Finish 4, took 2hr

2025-01-08 15:38:11 +08:00 · 2025-01-08 15:38:11 +08:00 · 45d10bab7f
parent 6d88f082a0
commit 45d10bab7f
3 changed files with 168 additions and 2 deletions
--- a/4-data-analytics.md
+++ b/4-data-analytics.md
@ -40,6 +40,21 @@
      - [Reasons](#reasons)
      - [Methods](#methods)
    - [Generation](#generation)
+      - [Definition](#definition)
+      - [Methods](#methods)
+        - [Feature Crossing](#feature-crossing)
+        - [Polynomial Expansion](#polynomial-expansion)
+        - [Feature Learning by Trees](#feature-learning-by-trees)
+        - [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning)
+  - [Feature Selection](#feature-selection)
+    - [Reason](#reason)
+    - [Methods](#methods)
+      - [Filter](#filter)
+      - [Wrapper](#wrapper)
+      - [Embedded](#embedded)
+      - [Shuffling](#shuffling)
+      - [Hybrid](#hybrid)
+      - [Dimensionality Reduction](#dimensionality-reduction)
 <!--toc:end-->

 ## Definition
@ -283,11 +298,162 @@
        - homogeneity of variance
    - Example:
        - assuming y > 0.5 lead to class 1, otherwise class 2
-        - ![page 1](./assets/4-analytics-line-regression.webp) 
-        - ![page 2](./assets/4-analytics-line-regression-2.webp) 
+        - ![page 1](./assets/4-analytics-line-regression.webp)
+        - ![page 2](./assets/4-analytics-line-regression-2.webp)
 - Some other ML algorithms do not make any assumption, but still may benefit
  from a better distributed data

 #### Methods

+- Logarithmic transformation: $log(𝑥 + 1)$
+    - Useful when applied to **skewed distributions**, it **expands** small
+      values and **compress** big values, helps to make the distribution less
+      skewed
+    - Numerical values x must be $x \gt -1$
+- Reciprocal transformation $1/𝑥$
+- Square root $\sqrt{x}$
+    - Similar to log transform
+- Exponential
+- Box cox transformation $x^(\lambda - 1) / \lambda$
+    - **prerequisite:** numeric values must be positive, can be solved by
+      shifting
+- Quantile transformation: using quartiles
+    - Transform feature to use a uniform or normal distribution. Tends to spread
+      out the most frequent values.
+    - This is **robust**
+    - But is **non-linear** transform, may distort linear correlation, but
+      variables at different scales are more comparable
+
 ### Generation
+
+#### Definition
+
+- Generating new features that are often not the result of feature
+  transformation
+- Examples:
+    - $Age \times NumberDiagnoses$
+    - ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp)
+    - ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp)
+
+#### Methods
+
+##### Feature Crossing
+
+- Create new features from existing ones, thus increasing predictive power
+- Takes the Cartesian product of existing features
+    - $A\times B=\{(a,b)\mid a\in A\ {\mbox{ and }}\ b\in B\}.$
+- Has uses when data is not linerarly separable
+- Deciding which feature to cross:
+    - Use expertise
+    - Automatic exploration tools
+    - [Deep learning](#automatic-feature-learning-deep-learning)
+
+##### Polynomial Expansion
+
+- Useful in modelling, since it can model non-linear relationships between
+  predictor and outcome
+- Use fitted polynomial variables to represent the data:
+    - $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$
+- Pros:
+    - Fast
+    - Good performance, compared to binning
+    - Doesn't create correlated features
+    - Good at handling continuous change
+- Cons:
+    - Less interpretable
+    - Lots of variables produced
+    - Hard to model changes in distribution
+
+##### Feature Learning by Trees
+
+- Each sample is a leaf node
+- Decision path to each node is a new non-linear feature
+- We can create N new binary features (with N leaf nodes)
+- Pro: Fast to get informative feature
+
+##### Automatic Feature learning: Deep learning
+
+- Deep learning model learns the features from data
+- Difference between shallow networks
+    - Deep, in the sense of having multiple hidden layers
+    - Introduced stochastic gradient descent
+- Can automate feature extraction
+- Require larger datasets
+- DL can learn hierarchical of features: Character → word → word group → clause
+  → sentence
+- CNN: use convolutional layers to apply filters to the input image, to detect
+  various features such as edges, textures and shapes
+
+## Feature Selection
+
+### Reason
+
+- More features doesn't necessarily lead to better model
+- Feature selection is useful for
+    - Model simplification: easy interpretation, smaller model, less cost
+    - Lower data requirements: less data is required
+    - Less dimensionality
+    - Enhanced generalization, less overfitting
+
+### Methods
+
+#### Filter
+
+- Select best features via the following methods and evaluate
+- Main methods
+    - Variance: remove the feature that has the same value
+    - Correlation: remove features that are highly correlated with each other
+- Con: Fail to consider the interaction between features and may reduce the
+  predict power of the model
+
+#### Wrapper
+
+- Use searching to search through all the possible feature subsets and evaluate
+  them
+- Steps of execution (p98), skipped
+- Con: Computationally expensive
+
+#### Embedded
+
+- Use feature selection as a part of ML algorithm
+- This address the drawbacks of both filter and wrapper method, and has
+  advantage of both
+- Faster than filter
+- More accurate than filter
+- Methods:
+    - Regularization: Add penalty to coefficients, which can turn them to zero,
+      and can be removed from dataset
+    - Tree based methods: outputs feature importance, which can be used to
+      select features.
+
+#### Shuffling
+
+#### Hybrid
+
+#### Dimensionality Reduction
+
+- When dimensionality is too high, it's computationally expensive to process
+  them. We **project the data** to a lower subspace, that captures the
+  **essence** of data
+- Reason
+    - Curse of dimensionality: high dimensionality data have large number of
+      features or dimensions, which can make it difficult to analyze and
+      understand
+    - Remove sparse or noisy data, reduce overfitting
+    - To create a model with lower number of variables
+- PCA:
+    - form of feature extraction, combines and transforms the dataset's original
+      values
+    - projects data onto a new space, defined by this subset of principal
+      components
+    - Is a **unsupervised** linear dimensionality reduction technique
+    - Preserves signal, filter out noise
+    - Use **covariance matrix**
+    - TODO: is calculation needed
+    - Minimize intraclass difference
+- LDA:
+    - Similar to PCA
+    - Different than PCA, because it retains classification labels in dataset
+    - Goal: maximize data variance and maximise class difference in the data.
+    - Use **scatter matrix**
+    - Maximizes interclass difference
--- a/assets/4-analytics-feat-gen-example-1.webp
+++ b/assets/4-analytics-feat-gen-example-1.webp
--- a/assets/4-analytics-feat-gen-example-2.webp
+++ b/assets/4-analytics-feat-gen-example-2.webp