diff --git a/4-data-analytics.md b/4-data-analytics.md index e022c6b..f353c75 100644 --- a/4-data-analytics.md +++ b/4-data-analytics.md @@ -40,6 +40,21 @@ - [Reasons](#reasons) - [Methods](#methods) - [Generation](#generation) + - [Definition](#definition) + - [Methods](#methods) + - [Feature Crossing](#feature-crossing) + - [Polynomial Expansion](#polynomial-expansion) + - [Feature Learning by Trees](#feature-learning-by-trees) + - [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning) + - [Feature Selection](#feature-selection) + - [Reason](#reason) + - [Methods](#methods) + - [Filter](#filter) + - [Wrapper](#wrapper) + - [Embedded](#embedded) + - [Shuffling](#shuffling) + - [Hybrid](#hybrid) + - [Dimensionality Reduction](#dimensionality-reduction) ## Definition @@ -283,11 +298,162 @@ - homogeneity of variance - Example: - assuming y > 0.5 lead to class 1, otherwise class 2 - - ![page 1](./assets/4-analytics-line-regression.webp) - - ![page 2](./assets/4-analytics-line-regression-2.webp) + - ![page 1](./assets/4-analytics-line-regression.webp) + - ![page 2](./assets/4-analytics-line-regression-2.webp) - Some other ML algorithms do not make any assumption, but still may benefit from a better distributed data #### Methods +- Logarithmic transformation: $log(𝑥 + 1)$ + - Useful when applied to **skewed distributions**, it **expands** small + values and **compress** big values, helps to make the distribution less + skewed + - Numerical values x must be $x \gt -1$ +- Reciprocal transformation $1/𝑥$ +- Square root $\sqrt{x}$ + - Similar to log transform +- Exponential +- Box cox transformation $x^(\lambda - 1) / \lambda$ + - **prerequisite:** numeric values must be positive, can be solved by + shifting +- Quantile transformation: using quartiles + - Transform feature to use a uniform or normal distribution. Tends to spread + out the most frequent values. + - This is **robust** + - But is **non-linear** transform, may distort linear correlation, but + variables at different scales are more comparable + ### Generation + +#### Definition + +- Generating new features that are often not the result of feature + transformation +- Examples: + - $Age \times NumberDiagnoses$ + - ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp) + - ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp) + +#### Methods + +##### Feature Crossing + +- Create new features from existing ones, thus increasing predictive power +- Takes the Cartesian product of existing features + - $A\times B=\{(a,b)\mid a\in A\ {\mbox{ and }}\ b\in B\}.$ +- Has uses when data is not linerarly separable +- Deciding which feature to cross: + - Use expertise + - Automatic exploration tools + - [Deep learning](#automatic-feature-learning-deep-learning) + +##### Polynomial Expansion + +- Useful in modelling, since it can model non-linear relationships between + predictor and outcome +- Use fitted polynomial variables to represent the data: + - $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$ +- Pros: + - Fast + - Good performance, compared to binning + - Doesn't create correlated features + - Good at handling continuous change +- Cons: + - Less interpretable + - Lots of variables produced + - Hard to model changes in distribution + +##### Feature Learning by Trees + +- Each sample is a leaf node +- Decision path to each node is a new non-linear feature +- We can create N new binary features (with N leaf nodes) +- Pro: Fast to get informative feature + +##### Automatic Feature learning: Deep learning + +- Deep learning model learns the features from data +- Difference between shallow networks + - Deep, in the sense of having multiple hidden layers + - Introduced stochastic gradient descent +- Can automate feature extraction +- Require larger datasets +- DL can learn hierarchical of features: Character → word → word group → clause + → sentence +- CNN: use convolutional layers to apply filters to the input image, to detect + various features such as edges, textures and shapes + +## Feature Selection + +### Reason + +- More features doesn't necessarily lead to better model +- Feature selection is useful for + - Model simplification: easy interpretation, smaller model, less cost + - Lower data requirements: less data is required + - Less dimensionality + - Enhanced generalization, less overfitting + +### Methods + +#### Filter + +- Select best features via the following methods and evaluate +- Main methods + - Variance: remove the feature that has the same value + - Correlation: remove features that are highly correlated with each other +- Con: Fail to consider the interaction between features and may reduce the + predict power of the model + +#### Wrapper + +- Use searching to search through all the possible feature subsets and evaluate + them +- Steps of execution (p98), skipped +- Con: Computationally expensive + +#### Embedded + +- Use feature selection as a part of ML algorithm +- This address the drawbacks of both filter and wrapper method, and has + advantage of both +- Faster than filter +- More accurate than filter +- Methods: + - Regularization: Add penalty to coefficients, which can turn them to zero, + and can be removed from dataset + - Tree based methods: outputs feature importance, which can be used to + select features. + +#### Shuffling + +#### Hybrid + +#### Dimensionality Reduction + +- When dimensionality is too high, it's computationally expensive to process + them. We **project the data** to a lower subspace, that captures the + **essence** of data +- Reason + - Curse of dimensionality: high dimensionality data have large number of + features or dimensions, which can make it difficult to analyze and + understand + - Remove sparse or noisy data, reduce overfitting + - To create a model with lower number of variables +- PCA: + - form of feature extraction, combines and transforms the dataset's original + values + - projects data onto a new space, defined by this subset of principal + components + - Is a **unsupervised** linear dimensionality reduction technique + - Preserves signal, filter out noise + - Use **covariance matrix** + - TODO: is calculation needed + - Minimize intraclass difference +- LDA: + - Similar to PCA + - Different than PCA, because it retains classification labels in dataset + - Goal: maximize data variance and maximise class difference in the data. + - Use **scatter matrix** + - Maximizes interclass difference diff --git a/assets/4-analytics-feat-gen-example-1.webp b/assets/4-analytics-feat-gen-example-1.webp new file mode 100644 index 0000000..2b22fff Binary files /dev/null and b/assets/4-analytics-feat-gen-example-1.webp differ diff --git a/assets/4-analytics-feat-gen-example-2.webp b/assets/4-analytics-feat-gen-example-2.webp new file mode 100644 index 0000000..9b7d6cd Binary files /dev/null and b/assets/4-analytics-feat-gen-example-2.webp differ