Finish 4, took 2hr
This commit is contained in:
parent
6d88f082a0
commit
45d10bab7f
|
@ -40,6 +40,21 @@
|
||||||
- [Reasons](#reasons)
|
- [Reasons](#reasons)
|
||||||
- [Methods](#methods)
|
- [Methods](#methods)
|
||||||
- [Generation](#generation)
|
- [Generation](#generation)
|
||||||
|
- [Definition](#definition)
|
||||||
|
- [Methods](#methods)
|
||||||
|
- [Feature Crossing](#feature-crossing)
|
||||||
|
- [Polynomial Expansion](#polynomial-expansion)
|
||||||
|
- [Feature Learning by Trees](#feature-learning-by-trees)
|
||||||
|
- [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning)
|
||||||
|
- [Feature Selection](#feature-selection)
|
||||||
|
- [Reason](#reason)
|
||||||
|
- [Methods](#methods)
|
||||||
|
- [Filter](#filter)
|
||||||
|
- [Wrapper](#wrapper)
|
||||||
|
- [Embedded](#embedded)
|
||||||
|
- [Shuffling](#shuffling)
|
||||||
|
- [Hybrid](#hybrid)
|
||||||
|
- [Dimensionality Reduction](#dimensionality-reduction)
|
||||||
<!--toc:end-->
|
<!--toc:end-->
|
||||||
|
|
||||||
## Definition
|
## Definition
|
||||||
|
@ -283,11 +298,162 @@
|
||||||
- homogeneity of variance
|
- homogeneity of variance
|
||||||
- Example:
|
- Example:
|
||||||
- assuming y > 0.5 lead to class 1, otherwise class 2
|
- assuming y > 0.5 lead to class 1, otherwise class 2
|
||||||
- ![page 1](./assets/4-analytics-line-regression.webp)
|
- ![page 1](./assets/4-analytics-line-regression.webp)
|
||||||
- ![page 2](./assets/4-analytics-line-regression-2.webp)
|
- ![page 2](./assets/4-analytics-line-regression-2.webp)
|
||||||
- Some other ML algorithms do not make any assumption, but still may benefit
|
- Some other ML algorithms do not make any assumption, but still may benefit
|
||||||
from a better distributed data
|
from a better distributed data
|
||||||
|
|
||||||
#### Methods
|
#### Methods
|
||||||
|
|
||||||
|
- Logarithmic transformation: $log(𝑥 + 1)$
|
||||||
|
- Useful when applied to **skewed distributions**, it **expands** small
|
||||||
|
values and **compress** big values, helps to make the distribution less
|
||||||
|
skewed
|
||||||
|
- Numerical values x must be $x \gt -1$
|
||||||
|
- Reciprocal transformation $1/𝑥$
|
||||||
|
- Square root $\sqrt{x}$
|
||||||
|
- Similar to log transform
|
||||||
|
- Exponential
|
||||||
|
- Box cox transformation $x^(\lambda - 1) / \lambda$
|
||||||
|
- **prerequisite:** numeric values must be positive, can be solved by
|
||||||
|
shifting
|
||||||
|
- Quantile transformation: using quartiles
|
||||||
|
- Transform feature to use a uniform or normal distribution. Tends to spread
|
||||||
|
out the most frequent values.
|
||||||
|
- This is **robust**
|
||||||
|
- But is **non-linear** transform, may distort linear correlation, but
|
||||||
|
variables at different scales are more comparable
|
||||||
|
|
||||||
### Generation
|
### Generation
|
||||||
|
|
||||||
|
#### Definition
|
||||||
|
|
||||||
|
- Generating new features that are often not the result of feature
|
||||||
|
transformation
|
||||||
|
- Examples:
|
||||||
|
- $Age \times NumberDiagnoses$
|
||||||
|
- ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp)
|
||||||
|
- ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp)
|
||||||
|
|
||||||
|
#### Methods
|
||||||
|
|
||||||
|
##### Feature Crossing
|
||||||
|
|
||||||
|
- Create new features from existing ones, thus increasing predictive power
|
||||||
|
- Takes the Cartesian product of existing features
|
||||||
|
- $A\times B=\{(a,b)\mid a\in A\ {\mbox{ and }}\ b\in B\}.$
|
||||||
|
- Has uses when data is not linerarly separable
|
||||||
|
- Deciding which feature to cross:
|
||||||
|
- Use expertise
|
||||||
|
- Automatic exploration tools
|
||||||
|
- [Deep learning](#automatic-feature-learning-deep-learning)
|
||||||
|
|
||||||
|
##### Polynomial Expansion
|
||||||
|
|
||||||
|
- Useful in modelling, since it can model non-linear relationships between
|
||||||
|
predictor and outcome
|
||||||
|
- Use fitted polynomial variables to represent the data:
|
||||||
|
- $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$
|
||||||
|
- Pros:
|
||||||
|
- Fast
|
||||||
|
- Good performance, compared to binning
|
||||||
|
- Doesn't create correlated features
|
||||||
|
- Good at handling continuous change
|
||||||
|
- Cons:
|
||||||
|
- Less interpretable
|
||||||
|
- Lots of variables produced
|
||||||
|
- Hard to model changes in distribution
|
||||||
|
|
||||||
|
##### Feature Learning by Trees
|
||||||
|
|
||||||
|
- Each sample is a leaf node
|
||||||
|
- Decision path to each node is a new non-linear feature
|
||||||
|
- We can create N new binary features (with N leaf nodes)
|
||||||
|
- Pro: Fast to get informative feature
|
||||||
|
|
||||||
|
##### Automatic Feature learning: Deep learning
|
||||||
|
|
||||||
|
- Deep learning model learns the features from data
|
||||||
|
- Difference between shallow networks
|
||||||
|
- Deep, in the sense of having multiple hidden layers
|
||||||
|
- Introduced stochastic gradient descent
|
||||||
|
- Can automate feature extraction
|
||||||
|
- Require larger datasets
|
||||||
|
- DL can learn hierarchical of features: Character → word → word group → clause
|
||||||
|
→ sentence
|
||||||
|
- CNN: use convolutional layers to apply filters to the input image, to detect
|
||||||
|
various features such as edges, textures and shapes
|
||||||
|
|
||||||
|
## Feature Selection
|
||||||
|
|
||||||
|
### Reason
|
||||||
|
|
||||||
|
- More features doesn't necessarily lead to better model
|
||||||
|
- Feature selection is useful for
|
||||||
|
- Model simplification: easy interpretation, smaller model, less cost
|
||||||
|
- Lower data requirements: less data is required
|
||||||
|
- Less dimensionality
|
||||||
|
- Enhanced generalization, less overfitting
|
||||||
|
|
||||||
|
### Methods
|
||||||
|
|
||||||
|
#### Filter
|
||||||
|
|
||||||
|
- Select best features via the following methods and evaluate
|
||||||
|
- Main methods
|
||||||
|
- Variance: remove the feature that has the same value
|
||||||
|
- Correlation: remove features that are highly correlated with each other
|
||||||
|
- Con: Fail to consider the interaction between features and may reduce the
|
||||||
|
predict power of the model
|
||||||
|
|
||||||
|
#### Wrapper
|
||||||
|
|
||||||
|
- Use searching to search through all the possible feature subsets and evaluate
|
||||||
|
them
|
||||||
|
- Steps of execution (p98), skipped
|
||||||
|
- Con: Computationally expensive
|
||||||
|
|
||||||
|
#### Embedded
|
||||||
|
|
||||||
|
- Use feature selection as a part of ML algorithm
|
||||||
|
- This address the drawbacks of both filter and wrapper method, and has
|
||||||
|
advantage of both
|
||||||
|
- Faster than filter
|
||||||
|
- More accurate than filter
|
||||||
|
- Methods:
|
||||||
|
- Regularization: Add penalty to coefficients, which can turn them to zero,
|
||||||
|
and can be removed from dataset
|
||||||
|
- Tree based methods: outputs feature importance, which can be used to
|
||||||
|
select features.
|
||||||
|
|
||||||
|
#### Shuffling
|
||||||
|
|
||||||
|
#### Hybrid
|
||||||
|
|
||||||
|
#### Dimensionality Reduction
|
||||||
|
|
||||||
|
- When dimensionality is too high, it's computationally expensive to process
|
||||||
|
them. We **project the data** to a lower subspace, that captures the
|
||||||
|
**essence** of data
|
||||||
|
- Reason
|
||||||
|
- Curse of dimensionality: high dimensionality data have large number of
|
||||||
|
features or dimensions, which can make it difficult to analyze and
|
||||||
|
understand
|
||||||
|
- Remove sparse or noisy data, reduce overfitting
|
||||||
|
- To create a model with lower number of variables
|
||||||
|
- PCA:
|
||||||
|
- form of feature extraction, combines and transforms the dataset's original
|
||||||
|
values
|
||||||
|
- projects data onto a new space, defined by this subset of principal
|
||||||
|
components
|
||||||
|
- Is a **unsupervised** linear dimensionality reduction technique
|
||||||
|
- Preserves signal, filter out noise
|
||||||
|
- Use **covariance matrix**
|
||||||
|
- TODO: is calculation needed
|
||||||
|
- Minimize intraclass difference
|
||||||
|
- LDA:
|
||||||
|
- Similar to PCA
|
||||||
|
- Different than PCA, because it retains classification labels in dataset
|
||||||
|
- Goal: maximize data variance and maximise class difference in the data.
|
||||||
|
- Use **scatter matrix**
|
||||||
|
- Maximizes interclass difference
|
||||||
|
|
BIN
assets/4-analytics-feat-gen-example-1.webp
Normal file
BIN
assets/4-analytics-feat-gen-example-1.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 62 KiB |
BIN
assets/4-analytics-feat-gen-example-2.webp
Normal file
BIN
assets/4-analytics-feat-gen-example-2.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 98 KiB |
Loading…
Reference in a new issue