Finish 4, took 2hr
This commit is contained in:
parent
6d88f082a0
commit
45d10bab7f
|
@ -40,6 +40,21 @@
|
|||
- [Reasons](#reasons)
|
||||
- [Methods](#methods)
|
||||
- [Generation](#generation)
|
||||
- [Definition](#definition)
|
||||
- [Methods](#methods)
|
||||
- [Feature Crossing](#feature-crossing)
|
||||
- [Polynomial Expansion](#polynomial-expansion)
|
||||
- [Feature Learning by Trees](#feature-learning-by-trees)
|
||||
- [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning)
|
||||
- [Feature Selection](#feature-selection)
|
||||
- [Reason](#reason)
|
||||
- [Methods](#methods)
|
||||
- [Filter](#filter)
|
||||
- [Wrapper](#wrapper)
|
||||
- [Embedded](#embedded)
|
||||
- [Shuffling](#shuffling)
|
||||
- [Hybrid](#hybrid)
|
||||
- [Dimensionality Reduction](#dimensionality-reduction)
|
||||
<!--toc:end-->
|
||||
|
||||
## Definition
|
||||
|
@ -283,11 +298,162 @@
|
|||
- homogeneity of variance
|
||||
- Example:
|
||||
- assuming y > 0.5 lead to class 1, otherwise class 2
|
||||
- ![page 1](./assets/4-analytics-line-regression.webp)
|
||||
- ![page 2](./assets/4-analytics-line-regression-2.webp)
|
||||
- ![page 1](./assets/4-analytics-line-regression.webp)
|
||||
- ![page 2](./assets/4-analytics-line-regression-2.webp)
|
||||
- Some other ML algorithms do not make any assumption, but still may benefit
|
||||
from a better distributed data
|
||||
|
||||
#### Methods
|
||||
|
||||
- Logarithmic transformation: $log(𝑥 + 1)$
|
||||
- Useful when applied to **skewed distributions**, it **expands** small
|
||||
values and **compress** big values, helps to make the distribution less
|
||||
skewed
|
||||
- Numerical values x must be $x \gt -1$
|
||||
- Reciprocal transformation $1/𝑥$
|
||||
- Square root $\sqrt{x}$
|
||||
- Similar to log transform
|
||||
- Exponential
|
||||
- Box cox transformation $x^(\lambda - 1) / \lambda$
|
||||
- **prerequisite:** numeric values must be positive, can be solved by
|
||||
shifting
|
||||
- Quantile transformation: using quartiles
|
||||
- Transform feature to use a uniform or normal distribution. Tends to spread
|
||||
out the most frequent values.
|
||||
- This is **robust**
|
||||
- But is **non-linear** transform, may distort linear correlation, but
|
||||
variables at different scales are more comparable
|
||||
|
||||
### Generation
|
||||
|
||||
#### Definition
|
||||
|
||||
- Generating new features that are often not the result of feature
|
||||
transformation
|
||||
- Examples:
|
||||
- $Age \times NumberDiagnoses$
|
||||
- ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp)
|
||||
- ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp)
|
||||
|
||||
#### Methods
|
||||
|
||||
##### Feature Crossing
|
||||
|
||||
- Create new features from existing ones, thus increasing predictive power
|
||||
- Takes the Cartesian product of existing features
|
||||
- $A\times B=\{(a,b)\mid a\in A\ {\mbox{ and }}\ b\in B\}.$
|
||||
- Has uses when data is not linerarly separable
|
||||
- Deciding which feature to cross:
|
||||
- Use expertise
|
||||
- Automatic exploration tools
|
||||
- [Deep learning](#automatic-feature-learning-deep-learning)
|
||||
|
||||
##### Polynomial Expansion
|
||||
|
||||
- Useful in modelling, since it can model non-linear relationships between
|
||||
predictor and outcome
|
||||
- Use fitted polynomial variables to represent the data:
|
||||
- $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$
|
||||
- Pros:
|
||||
- Fast
|
||||
- Good performance, compared to binning
|
||||
- Doesn't create correlated features
|
||||
- Good at handling continuous change
|
||||
- Cons:
|
||||
- Less interpretable
|
||||
- Lots of variables produced
|
||||
- Hard to model changes in distribution
|
||||
|
||||
##### Feature Learning by Trees
|
||||
|
||||
- Each sample is a leaf node
|
||||
- Decision path to each node is a new non-linear feature
|
||||
- We can create N new binary features (with N leaf nodes)
|
||||
- Pro: Fast to get informative feature
|
||||
|
||||
##### Automatic Feature learning: Deep learning
|
||||
|
||||
- Deep learning model learns the features from data
|
||||
- Difference between shallow networks
|
||||
- Deep, in the sense of having multiple hidden layers
|
||||
- Introduced stochastic gradient descent
|
||||
- Can automate feature extraction
|
||||
- Require larger datasets
|
||||
- DL can learn hierarchical of features: Character → word → word group → clause
|
||||
→ sentence
|
||||
- CNN: use convolutional layers to apply filters to the input image, to detect
|
||||
various features such as edges, textures and shapes
|
||||
|
||||
## Feature Selection
|
||||
|
||||
### Reason
|
||||
|
||||
- More features doesn't necessarily lead to better model
|
||||
- Feature selection is useful for
|
||||
- Model simplification: easy interpretation, smaller model, less cost
|
||||
- Lower data requirements: less data is required
|
||||
- Less dimensionality
|
||||
- Enhanced generalization, less overfitting
|
||||
|
||||
### Methods
|
||||
|
||||
#### Filter
|
||||
|
||||
- Select best features via the following methods and evaluate
|
||||
- Main methods
|
||||
- Variance: remove the feature that has the same value
|
||||
- Correlation: remove features that are highly correlated with each other
|
||||
- Con: Fail to consider the interaction between features and may reduce the
|
||||
predict power of the model
|
||||
|
||||
#### Wrapper
|
||||
|
||||
- Use searching to search through all the possible feature subsets and evaluate
|
||||
them
|
||||
- Steps of execution (p98), skipped
|
||||
- Con: Computationally expensive
|
||||
|
||||
#### Embedded
|
||||
|
||||
- Use feature selection as a part of ML algorithm
|
||||
- This address the drawbacks of both filter and wrapper method, and has
|
||||
advantage of both
|
||||
- Faster than filter
|
||||
- More accurate than filter
|
||||
- Methods:
|
||||
- Regularization: Add penalty to coefficients, which can turn them to zero,
|
||||
and can be removed from dataset
|
||||
- Tree based methods: outputs feature importance, which can be used to
|
||||
select features.
|
||||
|
||||
#### Shuffling
|
||||
|
||||
#### Hybrid
|
||||
|
||||
#### Dimensionality Reduction
|
||||
|
||||
- When dimensionality is too high, it's computationally expensive to process
|
||||
them. We **project the data** to a lower subspace, that captures the
|
||||
**essence** of data
|
||||
- Reason
|
||||
- Curse of dimensionality: high dimensionality data have large number of
|
||||
features or dimensions, which can make it difficult to analyze and
|
||||
understand
|
||||
- Remove sparse or noisy data, reduce overfitting
|
||||
- To create a model with lower number of variables
|
||||
- PCA:
|
||||
- form of feature extraction, combines and transforms the dataset's original
|
||||
values
|
||||
- projects data onto a new space, defined by this subset of principal
|
||||
components
|
||||
- Is a **unsupervised** linear dimensionality reduction technique
|
||||
- Preserves signal, filter out noise
|
||||
- Use **covariance matrix**
|
||||
- TODO: is calculation needed
|
||||
- Minimize intraclass difference
|
||||
- LDA:
|
||||
- Similar to PCA
|
||||
- Different than PCA, because it retains classification labels in dataset
|
||||
- Goal: maximize data variance and maximise class difference in the data.
|
||||
- Use **scatter matrix**
|
||||
- Maximizes interclass difference
|
||||
|
|
BIN
assets/4-analytics-feat-gen-example-1.webp
Normal file
BIN
assets/4-analytics-feat-gen-example-1.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 62 KiB |
BIN
assets/4-analytics-feat-gen-example-2.webp
Normal file
BIN
assets/4-analytics-feat-gen-example-2.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 98 KiB |
Loading…
Reference in a new issue