Finish 4, took 2hr

This commit is contained in:
Ryan 2025-01-08 15:38:11 +08:00
parent 6d88f082a0
commit 45d10bab7f
3 changed files with 168 additions and 2 deletions

View file

@ -40,6 +40,21 @@
- [Reasons](#reasons) - [Reasons](#reasons)
- [Methods](#methods) - [Methods](#methods)
- [Generation](#generation) - [Generation](#generation)
- [Definition](#definition)
- [Methods](#methods)
- [Feature Crossing](#feature-crossing)
- [Polynomial Expansion](#polynomial-expansion)
- [Feature Learning by Trees](#feature-learning-by-trees)
- [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning)
- [Feature Selection](#feature-selection)
- [Reason](#reason)
- [Methods](#methods)
- [Filter](#filter)
- [Wrapper](#wrapper)
- [Embedded](#embedded)
- [Shuffling](#shuffling)
- [Hybrid](#hybrid)
- [Dimensionality Reduction](#dimensionality-reduction)
<!--toc:end--> <!--toc:end-->
## Definition ## Definition
@ -283,11 +298,162 @@
- homogeneity of variance - homogeneity of variance
- Example: - Example:
- assuming y > 0.5 lead to class 1, otherwise class 2 - assuming y > 0.5 lead to class 1, otherwise class 2
- ![page 1](./assets/4-analytics-line-regression.webp) - ![page 1](./assets/4-analytics-line-regression.webp)
- ![page 2](./assets/4-analytics-line-regression-2.webp) - ![page 2](./assets/4-analytics-line-regression-2.webp)
- Some other ML algorithms do not make any assumption, but still may benefit - Some other ML algorithms do not make any assumption, but still may benefit
from a better distributed data from a better distributed data
#### Methods #### Methods
- Logarithmic transformation: $log(𝑥 + 1)$
- Useful when applied to **skewed distributions**, it **expands** small
values and **compress** big values, helps to make the distribution less
skewed
- Numerical values x must be $x \gt -1$
- Reciprocal transformation $1/𝑥$
- Square root $\sqrt{x}$
- Similar to log transform
- Exponential
- Box cox transformation $x^(\lambda - 1) / \lambda$
- **prerequisite:** numeric values must be positive, can be solved by
shifting
- Quantile transformation: using quartiles
- Transform feature to use a uniform or normal distribution. Tends to spread
out the most frequent values.
- This is **robust**
- But is **non-linear** transform, may distort linear correlation, but
variables at different scales are more comparable
### Generation ### Generation
#### Definition
- Generating new features that are often not the result of feature
transformation
- Examples:
- $Age \times NumberDiagnoses$
- ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp)
- ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp)
#### Methods
##### Feature Crossing
- Create new features from existing ones, thus increasing predictive power
- Takes the Cartesian product of existing features
- $A\times B=\{(a,b)\mid a\in A\ {\mbox{ and }}\ b\in B\}.$
- Has uses when data is not linerarly separable
- Deciding which feature to cross:
- Use expertise
- Automatic exploration tools
- [Deep learning](#automatic-feature-learning-deep-learning)
##### Polynomial Expansion
- Useful in modelling, since it can model non-linear relationships between
predictor and outcome
- Use fitted polynomial variables to represent the data:
- $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$
- Pros:
- Fast
- Good performance, compared to binning
- Doesn't create correlated features
- Good at handling continuous change
- Cons:
- Less interpretable
- Lots of variables produced
- Hard to model changes in distribution
##### Feature Learning by Trees
- Each sample is a leaf node
- Decision path to each node is a new non-linear feature
- We can create N new binary features (with N leaf nodes)
- Pro: Fast to get informative feature
##### Automatic Feature learning: Deep learning
- Deep learning model learns the features from data
- Difference between shallow networks
- Deep, in the sense of having multiple hidden layers
- Introduced stochastic gradient descent
- Can automate feature extraction
- Require larger datasets
- DL can learn hierarchical of features: Character → word → word group → clause
→ sentence
- CNN: use convolutional layers to apply filters to the input image, to detect
various features such as edges, textures and shapes
## Feature Selection
### Reason
- More features doesn't necessarily lead to better model
- Feature selection is useful for
- Model simplification: easy interpretation, smaller model, less cost
- Lower data requirements: less data is required
- Less dimensionality
- Enhanced generalization, less overfitting
### Methods
#### Filter
- Select best features via the following methods and evaluate
- Main methods
- Variance: remove the feature that has the same value
- Correlation: remove features that are highly correlated with each other
- Con: Fail to consider the interaction between features and may reduce the
predict power of the model
#### Wrapper
- Use searching to search through all the possible feature subsets and evaluate
them
- Steps of execution (p98), skipped
- Con: Computationally expensive
#### Embedded
- Use feature selection as a part of ML algorithm
- This address the drawbacks of both filter and wrapper method, and has
advantage of both
- Faster than filter
- More accurate than filter
- Methods:
- Regularization: Add penalty to coefficients, which can turn them to zero,
and can be removed from dataset
- Tree based methods: outputs feature importance, which can be used to
select features.
#### Shuffling
#### Hybrid
#### Dimensionality Reduction
- When dimensionality is too high, it's computationally expensive to process
them. We **project the data** to a lower subspace, that captures the
**essence** of data
- Reason
- Curse of dimensionality: high dimensionality data have large number of
features or dimensions, which can make it difficult to analyze and
understand
- Remove sparse or noisy data, reduce overfitting
- To create a model with lower number of variables
- PCA:
- form of feature extraction, combines and transforms the dataset's original
values
- projects data onto a new space, defined by this subset of principal
components
- Is a **unsupervised** linear dimensionality reduction technique
- Preserves signal, filter out noise
- Use **covariance matrix**
- TODO: is calculation needed
- Minimize intraclass difference
- LDA:
- Similar to PCA
- Different than PCA, because it retains classification labels in dataset
- Goal: maximize data variance and maximise class difference in the data.
- Use **scatter matrix**
- Maximizes interclass difference

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 98 KiB