Add part of 4, used 1.5 hr
This commit is contained in:
parent
ea049b6d06
commit
6d88f082a0
|
@ -12,7 +12,7 @@
|
||||||
- [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
|
- [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
|
||||||
- [Have enough samples](#have-enough-samples)
|
- [Have enough samples](#have-enough-samples)
|
||||||
- [Bring human insight to problem](#bring-human-insight-to-problem)
|
- [Bring human insight to problem](#bring-human-insight-to-problem)
|
||||||
- [Process of Feature Engineering](#process-of-feature-engineering)
|
- [Methods of Feature Engineering](#methods-of-feature-engineering)
|
||||||
- [Scaling](#scaling)
|
- [Scaling](#scaling)
|
||||||
- [Rationale:](#rationale)
|
- [Rationale:](#rationale)
|
||||||
- [Methods:](#methods)
|
- [Methods:](#methods)
|
||||||
|
@ -29,7 +29,16 @@
|
||||||
- [k means binning](#k-means-binning)
|
- [k means binning](#k-means-binning)
|
||||||
- [decision trees](#decision-trees)
|
- [decision trees](#decision-trees)
|
||||||
- [Encoding](#encoding)
|
- [Encoding](#encoding)
|
||||||
|
- [Definition](#definition)
|
||||||
|
- [Reason](#reason)
|
||||||
|
- [Methods](#methods)
|
||||||
|
- [One hot encoding](#one-hot-encoding)
|
||||||
|
- [Ordinal encoding](#ordinal-encoding)
|
||||||
|
- [Count / frequency encoding](#count-frequency-encoding)
|
||||||
|
- [Mean / target encoding](#mean-target-encoding)
|
||||||
- [Transformation](#transformation)
|
- [Transformation](#transformation)
|
||||||
|
- [Reasons](#reasons)
|
||||||
|
- [Methods](#methods)
|
||||||
- [Generation](#generation)
|
- [Generation](#generation)
|
||||||
<!--toc:end-->
|
<!--toc:end-->
|
||||||
|
|
||||||
|
@ -83,8 +92,8 @@
|
||||||
### Numeric with meaningful magnitude:
|
### Numeric with meaningful magnitude:
|
||||||
|
|
||||||
- It does not mean that **categorical** features can't be used in training:
|
- It does not mean that **categorical** features can't be used in training:
|
||||||
simply, they will need to be **transformed** through a process called one-hot
|
simply, they will need to be **transformed** through a process called
|
||||||
encoding
|
[encoding](#encoding)
|
||||||
- Example: Font category: (Arial, Times New Roman)
|
- Example: Font category: (Arial, Times New Roman)
|
||||||
|
|
||||||
### Have enough samples
|
### Have enough samples
|
||||||
|
@ -99,7 +108,7 @@
|
||||||
**curious mind**
|
**curious mind**
|
||||||
- This is an iterative process, need to use **feedback** from production usage
|
- This is an iterative process, need to use **feedback** from production usage
|
||||||
|
|
||||||
## Process of Feature Engineering
|
## Methods of Feature Engineering
|
||||||
|
|
||||||
### Scaling
|
### Scaling
|
||||||
|
|
||||||
|
@ -153,7 +162,7 @@
|
||||||
#### Reason for binning
|
#### Reason for binning
|
||||||
|
|
||||||
- Example: Solar energy modeling
|
- Example: Solar energy modeling
|
||||||
- Acelleration calculation, by binning, and reduce the number of simulation
|
- Acceleration calculation, by binning, and reduce the number of simulation
|
||||||
needed
|
needed
|
||||||
- Improves **performance** by grouping data with **similar attributes** and has
|
- Improves **performance** by grouping data with **similar attributes** and has
|
||||||
**similar predictive strength**
|
**similar predictive strength**
|
||||||
|
@ -192,6 +201,93 @@
|
||||||
|
|
||||||
### Encoding
|
### Encoding
|
||||||
|
|
||||||
|
#### Definition
|
||||||
|
|
||||||
|
- The inverse of binning: creating numerical values from categorical variables
|
||||||
|
|
||||||
|
#### Reason
|
||||||
|
|
||||||
|
- Machine learning algorithms require **numerical** input data, and this
|
||||||
|
converts **categorical** data to **numerical** data
|
||||||
|
|
||||||
|
#### Methods
|
||||||
|
|
||||||
|
##### One hot encoding
|
||||||
|
|
||||||
|
- Replace categorical variable (nominal) with different binary variables
|
||||||
|
- **Eliminates** **ordinality**: since categorical variables shouldn't be
|
||||||
|
ranked, otherwise the algorithm may think there's ordering between the
|
||||||
|
variables
|
||||||
|
- Improve performance by allowing model to capture the complex relationship
|
||||||
|
within the data, that may be **missed** if categorical variables are treated
|
||||||
|
as **single** entities
|
||||||
|
- Cons
|
||||||
|
- High dimensionality: make the model more complex, and slower to train
|
||||||
|
- Is sparse data
|
||||||
|
- May lead to overfitting, especially if there's too many categories and
|
||||||
|
sample size is small
|
||||||
|
- Usage:
|
||||||
|
- Good for algorithms that look at all features at the same time: neural
|
||||||
|
network, clustering, SVM
|
||||||
|
- Used for linear regression, but **keep k-1** binary variable to avoid
|
||||||
|
**multicollinearity**:
|
||||||
|
- In linear regression, the presence of all k binary variables for a
|
||||||
|
categorical feature (where k is the number of categories) introduces
|
||||||
|
perfect multicollinearity. This happens because the k-th variable is a
|
||||||
|
linear **combination** of the others (e.g., if "Red" and "Blue" are 0,
|
||||||
|
"Green" must be 1).
|
||||||
|
- Don't use for tree algorithms
|
||||||
|
|
||||||
|
##### Ordinal encoding
|
||||||
|
|
||||||
|
- Ordinal variable: comprises a finite set of discrete values with a **ranked**
|
||||||
|
ordering
|
||||||
|
- Ordinal encoding replaces the label by ordered number
|
||||||
|
- Does not add value to give the variable more predictive power
|
||||||
|
- Usage:
|
||||||
|
- For categorical data with ordinal meaning
|
||||||
|
|
||||||
|
##### Count / frequency encoding
|
||||||
|
|
||||||
|
- Replace occurrences of label with the count of occurrences
|
||||||
|
- Cons:
|
||||||
|
- Will have loss of unique categories: (if the two categories have same
|
||||||
|
frequency, they will be treated as the same)
|
||||||
|
- Doesn't handle unseen categories
|
||||||
|
- Overfitting, if low frequency in general
|
||||||
|
|
||||||
|
##### Mean / target encoding
|
||||||
|
|
||||||
|
- Replace the _value_ for every categories with the avg of _values_ for every
|
||||||
|
_category-value_ pair
|
||||||
|
- monotonic relationship between variable and target
|
||||||
|
- Don't expand the feature space
|
||||||
|
- Con: prone to overfitting
|
||||||
|
- Usage:
|
||||||
|
- High cardinality (the number of elements in a mathematical set) data, by
|
||||||
|
leveraging the target variable's statistics to retain predictive power
|
||||||
|
|
||||||
### Transformation
|
### Transformation
|
||||||
|
|
||||||
|
#### Reasons
|
||||||
|
|
||||||
|
- Linear/Logistic regression models has assumption between the predictors and
|
||||||
|
the outcome.
|
||||||
|
- Transformation may help create this relationship to avoid poor
|
||||||
|
performance.
|
||||||
|
- Assumptions:
|
||||||
|
- Linear dependency between the predictors and the outcome.
|
||||||
|
- Multivariate normality (every variable X should follow a Gaussian
|
||||||
|
distribution)
|
||||||
|
- No or little multicollinearity
|
||||||
|
- homogeneity of variance
|
||||||
|
- Example:
|
||||||
|
- assuming y > 0.5 lead to class 1, otherwise class 2
|
||||||
|
- ![page 1](./assets/4-analytics-line-regression.webp)
|
||||||
|
- ![page 2](./assets/4-analytics-line-regression-2.webp)
|
||||||
|
- Some other ML algorithms do not make any assumption, but still may benefit
|
||||||
|
from a better distributed data
|
||||||
|
|
||||||
|
#### Methods
|
||||||
|
|
||||||
### Generation
|
### Generation
|
||||||
|
|
BIN
assets/4-analytics-line-regression-2.webp
Normal file
BIN
assets/4-analytics-line-regression-2.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 66 KiB |
BIN
assets/4-analytics-line-regression.webp
Normal file
BIN
assets/4-analytics-line-regression.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 55 KiB |
Loading…
Reference in a new issue