Add more to 4, took 1.5hr
This commit is contained in:
parent
33d3bee7b0
commit
ea049b6d06
|
@ -1,18 +1,39 @@
|
||||||
# Data analytics
|
# Data analytics: Feature engineering
|
||||||
|
|
||||||
<!--toc:start-->
|
<!--toc:start-->
|
||||||
|
- [Data analytics: Feature engineering](#data-analytics-feature-engineering)
|
||||||
|
- [Definition](#definition)
|
||||||
|
- [Sources of features](#sources-of-features)
|
||||||
|
- [Feature engineering in ML](#feature-engineering-in-ml)
|
||||||
|
- [Types of feature engineering](#types-of-feature-engineering)
|
||||||
|
- [Good feature:](#good-feature)
|
||||||
|
- [Related to objective (important)](#related-to-objective-important)
|
||||||
|
- [Known at prediction-time](#known-at-prediction-time)
|
||||||
|
- [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
|
||||||
|
- [Have enough samples](#have-enough-samples)
|
||||||
|
- [Bring human insight to problem](#bring-human-insight-to-problem)
|
||||||
|
- [Process of Feature Engineering](#process-of-feature-engineering)
|
||||||
|
- [Scaling](#scaling)
|
||||||
|
- [Rationale:](#rationale)
|
||||||
|
- [Methods:](#methods)
|
||||||
|
- [Normalization or Standardization:](#normalization-or-standardization)
|
||||||
|
- [Min-max scaling:](#min-max-scaling)
|
||||||
|
- [Robust scaling:](#robust-scaling)
|
||||||
|
- [Choosing](#choosing)
|
||||||
|
- [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
|
||||||
|
- [Definition](#definition)
|
||||||
|
- [Reason for binning](#reason-for-binning)
|
||||||
|
- [Methods](#methods)
|
||||||
|
- [Equal width binning](#equal-width-binning)
|
||||||
|
- [Equal frequency binning](#equal-frequency-binning)
|
||||||
|
- [k means binning](#k-means-binning)
|
||||||
|
- [decision trees](#decision-trees)
|
||||||
|
- [Encoding](#encoding)
|
||||||
|
- [Transformation](#transformation)
|
||||||
|
- [Generation](#generation)
|
||||||
|
<!--toc:end-->
|
||||||
|
|
||||||
- [Data analytics](#data-analytics)
|
## Definition
|
||||||
- [Feature engineering](#feature-engineering) - [Definition](#definition) -
|
|
||||||
[Sources of features](#sources-of-features) -
|
|
||||||
[Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) -
|
|
||||||
[Intro](#intro) -
|
|
||||||
[Types of feature engineering](#types-of-feature-engineering) -
|
|
||||||
[Good feature:](#good-feature) <!--toc:end-->
|
|
||||||
|
|
||||||
## Feature engineering
|
|
||||||
|
|
||||||
### Definition
|
|
||||||
|
|
||||||
- The process that attempts to create **additional** relevant features from
|
- The process that attempts to create **additional** relevant features from
|
||||||
**existing** raw features, to increase the predictive power of **algorithms**
|
**existing** raw features, to increase the predictive power of **algorithms**
|
||||||
|
@ -21,11 +42,11 @@
|
||||||
is improved.
|
is improved.
|
||||||
- Important to machine learning
|
- Important to machine learning
|
||||||
|
|
||||||
### Sources of features
|
## Sources of features
|
||||||
|
|
||||||
- Different features are needed for different problems, even in the same domain
|
- Different features are needed for different problems, even in the same domain
|
||||||
|
|
||||||
### Feature engineering in ML
|
## Feature engineering in ML
|
||||||
|
|
||||||
- Process of ML iterations:
|
- Process of ML iterations:
|
||||||
- Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
|
- Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
|
||||||
|
@ -42,26 +63,135 @@
|
||||||
- Highlighting **interactions** between features
|
- Highlighting **interactions** between features
|
||||||
- Representing the feature in a **different** way
|
- Representing the feature in a **different** way
|
||||||
|
|
||||||
### Good feature:
|
## Good feature:
|
||||||
|
|
||||||
- Related to objective (important)
|
### Related to objective (important)
|
||||||
- Example: the number of concrete blocks around it is not related to house
|
|
||||||
prices
|
- Example: the number of concrete blocks around it is not related to house
|
||||||
- Known at prediction-time
|
prices
|
||||||
- Some data could be known **immediately**, and some other data is not known
|
|
||||||
in **real time**: Can't feed the feature to a model, if it isn't present
|
### Known at prediction-time
|
||||||
at prediction time
|
|
||||||
- Feature definition shouldn't **change** over time
|
- Some data could be known **immediately**, and some other data is not known in
|
||||||
- Example: If the sales data at prediction time is only available within 3
|
**real time**: Can't feed the feature to a model, if it isn't present at
|
||||||
days, with a 3 day lag, then current sale data can't be used for training
|
prediction time
|
||||||
(that has to predict with a 3-day old data)
|
- Feature definition shouldn't **change** over time
|
||||||
- Numeric with meaningful magnitude:
|
- Example: If the sales data at prediction time is only available within 3 days,
|
||||||
- It does not mean that **categorical** features can't be used in training:
|
with a 3 day lag, then current sale data can't be used for training (that has
|
||||||
simply, they will need to be **transformed** through a process called
|
to predict with a 3-day old data)
|
||||||
one-hot encoding
|
|
||||||
- Example: Font category: (Arial, Times New Roman)
|
### Numeric with meaningful magnitude:
|
||||||
- Have enough samples
|
|
||||||
- Have at least five examples of any value before using it in your model
|
- It does not mean that **categorical** features can't be used in training:
|
||||||
- If features tend to be poorly assorted and are unbalanced, then the
|
simply, they will need to be **transformed** through a process called one-hot
|
||||||
trained model will be biased
|
encoding
|
||||||
- Bring human insight to problem
|
- Example: Font category: (Arial, Times New Roman)
|
||||||
|
|
||||||
|
### Have enough samples
|
||||||
|
|
||||||
|
- Have at least five examples of any value before using it in your model
|
||||||
|
- If features tend to be poorly assorted and are unbalanced, then the trained
|
||||||
|
model will be biased
|
||||||
|
|
||||||
|
### Bring human insight to problem
|
||||||
|
|
||||||
|
- Must have a reason for this feature to be useful, needs **subject matter** and
|
||||||
|
**curious mind**
|
||||||
|
- This is an iterative process, need to use **feedback** from production usage
|
||||||
|
|
||||||
|
## Process of Feature Engineering
|
||||||
|
|
||||||
|
### Scaling
|
||||||
|
|
||||||
|
#### Rationale:
|
||||||
|
|
||||||
|
- Leads to a better model, useful when data is uneven: $X1 >> X2$
|
||||||
|
|
||||||
|
#### Methods:
|
||||||
|
|
||||||
|
##### Normalization or Standardization:
|
||||||
|
|
||||||
|
- $𝑍 = \frac{𝑋−𝜇}{\sigma}$
|
||||||
|
- Re-scaled to have a standard normal distribution that centered around 0 with
|
||||||
|
SD of 1
|
||||||
|
- Will **compress** the value in the narrow range, if the variable is skewed, or
|
||||||
|
has outliers.
|
||||||
|
- This may impair the prediction
|
||||||
|
|
||||||
|
##### Min-max scaling:
|
||||||
|
|
||||||
|
- $X_{scaled} = \frac{X - min}{max - min}$
|
||||||
|
- Also will compress observation
|
||||||
|
|
||||||
|
##### Robust scaling:
|
||||||
|
|
||||||
|
- $X_{scaled} = \frac{X - median}{IQR}$
|
||||||
|
- IQR: Interquartile range
|
||||||
|
- Better at **preserving** the spread
|
||||||
|
|
||||||
|
#### Choosing
|
||||||
|
|
||||||
|
- If data is **not guassain like**, and has a **skewed distribution** or
|
||||||
|
outliers : Use **robust** scaling, as the other two will compress the data to
|
||||||
|
a narrow range, which is not ideal
|
||||||
|
- For **PCA or LDA**(distance or covariance calculation), better to use
|
||||||
|
**Normalization or Standardization**, since it will remove the effect of
|
||||||
|
numerical scale, on variance and covariance
|
||||||
|
- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
|
||||||
|
data may be out of bound (out of original range). This is preferred when the
|
||||||
|
network prefer a 0-1 **scale**
|
||||||
|
|
||||||
|
### Discretization / Binning / Bucketing
|
||||||
|
|
||||||
|
#### Definition
|
||||||
|
|
||||||
|
- The process of transforming **continuous** variable into **discrete** ones, by
|
||||||
|
creating a set of continuous interval, that spans over the range of variable's
|
||||||
|
values
|
||||||
|
- ![binning diagram](./assets/4-analytics-binning.webp)
|
||||||
|
|
||||||
|
#### Reason for binning
|
||||||
|
|
||||||
|
- Example: Solar energy modeling
|
||||||
|
- Acelleration calculation, by binning, and reduce the number of simulation
|
||||||
|
needed
|
||||||
|
- Improves **performance** by grouping data with **similar attributes** and has
|
||||||
|
**similar predictive strength**
|
||||||
|
- Improve **non-linearity**, by being able to capture **non-linear patterns** ,
|
||||||
|
thus improving fitting power of model
|
||||||
|
- **Interpretability** is enhanced by grouping
|
||||||
|
- Reduce the impact of **outliers**
|
||||||
|
- Prevent **overfitting**
|
||||||
|
- Allow feature **interaction**, with **continuous** variables
|
||||||
|
|
||||||
|
#### Methods
|
||||||
|
|
||||||
|
##### Equal width binning
|
||||||
|
|
||||||
|
- Divide the scope into bins of the same width
|
||||||
|
- Con: is sensitive to skewed distribution
|
||||||
|
|
||||||
|
##### Equal frequency binning
|
||||||
|
|
||||||
|
- Divides the scope of possible values of variable into N bins, where each bin
|
||||||
|
carries the same **number** of observations
|
||||||
|
- Con: May disrupt the relationship with target
|
||||||
|
|
||||||
|
##### k means binning
|
||||||
|
|
||||||
|
- Use k-means to partition the values into clusters
|
||||||
|
- Con: need hyper-parameter tuning
|
||||||
|
|
||||||
|
##### decision trees
|
||||||
|
|
||||||
|
- Using decision trees to decide the best splitting points
|
||||||
|
- Observes which bin is more similar than other bins
|
||||||
|
- Con:
|
||||||
|
- may cause overfitting
|
||||||
|
- have a chance of failing: bad performance
|
||||||
|
|
||||||
|
### Encoding
|
||||||
|
|
||||||
|
### Transformation
|
||||||
|
|
||||||
|
### Generation
|
||||||
|
|
BIN
assets/4-analytics-binning.webp
Normal file
BIN
assets/4-analytics-binning.webp
Normal file
Binary file not shown.
After Width: | Height: | Size: 27 KiB |
Loading…
Reference in a new issue