diff --git a/4-data-analytics.md b/4-data-analytics.md index 7bcd2b5..96c5e0c 100644 --- a/4-data-analytics.md +++ b/4-data-analytics.md @@ -1,18 +1,39 @@ -# Data analytics +# Data analytics: Feature engineering +- [Data analytics: Feature engineering](#data-analytics-feature-engineering) + - [Definition](#definition) + - [Sources of features](#sources-of-features) + - [Feature engineering in ML](#feature-engineering-in-ml) + - [Types of feature engineering](#types-of-feature-engineering) + - [Good feature:](#good-feature) + - [Related to objective (important)](#related-to-objective-important) + - [Known at prediction-time](#known-at-prediction-time) + - [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude) + - [Have enough samples](#have-enough-samples) + - [Bring human insight to problem](#bring-human-insight-to-problem) + - [Process of Feature Engineering](#process-of-feature-engineering) + - [Scaling](#scaling) + - [Rationale:](#rationale) + - [Methods:](#methods) + - [Normalization or Standardization:](#normalization-or-standardization) + - [Min-max scaling:](#min-max-scaling) + - [Robust scaling:](#robust-scaling) + - [Choosing](#choosing) + - [Discretization / Binning / Bucketing](#discretization-binning-bucketing) + - [Definition](#definition) + - [Reason for binning](#reason-for-binning) + - [Methods](#methods) + - [Equal width binning](#equal-width-binning) + - [Equal frequency binning](#equal-frequency-binning) + - [k means binning](#k-means-binning) + - [decision trees](#decision-trees) + - [Encoding](#encoding) + - [Transformation](#transformation) + - [Generation](#generation) + -- [Data analytics](#data-analytics) - - [Feature engineering](#feature-engineering) - [Definition](#definition) - - [Sources of features](#sources-of-features) - - [Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) - - [Intro](#intro) - - [Types of feature engineering](#types-of-feature-engineering) - - [Good feature:](#good-feature) - -## Feature engineering - -### Definition +## Definition - The process that attempts to create **additional** relevant features from **existing** raw features, to increase the predictive power of **algorithms** @@ -21,11 +42,11 @@ is improved. - Important to machine learning -### Sources of features +## Sources of features - Different features are needed for different problems, even in the same domain -### Feature engineering in ML +## Feature engineering in ML - Process of ML iterations: - Baseline model -> Feature engineering -> Model 2 -> Feature engineering -> @@ -42,26 +63,135 @@ - Highlighting **interactions** between features - Representing the feature in a **different** way -### Good feature: +## Good feature: -- Related to objective (important) - - Example: the number of concrete blocks around it is not related to house - prices -- Known at prediction-time - - Some data could be known **immediately**, and some other data is not known - in **real time**: Can't feed the feature to a model, if it isn't present - at prediction time - - Feature definition shouldn't **change** over time - - Example: If the sales data at prediction time is only available within 3 - days, with a 3 day lag, then current sale data can't be used for training - (that has to predict with a 3-day old data) -- Numeric with meaningful magnitude: - - It does not mean that **categorical** features can't be used in training: - simply, they will need to be **transformed** through a process called - one-hot encoding - - Example: Font category: (Arial, Times New Roman) -- Have enough samples - - Have at least five examples of any value before using it in your model - - If features tend to be poorly assorted and are unbalanced, then the - trained model will be biased -- Bring human insight to problem +### Related to objective (important) + +- Example: the number of concrete blocks around it is not related to house + prices + +### Known at prediction-time + +- Some data could be known **immediately**, and some other data is not known in + **real time**: Can't feed the feature to a model, if it isn't present at + prediction time +- Feature definition shouldn't **change** over time +- Example: If the sales data at prediction time is only available within 3 days, + with a 3 day lag, then current sale data can't be used for training (that has + to predict with a 3-day old data) + +### Numeric with meaningful magnitude: + +- It does not mean that **categorical** features can't be used in training: + simply, they will need to be **transformed** through a process called one-hot + encoding +- Example: Font category: (Arial, Times New Roman) + +### Have enough samples + +- Have at least five examples of any value before using it in your model +- If features tend to be poorly assorted and are unbalanced, then the trained + model will be biased + +### Bring human insight to problem + +- Must have a reason for this feature to be useful, needs **subject matter** and + **curious mind** +- This is an iterative process, need to use **feedback** from production usage + +## Process of Feature Engineering + +### Scaling + +#### Rationale: + +- Leads to a better model, useful when data is uneven: $X1 >> X2$ + +#### Methods: + +##### Normalization or Standardization: + +- $𝑍 = \frac{𝑋−𝜇}{\sigma}$ +- Re-scaled to have a standard normal distribution that centered around 0 with + SD of 1 +- Will **compress** the value in the narrow range, if the variable is skewed, or + has outliers. + - This may impair the prediction + +##### Min-max scaling: + +- $X_{scaled} = \frac{X - min}{max - min}$ +- Also will compress observation + +##### Robust scaling: + +- $X_{scaled} = \frac{X - median}{IQR}$ +- IQR: Interquartile range +- Better at **preserving** the spread + +#### Choosing + +- If data is **not guassain like**, and has a **skewed distribution** or + outliers : Use **robust** scaling, as the other two will compress the data to + a narrow range, which is not ideal +- For **PCA or LDA**(distance or covariance calculation), better to use + **Normalization or Standardization**, since it will remove the effect of + numerical scale, on variance and covariance +- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new + data may be out of bound (out of original range). This is preferred when the + network prefer a 0-1 **scale** + +### Discretization / Binning / Bucketing + +#### Definition + +- The process of transforming **continuous** variable into **discrete** ones, by + creating a set of continuous interval, that spans over the range of variable's + values +- ![binning diagram](./assets/4-analytics-binning.webp) + +#### Reason for binning + +- Example: Solar energy modeling + - Acelleration calculation, by binning, and reduce the number of simulation + needed +- Improves **performance** by grouping data with **similar attributes** and has + **similar predictive strength** +- Improve **non-linearity**, by being able to capture **non-linear patterns** , + thus improving fitting power of model +- **Interpretability** is enhanced by grouping +- Reduce the impact of **outliers** +- Prevent **overfitting** +- Allow feature **interaction**, with **continuous** variables + +#### Methods + +##### Equal width binning + +- Divide the scope into bins of the same width +- Con: is sensitive to skewed distribution + +##### Equal frequency binning + +- Divides the scope of possible values of variable into N bins, where each bin + carries the same **number** of observations +- Con: May disrupt the relationship with target + +##### k means binning + +- Use k-means to partition the values into clusters +- Con: need hyper-parameter tuning + +##### decision trees + +- Using decision trees to decide the best splitting points +- Observes which bin is more similar than other bins +- Con: + - may cause overfitting + - have a chance of failing: bad performance + +### Encoding + +### Transformation + +### Generation diff --git a/assets/4-analytics-binning.webp b/assets/4-analytics-binning.webp new file mode 100644 index 0000000..7de5c3a Binary files /dev/null and b/assets/4-analytics-binning.webp differ