Add more to 4, took 1.5hr

2025-01-07 21:20:43 +08:00 · 2025-01-07 21:20:43 +08:00 · ea049b6d06
parent 33d3bee7b0
commit ea049b6d06
2 changed files with 166 additions and 36 deletions
--- a/4-data-analytics.md
+++ b/4-data-analytics.md
@ -1,18 +1,39 @@
-# Data analytics
+# Data analytics: Feature engineering

 <!--toc:start-->
+- [Data analytics: Feature engineering](#data-analytics-feature-engineering)
+  - [Definition](#definition)
+  - [Sources of features](#sources-of-features)
+  - [Feature engineering in ML](#feature-engineering-in-ml)
+    - [Types of feature engineering](#types-of-feature-engineering)
+  - [Good feature:](#good-feature)
+    - [Related to objective (important)](#related-to-objective-important)
+    - [Known at prediction-time](#known-at-prediction-time)
+    - [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
+    - [Have enough samples](#have-enough-samples)
+    - [Bring human insight to problem](#bring-human-insight-to-problem)
+  - [Process of Feature Engineering](#process-of-feature-engineering)
+    - [Scaling](#scaling)
+      - [Rationale:](#rationale)
+      - [Methods:](#methods)
+        - [Normalization or Standardization:](#normalization-or-standardization)
+        - [Min-max scaling:](#min-max-scaling)
+        - [Robust scaling:](#robust-scaling)
+      - [Choosing](#choosing)
+    - [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
+      - [Definition](#definition)
+      - [Reason for binning](#reason-for-binning)
+      - [Methods](#methods)
+        - [Equal width binning](#equal-width-binning)
+        - [Equal frequency binning](#equal-frequency-binning)
+        - [k means binning](#k-means-binning)
+        - [decision trees](#decision-trees)
+    - [Encoding](#encoding)
+    - [Transformation](#transformation)
+    - [Generation](#generation)
+<!--toc:end-->

- [Data analytics](#data-analytics)
-    - [Feature engineering](#feature-engineering) - [Definition](#definition) -
-      [Sources of features](#sources-of-features) -
-      [Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) -
-      [Intro](#intro) -
-      [Types of feature engineering](#types-of-feature-engineering) -
-      [Good feature:](#good-feature) <!--toc:end-->
-
-## Feature engineering
-
-### Definition
+## Definition

 - The process that attempts to create **additional** relevant features from
  **existing** raw features, to increase the predictive power of **algorithms**
@ -21,11 +42,11 @@
  is improved.
 - Important to machine learning

-### Sources of features
+## Sources of features

 - Different features are needed for different problems, even in the same domain

-### Feature engineering in ML
+## Feature engineering in ML

 - Process of ML iterations:
    - Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
@ -42,26 +63,135 @@
 - Highlighting **interactions** between features
 - Representing the feature in a **different** way

-### Good feature:
+## Good feature:

- Related to objective (important)
-    - Example: the number of concrete blocks around it is not related to house
-      prices
- Known at prediction-time
-    - Some data could be known **immediately**, and some other data is not known
-      in **real time**: Can't feed the feature to a model, if it isn't present
-      at prediction time
-    - Feature definition shouldn't **change** over time
-    - Example: If the sales data at prediction time is only available within 3
-      days, with a 3 day lag, then current sale data can't be used for training
-      (that has to predict with a 3-day old data)
- Numeric with meaningful magnitude:
-    - It does not mean that **categorical** features can't be used in training:
-      simply, they will need to be **transformed** through a process called
-      one-hot encoding
-    - Example: Font category: (Arial, Times New Roman)
- Have enough samples
-    - Have at least five examples of any value before using it in your model
-    - If features tend to be poorly assorted and are unbalanced, then the
-      trained model will be biased
- Bring human insight to problem
+### Related to objective (important)
+
+- Example: the number of concrete blocks around it is not related to house
+  prices
+
+### Known at prediction-time
+
+- Some data could be known **immediately**, and some other data is not known in
+  **real time**: Can't feed the feature to a model, if it isn't present at
+  prediction time
+- Feature definition shouldn't **change** over time
+- Example: If the sales data at prediction time is only available within 3 days,
+  with a 3 day lag, then current sale data can't be used for training (that has
+  to predict with a 3-day old data)
+
+### Numeric with meaningful magnitude:
+
+- It does not mean that **categorical** features can't be used in training:
+  simply, they will need to be **transformed** through a process called one-hot
+  encoding
+- Example: Font category: (Arial, Times New Roman)
+
+### Have enough samples
+
+- Have at least five examples of any value before using it in your model
+- If features tend to be poorly assorted and are unbalanced, then the trained
+  model will be biased
+
+### Bring human insight to problem
+
+- Must have a reason for this feature to be useful, needs **subject matter** and
+  **curious mind**
+- This is an iterative process, need to use **feedback** from production usage
+
+## Process of Feature Engineering
+
+### Scaling
+
+#### Rationale:
+
+- Leads to a better model, useful when data is uneven: $X1 >> X2$
+
+#### Methods:
+
+##### Normalization or Standardization:
+
+- $𝑍 = \frac{𝑋−𝜇}{\sigma}$
+- Re-scaled to have a standard normal distribution that centered around 0 with
+  SD of 1
+- Will **compress** the value in the narrow range, if the variable is skewed, or
+  has outliers.
+    - This may impair the prediction
+
+##### Min-max scaling:
+
+- $X_{scaled} = \frac{X - min}{max - min}$
+- Also will compress observation
+
+##### Robust scaling:
+
+- $X_{scaled} = \frac{X - median}{IQR}$
+- IQR: Interquartile range
+- Better at **preserving** the spread
+
+#### Choosing
+
+- If data is **not guassain like**, and has a **skewed distribution** or
+  outliers : Use **robust** scaling, as the other two will compress the data to
+  a narrow range, which is not ideal
+- For **PCA or LDA**(distance or covariance calculation), better to use
+  **Normalization or Standardization**, since it will remove the effect of
+  numerical scale, on variance and covariance
+- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
+  data may be out of bound (out of original range). This is preferred when the
+  network prefer a 0-1 **scale**
+
+### Discretization / Binning / Bucketing
+
+#### Definition
+
+- The process of transforming **continuous** variable into **discrete** ones, by
+  creating a set of continuous interval, that spans over the range of variable's
+  values
+- ![binning diagram](./assets/4-analytics-binning.webp)
+
+#### Reason for binning
+
+- Example: Solar energy modeling
+    - Acelleration calculation, by binning, and reduce the number of simulation
+      needed
+- Improves **performance** by grouping data with **similar attributes** and has
+  **similar predictive strength**
+- Improve **non-linearity**, by being able to capture **non-linear patterns** ,
+  thus improving fitting power of model
+- **Interpretability** is enhanced by grouping
+- Reduce the impact of **outliers**
+- Prevent **overfitting**
+- Allow feature **interaction**, with **continuous** variables
+
+#### Methods
+
+##### Equal width binning
+
+- Divide the scope into bins of the same width
+- Con: is sensitive to skewed distribution
+
+##### Equal frequency binning
+
+- Divides the scope of possible values of variable into N bins, where each bin
+  carries the same **number** of observations
+- Con: May disrupt the relationship with target
+
+##### k means binning
+
+- Use k-means to partition the values into clusters
+- Con: need hyper-parameter tuning
+
+##### decision trees
+
+- Using decision trees to decide the best splitting points
+- Observes which bin is more similar than other bins
+- Con:
+    - may cause overfitting
+    - have a chance of failing: bad performance
+
+### Encoding
+
+### Transformation
+
+### Generation
--- a/assets/4-analytics-binning.webp
+++ b/assets/4-analytics-binning.webp