Add more to 4, took 1.5hr

2025-01-07 21:20:43 +08:00 · 2025-01-07 21:20:43 +08:00 · ea049b6d06
parent 33d3bee7b0
commit ea049b6d06
2 changed files with 166 additions and 36 deletions
--- a/4-data-analytics.md
+++ b/4-data-analytics.md
@ -1,18 +1,39 @@
-# Data analytics
+# Data analytics: Feature engineering
 <!--toc:start-->
 - [Data analytics: Feature engineering](#data-analytics-feature-engineering)
  - [Definition](#definition)
  - [Sources of features](#sources-of-features)
  - [Feature engineering in ML](#feature-engineering-in-ml)
    - [Types of feature engineering](#types-of-feature-engineering)
  - [Good feature:](#good-feature)
    - [Related to objective (important)](#related-to-objective-important)
    - [Known at prediction-time](#known-at-prediction-time)
    - [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
    - [Have enough samples](#have-enough-samples)
    - [Bring human insight to problem](#bring-human-insight-to-problem)
  - [Process of Feature Engineering](#process-of-feature-engineering)
    - [Scaling](#scaling)
      - [Rationale:](#rationale)
      - [Methods:](#methods)
        - [Normalization or Standardization:](#normalization-or-standardization)
        - [Min-max scaling:](#min-max-scaling)
        - [Robust scaling:](#robust-scaling)
      - [Choosing](#choosing)
    - [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
      - [Definition](#definition)
      - [Reason for binning](#reason-for-binning)
      - [Methods](#methods)
        - [Equal width binning](#equal-width-binning)
        - [Equal frequency binning](#equal-frequency-binning)
        - [k means binning](#k-means-binning)
        - [decision trees](#decision-trees)
    - [Encoding](#encoding)
    - [Transformation](#transformation)
    - [Generation](#generation)
 <!--toc:end-->
- [Data analytics](#data-analytics)
+## Definition
    - [Feature engineering](#feature-engineering) - [Definition](#definition) -
      [Sources of features](#sources-of-features) -
      [Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) -
      [Intro](#intro) -
      [Types of feature engineering](#types-of-feature-engineering) -
      [Good feature:](#good-feature) <!--toc:end-->
 ## Feature engineering
 ### Definition
 - The process that attempts to create **additional** relevant features from
  **existing** raw features, to increase the predictive power of **algorithms**
@ -21,11 +42,11 @@
  is improved.
 - Important to machine learning
-### Sources of features
+## Sources of features
 - Different features are needed for different problems, even in the same domain
-### Feature engineering in ML
+## Feature engineering in ML
 - Process of ML iterations:
    - Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
@ -42,26 +63,135 @@
 - Highlighting **interactions** between features
 - Representing the feature in a **different** way
-### Good feature:
+## Good feature:
- Related to objective (important)
+### Related to objective (important)
-    - Example: the number of concrete blocks around it is not related to house
+
-      prices
+- Example: the number of concrete blocks around it is not related to house
- Known at prediction-time
+  prices
-    - Some data could be known **immediately**, and some other data is not known
+
-      in **real time**: Can't feed the feature to a model, if it isn't present
+### Known at prediction-time
-      at prediction time
+
-    - Feature definition shouldn't **change** over time
+- Some data could be known **immediately**, and some other data is not known in
-    - Example: If the sales data at prediction time is only available within 3
+  **real time**: Can't feed the feature to a model, if it isn't present at
-      days, with a 3 day lag, then current sale data can't be used for training
+  prediction time
-      (that has to predict with a 3-day old data)
+- Feature definition shouldn't **change** over time
- Numeric with meaningful magnitude:
+- Example: If the sales data at prediction time is only available within 3 days,
-    - It does not mean that **categorical** features can't be used in training:
+  with a 3 day lag, then current sale data can't be used for training (that has
-      simply, they will need to be **transformed** through a process called
+  to predict with a 3-day old data)
-      one-hot encoding
+
-    - Example: Font category: (Arial, Times New Roman)
+### Numeric with meaningful magnitude:
- Have enough samples
+
-    - Have at least five examples of any value before using it in your model
+- It does not mean that **categorical** features can't be used in training:
-    - If features tend to be poorly assorted and are unbalanced, then the
+  simply, they will need to be **transformed** through a process called one-hot
-      trained model will be biased
+  encoding
- Bring human insight to problem
+- Example: Font category: (Arial, Times New Roman)
 ### Have enough samples
 - Have at least five examples of any value before using it in your model
 - If features tend to be poorly assorted and are unbalanced, then the trained
  model will be biased
 ### Bring human insight to problem
 - Must have a reason for this feature to be useful, needs **subject matter** and
  **curious mind**
 - This is an iterative process, need to use **feedback** from production usage
 ## Process of Feature Engineering
 ### Scaling
 #### Rationale:
 - Leads to a better model, useful when data is uneven: $X1 >> X2$
 #### Methods:
 ##### Normalization or Standardization:
 - $𝑍 = \frac{𝑋−𝜇}{\sigma}$
 - Re-scaled to have a standard normal distribution that centered around 0 with
  SD of 1
 - Will **compress** the value in the narrow range, if the variable is skewed, or
  has outliers.
    - This may impair the prediction
 ##### Min-max scaling:
 - $X_{scaled} = \frac{X - min}{max - min}$
 - Also will compress observation
 ##### Robust scaling:
 - $X_{scaled} = \frac{X - median}{IQR}$
 - IQR: Interquartile range
 - Better at **preserving** the spread
 #### Choosing
 - If data is **not guassain like**, and has a **skewed distribution** or
  outliers : Use **robust** scaling, as the other two will compress the data to
  a narrow range, which is not ideal
 - For **PCA or LDA**(distance or covariance calculation), better to use
  **Normalization or Standardization**, since it will remove the effect of
  numerical scale, on variance and covariance
 - Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
  data may be out of bound (out of original range). This is preferred when the
  network prefer a 0-1 **scale**
 ### Discretization / Binning / Bucketing
 #### Definition
 - The process of transforming **continuous** variable into **discrete** ones, by
  creating a set of continuous interval, that spans over the range of variable's
  values
 - ![binning diagram](./assets/4-analytics-binning.webp)
 #### Reason for binning
 - Example: Solar energy modeling
    - Acelleration calculation, by binning, and reduce the number of simulation
      needed
 - Improves **performance** by grouping data with **similar attributes** and has
  **similar predictive strength**
 - Improve **non-linearity**, by being able to capture **non-linear patterns** ,
  thus improving fitting power of model
 - **Interpretability** is enhanced by grouping
 - Reduce the impact of **outliers**
 - Prevent **overfitting**
 - Allow feature **interaction**, with **continuous** variables
 #### Methods
 ##### Equal width binning
 - Divide the scope into bins of the same width
 - Con: is sensitive to skewed distribution
 ##### Equal frequency binning
 - Divides the scope of possible values of variable into N bins, where each bin
  carries the same **number** of observations
 - Con: May disrupt the relationship with target
 ##### k means binning
 - Use k-means to partition the values into clusters
 - Con: need hyper-parameter tuning
 ##### decision trees
 - Using decision trees to decide the best splitting points
 - Observes which bin is more similar than other bins
 - Con:
    - may cause overfitting
    - have a chance of failing: bad performance
 ### Encoding
 ### Transformation
 ### Generation
--- a/assets/4-analytics-binning.webp
+++ b/assets/4-analytics-binning.webp