EBU6504_smart_arch_notes/4-data-analytics.md

# Data analytics: Feature engineering

<!--toc:start-->

- [Data analytics: Feature engineering](#data-analytics-feature-engineering)
    - [Definition](#definition)
    - [Sources of features](#sources-of-features)
    - [Feature engineering in ML](#feature-engineering-in-ml)
        - [Types of feature engineering](#types-of-feature-engineering)
    - [Good feature:](#good-feature)
        - [Related to objective (important)](#related-to-objective-important)
        - [Known at prediction-time](#known-at-prediction-time)
        - [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
        - [Have enough samples](#have-enough-samples)
        - [Bring human insight to problem](#bring-human-insight-to-problem)
    - [Methods of Feature Engineering](#methods-of-feature-engineering)
        - [Scaling](#scaling)
            - [Rationale:](#rationale)
            - [Methods:](#methods)
                - [Normalization or Standardization:](#normalization-or-standardization)
                - [Min-max scaling:](#min-max-scaling)
                - [Robust scaling:](#robust-scaling)
            - [Choosing](#choosing)
        - [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
            - [Definition](#definition)
            - [Reason for binning](#reason-for-binning)
            - [Methods](#methods)
                - [Equal width binning](#equal-width-binning)
                - [Equal frequency binning](#equal-frequency-binning)
                - [k means binning](#k-means-binning)
                - [decision trees](#decision-trees)
        - [Encoding](#encoding)
            - [Definition](#definition)
            - [Reason](#reason)
            - [Methods](#methods)
                - [One hot encoding](#one-hot-encoding)
                - [Ordinal encoding](#ordinal-encoding)
                - [Count / frequency encoding](#count-frequency-encoding)
                - [Mean / target encoding](#mean-target-encoding)
        - [Transformation](#transformation)
            - [Reasons](#reasons)
            - [Methods](#methods)
        - [Generation](#generation)
            - [Definition](#definition)
            - [Methods](#methods)
                - [Feature Crossing](#feature-crossing)
                - [Polynomial Expansion](#polynomial-expansion)
                - [Feature Learning by Trees](#feature-learning-by-trees)
                - [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning)
    - [Feature Selection](#feature-selection)
        - [Reason](#reason)
        - [Methods](#methods)
            - [Filter](#filter)
            - [Wrapper](#wrapper)
            - [Embedded](#embedded)
            - [Shuffling](#shuffling)
            - [Hybrid](#hybrid)
            - [Dimensionality Reduction](#dimensionality-reduction)

<!--toc:end-->

## Definition

- The process that attempts to create **additional** relevant features from
  **existing** raw features, to increase the predictive power of **algorithms**
- Alternative definition: transfer raw data into features that **better
  represent** the underlying problem, such that the accuracy of predictive model
  is improved.
- Important to machine learning

## Sources of features

- Different features are needed for different problems, even in the same domain

## Feature engineering in ML

- Process of ML iterations:
    - Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
      Final
- Example: data needed to predict house price
    - ML can do that with sufficient feature
- Reason for feature engineering: Raw data are rarely useful
    - Must be mapped into a feature vector
    - Good feature engineering takes the most time out of ML

### Types of feature engineering

- **Indicator** variable to isolate information
- Highlighting **interactions** between features
- Representing the feature in a **different** way

## Good feature:

### Related to objective (important)

- Example: the number of concrete blocks around it is not related to house
  prices

### Known at prediction-time

- Some data could be known **immediately**, and some other data is not known in
  **real time**: Can't feed the feature to a model, if it isn't present at
  prediction time
- Feature definition shouldn't **change** over time
- Example: If the sales data at prediction time is only available within 3 days,
  with a 3 day lag, then current sale data can't be used for training (that has
  to predict with a 3-day old data)

### Numeric with meaningful magnitude:

- It does not mean that **categorical** features can't be used in training:
  simply, they will need to be **transformed** through a process called
  [encoding](#encoding)
- Example: Font category: (Arial, Times New Roman)

### Have enough samples

- Have at least five examples of any value before using it in your model
- If features tend to be poorly assorted and are unbalanced, then the trained
  model will be biased

### Bring human insight to problem

- Must have a reason for this feature to be useful, needs **subject matter** and
  **curious mind**
- This is an iterative process, need to use **feedback** from production usage

## Methods of Feature Engineering

### Scaling

#### Rationale:

- Leads to a better model, useful when data is uneven: $X1 >> X2$

#### Methods:

##### Normalization or Standardization:

- $𝑍 = \frac{𝑋−𝜇}{\sigma}$
- Re-scaled to have a standard normal distribution that centered around 0 with
  SD of 1
- Will **compress** the value in the narrow range, if the variable is skewed, or
  has outliers.
    - This may impair the prediction

##### Min-max scaling:

- $X_{scaled} = \frac{X - min}{max - min}$
- Also will compress observation

##### Robust scaling:

- $X_{scaled} = \frac{X - median}{IQR}$
- IQR: Interquartile range
- Better at **preserving** the spread

#### Choosing

- If data is **not guassain like**, and has a **skewed distribution** or
  outliers : Use **robust** scaling, as the other two will compress the data to
  a narrow range, which is not ideal
- For **PCA or LDA**(distance or covariance calculation), better to use
  **Normalization or Standardization**, since it will remove the effect of
  numerical scale, on variance and covariance
- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
  data may be out of bound (out of original range). This is preferred when the
  network prefer a 0-1 **scale**

### Discretization / Binning / Bucketing

#### Definition

- The process of transforming **continuous** variable into **discrete** ones, by
  creating a set of continuous interval, that spans over the range of variable's
  values
- ![binning diagram](./assets/4-analytics-binning.webp)

#### Reason for binning

- Example: Solar energy modeling
    - Acceleration calculation, by binning, and reduce the number of simulation
      needed
- Improves **performance** by grouping data with **similar attributes** and has
  **similar predictive strength**
- Improve **non-linearity**, by being able to capture **non-linear patterns** ,
  thus improving fitting power of model
- **Interpretability** is enhanced by grouping
- Reduce the impact of **outliers**
- Prevent **overfitting**
- Allow feature **interaction**, with **continuous** variables

#### Methods

##### Equal width binning

- Divide the scope into bins of the same width
- Con: is sensitive to skewed distribution

##### Equal frequency binning

- Divides the scope of possible values of variable into N bins, where each bin
  carries the same **number** of observations
- Con: May disrupt the relationship with target

##### k means binning

- Use k-means to partition the values into clusters
- Con: need hyper-parameter tuning

##### decision trees

- Using decision trees to decide the best splitting points
- Observes which bin is more similar than other bins
- Con:
    - may cause overfitting
    - have a chance of failing: bad performance

### Encoding

#### Definition

- The inverse of binning: creating numerical values from categorical variables

#### Reason

- Machine learning algorithms require **numerical** input data, and this
  converts **categorical** data to **numerical** data

#### Methods

##### One hot encoding

- Replace categorical variable (nominal) with different binary variables
- **Eliminates** **ordinality**: since categorical variables shouldn't be
  ranked, otherwise the algorithm may think there's ordering between the
  variables
- Improve performance by allowing model to capture the complex relationship
  within the data, that may be **missed** if categorical variables are treated
  as **single** entities
- Cons
    - High dimensionality: make the model more complex, and slower to train
    - Is sparse data
    - May lead to overfitting, especially if there's too many categories and
      sample size is small
- Usage:
    - Good for algorithms that look at all features at the same time: neural
      network, clustering, SVM
    - Used for linear regression, but **keep k-1** binary variable to avoid
      **multicollinearity**:
        - In linear regression, the presence of all k binary variables for a
          categorical feature (where k is the number of categories) introduces
          perfect multicollinearity. This happens because the k-th variable is a
          linear **combination** of the others (e.g., if "Red" and "Blue" are 0,
          "Green" must be 1).
    - Don't use for tree algorithms

##### Ordinal encoding

- Ordinal variable: comprises a finite set of discrete values with a **ranked**
  ordering
- Ordinal encoding replaces the label by ordered number
- Does not add value to give the variable more predictive power
- Usage:
    - For categorical data with ordinal meaning

##### Count / frequency encoding

- Replace occurrences of label with the count of occurrences
- Cons:
    - Will have loss of unique categories: (if the two categories have same
      frequency, they will be treated as the same)
    - Doesn't handle unseen categories
    - Overfitting, if low frequency in general

##### Mean / target encoding

- Replace the _value_ for every categories with the avg of _values_ for every
  _category-value_ pair
- monotonic relationship between variable and target
- Don't expand the feature space
- Con: prone to overfitting
- Usage:
    - High cardinality (the number of elements in a mathematical set) data, by
      leveraging the target variable's statistics to retain predictive power

### Transformation

#### Reasons

- Linear/Logistic regression models has assumption between the predictors and
  the outcome.
    - Transformation may help create this relationship to avoid poor
      performance.
    - Assumptions:
        - Linear dependency between the predictors and the outcome.
        - Multivariate normality (every variable X should follow a Gaussian
          distribution)
        - No or little multicollinearity
        - homogeneity of variance
    - Example:
        - assuming y > 0.5 lead to class 1, otherwise class 2
        - ![page 1](./assets/4-analytics-line-regression.webp)
        - ![page 2](./assets/4-analytics-line-regression-2.webp)
- Some other ML algorithms do not make any assumption, but still may benefit
  from a better distributed data

#### Methods

- Logarithmic transformation: $log(𝑥 + 1)$
    - Useful when applied to **skewed distributions**, it **expands** small
      values and **compress** big values, helps to make the distribution less
      skewed
    - Numerical values x must be $x \gt -1$
- Reciprocal transformation $1/𝑥$
- Square root $\sqrt{x}$
    - Similar to log transform
- Exponential
- Box cox transformation $(x^\lambda - 1)  / \lambda$
    - **prerequisite:** numeric values must be positive, can be solved by
      shifting
- Quantile transformation: using quartiles
    - Transform feature to use a uniform or normal distribution. Tends to spread
      out the most frequent values.
    - This is **robust**
    - But is **non-linear** transform, may distort linear correlation, but
      variables at different scales are more comparable

### Generation

#### Definition

- Generating new features that are often not the result of feature
  transformation
- Examples:
    - $Age \times NumberDiagnoses$
    - ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp)
    - ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp)

#### Methods

##### Feature Crossing

- Create new features from existing ones, thus increasing predictive power
- Takes the Cartesian product of existing features
    - $A\times B=\{(a,b), a \in A \ and\  b\in B\}.$
- Has uses when data is not linerarly separable
- Deciding which feature to cross:
    - Use expertise
    - Automatic exploration tools
    - [Deep learning](#automatic-feature-learning-deep-learning)

##### Polynomial Expansion

- Useful in modelling, since it can model non-linear relationships between
  predictor and outcome
- Use fitted polynomial variables to represent the data:
    - $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$
- Pros:
    - Fast
    - Good performance, compared to binning
    - Doesn't create correlated features
    - Good at handling continuous change
- Cons:
    - Less interpretable
    - Lots of variables produced
    - Hard to model changes in distribution

##### Feature Learning by Trees

- Each sample is a leaf node
- Decision path to each node is a new non-linear feature
- We can create N new binary features (with N leaf nodes)
- Pro: Fast to get informative feature

##### Automatic Feature learning: Deep learning

- Deep learning model learns the features from data
- Difference between shallow networks
    - Deep, in the sense of having multiple hidden layers
    - Introduced stochastic gradient descent
- Can automate feature extraction
- Require larger datasets
- DL can learn hierarchical of features: Character → word → word group → clause
  → sentence
- CNN: use convolutional layers to apply filters to the input image, to detect
  various features such as edges, textures and shapes

## Feature Selection

### Reason

- More features doesn't necessarily lead to better model
- Feature selection is useful for
    - Model simplification: easy interpretation, smaller model, less cost
    - Lower data requirements: less data is required
    - Less dimensionality
    - Enhanced generalization, less overfitting

### Methods

#### Filter

- Select best features via the following methods and evaluate
- Main methods
    - Variance: remove the feature that has the same value
    - Correlation: remove features that are highly correlated with each other
- Con: Fail to consider the interaction between features and may reduce the
  predict power of the model

#### Wrapper

- Use searching to search through all the possible feature subsets and evaluate
  them
- Steps of execution (p98), skipped
- Con: Computationally expensive

#### Embedded

- Use feature selection as a part of ML algorithm
- This address the drawbacks of both filter and wrapper method, and has
  advantage of both
- Faster than filter
- More accurate than filter
- Methods:
    - Regularization: Add penalty to coefficients, which can turn them to zero,
      and can be removed from dataset
    - Tree based methods: outputs feature importance, which can be used to
      select features.

#### Shuffling

#### Hybrid

#### Dimensionality Reduction

- When dimensionality is too high, it's computationally expensive to process
  them. We **project the data** to a lower subspace, that captures the
  **essence** of data
- Reason
    - Curse of dimensionality: high dimensionality data have large number of
      features or dimensions, which can make it difficult to analyze and
      understand
    - Remove sparse or noisy data, reduce overfitting
    - To create a model with lower number of variables
- PCA:
    - form of feature extraction, combines and transforms the dataset's original
      values
    - projects data onto a new space, defined by this subset of principal
      components
    - Is a **unsupervised** linear dimensionality reduction technique
    - Preserves signal, filter out noise
    - Use **covariance matrix**
    - TODO: is calculation needed
    - Minimize intraclass difference
- LDA:
    - Similar to PCA
    - Different than PCA, because it retains classification labels in dataset
    - Goal: maximize data variance and maximise class difference in the data.
    - Use **scatter matrix**
    - Maximizes interclass difference
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								# Data analytics: Feature engineering
-												spent 1 hr on this, not complete

											
										
										
											2025-01-07 19:00:11 +08:00
 								<!--toc:start-->
-												Finish 5, took 1.4 hr

											
										
										
											2025-01-08 16:26:25 +08:00
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								- [Data analytics: Feature engineering](#data-analytics-feature-engineering)
-												Finish 5, took 1.4 hr

											
										
										
											2025-01-08 16:26:25 +08:00
+								    - [Definition](#definition)
 								    - [Sources of features](#sources-of-features)
 								    - [Feature engineering in ML](#feature-engineering-in-ml)
 								        - [Types of feature engineering](#types-of-feature-engineering)
 								    - [Good feature:](#good-feature)
 								        - [Related to objective (important)](#related-to-objective-important)
 								        - [Known at prediction-time](#known-at-prediction-time)
 								        - [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
 								        - [Have enough samples](#have-enough-samples)
 								        - [Bring human insight to problem](#bring-human-insight-to-problem)
 								    - [Methods of Feature Engineering](#methods-of-feature-engineering)
 								        - [Scaling](#scaling)
 								            - [Rationale:](#rationale)
 								            - [Methods:](#methods)
 								                - [Normalization or Standardization:](#normalization-or-standardization)
 								                - [Min-max scaling:](#min-max-scaling)
 								                - [Robust scaling:](#robust-scaling)
 								            - [Choosing](#choosing)
 								        - [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
 								            - [Definition](#definition)
 								            - [Reason for binning](#reason-for-binning)
 								            - [Methods](#methods)
 								                - [Equal width binning](#equal-width-binning)
 								                - [Equal frequency binning](#equal-frequency-binning)
 								                - [k means binning](#k-means-binning)
 								                - [decision trees](#decision-trees)
 								        - [Encoding](#encoding)
 								            - [Definition](#definition)
 								            - [Reason](#reason)
 								            - [Methods](#methods)
 								                - [One hot encoding](#one-hot-encoding)
 								                - [Ordinal encoding](#ordinal-encoding)
 								                - [Count / frequency encoding](#count-frequency-encoding)
 								                - [Mean / target encoding](#mean-target-encoding)
 								        - [Transformation](#transformation)
 								            - [Reasons](#reasons)
 								            - [Methods](#methods)
 								        - [Generation](#generation)
 								            - [Definition](#definition)
 								            - [Methods](#methods)
 								                - [Feature Crossing](#feature-crossing)
 								                - [Polynomial Expansion](#polynomial-expansion)
 								                - [Feature Learning by Trees](#feature-learning-by-trees)
 								                - [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning)
 								    - [Feature Selection](#feature-selection)
 								        - [Reason](#reason)
 								        - [Methods](#methods)
 								            - [Filter](#filter)
 								            - [Wrapper](#wrapper)
 								            - [Embedded](#embedded)
 								            - [Shuffling](#shuffling)
 								            - [Hybrid](#hybrid)
 								            - [Dimensionality Reduction](#dimensionality-reduction)
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								<!--toc:end-->
-												spent 1 hr on this, not complete

											
										
										
											2025-01-07 19:00:11 +08:00
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								## Definition
-												spent 1 hr on this, not complete

											
										
										
											2025-01-07 19:00:11 +08:00
 								- The process that attempts to create **additional** relevant features from
 								  **existing** raw features, to increase the predictive power of **algorithms**
 								- Alternative definition: transfer raw data into features that **better
 								  represent** the underlying problem, such that the accuracy of predictive model
 								  is improved.
 								- Important to machine learning
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								## Sources of features
-												spent 1 hr on this, not complete

											
										
										
											2025-01-07 19:00:11 +08:00
 								- Different features are needed for different problems, even in the same domain
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								## Feature engineering in ML
-												spent 1 hr on this, not complete

											
										
										
											2025-01-07 19:00:11 +08:00
 								- Process of ML iterations:
 								    - Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
 								      Final
 								- Example: data needed to predict house price
 								    - ML can do that with sufficient feature
 								- Reason for feature engineering: Raw data are rarely useful
 								    - Must be mapped into a feature vector
 								    - Good feature engineering takes the most time out of ML
 								### Types of feature engineering
 								- **Indicator** variable to isolate information
 								- Highlighting **interactions** between features
 								- Representing the feature in a **different** way
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								## Good feature:
 								### Related to objective (important)
 								- Example: the number of concrete blocks around it is not related to house
 								  prices
 								### Known at prediction-time
 								- Some data could be known **immediately**, and some other data is not known in
 								  **real time**: Can't feed the feature to a model, if it isn't present at
 								  prediction time
 								- Feature definition shouldn't **change** over time
 								- Example: If the sales data at prediction time is only available within 3 days,
 								  with a 3 day lag, then current sale data can't be used for training (that has
 								  to predict with a 3-day old data)
 								### Numeric with meaningful magnitude:
 								- It does not mean that **categorical** features can't be used in training:
-												Add part of 4, used 1.5 hr

											
										
										
											2025-01-08 12:30:17 +08:00
+								  simply, they will need to be **transformed** through a process called
 								  [encoding](#encoding)
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								- Example: Font category: (Arial, Times New Roman)
 								### Have enough samples
 								- Have at least five examples of any value before using it in your model
 								- If features tend to be poorly assorted and are unbalanced, then the trained
 								  model will be biased
 								### Bring human insight to problem
 								- Must have a reason for this feature to be useful, needs **subject matter** and
 								  **curious mind**
 								- This is an iterative process, need to use **feedback** from production usage
-												Add part of 4, used 1.5 hr

											
										
										
											2025-01-08 12:30:17 +08:00
+								## Methods of Feature Engineering
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
 								### Scaling
 								#### Rationale:
 								- Leads to a better model, useful when data is uneven: $X1 >> X2$
 								#### Methods:
 								##### Normalization or Standardization:
 								- $𝑍 = \frac{𝑋−𝜇}{\sigma}$
 								- Re-scaled to have a standard normal distribution that centered around 0 with
 								  SD of 1
 								- Will **compress** the value in the narrow range, if the variable is skewed, or
 								  has outliers.
 								    - This may impair the prediction
 								##### Min-max scaling:
 								- $X_{scaled} = \frac{X - min}{max - min}$
 								- Also will compress observation
 								##### Robust scaling:
 								- $X_{scaled} = \frac{X - median}{IQR}$
 								- IQR: Interquartile range
 								- Better at **preserving** the spread
 								#### Choosing
 								- If data is **not guassain like**, and has a **skewed distribution** or
 								  outliers : Use **robust** scaling, as the other two will compress the data to
 								  a narrow range, which is not ideal
 								- For **PCA or LDA**(distance or covariance calculation), better to use
 								  **Normalization or Standardization**, since it will remove the effect of
 								  numerical scale, on variance and covariance
 								- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
 								  data may be out of bound (out of original range). This is preferred when the
 								  network prefer a 0-1 **scale**
 								### Discretization / Binning / Bucketing
 								#### Definition
 								- The process of transforming **continuous** variable into **discrete** ones, by
 								  creating a set of continuous interval, that spans over the range of variable's
 								  values
 								- ![binning diagram](./assets/4-analytics-binning.webp)
 								#### Reason for binning
 								- Example: Solar energy modeling
-												Add part of 4, used 1.5 hr

											
										
										
											2025-01-08 12:30:17 +08:00
+								    - Acceleration calculation, by binning, and reduce the number of simulation
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								      needed
 								- Improves **performance** by grouping data with **similar attributes** and has
 								  **similar predictive strength**
 								- Improve **non-linearity**, by being able to capture **non-linear patterns** ,
 								  thus improving fitting power of model
 								- **Interpretability** is enhanced by grouping
 								- Reduce the impact of **outliers**
 								- Prevent **overfitting**
 								- Allow feature **interaction**, with **continuous** variables
 								#### Methods
 								##### Equal width binning
 								- Divide the scope into bins of the same width
 								- Con: is sensitive to skewed distribution
 								##### Equal frequency binning
 								- Divides the scope of possible values of variable into N bins, where each bin
 								  carries the same **number** of observations
 								- Con: May disrupt the relationship with target
 								##### k means binning
 								- Use k-means to partition the values into clusters
 								- Con: need hyper-parameter tuning
 								##### decision trees
 								- Using decision trees to decide the best splitting points
 								- Observes which bin is more similar than other bins
 								- Con:
 								    - may cause overfitting
 								    - have a chance of failing: bad performance
 								### Encoding
-												Add part of 4, used 1.5 hr

											
										
										
											2025-01-08 12:30:17 +08:00
+								#### Definition
 								- The inverse of binning: creating numerical values from categorical variables
 								#### Reason
 								- Machine learning algorithms require **numerical** input data, and this
 								  converts **categorical** data to **numerical** data
 								#### Methods
 								##### One hot encoding
 								- Replace categorical variable (nominal) with different binary variables
 								- **Eliminates** **ordinality**: since categorical variables shouldn't be
 								  ranked, otherwise the algorithm may think there's ordering between the
 								  variables
 								- Improve performance by allowing model to capture the complex relationship
 								  within the data, that may be **missed** if categorical variables are treated
 								  as **single** entities
 								- Cons
 								    - High dimensionality: make the model more complex, and slower to train
 								    - Is sparse data
 								    - May lead to overfitting, especially if there's too many categories and
 								      sample size is small
 								- Usage:
 								    - Good for algorithms that look at all features at the same time: neural
 								      network, clustering, SVM
 								    - Used for linear regression, but **keep k-1** binary variable to avoid
 								      **multicollinearity**:
 								        - In linear regression, the presence of all k binary variables for a
 								          categorical feature (where k is the number of categories) introduces
 								          perfect multicollinearity. This happens because the k-th variable is a
 								          linear **combination** of the others (e.g., if "Red" and "Blue" are 0,
 								          "Green" must be 1).
 								    - Don't use for tree algorithms
 								##### Ordinal encoding
 								- Ordinal variable: comprises a finite set of discrete values with a **ranked**
 								  ordering
 								- Ordinal encoding replaces the label by ordered number
 								- Does not add value to give the variable more predictive power
 								- Usage:
 								    - For categorical data with ordinal meaning
 								##### Count / frequency encoding
 								- Replace occurrences of label with the count of occurrences
 								- Cons:
 								    - Will have loss of unique categories: (if the two categories have same
 								      frequency, they will be treated as the same)
 								    - Doesn't handle unseen categories
 								    - Overfitting, if low frequency in general
 								##### Mean / target encoding
 								- Replace the _value_ for every categories with the avg of _values_ for every
 								  _category-value_ pair
 								- monotonic relationship between variable and target
 								- Don't expand the feature space
 								- Con: prone to overfitting
 								- Usage:
 								    - High cardinality (the number of elements in a mathematical set) data, by
 								      leveraging the target variable's statistics to retain predictive power
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								### Transformation
-												Add part of 4, used 1.5 hr

											
										
										
											2025-01-08 12:30:17 +08:00
+								#### Reasons
 								- Linear/Logistic regression models has assumption between the predictors and
 								  the outcome.
 								    - Transformation may help create this relationship to avoid poor
 								      performance.
 								    - Assumptions:
 								        - Linear dependency between the predictors and the outcome.
 								        - Multivariate normality (every variable X should follow a Gaussian
 								          distribution)
 								        - No or little multicollinearity
 								        - homogeneity of variance
 								    - Example:
 								        - assuming y > 0.5 lead to class 1, otherwise class 2
-												Finish 4, took 2hr

											
										
										
											2025-01-08 15:38:11 +08:00
+								        - ![page 1](./assets/4-analytics-line-regression.webp)
 								        - ![page 2](./assets/4-analytics-line-regression-2.webp)
-												Add part of 4, used 1.5 hr

											
										
										
											2025-01-08 12:30:17 +08:00
+								- Some other ML algorithms do not make any assumption, but still may benefit
 								  from a better distributed data
 								#### Methods
-												Finish 4, took 2hr

											
										
										
											2025-01-08 15:38:11 +08:00
+								- Logarithmic transformation: $log(𝑥 + 1)$
 								    - Useful when applied to **skewed distributions**, it **expands** small
 								      values and **compress** big values, helps to make the distribution less
 								      skewed
 								    - Numerical values x must be $x \gt -1$
 								- Reciprocal transformation $1/𝑥$
 								- Square root $\sqrt{x}$
 								    - Similar to log transform
 								- Exponential
-												fix box cox transformation

											
										
										
											2025-01-08 15:40:26 +08:00
+								- Box cox transformation $(x^\lambda - 1)  / \lambda$
-												Finish 4, took 2hr

											
										
										
											2025-01-08 15:38:11 +08:00
+								    - **prerequisite:** numeric values must be positive, can be solved by
 								      shifting
 								- Quantile transformation: using quartiles
 								    - Transform feature to use a uniform or normal distribution. Tends to spread
 								      out the most frequent values.
 								    - This is **robust**
 								    - But is **non-linear** transform, may distort linear correlation, but
 								      variables at different scales are more comparable
-												Add more to 4, took 1.5hr

											
										
										
											2025-01-07 21:20:43 +08:00
+								### Generation
-												Finish 4, took 2hr

											
										
										
											2025-01-08 15:38:11 +08:00
 								#### Definition
 								- Generating new features that are often not the result of feature
 								  transformation
 								- Examples:
 								    - $Age \times NumberDiagnoses$
 								    - ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp)
 								    - ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp)
 								#### Methods
 								##### Feature Crossing
 								- Create new features from existing ones, thus increasing predictive power
 								- Takes the Cartesian product of existing features
-												fix latex

											
										
										
											2025-01-08 16:28:27 +08:00
+								    - $A\times B=\{(a,b), a \in A \ and\  b\in B\}.$
-												Finish 4, took 2hr

											
										
										
											2025-01-08 15:38:11 +08:00
+								- Has uses when data is not linerarly separable
 								- Deciding which feature to cross:
 								    - Use expertise
 								    - Automatic exploration tools
 								    - [Deep learning](#automatic-feature-learning-deep-learning)
 								##### Polynomial Expansion
 								- Useful in modelling, since it can model non-linear relationships between
 								  predictor and outcome
 								- Use fitted polynomial variables to represent the data:
 								    - $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$
 								- Pros:
 								    - Fast
 								    - Good performance, compared to binning
 								    - Doesn't create correlated features
 								    - Good at handling continuous change
 								- Cons:
 								    - Less interpretable
 								    - Lots of variables produced
 								    - Hard to model changes in distribution
 								##### Feature Learning by Trees
 								- Each sample is a leaf node
 								- Decision path to each node is a new non-linear feature
 								- We can create N new binary features (with N leaf nodes)
 								- Pro: Fast to get informative feature
 								##### Automatic Feature learning: Deep learning
 								- Deep learning model learns the features from data
 								- Difference between shallow networks
 								    - Deep, in the sense of having multiple hidden layers
 								    - Introduced stochastic gradient descent
 								- Can automate feature extraction
 								- Require larger datasets
 								- DL can learn hierarchical of features: Character → word → word group → clause
 								  → sentence
 								- CNN: use convolutional layers to apply filters to the input image, to detect
 								  various features such as edges, textures and shapes
 								## Feature Selection
 								### Reason
 								- More features doesn't necessarily lead to better model
 								- Feature selection is useful for
 								    - Model simplification: easy interpretation, smaller model, less cost
 								    - Lower data requirements: less data is required
 								    - Less dimensionality
 								    - Enhanced generalization, less overfitting
 								### Methods
 								#### Filter
 								- Select best features via the following methods and evaluate
 								- Main methods
 								    - Variance: remove the feature that has the same value
 								    - Correlation: remove features that are highly correlated with each other
 								- Con: Fail to consider the interaction between features and may reduce the
 								  predict power of the model
 								#### Wrapper
 								- Use searching to search through all the possible feature subsets and evaluate
 								  them
 								- Steps of execution (p98), skipped
 								- Con: Computationally expensive
 								#### Embedded
 								- Use feature selection as a part of ML algorithm
 								- This address the drawbacks of both filter and wrapper method, and has
 								  advantage of both
 								- Faster than filter
 								- More accurate than filter
 								- Methods:
 								    - Regularization: Add penalty to coefficients, which can turn them to zero,
 								      and can be removed from dataset
 								    - Tree based methods: outputs feature importance, which can be used to
 								      select features.
 								#### Shuffling
 								#### Hybrid
 								#### Dimensionality Reduction
 								- When dimensionality is too high, it's computationally expensive to process
 								  them. We **project the data** to a lower subspace, that captures the
 								  **essence** of data
 								- Reason
 								    - Curse of dimensionality: high dimensionality data have large number of
 								      features or dimensions, which can make it difficult to analyze and
 								      understand
 								    - Remove sparse or noisy data, reduce overfitting
 								    - To create a model with lower number of variables
 								- PCA:
 								    - form of feature extraction, combines and transforms the dataset's original
 								      values
 								    - projects data onto a new space, defined by this subset of principal
 								      components
 								    - Is a **unsupervised** linear dimensionality reduction technique
 								    - Preserves signal, filter out noise
 								    - Use **covariance matrix**
 								    - TODO: is calculation needed
 								    - Minimize intraclass difference
 								- LDA:
 								    - Similar to PCA
 								    - Different than PCA, because it retains classification labels in dataset
 								    - Goal: maximize data variance and maximise class difference in the data.
 								    - Use **scatter matrix**
 								    - Maximizes interclass difference