EBU6504_smart_arch_notes/4-data-analytics.md
2025-01-07 21:20:43 +08:00

198 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Data analytics: Feature engineering
<!--toc:start-->
- [Data analytics: Feature engineering](#data-analytics-feature-engineering)
- [Definition](#definition)
- [Sources of features](#sources-of-features)
- [Feature engineering in ML](#feature-engineering-in-ml)
- [Types of feature engineering](#types-of-feature-engineering)
- [Good feature:](#good-feature)
- [Related to objective (important)](#related-to-objective-important)
- [Known at prediction-time](#known-at-prediction-time)
- [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
- [Have enough samples](#have-enough-samples)
- [Bring human insight to problem](#bring-human-insight-to-problem)
- [Process of Feature Engineering](#process-of-feature-engineering)
- [Scaling](#scaling)
- [Rationale:](#rationale)
- [Methods:](#methods)
- [Normalization or Standardization:](#normalization-or-standardization)
- [Min-max scaling:](#min-max-scaling)
- [Robust scaling:](#robust-scaling)
- [Choosing](#choosing)
- [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
- [Definition](#definition)
- [Reason for binning](#reason-for-binning)
- [Methods](#methods)
- [Equal width binning](#equal-width-binning)
- [Equal frequency binning](#equal-frequency-binning)
- [k means binning](#k-means-binning)
- [decision trees](#decision-trees)
- [Encoding](#encoding)
- [Transformation](#transformation)
- [Generation](#generation)
<!--toc:end-->
## Definition
- The process that attempts to create **additional** relevant features from
**existing** raw features, to increase the predictive power of **algorithms**
- Alternative definition: transfer raw data into features that **better
represent** the underlying problem, such that the accuracy of predictive model
is improved.
- Important to machine learning
## Sources of features
- Different features are needed for different problems, even in the same domain
## Feature engineering in ML
- Process of ML iterations:
- Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
Final
- Example: data needed to predict house price
- ML can do that with sufficient feature
- Reason for feature engineering: Raw data are rarely useful
- Must be mapped into a feature vector
- Good feature engineering takes the most time out of ML
### Types of feature engineering
- **Indicator** variable to isolate information
- Highlighting **interactions** between features
- Representing the feature in a **different** way
## Good feature:
### Related to objective (important)
- Example: the number of concrete blocks around it is not related to house
prices
### Known at prediction-time
- Some data could be known **immediately**, and some other data is not known in
**real time**: Can't feed the feature to a model, if it isn't present at
prediction time
- Feature definition shouldn't **change** over time
- Example: If the sales data at prediction time is only available within 3 days,
with a 3 day lag, then current sale data can't be used for training (that has
to predict with a 3-day old data)
### Numeric with meaningful magnitude:
- It does not mean that **categorical** features can't be used in training:
simply, they will need to be **transformed** through a process called one-hot
encoding
- Example: Font category: (Arial, Times New Roman)
### Have enough samples
- Have at least five examples of any value before using it in your model
- If features tend to be poorly assorted and are unbalanced, then the trained
model will be biased
### Bring human insight to problem
- Must have a reason for this feature to be useful, needs **subject matter** and
**curious mind**
- This is an iterative process, need to use **feedback** from production usage
## Process of Feature Engineering
### Scaling
#### Rationale:
- Leads to a better model, useful when data is uneven: $X1 >> X2$
#### Methods:
##### Normalization or Standardization:
- $𝑍 = \frac{𝑋−𝜇}{\sigma}$
- Re-scaled to have a standard normal distribution that centered around 0 with
SD of 1
- Will **compress** the value in the narrow range, if the variable is skewed, or
has outliers.
- This may impair the prediction
##### Min-max scaling:
- $X_{scaled} = \frac{X - min}{max - min}$
- Also will compress observation
##### Robust scaling:
- $X_{scaled} = \frac{X - median}{IQR}$
- IQR: Interquartile range
- Better at **preserving** the spread
#### Choosing
- If data is **not guassain like**, and has a **skewed distribution** or
outliers : Use **robust** scaling, as the other two will compress the data to
a narrow range, which is not ideal
- For **PCA or LDA**(distance or covariance calculation), better to use
**Normalization or Standardization**, since it will remove the effect of
numerical scale, on variance and covariance
- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
data may be out of bound (out of original range). This is preferred when the
network prefer a 0-1 **scale**
### Discretization / Binning / Bucketing
#### Definition
- The process of transforming **continuous** variable into **discrete** ones, by
creating a set of continuous interval, that spans over the range of variable's
values
- ![binning diagram](./assets/4-analytics-binning.webp)
#### Reason for binning
- Example: Solar energy modeling
- Acelleration calculation, by binning, and reduce the number of simulation
needed
- Improves **performance** by grouping data with **similar attributes** and has
**similar predictive strength**
- Improve **non-linearity**, by being able to capture **non-linear patterns** ,
thus improving fitting power of model
- **Interpretability** is enhanced by grouping
- Reduce the impact of **outliers**
- Prevent **overfitting**
- Allow feature **interaction**, with **continuous** variables
#### Methods
##### Equal width binning
- Divide the scope into bins of the same width
- Con: is sensitive to skewed distribution
##### Equal frequency binning
- Divides the scope of possible values of variable into N bins, where each bin
carries the same **number** of observations
- Con: May disrupt the relationship with target
##### k means binning
- Use k-means to partition the values into clusters
- Con: need hyper-parameter tuning
##### decision trees
- Using decision trees to decide the best splitting points
- Observes which bin is more similar than other bins
- Con:
- may cause overfitting
- have a chance of failing: bad performance
### Encoding
### Transformation
### Generation