EBU6504_smart_arch_notes/4-data-analytics.md

198 lines
6.6 KiB
Markdown
Raw Normal View History

2025-01-07 21:20:43 +08:00
# Data analytics: Feature engineering
2025-01-07 19:00:11 +08:00
<!--toc:start-->
2025-01-07 21:20:43 +08:00
- [Data analytics: Feature engineering](#data-analytics-feature-engineering)
- [Definition](#definition)
- [Sources of features](#sources-of-features)
- [Feature engineering in ML](#feature-engineering-in-ml)
- [Types of feature engineering](#types-of-feature-engineering)
- [Good feature:](#good-feature)
- [Related to objective (important)](#related-to-objective-important)
- [Known at prediction-time](#known-at-prediction-time)
- [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
- [Have enough samples](#have-enough-samples)
- [Bring human insight to problem](#bring-human-insight-to-problem)
- [Process of Feature Engineering](#process-of-feature-engineering)
- [Scaling](#scaling)
- [Rationale:](#rationale)
- [Methods:](#methods)
- [Normalization or Standardization:](#normalization-or-standardization)
- [Min-max scaling:](#min-max-scaling)
- [Robust scaling:](#robust-scaling)
- [Choosing](#choosing)
- [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
- [Definition](#definition)
- [Reason for binning](#reason-for-binning)
- [Methods](#methods)
- [Equal width binning](#equal-width-binning)
- [Equal frequency binning](#equal-frequency-binning)
- [k means binning](#k-means-binning)
- [decision trees](#decision-trees)
- [Encoding](#encoding)
- [Transformation](#transformation)
- [Generation](#generation)
<!--toc:end-->
2025-01-07 19:00:11 +08:00
2025-01-07 21:20:43 +08:00
## Definition
2025-01-07 19:00:11 +08:00
- The process that attempts to create **additional** relevant features from
**existing** raw features, to increase the predictive power of **algorithms**
- Alternative definition: transfer raw data into features that **better
represent** the underlying problem, such that the accuracy of predictive model
is improved.
- Important to machine learning
2025-01-07 21:20:43 +08:00
## Sources of features
2025-01-07 19:00:11 +08:00
- Different features are needed for different problems, even in the same domain
2025-01-07 21:20:43 +08:00
## Feature engineering in ML
2025-01-07 19:00:11 +08:00
- Process of ML iterations:
- Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
Final
- Example: data needed to predict house price
- ML can do that with sufficient feature
- Reason for feature engineering: Raw data are rarely useful
- Must be mapped into a feature vector
- Good feature engineering takes the most time out of ML
### Types of feature engineering
- **Indicator** variable to isolate information
- Highlighting **interactions** between features
- Representing the feature in a **different** way
2025-01-07 21:20:43 +08:00
## Good feature:
### Related to objective (important)
- Example: the number of concrete blocks around it is not related to house
prices
### Known at prediction-time
- Some data could be known **immediately**, and some other data is not known in
**real time**: Can't feed the feature to a model, if it isn't present at
prediction time
- Feature definition shouldn't **change** over time
- Example: If the sales data at prediction time is only available within 3 days,
with a 3 day lag, then current sale data can't be used for training (that has
to predict with a 3-day old data)
### Numeric with meaningful magnitude:
- It does not mean that **categorical** features can't be used in training:
simply, they will need to be **transformed** through a process called one-hot
encoding
- Example: Font category: (Arial, Times New Roman)
### Have enough samples
- Have at least five examples of any value before using it in your model
- If features tend to be poorly assorted and are unbalanced, then the trained
model will be biased
### Bring human insight to problem
- Must have a reason for this feature to be useful, needs **subject matter** and
**curious mind**
- This is an iterative process, need to use **feedback** from production usage
## Process of Feature Engineering
### Scaling
#### Rationale:
- Leads to a better model, useful when data is uneven: $X1 >> X2$
#### Methods:
##### Normalization or Standardization:
- $𝑍 = \frac{𝑋−𝜇}{\sigma}$
- Re-scaled to have a standard normal distribution that centered around 0 with
SD of 1
- Will **compress** the value in the narrow range, if the variable is skewed, or
has outliers.
- This may impair the prediction
##### Min-max scaling:
- $X_{scaled} = \frac{X - min}{max - min}$
- Also will compress observation
##### Robust scaling:
- $X_{scaled} = \frac{X - median}{IQR}$
- IQR: Interquartile range
- Better at **preserving** the spread
#### Choosing
- If data is **not guassain like**, and has a **skewed distribution** or
outliers : Use **robust** scaling, as the other two will compress the data to
a narrow range, which is not ideal
- For **PCA or LDA**(distance or covariance calculation), better to use
**Normalization or Standardization**, since it will remove the effect of
numerical scale, on variance and covariance
- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
data may be out of bound (out of original range). This is preferred when the
network prefer a 0-1 **scale**
### Discretization / Binning / Bucketing
#### Definition
- The process of transforming **continuous** variable into **discrete** ones, by
creating a set of continuous interval, that spans over the range of variable's
values
- ![binning diagram](./assets/4-analytics-binning.webp)
#### Reason for binning
- Example: Solar energy modeling
- Acelleration calculation, by binning, and reduce the number of simulation
needed
- Improves **performance** by grouping data with **similar attributes** and has
**similar predictive strength**
- Improve **non-linearity**, by being able to capture **non-linear patterns** ,
thus improving fitting power of model
- **Interpretability** is enhanced by grouping
- Reduce the impact of **outliers**
- Prevent **overfitting**
- Allow feature **interaction**, with **continuous** variables
#### Methods
##### Equal width binning
- Divide the scope into bins of the same width
- Con: is sensitive to skewed distribution
##### Equal frequency binning
- Divides the scope of possible values of variable into N bins, where each bin
carries the same **number** of observations
- Con: May disrupt the relationship with target
##### k means binning
- Use k-means to partition the values into clusters
- Con: need hyper-parameter tuning
##### decision trees
- Using decision trees to decide the best splitting points
- Observes which bin is more similar than other bins
- Con:
- may cause overfitting
- have a chance of failing: bad performance
### Encoding
### Transformation
### Generation