2025-01-07 21:20:43 +08:00
|
|
|
|
# Data analytics: Feature engineering
|
2025-01-07 19:00:11 +08:00
|
|
|
|
|
|
|
|
|
<!--toc:start-->
|
2025-01-08 16:26:25 +08:00
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
- [Data analytics: Feature engineering](#data-analytics-feature-engineering)
|
2025-01-08 16:26:25 +08:00
|
|
|
|
- [Definition](#definition)
|
|
|
|
|
- [Sources of features](#sources-of-features)
|
|
|
|
|
- [Feature engineering in ML](#feature-engineering-in-ml)
|
|
|
|
|
- [Types of feature engineering](#types-of-feature-engineering)
|
|
|
|
|
- [Good feature:](#good-feature)
|
|
|
|
|
- [Related to objective (important)](#related-to-objective-important)
|
|
|
|
|
- [Known at prediction-time](#known-at-prediction-time)
|
|
|
|
|
- [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude)
|
|
|
|
|
- [Have enough samples](#have-enough-samples)
|
|
|
|
|
- [Bring human insight to problem](#bring-human-insight-to-problem)
|
|
|
|
|
- [Methods of Feature Engineering](#methods-of-feature-engineering)
|
|
|
|
|
- [Scaling](#scaling)
|
|
|
|
|
- [Rationale:](#rationale)
|
|
|
|
|
- [Methods:](#methods)
|
|
|
|
|
- [Normalization or Standardization:](#normalization-or-standardization)
|
|
|
|
|
- [Min-max scaling:](#min-max-scaling)
|
|
|
|
|
- [Robust scaling:](#robust-scaling)
|
|
|
|
|
- [Choosing](#choosing)
|
|
|
|
|
- [Discretization / Binning / Bucketing](#discretization-binning-bucketing)
|
|
|
|
|
- [Definition](#definition)
|
|
|
|
|
- [Reason for binning](#reason-for-binning)
|
|
|
|
|
- [Methods](#methods)
|
|
|
|
|
- [Equal width binning](#equal-width-binning)
|
|
|
|
|
- [Equal frequency binning](#equal-frequency-binning)
|
|
|
|
|
- [k means binning](#k-means-binning)
|
|
|
|
|
- [decision trees](#decision-trees)
|
|
|
|
|
- [Encoding](#encoding)
|
|
|
|
|
- [Definition](#definition)
|
|
|
|
|
- [Reason](#reason)
|
|
|
|
|
- [Methods](#methods)
|
|
|
|
|
- [One hot encoding](#one-hot-encoding)
|
|
|
|
|
- [Ordinal encoding](#ordinal-encoding)
|
|
|
|
|
- [Count / frequency encoding](#count-frequency-encoding)
|
|
|
|
|
- [Mean / target encoding](#mean-target-encoding)
|
|
|
|
|
- [Transformation](#transformation)
|
|
|
|
|
- [Reasons](#reasons)
|
|
|
|
|
- [Methods](#methods)
|
|
|
|
|
- [Generation](#generation)
|
|
|
|
|
- [Definition](#definition)
|
|
|
|
|
- [Methods](#methods)
|
|
|
|
|
- [Feature Crossing](#feature-crossing)
|
|
|
|
|
- [Polynomial Expansion](#polynomial-expansion)
|
|
|
|
|
- [Feature Learning by Trees](#feature-learning-by-trees)
|
|
|
|
|
- [Automatic Feature learning: Deep learning](#automatic-feature-learning-deep-learning)
|
|
|
|
|
- [Feature Selection](#feature-selection)
|
|
|
|
|
- [Reason](#reason)
|
|
|
|
|
- [Methods](#methods)
|
|
|
|
|
- [Filter](#filter)
|
|
|
|
|
- [Wrapper](#wrapper)
|
|
|
|
|
- [Embedded](#embedded)
|
|
|
|
|
- [Shuffling](#shuffling)
|
|
|
|
|
- [Hybrid](#hybrid)
|
|
|
|
|
- [Dimensionality Reduction](#dimensionality-reduction)
|
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
<!--toc:end-->
|
2025-01-07 19:00:11 +08:00
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
## Definition
|
2025-01-07 19:00:11 +08:00
|
|
|
|
|
|
|
|
|
- The process that attempts to create **additional** relevant features from
|
|
|
|
|
**existing** raw features, to increase the predictive power of **algorithms**
|
|
|
|
|
- Alternative definition: transfer raw data into features that **better
|
|
|
|
|
represent** the underlying problem, such that the accuracy of predictive model
|
|
|
|
|
is improved.
|
|
|
|
|
- Important to machine learning
|
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
## Sources of features
|
2025-01-07 19:00:11 +08:00
|
|
|
|
|
|
|
|
|
- Different features are needed for different problems, even in the same domain
|
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
## Feature engineering in ML
|
2025-01-07 19:00:11 +08:00
|
|
|
|
|
|
|
|
|
- Process of ML iterations:
|
|
|
|
|
- Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
|
|
|
|
|
Final
|
|
|
|
|
- Example: data needed to predict house price
|
|
|
|
|
- ML can do that with sufficient feature
|
|
|
|
|
- Reason for feature engineering: Raw data are rarely useful
|
|
|
|
|
- Must be mapped into a feature vector
|
|
|
|
|
- Good feature engineering takes the most time out of ML
|
|
|
|
|
|
|
|
|
|
### Types of feature engineering
|
|
|
|
|
|
|
|
|
|
- **Indicator** variable to isolate information
|
|
|
|
|
- Highlighting **interactions** between features
|
|
|
|
|
- Representing the feature in a **different** way
|
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
## Good feature:
|
|
|
|
|
|
|
|
|
|
### Related to objective (important)
|
|
|
|
|
|
|
|
|
|
- Example: the number of concrete blocks around it is not related to house
|
|
|
|
|
prices
|
|
|
|
|
|
|
|
|
|
### Known at prediction-time
|
|
|
|
|
|
|
|
|
|
- Some data could be known **immediately**, and some other data is not known in
|
|
|
|
|
**real time**: Can't feed the feature to a model, if it isn't present at
|
|
|
|
|
prediction time
|
|
|
|
|
- Feature definition shouldn't **change** over time
|
|
|
|
|
- Example: If the sales data at prediction time is only available within 3 days,
|
|
|
|
|
with a 3 day lag, then current sale data can't be used for training (that has
|
|
|
|
|
to predict with a 3-day old data)
|
|
|
|
|
|
|
|
|
|
### Numeric with meaningful magnitude:
|
|
|
|
|
|
|
|
|
|
- It does not mean that **categorical** features can't be used in training:
|
2025-01-08 12:30:17 +08:00
|
|
|
|
simply, they will need to be **transformed** through a process called
|
|
|
|
|
[encoding](#encoding)
|
2025-01-07 21:20:43 +08:00
|
|
|
|
- Example: Font category: (Arial, Times New Roman)
|
|
|
|
|
|
|
|
|
|
### Have enough samples
|
|
|
|
|
|
|
|
|
|
- Have at least five examples of any value before using it in your model
|
|
|
|
|
- If features tend to be poorly assorted and are unbalanced, then the trained
|
|
|
|
|
model will be biased
|
|
|
|
|
|
|
|
|
|
### Bring human insight to problem
|
|
|
|
|
|
|
|
|
|
- Must have a reason for this feature to be useful, needs **subject matter** and
|
|
|
|
|
**curious mind**
|
|
|
|
|
- This is an iterative process, need to use **feedback** from production usage
|
|
|
|
|
|
2025-01-08 12:30:17 +08:00
|
|
|
|
## Methods of Feature Engineering
|
2025-01-07 21:20:43 +08:00
|
|
|
|
|
|
|
|
|
### Scaling
|
|
|
|
|
|
|
|
|
|
#### Rationale:
|
|
|
|
|
|
|
|
|
|
- Leads to a better model, useful when data is uneven: $X1 >> X2$
|
|
|
|
|
|
|
|
|
|
#### Methods:
|
|
|
|
|
|
|
|
|
|
##### Normalization or Standardization:
|
|
|
|
|
|
|
|
|
|
- $𝑍 = \frac{𝑋−𝜇}{\sigma}$
|
|
|
|
|
- Re-scaled to have a standard normal distribution that centered around 0 with
|
|
|
|
|
SD of 1
|
|
|
|
|
- Will **compress** the value in the narrow range, if the variable is skewed, or
|
|
|
|
|
has outliers.
|
|
|
|
|
- This may impair the prediction
|
|
|
|
|
|
|
|
|
|
##### Min-max scaling:
|
|
|
|
|
|
|
|
|
|
- $X_{scaled} = \frac{X - min}{max - min}$
|
|
|
|
|
- Also will compress observation
|
|
|
|
|
|
|
|
|
|
##### Robust scaling:
|
|
|
|
|
|
|
|
|
|
- $X_{scaled} = \frac{X - median}{IQR}$
|
|
|
|
|
- IQR: Interquartile range
|
|
|
|
|
- Better at **preserving** the spread
|
|
|
|
|
|
|
|
|
|
#### Choosing
|
|
|
|
|
|
|
|
|
|
- If data is **not guassain like**, and has a **skewed distribution** or
|
|
|
|
|
outliers : Use **robust** scaling, as the other two will compress the data to
|
|
|
|
|
a narrow range, which is not ideal
|
|
|
|
|
- For **PCA or LDA**(distance or covariance calculation), better to use
|
|
|
|
|
**Normalization or Standardization**, since it will remove the effect of
|
|
|
|
|
numerical scale, on variance and covariance
|
|
|
|
|
- Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new
|
|
|
|
|
data may be out of bound (out of original range). This is preferred when the
|
|
|
|
|
network prefer a 0-1 **scale**
|
|
|
|
|
|
|
|
|
|
### Discretization / Binning / Bucketing
|
|
|
|
|
|
|
|
|
|
#### Definition
|
|
|
|
|
|
|
|
|
|
- The process of transforming **continuous** variable into **discrete** ones, by
|
|
|
|
|
creating a set of continuous interval, that spans over the range of variable's
|
|
|
|
|
values
|
|
|
|
|
- ![binning diagram](./assets/4-analytics-binning.webp)
|
|
|
|
|
|
|
|
|
|
#### Reason for binning
|
|
|
|
|
|
|
|
|
|
- Example: Solar energy modeling
|
2025-01-08 12:30:17 +08:00
|
|
|
|
- Acceleration calculation, by binning, and reduce the number of simulation
|
2025-01-07 21:20:43 +08:00
|
|
|
|
needed
|
|
|
|
|
- Improves **performance** by grouping data with **similar attributes** and has
|
|
|
|
|
**similar predictive strength**
|
|
|
|
|
- Improve **non-linearity**, by being able to capture **non-linear patterns** ,
|
|
|
|
|
thus improving fitting power of model
|
|
|
|
|
- **Interpretability** is enhanced by grouping
|
|
|
|
|
- Reduce the impact of **outliers**
|
|
|
|
|
- Prevent **overfitting**
|
|
|
|
|
- Allow feature **interaction**, with **continuous** variables
|
|
|
|
|
|
|
|
|
|
#### Methods
|
|
|
|
|
|
|
|
|
|
##### Equal width binning
|
|
|
|
|
|
|
|
|
|
- Divide the scope into bins of the same width
|
|
|
|
|
- Con: is sensitive to skewed distribution
|
|
|
|
|
|
|
|
|
|
##### Equal frequency binning
|
|
|
|
|
|
|
|
|
|
- Divides the scope of possible values of variable into N bins, where each bin
|
|
|
|
|
carries the same **number** of observations
|
|
|
|
|
- Con: May disrupt the relationship with target
|
|
|
|
|
|
|
|
|
|
##### k means binning
|
|
|
|
|
|
|
|
|
|
- Use k-means to partition the values into clusters
|
|
|
|
|
- Con: need hyper-parameter tuning
|
|
|
|
|
|
|
|
|
|
##### decision trees
|
|
|
|
|
|
|
|
|
|
- Using decision trees to decide the best splitting points
|
|
|
|
|
- Observes which bin is more similar than other bins
|
|
|
|
|
- Con:
|
|
|
|
|
- may cause overfitting
|
|
|
|
|
- have a chance of failing: bad performance
|
|
|
|
|
|
|
|
|
|
### Encoding
|
|
|
|
|
|
2025-01-08 12:30:17 +08:00
|
|
|
|
#### Definition
|
|
|
|
|
|
|
|
|
|
- The inverse of binning: creating numerical values from categorical variables
|
|
|
|
|
|
|
|
|
|
#### Reason
|
|
|
|
|
|
|
|
|
|
- Machine learning algorithms require **numerical** input data, and this
|
|
|
|
|
converts **categorical** data to **numerical** data
|
|
|
|
|
|
|
|
|
|
#### Methods
|
|
|
|
|
|
|
|
|
|
##### One hot encoding
|
|
|
|
|
|
|
|
|
|
- Replace categorical variable (nominal) with different binary variables
|
|
|
|
|
- **Eliminates** **ordinality**: since categorical variables shouldn't be
|
|
|
|
|
ranked, otherwise the algorithm may think there's ordering between the
|
|
|
|
|
variables
|
|
|
|
|
- Improve performance by allowing model to capture the complex relationship
|
|
|
|
|
within the data, that may be **missed** if categorical variables are treated
|
|
|
|
|
as **single** entities
|
|
|
|
|
- Cons
|
|
|
|
|
- High dimensionality: make the model more complex, and slower to train
|
|
|
|
|
- Is sparse data
|
|
|
|
|
- May lead to overfitting, especially if there's too many categories and
|
|
|
|
|
sample size is small
|
|
|
|
|
- Usage:
|
|
|
|
|
- Good for algorithms that look at all features at the same time: neural
|
|
|
|
|
network, clustering, SVM
|
|
|
|
|
- Used for linear regression, but **keep k-1** binary variable to avoid
|
|
|
|
|
**multicollinearity**:
|
|
|
|
|
- In linear regression, the presence of all k binary variables for a
|
|
|
|
|
categorical feature (where k is the number of categories) introduces
|
|
|
|
|
perfect multicollinearity. This happens because the k-th variable is a
|
|
|
|
|
linear **combination** of the others (e.g., if "Red" and "Blue" are 0,
|
|
|
|
|
"Green" must be 1).
|
|
|
|
|
- Don't use for tree algorithms
|
|
|
|
|
|
|
|
|
|
##### Ordinal encoding
|
|
|
|
|
|
|
|
|
|
- Ordinal variable: comprises a finite set of discrete values with a **ranked**
|
|
|
|
|
ordering
|
|
|
|
|
- Ordinal encoding replaces the label by ordered number
|
|
|
|
|
- Does not add value to give the variable more predictive power
|
|
|
|
|
- Usage:
|
|
|
|
|
- For categorical data with ordinal meaning
|
|
|
|
|
|
|
|
|
|
##### Count / frequency encoding
|
|
|
|
|
|
|
|
|
|
- Replace occurrences of label with the count of occurrences
|
|
|
|
|
- Cons:
|
|
|
|
|
- Will have loss of unique categories: (if the two categories have same
|
|
|
|
|
frequency, they will be treated as the same)
|
|
|
|
|
- Doesn't handle unseen categories
|
|
|
|
|
- Overfitting, if low frequency in general
|
|
|
|
|
|
|
|
|
|
##### Mean / target encoding
|
|
|
|
|
|
|
|
|
|
- Replace the _value_ for every categories with the avg of _values_ for every
|
|
|
|
|
_category-value_ pair
|
|
|
|
|
- monotonic relationship between variable and target
|
|
|
|
|
- Don't expand the feature space
|
|
|
|
|
- Con: prone to overfitting
|
|
|
|
|
- Usage:
|
|
|
|
|
- High cardinality (the number of elements in a mathematical set) data, by
|
|
|
|
|
leveraging the target variable's statistics to retain predictive power
|
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
### Transformation
|
|
|
|
|
|
2025-01-08 12:30:17 +08:00
|
|
|
|
#### Reasons
|
|
|
|
|
|
|
|
|
|
- Linear/Logistic regression models has assumption between the predictors and
|
|
|
|
|
the outcome.
|
|
|
|
|
- Transformation may help create this relationship to avoid poor
|
|
|
|
|
performance.
|
|
|
|
|
- Assumptions:
|
|
|
|
|
- Linear dependency between the predictors and the outcome.
|
|
|
|
|
- Multivariate normality (every variable X should follow a Gaussian
|
|
|
|
|
distribution)
|
|
|
|
|
- No or little multicollinearity
|
|
|
|
|
- homogeneity of variance
|
|
|
|
|
- Example:
|
|
|
|
|
- assuming y > 0.5 lead to class 1, otherwise class 2
|
2025-01-08 15:38:11 +08:00
|
|
|
|
- ![page 1](./assets/4-analytics-line-regression.webp)
|
|
|
|
|
- ![page 2](./assets/4-analytics-line-regression-2.webp)
|
2025-01-08 12:30:17 +08:00
|
|
|
|
- Some other ML algorithms do not make any assumption, but still may benefit
|
|
|
|
|
from a better distributed data
|
|
|
|
|
|
|
|
|
|
#### Methods
|
|
|
|
|
|
2025-01-08 15:38:11 +08:00
|
|
|
|
- Logarithmic transformation: $log(𝑥 + 1)$
|
|
|
|
|
- Useful when applied to **skewed distributions**, it **expands** small
|
|
|
|
|
values and **compress** big values, helps to make the distribution less
|
|
|
|
|
skewed
|
|
|
|
|
- Numerical values x must be $x \gt -1$
|
|
|
|
|
- Reciprocal transformation $1/𝑥$
|
|
|
|
|
- Square root $\sqrt{x}$
|
|
|
|
|
- Similar to log transform
|
|
|
|
|
- Exponential
|
2025-01-08 15:40:26 +08:00
|
|
|
|
- Box cox transformation $(x^\lambda - 1) / \lambda$
|
2025-01-08 15:38:11 +08:00
|
|
|
|
- **prerequisite:** numeric values must be positive, can be solved by
|
|
|
|
|
shifting
|
|
|
|
|
- Quantile transformation: using quartiles
|
|
|
|
|
- Transform feature to use a uniform or normal distribution. Tends to spread
|
|
|
|
|
out the most frequent values.
|
|
|
|
|
- This is **robust**
|
|
|
|
|
- But is **non-linear** transform, may distort linear correlation, but
|
|
|
|
|
variables at different scales are more comparable
|
|
|
|
|
|
2025-01-07 21:20:43 +08:00
|
|
|
|
### Generation
|
2025-01-08 15:38:11 +08:00
|
|
|
|
|
|
|
|
|
#### Definition
|
|
|
|
|
|
|
|
|
|
- Generating new features that are often not the result of feature
|
|
|
|
|
transformation
|
|
|
|
|
- Examples:
|
|
|
|
|
- $Age \times NumberDiagnoses$
|
|
|
|
|
- ![statistical feature](./assets/4-analytics-feat-gen-example-1.webp)
|
|
|
|
|
- ![fourier transform](./assets/4-analytics-feat-gen-example-2.webp)
|
|
|
|
|
|
|
|
|
|
#### Methods
|
|
|
|
|
|
|
|
|
|
##### Feature Crossing
|
|
|
|
|
|
|
|
|
|
- Create new features from existing ones, thus increasing predictive power
|
|
|
|
|
- Takes the Cartesian product of existing features
|
|
|
|
|
- $A\times B=\{(a,b)\mid a\in A\ {\mbox{ and }}\ b\in B\}.$
|
|
|
|
|
- Has uses when data is not linerarly separable
|
|
|
|
|
- Deciding which feature to cross:
|
|
|
|
|
- Use expertise
|
|
|
|
|
- Automatic exploration tools
|
|
|
|
|
- [Deep learning](#automatic-feature-learning-deep-learning)
|
|
|
|
|
|
|
|
|
|
##### Polynomial Expansion
|
|
|
|
|
|
|
|
|
|
- Useful in modelling, since it can model non-linear relationships between
|
|
|
|
|
predictor and outcome
|
|
|
|
|
- Use fitted polynomial variables to represent the data:
|
|
|
|
|
- $𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛$
|
|
|
|
|
- Pros:
|
|
|
|
|
- Fast
|
|
|
|
|
- Good performance, compared to binning
|
|
|
|
|
- Doesn't create correlated features
|
|
|
|
|
- Good at handling continuous change
|
|
|
|
|
- Cons:
|
|
|
|
|
- Less interpretable
|
|
|
|
|
- Lots of variables produced
|
|
|
|
|
- Hard to model changes in distribution
|
|
|
|
|
|
|
|
|
|
##### Feature Learning by Trees
|
|
|
|
|
|
|
|
|
|
- Each sample is a leaf node
|
|
|
|
|
- Decision path to each node is a new non-linear feature
|
|
|
|
|
- We can create N new binary features (with N leaf nodes)
|
|
|
|
|
- Pro: Fast to get informative feature
|
|
|
|
|
|
|
|
|
|
##### Automatic Feature learning: Deep learning
|
|
|
|
|
|
|
|
|
|
- Deep learning model learns the features from data
|
|
|
|
|
- Difference between shallow networks
|
|
|
|
|
- Deep, in the sense of having multiple hidden layers
|
|
|
|
|
- Introduced stochastic gradient descent
|
|
|
|
|
- Can automate feature extraction
|
|
|
|
|
- Require larger datasets
|
|
|
|
|
- DL can learn hierarchical of features: Character → word → word group → clause
|
|
|
|
|
→ sentence
|
|
|
|
|
- CNN: use convolutional layers to apply filters to the input image, to detect
|
|
|
|
|
various features such as edges, textures and shapes
|
|
|
|
|
|
|
|
|
|
## Feature Selection
|
|
|
|
|
|
|
|
|
|
### Reason
|
|
|
|
|
|
|
|
|
|
- More features doesn't necessarily lead to better model
|
|
|
|
|
- Feature selection is useful for
|
|
|
|
|
- Model simplification: easy interpretation, smaller model, less cost
|
|
|
|
|
- Lower data requirements: less data is required
|
|
|
|
|
- Less dimensionality
|
|
|
|
|
- Enhanced generalization, less overfitting
|
|
|
|
|
|
|
|
|
|
### Methods
|
|
|
|
|
|
|
|
|
|
#### Filter
|
|
|
|
|
|
|
|
|
|
- Select best features via the following methods and evaluate
|
|
|
|
|
- Main methods
|
|
|
|
|
- Variance: remove the feature that has the same value
|
|
|
|
|
- Correlation: remove features that are highly correlated with each other
|
|
|
|
|
- Con: Fail to consider the interaction between features and may reduce the
|
|
|
|
|
predict power of the model
|
|
|
|
|
|
|
|
|
|
#### Wrapper
|
|
|
|
|
|
|
|
|
|
- Use searching to search through all the possible feature subsets and evaluate
|
|
|
|
|
them
|
|
|
|
|
- Steps of execution (p98), skipped
|
|
|
|
|
- Con: Computationally expensive
|
|
|
|
|
|
|
|
|
|
#### Embedded
|
|
|
|
|
|
|
|
|
|
- Use feature selection as a part of ML algorithm
|
|
|
|
|
- This address the drawbacks of both filter and wrapper method, and has
|
|
|
|
|
advantage of both
|
|
|
|
|
- Faster than filter
|
|
|
|
|
- More accurate than filter
|
|
|
|
|
- Methods:
|
|
|
|
|
- Regularization: Add penalty to coefficients, which can turn them to zero,
|
|
|
|
|
and can be removed from dataset
|
|
|
|
|
- Tree based methods: outputs feature importance, which can be used to
|
|
|
|
|
select features.
|
|
|
|
|
|
|
|
|
|
#### Shuffling
|
|
|
|
|
|
|
|
|
|
#### Hybrid
|
|
|
|
|
|
|
|
|
|
#### Dimensionality Reduction
|
|
|
|
|
|
|
|
|
|
- When dimensionality is too high, it's computationally expensive to process
|
|
|
|
|
them. We **project the data** to a lower subspace, that captures the
|
|
|
|
|
**essence** of data
|
|
|
|
|
- Reason
|
|
|
|
|
- Curse of dimensionality: high dimensionality data have large number of
|
|
|
|
|
features or dimensions, which can make it difficult to analyze and
|
|
|
|
|
understand
|
|
|
|
|
- Remove sparse or noisy data, reduce overfitting
|
|
|
|
|
- To create a model with lower number of variables
|
|
|
|
|
- PCA:
|
|
|
|
|
- form of feature extraction, combines and transforms the dataset's original
|
|
|
|
|
values
|
|
|
|
|
- projects data onto a new space, defined by this subset of principal
|
|
|
|
|
components
|
|
|
|
|
- Is a **unsupervised** linear dimensionality reduction technique
|
|
|
|
|
- Preserves signal, filter out noise
|
|
|
|
|
- Use **covariance matrix**
|
|
|
|
|
- TODO: is calculation needed
|
|
|
|
|
- Minimize intraclass difference
|
|
|
|
|
- LDA:
|
|
|
|
|
- Similar to PCA
|
|
|
|
|
- Different than PCA, because it retains classification labels in dataset
|
|
|
|
|
- Goal: maximize data variance and maximise class difference in the data.
|
|
|
|
|
- Use **scatter matrix**
|
|
|
|
|
- Maximizes interclass difference
|