diff --git a/4-data-analytics.md b/4-data-analytics.md index 96c5e0c..e022c6b 100644 --- a/4-data-analytics.md +++ b/4-data-analytics.md @@ -12,7 +12,7 @@ - [Numeric with meaningful magnitude:](#numeric-with-meaningful-magnitude) - [Have enough samples](#have-enough-samples) - [Bring human insight to problem](#bring-human-insight-to-problem) - - [Process of Feature Engineering](#process-of-feature-engineering) + - [Methods of Feature Engineering](#methods-of-feature-engineering) - [Scaling](#scaling) - [Rationale:](#rationale) - [Methods:](#methods) @@ -29,7 +29,16 @@ - [k means binning](#k-means-binning) - [decision trees](#decision-trees) - [Encoding](#encoding) + - [Definition](#definition) + - [Reason](#reason) + - [Methods](#methods) + - [One hot encoding](#one-hot-encoding) + - [Ordinal encoding](#ordinal-encoding) + - [Count / frequency encoding](#count-frequency-encoding) + - [Mean / target encoding](#mean-target-encoding) - [Transformation](#transformation) + - [Reasons](#reasons) + - [Methods](#methods) - [Generation](#generation) @@ -83,8 +92,8 @@ ### Numeric with meaningful magnitude: - It does not mean that **categorical** features can't be used in training: - simply, they will need to be **transformed** through a process called one-hot - encoding + simply, they will need to be **transformed** through a process called + [encoding](#encoding) - Example: Font category: (Arial, Times New Roman) ### Have enough samples @@ -99,7 +108,7 @@ **curious mind** - This is an iterative process, need to use **feedback** from production usage -## Process of Feature Engineering +## Methods of Feature Engineering ### Scaling @@ -153,7 +162,7 @@ #### Reason for binning - Example: Solar energy modeling - - Acelleration calculation, by binning, and reduce the number of simulation + - Acceleration calculation, by binning, and reduce the number of simulation needed - Improves **performance** by grouping data with **similar attributes** and has **similar predictive strength** @@ -192,6 +201,93 @@ ### Encoding +#### Definition + +- The inverse of binning: creating numerical values from categorical variables + +#### Reason + +- Machine learning algorithms require **numerical** input data, and this + converts **categorical** data to **numerical** data + +#### Methods + +##### One hot encoding + +- Replace categorical variable (nominal) with different binary variables +- **Eliminates** **ordinality**: since categorical variables shouldn't be + ranked, otherwise the algorithm may think there's ordering between the + variables +- Improve performance by allowing model to capture the complex relationship + within the data, that may be **missed** if categorical variables are treated + as **single** entities +- Cons + - High dimensionality: make the model more complex, and slower to train + - Is sparse data + - May lead to overfitting, especially if there's too many categories and + sample size is small +- Usage: + - Good for algorithms that look at all features at the same time: neural + network, clustering, SVM + - Used for linear regression, but **keep k-1** binary variable to avoid + **multicollinearity**: + - In linear regression, the presence of all k binary variables for a + categorical feature (where k is the number of categories) introduces + perfect multicollinearity. This happens because the k-th variable is a + linear **combination** of the others (e.g., if "Red" and "Blue" are 0, + "Green" must be 1). + - Don't use for tree algorithms + +##### Ordinal encoding + +- Ordinal variable: comprises a finite set of discrete values with a **ranked** + ordering +- Ordinal encoding replaces the label by ordered number +- Does not add value to give the variable more predictive power +- Usage: + - For categorical data with ordinal meaning + +##### Count / frequency encoding + +- Replace occurrences of label with the count of occurrences +- Cons: + - Will have loss of unique categories: (if the two categories have same + frequency, they will be treated as the same) + - Doesn't handle unseen categories + - Overfitting, if low frequency in general + +##### Mean / target encoding + +- Replace the _value_ for every categories with the avg of _values_ for every + _category-value_ pair +- monotonic relationship between variable and target +- Don't expand the feature space +- Con: prone to overfitting +- Usage: + - High cardinality (the number of elements in a mathematical set) data, by + leveraging the target variable's statistics to retain predictive power + ### Transformation +#### Reasons + +- Linear/Logistic regression models has assumption between the predictors and + the outcome. + - Transformation may help create this relationship to avoid poor + performance. + - Assumptions: + - Linear dependency between the predictors and the outcome. + - Multivariate normality (every variable X should follow a Gaussian + distribution) + - No or little multicollinearity + - homogeneity of variance + - Example: + - assuming y > 0.5 lead to class 1, otherwise class 2 + - ![page 1](./assets/4-analytics-line-regression.webp) + - ![page 2](./assets/4-analytics-line-regression-2.webp) +- Some other ML algorithms do not make any assumption, but still may benefit + from a better distributed data + +#### Methods + ### Generation diff --git a/assets/4-analytics-line-regression-2.webp b/assets/4-analytics-line-regression-2.webp new file mode 100644 index 0000000..02d72b9 Binary files /dev/null and b/assets/4-analytics-line-regression-2.webp differ diff --git a/assets/4-analytics-line-regression.webp b/assets/4-analytics-line-regression.webp new file mode 100644 index 0000000..348c6ad Binary files /dev/null and b/assets/4-analytics-line-regression.webp differ