diff --git a/4-data-analytics.md b/4-data-analytics.md new file mode 100644 index 0000000..7bcd2b5 --- /dev/null +++ b/4-data-analytics.md @@ -0,0 +1,67 @@ +# Data analytics + + + +- [Data analytics](#data-analytics) + - [Feature engineering](#feature-engineering) - [Definition](#definition) - + [Sources of features](#sources-of-features) - + [Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) - + [Intro](#intro) - + [Types of feature engineering](#types-of-feature-engineering) - + [Good feature:](#good-feature) + +## Feature engineering + +### Definition + +- The process that attempts to create **additional** relevant features from + **existing** raw features, to increase the predictive power of **algorithms** +- Alternative definition: transfer raw data into features that **better + represent** the underlying problem, such that the accuracy of predictive model + is improved. +- Important to machine learning + +### Sources of features + +- Different features are needed for different problems, even in the same domain + +### Feature engineering in ML + +- Process of ML iterations: + - Baseline model -> Feature engineering -> Model 2 -> Feature engineering -> + Final +- Example: data needed to predict house price + - ML can do that with sufficient feature +- Reason for feature engineering: Raw data are rarely useful + - Must be mapped into a feature vector + - Good feature engineering takes the most time out of ML + +### Types of feature engineering + +- **Indicator** variable to isolate information +- Highlighting **interactions** between features +- Representing the feature in a **different** way + +### Good feature: + +- Related to objective (important) + - Example: the number of concrete blocks around it is not related to house + prices +- Known at prediction-time + - Some data could be known **immediately**, and some other data is not known + in **real time**: Can't feed the feature to a model, if it isn't present + at prediction time + - Feature definition shouldn't **change** over time + - Example: If the sales data at prediction time is only available within 3 + days, with a 3 day lag, then current sale data can't be used for training + (that has to predict with a 3-day old data) +- Numeric with meaningful magnitude: + - It does not mean that **categorical** features can't be used in training: + simply, they will need to be **transformed** through a process called + one-hot encoding + - Example: Font category: (Arial, Times New Roman) +- Have enough samples + - Have at least five examples of any value before using it in your model + - If features tend to be poorly assorted and are unbalanced, then the + trained model will be biased +- Bring human insight to problem