# Data analytics - [Data analytics](#data-analytics) - [Feature engineering](#feature-engineering) - [Definition](#definition) - [Sources of features](#sources-of-features) - [Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) - [Intro](#intro) - [Types of feature engineering](#types-of-feature-engineering) - [Good feature:](#good-feature) ## Feature engineering ### Definition - The process that attempts to create **additional** relevant features from **existing** raw features, to increase the predictive power of **algorithms** - Alternative definition: transfer raw data into features that **better represent** the underlying problem, such that the accuracy of predictive model is improved. - Important to machine learning ### Sources of features - Different features are needed for different problems, even in the same domain ### Feature engineering in ML - Process of ML iterations: - Baseline model -> Feature engineering -> Model 2 -> Feature engineering -> Final - Example: data needed to predict house price - ML can do that with sufficient feature - Reason for feature engineering: Raw data are rarely useful - Must be mapped into a feature vector - Good feature engineering takes the most time out of ML ### Types of feature engineering - **Indicator** variable to isolate information - Highlighting **interactions** between features - Representing the feature in a **different** way ### Good feature: - Related to objective (important) - Example: the number of concrete blocks around it is not related to house prices - Known at prediction-time - Some data could be known **immediately**, and some other data is not known in **real time**: Can't feed the feature to a model, if it isn't present at prediction time - Feature definition shouldn't **change** over time - Example: If the sales data at prediction time is only available within 3 days, with a 3 day lag, then current sale data can't be used for training (that has to predict with a 3-day old data) - Numeric with meaningful magnitude: - It does not mean that **categorical** features can't be used in training: simply, they will need to be **transformed** through a process called one-hot encoding - Example: Font category: (Arial, Times New Roman) - Have enough samples - Have at least five examples of any value before using it in your model - If features tend to be poorly assorted and are unbalanced, then the trained model will be biased - Bring human insight to problem