68 lines
2.6 KiB
Markdown
68 lines
2.6 KiB
Markdown
|
# Data analytics
|
||
|
|
||
|
<!--toc:start-->
|
||
|
|
||
|
- [Data analytics](#data-analytics)
|
||
|
- [Feature engineering](#feature-engineering) - [Definition](#definition) -
|
||
|
[Sources of features](#sources-of-features) -
|
||
|
[Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) -
|
||
|
[Intro](#intro) -
|
||
|
[Types of feature engineering](#types-of-feature-engineering) -
|
||
|
[Good feature:](#good-feature) <!--toc:end-->
|
||
|
|
||
|
## Feature engineering
|
||
|
|
||
|
### Definition
|
||
|
|
||
|
- The process that attempts to create **additional** relevant features from
|
||
|
**existing** raw features, to increase the predictive power of **algorithms**
|
||
|
- Alternative definition: transfer raw data into features that **better
|
||
|
represent** the underlying problem, such that the accuracy of predictive model
|
||
|
is improved.
|
||
|
- Important to machine learning
|
||
|
|
||
|
### Sources of features
|
||
|
|
||
|
- Different features are needed for different problems, even in the same domain
|
||
|
|
||
|
### Feature engineering in ML
|
||
|
|
||
|
- Process of ML iterations:
|
||
|
- Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
|
||
|
Final
|
||
|
- Example: data needed to predict house price
|
||
|
- ML can do that with sufficient feature
|
||
|
- Reason for feature engineering: Raw data are rarely useful
|
||
|
- Must be mapped into a feature vector
|
||
|
- Good feature engineering takes the most time out of ML
|
||
|
|
||
|
### Types of feature engineering
|
||
|
|
||
|
- **Indicator** variable to isolate information
|
||
|
- Highlighting **interactions** between features
|
||
|
- Representing the feature in a **different** way
|
||
|
|
||
|
### Good feature:
|
||
|
|
||
|
- Related to objective (important)
|
||
|
- Example: the number of concrete blocks around it is not related to house
|
||
|
prices
|
||
|
- Known at prediction-time
|
||
|
- Some data could be known **immediately**, and some other data is not known
|
||
|
in **real time**: Can't feed the feature to a model, if it isn't present
|
||
|
at prediction time
|
||
|
- Feature definition shouldn't **change** over time
|
||
|
- Example: If the sales data at prediction time is only available within 3
|
||
|
days, with a 3 day lag, then current sale data can't be used for training
|
||
|
(that has to predict with a 3-day old data)
|
||
|
- Numeric with meaningful magnitude:
|
||
|
- It does not mean that **categorical** features can't be used in training:
|
||
|
simply, they will need to be **transformed** through a process called
|
||
|
one-hot encoding
|
||
|
- Example: Font category: (Arial, Times New Roman)
|
||
|
- Have enough samples
|
||
|
- Have at least five examples of any value before using it in your model
|
||
|
- If features tend to be poorly assorted and are unbalanced, then the
|
||
|
trained model will be biased
|
||
|
- Bring human insight to problem
|