EBU6504_smart_arch_notes/4-data-analytics.md

# Data analytics

<!--toc:start-->

- [Data analytics](#data-analytics)
    - [Feature engineering](#feature-engineering) - [Definition](#definition) -
      [Sources of features](#sources-of-features) -
      [Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) -
      [Intro](#intro) -
      [Types of feature engineering](#types-of-feature-engineering) -
      [Good feature:](#good-feature) <!--toc:end-->

## Feature engineering

### Definition

- The process that attempts to create **additional** relevant features from
  **existing** raw features, to increase the predictive power of **algorithms**
- Alternative definition: transfer raw data into features that **better
  represent** the underlying problem, such that the accuracy of predictive model
  is improved.
- Important to machine learning

### Sources of features

- Different features are needed for different problems, even in the same domain

### Feature engineering in ML

- Process of ML iterations:
    - Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
      Final
- Example: data needed to predict house price
    - ML can do that with sufficient feature
- Reason for feature engineering: Raw data are rarely useful
    - Must be mapped into a feature vector
    - Good feature engineering takes the most time out of ML

### Types of feature engineering

- **Indicator** variable to isolate information
- Highlighting **interactions** between features
- Representing the feature in a **different** way

### Good feature:

- Related to objective (important)
    - Example: the number of concrete blocks around it is not related to house
      prices
- Known at prediction-time
    - Some data could be known **immediately**, and some other data is not known
      in **real time**: Can't feed the feature to a model, if it isn't present
      at prediction time
    - Feature definition shouldn't **change** over time
    - Example: If the sales data at prediction time is only available within 3
      days, with a 3 day lag, then current sale data can't be used for training
      (that has to predict with a 3-day old data)
- Numeric with meaningful magnitude:
    - It does not mean that **categorical** features can't be used in training:
      simply, they will need to be **transformed** through a process called
      one-hot encoding
    - Example: Font category: (Arial, Times New Roman)
- Have enough samples
    - Have at least five examples of any value before using it in your model
    - If features tend to be poorly assorted and are unbalanced, then the
      trained model will be biased
- Bring human insight to problem
spent 1 hr on this, not complete 2025-01-07 19:00:11 +08:00			`# Data analytics`

			`<!--toc:start-->`

			`- [Data analytics](#data-analytics)`
			`- [Feature engineering](#feature-engineering) - [Definition](#definition) -`
			`[Sources of features](#sources-of-features) -`
			`[Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) -`
			`[Intro](#intro) -`
			`[Types of feature engineering](#types-of-feature-engineering) -`
			`[Good feature:](#good-feature) <!--toc:end-->`

			`## Feature engineering`

			`### Definition`

			`- The process that attempts to create additional relevant features from`
			`existing raw features, to increase the predictive power of algorithms`
			`- Alternative definition: transfer raw data into features that **better`
			`represent** the underlying problem, such that the accuracy of predictive model`
			`is improved.`
			`- Important to machine learning`

			`### Sources of features`

			`- Different features are needed for different problems, even in the same domain`

			`### Feature engineering in ML`

			`- Process of ML iterations:`
			`- Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->`
			`Final`
			`- Example: data needed to predict house price`
			`- ML can do that with sufficient feature`
			`- Reason for feature engineering: Raw data are rarely useful`
			`- Must be mapped into a feature vector`
			`- Good feature engineering takes the most time out of ML`

			`### Types of feature engineering`

			`- Indicator variable to isolate information`
			`- Highlighting interactions between features`
			`- Representing the feature in a different way`

			`### Good feature:`

			`- Related to objective (important)`
			`- Example: the number of concrete blocks around it is not related to house`
			`prices`
			`- Known at prediction-time`
			`- Some data could be known immediately, and some other data is not known`
			`in real time: Can't feed the feature to a model, if it isn't present`
			`at prediction time`
			`- Feature definition shouldn't change over time`
			`- Example: If the sales data at prediction time is only available within 3`
			`days, with a 3 day lag, then current sale data can't be used for training`
			`(that has to predict with a 3-day old data)`
			`- Numeric with meaningful magnitude:`
			`- It does not mean that categorical features can't be used in training:`
			`simply, they will need to be transformed through a process called`
			`one-hot encoding`
			`- Example: Font category: (Arial, Times New Roman)`
			`- Have enough samples`
			`- Have at least five examples of any value before using it in your model`
			`- If features tend to be poorly assorted and are unbalanced, then the`
			`trained model will be biased`
			`- Bring human insight to problem`