spent 1 hr on this, not complete

2025-01-07 19:00:11 +08:00 · 2025-01-07 19:00:11 +08:00 · 33d3bee7b0
parent ac75b690d8
commit 33d3bee7b0
1 changed files with 67 additions and 0 deletions
--- a/4-data-analytics.md
+++ b/4-data-analytics.md
@ -0,0 +1,67 @@
+# Data analytics
+
+<!--toc:start-->
+
+- [Data analytics](#data-analytics)
+    - [Feature engineering](#feature-engineering) - [Definition](#definition) -
+      [Sources of features](#sources-of-features) -
+      [Is a part of machine learning, an iterative process](#is-a-part-of-machine-learning-an-iterative-process) -
+      [Intro](#intro) -
+      [Types of feature engineering](#types-of-feature-engineering) -
+      [Good feature:](#good-feature) <!--toc:end-->
+
+## Feature engineering
+
+### Definition
+
+- The process that attempts to create **additional** relevant features from
+  **existing** raw features, to increase the predictive power of **algorithms**
+- Alternative definition: transfer raw data into features that **better
+  represent** the underlying problem, such that the accuracy of predictive model
+  is improved.
+- Important to machine learning
+
+### Sources of features
+
+- Different features are needed for different problems, even in the same domain
+
+### Feature engineering in ML
+
+- Process of ML iterations:
+    - Baseline model -> Feature engineering -> Model 2 -> Feature engineering ->
+      Final
+- Example: data needed to predict house price
+    - ML can do that with sufficient feature
+- Reason for feature engineering: Raw data are rarely useful
+    - Must be mapped into a feature vector
+    - Good feature engineering takes the most time out of ML
+
+### Types of feature engineering
+
+- **Indicator** variable to isolate information
+- Highlighting **interactions** between features
+- Representing the feature in a **different** way
+
+### Good feature:
+
+- Related to objective (important)
+    - Example: the number of concrete blocks around it is not related to house
+      prices
+- Known at prediction-time
+    - Some data could be known **immediately**, and some other data is not known
+      in **real time**: Can't feed the feature to a model, if it isn't present
+      at prediction time
+    - Feature definition shouldn't **change** over time
+    - Example: If the sales data at prediction time is only available within 3
+      days, with a 3 day lag, then current sale data can't be used for training
+      (that has to predict with a 3-day old data)
+- Numeric with meaningful magnitude:
+    - It does not mean that **categorical** features can't be used in training:
+      simply, they will need to be **transformed** through a process called
+      one-hot encoding
+    - Example: Font category: (Arial, Times New Roman)
+- Have enough samples
+    - Have at least five examples of any value before using it in your model
+    - If features tend to be poorly assorted and are unbalanced, then the
+      trained model will be biased
+- Bring human insight to problem