EBU6504_smart_arch_notes/4-data-analytics.md
2025-01-07 19:00:11 +08:00

2.6 KiB

Data analytics

Feature engineering

Definition

  • The process that attempts to create additional relevant features from existing raw features, to increase the predictive power of algorithms
  • Alternative definition: transfer raw data into features that better represent the underlying problem, such that the accuracy of predictive model is improved.
  • Important to machine learning

Sources of features

  • Different features are needed for different problems, even in the same domain

Feature engineering in ML

  • Process of ML iterations:
    • Baseline model -> Feature engineering -> Model 2 -> Feature engineering -> Final
  • Example: data needed to predict house price
    • ML can do that with sufficient feature
  • Reason for feature engineering: Raw data are rarely useful
    • Must be mapped into a feature vector
    • Good feature engineering takes the most time out of ML

Types of feature engineering

  • Indicator variable to isolate information
  • Highlighting interactions between features
  • Representing the feature in a different way

Good feature:

  • Related to objective (important)
    • Example: the number of concrete blocks around it is not related to house prices
  • Known at prediction-time
    • Some data could be known immediately, and some other data is not known in real time: Can't feed the feature to a model, if it isn't present at prediction time
    • Feature definition shouldn't change over time
    • Example: If the sales data at prediction time is only available within 3 days, with a 3 day lag, then current sale data can't be used for training (that has to predict with a 3-day old data)
  • Numeric with meaningful magnitude:
    • It does not mean that categorical features can't be used in training: simply, they will need to be transformed through a process called one-hot encoding
    • Example: Font category: (Arial, Times New Roman)
  • Have enough samples
    • Have at least five examples of any value before using it in your model
    • If features tend to be poorly assorted and are unbalanced, then the trained model will be biased
  • Bring human insight to problem