EBU6504_smart_arch_notes/4-data-analytics.md
2025-01-08 16:29:29 +08:00

16 KiB
Raw Blame History

Data analytics: Feature engineering

Definition

  • The process that attempts to create additional relevant features from existing raw features, to increase the predictive power of algorithms
  • Alternative definition: transfer raw data into features that better represent the underlying problem, such that the accuracy of predictive model is improved.
  • Important to machine learning

Sources of features

  • Different features are needed for different problems, even in the same domain

Feature engineering in ML

  • Process of ML iterations:
    • Baseline model -> Feature engineering -> Model 2 -> Feature engineering -> Final
  • Example: data needed to predict house price
    • ML can do that with sufficient feature
  • Reason for feature engineering: Raw data are rarely useful
    • Must be mapped into a feature vector
    • Good feature engineering takes the most time out of ML

Types of feature engineering

  • Indicator variable to isolate information
  • Highlighting interactions between features
  • Representing the feature in a different way

Good feature:

  • Example: the number of concrete blocks around it is not related to house prices

Known at prediction-time

  • Some data could be known immediately, and some other data is not known in real time: Can't feed the feature to a model, if it isn't present at prediction time
  • Feature definition shouldn't change over time
  • Example: If the sales data at prediction time is only available within 3 days, with a 3 day lag, then current sale data can't be used for training (that has to predict with a 3-day old data)

Numeric with meaningful magnitude:

  • It does not mean that categorical features can't be used in training: simply, they will need to be transformed through a process called encoding
  • Example: Font category: (Arial, Times New Roman)

Have enough samples

  • Have at least five examples of any value before using it in your model
  • If features tend to be poorly assorted and are unbalanced, then the trained model will be biased

Bring human insight to problem

  • Must have a reason for this feature to be useful, needs subject matter and curious mind
  • This is an iterative process, need to use feedback from production usage

Methods of Feature Engineering

Scaling

Rationale:

  • Leads to a better model, useful when data is uneven: X1 >> X2

Methods:

Normalization or Standardization:
  • 𝑍 = \frac{𝑋−𝜇}{\sigma}
  • Re-scaled to have a standard normal distribution that centered around 0 with SD of 1
  • Will compress the value in the narrow range, if the variable is skewed, or has outliers.
    • This may impair the prediction
Min-max scaling:
  • X_{scaled} = \frac{X - min}{max - min}
  • Also will compress observation
Robust scaling:
  • X_{scaled} = \frac{X - median}{IQR}
  • IQR: Interquartile range
  • Better at preserving the spread

Choosing

  • If data is not guassain like, and has a skewed distribution or outliers : Use robust scaling, as the other two will compress the data to a narrow range, which is not ideal
  • For PCA or LDA(distance or covariance calculation), better to use Normalization or Standardization, since it will remove the effect of numerical scale, on variance and covariance
  • Min-Max scaling: is bound to 0-1, has same drawback as normalization, and new data may be out of bound (out of original range). This is preferred when the network prefer a 0-1 scale

Discretization / Binning / Bucketing

Definition

  • The process of transforming continuous variable into discrete ones, by creating a set of continuous interval, that spans over the range of variable's values
  • binning diagram

Reason for binning

  • Example: Solar energy modeling
    • Acceleration calculation, by binning, and reduce the number of simulation needed
  • Improves performance by grouping data with similar attributes and has similar predictive strength
  • Improve non-linearity, by being able to capture non-linear patterns , thus improving fitting power of model
  • Interpretability is enhanced by grouping
  • Reduce the impact of outliers
  • Prevent overfitting
  • Allow feature interaction, with continuous variables

Methods

Equal width binning
  • Divide the scope into bins of the same width
  • Con: is sensitive to skewed distribution
Equal frequency binning
  • Divides the scope of possible values of variable into N bins, where each bin carries the same number of observations
  • Con: May disrupt the relationship with target
k means binning
  • Use k-means to partition the values into clusters
  • Con: need hyper-parameter tuning
decision trees
  • Using decision trees to decide the best splitting points
  • Observes which bin is more similar than other bins
  • Con:
    • may cause overfitting
    • have a chance of failing: bad performance

Encoding

Definition

  • The inverse of binning: creating numerical values from categorical variables

Reason

  • Machine learning algorithms require numerical input data, and this converts categorical data to numerical data

Methods

One hot encoding
  • Replace categorical variable (nominal) with different binary variables
  • Eliminates ordinality: since categorical variables shouldn't be ranked, otherwise the algorithm may think there's ordering between the variables
  • Improve performance by allowing model to capture the complex relationship within the data, that may be missed if categorical variables are treated as single entities
  • Cons
    • High dimensionality: make the model more complex, and slower to train
    • Is sparse data
    • May lead to overfitting, especially if there's too many categories and sample size is small
  • Usage:
    • Good for algorithms that look at all features at the same time: neural network, clustering, SVM
    • Used for linear regression, but keep k-1 binary variable to avoid multicollinearity:
      • In linear regression, the presence of all k binary variables for a categorical feature (where k is the number of categories) introduces perfect multicollinearity. This happens because the k-th variable is a linear combination of the others (e.g., if "Red" and "Blue" are 0, "Green" must be 1).
    • Don't use for tree algorithms
Ordinal encoding
  • Ordinal variable: comprises a finite set of discrete values with a ranked ordering
  • Ordinal encoding replaces the label by ordered number
  • Does not add value to give the variable more predictive power
  • Usage:
    • For categorical data with ordinal meaning
Count / frequency encoding
  • Replace occurrences of label with the count of occurrences
  • Cons:
    • Will have loss of unique categories: (if the two categories have same frequency, they will be treated as the same)
    • Doesn't handle unseen categories
    • Overfitting, if low frequency in general
Mean / target encoding
  • Replace the value for every categories with the avg of values for every category-value pair
  • monotonic relationship between variable and target
  • Don't expand the feature space
  • Con: prone to overfitting
  • Usage:
    • High cardinality (the number of elements in a mathematical set) data, by leveraging the target variable's statistics to retain predictive power

Transformation

Reasons

  • Linear/Logistic regression models has assumption between the predictors and the outcome.
    • Transformation may help create this relationship to avoid poor performance.
    • Assumptions:
      • Linear dependency between the predictors and the outcome.
      • Multivariate normality (every variable X should follow a Gaussian distribution)
      • No or little multicollinearity
      • homogeneity of variance
    • Example:
      • assuming y > 0.5 lead to class 1, otherwise class 2
      • page 1
      • page 2
  • Some other ML algorithms do not make any assumption, but still may benefit from a better distributed data

Methods

  • Logarithmic transformation: log(𝑥 + 1)
    • Useful when applied to skewed distributions, it expands small values and compress big values, helps to make the distribution less skewed
    • Numerical values x must be x \gt -1
  • Reciprocal transformation 1/𝑥
  • Square root \sqrt{x}
    • Similar to log transform
  • Exponential
  • Box cox transformation (x^\lambda - 1) / \lambda
    • prerequisite: numeric values must be positive, can be solved by shifting
  • Quantile transformation: using quartiles
    • Transform feature to use a uniform or normal distribution. Tends to spread out the most frequent values.
    • This is robust
    • But is non-linear transform, may distort linear correlation, but variables at different scales are more comparable

Generation

Definition

  • Generating new features that are often not the result of feature transformation
  • Examples:
    • Age \times NumberDiagnoses
    • statistical feature
    • fourier transform

Methods

Feature Crossing
  • Create new features from existing ones, thus increasing predictive power
  • Takes the Cartesian product of existing features
    • A\times B=\{(a,b), a \in A \ and\ b\in B\}.
  • Has uses when data is not linerarly separable
  • Deciding which feature to cross:
Polynomial Expansion
  • Useful in modelling, since it can model non-linear relationships between predictor and outcome
  • Use fitted polynomial variables to represent the data:
    • 𝑝𝑜𝑙𝑦(𝑥, 𝑛)= 𝑎_0 + 𝑎_1 \times 𝑥 + 𝑎_2 \times 𝑥^2 + ⋯ + 𝑎_𝑛 \times 𝑥^𝑛
  • Pros:
    • Fast
    • Good performance, compared to binning
    • Doesn't create correlated features
    • Good at handling continuous change
  • Cons:
    • Less interpretable
    • Lots of variables produced
    • Hard to model changes in distribution
Feature Learning by Trees
  • Each sample is a leaf node
  • Decision path to each node is a new non-linear feature
  • We can create N new binary features (with N leaf nodes)
  • Pro: Fast to get informative feature
Automatic Feature learning: Deep learning
  • Deep learning model learns the features from data
  • Difference between shallow networks
    • Deep, in the sense of having multiple hidden layers
    • Introduced stochastic gradient descent
  • Can automate feature extraction
  • Require larger datasets
  • DL can learn hierarchical of features: Character → word → word group → clause → sentence
  • CNN: use convolutional layers to apply filters to the input image, to detect various features such as edges, textures and shapes

Feature Selection

Reason

  • More features doesn't necessarily lead to better model
  • Feature selection is useful for
    • Model simplification: easy interpretation, smaller model, less cost
    • Lower data requirements: less data is required
    • Less dimensionality
    • Enhanced generalization, less overfitting

Methods

Filter

  • Select best features via the following methods and evaluate
  • Main methods
    • Variance: remove the feature that has the same value
    • Correlation: remove features that are highly correlated with each other
  • Con: Fail to consider the interaction between features and may reduce the predict power of the model

Wrapper

  • Use searching to search through all the possible feature subsets and evaluate them
  • Steps of execution (p98), skipped
  • Con: Computationally expensive

Embedded

  • Use feature selection as a part of ML algorithm
  • This address the drawbacks of both filter and wrapper method, and has advantage of both
  • Faster than filter
  • More accurate than filter
  • Methods:
    • Regularization: Add penalty to coefficients, which can turn them to zero, and can be removed from dataset
    • Tree based methods: outputs feature importance, which can be used to select features.

Shuffling

Hybrid

Dimensionality Reduction

  • When dimensionality is too high, it's computationally expensive to process them. We project the data to a lower subspace, that captures the essence of data
  • Reason
    • Curse of dimensionality: high dimensionality data have large number of features or dimensions, which can make it difficult to analyze and understand
    • Remove sparse or noisy data, reduce overfitting
    • To create a model with lower number of variables
  • PCA:
    • form of feature extraction, combines and transforms the dataset's original values
    • projects data onto a new space, defined by this subset of principal components
    • Is a unsupervised linear dimensionality reduction technique
    • Preserves signal, filter out noise
    • Use covariance matrix
    • TODO: is calculation needed
    • Minimize intraclass difference
  • LDA:
    • Similar to PCA
    • Different than PCA, because it retains classification labels in dataset
    • Goal: maximize data variance and maximise class difference in the data.
    • Use scatter matrix
    • Maximizes interclass difference