Feature Engineering — Complete Guide

Feature engineering transforms raw data into features that improve model performance. It's often the most impactful step in ML.

Numerical Features

Scaling:

StandardScaler (Z-score):
z = (x - μ) / σ
Mean=0, Std=1
Use for: Most algorithms (SVM, KNN, Neural Networks)

MinMaxScaler:
x_scaled = (x - min) / (max - min)
Range: [0, 1]
Use for: Neural networks, image data

RobustScaler:
Uses median and IQR
Robust to outliers
Use for: Data with outliers

Log Transform:
x_log = log(x + 1)
Use for: Skewed distributions, power laws

Categorical Features

One-Hot Encoding:
Color: [Red, Blue, Green]
→ Red:  [1, 0, 0]
→ Blue: [0, 1, 0]
→ Green:[0, 0, 1]
Use for: Categories without order

Label Encoding:
Color: [Red, Blue, Green]
→ Red=0, Blue=1, Green=2
Use for: Ordinal categories (Low < Medium < High)

Target Encoding:
Replace category with mean of target
Red: mean(target where Red) = 0.7
Use for: High-cardinality categories

Feature Creation

Date features:
├─ Year, Month, Day, Hour
├─ Day of week, Is weekend
├─ Is holiday, Season
└─ Days since event

Text features:
├─ Word count, Character count
├─ TF-IDF vectors
├─ Word embeddings
└─ Sentiment scores

Interaction features:
├─ x₁ × x₂ (product)
├─ x₁ / x₂ (ratio)
├─ x₁ - x₂ (difference)
└─ x₁², x₂² (polynomial)

Aggregation features:
├─ Mean, Median, Std per group
├─ Count per category
├─ Rolling statistics
└─ Lag features

Feature Selection

Method 1: Filter (statistical tests)
├─ Correlation with target
├─ Chi-squared test
├─ Mutual information
└─ ANOVA F-test

Method 2: Wrapper (model-based)
├─ Forward selection
├─ Backward elimination
├─ Recursive feature elimination (RFE)
└─ Genetic algorithms

Method 3: Embedded (built into model)
├─ L1 regularization (Lasso)
├─ Feature importance (Tree-based)
└─ Permutation importance

Python Implementation

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Define preprocessing
numerical = ['age', 'income', 'score']
categorical = ['gender', 'city', 'category']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical)
])

# Create pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")

Key Takeaways

Feature engineering is often more important than model choice
Scale numerical features for distance-based algorithms
One-hot encode categorical variables for most models
Create interaction features to capture relationships
Feature selection removes noise and speeds up training
Use pipelines to prevent data leakage
Domain knowledge guides the best feature engineering
Automated tools (featuretools) can generate features

Feature Engineering — Complete Guide

Feature Engineering — Complete Guide

Numerical Features

Categorical Features

Feature Creation

Feature Selection

Python Implementation

Key Takeaways

Need Expert Machine Learning Help?