Feature Engineering — Complete Guide
Feature engineering transforms raw data into features that improve model performance. It's often the most impactful step in ML.
Numerical Features
Scaling:
StandardScaler (Z-score):
z = (x - μ) / σ
Mean=0, Std=1
Use for: Most algorithms (SVM, KNN, Neural Networks)
MinMaxScaler:
x_scaled = (x - min) / (max - min)
Range: [0, 1]
Use for: Neural networks, image data
RobustScaler:
Uses median and IQR
Robust to outliers
Use for: Data with outliers
Log Transform:
x_log = log(x + 1)
Use for: Skewed distributions, power laws
Categorical Features
One-Hot Encoding:
Color: [Red, Blue, Green]
→ Red: [1, 0, 0]
→ Blue: [0, 1, 0]
→ Green:[0, 0, 1]
Use for: Categories without order
Label Encoding:
Color: [Red, Blue, Green]
→ Red=0, Blue=1, Green=2
Use for: Ordinal categories (Low < Medium < High)
Target Encoding:
Replace category with mean of target
Red: mean(target where Red) = 0.7
Use for: High-cardinality categories
Feature Creation
Date features:
├─ Year, Month, Day, Hour
├─ Day of week, Is weekend
├─ Is holiday, Season
└─ Days since event
Text features:
├─ Word count, Character count
├─ TF-IDF vectors
├─ Word embeddings
└─ Sentiment scores
Interaction features:
├─ x₁ × x₂ (product)
├─ x₁ / x₂ (ratio)
├─ x₁ - x₂ (difference)
└─ x₁², x₂² (polynomial)
Aggregation features:
├─ Mean, Median, Std per group
├─ Count per category
├─ Rolling statistics
└─ Lag features
Feature Selection
Method 1: Filter (statistical tests)
├─ Correlation with target
├─ Chi-squared test
├─ Mutual information
└─ ANOVA F-test
Method 2: Wrapper (model-based)
├─ Forward selection
├─ Backward elimination
├─ Recursive feature elimination (RFE)
└─ Genetic algorithms
Method 3: Embedded (built into model)
├─ L1 regularization (Lasso)
├─ Feature importance (Tree-based)
└─ Permutation importance
Python Implementation
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Define preprocessing
numerical = ['age', 'income', 'score']
categorical = ['gender', 'city', 'category']
preprocessor = ColumnTransformer([
('num', StandardScaler(), numerical),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical)
])
# Create pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
# Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")
Key Takeaways
- Feature engineering is often more important than model choice
- Scale numerical features for distance-based algorithms
- One-hot encode categorical variables for most models
- Create interaction features to capture relationships
- Feature selection removes noise and speeds up training
- Use pipelines to prevent data leakage
- Domain knowledge guides the best feature engineering
- Automated tools (featuretools) can generate features