Feature Engineering — Complete Guide

Core MLFeature EngineeringFree Lesson

Advertisement

Feature Engineering — Complete Guide

Feature engineering transforms raw data into features that improve model performance. It's often the most impactful step in ML.


Numerical Features

Scaling:

StandardScaler (Z-score):
z = (x - μ) / σ
Mean=0, Std=1
Use for: Most algorithms (SVM, KNN, Neural Networks)

MinMaxScaler:
x_scaled = (x - min) / (max - min)
Range: [0, 1]
Use for: Neural networks, image data

RobustScaler:
Uses median and IQR
Robust to outliers
Use for: Data with outliers

Log Transform:
x_log = log(x + 1)
Use for: Skewed distributions, power laws

Categorical Features

One-Hot Encoding:
Color: [Red, Blue, Green]
→ Red:  [1, 0, 0]
→ Blue: [0, 1, 0]
→ Green:[0, 0, 1]
Use for: Categories without order

Label Encoding:
Color: [Red, Blue, Green]
→ Red=0, Blue=1, Green=2
Use for: Ordinal categories (Low < Medium < High)

Target Encoding:
Replace category with mean of target
Red: mean(target where Red) = 0.7
Use for: High-cardinality categories

Feature Creation

Date features:
├─ Year, Month, Day, Hour
├─ Day of week, Is weekend
├─ Is holiday, Season
└─ Days since event

Text features:
├─ Word count, Character count
├─ TF-IDF vectors
├─ Word embeddings
└─ Sentiment scores

Interaction features:
├─ x₁ × x₂ (product)
├─ x₁ / x₂ (ratio)
├─ x₁ - x₂ (difference)
└─ x₁², x₂² (polynomial)

Aggregation features:
├─ Mean, Median, Std per group
├─ Count per category
├─ Rolling statistics
└─ Lag features

Feature Selection

Method 1: Filter (statistical tests)
├─ Correlation with target
├─ Chi-squared test
├─ Mutual information
└─ ANOVA F-test

Method 2: Wrapper (model-based)
├─ Forward selection
├─ Backward elimination
├─ Recursive feature elimination (RFE)
└─ Genetic algorithms

Method 3: Embedded (built into model)
├─ L1 regularization (Lasso)
├─ Feature importance (Tree-based)
└─ Permutation importance

Python Implementation

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Define preprocessing
numerical = ['age', 'income', 'score']
categorical = ['gender', 'city', 'category']

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical),
    ('cat', OneHotEncoder(handle_unknown='ignore'), categorical)
])

# Create pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# Train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
print(f"Accuracy: {pipeline.score(X_test, y_test):.3f}")

Key Takeaways

  1. Feature engineering is often more important than model choice
  2. Scale numerical features for distance-based algorithms
  3. One-hot encode categorical variables for most models
  4. Create interaction features to capture relationships
  5. Feature selection removes noise and speeds up training
  6. Use pipelines to prevent data leakage
  7. Domain knowledge guides the best feature engineering
  8. Automated tools (featuretools) can generate features

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement