Advanced Feature Engineering
Feature engineering is often the difference between a mediocre model and a high-performing one. Raw data rarely arrives in a form that maximizes predictive power β transforming, combining, and encoding features systematically can unlock patterns that algorithms miss on raw inputs.
Feature Engineering Pipeline
Why Feature Engineering Matters
Models don't understand data the way humans do. A linear model sees only numbers; a tree-based model sees only split points. The art of feature engineering is translating domain knowledge and data structure into numeric representations that expose the signal hiding in your data.
import pandas as pd
import numpy as np
from sklearn.preprocessing import (
PolynomialFeatures, StandardScaler, LabelEncoder,
OrdinalEncoder, KBinsDiscretizer
)
from category_encoders import TargetEncoder, WOEEncoder
import warnings
warnings.filterwarnings('ignore')
Feature Creation from Domain Knowledge
The most powerful features come from understanding the problem domain.
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'transaction_amount': np.random.lognormal(mean=4, sigma=1.2, size=n),
'transaction_hour': np.random.choice(range(24), size=n),
'transaction_dayofweek': np.random.choice(range(7), size=n),
'customer_age': np.random.normal(40, 12, size=n).clip(18, 80),
'account_balance': np.random.lognormal(mean=8, sigma=0.8, size=n),
'num_products': np.random.poisson(2, size=n) + 1,
'tenure_months': np.random.exponential(36, size=n).clip(1, 120),
'credit_score': np.random.normal(650, 80, size=n).clip(300, 850),
'category': np.random.choice(['electronics', 'groceries', 'travel', 'entertainment'], size=n),
'is_fraud': np.random.binomial(1, 0.02, size=n)
})
# Ratio features β capture relationships between variables
df['balance_to_income'] = df['account_balance'] / (df['customer_age'] * 1000 + 1)
df['amount_to_balance'] = df['transaction_amount'] / (df['account_balance'] + 1)
# Aggregation features β customer-level summaries
customer_stats = df.groupby('category').agg(
avg_amount=('transaction_amount', 'mean'),
std_amount=('transaction_amount', 'std'),
avg_balance=('account_balance', 'mean')
).rename(columns={
'avg_amount': 'cat_avg_amount',
'std_amount': 'cat_std_amount',
'avg_balance': 'cat_avg_balance'
})
df = df.merge(customer_stats, on='category', how='left')
# Deviation features β how far from the norm
df['amount_zscore'] = (
(df['transaction_amount'] - df['cat_avg_amount']) / (df['cat_std_amount'] + 1e-8)
)
# Cyclical features β encode circular nature of time
df['hour_sin'] = np.sin(2 * np.pi * df['transaction_hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['transaction_hour'] / 24)
df['dow_sin'] = np.sin(2 * np.pi * df['transaction_dayofweek'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['transaction_dayofweek'] / 7)
print(df[['transaction_amount', 'amount_zscore', 'hour_sin', 'hour_cos']].head())
Binning and Discretization
Converting continuous variables into bins can capture non-linear relationships and reduce noise.
# Equal-width binning
df['age_group'] = pd.cut(
df['customer_age'],
bins=[0, 25, 35, 45, 55, 65, 100],
labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
)
# Quantile binning β equal frequency
df['amount_quantile'] = pd.qcut(
df['transaction_amount'],
q=5,
labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5']
)
# Sklearn discretization
kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
df['credit_score_bin'] = kbins.fit_transform(df[['credit_score']]).astype(int)
# Custom binning based on domain knowledge
def risk_bucket(score):
if score < 580:
return 'high_risk'
elif score < 670:
return 'moderate_risk'
elif score < 740:
return 'good'
else:
return 'excellent'
df['risk_category'] = df['credit_score'].apply(risk_bucket)
# Interaction binning β combine two continuous features
df['age_amount_bin'] = pd.cut(
df['customer_age'], bins=5
).astype(str) + '_' + pd.cut(
df['transaction_amount'], bins=5
).astype(str)
print(df[['customer_age', 'age_group', 'amount_quantile', 'risk_category']].head(10))
Encoding Categorical Variables
Different encoding strategies suit different algorithms and cardinality levels.
# Ordinal encoding for ordered categories
risk_mapping = {'high_risk': 0, 'moderate_risk': 1, 'good': 2, 'excellent': 3}
df['risk_ordinal'] = df['risk_category'].map(risk_mapping)
# One-hot encoding for low-cardinality features
df_onehot = pd.get_dummies(df['category'], prefix='cat', drop_first=True)
df = pd.concat([df, df_onehot], axis=1)
# Label encoding (useful for tree-based models)
le = LabelEncoder()
df['category_label'] = le.fit_transform(df['category'])
# Frequency encoding β replace category with its frequency
freq_map = df['category'].value_counts(normalize=True).to_dict()
df['category_freq'] = df['category'].map(freq_map)
# Binary encoding β more efficient than one-hot for high cardinality
def binary_encode(series, n_bits=None):
codes = series.astype('category').cat.codes
if n_bits is None:
n_bits = int(np.ceil(np.log2(codes.max() + 1)))
bits = []
for i in range(n_bits):
bits.append((codes >> i) & 1)
return pd.DataFrame(
{f'{series.name}_bit{i}': bits[i] for i in range(n_bits)}
)
binary_cols = binary_encode(df['category'])
df = pd.concat([df, binary_cols], axis=1)
print(df[[c for c in df.columns if 'bit' in c or 'category' in c]].head())
Target Encoding
Target encoding replaces categories with the mean of the target variable β powerful but requires care to avoid leakage.
from sklearn.model_selection import KFold
def target_encode_cv(df, cat_col, target_col, n_folds=5, smoothing=10):
"""Target encoding with cross-validation to prevent leakage."""
global_mean = df[target_col].mean()
encoded = pd.Series(index=df.index, dtype=float)
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
train, val = df.iloc[train_idx], df.iloc[val_idx]
stats = train.groupby(cat_col)[target_col].agg(['mean', 'count'])
smooth = (stats['count'] * stats['mean'] + smoothing * global_mean) / (stats['count'] + smoothing)
encoded.iloc[val_idx] = val[cat_col].map(smooth).fillna(global_mean)
return encoded
df['category_target_enc'] = target_encode_cv(df, 'category', 'is_fraud')
# Compare distributions
print("Original category means:")
print(df.groupby('category')['is_fraud'].mean())
print("\nTarget encoded values (sample):")
print(df[['category', 'category_target_enc']].drop_duplicates().head())
Interaction and Polynomial Features
Capturing relationships between features that linear models would otherwise miss. Interaction features combine two or more variables to capture synergistic effects:
For polynomial features of degree , the expansion includes all monomials up to degree :
from sklearn.preprocessing import PolynomialFeatures
# Manual interaction features
df['age_x_products'] = df['customer_age'] * df['num_products']
df['amount_x_hour'] = df['transaction_amount'] * df['transaction_hour']
df['balance_x_tenure'] = df['account_balance'] * df['tenure_months']
# Polynomial features from sklearn
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
key_features = df[['customer_age', 'num_products', 'credit_score']].copy()
poly_features = poly.fit_transform(key_features)
poly_df = pd.DataFrame(
poly_features,
columns=poly.get_feature_names_out(['age', 'products', 'credit']),
index=df.index
)
df = pd.concat([df, poly_df], axis=1)
print(f"Original features: 3 β Polynomial features: {poly_df.shape[1]}")
print(poly_df.head())
Log and Power Transformations
Correcting skewness and stabilizing variance.
from sklearn.preprocessing import PowerTransformer
# Log transform for right-skewed features
df['log_amount'] = np.log1p(df['transaction_amount'])
df['log_balance'] = np.log1p(df['account_balance'])
# Square root transform
df['sqrt_amount'] = np.sqrt(df['transaction_amount'])
# Box-Cox / Yeo-Johnson (handles zeros and negatives)
pt = PowerTransformer(method='yeo-johnson')
df['amount_yeojohnson'] = pt.fit_transform(df[['transaction_amount']])
# Compare skewness
for col in ['transaction_amount', 'log_amount', 'amount_yeojohnson']:
print(f"{col}: skewness = {df[col].skew():.3f}")
Missing Value Features
Missingness itself can be a signal.
# Create missingness indicators
df['balance_missing'] = df['account_balance'].isnull().astype(int)
df['credit_missing'] = df['credit_score'].isnull().astype(int)
# Count of missing features per row
missing_cols = ['account_balance', 'credit_score', 'tenure_months']
df['missing_count'] = df[missing_cols].isnull().sum(axis=1)
# Impute and create flag
for col in missing_cols:
median_val = df[col].median()
df[f'{col}_imputed'] = df[col].fillna(median_val)
df[f'{col}_was_missing'] = df[col].isnull().astype(int)
print(df[[c for c in df.columns if 'missing' in c or 'imputed' in c]].sum())
Datetime Feature Extraction
Dates contain rich patterns when properly decomposed.
dates = pd.date_range('2024-01-01', periods=n, freq='h')
df['date'] = dates[:n]
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['dayofyear'] = df['date'].dt.dayofyear
df['weekofyear'] = df['date'].dt.isocalendar().week.astype(int)
df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
# Rolling window features
df['amount_7d_mean'] = df['transaction_amount'].rolling(window=7, min_periods=1).mean()
df['amount_7d_std'] = df['transaction_amount'].rolling(window=7, min_periods=1).std()
df['amount_30d_mean'] = df['transaction_amount'].rolling(window=30, min_periods=1).mean()
# Lag features
df['amount_lag1'] = df['transaction_amount'].shift(1)
df['amount_lag7'] = df['transaction_amount'].shift(7)
df['amount_diff1'] = df['transaction_amount'].diff(1)
print(df[['date', 'is_weekend', 'amount_7d_mean', 'amount_lag1']].head(10))
Building a Feature Engineering Pipeline
Putting it all together into a reusable, production-ready pipeline.
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
class CyclicalEncoder:
def __init__(self, col, max_val):
self.col = col
self.max_val = max_val
def transform(self, X):
X = X.copy()
X[f'{self.col}_sin'] = np.sin(2 * np.pi * X[self.col] / self.max_val)
X[f'{self.col}_cos'] = np.cos(2 * np.pi * X[self.col] / self.max_val)
return X.drop(columns=[self.col])
def fit(self, X, y=None):
return self
class LogTransformer:
def __init__(self, cols):
self.cols = cols
def transform(self, X):
X = X.copy()
for col in self.cols:
X[col] = np.log1p(X[col])
return X
def fit(self, X, y=None):
return self
# Define column groups
numeric_features = ['transaction_amount', 'account_balance', 'credit_score']
categorical_features = ['category']
cyclical_features = {'transaction_hour': 24, 'transaction_dayofweek': 7}
# Build preprocessing pipeline
preprocessor = Pipeline([
('log_transform', LogTransformer(['transaction_amount', 'account_balance'])),
('scaler', StandardScaler())
])
feature_engineering_pipeline = ColumnTransformer([
('numeric', preprocessor, numeric_features),
('categorical', OneHotEncoder(drop='first', sparse_output=False), categorical_features),
])
# Fit and transform
X_engineered = feature_engineering_pipeline.fit_transform(df)
print(f"Final feature matrix shape: {X_engineered.shape}")
Best Practices
- Always split before target encoding β prevent target leakage
- Monitor feature drift β engineered features may need recalibration over time
- Use domain knowledge β the best features come from understanding the business problem
- Start simple, add complexity β begin with obvious features before crafting exotic ones
- Document every transformation β reproducibility requires knowing what you did and why
- Validate with cross-validation β especially for target encoding and feature selection
Summary
Advanced feature engineering transforms raw data into predictive power. Master these techniques β cyclic encoding, target encoding, interaction features, missingness indicators, and domain-driven feature creation β and you'll consistently extract more signal from the same data than relying on default preprocessing alone.