Advanced Feature Engineering

Feature engineering is often the difference between a mediocre model and a high-performing one. Raw data rarely arrives in a form that maximizes predictive power – transforming, combining, and encoding features systematically can unlock patterns that algorithms miss on raw inputs.

Feature Engineering Pipeline

Why Feature Engineering Matters

Models don't understand data the way humans do. A linear model sees only numbers; a tree-based model sees only split points. The art of feature engineering is translating domain knowledge and data structure into numeric representations that expose the signal hiding in your data.

import pandas as pd
import numpy as np
from sklearn.preprocessing import (
    PolynomialFeatures, StandardScaler, LabelEncoder,
    OrdinalEncoder, KBinsDiscretizer
)
from category_encoders import TargetEncoder, WOEEncoder
import warnings
warnings.filterwarnings('ignore')

Feature Creation from Domain Knowledge

The most powerful features come from understanding the problem domain.

np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'transaction_amount': np.random.lognormal(mean=4, sigma=1.2, size=n),
    'transaction_hour': np.random.choice(range(24), size=n),
    'transaction_dayofweek': np.random.choice(range(7), size=n),
    'customer_age': np.random.normal(40, 12, size=n).clip(18, 80),
    'account_balance': np.random.lognormal(mean=8, sigma=0.8, size=n),
    'num_products': np.random.poisson(2, size=n) + 1,
    'tenure_months': np.random.exponential(36, size=n).clip(1, 120),
    'credit_score': np.random.normal(650, 80, size=n).clip(300, 850),
    'category': np.random.choice(['electronics', 'groceries', 'travel', 'entertainment'], size=n),
    'is_fraud': np.random.binomial(1, 0.02, size=n)
})

# Ratio features – capture relationships between variables
df['balance_to_income'] = df['account_balance'] / (df['customer_age'] * 1000 + 1)
df['amount_to_balance'] = df['transaction_amount'] / (df['account_balance'] + 1)

# Aggregation features – customer-level summaries
customer_stats = df.groupby('category').agg(
    avg_amount=('transaction_amount', 'mean'),
    std_amount=('transaction_amount', 'std'),
    avg_balance=('account_balance', 'mean')
).rename(columns={
    'avg_amount': 'cat_avg_amount',
    'std_amount': 'cat_std_amount',
    'avg_balance': 'cat_avg_balance'
})
df = df.merge(customer_stats, on='category', how='left')

# Deviation features – how far from the norm
df['amount_zscore'] = (
    (df['transaction_amount'] - df['cat_avg_amount']) / (df['cat_std_amount'] + 1e-8)
)

# Cyclical features – encode circular nature of time
df['hour_sin'] = np.sin(2 * np.pi * df['transaction_hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['transaction_hour'] / 24)
df['dow_sin'] = np.sin(2 * np.pi * df['transaction_dayofweek'] / 7)
df['dow_cos'] = np.cos(2 * np.pi * df['transaction_dayofweek'] / 7)

print(df[['transaction_amount', 'amount_zscore', 'hour_sin', 'hour_cos']].head())

Binning and Discretization

Converting continuous variables into bins can capture non-linear relationships and reduce noise.

# Equal-width binning
df['age_group'] = pd.cut(
    df['customer_age'],
    bins=[0, 25, 35, 45, 55, 65, 100],
    labels=['18-25', '26-35', '36-45', '46-55', '56-65', '65+']
)

# Quantile binning – equal frequency
df['amount_quantile'] = pd.qcut(
    df['transaction_amount'],
    q=5,
    labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5']
)

# Sklearn discretization
kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='quantile')
df['credit_score_bin'] = kbins.fit_transform(df[['credit_score']]).astype(int)

# Custom binning based on domain knowledge
def risk_bucket(score):
    if score < 580:
        return 'high_risk'
    elif score < 670:
        return 'moderate_risk'
    elif score < 740:
        return 'good'
    else:
        return 'excellent'

df['risk_category'] = df['credit_score'].apply(risk_bucket)

# Interaction binning – combine two continuous features
df['age_amount_bin'] = pd.cut(
    df['customer_age'], bins=5
).astype(str) + '_' + pd.cut(
    df['transaction_amount'], bins=5
).astype(str)

print(df[['customer_age', 'age_group', 'amount_quantile', 'risk_category']].head(10))

Encoding Categorical Variables

Different encoding strategies suit different algorithms and cardinality levels.

# Ordinal encoding for ordered categories
risk_mapping = {'high_risk': 0, 'moderate_risk': 1, 'good': 2, 'excellent': 3}
df['risk_ordinal'] = df['risk_category'].map(risk_mapping)

# One-hot encoding for low-cardinality features
df_onehot = pd.get_dummies(df['category'], prefix='cat', drop_first=True)
df = pd.concat([df, df_onehot], axis=1)

# Label encoding (useful for tree-based models)
le = LabelEncoder()
df['category_label'] = le.fit_transform(df['category'])

# Frequency encoding – replace category with its frequency
freq_map = df['category'].value_counts(normalize=True).to_dict()
df['category_freq'] = df['category'].map(freq_map)

# Binary encoding – more efficient than one-hot for high cardinality
def binary_encode(series, n_bits=None):
    codes = series.astype('category').cat.codes
    if n_bits is None:
        n_bits = int(np.ceil(np.log2(codes.max() + 1)))
    bits = []
    for i in range(n_bits):
        bits.append((codes >> i) & 1)
    return pd.DataFrame(
        {f'{series.name}_bit{i}': bits[i] for i in range(n_bits)}
    )

binary_cols = binary_encode(df['category'])
df = pd.concat([df, binary_cols], axis=1)
print(df[[c for c in df.columns if 'bit' in c or 'category' in c]].head())

Target Encoding

Target encoding replaces categories with the mean of the target variable – powerful but requires care to avoid leakage.

from sklearn.model_selection import KFold

def target_encode_cv(df, cat_col, target_col, n_folds=5, smoothing=10):
    """Target encoding with cross-validation to prevent leakage."""
    global_mean = df[target_col].mean()
    encoded = pd.Series(index=df.index, dtype=float)
    
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
    
    for train_idx, val_idx in kf.split(df):
        train, val = df.iloc[train_idx], df.iloc[val_idx]
        
        stats = train.groupby(cat_col)[target_col].agg(['mean', 'count'])
        smooth = (stats['count'] * stats['mean'] + smoothing * global_mean) / (stats['count'] + smoothing)
        
        encoded.iloc[val_idx] = val[cat_col].map(smooth).fillna(global_mean)
    
    return encoded

df['category_target_enc'] = target_encode_cv(df, 'category', 'is_fraud')

# Compare distributions
print("Original category means:")
print(df.groupby('category')['is_fraud'].mean())
print("\nTarget encoded values (sample):")
print(df[['category', 'category_target_enc']].drop_duplicates().head())

Interaction and Polynomial Features

Capturing relationships between features that linear models would otherwise miss. Interaction features combine two or more variables to capture synergistic effects:

x_{ij} = x_i \cdot x_j

For polynomial features of degree $d$ , the expansion includes all monomials up to degree $d$ :

from sklearn.preprocessing import PolynomialFeatures

# Manual interaction features
df['age_x_products'] = df['customer_age'] * df['num_products']
df['amount_x_hour'] = df['transaction_amount'] * df['transaction_hour']
df['balance_x_tenure'] = df['account_balance'] * df['tenure_months']

# Polynomial features from sklearn
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
key_features = df[['customer_age', 'num_products', 'credit_score']].copy()
poly_features = poly.fit_transform(key_features)
poly_df = pd.DataFrame(
    poly_features, 
    columns=poly.get_feature_names_out(['age', 'products', 'credit']),
    index=df.index
)
df = pd.concat([df, poly_df], axis=1)

print(f"Original features: 3 → Polynomial features: {poly_df.shape[1]}")
print(poly_df.head())

Log and Power Transformations

Correcting skewness and stabilizing variance.

from sklearn.preprocessing import PowerTransformer

# Log transform for right-skewed features
df['log_amount'] = np.log1p(df['transaction_amount'])
df['log_balance'] = np.log1p(df['account_balance'])

# Square root transform
df['sqrt_amount'] = np.sqrt(df['transaction_amount'])

# Box-Cox / Yeo-Johnson (handles zeros and negatives)
pt = PowerTransformer(method='yeo-johnson')
df['amount_yeojohnson'] = pt.fit_transform(df[['transaction_amount']])

# Compare skewness
for col in ['transaction_amount', 'log_amount', 'amount_yeojohnson']:
    print(f"{col}: skewness = {df[col].skew():.3f}")

Missing Value Features

Missingness itself can be a signal.

# Create missingness indicators
df['balance_missing'] = df['account_balance'].isnull().astype(int)
df['credit_missing'] = df['credit_score'].isnull().astype(int)

# Count of missing features per row
missing_cols = ['account_balance', 'credit_score', 'tenure_months']
df['missing_count'] = df[missing_cols].isnull().sum(axis=1)

# Impute and create flag
for col in missing_cols:
    median_val = df[col].median()
    df[f'{col}_imputed'] = df[col].fillna(median_val)
    df[f'{col}_was_missing'] = df[col].isnull().astype(int)

print(df[[c for c in df.columns if 'missing' in c or 'imputed' in c]].sum())

Datetime Feature Extraction

Dates contain rich patterns when properly decomposed.

dates = pd.date_range('2024-01-01', periods=n, freq='h')
df['date'] = dates[:n]

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['hour'] = df['date'].dt.hour
df['dayofyear'] = df['date'].dt.dayofyear
df['weekofyear'] = df['date'].dt.isocalendar().week.astype(int)
df['is_weekend'] = df['date'].dt.dayofweek.isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)

# Rolling window features
df['amount_7d_mean'] = df['transaction_amount'].rolling(window=7, min_periods=1).mean()
df['amount_7d_std'] = df['transaction_amount'].rolling(window=7, min_periods=1).std()
df['amount_30d_mean'] = df['transaction_amount'].rolling(window=30, min_periods=1).mean()

# Lag features
df['amount_lag1'] = df['transaction_amount'].shift(1)
df['amount_lag7'] = df['transaction_amount'].shift(7)
df['amount_diff1'] = df['transaction_amount'].diff(1)

print(df[['date', 'is_weekend', 'amount_7d_mean', 'amount_lag1']].head(10))

Building a Feature Engineering Pipeline

Putting it all together into a reusable, production-ready pipeline.

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

class CyclicalEncoder:
    def __init__(self, col, max_val):
        self.col = col
        self.max_val = max_val
    
    def transform(self, X):
        X = X.copy()
        X[f'{self.col}_sin'] = np.sin(2 * np.pi * X[self.col] / self.max_val)
        X[f'{self.col}_cos'] = np.cos(2 * np.pi * X[self.col] / self.max_val)
        return X.drop(columns=[self.col])
    
    def fit(self, X, y=None):
        return self

class LogTransformer:
    def __init__(self, cols):
        self.cols = cols
    
    def transform(self, X):
        X = X.copy()
        for col in self.cols:
            X[col] = np.log1p(X[col])
        return X
    
    def fit(self, X, y=None):
        return self

# Define column groups
numeric_features = ['transaction_amount', 'account_balance', 'credit_score']
categorical_features = ['category']
cyclical_features = {'transaction_hour': 24, 'transaction_dayofweek': 7}

# Build preprocessing pipeline
preprocessor = Pipeline([
    ('log_transform', LogTransformer(['transaction_amount', 'account_balance'])),
    ('scaler', StandardScaler())
])

feature_engineering_pipeline = ColumnTransformer([
    ('numeric', preprocessor, numeric_features),
    ('categorical', OneHotEncoder(drop='first', sparse_output=False), categorical_features),
])

# Fit and transform
X_engineered = feature_engineering_pipeline.fit_transform(df)
print(f"Final feature matrix shape: {X_engineered.shape}")

Best Practices

Always split before target encoding – prevent target leakage
Monitor feature drift – engineered features may need recalibration over time
Use domain knowledge – the best features come from understanding the business problem
Start simple, add complexity – begin with obvious features before crafting exotic ones
Document every transformation – reproducibility requires knowing what you did and why
Validate with cross-validation – especially for target encoding and feature selection

Summary

Advanced feature engineering transforms raw data into predictive power. Master these techniques – cyclic encoding, target encoding, interaction features, missingness indicators, and domain-driven feature creation – and you'll consistently extract more signal from the same data than relying on default preprocessing alone.