Feature Engineering Techniques

Machine LearningFeature EngineeringFree Lesson

Advertisement

What is Feature Engineering?

Feature engineering is the process of creating new features from raw data to improve model performance. Good features can make simple models perform well, while poor features require complex models.

Importance of Feature Engineering

  • Improves model accuracy
  • Reduces model complexity
  • Handles missing values creatively
  • Captures domain knowledge
  • Reduces overfitting

Types of Features

  1. Numerical Features: Continuous values
  2. Categorical Features: Discrete categories
  3. Temporal Features: Time-based features
  4. Text Features: Natural language
  5. Image Features: Visual data

Numerical Feature Transformation

import numpy as np
import pandas as pd

# Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(X)

# Binning/Discretization
df['age_bin'] = pd.cut(df['age'], bins=[0, 18, 35, 50, 100], 
                       labels=['child', 'young', 'adult', 'senior'])

# Log transformation
df['log_income'] = np.log1p(df['income'])

# Square root transformation
df['sqrt_value'] = np.sqrt(df['value'])

# Box-Cox transformation
from scipy import stats
df['boxcox_value'], lambda_param = stats.boxcox(df['value'])

# Power transformation
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
df_transformed = pt.fit_transform(df[['value']])

Categorical Feature Engineering

# Creating interaction features
df['age_income'] = df['age'] * df['income']

# Ratio features
df['debt_income_ratio'] = df['debt'] / df['income']

# Binary features from categorical
df['is_urban'] = (df['location'] == 'Urban').astype(int)

# Count encoding
count_map = df['category'].value_counts().to_dict()
df['category_count'] = df['category'].map(count_map)

# Target encoding
mean_target = df.groupby('category')['target'].mean()
df['category_target_mean'] = df['category'].map(mean_target)

# Weight of Evidence encoding
def calculate_woe(df, feature, target):
    cross_tab = pd.crosstab(df[feature], df[target])
    cross_tab['woe'] = np.log((cross_tab[1] / cross_tab[1].sum()) / 
                              (cross_tab[0] / cross_tab[0].sum()))
    return cross_tab['woe'].to_dict()

woe_map = calculate_woe(df, 'category', 'target')
df['category_woe'] = df['category'].map(woe_map)

Date/Time Feature Engineering

df['date'] = pd.to_datetime(df['date'])

# Cyclical encoding for periodic features
df['month_sin'] = np.sin(2 * np.pi * df['date'].dt.month / 12)
df['month_cos'] = np.cos(2 * np.pi * df['date'].dt.month / 12)
df['day_sin'] = np.sin(2 * np.pi * df['date'].dt.day / 31)
df['day_cos'] = np.cos(2 * np.pi * df['date'].dt.day / 31)

# Time since reference date
df['days_since_start'] = (df['date'] - pd.Timestamp('2020-01-01')).dt.days

# Extract business day features
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
df['is_quarter_start'] = df['date'].dt.is_quarter_start.astype(int)

# Days in month
df['days_in_month'] = df['date'].dt.days_in_month

Text Feature Engineering

import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Basic text cleaning
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

df['cleaned_text'] = df['text'].apply(clean_text)

# Bag of Words
bow = CountVectorizer(max_features=1000)
bow_features = bow.fit_transform(df['cleaned_text'])

# TF-IDF
tfidf = TfidfVectorizer(max_features=1000)
tfidf_features = tfidf.fit_transform(df['cleaned_text'])

# Text statistics features
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text_length'] / df['word_count']
df['unique_word_count'] = df['text'].apply(lambda x: len(set(x.split())))
df['uppercase_count'] = df['text'].str.count(r'[A-Z]')
df['digit_count'] = df['text'].str.count(r'\d')

Aggregation Features

# Group-based aggregations
df['user_avg_spend'] = df.groupby('user_id')['amount'].transform('mean')
df['user_total_spend'] = df.groupby('user_id')['amount'].transform('sum')
df['user_count'] = df.groupby('user_id')['transaction_id'].transform('count')

# Rolling aggregations (time series)
df = df.sort_values(['user_id', 'date'])
df['rolling_mean_7d'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling(7, min_periods=1).mean())
df['rolling_std_7d'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling(7, min_periods=1).std())

# Lag features
df['amount_lag_1'] = df.groupby('user_id')['amount'].shift(1)
df['amount_lag_7'] = df.groupby('user_id')['amount'].shift(7)

# Difference features
df['amount_diff_1'] = df.groupby('user_id')['amount'].diff(1)
df['amount_diff_7'] = df.groupby('user_id')['amount'].diff(7)

Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier

# Univariate selection
selector = SelectKBest(f_classif, k=10)
X_selected = selector.fit_transform(X, y)

# Recursive Feature Elimination
model = RandomForestClassifier(n_estimators=100)
rfe = RFE(model, n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)

# Feature importance from tree-based models
model = RandomForestClassifier(n_estimators=100)
model.fit(X, y)
importance = pd.DataFrame({
    'feature': X.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

Feature Scaling for Engineering

from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization
scaler = StandardScaler()
df['feature_scaled'] = scaler.fit_transform(df[['feature']])

# Min-Max scaling
scaler = MinMaxScaler()
df['feature_normalized'] = scaler.fit_transform(df[['feature']])

# Rank transformation
df['feature_rank'] = df['feature'].rank()

Key Takeaways

  1. Feature engineering can significantly boost model performance
  2. Domain knowledge is crucial for creating meaningful features
  3. Always consider the target variable when creating features
  4. Validate new features through cross-validation
  5. Document all feature transformations for reproducibility

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement