Why Data Preprocessing Matters
Raw data is often messy, incomplete, and inconsistent. Data preprocessing transforms raw data into a format suitable for analysis and machine learning. It's estimated that data scientists spend 60-80% of their time on data preprocessing.
Common Data Quality Issues
- Missing Values: Empty cells in the dataset
- Inconsistent Data: Different formats for same information
- Outliers: Extreme values that distort analysis
- Duplicate Records: Repeated entries
- Incorrect Data Types: Wrong data types for columns
- Noise: Random errors in measurements
Handling Missing Values
Different strategies for different scenarios:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
# Option 1: Remove rows with missing values
df_clean = df.dropna()
# Option 2: Remove columns with too many missing values
df_clean = df.dropna(axis=1, thresh=0.7*len(df))
# Option 3: Fill with a specific value
df['column'].fillna(0, inplace=True)
df['column'].fillna('Unknown', inplace=True)
# Option 4: Fill with mean/median (numerical)
df['column'].fillna(df['column'].mean(), inplace=True)
df['column'].fillna(df['column'].median(), inplace=True)
# Option 5: Forward fill or backward fill (time series)
df['column'].fillna(method='ffill', inplace=True)
df['column'].fillna(method='bfill', inplace=True)
# Option 6: Using sklearn imputer
imputer = SimpleImputer(strategy='mean') # mean, median, most_frequent
imputed_data = imputer.fit_transform(df)
# Option 7: Using KNN imputer
from sklearn.impute import KNNImputer
knn_imputer = KNNImputer(n_neighbors=5)
imputed_data = knn_imputer.fit_transform(df)
Handling Outliers
Statistical methods for detecting and treating outliers:
# Method 1: Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df['column']))
df_clean = df[z_scores < 3]
# Method 2: IQR method
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_clean = df[(df['column'] >= lower_bound) & (df['column'] <= upper_bound)]
# Method 3: Winsorization (capping)
df['column'] = df['column'].clip(lower=df['column'].quantile(0.05),
upper=df['column'].quantile(0.95))
# Method 4: Log transformation for skewed data
df['column'] = np.log1p(df['column'])
Data Type Conversion
Ensuring correct data types:
# Convert to numeric
df['column'] = pd.to_numeric(df['column'], errors='coerce')
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Convert to category (for categorical data)
df['category'] = df['category'].astype('category')
# Convert strings to numeric
df['numeric_string'] = df['numeric_string'].str.replace(',', '').astype(float)
Removing Duplicates
# Check for duplicates
print(df.duplicated().sum())
# Remove duplicates
df_clean = df.drop_duplicates()
# Remove duplicates based on specific columns
df_clean = df.drop_duplicates(subset=['column1', 'column2'])
# Keep last occurrence
df_clean = df.drop_duplicates(keep='last')
String Manipulation
Cleaning text data:
# Remove whitespace
df['column'] = df['column'].str.strip()
# Convert to lowercase
df['column'] = df['column'].str.lower()
# Remove special characters
df['column'] = df['column'].str.replace(r'[^a-zA-Z0-9]', '', regex=True)
# Split into multiple columns
df[['first_name', 'last_name']] = df['name'].str.split(' ', expand=True)
# Extract patterns
df['email_domain'] = df['email'].str.extract(r'@(.*)')
Data Normalization and Standardization
Feature scaling for machine learning:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Min-Max Scaling (0 to 1)
scaler_minmax = MinMaxScaler()
df_scaled = scaler.fit_transform(df)
# Standardization (mean=0, std=1)
scaler_std = StandardScaler()
df_standardized = scaler_std.fit_transform(df)
# Robust Scaler (uses median and IQR)
from sklearn.preprocessing import RobustScaler
scaler_robust = RobustScaler()
df_robust = scaler_robust.fit_transform(df)
# Log transformation for skewed data
df['skewed_column'] = np.log1p(df['skewed_column'])
Encoding Categorical Variables
Converting text to numbers:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import pandas as pd
# Label Encoding (ordinal)
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
# One-Hot Encoding (nominal)
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')
# Using pandas get_dummies
df = pd.get_dummies(df, columns=['category1', 'category2'])
# Target Encoding (using target variable)
target_mean = df.groupby('category')['target'].mean()
df['category_encoded'] = df['category'].map(target_mean)
# Frequency Encoding
freq = df['category'].value_counts(normalize=True)
df['category_freq'] = df['category'].map(freq)
DateTime Feature Engineering
Extracting useful features from dates:
df['date'] = pd.to_datetime(df['date'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
# Time-based features
df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
df['hour'] = df['date'].dt.hour
# Calculate time differences
df['days_since'] = (df['date'] - df['date'].min()).dt.days
Best Practices
- Always explore your data before preprocessing
- Document all preprocessing steps
- Create pipelines for reproducibility
- Handle missing values based on context
- Test preprocessing on validation data
- Consider the downstream model requirements