Data Cleaning: Missing Values, Outliers and Types
Master data cleaning — handling missing values, outliers, duplicates, and type conversions.
Missing Data Mechanisms
DfMissing Completely at Random (MCAR)
Data is MCAR if the probability of a value being missing is independent of both the observed and unobserved data. This is the strongest assumption — missingness provides no information about the missing values. Example: data entry errors, random equipment failures.
DfMissing at Random (MAR)
Data is MAR if the probability of missingness depends only on observed data (not on the missing values themselves). Example: men are less likely to report depression scores, but within men, missingness doesn't depend on the actual depression level.
DfMissing Not at Random (MNAR)
Data is MNAR if the missingness depends on the missing values themselves. This is the most problematic mechanism — the missing data contains information. Example: high-income individuals refusing to report income.
Detecting Missing Values
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, np.nan, 3, np.nan, 5],
'B': [np.nan, 2, 3, 4, np.nan],
'C': [1, 2, np.nan, 4, 5],
'D': ['x', 'y', None, 'w', 'z']
})
# Basic detection
print(df.isnull()) # Boolean DataFrame
print(df.isnull().sum()) # Count per column
print(df.isnull().sum().sum()) # Total missing
print(df.isnull().mean() * 100) # Percentage per column
# Row-level missingness
print(df.isnull().sum(axis=1)) # Missing count per row
# Missing data patterns
missing_pattern = df.isnull().value_counts()
print(missing_pattern)
# Heatmap visualization
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.title('Missing Data Heatmap')
plt.show()
Handling Missing Values
Deletion
# Drop rows with any NaN
df_drop_any = df.dropna()
# Drop rows where all values are NaN
df_drop_all = df.dropna(how='all')
# Drop rows with fewer than 3 non-NaN values
df_drop_thresh = df.dropna(thresh=3)
# Drop columns with NaN
df_drop_cols = df.dropna(axis=1)
# Drop rows where specific columns are NaN
df_drop_subset = df.dropna(subset=['A', 'B'])
# Fraction of data retained
print(f"Rows retained: {len(df_drop_any)/len(df)*100:.1f}%")
Deletion Bias
Listwise deletion (dropping rows) is unbiased only under MCAR. Under MAR, it produces biased estimates because the deleted rows are not a random sample. Under MAR, use imputation instead. Always check the missing data mechanism before choosing a strategy.
Imputation
Mean Imputation
Here,
- =Imputed value for missing observation i
- =Mean of observed values
- =Number of observed (non-missing) values
# Mean imputation (numerical only)
df['A_filled'] = df['A'].fillna(df['A'].mean())
# Median imputation (robust to outliers)
df['B_filled'] = df['B'].fillna(df['B'].median())
# Mode imputation (categorical)
df['D_filled'] = df['D'].fillna(df['D'].mode()[0])
# Constant imputation
df['C_filled'] = df['C'].fillna(0)
# Forward fill / backward fill (time series)
df['A_ffill'] = df['A'].ffill() # Use previous valid value
df['A_bfill'] = df['A'].bfill() # Use next valid value
# Interpolation (numerical, time series)
df['A_interp'] = df['A'].interpolate(method='linear')
Mean Imputation Distorts Variance
Mean imputation reduces variance by a factor of where is the proportion of missing data. It also distorts correlations toward zero. For statistical analysis, prefer multiple imputation or model-based methods.
Advanced Imputation
from sklearn.impute import KNNImputer, SimpleImputer
# KNN Imputation
imputer_knn = KNNImputer(n_neighbors=3)
df_numeric = df[['A', 'B', 'C']].copy()
df_imputed = pd.DataFrame(
imputer_knn.fit_transform(df_numeric),
columns=df_numeric.columns
)
print(df_imputed)
# Iterative Imputation (MICE-like)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer_iter = IterativeImputer(max_iter=10, random_state=42)
df_iter_imputed = pd.DataFrame(
imputer_iter.fit_transform(df_numeric),
columns=df_numeric.columns
)
# Simple Imputer with strategy
imputer = SimpleImputer(strategy='median')
df_simple = pd.DataFrame(
imputer.fit_transform(df_numeric),
columns=df_numeric.columns
)
KNN Imputation
Here,
- =Set of k nearest neighbors of observation i
- =Inverse distance weight for neighbor j
- =Distance between observations i and j
- =Observed value at neighbor j
Outlier Detection
DfOutlier
An outlier is an observation that lies an abnormal distance from other values in a dataset. Outliers can be genuine extreme values (signal) or data errors (noise). The appropriate treatment depends on whether the outlier represents a valid but rare phenomenon or a measurement/recording error.
IQR Method
def detect_outliers_iqr(series, multiplier=1.5):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - multiplier * IQR
upper = Q3 + multiplier * IQR
return (series < lower) | (series > upper)
# Apply to DataFrame
numeric_cols = df.select_dtypes(include=[np.number]).columns
outlier_mask = df[numeric_cols].apply(detect_outliers_iqr)
print(f"Outliers per column:\n{outlier_mask.sum()}")
# Count of outliers per row
print(f"Outliers per row:\n{outlier_mask.sum(axis=1)}")
Z-Score Method
from scipy import stats
def detect_outliers_zscore(series, threshold=3):
z_scores = np.abs(stats.zscore(series.dropna()))
mask = pd.Series(False, index=series.index)
mask[series.dropna().index] = z_scores > threshold
return mask
# Modified Z-score (using MAD — more robust)
def detect_outliers_mad(series, threshold=3.5):
median = series.median()
mad = np.median(np.abs(series - median))
modified_z = 0.6745 * (series - median) / mad
return np.abs(modified_z) > threshold
Handling Outliers
# Winsorization (cap at percentiles)
from scipy.stats import mstats
df['A_winsorized'] = mstats.winsorize(df['A'], limits=[0.05, 0.05])
# Clip to IQR bounds
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
df['A_clipped'] = df['A'].clip(lower=Q1-1.5*IQR, upper=Q3+1.5*IQR)
# Log transformation (for right-skewed data)
df['A_log'] = np.log1p(df['A']) # log(1 + x) handles zeros
# Replace with NaN (treat as missing)
outlier_mask = detect_outliers_iqr(df['A'])
df.loc[outlier_mask, 'A'] = np.nan
When to Remove vs Keep Outliers
- Remove: Data entry errors, measurement artifacts, clearly invalid values
- Keep: Genuine extreme values (e.g., billionaire wealth in income data)
- Transform: Log or Box-Cox to reduce skewness while keeping the data
- Winsorize: Cap extreme values at a specified percentile
- Separate analysis: Model outliers separately if they represent a distinct population
Duplicates
df_dup = pd.DataFrame({
'name': ['Alice', 'Bob', 'Alice', 'Diana', 'Bob'],
'age': [25, 30, 25, 28, 30],
'score': [95, 87, 95, 92, 88] # Note: Bob has different score
})
# Exact duplicates (all columns)
print(df_dup.duplicated().sum()) # 1 exact duplicate (row 0)
print(df_dup[df_dup.duplicated()])
# Subset duplicates (key columns)
print(df_dup.duplicated(subset=['name']).sum()) # 2 (Alice and Bob appear twice)
# Keep options
df_dup.drop_duplicates(subset=['name'], keep='first') # Keep first occurrence
df_dup.drop_duplicates(subset=['name'], keep='last') # Keep last occurrence
df_dup.drop_duplicates(subset=['name'], keep=False) # Drop all duplicates
# Mark duplicates without removing
df_dup['is_dup'] = df_dup.duplicated(subset=['name'], keep=False)
Determining Duplicate Strategy
Before removing duplicates, understand why they exist:
- Exact duplicates: Likely data entry errors → drop
- Key duplicates with different values: Possible legitimate entries → investigate
- Temporal duplicates: Latest version may be authoritative → keep last
- Near-duplicates: Use fuzzy matching (Levenshtein distance) for text fields
Type Conversions
# String to numeric
df['price'] = pd.to_numeric(df['price_str'], errors='coerce') # NaN on failure
df['price'] = pd.to_numeric(df['price_str'], errors='ignore') # Keep as string
df['price'] = pd.to_numeric(df['price_str'], errors='raise') # Raise exception
# String to datetime
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d', errors='coerce')
df['date'] = pd.to_datetime(df['date_str'], infer_datetime_format=True)
# Extract datetime components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek # Monday=0
df['quarter'] = df['date'].dt.quarter
# Optimize dtypes
df['age'] = df['age'].astype('int8') # -128 to 127
df['salary'] = df['salary'].astype('int32') # Instead of int64
df['grade'] = df['grade'].astype('category') # Categorical
# Check memory savings
print(f"Before: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
df_optimized = df.copy()
for col in df_optimized.select_dtypes(include='object').columns:
if df_optimized[col].nunique() / len(df_optimized) < 0.5:
df_optimized[col] = df_optimized[col].astype('category')
print(f"After: {df_optimized.memory_usage(deep=True).sum() / 1024:.1f} KB")
Category Dtype for Memory Efficiency
String columns with low cardinality (few unique values relative to total rows) should use category dtype. It stores integers internally with a lookup table. For a column with 1M rows and 10 unique values, category uses ~10x less memory than object (strings).
Text Cleaning
# Whitespace
df['name'] = df['name'].str.strip() # Remove leading/trailing
df['name'] = df['name'].str.replace(r'\s+', ' ', regex=True) # Multiple spaces
# Case
df['name'] = df['name'].str.lower()
df['name'] = df['name'].str.title()
# Pattern extraction
df['area_code'] = df['phone'].str.extract(r'\((\d{3})\)')
# Replace patterns
df['name'] = df['name'].str.replace(r'[^a-zA-Z\s]', '', regex=True)
# Encode/decode
df['name_encoded'] = df['name'].str.encode('utf-8').str.decode('latin-1')
Complete Cleaning Pipeline
def clean_dataframe(df):
df = df.copy()
# 1. Remove exact duplicates
df = df.drop_duplicates()
# 2. Strip whitespace from string columns
str_cols = df.select_dtypes(include='object').columns
for col in str_cols:
df[col] = df[col].str.strip()
# 3. Convert numeric columns
for col in df.columns:
if df[col].dtype == 'object':
converted = pd.to_numeric(df[col], errors='coerce')
if converted.notna().sum() / len(df) > 0.8:
df[col] = converted
# 4. Convert date columns
for col in df.columns:
if 'date' in col.lower() or 'time' in col.lower():
df[col] = pd.to_datetime(df[col], errors='coerce')
# 5. Optimize dtypes
for col in df.select_dtypes(include='object').columns:
if df[col].nunique() / len(df) < 0.5:
df[col] = df[col].astype('category')
# 6. Report cleaning summary
print(f"Shape: {df.shape}")
print(f"Missing: {df.isnull().sum().sum()}")
print(f"Duplicates removed: {len(df) - len(df.drop_duplicates())}")
return df
Key Takeaways
Summary: Data Cleaning
- MCAR, MAR, MNAR describe different missing data mechanisms — the mechanism determines the appropriate handling strategy
- Mean imputation is simple but distorts variance and correlations — prefer KNN or iterative imputation for statistical analysis
- IQR method is robust for outlier detection; Z-score assumes normality; MAD is the most robust univariate method
- Duplicates should be investigated before removal — understand why they exist
- Type conversions with
errors='coerce'handle messy data gracefully - Category dtype provides significant memory savings for low-cardinality string columns
- Build a reproducible cleaning pipeline — every transformation should be traceable and reversible
Practice Exercise
- Given a dataset with 20% MCAR missing values, compare the bias introduced by mean imputation, median imputation, KNN imputation, and listwise deletion. Compute the difference between imputed and true means.
- Write a function that detects outliers using IQR, MAD, and Z-score methods simultaneously, then creates a consensus flag (outlier if 2+ methods agree).
- Build a text cleaning pipeline that handles: extra whitespace, mixed case, special characters, encoding issues (mojibake), and inconsistent formatting (e.g., "NY", "New York", "NYC" → "New York").
- Given a DataFrame with mixed types and messy data, implement the complete cleaning pipeline and verify data quality with a summary report.
- Analyze the missing data mechanism in a real-world dataset using Little's MCAR test or pattern analysis with
missingnolibrary.