Handling Missing Data — Complete Guide

Data PreprocessingMissing DataFree Lesson

Advertisement

Types of Missing Data

Understanding why data is missing determines the right strategy.

TypeFull NameMissingness Depends OnExample
MCARMissing Completely At RandomNothingRandom sensor failure
MARMissing At RandomOther observed variablesIncome missing more for young people
MNARMissing Not At RandomThe missing value itselfHigh earners omit salary

Rule: MCAR → any method works. MAR → model-based imputation. MNAR → collect more data or model mechanism.


Detecting Missing Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer

np.random.seed(42)

# Create dataset with realistic missing patterns
df = pd.DataFrame(load_breast_cancer().data,
                  columns=load_breast_cancer().feature_names)

# Inject missingness
df.iloc[np.random.choice(len(df), 50, replace=False), 0]  = np.nan   # MCAR
df.iloc[np.random.choice(len(df), 30, replace=False), 2]  = np.nan   # MCAR
df.iloc[np.random.choice(len(df), 80, replace=False), 5]  = np.nan   # MCAR

# ── 1. Overview ───────────────────────────────────────────────────────
missing = df.isnull().sum()
missing_pct = (df.isnull().mean() * 100).round(2)
missing_report = pd.DataFrame({
    "Missing Count": missing,
    "Missing %":     missing_pct,
    "Dtype":         df.dtypes,
}).query("`Missing Count` > 0").sort_values("Missing %", ascending=False)

print(missing_report)

# ── 2. Missingness heatmap ────────────────────────────────────────────
plt.figure(figsize=(14, 6))
sns.heatmap(df.isnull().T, cmap="YlOrRd", cbar=False,
            xticklabels=False, yticklabels=True)
plt.title("Missing Data Pattern", fontsize=13, fontweight="bold")
plt.xlabel("Observations"); plt.ylabel("Features")
plt.tight_layout(); plt.show()

# ── 3. Missingness bar chart ──────────────────────────────────────────
missing_pct[missing_pct > 0].sort_values().plot(
    kind="barh", figsize=(9, 4), color="#ef4444", edgecolor="white")
plt.title("Missing Data Percentage by Feature")
plt.xlabel("Missing %"); plt.axvline(5, color="orange", linestyle="--",
                                      label="5% threshold")
plt.legend(); plt.grid(True, alpha=0.3, axis="x")
plt.tight_layout(); plt.show()

Strategy 1 — Deletion

# Listwise deletion (remove rows with ANY missing)
df_complete = df.dropna()
print(f"Rows retained: {len(df_complete)}/{len(df)} = {len(df_complete)/len(df)*100:.1f}%")

# Column deletion (> 40% missing)
threshold = 0.40
df_no_high_miss = df.drop(columns=df.columns[df.isnull().mean() > threshold])

# ✅ Use when: MCAR, small % missing
# ❌ Avoid when: MAR/MNAR, or losing >5% of data

Strategy 2 — Simple Imputation

from sklearn.impute import SimpleImputer

# Mean imputation (numeric, symmetric distributions)
mean_imp = SimpleImputer(strategy="mean")
df["mean radius_filled"] = mean_imp.fit_transform(df[["mean radius"]])

# Median imputation (numeric, skewed distributions) ← usually preferred
median_imp = SimpleImputer(strategy="median")

# Mode imputation (categorical)
mode_imp = SimpleImputer(strategy="most_frequent")

# Constant imputation
const_imp = SimpleImputer(strategy="constant", fill_value=-1)

# Compare distributions before/after
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df["mean radius"].dropna().hist(ax=axes[0], bins=40, color="steelblue",
                                 edgecolor="white", alpha=0.7, label="Original")
axes[0].axvline(df["mean radius"].mean(), color="red", linestyle="--",
                label=f"Mean={df['mean radius'].mean():.2f}")
axes[0].set_title("Before Imputation"); axes[0].legend()

df["mean radius_filled"].hist(ax=axes[1], bins=40, color="steelblue",
                               edgecolor="white", alpha=0.7, label="After Mean Imp")
axes[1].axvline(df["mean radius"].mean(), color="red", linestyle="--")
axes[1].set_title("After Mean Imputation"); axes[1].legend()
plt.tight_layout(); plt.show()

Strategy 3 — KNN Imputation

from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

# Scale first (KNN is distance-based)
scaler = StandardScaler()
df_numeric = df.select_dtypes(include=[np.number])
df_scaled  = scaler.fit_transform(df_numeric)

knn_imp = KNNImputer(n_neighbors=5, weights="distance")
df_knn_scaled = knn_imp.fit_transform(df_scaled)
df_knn = pd.DataFrame(
    scaler.inverse_transform(df_knn_scaled),
    columns=df_numeric.columns
)

# ✅ Use when: MAR, moderate % missing, features are correlated
# ⚠️  Slow on large datasets (O(n²))

Strategy 4 — Iterative (MICE) Imputation

from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import RandomForestRegressor

# MICE with Bayesian Ridge (default)
mice_imp = IterativeImputer(
    estimator=BayesianRidge(),
    max_iter=10,
    random_state=42,
    sample_posterior=True,   # multiple imputation
)
df_mice = pd.DataFrame(
    mice_imp.fit_transform(df_numeric),
    columns=df_numeric.columns
)

# MICE with Random Forest (non-linear relationships)
rf_imp = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=50, random_state=42),
    max_iter=5,
    random_state=42,
)
df_rf_imp = pd.DataFrame(
    rf_imp.fit_transform(df_numeric),
    columns=df_numeric.columns
)

# ✅ Best for: MAR, multiple features missing, correlated features

Comparing Imputation Methods

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

y = load_breast_cancer().target

methods = {
    "Complete cases": SimpleImputer(strategy="mean"),  # baseline
    "Mean":           SimpleImputer(strategy="mean"),
    "Median":         SimpleImputer(strategy="median"),
    "KNN (k=5)":      KNNImputer(n_neighbors=5),
    "MICE":           IterativeImputer(max_iter=10, random_state=42),
}

results = {}
for name, imp in methods.items():
    pipe = Pipeline([
        ("imputer", imp),
        ("scaler",  StandardScaler()),
        ("clf",     LogisticRegression(max_iter=500)),
    ])
    scores = cross_val_score(pipe, df_numeric, y, cv=5, scoring="f1")
    results[name] = {"F1 Mean": scores.mean(), "F1 Std": scores.std()}

results_df = pd.DataFrame(results).T.round(4)
print("\nImputation Method Comparison:")
print(results_df.sort_values("F1 Mean", ascending=False))

Decision Guide

% Missing       Mechanism    Recommendation
──────────────────────────────────────────────────────────
< 5%            MCAR         Listwise deletion or mean/median
5–20%           MCAR/MAR     Median (numeric), mode (categorical)
5–20%           MAR          KNN or MICE imputation
> 20%           MAR          MICE with uncertainty; add indicator column
> 40%           Any          Consider dropping feature; collect more data
Any             MNAR         Model the missingness mechanism explicitly

Always add a binary indicator column for features with >5% missing:

df["income_missing"] = df["income"].isna().astype(int)
df["income"].fillna(df["income"].median(), inplace=True)

Key Takeaways

  1. Diagnose first — MCAR/MAR/MNAR determines the right strategy
  2. Simple imputation (median) is fine for MCAR < 5%
  3. KNN imputation leverages feature correlations — scale first
  4. MICE/Iterative is the gold standard for MAR data
  5. Always add a missingness indicator — tells the model where data was imputed

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement