Types of Missing Data
Understanding why data is missing determines the right strategy.
| Type | Full Name | Missingness Depends On | Example |
|---|---|---|---|
| MCAR | Missing Completely At Random | Nothing | Random sensor failure |
| MAR | Missing At Random | Other observed variables | Income missing more for young people |
| MNAR | Missing Not At Random | The missing value itself | High earners omit salary |
Rule: MCAR → any method works. MAR → model-based imputation. MNAR → collect more data or model mechanism.
Detecting Missing Data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
np.random.seed(42)
# Create dataset with realistic missing patterns
df = pd.DataFrame(load_breast_cancer().data,
columns=load_breast_cancer().feature_names)
# Inject missingness
df.iloc[np.random.choice(len(df), 50, replace=False), 0] = np.nan # MCAR
df.iloc[np.random.choice(len(df), 30, replace=False), 2] = np.nan # MCAR
df.iloc[np.random.choice(len(df), 80, replace=False), 5] = np.nan # MCAR
# ── 1. Overview ───────────────────────────────────────────────────────
missing = df.isnull().sum()
missing_pct = (df.isnull().mean() * 100).round(2)
missing_report = pd.DataFrame({
"Missing Count": missing,
"Missing %": missing_pct,
"Dtype": df.dtypes,
}).query("`Missing Count` > 0").sort_values("Missing %", ascending=False)
print(missing_report)
# ── 2. Missingness heatmap ────────────────────────────────────────────
plt.figure(figsize=(14, 6))
sns.heatmap(df.isnull().T, cmap="YlOrRd", cbar=False,
xticklabels=False, yticklabels=True)
plt.title("Missing Data Pattern", fontsize=13, fontweight="bold")
plt.xlabel("Observations"); plt.ylabel("Features")
plt.tight_layout(); plt.show()
# ── 3. Missingness bar chart ──────────────────────────────────────────
missing_pct[missing_pct > 0].sort_values().plot(
kind="barh", figsize=(9, 4), color="#ef4444", edgecolor="white")
plt.title("Missing Data Percentage by Feature")
plt.xlabel("Missing %"); plt.axvline(5, color="orange", linestyle="--",
label="5% threshold")
plt.legend(); plt.grid(True, alpha=0.3, axis="x")
plt.tight_layout(); plt.show()
Strategy 1 — Deletion
# Listwise deletion (remove rows with ANY missing)
df_complete = df.dropna()
print(f"Rows retained: {len(df_complete)}/{len(df)} = {len(df_complete)/len(df)*100:.1f}%")
# Column deletion (> 40% missing)
threshold = 0.40
df_no_high_miss = df.drop(columns=df.columns[df.isnull().mean() > threshold])
# ✅ Use when: MCAR, small % missing
# ❌ Avoid when: MAR/MNAR, or losing >5% of data
Strategy 2 — Simple Imputation
from sklearn.impute import SimpleImputer
# Mean imputation (numeric, symmetric distributions)
mean_imp = SimpleImputer(strategy="mean")
df["mean radius_filled"] = mean_imp.fit_transform(df[["mean radius"]])
# Median imputation (numeric, skewed distributions) ← usually preferred
median_imp = SimpleImputer(strategy="median")
# Mode imputation (categorical)
mode_imp = SimpleImputer(strategy="most_frequent")
# Constant imputation
const_imp = SimpleImputer(strategy="constant", fill_value=-1)
# Compare distributions before/after
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df["mean radius"].dropna().hist(ax=axes[0], bins=40, color="steelblue",
edgecolor="white", alpha=0.7, label="Original")
axes[0].axvline(df["mean radius"].mean(), color="red", linestyle="--",
label=f"Mean={df['mean radius'].mean():.2f}")
axes[0].set_title("Before Imputation"); axes[0].legend()
df["mean radius_filled"].hist(ax=axes[1], bins=40, color="steelblue",
edgecolor="white", alpha=0.7, label="After Mean Imp")
axes[1].axvline(df["mean radius"].mean(), color="red", linestyle="--")
axes[1].set_title("After Mean Imputation"); axes[1].legend()
plt.tight_layout(); plt.show()
Strategy 3 — KNN Imputation
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
# Scale first (KNN is distance-based)
scaler = StandardScaler()
df_numeric = df.select_dtypes(include=[np.number])
df_scaled = scaler.fit_transform(df_numeric)
knn_imp = KNNImputer(n_neighbors=5, weights="distance")
df_knn_scaled = knn_imp.fit_transform(df_scaled)
df_knn = pd.DataFrame(
scaler.inverse_transform(df_knn_scaled),
columns=df_numeric.columns
)
# ✅ Use when: MAR, moderate % missing, features are correlated
# ⚠️ Slow on large datasets (O(n²))
Strategy 4 — Iterative (MICE) Imputation
from sklearn.experimental import enable_iterative_imputer # noqa
from sklearn.impute import IterativeImputer
from sklearn.linear_model import BayesianRidge
from sklearn.ensemble import RandomForestRegressor
# MICE with Bayesian Ridge (default)
mice_imp = IterativeImputer(
estimator=BayesianRidge(),
max_iter=10,
random_state=42,
sample_posterior=True, # multiple imputation
)
df_mice = pd.DataFrame(
mice_imp.fit_transform(df_numeric),
columns=df_numeric.columns
)
# MICE with Random Forest (non-linear relationships)
rf_imp = IterativeImputer(
estimator=RandomForestRegressor(n_estimators=50, random_state=42),
max_iter=5,
random_state=42,
)
df_rf_imp = pd.DataFrame(
rf_imp.fit_transform(df_numeric),
columns=df_numeric.columns
)
# ✅ Best for: MAR, multiple features missing, correlated features
Comparing Imputation Methods
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
y = load_breast_cancer().target
methods = {
"Complete cases": SimpleImputer(strategy="mean"), # baseline
"Mean": SimpleImputer(strategy="mean"),
"Median": SimpleImputer(strategy="median"),
"KNN (k=5)": KNNImputer(n_neighbors=5),
"MICE": IterativeImputer(max_iter=10, random_state=42),
}
results = {}
for name, imp in methods.items():
pipe = Pipeline([
("imputer", imp),
("scaler", StandardScaler()),
("clf", LogisticRegression(max_iter=500)),
])
scores = cross_val_score(pipe, df_numeric, y, cv=5, scoring="f1")
results[name] = {"F1 Mean": scores.mean(), "F1 Std": scores.std()}
results_df = pd.DataFrame(results).T.round(4)
print("\nImputation Method Comparison:")
print(results_df.sort_values("F1 Mean", ascending=False))
Decision Guide
% Missing Mechanism Recommendation
──────────────────────────────────────────────────────────
< 5% MCAR Listwise deletion or mean/median
5–20% MCAR/MAR Median (numeric), mode (categorical)
5–20% MAR KNN or MICE imputation
> 20% MAR MICE with uncertainty; add indicator column
> 40% Any Consider dropping feature; collect more data
Any MNAR Model the missingness mechanism explicitly
Always add a binary indicator column for features with >5% missing:
df["income_missing"] = df["income"].isna().astype(int)
df["income"].fillna(df["income"].median(), inplace=True)
Key Takeaways
- Diagnose first — MCAR/MAR/MNAR determines the right strategy
- Simple imputation (median) is fine for MCAR < 5%
- KNN imputation leverages feature correlations — scale first
- MICE/Iterative is the gold standard for MAR data
- Always add a missingness indicator — tells the model where data was imputed