CW

Data Cleaning: Missing Values, Outliers and Types

Module 2: NumPy & PandasFree Lesson

Advertisement

Data Cleaning: Missing Values, Outliers and Types

Master data cleaning — handling missing values, outliers, duplicates, and type conversions.

Data Cleaning PipelineMissingNaN, None, NAMCAR / MAR / MNARImpute or DropHigh ImpactOutliersIQR MethodZ-Score MethodDBSCAN / IsolationDomain ExpertiseDuplicatesExact DuplicatesFuzzy Matchingkeep='first'/last'Quick WinsTypesastype()pd.to_datetime()pd.to_numeric()Category TypeTextWhitespaceCase NormalizationRegex CleaningEncoding IssuesImpact on Downstream AnalysisMissing → Biased EstimatesOutliers → Distorted ModelsDuplicates → Inflated CountsClean data → Reliable analysis → Trustworthy models

Missing Data Mechanisms

DfMissing Completely at Random (MCAR)

Data is MCAR if the probability of a value being missing is independent of both the observed and unobserved data. This is the strongest assumption — missingness provides no information about the missing values. Example: data entry errors, random equipment failures.

DfMissing at Random (MAR)

Data is MAR if the probability of missingness depends only on observed data (not on the missing values themselves). Example: men are less likely to report depression scores, but within men, missingness doesn't depend on the actual depression level.

DfMissing Not at Random (MNAR)

Data is MNAR if the missingness depends on the missing values themselves. This is the most problematic mechanism — the missing data contains information. Example: high-income individuals refusing to report income.

Missing Data MechanismsMCARMissingness independent of all dataP(Missing | X, Y) = P(Missing)Deletion is unbiased (but loses power)Most common in clinical trialsMARMissingness depends on observed dataP(Missing | X, Y) = P(Missing | X)Multiple imputation works wellMost real-world missing dataMNARMissingness depends on missing valuesP(Missing | X, Y) ≠ f(X)Requires sensitivity analysisPattern-mixture models needed

Detecting Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, np.nan, 3, np.nan, 5],
    'B': [np.nan, 2, 3, 4, np.nan],
    'C': [1, 2, np.nan, 4, 5],
    'D': ['x', 'y', None, 'w', 'z']
})

# Basic detection
print(df.isnull())           # Boolean DataFrame
print(df.isnull().sum())     # Count per column
print(df.isnull().sum().sum())  # Total missing
print(df.isnull().mean() * 100)  # Percentage per column

# Row-level missingness
print(df.isnull().sum(axis=1))  # Missing count per row

# Missing data patterns
missing_pattern = df.isnull().value_counts()
print(missing_pattern)

# Heatmap visualization
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
plt.title('Missing Data Heatmap')
plt.show()

Handling Missing Values

Deletion

# Drop rows with any NaN
df_drop_any = df.dropna()

# Drop rows where all values are NaN
df_drop_all = df.dropna(how='all')

# Drop rows with fewer than 3 non-NaN values
df_drop_thresh = df.dropna(thresh=3)

# Drop columns with NaN
df_drop_cols = df.dropna(axis=1)

# Drop rows where specific columns are NaN
df_drop_subset = df.dropna(subset=['A', 'B'])

# Fraction of data retained
print(f"Rows retained: {len(df_drop_any)/len(df)*100:.1f}%")

Deletion Bias

Listwise deletion (dropping rows) is unbiased only under MCAR. Under MAR, it produces biased estimates because the deleted rows are not a random sample. Under MAR, use imputation instead. Always check the missing data mechanism before choosing a strategy.

Imputation

Mean Imputation

hatxi=barx=frac1nsumj=1nxjquadtext(wherexjtextisobserved)\\hat{x}_i = \\bar{x} = \\frac{1}{n} \\sum_{j=1}^{n} x_j \\quad \\text{(where } x_j \\text{ is observed)}

Here,

  • x^i\hat{x}_i=Imputed value for missing observation i
  • xˉ\bar{x}=Mean of observed values
  • nn=Number of observed (non-missing) values
# Mean imputation (numerical only)
df['A_filled'] = df['A'].fillna(df['A'].mean())

# Median imputation (robust to outliers)
df['B_filled'] = df['B'].fillna(df['B'].median())

# Mode imputation (categorical)
df['D_filled'] = df['D'].fillna(df['D'].mode()[0])

# Constant imputation
df['C_filled'] = df['C'].fillna(0)

# Forward fill / backward fill (time series)
df['A_ffill'] = df['A'].ffill()  # Use previous valid value
df['A_bfill'] = df['A'].bfill()  # Use next valid value

# Interpolation (numerical, time series)
df['A_interp'] = df['A'].interpolate(method='linear')

Mean Imputation Distorts Variance

Mean imputation reduces variance by a factor of (1p)(1 - p) where pp is the proportion of missing data. It also distorts correlations toward zero. For statistical analysis, prefer multiple imputation or model-based methods.

Advanced Imputation

from sklearn.impute import KNNImputer, SimpleImputer

# KNN Imputation
imputer_knn = KNNImputer(n_neighbors=3)
df_numeric = df[['A', 'B', 'C']].copy()
df_imputed = pd.DataFrame(
    imputer_knn.fit_transform(df_numeric),
    columns=df_numeric.columns
)
print(df_imputed)

# Iterative Imputation (MICE-like)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imputer_iter = IterativeImputer(max_iter=10, random_state=42)
df_iter_imputed = pd.DataFrame(
    imputer_iter.fit_transform(df_numeric),
    columns=df_numeric.columns
)

# Simple Imputer with strategy
imputer = SimpleImputer(strategy='median')
df_simple = pd.DataFrame(
    imputer.fit_transform(df_numeric),
    columns=df_numeric.columns
)

KNN Imputation

hatxi=fracsumjinmathcalNk(i)wjcdotxjsumjinmathcalNk(i)wjquadtextwherewj=frac1d(xi,xj)\\hat{x}_i = \\frac{\\sum_{j \\in \\mathcal{N}_k(i)} w_j \\cdot x_j}{\\sum_{j \\in \\mathcal{N}_k(i)} w_j} \\quad \\text{where } w_j = \\frac{1}{d(x_i, x_j)}

Here,

  • Nk(i)\mathcal{N}_k(i)=Set of k nearest neighbors of observation i
  • wjw_j=Inverse distance weight for neighbor j
  • d(xi,xj)d(x_i, x_j)=Distance between observations i and j
  • xjx_j=Observed value at neighbor j

Outlier Detection

DfOutlier

An outlier is an observation that lies an abnormal distance from other values in a dataset. Outliers can be genuine extreme values (signal) or data errors (noise). The appropriate treatment depends on whether the outlier represents a valid but rare phenomenon or a measurement/recording error.

Outlier Detection MethodsIQR MethodQ1 = 25th percentileQ3 = 75th percentileIQR = Q3 - Q1Lower: Q1 - 1.5*IQRUpper: Q3 + 1.5*IQRRobust to non-normal dataZ-Score Methodz = (x - μ) / σ|z| > 2 → Moderate outlier|z| > 3 → Extreme outlierAssumes normal distributionSensitive to μ, σ from outliersMAD MethodMAD = median(|x - median(x)|)z_modified = 0.6745*(x-med)/MAD|z| > 3.5 → OutlierRobust: uses median, not mean0.6745 = z for 75th percentile

IQR Method

def detect_outliers_iqr(series, multiplier=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - multiplier * IQR
    upper = Q3 + multiplier * IQR
    return (series < lower) | (series > upper)

# Apply to DataFrame
numeric_cols = df.select_dtypes(include=[np.number]).columns
outlier_mask = df[numeric_cols].apply(detect_outliers_iqr)
print(f"Outliers per column:\n{outlier_mask.sum()}")

# Count of outliers per row
print(f"Outliers per row:\n{outlier_mask.sum(axis=1)}")

Z-Score Method

from scipy import stats

def detect_outliers_zscore(series, threshold=3):
    z_scores = np.abs(stats.zscore(series.dropna()))
    mask = pd.Series(False, index=series.index)
    mask[series.dropna().index] = z_scores > threshold
    return mask

# Modified Z-score (using MAD — more robust)
def detect_outliers_mad(series, threshold=3.5):
    median = series.median()
    mad = np.median(np.abs(series - median))
    modified_z = 0.6745 * (series - median) / mad
    return np.abs(modified_z) > threshold

Handling Outliers

# Winsorization (cap at percentiles)
from scipy.stats import mstats
df['A_winsorized'] = mstats.winsorize(df['A'], limits=[0.05, 0.05])

# Clip to IQR bounds
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
df['A_clipped'] = df['A'].clip(lower=Q1-1.5*IQR, upper=Q3+1.5*IQR)

# Log transformation (for right-skewed data)
df['A_log'] = np.log1p(df['A'])  # log(1 + x) handles zeros

# Replace with NaN (treat as missing)
outlier_mask = detect_outliers_iqr(df['A'])
df.loc[outlier_mask, 'A'] = np.nan

When to Remove vs Keep Outliers

  • Remove: Data entry errors, measurement artifacts, clearly invalid values
  • Keep: Genuine extreme values (e.g., billionaire wealth in income data)
  • Transform: Log or Box-Cox to reduce skewness while keeping the data
  • Winsorize: Cap extreme values at a specified percentile
  • Separate analysis: Model outliers separately if they represent a distinct population

Duplicates

df_dup = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Alice', 'Diana', 'Bob'],
    'age': [25, 30, 25, 28, 30],
    'score': [95, 87, 95, 92, 88]  # Note: Bob has different score
})

# Exact duplicates (all columns)
print(df_dup.duplicated().sum())  # 1 exact duplicate (row 0)
print(df_dup[df_dup.duplicated()])

# Subset duplicates (key columns)
print(df_dup.duplicated(subset=['name']).sum())  # 2 (Alice and Bob appear twice)

# Keep options
df_dup.drop_duplicates(subset=['name'], keep='first')   # Keep first occurrence
df_dup.drop_duplicates(subset=['name'], keep='last')    # Keep last occurrence
df_dup.drop_duplicates(subset=['name'], keep=False)     # Drop all duplicates

# Mark duplicates without removing
df_dup['is_dup'] = df_dup.duplicated(subset=['name'], keep=False)

Determining Duplicate Strategy

Before removing duplicates, understand why they exist:

  • Exact duplicates: Likely data entry errors → drop
  • Key duplicates with different values: Possible legitimate entries → investigate
  • Temporal duplicates: Latest version may be authoritative → keep last
  • Near-duplicates: Use fuzzy matching (Levenshtein distance) for text fields

Type Conversions

# String to numeric
df['price'] = pd.to_numeric(df['price_str'], errors='coerce')  # NaN on failure
df['price'] = pd.to_numeric(df['price_str'], errors='ignore')  # Keep as string
df['price'] = pd.to_numeric(df['price_str'], errors='raise')   # Raise exception

# String to datetime
df['date'] = pd.to_datetime(df['date_str'], format='%Y-%m-%d', errors='coerce')
df['date'] = pd.to_datetime(df['date_str'], infer_datetime_format=True)

# Extract datetime components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0
df['quarter'] = df['date'].dt.quarter

# Optimize dtypes
df['age'] = df['age'].astype('int8')        # -128 to 127
df['salary'] = df['salary'].astype('int32')  # Instead of int64
df['grade'] = df['grade'].astype('category') # Categorical

# Check memory savings
print(f"Before: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")
df_optimized = df.copy()
for col in df_optimized.select_dtypes(include='object').columns:
    if df_optimized[col].nunique() / len(df_optimized) < 0.5:
        df_optimized[col] = df_optimized[col].astype('category')
print(f"After:  {df_optimized.memory_usage(deep=True).sum() / 1024:.1f} KB")

Category Dtype for Memory Efficiency

String columns with low cardinality (few unique values relative to total rows) should use category dtype. It stores integers internally with a lookup table. For a column with 1M rows and 10 unique values, category uses ~10x less memory than object (strings).

Text Cleaning

# Whitespace
df['name'] = df['name'].str.strip()      # Remove leading/trailing
df['name'] = df['name'].str.replace(r'\s+', ' ', regex=True)  # Multiple spaces

# Case
df['name'] = df['name'].str.lower()
df['name'] = df['name'].str.title()

# Pattern extraction
df['area_code'] = df['phone'].str.extract(r'\((\d{3})\)')

# Replace patterns
df['name'] = df['name'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

# Encode/decode
df['name_encoded'] = df['name'].str.encode('utf-8').str.decode('latin-1')

Complete Cleaning Pipeline

def clean_dataframe(df):
    df = df.copy()

    # 1. Remove exact duplicates
    df = df.drop_duplicates()

    # 2. Strip whitespace from string columns
    str_cols = df.select_dtypes(include='object').columns
    for col in str_cols:
        df[col] = df[col].str.strip()

    # 3. Convert numeric columns
    for col in df.columns:
        if df[col].dtype == 'object':
            converted = pd.to_numeric(df[col], errors='coerce')
            if converted.notna().sum() / len(df) > 0.8:
                df[col] = converted

    # 4. Convert date columns
    for col in df.columns:
        if 'date' in col.lower() or 'time' in col.lower():
            df[col] = pd.to_datetime(df[col], errors='coerce')

    # 5. Optimize dtypes
    for col in df.select_dtypes(include='object').columns:
        if df[col].nunique() / len(df) < 0.5:
            df[col] = df[col].astype('category')

    # 6. Report cleaning summary
    print(f"Shape: {df.shape}")
    print(f"Missing: {df.isnull().sum().sum()}")
    print(f"Duplicates removed: {len(df) - len(df.drop_duplicates())}")

    return df

Key Takeaways

Summary: Data Cleaning

  1. MCAR, MAR, MNAR describe different missing data mechanisms — the mechanism determines the appropriate handling strategy
  2. Mean imputation is simple but distorts variance and correlations — prefer KNN or iterative imputation for statistical analysis
  3. IQR method is robust for outlier detection; Z-score assumes normality; MAD is the most robust univariate method
  4. Duplicates should be investigated before removal — understand why they exist
  5. Type conversions with errors='coerce' handle messy data gracefully
  6. Category dtype provides significant memory savings for low-cardinality string columns
  7. Build a reproducible cleaning pipeline — every transformation should be traceable and reversible

Practice Exercise

  1. Given a dataset with 20% MCAR missing values, compare the bias introduced by mean imputation, median imputation, KNN imputation, and listwise deletion. Compute the difference between imputed and true means.
  2. Write a function that detects outliers using IQR, MAD, and Z-score methods simultaneously, then creates a consensus flag (outlier if 2+ methods agree).
  3. Build a text cleaning pipeline that handles: extra whitespace, mixed case, special characters, encoding issues (mojibake), and inconsistent formatting (e.g., "NY", "New York", "NYC" → "New York").
  4. Given a DataFrame with mixed types and messy data, implement the complete cleaning pipeline and verify data quality with a summary report.
  5. Analyze the missing data mechanism in a real-world dataset using Little's MCAR test or pattern analysis with missingno library.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement