Data Cleaning Essentials

Data scientists spend 60-80% of their time cleaning data. Dirty data leads to wrong conclusions and broken models. This lesson covers every major cleaning technique you will use in practice.

Data Cleaning Pipeline

Missing Values

Missing data is the most common data quality issue. How you handle it affects your analysis significantly.

Detecting Missing Values

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "name": ["Alice", "Bob", None, "Diana", "Eve"],
    "age": [25, np.nan, 35, 28, np.nan],
    "salary": [70000, 80000, np.nan, 75000, 90000],
    "department": ["eng", "eng", None, "sales", "eng"],
})

# Count missing values per column
print(df.isnull().sum())
# name          1
# age           2
# salary        1
# department    1

# Percentage missing
print((df.isnull().sum() / len(df) * 100).round(1))
# name          20.0
# age           40.0
# salary        20.0
# department    20.0

# Rows with any missing values
print(df[df.isnull().any(axis=1)])

# Patterns in missing data
print(df.isnull().describe())

Types of Missingness

Architecture Diagram

MCAR (Missing Completely at Random):
  Missingness has no relationship with any data.
  Example: Lab equipment randomly fails.
  Safe to drop rows.

MAR (Missing at Random):
  Missingness is related to observed data but not the missing value itself.
  Example: Younger people skip income questions (age is observed).
  Use imputation based on other columns.

MNAR (Missing Not at Random):
  Missingness is related to the missing value itself.
  Example: High earners hide their income.
  Requires domain knowledge or specialized methods.

Handling Missing Values

df = pd.DataFrame({
    "age": [25, 30, np.nan, 35, 28, np.nan],
    "salary": [50000, np.nan, 70000, np.nan, 60000, 80000],
    "score": [85, 90, np.nan, np.nan, 75, 88],
})

# Drop rows with any missing values
print(df.dropna())
print(df.dropna(subset=["age"]))  # Only check specific columns

# Drop columns with too many missing values
threshold = len(df) * 0.5  # Drop if > 50% missing
print(df.dropna(axis=1, thresh=threshold))

# Fill with constant
df["age"] = df["age"].fillna(0)

# Fill with statistics
df["salary"] = df["salary"].fillna(df["salary"].mean())
df["score"] = df["score"].fillna(df["score"].median())

# Fill with different values per column
df = df.fillna({
    "age": df["age"].mean(),
    "salary": df["salary"].median(),
    "score": 0,
})

# Forward fill (useful for time series)
df["score"] = df["score"].fillna(method="ffill")

# Backward fill
df["score"] = df["score"].fillna(method="bfill")

# Interpolation
df["age"] = df["age"].interpolate(method="linear")

# Indicator variable for missingness
df["age_was_missing"] = df["age"].isnull().astype(int)

Duplicates

Duplicate rows inflate counts and bias aggregations.

Detecting Duplicates

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Alice", "Diana", "Bob"],
    "age": [25, 30, 25, 28, 30],
    "city": ["NYC", "LA", "NYC", "Houston", "LA"],
})

# Exact duplicates
print(df.duplicated().sum())  # 2

# Duplicates based on specific columns
print(df.duplicated(subset=["name"]).sum())  # 1

# View duplicate rows
print(df[df.duplicated(keep=False)])  # Show all duplicates

# Unique values
print(df["name"].nunique())  # 4
print(df["name"].unique())   # ['Alice' 'Bob' 'Diana']

Removing Duplicates

# Remove exact duplicates (keep first occurrence)
df_clean = df.drop_duplicates()

# Keep last occurrence
df_clean = df.drop_duplicates(keep="last")

# Remove based on specific columns
df_clean = df.drop_duplicates(subset=["name"])

# Keep most recent (assuming sorted by date)
df = df.sort_values("date")
df_clean = df.drop_duplicates(subset=["name"], keep="last")

Outliers

Outliers are data points that differ significantly from other observations. They can be errors or genuinely extreme values.

Detecting Outliers

np.random.seed(42)
data = pd.Series(np.random.normal(50, 10, 1000).tolist() + [150, 200, -30])

# IQR Method (most common)
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = data[(data < lower) | (data > upper)]
print(f"IQR bounds: {lower:.2f} to {upper:.2f}")
print(f"Outliers found: {len(outliers)}")

# Z-Score Method
from scipy import stats

z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]
print(f"Z-score outliers: {len(outliers_z)}")

# Modified Z-Score (more robust)
median = data.median()
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
outliers_mz = data[np.abs(modified_z) > 3.5]
print(f"Modified Z outliers: {len(outliers_mz)}")

Handling Outliers

# Cap at percentiles (Winsorization)
lower_cap = data.quantile(0.01)
upper_cap = data.quantile(0.99)
data_capped = data.clip(lower=lower_cap, upper=upper_cap)

# Remove outliers
data_clean = data[(data >= lower) & (data <= upper)]

# Replace with NaN and impute
data.loc[(data < lower) | (data > upper)] = np.nan
data_imputed = data.fillna(data.median())

# Log transform (reduces impact of extreme values)
data_log = np.log1p(data.clip(lower=0))

Data Type Issues

df = pd.DataFrame({
    "price": ["19.99", "42.50", "N/A", "100.00"],
    "quantity": ["3", "5", "2", "7"],
    "date": ["01/15/2024", "02/20/2024", "03/25/2024", "04/10/2024"],
    "category": ["  A  ", "B", "  C  ", "A "],
})

# String to numeric (handle errors)
df["price"] = pd.to_numeric(df["price"], errors="coerce")

# String to integer
df["quantity"] = df["quantity"].astype(int)

# String to datetime
df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y")

# Clean string columns
df["category"] = df["category"].str.strip().str.upper()

# Optimize dtypes
df["quantity"] = df["quantity"].astype("int8")      # Smaller int type
df["price"] = df["price"].astype("float32")         # Smaller float type

# Categorical type (saves memory for low-cardinality columns)
df["category"] = df["category"].astype("category")

print(df.dtypes)
print(df)

String Cleaning

df = pd.DataFrame({
    "name": ["  alice smith ", "BOB JONES", "charlie@domain.com", "diana-ray"],
    "phone": ["555-123-4567", "(555) 987-6543", "555.234.5678", "N/A"],
})

# Standardize names
df["name_clean"] = (
    df["name"]
    .str.strip()
    .str.lower()
    .str.replace(r"[^a-zA-Z\s]", "", regex=True)  # Remove special chars
    .str.replace(r"\s+", " ", regex=True)          # Normalize whitespace
)

# Extract phone digits
df["phone_clean"] = df["phone"].str.replace(r"\D", "", regex=True)

# Extract email domain
df["domain"] = df["name"].str.extract(r"@(\S+)")

# Pattern matching
df["has_hyphen"] = df["name"].str.contains("-")

# Split columns
name_parts = df["name_clean"].str.split(" ", expand=True)
df["first_name"] = name_parts[0]
df["last_name"] = name_parts[1] if 1 in name_parts.columns else ""

print(df[["name_clean", "phone_clean", "domain", "first_name"]])

Feature Engineering Basics

Feature engineering creates new variables from existing data to improve model performance.

df = pd.DataFrame({
    "date": pd.date_range("2024-01-01", periods=100, freq="D"),
    "revenue": np.random.randint(100, 1000, 100),
    "cost": np.random.randint(50, 500, 100),
    "category": np.random.choice(["A", "B", "C"], 100),
    "customer_age": np.random.randint(18, 70, 100),
})

# Date features
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek
df["is_weekend"] = df["date"].dt.dayofweek >= 5
df["quarter"] = df["date"].dt.quarter

# Numeric features
df["profit"] = df["revenue"] - df["cost"]
df["profit_margin"] = df["profit"] / df["revenue"]
df["revenue_log"] = np.log1p(df["revenue"])

# Binning
df["age_group"] = pd.cut(
    df["customer_age"],
    bins=[0, 25, 35, 50, 100],
    labels=["young", "mid", "senior", "elder"],
)

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=["category"], prefix="cat", drop_first=True)

# Lag features (for time series)
df["revenue_lag_1"] = df["revenue"].shift(1)
df["revenue_lag_7"] = df["revenue"].shift(7)

# Rolling statistics
df["revenue_ma_7"] = df["revenue"].rolling(window=7).mean()
df["revenue_std_7"] = df["revenue"].rolling(window=7).std()

print(df.head(10))

Complete Cleaning Pipeline

def clean_dataset(df, target_col=None):
    """
    Comprehensive data cleaning pipeline.

    Returns cleaned DataFrame and cleaning report.
    """
    report = {}
    original_shape = df.shape

    # 1. Remove exact duplicates
    n_dupes = df.duplicated().sum()
    df = df.drop_duplicates()
    report["duplicates_removed"] = n_dupes

    # 2. Clean column names
    df.columns = (
        df.columns.str.strip()
        .str.lower()
        .str.replace(r"[^a-z0-9_]", "_", regex=True)
        .str.replace(r"_+", "_", regex=True)
    )

    # 3. Convert data types
    for col in df.select_dtypes(include="object").columns:
        # Try numeric conversion
        converted = pd.to_numeric(df[col], errors="coerce")
        if converted.notna().sum() > len(df) * 0.5:
            df[col] = converted
            continue

        # Try datetime conversion
        try:
            df[col] = pd.to_datetime(df[col])
            continue
        except (ValueError, TypeError):
            pass

    # 4. Handle missing values
    missing_before = df.isnull().sum().sum()

    # Drop columns with > 50% missing
    threshold = len(df) * 0.5
    df = df.dropna(axis=1, thresh=threshold)

    # Fill numeric with median, categorical with mode
    for col in df.select_dtypes(include="number").columns:
        df[col] = df[col].fillna(df[col].median())

    for col in df.select_dtypes(include=["object", "category"]).columns:
        if df[col].mode().empty:
            df[col] = df[col].fillna("unknown")
        else:
            df[col] = df[col].fillna(df[col].mode()[0])

    missing_after = df.isnull().sum().sum()

    # 5. Remove obvious outliers (IQR method) for numeric columns
    outliers_removed = 0
    numeric_cols = df.select_dtypes(include="number").columns
    if target_col:
        numeric_cols = numeric_cols.drop(target_col, errors="ignore")

    for col in numeric_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 3 * IQR  # Use 3x for less aggressive filtering
        upper = Q3 + 3 * IQR
        mask = (df[col] >= lower) & (df[col] <= upper)
        outliers_removed += (~mask).sum()
        df = df[mask]

    report.update({
        "original_shape": original_shape,
        "final_shape": df.shape,
        "missing_before": missing_before,
        "missing_after": missing_after,
        "outliers_removed": outliers_removed,
    })

    print("Cleaning Report:")
    for k, v in report.items():
        print(f"  {k}: {v}")

    return df, report

Missing Data Formulas

Missing data percentage per column:

\text{missing\_pct}_j = \frac{\sum_{i=1}^{n} \mathbb{1}(x_{ij} = \text{NaN})}{n} \times 100\%

Mean imputation:

\hat{x}_{ij} = \bar{x}_j = \frac{1}{n_{\text{obs}}} \sum_{i: x_{ij} \neq \text{NaN}} x_{ij}

Forward fill (time series):

\hat{x}_{ij} = x_{(i-1)j} \quad \text{for the nearest preceding non-missing value}

IQR outlier bounds:

\text{lower} = Q_1 - 1.5 \times IQR, \quad \text{upper} = Q_3 + 1.5 \times IQR

Z-score outlier threshold:

|z_i| = \left|\frac{x_i - \bar{x}}{\sigma}\right| > 3 \implies \text{outlier}

Key Takeaways

Always check missing values first – the pattern tells you the right handling method.
Use IQR or Z-score for outlier detection; domain knowledge decides whether to remove or cap them.
Clean column names, data types, and strings before analysis.
Feature engineering (date parts, ratios, bins, encoding) is where you add real value.
Build cleaning functions early – you will reuse them on every project.
Document every cleaning decision – your future self will thank you.

Data Cleaning Essentials

Data Cleaning Essentials

Data Cleaning Pipeline

Missing Values

Detecting Missing Values

Types of Missingness

Handling Missing Values

Duplicates

Detecting Duplicates

Removing Duplicates

Outliers

Detecting Outliers

Handling Outliers

Data Type Issues

String Cleaning

Feature Engineering Basics

Complete Cleaning Pipeline

Missing Data Formulas

Key Takeaways

Premium Content

Need Expert Data Science Help?