Data Cleaning Essentials
Data scientists spend 60-80% of their time cleaning data. Dirty data leads to wrong conclusions and broken models. This lesson covers every major cleaning technique you will use in practice.
Data Cleaning Pipeline
Missing Values
Missing data is the most common data quality issue. How you handle it affects your analysis significantly.
Detecting Missing Values
import pandas as pd
import numpy as np
df = pd.DataFrame({
"name": ["Alice", "Bob", None, "Diana", "Eve"],
"age": [25, np.nan, 35, 28, np.nan],
"salary": [70000, 80000, np.nan, 75000, 90000],
"department": ["eng", "eng", None, "sales", "eng"],
})
# Count missing values per column
print(df.isnull().sum())
# name 1
# age 2
# salary 1
# department 1
# Percentage missing
print((df.isnull().sum() / len(df) * 100).round(1))
# name 20.0
# age 40.0
# salary 20.0
# department 20.0
# Rows with any missing values
print(df[df.isnull().any(axis=1)])
# Patterns in missing data
print(df.isnull().describe())
Types of Missingness
MCAR (Missing Completely at Random):
Missingness has no relationship with any data.
Example: Lab equipment randomly fails.
Safe to drop rows.
MAR (Missing at Random):
Missingness is related to observed data but not the missing value itself.
Example: Younger people skip income questions (age is observed).
Use imputation based on other columns.
MNAR (Missing Not at Random):
Missingness is related to the missing value itself.
Example: High earners hide their income.
Requires domain knowledge or specialized methods.
Handling Missing Values
df = pd.DataFrame({
"age": [25, 30, np.nan, 35, 28, np.nan],
"salary": [50000, np.nan, 70000, np.nan, 60000, 80000],
"score": [85, 90, np.nan, np.nan, 75, 88],
})
# Drop rows with any missing values
print(df.dropna())
print(df.dropna(subset=["age"])) # Only check specific columns
# Drop columns with too many missing values
threshold = len(df) * 0.5 # Drop if > 50% missing
print(df.dropna(axis=1, thresh=threshold))
# Fill with constant
df["age"] = df["age"].fillna(0)
# Fill with statistics
df["salary"] = df["salary"].fillna(df["salary"].mean())
df["score"] = df["score"].fillna(df["score"].median())
# Fill with different values per column
df = df.fillna({
"age": df["age"].mean(),
"salary": df["salary"].median(),
"score": 0,
})
# Forward fill (useful for time series)
df["score"] = df["score"].fillna(method="ffill")
# Backward fill
df["score"] = df["score"].fillna(method="bfill")
# Interpolation
df["age"] = df["age"].interpolate(method="linear")
# Indicator variable for missingness
df["age_was_missing"] = df["age"].isnull().astype(int)
Duplicates
Duplicate rows inflate counts and bias aggregations.
Detecting Duplicates
df = pd.DataFrame({
"name": ["Alice", "Bob", "Alice", "Diana", "Bob"],
"age": [25, 30, 25, 28, 30],
"city": ["NYC", "LA", "NYC", "Houston", "LA"],
})
# Exact duplicates
print(df.duplicated().sum()) # 2
# Duplicates based on specific columns
print(df.duplicated(subset=["name"]).sum()) # 1
# View duplicate rows
print(df[df.duplicated(keep=False)]) # Show all duplicates
# Unique values
print(df["name"].nunique()) # 4
print(df["name"].unique()) # ['Alice' 'Bob' 'Diana']
Removing Duplicates
# Remove exact duplicates (keep first occurrence)
df_clean = df.drop_duplicates()
# Keep last occurrence
df_clean = df.drop_duplicates(keep="last")
# Remove based on specific columns
df_clean = df.drop_duplicates(subset=["name"])
# Keep most recent (assuming sorted by date)
df = df.sort_values("date")
df_clean = df.drop_duplicates(subset=["name"], keep="last")
Outliers
Outliers are data points that differ significantly from other observations. They can be errors or genuinely extreme values.
Detecting Outliers
np.random.seed(42)
data = pd.Series(np.random.normal(50, 10, 1000).tolist() + [150, 200, -30])
# IQR Method (most common)
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = data[(data < lower) | (data > upper)]
print(f"IQR bounds: {lower:.2f} to {upper:.2f}")
print(f"Outliers found: {len(outliers)}")
# Z-Score Method
from scipy import stats
z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]
print(f"Z-score outliers: {len(outliers_z)}")
# Modified Z-Score (more robust)
median = data.median()
mad = np.median(np.abs(data - median))
modified_z = 0.6745 * (data - median) / mad
outliers_mz = data[np.abs(modified_z) > 3.5]
print(f"Modified Z outliers: {len(outliers_mz)}")
Handling Outliers
# Cap at percentiles (Winsorization)
lower_cap = data.quantile(0.01)
upper_cap = data.quantile(0.99)
data_capped = data.clip(lower=lower_cap, upper=upper_cap)
# Remove outliers
data_clean = data[(data >= lower) & (data <= upper)]
# Replace with NaN and impute
data.loc[(data < lower) | (data > upper)] = np.nan
data_imputed = data.fillna(data.median())
# Log transform (reduces impact of extreme values)
data_log = np.log1p(data.clip(lower=0))
Data Type Issues
df = pd.DataFrame({
"price": ["19.99", "42.50", "N/A", "100.00"],
"quantity": ["3", "5", "2", "7"],
"date": ["01/15/2024", "02/20/2024", "03/25/2024", "04/10/2024"],
"category": [" A ", "B", " C ", "A "],
})
# String to numeric (handle errors)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
# String to integer
df["quantity"] = df["quantity"].astype(int)
# String to datetime
df["date"] = pd.to_datetime(df["date"], format="%m/%d/%Y")
# Clean string columns
df["category"] = df["category"].str.strip().str.upper()
# Optimize dtypes
df["quantity"] = df["quantity"].astype("int8") # Smaller int type
df["price"] = df["price"].astype("float32") # Smaller float type
# Categorical type (saves memory for low-cardinality columns)
df["category"] = df["category"].astype("category")
print(df.dtypes)
print(df)
String Cleaning
df = pd.DataFrame({
"name": [" alice smith ", "BOB JONES", "charlie@domain.com", "diana-ray"],
"phone": ["555-123-4567", "(555) 987-6543", "555.234.5678", "N/A"],
})
# Standardize names
df["name_clean"] = (
df["name"]
.str.strip()
.str.lower()
.str.replace(r"[^a-zA-Z\s]", "", regex=True) # Remove special chars
.str.replace(r"\s+", " ", regex=True) # Normalize whitespace
)
# Extract phone digits
df["phone_clean"] = df["phone"].str.replace(r"\D", "", regex=True)
# Extract email domain
df["domain"] = df["name"].str.extract(r"@(\S+)")
# Pattern matching
df["has_hyphen"] = df["name"].str.contains("-")
# Split columns
name_parts = df["name_clean"].str.split(" ", expand=True)
df["first_name"] = name_parts[0]
df["last_name"] = name_parts[1] if 1 in name_parts.columns else ""
print(df[["name_clean", "phone_clean", "domain", "first_name"]])
Feature Engineering Basics
Feature engineering creates new variables from existing data to improve model performance.
df = pd.DataFrame({
"date": pd.date_range("2024-01-01", periods=100, freq="D"),
"revenue": np.random.randint(100, 1000, 100),
"cost": np.random.randint(50, 500, 100),
"category": np.random.choice(["A", "B", "C"], 100),
"customer_age": np.random.randint(18, 70, 100),
})
# Date features
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.dayofweek
df["is_weekend"] = df["date"].dt.dayofweek >= 5
df["quarter"] = df["date"].dt.quarter
# Numeric features
df["profit"] = df["revenue"] - df["cost"]
df["profit_margin"] = df["profit"] / df["revenue"]
df["revenue_log"] = np.log1p(df["revenue"])
# Binning
df["age_group"] = pd.cut(
df["customer_age"],
bins=[0, 25, 35, 50, 100],
labels=["young", "mid", "senior", "elder"],
)
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=["category"], prefix="cat", drop_first=True)
# Lag features (for time series)
df["revenue_lag_1"] = df["revenue"].shift(1)
df["revenue_lag_7"] = df["revenue"].shift(7)
# Rolling statistics
df["revenue_ma_7"] = df["revenue"].rolling(window=7).mean()
df["revenue_std_7"] = df["revenue"].rolling(window=7).std()
print(df.head(10))
Complete Cleaning Pipeline
def clean_dataset(df, target_col=None):
"""
Comprehensive data cleaning pipeline.
Returns cleaned DataFrame and cleaning report.
"""
report = {}
original_shape = df.shape
# 1. Remove exact duplicates
n_dupes = df.duplicated().sum()
df = df.drop_duplicates()
report["duplicates_removed"] = n_dupes
# 2. Clean column names
df.columns = (
df.columns.str.strip()
.str.lower()
.str.replace(r"[^a-z0-9_]", "_", regex=True)
.str.replace(r"_+", "_", regex=True)
)
# 3. Convert data types
for col in df.select_dtypes(include="object").columns:
# Try numeric conversion
converted = pd.to_numeric(df[col], errors="coerce")
if converted.notna().sum() > len(df) * 0.5:
df[col] = converted
continue
# Try datetime conversion
try:
df[col] = pd.to_datetime(df[col])
continue
except (ValueError, TypeError):
pass
# 4. Handle missing values
missing_before = df.isnull().sum().sum()
# Drop columns with > 50% missing
threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)
# Fill numeric with median, categorical with mode
for col in df.select_dtypes(include="number").columns:
df[col] = df[col].fillna(df[col].median())
for col in df.select_dtypes(include=["object", "category"]).columns:
if df[col].mode().empty:
df[col] = df[col].fillna("unknown")
else:
df[col] = df[col].fillna(df[col].mode()[0])
missing_after = df.isnull().sum().sum()
# 5. Remove obvious outliers (IQR method) for numeric columns
outliers_removed = 0
numeric_cols = df.select_dtypes(include="number").columns
if target_col:
numeric_cols = numeric_cols.drop(target_col, errors="ignore")
for col in numeric_cols:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 3 * IQR # Use 3x for less aggressive filtering
upper = Q3 + 3 * IQR
mask = (df[col] >= lower) & (df[col] <= upper)
outliers_removed += (~mask).sum()
df = df[mask]
report.update({
"original_shape": original_shape,
"final_shape": df.shape,
"missing_before": missing_before,
"missing_after": missing_after,
"outliers_removed": outliers_removed,
})
print("Cleaning Report:")
for k, v in report.items():
print(f" {k}: {v}")
return df, report
Missing Data Formulas
Missing data percentage per column:
Mean imputation:
Forward fill (time series):
IQR outlier bounds:
Z-score outlier threshold:
Key Takeaways
- Always check missing values first β the pattern tells you the right handling method.
- Use IQR or Z-score for outlier detection; domain knowledge decides whether to remove or cap them.
- Clean column names, data types, and strings before analysis.
- Feature engineering (date parts, ratios, bins, encoding) is where you add real value.
- Build cleaning functions early β you will reuse them on every project.
- Document every cleaning decision β your future self will thank you.