Principal Component Analysis (PCA)

Foundations of Statistics

Reducing Dimensionality While Preserving Variance

PCA finds orthogonal directions of maximum variance in high-dimensional data. By projecting onto principal components, you reduce complexity while retaining the most informative structure in the data.

Genomics — Visualize thousands of gene expressions in 2D plots
Image Processing — Compress facial recognition features while preserving identity information
Finance — Extract key risk factors from correlated asset returns

The first few components often capture the essence that hundreds of variables conceal.

PCA finds orthogonal directions (principal components) of maximum variance in the data. Used for dimensionality reduction, visualization, and feature extraction.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# High-dimensional data: 10 variables, some correlated
n = 200
true_components = 3
X_latent = np.random.randn(n, true_components)
loading_matrix = np.random.randn(10, true_components)
X = X_latent @ loading_matrix.T + np.random.randn(n, 10) * 0.5

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Scree plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].plot(range(1, 11), pca.explained_variance_ratio_*100, 'bo-', markersize=8)
axes[0].bar(range(1, 11), pca.explained_variance_ratio_*100, alpha=0.3, color='steelblue')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained (%)')
axes[0].set_title('Scree Plot')

# Cumulative variance
cumvar = np.cumsum(pca.explained_variance_ratio_)*100
axes[1].plot(range(1, 11), cumvar, 'ro-', markersize=8)
axes[1].axhline(80, color='green', linestyle='--', label='80% threshold')
axes[1].axhline(95, color='blue', linestyle='--', label='95% threshold')
axes[1].set_title('Cumulative Variance Explained')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Variance (%)')
axes[1].legend()

n_comp_80 = np.argmax(cumvar >= 80) + 1
n_comp_95 = np.argmax(cumvar >= 95) + 1
print(f"Components needed for 80% variance: {n_comp_80}")
print(f"Components needed for 95% variance: {n_comp_95}")

# 2D visualization using first 2 PCs
axes[2].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, color='steelblue')
axes[2].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
axes[2].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
axes[2].set_title('Data in PC Space')

plt.tight_layout()
plt.savefig('pca.png', dpi=150)
plt.show()

# Biplot: feature loadings
fig, ax = plt.subplots(figsize=(8, 8))
loadings = pca.components_.T
ax.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.3, s=10, color='gray')
for i, (x, y) in enumerate(loadings[:, :2]):
    ax.arrow(0, 0, x*5, y*5, head_width=0.1, head_length=0.05,
             fc='red', ec='red')
    ax.text(x*5.5, y*5.5, f'X{i+1}', fontsize=9, color='red', ha='center')
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_title('PCA Biplot')
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.tight_layout()
plt.savefig('pca_biplot.png', dpi=150)
plt.show()

Principal Component Analysis (PCA) — Dimensionality Reduction

Principal Component Analysis (PCA)

Reducing Dimensionality While Preserving Variance

Key Takeaways

Need Expert Statistics Help?