Principal Component Analysis (PCA) — Dimensionality Reduction

Free Lesson

Advertisement

Principal Component Analysis (PCA)

PCA finds orthogonal directions (principal components) of maximum variance in the data. Used for dimensionality reduction, visualization, and feature extraction.

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)

# High-dimensional data: 10 variables, some correlated
n = 200
true_components = 3
X_latent = np.random.randn(n, true_components)
loading_matrix = np.random.randn(10, true_components)
X = X_latent @ loading_matrix.T + np.random.randn(n, 10) * 0.5

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Scree plot
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

axes[0].plot(range(1, 11), pca.explained_variance_ratio_*100, 'bo-', markersize=8)
axes[0].bar(range(1, 11), pca.explained_variance_ratio_*100, alpha=0.3, color='steelblue')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained (%)')
axes[0].set_title('Scree Plot')

# Cumulative variance
cumvar = np.cumsum(pca.explained_variance_ratio_)*100
axes[1].plot(range(1, 11), cumvar, 'ro-', markersize=8)
axes[1].axhline(80, color='green', linestyle='--', label='80% threshold')
axes[1].axhline(95, color='blue', linestyle='--', label='95% threshold')
axes[1].set_title('Cumulative Variance Explained')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Variance (%)')
axes[1].legend()

n_comp_80 = np.argmax(cumvar >= 80) + 1
n_comp_95 = np.argmax(cumvar >= 95) + 1
print(f"Components needed for 80% variance: {n_comp_80}")
print(f"Components needed for 95% variance: {n_comp_95}")

# 2D visualization using first 2 PCs
axes[2].scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5, color='steelblue')
axes[2].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
axes[2].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
axes[2].set_title('Data in PC Space')

plt.tight_layout()
plt.savefig('pca.png', dpi=150)
plt.show()

# Biplot: feature loadings
fig, ax = plt.subplots(figsize=(8, 8))
loadings = pca.components_.T
ax.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.3, s=10, color='gray')
for i, (x, y) in enumerate(loadings[:, :2]):
    ax.arrow(0, 0, x*5, y*5, head_width=0.1, head_length=0.05,
             fc='red', ec='red')
    ax.text(x*5.5, y*5.5, f'X{i+1}', fontsize=9, color='red', ha='center')
ax.axhline(0, color='black', linewidth=0.5)
ax.axvline(0, color='black', linewidth=0.5)
ax.set_title('PCA Biplot')
ax.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
ax.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
plt.tight_layout()
plt.savefig('pca_biplot.png', dpi=150)
plt.show()

Key Takeaways

  1. PCs are orthogonal (uncorrelated) linear combinations of original features
  2. Scree plot and cumulative variance guide how many PCs to keep
  3. Standardize features before PCA — otherwise high-variance features dominate
  4. Loadings show how original features contribute to each PC
  5. PCA assumes linearity — use t-SNE or UMAP for nonlinear dimensionality reduction

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement