PCA: Dimensionality Reduction and Feature Extraction

Why Dimensionality Reduction?

High-dimensional data presents fundamental challenges that degrade both computational efficiency and model performance. The curse of dimensionality manifests in several critical ways.

The Curse of Dimensionality

As dimensionality increases:

Volume Explosion: The volume of the feature space grows exponentially as , where is the range of each feature
Data Sparsity: For a fixed number of observations , the density of data points vanishes:
Distance Concentration: In high dimensions, the ratio of distances between nearest and farthest points converges to 1:

Overfitting Risk: Models with many parameters relative to observations generalize poorly

Benefits of Dimensionality Reduction

Computational Efficiency: Reduces time complexity from to where
Noise Filtering: Removes low-variance components containing primarily noise
Visualization: Projects high-dimensional data into 2D/3D for human interpretation
Multicollinearity Removal: Creates orthogonal features that are uncorrelated
Improved Generalization: Reduces model complexity and overfitting

PCA Algorithm — Step by Step

PCA finds orthogonal directions of maximum variance in the data through eigendecomposition of the covariance matrix.

Mathematical Formulation

Given a dataset with observations and features:

Step 1 — Center the data:

Step 2 — Compute the covariance matrix:

Step 3 — Eigendecomposition:

where are eigenvalues and are corresponding eigenvectors.

Step 4 — Project data onto principal components:

where contains the top eigenvectors.

Geometric Interpretation

Eigendecomposition and SVD

Relationship Between PCA and Eigendecomposition

The covariance matrix is symmetric positive semi-definite, guaranteeing real non-negative eigenvalues and orthogonal eigenvectors:

where and .

SVD-Based PCA

For computational efficiency, PCA is typically performed via Singular Value Decomposition:

where:

: left singular vectors (orthonormal)
: diagonal matrix of singular values
: right singular vectors (principal directions)

Connection to eigenvalues:

The principal component scores are:

Eigenvector Visualization

Variance Explained and Scree Plot

Proportion of Variance Explained

Each principal component captures a fraction of the total variance:

The cumulative variance explained by the first components:

Scree Plot

The scree plot displays eigenvalues in descending order, showing the "elbow" where adding more components yields diminishing returns.

Choosing the Number of Components

Three principal methods guide component selection:

1. Kaiser's Rule (Eigenvalue > 1)

Retain components where (for standardized data):

Rationale: A component with explains less variance than a single original variable.

2. Cumulative Variance Threshold

Select such that cumulative PVE exceeds a threshold :

Common thresholds: .

3. Scree Plot Elbow Method

Identify the "elow" in the scree plot — the point where the rate of decrease sharply changes:

4. Parallel Analysis (Horn's Method)

Compare eigenvalues to those from random data of the same dimensions. Retain components where:

This method corrects for the upward bias in eigenvalues due to finite sample sizes.

PCA for Visualization (2D/3D)

From High Dimensions to 2D

PCA enables visualization of high-dimensional datasets by projecting onto the first two or three principal components while preserving maximum variance.

Information Loss: When projecting from dimensions to :

2D Projection Example

Kernel PCA for Non-Linear Data

Limitation of Standard PCA

Standard PCA assumes linear relationships. For non-linear manifolds, PCA may fail to capture the intrinsic structure.

The Kernel Trick

Kernel PCA maps data to a higher-dimensional feature space via a non-linear mapping , then performs PCA in .

The key insight: we never compute explicitly. Instead, we use the kernel function:

Common Kernel Functions

Kernel	Formula	Use Case
Linear		Standard PCA
Polynomial		Polynomial relationships
RBF (Gaussian)		Complex non-linear boundaries
Sigmoid		Neural network-like behavior

Kernel PCA Algorithm

Compute the kernel matrix:
Center the kernel matrix:
Solve the eigenvalue problem:
Project onto kernel principal components:

Comparison: Linear vs Kernel PCA

Implementation in Python

Standard PCA with Scikit-Learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

# Load high-dimensional data
digits = load_digits()
X = digits.data  # (1797, 64) — 8x8 pixel images
y = digits.target

# Standardize (critical for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Results
print(f"Original shape: {X.shape}")
print(f"Reduced shape:  {X_pca.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.3f}")
print(f"Per-component: {pca.explained_variance_ratio_}")

Scree Plot and Component Selection

# Full PCA to inspect all components
pca_full = PCA().fit(X_scaled)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scree plot
axes[0].bar(range(1, len(pca_full.explained_variance_) + 1),
            pca_full.explained_variance_, alpha=0.7, label='Individual')
axes[0].step(range(1, len(pca_full.explained_variance_ratio_) + 1),
             np.cumsum(pca_full.explained_variance_ratio_),
             where='mid', color='red', label='Cumulative')
axes[0].axhline(y=0.90, color='k', linestyle='--', label='90% threshold')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained')
axes[0].set_title('Scree Plot')
axes[0].legend()

# Cumulative variance
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
n_components_90 = np.argmax(cumvar >= 0.90) + 1
axes[1].plot(range(1, len(cumvar) + 1), cumvar, 'o-')
axes[1].axvline(x=n_components_90, color='r', linestyle='--',
                label=f'{n_components_90} components for 90%')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Variance Explained')
axes[1].set_title('Cumulative Variance')
axes[1].legend()

plt.tight_layout()
plt.show()

2D Visualization with Class Labels

# Visualize digit clusters in 2D PCA space
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y,
                      cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter, label='Digit')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('MNIST Digits — PCA 2D Projection')
plt.show()

Reconstruction from Principal Components

# Reconstruct from k components
n_comp = 30
pca_30 = PCA(n_components=n_comp)
X_transformed = pca_30.fit_transform(X_scaled)
X_reconstructed = pca_30.inverse_transform(X_transformed)

# Reconstruct to original scale
X_recon_original = scaler.inverse_transform(X_reconstructed)

# Visualize reconstruction quality
fig, axes = plt.subplots(2, 8, figsize=(16, 4))
for i in range(8):
    axes[0, i].imshow(digits.images[i], cmap='gray_r')
    axes[0, i].set_title('Original')
    axes[0, i].axis('off')

    axes[1, i].imshow(X_recon_original[i].reshape(8, 8), cmap='gray_r')
    axes[1, i].set_title('Reconstructed')
    axes[1, i].axis('off')

plt.suptitle(f'Reconstruction with {n_comp} components')
plt.tight_layout()
plt.show()

Kernel PCA Implementation

from sklearn.decomposition import KernelPCA

# Non-linear data example
from sklearn.datasets import make_moons
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)

# Standard PCA vs Kernel PCA
pca_std = PCA(n_components=2)
kpca_rbf = KernelPCA(n_components=2, kernel='rbf', gamma=15)
kpca_poly = KernelPCA(n_components=2, kernel='poly', degree=3)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, model, title in zip(axes,
    [pca_std, kpca_rbf, kpca_poly],
    ['Linear PCA', 'Kernel PCA (RBF)', 'Kernel PCA (Poly)']):
    X_proj = model.fit_transform(X_moons)
    ax.scatter(X_proj[:, 0], X_proj[:, 1], c=y_moons, cmap='coolwarm', s=20)
    ax.set_title(title)
    ax.set_xlabel('Component 1')
    ax.set_ylabel('Component 2')

plt.tight_layout()
plt.show()

Incremental PCA for Large Datasets

from sklearn.decomposition import IncrementalPCA

# For datasets that don't fit in memory
n_batches = 10
ipca = IncrementalPCA(n_components=2)

batch_size = len(X_scaled) // n_batches
for i in range(n_batches):
    start = i * batch_size
    end = start + batch_size
    ipca.fit(X_scaled[start:end])

X_ipca = ipca.transform(X_scaled)
print(f"Explained variance: {ipca.explained_variance_ratio_.sum():.3f}")

PCA as Preprocessing for Other Models

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Pipeline: Scale ? PCA ? Classify
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=20)),
    ('clf', LogisticRegression(max_iter=1000))
])

# Compare with and without PCA
scores_pca = cross_val_score(pipe, digits.data, digits.target, cv=5)
scores_raw = cross_val_score(
    Pipeline([('scaler', StandardScaler()),
              ('clf', LogisticRegression(max_iter=1000))]),
    digits.data, digits.target, cv=5)

print(f"Without PCA: {scores_raw.mean():.3f} ± {scores_raw.std():.3f}")
print(f"With PCA (20): {scores_pca.mean():.3f} ± {scores_pca.std():.3f}")

Before/After PCA Dimensionality Reduction

Mathematical Properties of PCA

Optimality of PCA

PCA finds the rank- approximation that minimizes the reconstruction error:

By the Eckart-Young-Mirsky Theorem, the solution is given by the truncated SVD:

with optimal reconstruction error:

Properties of Principal Components

Orthogonality: for
Unit length:
Maximum variance: is maximized subject to
Uncorrelated scores: (diagonal)
Total variance preserved:

Common Pitfalls and Best Practices

⚠️

Critical: Always standardize features before PCA when variables have different units or scales. PCA is sensitive to variance magnitude — features with larger scales will dominate.

Pitfalls

Pitfall	Consequence	Solution
Not standardizing	Scale-dominated components	Use `StandardScaler` before PCA
Ignoring outliers	Distorted principal components	Use Robust PCA or remove outliers
Over-reducing	Information loss, degraded model	Monitor reconstruction error
Interpreting PCs causally	Incorrect causal inference	PCs are statistical, not causal
Using PCA with sparse data	Dense representation loses sparsity	Use Sparse PCA or NMF

When to Use PCA

Preprocessing: Reduce dimensionality before classification/regression
Visualization: Project to 2D/3D for exploratory data analysis
Noise reduction: Remove low-variance components (assumed noise)
Feature decorrelation: Create uncorrelated features for models assuming independence
Multicollinearity: Resolve correlated predictors in linear models

When NOT to Use PCA

Interpretability needed: PCs are linear combinations, not original features
Sparse data: Use truncated SVD or NMF instead
Non-linear structure: Use Kernel PCA, t-SNE, or UMAP
Categorical data: Use MCA (Multiple Correspondence Analysis) instead
Time series: Use specialized methods (e.g., DMD, POD)

PCA vs Other Dimensionality Reduction Methods

Method	Type	Preserves	Best For
PCA	Linear	Global variance	General purpose, preprocessing
Kernel PCA	Non-linear	Kernel-space variance	Non-linear manifolds
t-SNE	Non-linear	Local neighborhoods	Visualization (2D/3D)
UMAP	Non-linear	Local + global structure	Visualization, clustering
LDA	Supervised	Class separability	Classification preprocessing
Autoencoders	Non-linear	Reconstruction	Complex non-linear reduction
NMF	Linear (non-negative)	Non-negative factors	Interpretable features (images, text)
ICA	Linear	Statistical independence	Signal separation (blind source)

Summary

PCA is the foundational technique for dimensionality reduction:

Core idea: Find orthogonal directions of maximum variance via eigendecomposition of the covariance matrix
Computational method: SVD is preferred numerically — gives principal directions as columns of
Component selection: Use Kaiser's rule, cumulative PVE threshold, or parallel analysis
Variance explained: quantifies information retention
Optimality: PCA minimizes mean squared reconstruction error (Eckart-Young theorem)
Kernel extension: Kernel PCA handles non-linear data via the kernel trick
Best practice: Always standardize data, validate reconstruction quality, and choose based on the application's information needs