CW

PCA: Dimensionality Reduction and Feature Extraction

Module 9: Unsupervised LearningFree Lesson

Advertisement

Why Dimensionality Reduction?

High-dimensional data presents fundamental challenges that degrade both computational efficiency and model performance. The curse of dimensionality manifests in several critical ways.

The Curse of Dimensionality

As dimensionality pp increases:

  1. Volume Explosion: The volume of the feature space grows exponentially as O(rp)O(r^p), where rr is the range of each feature
  2. Data Sparsity: For a fixed number of observations nn, the density of data points vanishes: ρn/rp\rho \propto n / r^p
  3. Distance Concentration: In high dimensions, the ratio of distances between nearest and farthest points converges to 1:
limpdmaxdmindmin0\lim_{p \to \infty} \frac{d_{\max} - d_{\min}}{d_{\min}} \to 0
  1. Overfitting Risk: Models with many parameters relative to observations generalize poorly

Benefits of Dimensionality Reduction

  • Computational Efficiency: Reduces time complexity from O(np2)O(np^2) to O(nk2)O(nk^2) where kpk \ll p
  • Noise Filtering: Removes low-variance components containing primarily noise
  • Visualization: Projects high-dimensional data into 2D/3D for human interpretation
  • Multicollinearity Removal: Creates orthogonal features that are uncorrelated
  • Improved Generalization: Reduces model complexity and overfitting

PCA Algorithm — Step by Step

PCA finds orthogonal directions of maximum variance in the data through eigendecomposition of the covariance matrix.

Mathematical Formulation

Given a dataset XRn×p\mathbf{X} \in \mathbb{R}^{n \times p} with nn observations and pp features:

Step 1 — Center the data:

Z=Xμ,μ=1ni=1nxi\mathbf{Z} = \mathbf{X} - \boldsymbol{\mu}, \quad \boldsymbol{\mu} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{x}_i

Step 2 — Compute the covariance matrix:

C=1n1ZTZRp×p\mathbf{C} = \frac{1}{n-1}\mathbf{Z}^T\mathbf{Z} \in \mathbb{R}^{p \times p}

Step 3 — Eigendecomposition:

Cvk=λkvk\mathbf{C}\mathbf{v}_k = \lambda_k \mathbf{v}_k

where λ1λ2λp0\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0 are eigenvalues and vk\mathbf{v}_k are corresponding eigenvectors.

Step 4 — Project data onto principal components:

Y=ZVk\mathbf{Y} = \mathbf{Z}\mathbf{V}_k

where Vk=[v1v2vk]\mathbf{V}_k = [\mathbf{v}_1 | \mathbf{v}_2 | \cdots | \mathbf{v}_k] contains the top kk eigenvectors.

Geometric Interpretation

PCA: Projection onto Principal ComponentsOriginal Data (2D)PC1PC290°ProjectProjected (1D)PC1 scoreOriginal dataProjected dataPC1PC2

Eigendecomposition and SVD

Relationship Between PCA and Eigendecomposition

The covariance matrix C\mathbf{C} is symmetric positive semi-definite, guaranteeing real non-negative eigenvalues and orthogonal eigenvectors:

C=VΛVT\mathbf{C} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^T

where Λ=diag(λ1,λ2,,λp)\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_p) and VVT=I\mathbf{V}\mathbf{V}^T = \mathbf{I}.

SVD-Based PCA

For computational efficiency, PCA is typically performed via Singular Value Decomposition:

Z=UΣVT\mathbf{Z} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T

where:

  • URn×n\mathbf{U} \in \mathbb{R}^{n \times n}: left singular vectors (orthonormal)
  • ΣRn×p\boldsymbol{\Sigma} \in \mathbb{R}^{n \times p}: diagonal matrix of singular values σ1σ20\sigma_1 \geq \sigma_2 \geq \cdots \geq 0
  • VRp×p\mathbf{V} \in \mathbb{R}^{p \times p}: right singular vectors (principal directions)

Connection to eigenvalues:

λk=σk2n1\lambda_k = \frac{\sigma_k^2}{n-1}

The principal component scores are:

Y=ZV=UΣ\mathbf{Y} = \mathbf{Z}\mathbf{V} = \mathbf{U}\boldsymbol{\Sigma}

Eigenvector Visualization

Eigenvector Directions in Feature SpaceFeature 1Feature 2v1 (?1=85%)v2 (?2=15%)mean

Variance Explained and Scree Plot

Proportion of Variance Explained

Each principal component captures a fraction of the total variance:

PVEk=λkj=1pλj\text{PVE}_k = \frac{\lambda_k}{\sum_{j=1}^{p} \lambda_j}

The cumulative variance explained by the first kk components:

Cumulative PVEk=j=1kλjj=1pλj\text{Cumulative PVE}_k = \frac{\sum_{j=1}^{k} \lambda_j}{\sum_{j=1}^{p} \lambda_j}

Scree Plot

The scree plot displays eigenvalues in descending order, showing the "elbow" where adding more components yields diminishing returns.

Scree Plot — Variance ExplainedPrincipal ComponentEigenvalue (?)02468PC1PC2PC3PC4PC5PC67.523.852.141.020.680.31ElbowEigenvalueCumulative %

Choosing the Number of Components

Three principal methods guide component selection:

1. Kaiser's Rule (Eigenvalue > 1)

Retain components where λk>1\lambda_k > 1 (for standardized data):

k=max{k:λk>1}k^* = \max\{k : \lambda_k > 1\}

Rationale: A component with λk<1\lambda_k < 1 explains less variance than a single original variable.

2. Cumulative Variance Threshold

Select kk such that cumulative PVE exceeds a threshold τ\tau:

k=min{k:j=1kλjj=1pλjτ}k^* = \min\left\{k : \frac{\sum_{j=1}^{k}\lambda_j}{\sum_{j=1}^{p}\lambda_j} \geq \tau\right\}

Common thresholds: τ{0.80,0.90,0.95}\tau \in \{0.80, 0.90, 0.95\}.

3. Scree Plot Elbow Method

Identify the "elow" in the scree plot — the point where the rate of decrease sharply changes:

elbow=argmaxk(λkλk+1)\text{elbow} = \arg\max_k \left(\lambda_k - \lambda_{k+1}\right)

4. Parallel Analysis (Horn's Method)

Compare eigenvalues to those from random data of the same dimensions. Retain components where:

λk>λkrandom\lambda_k > \lambda_k^{\text{random}}

This method corrects for the upward bias in eigenvalues due to finite sample sizes.

Component Selection Methods ComparisonKaiser's Rule? > 1Simple heuristicMay over/under-selectBest for standardizeddata (Z-scores)Quick checkCumulative PVE= 80-95%Application-dependent80% for visualization90% for modeling95% for reconstructionMost commonParallel Analysis? > ?_randomMonte Carlo simulationControls for samplingerror and spuriouseigenvaluesMost rigorous

PCA for Visualization (2D/3D)

From High Dimensions to 2D

PCA enables visualization of high-dimensional datasets by projecting onto the first two or three principal components while preserving maximum variance.

Information Loss: When projecting from pp dimensions to k<pk < p:

Reconstruction Error=j=k+1pλj=tr(C)j=1kλj\text{Reconstruction Error} = \sum_{j=k+1}^{p} \lambda_j = \text{tr}(\mathbf{C}) - \sum_{j=1}^{k}\lambda_j

2D Projection Example

2D PCA Projection of High-Dimensional DataBefore PCA (2D view)PCAAfter PCA (PC1 vs PC2)PC1PC2Class AClass BClass C

Kernel PCA for Non-Linear Data

Limitation of Standard PCA

Standard PCA assumes linear relationships. For non-linear manifolds, PCA may fail to capture the intrinsic structure.

The Kernel Trick

Kernel PCA maps data to a higher-dimensional feature space F\mathcal{F} via a non-linear mapping ϕ:RpF\phi: \mathbb{R}^p \to \mathcal{F}, then performs PCA in F\mathcal{F}.

The key insight: we never compute ϕ(x)\phi(\mathbf{x}) explicitly. Instead, we use the kernel function:

K(xi,xj)=ϕ(xi),ϕ(xj)K(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle

Common Kernel Functions

KernelFormulaUse Case
LinearK(x,z)=xTzK(\mathbf{x}, \mathbf{z}) = \mathbf{x}^T\mathbf{z}Standard PCA
PolynomialK(x,z)=(xTz+c)dK(\mathbf{x}, \mathbf{z}) = (\mathbf{x}^T\mathbf{z} + c)^dPolynomial relationships
RBF (Gaussian)K(x,z)=exp(xz22σ2)K(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{\|\mathbf{x}-\mathbf{z}\|^2}{2\sigma^2}\right)Complex non-linear boundaries
SigmoidK(x,z)=tanh(αxTz+c)K(\mathbf{x}, \mathbf{z}) = \tanh(\alpha\mathbf{x}^T\mathbf{z} + c)Neural network-like behavior

Kernel PCA Algorithm

  1. Compute the kernel matrix: Kij=K(xi,xj)K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)
  2. Center the kernel matrix: K~=K1nKK1n+1nK1n\tilde{\mathbf{K}} = \mathbf{K} - \mathbf{1}_n\mathbf{K} - \mathbf{K}\mathbf{1}_n + \mathbf{1}_n\mathbf{K}\mathbf{1}_n
  3. Solve the eigenvalue problem: K~αk=nλkαk\tilde{\mathbf{K}}\boldsymbol{\alpha}_k = n\lambda_k \boldsymbol{\alpha}_k
  4. Project onto kernel principal components:
PCk(x)=i=1nαk(i)K(xi,x)\text{PC}_k(\mathbf{x}) = \sum_{i=1}^{n} \alpha_k^{(i)} K(\mathbf{x}_i, \mathbf{x})

Comparison: Linear vs Kernel PCA

Linear PCA vs Kernel PCA (RBF)Original DataLinear PCA failsKernel PCA (RBF) SeparatesNon-linear structure preserved

Implementation in Python

Standard PCA with Scikit-Learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

# Load high-dimensional data
digits = load_digits()
X = digits.data  # (1797, 64) — 8x8 pixel images
y = digits.target

# Standardize (critical for PCA)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Results
print(f"Original shape: {X.shape}")
print(f"Reduced shape:  {X_pca.shape}")
print(f"Variance explained: {pca.explained_variance_ratio_.sum():.3f}")
print(f"Per-component: {pca.explained_variance_ratio_}")

Scree Plot and Component Selection

# Full PCA to inspect all components
pca_full = PCA().fit(X_scaled)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Scree plot
axes[0].bar(range(1, len(pca_full.explained_variance_) + 1),
            pca_full.explained_variance_, alpha=0.7, label='Individual')
axes[0].step(range(1, len(pca_full.explained_variance_ratio_) + 1),
             np.cumsum(pca_full.explained_variance_ratio_),
             where='mid', color='red', label='Cumulative')
axes[0].axhline(y=0.90, color='k', linestyle='--', label='90% threshold')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained')
axes[0].set_title('Scree Plot')
axes[0].legend()

# Cumulative variance
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
n_components_90 = np.argmax(cumvar >= 0.90) + 1
axes[1].plot(range(1, len(cumvar) + 1), cumvar, 'o-')
axes[1].axvline(x=n_components_90, color='r', linestyle='--',
                label=f'{n_components_90} components for 90%')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Variance Explained')
axes[1].set_title('Cumulative Variance')
axes[1].legend()

plt.tight_layout()
plt.show()

2D Visualization with Class Labels

# Visualize digit clusters in 2D PCA space
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y,
                      cmap='tab10', alpha=0.6, s=10)
plt.colorbar(scatter, label='Digit')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('MNIST Digits — PCA 2D Projection')
plt.show()

Reconstruction from Principal Components

# Reconstruct from k components
n_comp = 30
pca_30 = PCA(n_components=n_comp)
X_transformed = pca_30.fit_transform(X_scaled)
X_reconstructed = pca_30.inverse_transform(X_transformed)

# Reconstruct to original scale
X_recon_original = scaler.inverse_transform(X_reconstructed)

# Visualize reconstruction quality
fig, axes = plt.subplots(2, 8, figsize=(16, 4))
for i in range(8):
    axes[0, i].imshow(digits.images[i], cmap='gray_r')
    axes[0, i].set_title('Original')
    axes[0, i].axis('off')

    axes[1, i].imshow(X_recon_original[i].reshape(8, 8), cmap='gray_r')
    axes[1, i].set_title('Reconstructed')
    axes[1, i].axis('off')

plt.suptitle(f'Reconstruction with {n_comp} components')
plt.tight_layout()
plt.show()

Kernel PCA Implementation

from sklearn.decomposition import KernelPCA

# Non-linear data example
from sklearn.datasets import make_moons
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)

# Standard PCA vs Kernel PCA
pca_std = PCA(n_components=2)
kpca_rbf = KernelPCA(n_components=2, kernel='rbf', gamma=15)
kpca_poly = KernelPCA(n_components=2, kernel='poly', degree=3)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, model, title in zip(axes,
    [pca_std, kpca_rbf, kpca_poly],
    ['Linear PCA', 'Kernel PCA (RBF)', 'Kernel PCA (Poly)']):
    X_proj = model.fit_transform(X_moons)
    ax.scatter(X_proj[:, 0], X_proj[:, 1], c=y_moons, cmap='coolwarm', s=20)
    ax.set_title(title)
    ax.set_xlabel('Component 1')
    ax.set_ylabel('Component 2')

plt.tight_layout()
plt.show()

Incremental PCA for Large Datasets

from sklearn.decomposition import IncrementalPCA

# For datasets that don't fit in memory
n_batches = 10
ipca = IncrementalPCA(n_components=2)

batch_size = len(X_scaled) // n_batches
for i in range(n_batches):
    start = i * batch_size
    end = start + batch_size
    ipca.fit(X_scaled[start:end])

X_ipca = ipca.transform(X_scaled)
print(f"Explained variance: {ipca.explained_variance_ratio_.sum():.3f}")

PCA as Preprocessing for Other Models

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Pipeline: Scale ? PCA ? Classify
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=20)),
    ('clf', LogisticRegression(max_iter=1000))
])

# Compare with and without PCA
scores_pca = cross_val_score(pipe, digits.data, digits.target, cv=5)
scores_raw = cross_val_score(
    Pipeline([('scaler', StandardScaler()),
              ('clf', LogisticRegression(max_iter=1000))]),
    digits.data, digits.target, cv=5)

print(f"Without PCA: {scores_raw.mean():.3f} ± {scores_raw.std():.3f}")
print(f"With PCA (20): {scores_pca.mean():.3f} ± {scores_pca.std():.3f}")

Before/After PCA Dimensionality Reduction

Before vs After PCABefore PCA (p=64 features)64 correlated features+ noise + redundancyPCA64?2 dims90% var retainedAfter PCA (k=2 components)2 uncorrelated PCsNoise removedMemory: 64x reducedSpeed: ~4000x fasterVisualization: Enabled

Mathematical Properties of PCA

Optimality of PCA

PCA finds the rank-kk approximation X^=UkΣkVkT\hat{\mathbf{X}} = \mathbf{U}_k\boldsymbol{\Sigma}_k\mathbf{V}_k^T that minimizes the reconstruction error:

minX^:rank(X^)kXX^F2\min_{\hat{\mathbf{X}}: \text{rank}(\hat{\mathbf{X}}) \leq k} \|\mathbf{X} - \hat{\mathbf{X}}\|_F^2

By the Eckart-Young-Mirsky Theorem, the solution is given by the truncated SVD:

X^=j=1kσjujvjT\hat{\mathbf{X}} = \sum_{j=1}^{k} \sigma_j \mathbf{u}_j \mathbf{v}_j^T

with optimal reconstruction error:

XX^F2=j=k+1pσj2=(n1)j=k+1pλj\|\mathbf{X} - \hat{\mathbf{X}}\|_F^2 = \sum_{j=k+1}^{p} \sigma_j^2 = (n-1)\sum_{j=k+1}^{p} \lambda_j

Properties of Principal Components

  1. Orthogonality: viTvj=0\mathbf{v}_i^T\mathbf{v}_j = 0 for iji \neq j
  2. Unit length: vk=1\|\mathbf{v}_k\| = 1
  3. Maximum variance: Var(Zvk)=λk\text{Var}(\mathbf{Z}\mathbf{v}_k) = \lambda_k is maximized subject to vk=1\|\mathbf{v}_k\| = 1
  4. Uncorrelated scores: Cov(Y)=Λ\text{Cov}(\mathbf{Y}) = \boldsymbol{\Lambda} (diagonal)
  5. Total variance preserved: k=1pλk=tr(C)\sum_{k=1}^{p} \lambda_k = \text{tr}(\mathbf{C})

Common Pitfalls and Best Practices

Critical: Always standardize features before PCA when variables have different units or scales. PCA is sensitive to variance magnitude — features with larger scales will dominate.

Pitfalls

PitfallConsequenceSolution
Not standardizingScale-dominated componentsUse StandardScaler before PCA
Ignoring outliersDistorted principal componentsUse Robust PCA or remove outliers
Over-reducingInformation loss, degraded modelMonitor reconstruction error
Interpreting PCs causallyIncorrect causal inferencePCs are statistical, not causal
Using PCA with sparse dataDense representation loses sparsityUse Sparse PCA or NMF

When to Use PCA

  • Preprocessing: Reduce dimensionality before classification/regression
  • Visualization: Project to 2D/3D for exploratory data analysis
  • Noise reduction: Remove low-variance components (assumed noise)
  • Feature decorrelation: Create uncorrelated features for models assuming independence
  • Multicollinearity: Resolve correlated predictors in linear models

When NOT to Use PCA

  • Interpretability needed: PCs are linear combinations, not original features
  • Sparse data: Use truncated SVD or NMF instead
  • Non-linear structure: Use Kernel PCA, t-SNE, or UMAP
  • Categorical data: Use MCA (Multiple Correspondence Analysis) instead
  • Time series: Use specialized methods (e.g., DMD, POD)

PCA vs Other Dimensionality Reduction Methods

MethodTypePreservesBest For
PCALinearGlobal varianceGeneral purpose, preprocessing
Kernel PCANon-linearKernel-space varianceNon-linear manifolds
t-SNENon-linearLocal neighborhoodsVisualization (2D/3D)
UMAPNon-linearLocal + global structureVisualization, clustering
LDASupervisedClass separabilityClassification preprocessing
AutoencodersNon-linearReconstructionComplex non-linear reduction
NMFLinear (non-negative)Non-negative factorsInterpretable features (images, text)
ICALinearStatistical independenceSignal separation (blind source)

Summary

PCA is the foundational technique for dimensionality reduction:

  1. Core idea: Find orthogonal directions of maximum variance via eigendecomposition of the covariance matrix
  2. Computational method: SVD is preferred numerically — Z=UΣVT\mathbf{Z} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T gives principal directions as columns of V\mathbf{V}
  3. Component selection: Use Kaiser's rule, cumulative PVE threshold, or parallel analysis
  4. Variance explained: PVEk=λk/λj\text{PVE}_k = \lambda_k / \sum \lambda_j quantifies information retention
  5. Optimality: PCA minimizes mean squared reconstruction error (Eckart-Young theorem)
  6. Kernel extension: Kernel PCA handles non-linear data via the kernel trick
  7. Best practice: Always standardize data, validate reconstruction quality, and choose kk based on the application's information needs

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement