PCA for Dimensionality Reduction

Module 2: Machine LearningFree Lesson

Advertisement

PCA for Dimensionality Reduction

The Curse of Dimensionality

As the number of features (dimensions) increases, several problems emerge:

DfCurse of Dimensionality

The phenomenon where the volume of the feature space grows exponentially with the number of dimensions, causing data to become sparse and distances between points to become increasingly uniform. This leads to overfitting, increased computational cost, and degraded model performance in high-dimensional spaces.

Architecture Diagram
CURSE OF DIMENSIONALITY EFFECTS:

Dimension | Data Points Needed for Same Density
----------|------------------------------------
    2     | 100
    5     | 1,000
   10     | 10,000
   20     | 1,000,000
   50     | 10^10 (impossible!)

Problems:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. SPARSE DATA: Points become far apart in high dimensions    β”‚
β”‚ 2. OVERFITTING: More features = more parameters to estimate   β”‚
β”‚ 3. COMPUTATIONAL COST: Algorithms slow down exponentially     β”‚
β”‚ 4. CURSE OF DISTANCE: All points become equidistant           β”‚
β”‚ 5. VISUALIZATION: Can't plot beyond 3D                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Mathematical Intuition

In high dimensions, the volume of the space grows exponentially:

Volume of a d-Dimensional Hypersphere

Vd(r)=Ο€d/2Ξ“(d/2+1)β‹…rdV_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} \cdot r^d

Here,

  • =Number of dimensions
  • =Radius of the hypersphere
  • =Gamma function (generalization of factorial)

For r=1r=1 and d=100d=100:

V100(1)=Ο€50Ξ“(51)β‰ˆ10βˆ’40V_{100}(1) = \frac{\pi^{50}}{\Gamma(51)} \approx 10^{-40}

This means data becomes extremely sparse!

The curse of dimensionality is not just a theoretical curiosity β€” it has practical consequences. For example, k-nearest neighbors degenerate in high dimensions because all points become approximately equidistant. This is why dimensionality reduction (like PCA) is often a critical preprocessing step before applying distance-based algorithms.

Principal Component Analysis (PCA)

PCA finds the directions (principal components) that maximize variance in the data.

DfPrincipal Component Analysis

An orthogonal linear transformation that projects data onto a new coordinate system where the greatest variance lies along the first principal component, the second greatest variance along the second, and so on. PCA finds the directions of maximum variance by computing the eigendecomposition of the covariance matrix (or equivalently, the SVD of the centered data matrix).

Mathematical Derivation

Given data matrix X∈RnΓ—dX \in \mathbb{R}^{n \times d}:

Step 1: Center the data

Centering

X~=Xβˆ’ΞΌ,ΞΌ=1nβˆ‘i=1nxi\tilde{X} = X - \mu, \quad \mu = \frac{1}{n}\sum_{i=1}^{n} x_i

Here,

  • =Mean vector of the data
  • =Number of samples

Step 2: Compute covariance matrix

Sample Covariance Matrix

Ξ£=1nβˆ’1X~TX~\Sigma = \frac{1}{n-1}\tilde{X}^T\tilde{X}

Here,

  • =Covariance matrix (d Γ— d)
  • =Centered data matrix

Step 3: Eigendecomposition

Ξ£vi=Ξ»ivi\Sigma v_i = \lambda_i v_i

Where:

  • viv_i = eigenvector (principal component direction)
  • Ξ»i\lambda_i = eigenvalue (variance explained by that component)

Step 4: Project onto top-k components

PCA Projection

Z=X~VkZ = \tilde{X} V_k

Here,

  • =Projected data (n Γ— k)
  • =Top-k eigenvectors [v_1, ..., v_k]
  • =Number of components to keep

Always center (and optionally scale) your data before applying PCA. PCA is sensitive to the scale of features β€” features with larger magnitudes will dominate the variance. Use StandardScaler before PCA when features have different units.

Visual Explanation

Architecture Diagram
PCA GEOMETRIC INTERPRETATION:

Original Data (2D):
                ●
         ●  ●     ●
      ●     ●  ●
   ●     ●     ●

Step 1: Find direction of maximum variance
                ●
         ●  ●  β†’β†’β†’β†’  ●  ← Principal Component 1
      ●     ●  ●
   ●     ●     ●

Step 2: Find orthogonal direction
                ↑
         ●  ●  |  ●  ← Principal Component 2
      ●     ●  ●
   ●     ●     ●

Step 3: Project data
PC1: ● ● ● ● ● ● ● ● ●  (1D representation)
     ───────────────────
     Original 2D β†’ Reduced 1D

Explained Variance

DfExplained Variance Ratio

The proportion of the total variance in the data that is captured by a given principal component. It is computed as the eigenvalue of that component divided by the sum of all eigenvalues.

Explained Variance Ratio

ExplainedΒ VarianceΒ Ratiok=Ξ»kβˆ‘i=1dΞ»i\text{Explained Variance Ratio}_k = \frac{\lambda_k}{\sum_{i=1}^{d} \lambda_i}

Here,

  • =Eigenvalue of the k-th principal component
  • =Total number of features

ThVariance Preservation

The total variance in the data is preserved under PCA. That is, the sum of eigenvalues equals the trace of the covariance matrix: βˆ‘i=1dΞ»i=tr(Ξ£)\sum_{i=1}^{d} \lambda_i = \text{tr}(\Sigma). When we project onto the top-k components, we retain a fraction βˆ‘i=1kΞ»iβˆ‘i=1dΞ»i\frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i} of the total variance.

Complete Python Implementation

πŸ“Complete PCA Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, fetch_olivetti_faces
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from mpl_toolkits.mplot3d import Axes3D
import time
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("=" * 70)
print("PCA FOR DIMENSIONALITY REDUCTION")
print("=" * 70)

# ============================================
# 1. Basic PCA Implementation
# ============================================

print("\n1. BASIC PCA IMPLEMENTATION")
print("-" * 40)

# Generate correlated data
n_samples = 300
x1 = np.random.randn(n_samples)
x2 = 0.8 * x1 + 0.3 * np.random.randn(n_samples)  # Correlated with x1
x3 = -0.5 * x1 + 0.6 * np.random.randn(n_samples)  # Also correlated

X = np.column_stack([x1, x2, x3])

print(f"Original data shape: {X.shape}")
print(f"Feature correlations:")
print(f"  Corr(x1, x2): {np.corrcoef(x1, x2)[0, 1]:.3f}")
print(f"  Corr(x1, x3): {np.corrcoef(x1, x3)[0, 1]:.3f}")
print(f"  Corr(x2, x3): {np.corrcoef(x2, x3)[0, 1]:.3f}")

# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

print(f"\nPCA Results:")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")

Explained Variance Analysis

# Load digits dataset
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"Digits dataset shape: {X_digits.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")

# Apply PCA with all components
pca_full = PCA()
X_digits_pca = pca_full.fit_transform(X_digits)

# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual variance
axes[0].bar(range(1, len(pca_full.explained_variance_ratio_) + 1),
            pca_full.explained_variance_ratio_, alpha=0.7, label='Individual')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')
axes[0].grid(True, alpha=0.3)

# Cumulative variance
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
axes[1].plot(range(1, len(cumulative_var) + 1), cumulative_var, 'bo-')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].axhline(y=0.90, color='g', linestyle='--', label='90% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pca_variance.png', dpi=150, bbox_inches='tight')
plt.show()

# Find number of components for different thresholds
for threshold in [0.90, 0.95, 0.99]:
    n_components = np.argmax(cumulative_var >= threshold) + 1
    print(f"Components for {threshold*100:.0f}% variance: {n_components}")

The "elbow method" in the scree plot helps identify the optimal number of components. Look for the point where the explained variance drops sharply β€” components beyond this point contribute relatively little information. In practice, retaining components that capture 90-95% of total variance is a common heuristic.

Impact on Classification

# Compare classification with different numbers of components
n_components_range = [2, 5, 10, 20, 30, 40, 50, 64]
train_scores = []
cv_scores = []
times = []

for n_comp in n_components_range:
    start = time.time()
    
    # Create pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=n_comp)),
        ('classifier', LogisticRegression(max_iter=1000, random_state=42))
    ])
    
    # Cross-validation
    scores = cross_val_score(pipeline, X_digits, y_digits, cv=5, scoring='accuracy')
    
    time_taken = time.time() - start
    
    train_scores.append(n_comp)
    cv_scores.append(scores.mean())
    times.append(time_taken)
    
    print(f"Components: {n_comp:3d}, Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f}), "
          f"Time: {time_taken:.3f}s")

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(n_components_range, cv_scores, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Components')
axes[0].set_ylabel('CV Accuracy')
axes[0].set_title('Classification Accuracy vs PCA Components')
axes[0].grid(True, alpha=0.3)
axes[0].axhline(y=max(cv_scores), color='r', linestyle='--', alpha=0.5)

axes[1].plot(n_components_range, times, 'rs-', linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Training Time (s)')
axes[1].set_title('Training Time vs PCA Components')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pca_classification.png', dpi=150, bbox_inches='tight')
plt.show()

Incremental PCA for Large Datasets

# Generate large dataset
n_large = 10000
X_large = np.random.randn(n_large, 100)

print(f"Large dataset shape: {X_large.shape}")

# Standard PCA
start = time.time()
pca_standard = PCA(n_components=10)
X_standard = pca_standard.fit_transform(X_large)
time_standard = time.time() - start

# Incremental PCA
start = time.time()
pca_incremental = IncrementalPCA(n_components=10, batch_size=1000)
X_incremental = pca_incremental.fit_transform(X_large)
time_incremental = time.time() - start

print(f"\nStandard PCA:")
print(f"  Shape: {X_standard.shape}")
print(f"  Time: {time_standard:.4f}s")
print(f"  Variance explained: {pca_standard.explained_variance_ratio_.sum():.4f}")

print(f"\nIncremental PCA:")
print(f"  Shape: {X_incremental.shape}")
print(f"  Time: {time_incremental:.4f}s")
print(f"  Variance explained: {pca_incremental.explained_variance_ratio_.sum():.4f}")

Standard PCA requires the entire dataset to fit in memory (O(nΓ—d) space). Incremental PCA processes data in mini-batches, making it suitable for datasets that are too large to fit in RAM. The trade-off is slightly different results and slower overall computation time.

PCA Components Visualization

# Visualize first 10 components as images (for digits)
pca_viz = PCA(n_components=10)
pca_viz.fit(X_digits)

fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.ravel()):
    ax.imshow(pca_viz.components_[i].reshape(8, 8), cmap='gray')
    ax.set_title(f'PC{i+1}\n({pca_viz.explained_variance_ratio_[i]*100:.1f}%)')
    ax.axis('off')

plt.suptitle('First 10 Principal Components (Digits)', fontsize=14)
plt.tight_layout()
plt.savefig('pca_components.png', dpi=150, bbox_inches='tight')
plt.show()

Eigenvalues and Eigenvectors

Intuitive Explanation

Architecture Diagram
EIGENVALUES AND EIGENVECTORS:

A matrix A transforms vectors. Eigenvectors are special:
they only get SCALED, not rotated.

Original Vector v:        After A*v:
    ↓                         ↓
    ↓                     β•±   ↓   β•²
    ↓                   β•±     ↓     β•²
    ──→               ──→─────↓─────→
                            Ξ»v (scaled)

β€’ Eigenvector v: direction preserved
β€’ Eigenvalue Ξ»: scaling factor
β€’ Larger Ξ» = more variance in that direction

PCA finds directions with MAXIMUM variance
= directions with LARGEST eigenvalues

Mathematical Properties

DfEigenvalue Decomposition

For a symmetric matrix (such as a covariance matrix Ξ£\Sigma), the eigendecomposition expresses the matrix as Ξ£=VΞ›VT\Sigma = V \Lambda V^T where VV is the matrix of eigenvectors and Ξ›\Lambda is the diagonal matrix of eigenvalues.

For a symmetric matrix (covariance matrix Ξ£\Sigma):

Ξ£vi=Ξ»ivi\Sigma v_i = \lambda_i v_i

Properties:

  • All eigenvalues are real: Ξ»i∈R\lambda_i \in \mathbb{R}
  • Eigenvectors are orthogonal: viTvj=0v_i^T v_j = 0 for iβ‰ ji \neq j
  • Total variance: βˆ‘i=1dΞ»i=tr(Ξ£)\sum_{i=1}^{d} \lambda_i = \text{tr}(\Sigma)

SVD vs Eigendecomposition

PCA can be computed via either eigendecomposition of the covariance matrix or Singular Value Decomposition (SVD) of the centered data matrix. SVD is numerically more stable and is the method used by scikit-learn. The relationship is: if X=USVTX = U S V^T is the SVD of the centered data, then the singular values relate to eigenvalues as Ξ»i=si2/(nβˆ’1)\lambda_i = s_i^2 / (n-1).

# SVD approach to PCA
X_centered = X_digits - X_digits.mean(axis=0)

# Using SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Using eigendecomposition
cov = X_centered.T @ X_centered / (X_centered.shape[0] - 1)
eig_vals, eig_vecs = np.linalg.eigh(cov)

# Sort by eigenvalues
idx = np.argsort(eig_vals)[::-1]
eig_vals = eig_vals[idx]
eig_vecs = eig_vecs[:, idx]

print("Comparing SVD and Eigendecomposition:")
print(f"  SVD singular values (first 5): {S[:5]}")
print(f"  SVD variance explained: {(S**2 / (S**2).sum())[:5]}")
print(f"\n  Eigenvalues (first 5): {eig_vals[:5]}")
print(f"  Eigenvalues variance explained: {(eig_vals / eig_vals.sum())[:5]}")

# They should be equivalent
print(f"\nEquivalence check: {np.allclose(S**2 / (X_centered.shape[0] - 1), eig_vals)}")

Practical Application: Image Compression

πŸ“Image Compression with PCA

# Load a sample image (or create synthetic one)
from sklearn.datasets import load_sample_image

try:
    image = load_sample_image('china.jpg')
except:
    image = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)

# Convert to grayscale for simplicity
if len(image.shape) == 3:
    image_gray = np.mean(image, axis=2)
else:
    image_gray = image

print(f"Original image shape: {image_gray.shape}")

# Divide image into patches
patch_size = 8
n_patches_y = image_gray.shape[0] // patch_size
n_patches_x = image_gray.shape[1] // patch_size

patches = []
for i in range(n_patches_y):
    for j in range(n_patches_x):
        patch = image_gray[i*patch_size:(i+1)*patch_size, 
                          j*patch_size:(j+1)*patch_size]
        patches.append(patch.flatten())

patches = np.array(patches)
print(f"Number of patches: {patches.shape}")
print(f"Patch size: {patch_size}x{patch_size} = {patch_size**2} dimensions")

# Apply PCA
n_components_list = [2, 5, 10, 20, 32]
reconstruction_errors = []

fig, axes = plt.subplots(2, len(n_components_list) + 1, figsize=(20, 8))

# Original
axes[0, 0].imshow(image_gray, cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')

axes[1, 0].imshow(image_gray, cmap='gray')
axes[1, 0].set_title('Reconstructed (Full)')
axes[1, 0].axis('off')

for idx, n_comp in enumerate(n_components_list):
    # PCA compression
    pca = PCA(n_components=n_comp)
    patches_compressed = pca.fit_transform(patches)
    patches_reconstructed = pca.inverse_transform(patches_compressed)
    
    # Calculate reconstruction error
    error = np.mean((patches - patches_reconstructed) ** 2)
    reconstruction_errors.append(error)
    
    # Reconstruct image
    image_reconstructed = np.zeros_like(image_gray)
    patch_idx = 0
    for i in range(n_patches_y):
        for j in range(n_patches_x):
            patch = patches_reconstructed[patch_idx].reshape(patch_size, patch_size)
            image_reconstructed[i*patch_size:(i+1)*patch_size, 
                              j*patch_size:(j+1)*patch_size] = patch
            patch_idx += 1
    
    axes[0, idx + 1].imshow(patches_compressed[:, :min(n_comp, 4)].T, cmap='viridis', aspect='auto')
    axes[0, idx + 1].set_title(f'{n_comp} Components\nCompressed')
    axes[0, idx + 1].axis('off')
    
    axes[1, idx + 1].imshow(image_reconstructed, cmap='gray')
    axes[1, idx + 1].set_title(f'{n_comp} Components\nMSE: {error:.2f}')
    axes[1, idx + 1].axis('off')

plt.suptitle('Image Compression with PCA', fontsize=14)
plt.tight_layout()
plt.savefig('pca_compression.png', dpi=150, bbox_inches='tight')
plt.show()

# Compression ratio
original_size = patches.nbytes
print("\nCompression Analysis:")
for n_comp, error in zip(n_components_list, reconstruction_errors):
    compressed_size = patches.shape[0] * n_comp * patches.dtype.itemsize
    ratio = original_size / compressed_size
    print(f"  {n_comp:2d} components: MSE={error:.2f}, "
          f"Compression ratio: {ratio:.1f}x, "
          f"Size reduction: {(1 - 1/ratio)*100:.1f}%")

The reconstruction error from PCA compression is exactly the sum of the eigenvalues of the discarded components. This provides a principled way to choose the number of components: pick the smallest k such that the total discarded variance is below an acceptable threshold.

Key Takeaways

  1. PCA finds directions of maximum variance in data
  2. Explained variance tells us how much information we retain
  3. Dimensionality reduction helps with visualization and computation
  4. Always center data before applying PCA
  5. Incremental PCA is essential for large datasets
  6. PCA components can reveal underlying structure
  7. SVD and eigendecomposition give equivalent results
  8. PCA is linear - for non-linear, use t-SNE or UMAP

Summary Table

AspectStandard PCAIncremental PCAKernel PCA
TypeLinearLinearNon-linear
Dataset SizeSmall-MediumLargeSmall-Medium
ComplexityO(dΒ³) or O(nΒ²d)O(batch Γ— dΒ²)O(nΒ³)
MemoryO(nΓ—d)O(batchΓ—d)O(nΒ²)
Best ForGeneral purposeOut-of-core dataNon-linear manifolds

Comparison with Other Methods

Architecture Diagram
DIMENSIONALITY REDUCTION METHODS:

PCA (Linear):
β€’ Preserves global variance
β€’ Fast and interpretable
β€’ Good for linear relationships

t-SNE (Non-linear):
β€’ Preserves local structure
β€’ Good for visualization
β€’ Slow, non-parametric

UMAP (Non-linear):
β€’ Faster than t-SNE
β€’ Better preserves global structure
β€’ Parametric version available

LDA (Supervised):
β€’ Uses class labels
β€’ Maximizes class separation
β€’ Good for classification

πŸ“‹Summary: PCA for Dimensionality Reduction

  1. The curse of dimensionality causes data sparsity, overfitting, and computational overhead as dimensions grow exponentially.
  2. PCA finds orthogonal directions (principal components) that maximize variance by computing the eigendecomposition of the covariance matrix.
  3. The explained variance ratio Ξ»k/βˆ‘Ξ»i\lambda_k / \sum \lambda_i quantifies how much information each component captures β€” use the scree plot and cumulative variance thresholds (90-95%) to choose k.
  4. Centering data is essential before PCA; standardization is critical when features have different units or scales.
  5. Incremental PCA processes data in mini-batches for out-of-core learning on datasets too large for memory.
  6. PCA is equivalent to SVD of the centered data matrix β€” scikit-learn uses SVD for numerical stability.
  7. PCA is linear β€” for non-linear manifold structure, use kernel PCA, t-SNE, or UMAP.
  8. Reconstruction error from PCA equals the sum of discarded eigenvalues, providing a principled component selection criterion.

Practice Exercises

Exercise 1: Face Recognition

"""
Use PCA (Eigenfaces) for face recognition:
1. Load face dataset (e.g., Olivetti faces)
2. Apply PCA to get eigenfaces
3. Reconstruct faces with different components
4. Train classifier on PCA features
5. Evaluate accuracy vs number of components
"""

# Your code here

Exercise 2: Anomaly Detection with PCA

"""
Detect anomalies using PCA reconstruction error:
1. Generate normal data
2. Inject anomalies
3. Apply PCA
4. Calculate reconstruction error
5. Set threshold and detect anomalies
"""

# Your code here

Exercise 3: Compare PCA with Other Methods

"""
Compare PCA with:
1. t-SNE
2. UMAP (if available)
3. Linear Discriminant Analysis (LDA)

On the same dataset, compare:
- Visualization quality
- Classification accuracy after reduction
- Computational time
"""

# Your code here

Exercise 4: PCA for Feature Engineering

"""
Use PCA as feature engineering:
1. Create polynomial features
2. Apply PCA to reduce
3. Compare model performance with/without PCA
4. Analyze which features contribute most
"""

# Your code here

Next: We'll dive deep into model evaluation metrics including ROC curves, AUC, and Precision-Recall curves.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement