PCA for Dimensionality Reduction

The Curse of Dimensionality

As the number of features (dimensions) increases, several problems emerge:

DfCurse of Dimensionality

The phenomenon where the volume of the feature space grows exponentially with the number of dimensions, causing data to become sparse and distances between points to become increasingly uniform. This leads to overfitting, increased computational cost, and degraded model performance in high-dimensional spaces.

Architecture Diagram

CURSE OF DIMENSIONALITY EFFECTS:

Dimension | Data Points Needed for Same Density
----------|------------------------------------
    2     | 100
    5     | 1,000
   10     | 10,000
   20     | 1,000,000
   50     | 10^10 (impossible!)

Problems:
┌─────────────────────────────────────────────────────────────────┐
│ 1. SPARSE DATA: Points become far apart in high dimensions    │
│ 2. OVERFITTING: More features = more parameters to estimate   │
│ 3. COMPUTATIONAL COST: Algorithms slow down exponentially     │
│ 4. CURSE OF DISTANCE: All points become equidistant           │
│ 5. VISUALIZATION: Can't plot beyond 3D                        │
└─────────────────────────────────────────────────────────────────┘

Mathematical Intuition

In high dimensions, the volume of the space grows exponentially:

Volume of a d-Dimensional Hypersphere

V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} \cdot r^d

Here,

=Number of dimensions
=Radius of the hypersphere
=Gamma function (generalization of factorial)

For $r=1$ and $d=100$ :

V_{100}(1) = \frac{\pi^{50}}{\Gamma(51)} \approx 10^{-40}

This means data becomes extremely sparse!

The curse of dimensionality is not just a theoretical curiosity — it has practical consequences. For example, k-nearest neighbors degenerate in high dimensions because all points become approximately equidistant. This is why dimensionality reduction (like PCA) is often a critical preprocessing step before applying distance-based algorithms.

Principal Component Analysis (PCA)

PCA finds the directions (principal components) that maximize variance in the data.

DfPrincipal Component Analysis

An orthogonal linear transformation that projects data onto a new coordinate system where the greatest variance lies along the first principal component, the second greatest variance along the second, and so on. PCA finds the directions of maximum variance by computing the eigendecomposition of the covariance matrix (or equivalently, the SVD of the centered data matrix).

Mathematical Derivation

Given data matrix $X \in \mathbb{R}^{n \times d}$ :

Step 1: Center the data

Centering

\tilde{X} = X - \mu, \quad \mu = \frac{1}{n}\sum_{i=1}^{n} x_i

Here,

=Mean vector of the data
=Number of samples

Step 2: Compute covariance matrix

Sample Covariance Matrix

\Sigma = \frac{1}{n-1}\tilde{X}^T\tilde{X}

Here,

=Covariance matrix (d × d)
=Centered data matrix

Step 3: Eigendecomposition

\Sigma v_i = \lambda_i v_i

Where:

$v_i$ = eigenvector (principal component direction)
$\lambda_i$ = eigenvalue (variance explained by that component)

Step 4: Project onto top-k components

PCA Projection

Z = \tilde{X} V_k

Here,

=Projected data (n × k)
=Top-k eigenvectors [v_1, ..., v_k]
=Number of components to keep

Always center (and optionally scale) your data before applying PCA. PCA is sensitive to the scale of features — features with larger magnitudes will dominate the variance. Use StandardScaler before PCA when features have different units.

Visual Explanation

Architecture Diagram

PCA GEOMETRIC INTERPRETATION:

Original Data (2D):
                ●
         ●  ●     ●
      ●     ●  ●
   ●     ●     ●

Step 1: Find direction of maximum variance
                ●
         ●  ●  →→→→  ●  ← Principal Component 1
      ●     ●  ●
   ●     ●     ●

Step 2: Find orthogonal direction
                ↑
         ●  ●  |  ●  ← Principal Component 2
      ●     ●  ●
   ●     ●     ●

Step 3: Project data
PC1: ● ● ● ● ● ● ● ● ●  (1D representation)
     ───────────────────
     Original 2D → Reduced 1D

Explained Variance

DfExplained Variance Ratio

The proportion of the total variance in the data that is captured by a given principal component. It is computed as the eigenvalue of that component divided by the sum of all eigenvalues.

Explained Variance Ratio

\text{Explained Variance Ratio}_k = \frac{\lambda_k}{\sum_{i=1}^{d} \lambda_i}

Here,

=Eigenvalue of the k-th principal component
=Total number of features

ThVariance Preservation

The total variance in the data is preserved under PCA. That is, the sum of eigenvalues equals the trace of the covariance matrix: $\sum_{i=1}^{d} \lambda_i = \text{tr}(\Sigma)$ . When we project onto the top-k components, we retain a fraction $\frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i}$ of the total variance.

Complete Python Implementation

📝Complete PCA Implementation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, fetch_olivetti_faces
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from mpl_toolkits.mplot3d import Axes3D
import time
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

print("=" * 70)
print("PCA FOR DIMENSIONALITY REDUCTION")
print("=" * 70)

# ============================================
# 1. Basic PCA Implementation
# ============================================

print("\n1. BASIC PCA IMPLEMENTATION")
print("-" * 40)

# Generate correlated data
n_samples = 300
x1 = np.random.randn(n_samples)
x2 = 0.8 * x1 + 0.3 * np.random.randn(n_samples)  # Correlated with x1
x3 = -0.5 * x1 + 0.6 * np.random.randn(n_samples)  # Also correlated

X = np.column_stack([x1, x2, x3])

print(f"Original data shape: {X.shape}")
print(f"Feature correlations:")
print(f"  Corr(x1, x2): {np.corrcoef(x1, x2)[0, 1]:.3f}")
print(f"  Corr(x1, x3): {np.corrcoef(x1, x3)[0, 1]:.3f}")
print(f"  Corr(x2, x3): {np.corrcoef(x2, x3)[0, 1]:.3f}")

# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)

print(f"\nPCA Results:")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")

Explained Variance Analysis

# Load digits dataset
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"Digits dataset shape: {X_digits.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")

# Apply PCA with all components
pca_full = PCA()
X_digits_pca = pca_full.fit_transform(X_digits)

# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual variance
axes[0].bar(range(1, len(pca_full.explained_variance_ratio_) + 1),
            pca_full.explained_variance_ratio_, alpha=0.7, label='Individual')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')
axes[0].grid(True, alpha=0.3)

# Cumulative variance
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
axes[1].plot(range(1, len(cumulative_var) + 1), cumulative_var, 'bo-')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].axhline(y=0.90, color='g', linestyle='--', label='90% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pca_variance.png', dpi=150, bbox_inches='tight')
plt.show()

# Find number of components for different thresholds
for threshold in [0.90, 0.95, 0.99]:
    n_components = np.argmax(cumulative_var >= threshold) + 1
    print(f"Components for {threshold*100:.0f}% variance: {n_components}")

The "elbow method" in the scree plot helps identify the optimal number of components. Look for the point where the explained variance drops sharply — components beyond this point contribute relatively little information. In practice, retaining components that capture 90-95% of total variance is a common heuristic.

Impact on Classification

# Compare classification with different numbers of components
n_components_range = [2, 5, 10, 20, 30, 40, 50, 64]
train_scores = []
cv_scores = []
times = []

for n_comp in n_components_range:
    start = time.time()
    
    # Create pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('pca', PCA(n_components=n_comp)),
        ('classifier', LogisticRegression(max_iter=1000, random_state=42))
    ])
    
    # Cross-validation
    scores = cross_val_score(pipeline, X_digits, y_digits, cv=5, scoring='accuracy')
    
    time_taken = time.time() - start
    
    train_scores.append(n_comp)
    cv_scores.append(scores.mean())
    times.append(time_taken)
    
    print(f"Components: {n_comp:3d}, Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f}), "
          f"Time: {time_taken:.3f}s")

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(n_components_range, cv_scores, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Components')
axes[0].set_ylabel('CV Accuracy')
axes[0].set_title('Classification Accuracy vs PCA Components')
axes[0].grid(True, alpha=0.3)
axes[0].axhline(y=max(cv_scores), color='r', linestyle='--', alpha=0.5)

axes[1].plot(n_components_range, times, 'rs-', linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Training Time (s)')
axes[1].set_title('Training Time vs PCA Components')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('pca_classification.png', dpi=150, bbox_inches='tight')
plt.show()

Incremental PCA for Large Datasets

# Generate large dataset
n_large = 10000
X_large = np.random.randn(n_large, 100)

print(f"Large dataset shape: {X_large.shape}")

# Standard PCA
start = time.time()
pca_standard = PCA(n_components=10)
X_standard = pca_standard.fit_transform(X_large)
time_standard = time.time() - start

# Incremental PCA
start = time.time()
pca_incremental = IncrementalPCA(n_components=10, batch_size=1000)
X_incremental = pca_incremental.fit_transform(X_large)
time_incremental = time.time() - start

print(f"\nStandard PCA:")
print(f"  Shape: {X_standard.shape}")
print(f"  Time: {time_standard:.4f}s")
print(f"  Variance explained: {pca_standard.explained_variance_ratio_.sum():.4f}")

print(f"\nIncremental PCA:")
print(f"  Shape: {X_incremental.shape}")
print(f"  Time: {time_incremental:.4f}s")
print(f"  Variance explained: {pca_incremental.explained_variance_ratio_.sum():.4f}")

Standard PCA requires the entire dataset to fit in memory (O(n×d) space). Incremental PCA processes data in mini-batches, making it suitable for datasets that are too large to fit in RAM. The trade-off is slightly different results and slower overall computation time.

PCA Components Visualization

# Visualize first 10 components as images (for digits)
pca_viz = PCA(n_components=10)
pca_viz.fit(X_digits)

fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.ravel()):
    ax.imshow(pca_viz.components_[i].reshape(8, 8), cmap='gray')
    ax.set_title(f'PC{i+1}\n({pca_viz.explained_variance_ratio_[i]*100:.1f}%)')
    ax.axis('off')

plt.suptitle('First 10 Principal Components (Digits)', fontsize=14)
plt.tight_layout()
plt.savefig('pca_components.png', dpi=150, bbox_inches='tight')
plt.show()

Eigenvalues and Eigenvectors

Intuitive Explanation

Architecture Diagram

EIGENVALUES AND EIGENVECTORS:

A matrix A transforms vectors. Eigenvectors are special:
they only get SCALED, not rotated.

Original Vector v:        After A*v:
    ↓                         ↓
    ↓                     ╱   ↓   ╲
    ↓                   ╱     ↓     ╲
    ──→               ──→─────↓─────→
                            λv (scaled)

• Eigenvector v: direction preserved
• Eigenvalue λ: scaling factor
• Larger λ = more variance in that direction

PCA finds directions with MAXIMUM variance
= directions with LARGEST eigenvalues

Mathematical Properties

DfEigenvalue Decomposition

For a symmetric matrix (such as a covariance matrix $\Sigma$ ), the eigendecomposition expresses the matrix as $\Sigma = V \Lambda V^T$ where $V$ is the matrix of eigenvectors and $\Lambda$ is the diagonal matrix of eigenvalues.

For a symmetric matrix (covariance matrix $\Sigma$ ):

\Sigma v_i = \lambda_i v_i

Properties:

All eigenvalues are real: $\lambda_i \in \mathbb{R}$
Eigenvectors are orthogonal: $v_i^T v_j = 0$ for $i \neq j$
Total variance: $\sum_{i=1}^{d} \lambda_i = \text{tr}(\Sigma)$

SVD vs Eigendecomposition

PCA can be computed via either eigendecomposition of the covariance matrix or Singular Value Decomposition (SVD) of the centered data matrix. SVD is numerically more stable and is the method used by scikit-learn. The relationship is: if $X = U S V^T$ is the SVD of the centered data, then the singular values relate to eigenvalues as $\lambda_i = s_i^2 / (n-1)$ .

# SVD approach to PCA
X_centered = X_digits - X_digits.mean(axis=0)

# Using SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

# Using eigendecomposition
cov = X_centered.T @ X_centered / (X_centered.shape[0] - 1)
eig_vals, eig_vecs = np.linalg.eigh(cov)

# Sort by eigenvalues
idx = np.argsort(eig_vals)[::-1]
eig_vals = eig_vals[idx]
eig_vecs = eig_vecs[:, idx]

print("Comparing SVD and Eigendecomposition:")
print(f"  SVD singular values (first 5): {S[:5]}")
print(f"  SVD variance explained: {(S**2 / (S**2).sum())[:5]}")
print(f"\n  Eigenvalues (first 5): {eig_vals[:5]}")
print(f"  Eigenvalues variance explained: {(eig_vals / eig_vals.sum())[:5]}")

# They should be equivalent
print(f"\nEquivalence check: {np.allclose(S**2 / (X_centered.shape[0] - 1), eig_vals)}")

Practical Application: Image Compression

📝Image Compression with PCA

# Load a sample image (or create synthetic one)
from sklearn.datasets import load_sample_image

try:
    image = load_sample_image('china.jpg')
except:
    image = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)

# Convert to grayscale for simplicity
if len(image.shape) == 3:
    image_gray = np.mean(image, axis=2)
else:
    image_gray = image

print(f"Original image shape: {image_gray.shape}")

# Divide image into patches
patch_size = 8
n_patches_y = image_gray.shape[0] // patch_size
n_patches_x = image_gray.shape[1] // patch_size

patches = []
for i in range(n_patches_y):
    for j in range(n_patches_x):
        patch = image_gray[i*patch_size:(i+1)*patch_size, 
                          j*patch_size:(j+1)*patch_size]
        patches.append(patch.flatten())

patches = np.array(patches)
print(f"Number of patches: {patches.shape}")
print(f"Patch size: {patch_size}x{patch_size} = {patch_size**2} dimensions")

# Apply PCA
n_components_list = [2, 5, 10, 20, 32]
reconstruction_errors = []

fig, axes = plt.subplots(2, len(n_components_list) + 1, figsize=(20, 8))

# Original
axes[0, 0].imshow(image_gray, cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')

axes[1, 0].imshow(image_gray, cmap='gray')
axes[1, 0].set_title('Reconstructed (Full)')
axes[1, 0].axis('off')

for idx, n_comp in enumerate(n_components_list):
    # PCA compression
    pca = PCA(n_components=n_comp)
    patches_compressed = pca.fit_transform(patches)
    patches_reconstructed = pca.inverse_transform(patches_compressed)
    
    # Calculate reconstruction error
    error = np.mean((patches - patches_reconstructed) ** 2)
    reconstruction_errors.append(error)
    
    # Reconstruct image
    image_reconstructed = np.zeros_like(image_gray)
    patch_idx = 0
    for i in range(n_patches_y):
        for j in range(n_patches_x):
            patch = patches_reconstructed[patch_idx].reshape(patch_size, patch_size)
            image_reconstructed[i*patch_size:(i+1)*patch_size, 
                              j*patch_size:(j+1)*patch_size] = patch
            patch_idx += 1
    
    axes[0, idx + 1].imshow(patches_compressed[:, :min(n_comp, 4)].T, cmap='viridis', aspect='auto')
    axes[0, idx + 1].set_title(f'{n_comp} Components\nCompressed')
    axes[0, idx + 1].axis('off')
    
    axes[1, idx + 1].imshow(image_reconstructed, cmap='gray')
    axes[1, idx + 1].set_title(f'{n_comp} Components\nMSE: {error:.2f}')
    axes[1, idx + 1].axis('off')

plt.suptitle('Image Compression with PCA', fontsize=14)
plt.tight_layout()
plt.savefig('pca_compression.png', dpi=150, bbox_inches='tight')
plt.show()

# Compression ratio
original_size = patches.nbytes
print("\nCompression Analysis:")
for n_comp, error in zip(n_components_list, reconstruction_errors):
    compressed_size = patches.shape[0] * n_comp * patches.dtype.itemsize
    ratio = original_size / compressed_size
    print(f"  {n_comp:2d} components: MSE={error:.2f}, "
          f"Compression ratio: {ratio:.1f}x, "
          f"Size reduction: {(1 - 1/ratio)*100:.1f}%")

The reconstruction error from PCA compression is exactly the sum of the eigenvalues of the discarded components. This provides a principled way to choose the number of components: pick the smallest k such that the total discarded variance is below an acceptable threshold.

Key Takeaways

PCA finds directions of maximum variance in data
Explained variance tells us how much information we retain
Dimensionality reduction helps with visualization and computation
Always center data before applying PCA
Incremental PCA is essential for large datasets
PCA components can reveal underlying structure
SVD and eigendecomposition give equivalent results
PCA is linear - for non-linear, use t-SNE or UMAP

Summary Table

Aspect	Standard PCA	Incremental PCA	Kernel PCA
Type	Linear	Linear	Non-linear
Dataset Size	Small-Medium	Large	Small-Medium
Complexity	O(d³) or O(n²d)	O(batch × d²)	O(n³)
Memory	O(n×d)	O(batch×d)	O(n²)
Best For	General purpose	Out-of-core data	Non-linear manifolds

Comparison with Other Methods

Architecture Diagram

DIMENSIONALITY REDUCTION METHODS:

PCA (Linear):
• Preserves global variance
• Fast and interpretable
• Good for linear relationships

t-SNE (Non-linear):
• Preserves local structure
• Good for visualization
• Slow, non-parametric

UMAP (Non-linear):
• Faster than t-SNE
• Better preserves global structure
• Parametric version available

LDA (Supervised):
• Uses class labels
• Maximizes class separation
• Good for classification

📋Summary: PCA for Dimensionality Reduction

The curse of dimensionality causes data sparsity, overfitting, and computational overhead as dimensions grow exponentially.
PCA finds orthogonal directions (principal components) that maximize variance by computing the eigendecomposition of the covariance matrix.
The explained variance ratio $\lambda_k / \sum \lambda_i$ quantifies how much information each component captures — use the scree plot and cumulative variance thresholds (90-95%) to choose k.
Centering data is essential before PCA; standardization is critical when features have different units or scales.
Incremental PCA processes data in mini-batches for out-of-core learning on datasets too large for memory.
PCA is equivalent to SVD of the centered data matrix — scikit-learn uses SVD for numerical stability.
PCA is linear — for non-linear manifold structure, use kernel PCA, t-SNE, or UMAP.
Reconstruction error from PCA equals the sum of discarded eigenvalues, providing a principled component selection criterion.

Practice Exercises

Exercise 1: Face Recognition

"""
Use PCA (Eigenfaces) for face recognition:
1. Load face dataset (e.g., Olivetti faces)
2. Apply PCA to get eigenfaces
3. Reconstruct faces with different components
4. Train classifier on PCA features
5. Evaluate accuracy vs number of components
"""

# Your code here

Exercise 2: Anomaly Detection with PCA

"""
Detect anomalies using PCA reconstruction error:
1. Generate normal data
2. Inject anomalies
3. Apply PCA
4. Calculate reconstruction error
5. Set threshold and detect anomalies
"""

# Your code here

Exercise 3: Compare PCA with Other Methods

"""
Compare PCA with:
1. t-SNE
2. UMAP (if available)
3. Linear Discriminant Analysis (LDA)

On the same dataset, compare:
- Visualization quality
- Classification accuracy after reduction
- Computational time
"""

# Your code here

Exercise 4: PCA for Feature Engineering

"""
Use PCA as feature engineering:
1. Create polynomial features
2. Apply PCA to reduce
3. Compare model performance with/without PCA
4. Analyze which features contribute most
"""

# Your code here

Next: We'll dive deep into model evaluation metrics including ROC curves, AUC, and Precision-Recall curves.

PCA for Dimensionality Reduction

PCA for Dimensionality Reduction

The Curse of Dimensionality

DfCurse of Dimensionality

Mathematical Intuition

Volume of a d-Dimensional Hypersphere

Principal Component Analysis (PCA)

DfPrincipal Component Analysis

Mathematical Derivation

Centering

Sample Covariance Matrix

PCA Projection

Visual Explanation

Explained Variance

DfExplained Variance Ratio

Explained Variance Ratio

ThVariance Preservation

Complete Python Implementation

📝Complete PCA Implementation

Explained Variance Analysis

Impact on Classification

Incremental PCA for Large Datasets

PCA Components Visualization

Eigenvalues and Eigenvectors

Intuitive Explanation

Mathematical Properties

DfEigenvalue Decomposition

SVD vs Eigendecomposition

Practical Application: Image Compression

📝Image Compression with PCA

Key Takeaways

Summary Table

Comparison with Other Methods

📋Summary: PCA for Dimensionality Reduction

Practice Exercises

Exercise 1: Face Recognition

Exercise 2: Anomaly Detection with PCA

Exercise 3: Compare PCA with Other Methods

Exercise 4: PCA for Feature Engineering

Need Expert Data Science Help?