PCA for Dimensionality Reduction
The Curse of Dimensionality
As the number of features (dimensions) increases, several problems emerge:
DfCurse of Dimensionality
The phenomenon where the volume of the feature space grows exponentially with the number of dimensions, causing data to become sparse and distances between points to become increasingly uniform. This leads to overfitting, increased computational cost, and degraded model performance in high-dimensional spaces.
CURSE OF DIMENSIONALITY EFFECTS:
Dimension | Data Points Needed for Same Density
----------|------------------------------------
2 | 100
5 | 1,000
10 | 10,000
20 | 1,000,000
50 | 10^10 (impossible!)
Problems:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. SPARSE DATA: Points become far apart in high dimensions β
β 2. OVERFITTING: More features = more parameters to estimate β
β 3. COMPUTATIONAL COST: Algorithms slow down exponentially β
β 4. CURSE OF DISTANCE: All points become equidistant β
β 5. VISUALIZATION: Can't plot beyond 3D β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Mathematical Intuition
In high dimensions, the volume of the space grows exponentially:
Volume of a d-Dimensional Hypersphere
Here,
- =Number of dimensions
- =Radius of the hypersphere
- =Gamma function (generalization of factorial)
For and :
This means data becomes extremely sparse!
The curse of dimensionality is not just a theoretical curiosity β it has practical consequences. For example, k-nearest neighbors degenerate in high dimensions because all points become approximately equidistant. This is why dimensionality reduction (like PCA) is often a critical preprocessing step before applying distance-based algorithms.
Principal Component Analysis (PCA)
PCA finds the directions (principal components) that maximize variance in the data.
DfPrincipal Component Analysis
An orthogonal linear transformation that projects data onto a new coordinate system where the greatest variance lies along the first principal component, the second greatest variance along the second, and so on. PCA finds the directions of maximum variance by computing the eigendecomposition of the covariance matrix (or equivalently, the SVD of the centered data matrix).
Mathematical Derivation
Given data matrix :
Step 1: Center the data
Centering
Here,
- =Mean vector of the data
- =Number of samples
Step 2: Compute covariance matrix
Sample Covariance Matrix
Here,
- =Covariance matrix (d Γ d)
- =Centered data matrix
Step 3: Eigendecomposition
Where:
- = eigenvector (principal component direction)
- = eigenvalue (variance explained by that component)
Step 4: Project onto top-k components
PCA Projection
Here,
- =Projected data (n Γ k)
- =Top-k eigenvectors [v_1, ..., v_k]
- =Number of components to keep
Always center (and optionally scale) your data before applying PCA. PCA is sensitive to the scale of features β features with larger magnitudes will dominate the variance. Use StandardScaler before PCA when features have different units.
Visual Explanation
PCA GEOMETRIC INTERPRETATION:
Original Data (2D):
β
β β β
β β β
β β β
Step 1: Find direction of maximum variance
β
β β ββββ β β Principal Component 1
β β β
β β β
Step 2: Find orthogonal direction
β
β β | β β Principal Component 2
β β β
β β β
Step 3: Project data
PC1: β β β β β β β β β (1D representation)
βββββββββββββββββββ
Original 2D β Reduced 1D
Explained Variance
DfExplained Variance Ratio
The proportion of the total variance in the data that is captured by a given principal component. It is computed as the eigenvalue of that component divided by the sum of all eigenvalues.
Explained Variance Ratio
Here,
- =Eigenvalue of the k-th principal component
- =Total number of features
ThVariance Preservation
The total variance in the data is preserved under PCA. That is, the sum of eigenvalues equals the trace of the covariance matrix: . When we project onto the top-k components, we retain a fraction of the total variance.
Complete Python Implementation
πComplete PCA Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA, IncrementalPCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits, fetch_olivetti_faces
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from mpl_toolkits.mplot3d import Axes3D
import time
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
print("=" * 70)
print("PCA FOR DIMENSIONALITY REDUCTION")
print("=" * 70)
# ============================================
# 1. Basic PCA Implementation
# ============================================
print("\n1. BASIC PCA IMPLEMENTATION")
print("-" * 40)
# Generate correlated data
n_samples = 300
x1 = np.random.randn(n_samples)
x2 = 0.8 * x1 + 0.3 * np.random.randn(n_samples) # Correlated with x1
x3 = -0.5 * x1 + 0.6 * np.random.randn(n_samples) # Also correlated
X = np.column_stack([x1, x2, x3])
print(f"Original data shape: {X.shape}")
print(f"Feature correlations:")
print(f" Corr(x1, x2): {np.corrcoef(x1, x2)[0, 1]:.3f}")
print(f" Corr(x1, x3): {np.corrcoef(x1, x3)[0, 1]:.3f}")
print(f" Corr(x2, x3): {np.corrcoef(x2, x3)[0, 1]:.3f}")
# Apply PCA
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X)
print(f"\nPCA Results:")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.4f}")
Explained Variance Analysis
# Load digits dataset
digits = load_digits()
X_digits = digits.data
y_digits = digits.target
print(f"Digits dataset shape: {X_digits.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")
# Apply PCA with all components
pca_full = PCA()
X_digits_pca = pca_full.fit_transform(X_digits)
# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Individual variance
axes[0].bar(range(1, len(pca_full.explained_variance_ratio_) + 1),
pca_full.explained_variance_ratio_, alpha=0.7, label='Individual')
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')
axes[0].grid(True, alpha=0.3)
# Cumulative variance
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
axes[1].plot(range(1, len(cumulative_var) + 1), cumulative_var, 'bo-')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% threshold')
axes[1].axhline(y=0.90, color='g', linestyle='--', label='90% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Explained Variance')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pca_variance.png', dpi=150, bbox_inches='tight')
plt.show()
# Find number of components for different thresholds
for threshold in [0.90, 0.95, 0.99]:
n_components = np.argmax(cumulative_var >= threshold) + 1
print(f"Components for {threshold*100:.0f}% variance: {n_components}")
The "elbow method" in the scree plot helps identify the optimal number of components. Look for the point where the explained variance drops sharply β components beyond this point contribute relatively little information. In practice, retaining components that capture 90-95% of total variance is a common heuristic.
Impact on Classification
# Compare classification with different numbers of components
n_components_range = [2, 5, 10, 20, 30, 40, 50, 64]
train_scores = []
cv_scores = []
times = []
for n_comp in n_components_range:
start = time.time()
# Create pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=n_comp)),
('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
# Cross-validation
scores = cross_val_score(pipeline, X_digits, y_digits, cv=5, scoring='accuracy')
time_taken = time.time() - start
train_scores.append(n_comp)
cv_scores.append(scores.mean())
times.append(time_taken)
print(f"Components: {n_comp:3d}, Accuracy: {scores.mean():.4f} (+/- {scores.std():.4f}), "
f"Time: {time_taken:.3f}s")
# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
axes[0].plot(n_components_range, cv_scores, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Components')
axes[0].set_ylabel('CV Accuracy')
axes[0].set_title('Classification Accuracy vs PCA Components')
axes[0].grid(True, alpha=0.3)
axes[0].axhline(y=max(cv_scores), color='r', linestyle='--', alpha=0.5)
axes[1].plot(n_components_range, times, 'rs-', linewidth=2, markersize=8)
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Training Time (s)')
axes[1].set_title('Training Time vs PCA Components')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('pca_classification.png', dpi=150, bbox_inches='tight')
plt.show()
Incremental PCA for Large Datasets
# Generate large dataset
n_large = 10000
X_large = np.random.randn(n_large, 100)
print(f"Large dataset shape: {X_large.shape}")
# Standard PCA
start = time.time()
pca_standard = PCA(n_components=10)
X_standard = pca_standard.fit_transform(X_large)
time_standard = time.time() - start
# Incremental PCA
start = time.time()
pca_incremental = IncrementalPCA(n_components=10, batch_size=1000)
X_incremental = pca_incremental.fit_transform(X_large)
time_incremental = time.time() - start
print(f"\nStandard PCA:")
print(f" Shape: {X_standard.shape}")
print(f" Time: {time_standard:.4f}s")
print(f" Variance explained: {pca_standard.explained_variance_ratio_.sum():.4f}")
print(f"\nIncremental PCA:")
print(f" Shape: {X_incremental.shape}")
print(f" Time: {time_incremental:.4f}s")
print(f" Variance explained: {pca_incremental.explained_variance_ratio_.sum():.4f}")
Standard PCA requires the entire dataset to fit in memory (O(nΓd) space). Incremental PCA processes data in mini-batches, making it suitable for datasets that are too large to fit in RAM. The trade-off is slightly different results and slower overall computation time.
PCA Components Visualization
# Visualize first 10 components as images (for digits)
pca_viz = PCA(n_components=10)
pca_viz.fit(X_digits)
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
for i, ax in enumerate(axes.ravel()):
ax.imshow(pca_viz.components_[i].reshape(8, 8), cmap='gray')
ax.set_title(f'PC{i+1}\n({pca_viz.explained_variance_ratio_[i]*100:.1f}%)')
ax.axis('off')
plt.suptitle('First 10 Principal Components (Digits)', fontsize=14)
plt.tight_layout()
plt.savefig('pca_components.png', dpi=150, bbox_inches='tight')
plt.show()
Eigenvalues and Eigenvectors
Intuitive Explanation
EIGENVALUES AND EIGENVECTORS:
A matrix A transforms vectors. Eigenvectors are special:
they only get SCALED, not rotated.
Original Vector v: After A*v:
β β
β β± β β²
β β± β β²
βββ βββββββββββββββ
Ξ»v (scaled)
β’ Eigenvector v: direction preserved
β’ Eigenvalue Ξ»: scaling factor
β’ Larger Ξ» = more variance in that direction
PCA finds directions with MAXIMUM variance
= directions with LARGEST eigenvalues
Mathematical Properties
DfEigenvalue Decomposition
For a symmetric matrix (such as a covariance matrix ), the eigendecomposition expresses the matrix as where is the matrix of eigenvectors and is the diagonal matrix of eigenvalues.
For a symmetric matrix (covariance matrix ):
Properties:
- All eigenvalues are real:
- Eigenvectors are orthogonal: for
- Total variance:
SVD vs Eigendecomposition
PCA can be computed via either eigendecomposition of the covariance matrix or Singular Value Decomposition (SVD) of the centered data matrix. SVD is numerically more stable and is the method used by scikit-learn. The relationship is: if is the SVD of the centered data, then the singular values relate to eigenvalues as .
# SVD approach to PCA
X_centered = X_digits - X_digits.mean(axis=0)
# Using SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Using eigendecomposition
cov = X_centered.T @ X_centered / (X_centered.shape[0] - 1)
eig_vals, eig_vecs = np.linalg.eigh(cov)
# Sort by eigenvalues
idx = np.argsort(eig_vals)[::-1]
eig_vals = eig_vals[idx]
eig_vecs = eig_vecs[:, idx]
print("Comparing SVD and Eigendecomposition:")
print(f" SVD singular values (first 5): {S[:5]}")
print(f" SVD variance explained: {(S**2 / (S**2).sum())[:5]}")
print(f"\n Eigenvalues (first 5): {eig_vals[:5]}")
print(f" Eigenvalues variance explained: {(eig_vals / eig_vals.sum())[:5]}")
# They should be equivalent
print(f"\nEquivalence check: {np.allclose(S**2 / (X_centered.shape[0] - 1), eig_vals)}")
Practical Application: Image Compression
πImage Compression with PCA
# Load a sample image (or create synthetic one)
from sklearn.datasets import load_sample_image
try:
image = load_sample_image('china.jpg')
except:
image = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)
# Convert to grayscale for simplicity
if len(image.shape) == 3:
image_gray = np.mean(image, axis=2)
else:
image_gray = image
print(f"Original image shape: {image_gray.shape}")
# Divide image into patches
patch_size = 8
n_patches_y = image_gray.shape[0] // patch_size
n_patches_x = image_gray.shape[1] // patch_size
patches = []
for i in range(n_patches_y):
for j in range(n_patches_x):
patch = image_gray[i*patch_size:(i+1)*patch_size,
j*patch_size:(j+1)*patch_size]
patches.append(patch.flatten())
patches = np.array(patches)
print(f"Number of patches: {patches.shape}")
print(f"Patch size: {patch_size}x{patch_size} = {patch_size**2} dimensions")
# Apply PCA
n_components_list = [2, 5, 10, 20, 32]
reconstruction_errors = []
fig, axes = plt.subplots(2, len(n_components_list) + 1, figsize=(20, 8))
# Original
axes[0, 0].imshow(image_gray, cmap='gray')
axes[0, 0].set_title('Original')
axes[0, 0].axis('off')
axes[1, 0].imshow(image_gray, cmap='gray')
axes[1, 0].set_title('Reconstructed (Full)')
axes[1, 0].axis('off')
for idx, n_comp in enumerate(n_components_list):
# PCA compression
pca = PCA(n_components=n_comp)
patches_compressed = pca.fit_transform(patches)
patches_reconstructed = pca.inverse_transform(patches_compressed)
# Calculate reconstruction error
error = np.mean((patches - patches_reconstructed) ** 2)
reconstruction_errors.append(error)
# Reconstruct image
image_reconstructed = np.zeros_like(image_gray)
patch_idx = 0
for i in range(n_patches_y):
for j in range(n_patches_x):
patch = patches_reconstructed[patch_idx].reshape(patch_size, patch_size)
image_reconstructed[i*patch_size:(i+1)*patch_size,
j*patch_size:(j+1)*patch_size] = patch
patch_idx += 1
axes[0, idx + 1].imshow(patches_compressed[:, :min(n_comp, 4)].T, cmap='viridis', aspect='auto')
axes[0, idx + 1].set_title(f'{n_comp} Components\nCompressed')
axes[0, idx + 1].axis('off')
axes[1, idx + 1].imshow(image_reconstructed, cmap='gray')
axes[1, idx + 1].set_title(f'{n_comp} Components\nMSE: {error:.2f}')
axes[1, idx + 1].axis('off')
plt.suptitle('Image Compression with PCA', fontsize=14)
plt.tight_layout()
plt.savefig('pca_compression.png', dpi=150, bbox_inches='tight')
plt.show()
# Compression ratio
original_size = patches.nbytes
print("\nCompression Analysis:")
for n_comp, error in zip(n_components_list, reconstruction_errors):
compressed_size = patches.shape[0] * n_comp * patches.dtype.itemsize
ratio = original_size / compressed_size
print(f" {n_comp:2d} components: MSE={error:.2f}, "
f"Compression ratio: {ratio:.1f}x, "
f"Size reduction: {(1 - 1/ratio)*100:.1f}%")
The reconstruction error from PCA compression is exactly the sum of the eigenvalues of the discarded components. This provides a principled way to choose the number of components: pick the smallest k such that the total discarded variance is below an acceptable threshold.
Key Takeaways
- PCA finds directions of maximum variance in data
- Explained variance tells us how much information we retain
- Dimensionality reduction helps with visualization and computation
- Always center data before applying PCA
- Incremental PCA is essential for large datasets
- PCA components can reveal underlying structure
- SVD and eigendecomposition give equivalent results
- PCA is linear - for non-linear, use t-SNE or UMAP
Summary Table
| Aspect | Standard PCA | Incremental PCA | Kernel PCA |
|---|---|---|---|
| Type | Linear | Linear | Non-linear |
| Dataset Size | Small-Medium | Large | Small-Medium |
| Complexity | O(dΒ³) or O(nΒ²d) | O(batch Γ dΒ²) | O(nΒ³) |
| Memory | O(nΓd) | O(batchΓd) | O(nΒ²) |
| Best For | General purpose | Out-of-core data | Non-linear manifolds |
Comparison with Other Methods
DIMENSIONALITY REDUCTION METHODS:
PCA (Linear):
β’ Preserves global variance
β’ Fast and interpretable
β’ Good for linear relationships
t-SNE (Non-linear):
β’ Preserves local structure
β’ Good for visualization
β’ Slow, non-parametric
UMAP (Non-linear):
β’ Faster than t-SNE
β’ Better preserves global structure
β’ Parametric version available
LDA (Supervised):
β’ Uses class labels
β’ Maximizes class separation
β’ Good for classification
πSummary: PCA for Dimensionality Reduction
- The curse of dimensionality causes data sparsity, overfitting, and computational overhead as dimensions grow exponentially.
- PCA finds orthogonal directions (principal components) that maximize variance by computing the eigendecomposition of the covariance matrix.
- The explained variance ratio quantifies how much information each component captures β use the scree plot and cumulative variance thresholds (90-95%) to choose k.
- Centering data is essential before PCA; standardization is critical when features have different units or scales.
- Incremental PCA processes data in mini-batches for out-of-core learning on datasets too large for memory.
- PCA is equivalent to SVD of the centered data matrix β scikit-learn uses SVD for numerical stability.
- PCA is linear β for non-linear manifold structure, use kernel PCA, t-SNE, or UMAP.
- Reconstruction error from PCA equals the sum of discarded eigenvalues, providing a principled component selection criterion.
Practice Exercises
Exercise 1: Face Recognition
"""
Use PCA (Eigenfaces) for face recognition:
1. Load face dataset (e.g., Olivetti faces)
2. Apply PCA to get eigenfaces
3. Reconstruct faces with different components
4. Train classifier on PCA features
5. Evaluate accuracy vs number of components
"""
# Your code here
Exercise 2: Anomaly Detection with PCA
"""
Detect anomalies using PCA reconstruction error:
1. Generate normal data
2. Inject anomalies
3. Apply PCA
4. Calculate reconstruction error
5. Set threshold and detect anomalies
"""
# Your code here
Exercise 3: Compare PCA with Other Methods
"""
Compare PCA with:
1. t-SNE
2. UMAP (if available)
3. Linear Discriminant Analysis (LDA)
On the same dataset, compare:
- Visualization quality
- Classification accuracy after reduction
- Computational time
"""
# Your code here
Exercise 4: PCA for Feature Engineering
"""
Use PCA as feature engineering:
1. Create polynomial features
2. Apply PCA to reduce
3. Compare model performance with/without PCA
4. Analyze which features contribute most
"""
# Your code here
Next: We'll dive deep into model evaluation metrics including ROC curves, AUC, and Precision-Recall curves.