Dimensionality Reduction — Complete Guide

Dimensionality reduction compresses high-dimensional data into fewer dimensions while preserving important information.

Why Reduce Dimensions?

Curse of Dimensionality:
├─ More dimensions = more data needed
├─ Distances become meaningless
├─ Models overfit
└─ Training becomes slow

Benefits:
├─ Faster training
├─ Less overfitting
├─ Better visualization (2D/3D)
├─ Removes noise
└─ Fewer features = simpler model

PCA (Principal Component Analysis)

PCA finds the directions of MAXIMUM VARIANCE:

1. Standardize data
2. Compute covariance matrix
3. Find eigenvectors (principal components)
4. Project data onto top K eigenvectors

PC1: Direction of most variance
PC2: Direction of second most variance (orthogonal to PC1)
...

Explained variance ratio tells you how much info each PC captures:
PC1: 72%
PC2: 15%
PC3: 8%
PC4: 5% → can probably drop this one

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

print(f"Explained variance: {pca.explained_variance_ratio_}")
# [0.72, 0.15] — first 2 components explain 87% of variance

t-SNE

t-SNE preserves LOCAL structure (neighborhoods):

Best for: Visualization (2D/3D)
Not for: Feature reduction for training

How it works:
1. Compute similarities in high-D (Gaussian)
2. Compute similarities in low-D (Student-t)
3. Minimize KL divergence between them

Key parameters:
├─ perplexity: Number of neighbors (5-50)
├─ learning_rate: Step size (10-1000)
└─ n_iter: Number of iterations (1000+)

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X)

plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='viridis')
plt.title('t-SNE Visualization')

UMAP

UMAP = faster, better version of t-SNE:

Advantages over t-SNE:
├─ 10x faster
├─ Better preserves global structure
├─ Can transform new data
└─ Better for clustering

preserves both local and global structure

import umap

reducer = umap.UMAP(n_components=2, n_neighbors=15)
X_2d = reducer.fit_transform(X)

Comparison

Method    Speed    Local    Global   Transform
──────────────────────────────────────────────
PCA       Fast     No       Yes      Yes
t-SNE     Slow     Yes      No       No
UMAP      Medium   Yes      Yes      Yes
LDA       Fast     No       No       Yes

LDA: Supervised — uses class labels
PCA: Unsupervised — uses only features

Key Takeaways

PCA is the standard — fast, interpretable, widely used
t-SNE is best for visualization — preserves local structure
UMAP is faster than t-SNE and preserves global structure
Explained variance tells you how much info PCA preserves
Standardize data before PCA
Reduce to 2-3 dimensions for visualization
Reduce to 10-50 dimensions for model training
Dimensionality reduction can improve model performance

Dimensionality Reduction — PCA, t-SNE, UMAP Complete Guide

Dimensionality Reduction — Complete Guide

Why Reduce Dimensions?

PCA (Principal Component Analysis)

t-SNE

UMAP

Comparison

Key Takeaways

Need Expert Machine Learning Help?