Dimensionality Reduction — Complete Guide
Dimensionality reduction compresses high-dimensional data into fewer dimensions while preserving important information.
Why Reduce Dimensions?
Curse of Dimensionality:
├─ More dimensions = more data needed
├─ Distances become meaningless
├─ Models overfit
└─ Training becomes slow
Benefits:
├─ Faster training
├─ Less overfitting
├─ Better visualization (2D/3D)
├─ Removes noise
└─ Fewer features = simpler model
PCA (Principal Component Analysis)
PCA finds the directions of MAXIMUM VARIANCE:
1. Standardize data
2. Compute covariance matrix
3. Find eigenvectors (principal components)
4. Project data onto top K eigenvectors
PC1: Direction of most variance
PC2: Direction of second most variance (orthogonal to PC1)
...
Explained variance ratio tells you how much info each PC captures:
PC1: 72%
PC2: 15%
PC3: 8%
PC4: 5% → can probably drop this one
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
print(f"Explained variance: {pca.explained_variance_ratio_}")
# [0.72, 0.15] — first 2 components explain 87% of variance
t-SNE
t-SNE preserves LOCAL structure (neighborhoods):
Best for: Visualization (2D/3D)
Not for: Feature reduction for training
How it works:
1. Compute similarities in high-D (Gaussian)
2. Compute similarities in low-D (Student-t)
3. Minimize KL divergence between them
Key parameters:
├─ perplexity: Number of neighbors (5-50)
├─ learning_rate: Step size (10-1000)
└─ n_iter: Number of iterations (1000+)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_2d = tsne.fit_transform(X)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, cmap='viridis')
plt.title('t-SNE Visualization')
UMAP
UMAP = faster, better version of t-SNE:
Advantages over t-SNE:
├─ 10x faster
├─ Better preserves global structure
├─ Can transform new data
└─ Better for clustering
preserves both local and global structure
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15)
X_2d = reducer.fit_transform(X)
Comparison
Method Speed Local Global Transform
──────────────────────────────────────────────
PCA Fast No Yes Yes
t-SNE Slow Yes No No
UMAP Medium Yes Yes Yes
LDA Fast No No Yes
LDA: Supervised — uses class labels
PCA: Unsupervised — uses only features
Key Takeaways
- PCA is the standard — fast, interpretable, widely used
- t-SNE is best for visualization — preserves local structure
- UMAP is faster than t-SNE and preserves global structure
- Explained variance tells you how much info PCA preserves
- Standardize data before PCA
- Reduce to 2-3 dimensions for visualization
- Reduce to 10-50 dimensions for model training
- Dimensionality reduction can improve model performance