Dimensionality Reduction: t-SNE, UMAP & Autoencoders
Visualizing and compressing high-dimensional data
Interview Question
"Explain the differences between t-SNE, UMAP, and PCA for dimensionality reduction. When would you use each method? How do autoencoders compare to these methods?"
Difficulty: Hard | Frequently asked at NVIDIA, Meta, Google
Theoretical Foundation
PCA Recap (Linear)
PCA finds orthogonal directions of maximum variance. Fast and deterministic, but only captures linear relationships.
t-SNE (t-Distributed Stochastic Neighbor Embedding)
Goal: Preserve local structure (nearby points stay nearby).
Algorithm:
- Compute pairwise similarities in high-dimensional space:
- Compute similarities in low-dimensional space using t-distribution:
- Minimize KL divergence:
Properties:
- Preserves local structure well
- Non-parametric (no explicit mapping)
- Slow for large datasets ()
- Non-deterministic (different runs give different results)
- Good for visualization (2-3D)
UMAP (Uniform Manifold Approximation and Projection)
Goal: Preserve both local and global structure.
Algorithm:
- Build fuzzy simplicial set (neighborhood graph) in high-D
- Optimize low-D embedding to match the fuzzy set
Properties:
- Faster than t-SNE ()
- Better preserves global structure
- Can be used for general dimensionality reduction
- Has a parametric version
- Deterministic with fixed random seed
t-SNE vs UMAP Comparison
| Aspect | t-SNE | UMAP |
|---|---|---|
| Speed | Slow | Fast |
| Global Structure | Poor | Good |
| Scalability | Limited | Better |
| Deterministic | No | Yes (with seed) |
| Out-of-sample | No | Yes (parametric) |
| Use Case | Visualization only | Visualization + reduction |
βΉοΈ
Key Insight: UMAP is generally preferred over t-SNE because it's faster, preserves global structure better, and can handle out-of-sample points. t-SNE is still useful for quick visualization.
Autoencoders
Architecture:
- Encoder: Maps input to latent space
- Decoder: Reconstructs input from latent space
Training:
Variants:
- Variational Autoencoder (VAE): Adds probabilistic structure to latent space
- Denoising Autoencoder: Trained to reconstruct from corrupted input
- Sparse Autoencoder: Enforces sparsity in latent representation
Properties:
- Learns non-linear mappings
- Can generate new data (VAE)
- Requires more data and tuning
- Computationally expensive
Method Selection Guide
| Method | Speed | Non-linear | Generation | Out-of-sample | Best For |
|---|---|---|---|---|---|
| PCA | Fast | No | No | Yes | Linear data, feature reduction |
| t-SNE | Slow | Yes | No | No | 2D/3D visualization |
| UMAP | Medium | Yes | No | Yes | Visualization + reduction |
| Autoencoder | Slow | Yes | Yes | Yes | Complex non-linear patterns |
Code Implementation
Real-World Applications
NVIDIA: Feature Visualization
- CNN Feature Maps: Visualizing learned representations
- Latent Space Analysis: Understanding generative models
- GPU-Accelerated Reduction: CUDA implementations of t-SNE/UMAP
Meta: Content Understanding
- Embedding Visualization: Visualizing user/item embeddings
- Topic Modeling: Reducing document dimensions
- Anomaly Detection: Identifying outliers in embedding space
π‘
NVIDIA Interview Tip: Discuss GPU acceleration of t-SNE and UMAP. Mention libraries like FIt-SNE and cuML that provide GPU-accelerated implementations.
Common Follow-Up Questions
Q1: Why is t-SNE not suitable for high-dimensional data directly? t-SNE is slow () and struggles with high dimensions. Apply PCA first to reduce to ~50 dimensions, then apply t-SNE.
Q2: How does UMAP preserve global structure better than t-SNE? UMAP uses fuzzy topological representations and optimizes a cross-entropy loss, while t-SNE uses KL divergence which emphasizes local structure.
Q3: When would you use an autoencoder over t-SNE/UMAP? When you need: (1) a parametric mapping for out-of-sample points, (2) data generation capabilities (VAE), or (3) complex non-linear patterns.
Q4: How do you evaluate dimensionality reduction quality?
- Trustworthiness: Are neighbors preserved?
- Continuity: Is the mapping smooth?
- Downstream task performance: Does reduction help classification/clustering?