Autoencoders: VAE, Denoising, Representation Learning — Asked at Google & OpenAI

🎯 The Interview Question

"Explain the autoencoder architecture and its variants. What is the mathematical formulation of a Variational Autoencoder (VAE) and how does it differ from a standard autoencoder? How does denoising autoencoding work, and what are its benefits? What is the connection between autoencoders and representation learning?"

This question tests understanding of unsupervised learning — crucial for Google (representation learning) and OpenAI (generative models).

📚 Detailed Answer

Standard Autoencoder

An autoencoder learns to compress data into a latent representation and reconstruct it:

Encoder: $z = f_\phi(x)$ Decoder: $\hat{x} = g_\theta(z)$

Objective: Minimize reconstruction error:

\mathcal{L}_{AE} = \|x - g_\theta(f_\phi(x))\|^2

Architecture:

Architecture Diagram

Input (784) → Encoder → Bottleneck (32) → Decoder → Output (784)

Limitations:

Deterministic: same input always produces same latent code
No generative capability: can't sample meaningful latent vectors
May learn identity function without constraints

Variational Autoencoder (VAE)

VAE adds probabilistic structure to enable generation.

Encoder outputs distribution parameters:

\mu = f_\phi(x), \quad \log\sigma^2 = g_\phi(x)

Reparameterization trick:

z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Objective (ELBO):

\mathcal{L}_{VAE} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))

where:

First term: reconstruction loss (decoder accuracy)
Second term: KL divergence (latent space regularization)

KL divergence for Gaussian:

D_{KL} = -\frac{1}{2}\sum_{j=1}^{d}(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2)

💡

The reparameterization trick is crucial. Without it, we can't backpropagate through the sampling operation. By writing $z = \mu + \sigma\epsilon$ , gradients flow through $\mu$ and $\sigma$ while $\epsilon$ is a constant noise sample.

VAE vs Standard Autoencoder

Aspect	Autoencoder	VAE
Latent space	Deterministic	Probabilistic
Generation	Cannot generate	Can sample and generate
Smoothness	No continuity guarantee	Continuous latent space
Training	Reconstruction only	Reconstruction + KL regularization

Denoising Autoencoder

Corrupts input with noise, then reconstructs clean version:

\tilde{x} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)

\mathcal{L}_{DAE} = \|x - g_\theta(f_\phi(\tilde{x}))\|^2

Benefits:

Learns robust features
Prevents identity function
Captures data manifold structure
Foundation for diffusion models

Mathematical Analysis

Information Bottleneck Perspective

Autoencoders compress information:

I(X; Z) = H(Z) - H(Z|X)

Regularization (KL term in VAE) limits $I(X; Z)$ , forcing the model to learn only relevant information.

Manifold Learning

Data lies on a lower-dimensional manifold $\mathcal{M}$ embedded in $\mathbb{R}^n$ :

\text{dim}(\mathcal{M}) = d \ll n

Autoencoders learn to map data to this manifold:

f: \mathbb{R}^n \rightarrow \mathbb{R}^d \rightarrow \mathcal{M}

Advanced Autoencoder Variants

β-VAE

Adds weight $\beta$ to KL term for disentangled representations:

\mathcal{L}_{\beta-VAE} = \mathbb{E}[\log p(x|z)] - \beta D_{KL}(q(z|x) \| p(z))

$\beta > 1$ encourages disentangled factors (each latent dimension captures one factor of variation).

Vector Quantized VAE (VQ-VAE)

Uses discrete latent space with codebook:

z_q = \arg\min_{e_k \in \mathcal{E}} \|z_e - e_k\|^2

\mathcal{L}_{VQ} = \|sg[z_e] - e\|^2 + \beta\|z_e - sg[e]\|^2

where $sg$ is stop-gradient operator. Used in DALL-E, VQGAN.

Hierarchical VAE (NVAE)

Stacks VAE layers with residual connections:

Architecture Diagram

x → VAE1(z1) → VAE2(z2) → ... → VAEk(zk)

Each level captures different scales of variation.

Representation Learning

Autoencoders learn useful representations without labels:

Self-supervised pre-training:

Train autoencoder on unlabeled data
Use encoder for downstream tasks
Fine-tune with small labeled dataset

Applications:

Anomaly detection: High reconstruction error → anomaly
Data augmentation: Generate similar samples
Dimensionality reduction: Use latent space as features
Image inpainting: Fill in missing parts

Follow-Up Questions

Q: Why is the reparameterization trick necessary in VAE? A: Without it, we can't backpropagate through the sampling operation. The trick makes the sampling differentiable by treating noise as an input rather than a random operation.

Q: How do VAEs differ from GANs? A: VAEs maximize a lower bound on log-likelihood (tractable but blurry). GANs minimize divergences (sharp but unstable). VAEs encode; GANs only generate.

Q: What is the role of KL divergence in VAE? A: It regularizes the latent space to be close to a standard normal prior, enabling smooth interpolation and generation.