🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Autoencoders: VAE, Denoising, Representation Learning — Asked at Google & OpenAI

Deep Learning Premium InterviewsAutoencoders & Representation Learning⭐ Premium

Advertisement

Google & OpenAI

Autoencoders: VAE, Denoising & Representation Learning

Premium Interview Preparation — Representation Learning Mastery

🎯 The Interview Question

"Explain the autoencoder architecture and its variants. What is the mathematical formulation of a Variational Autoencoder (VAE) and how does it differ from a standard autoencoder? How does denoising autoencoding work, and what are its benefits? What is the connection between autoencoders and representation learning?"

This question tests understanding of unsupervised learning — crucial for Google (representation learning) and OpenAI (generative models).


📚 Detailed Answer

Standard Autoencoder

An autoencoder learns to compress data into a latent representation and reconstruct it:

Encoder: z=fϕ(x)z = f_\phi(x) Decoder: x^=gθ(z)\hat{x} = g_\theta(z)

Objective: Minimize reconstruction error:

LAE=xgθ(fϕ(x))2\mathcal{L}_{AE} = \|x - g_\theta(f_\phi(x))\|^2

Architecture:

Architecture Diagram
Input (784) → Encoder → Bottleneck (32) → Decoder → Output (784)

Limitations:

  • Deterministic: same input always produces same latent code
  • No generative capability: can't sample meaningful latent vectors
  • May learn identity function without constraints

Variational Autoencoder (VAE)

VAE adds probabilistic structure to enable generation.

Encoder outputs distribution parameters:

μ=fϕ(x),logσ2=gϕ(x)\mu = f_\phi(x), \quad \log\sigma^2 = g_\phi(x)

Reparameterization trick:

z=μ+σϵ,ϵN(0,I)z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Objective (ELBO):

LVAE=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L}_{VAE} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))

where:

  • First term: reconstruction loss (decoder accuracy)
  • Second term: KL divergence (latent space regularization)

KL divergence for Gaussian:

DKL=12j=1d(1+logσj2μj2σj2)D_{KL} = -\frac{1}{2}\sum_{j=1}^{d}(1 + \log\sigma_j^2 - \mu_j^2 - \sigma_j^2)

💡

The reparameterization trick is crucial. Without it, we can't backpropagate through the sampling operation. By writing z=μ+σϵz = \mu + \sigma\epsilon, gradients flow through μ\mu and σ\sigma while ϵ\epsilon is a constant noise sample.

VAE vs Standard Autoencoder

AspectAutoencoderVAE
Latent spaceDeterministicProbabilistic
GenerationCannot generateCan sample and generate
SmoothnessNo continuity guaranteeContinuous latent space
TrainingReconstruction onlyReconstruction + KL regularization

Denoising Autoencoder

Corrupts input with noise, then reconstructs clean version:

x~=x+ϵ,ϵN(0,σ2I)\tilde{x} = x + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)
LDAE=xgθ(fϕ(x~))2\mathcal{L}_{DAE} = \|x - g_\theta(f_\phi(\tilde{x}))\|^2

Benefits:

  • Learns robust features
  • Prevents identity function
  • Captures data manifold structure
  • Foundation for diffusion models

Mathematical Analysis

Information Bottleneck Perspective

Autoencoders compress information:

I(X;Z)=H(Z)H(ZX)I(X; Z) = H(Z) - H(Z|X)

Regularization (KL term in VAE) limits I(X;Z)I(X; Z), forcing the model to learn only relevant information.

Manifold Learning

Data lies on a lower-dimensional manifold M\mathcal{M} embedded in Rn\mathbb{R}^n:

dim(M)=dn\text{dim}(\mathcal{M}) = d \ll n

Autoencoders learn to map data to this manifold:

f:RnRdMf: \mathbb{R}^n \rightarrow \mathbb{R}^d \rightarrow \mathcal{M}

Advanced Autoencoder Variants

β-VAE

Adds weight β\beta to KL term for disentangled representations:

LβVAE=E[logp(xz)]βDKL(q(zx)p(z))\mathcal{L}_{\beta-VAE} = \mathbb{E}[\log p(x|z)] - \beta D_{KL}(q(z|x) \| p(z))

β>1\beta > 1 encourages disentangled factors (each latent dimension captures one factor of variation).

Vector Quantized VAE (VQ-VAE)

Uses discrete latent space with codebook:

zq=argminekEzeek2z_q = \arg\min_{e_k \in \mathcal{E}} \|z_e - e_k\|^2
LVQ=sg[ze]e2+βzesg[e]2\mathcal{L}_{VQ} = \|sg[z_e] - e\|^2 + \beta\|z_e - sg[e]\|^2

where sgsg is stop-gradient operator. Used in DALL-E, VQGAN.

Hierarchical VAE (NVAE)

Stacks VAE layers with residual connections:

Architecture Diagram
x → VAE1(z1) → VAE2(z2) → ... → VAEk(zk)

Each level captures different scales of variation.

Representation Learning

Autoencoders learn useful representations without labels:

Self-supervised pre-training:

  1. Train autoencoder on unlabeled data
  2. Use encoder for downstream tasks
  3. Fine-tune with small labeled dataset

Applications:

  • Anomaly detection: High reconstruction error → anomaly
  • Data augmentation: Generate similar samples
  • Dimensionality reduction: Use latent space as features
  • Image inpainting: Fill in missing parts

Follow-Up Questions

Q: Why is the reparameterization trick necessary in VAE? A: Without it, we can't backpropagate through the sampling operation. The trick makes the sampling differentiable by treating noise as an input rather than a random operation.

Q: How do VAEs differ from GANs? A: VAEs maximize a lower bound on log-likelihood (tractable but blurry). GANs minimize divergences (sharp but unstable). VAEs encode; GANs only generate.

Q: What is the role of KL divergence in VAE? A: It regularizes the latent space to be close to a standard normal prior, enabling smooth interpolation and generation.

Related Topics

Advertisement