KL Divergence
βΉοΈ Why It Matters
KL divergence (Kullback-Leibler divergence) measures how one probability distribution differs from another. It is not a true distance metric (it's asymmetric), but it is the fundamental building block of variational autoencoders (VAEs), the expectation-maximization algorithm, and many other ML methods. When you see "minimize KL divergence" in a paper, this is what they mean.
Historical Context
βΉοΈ Kullback and Leibler
Solomon Kullback and Richard Leibler introduced this divergence in 1951. It's also known as relative entropy or information gain. It measures the "extra" information needed when using distribution to approximate distribution .
Core Definitions
DfKL Divergence
The Kullback-Leibler divergence from distribution to distribution is:
for discrete distributions, or:
for continuous distributions.
DfForward vs Reverse KL
- Forward KL : Averages over . Sensitive where has high probability.
- Reverse KL : Averages over . Sensitive where has high probability.
The choice between forward and reverse KL has profound implications for approximation quality.
Dff-Divergence
KL divergence is a special case of the f-divergence family:
For KL divergence, (forward) or (reverse).
Key Formulas
KL Divergence (Discrete)
Here,
- =KL divergence from Q to P
- =True distribution
- =Approximate distribution
KL Divergence (Continuous)
Here,
- =True density
- =Approximate density
KL Divergence of Gaussians
Here,
- =Mean and std of true distribution
- =Mean and std of approximate distribution
Relation to Cross-Entropy and Entropy
Here,
- =Cross-entropy between P and Q
- =Entropy of P
Relation to Mutual Information
Here,
- =Mutual information
- =Product of marginals (independence)
Properties and Theorems
ThNon-negativity (Gibbs' Inequality)
for all distributions and . Equality holds if and only if almost everywhere.
ThAsymmetry
in general. KL divergence is NOT a metric or distanceβit fails the symmetry and triangle inequality properties.
ThForward KL is Mean-Seeking
When minimizing w.r.t. , the optimal tends to cover all modes of . This is because the expectation is over , so must assign non-zero probability wherever does.
ThReverse KL is Mode-Seeking
When minimizing w.r.t. , the optimal tends to concentrate on a single mode of . This is because the expectation is over , so can "ignore" low-probability regions of .
ThChain Rule for KL Divergence
The joint KL decomposes into marginal KL plus conditional KL.
ThData Processing Inequality
If is a Markov chain and are distributions over , then:
Post-processing cannot increase KL divergence.
Worked Examples
πExample 1: Basic KL Calculation
Problem: , . Compute and .
π‘Solution: Basic KL
Note: β asymmetry in action.
πExample 2: KL Between Gaussians
Problem: Compute .
π‘Solution: Gaussian KL
Using the formula with :
πExample 3: KL Divergence in VAE
Problem: In a VAE, the encoder outputs and the prior is . If , compute the KL term.
π‘Solution: VAE KL Term
πExample 4: Reverse KL Mode-Seeking
Problem: is a mixture of two Gaussians: . If , what happens when minimizing vs ?
π‘Solution: Mode-Seeking vs Mean-Seeking
- Minimizing : will concentrate on one mode (e.g., ), ignoring the other. This is mode-seeking behavior.
- Minimizing : will spread out to cover both modes, becoming wider. This is mean-seeking behavior.
This is why VAEs use β it encourages the posterior to stay close to the prior, preventing posterior collapse.
Python Implementation
import numpy as np
from scipy import stats
def kl_divergence_discrete(p, q):
"""Compute KL divergence D_KL(P || Q) for discrete distributions."""
p, q = np.array(p, dtype=float), np.array(q, dtype=float)
# Filter where p > 0 to avoid log(0)
mask = p > 0
p, q = p[mask], q[mask]
# Avoid division by zero
q = np.where(q > 0, q, 1e-10)
return np.sum(p * np.log(p / q))
def kl_divergence_gaussian(mu0, sigma0, mu1, sigma1):
"""Compute KL divergence between two univariate Gaussians."""
return (np.log(sigma1 / sigma0) +
(sigma0**2 + (mu0 - mu1)**2) / (2 * sigma1**2) - 0.5)
def reverse_kl_gaussian(mu0, sigma0, mu1, sigma1):
"""Compute D_KL(N(mu1, sigma1^2) || N(mu0, sigma0^2))."""
return kl_divergence_gaussian(mu1, sigma1, mu0, sigma0)
def vae_kl_term(mu, log_var):
"""Compute KL divergence for VAE: D_KL(q(z|x) || N(0,I))."""
# q(z|x) = N(mu, exp(log_var))
# p(z) = N(0, 1)
return -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))
# --- Examples ---
# Discrete distributions
p = [0.9, 0.1]
q = [0.5, 0.5]
print(f"KL(P||Q): {kl_divergence_discrete(p, q):.4f}")
print(f"KL(Q||P): {kl_divergence_discrete(q, p):.4f}")
# Gaussian KL
print(f"KL(N(0,1)||N(0,2)): {kl_divergence_gaussian(0, 1, 0, 2):.4f}")
print(f"KL(N(0,2)||N(0,1)): {kl_divergence_gaussian(0, 2, 0, 1):.4f}")
# VAE KL term
mu = np.array([0.5, -0.3])
log_var = np.array([-0.5, -1.0])
print(f"VAE KL term: {vae_kl_term(mu, log_var):.4f}")
Applications in AI/ML
βΉοΈ Variational Autoencoders (VAEs)
The VAE loss is:
The KL term regularizes the encoder to stay close to the prior . This enables meaningful latent space interpolation and generation.
βΉοΈ Expectation-Maximization
EM alternates between:
- E-step: Compute
- M-step: Maximize
This is equivalent to maximizing a lower bound on the log-likelihood.
βΉοΈ Knowledge Distillation
In distillation, the student minimizes where is the teacher's soft output and is the student's. Using forward KL ensures the student covers all modes of the teacher's knowledge.
βΉοΈ Distribution Matching
- GANs: Original GAN minimizes (via JS divergence)
- Flow models: Maximize likelihood, equivalent to minimizing
- Domain adaptation: Minimize feature distribution shift between source and target domains
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Treating KL as a distance metric | KL is asymmetric and doesn't satisfy triangle inequality | Use symmetric alternatives like Jensen-Shannon divergence |
| Using KL when but | Division by zero | Use small epsilon or filter zero probabilities |
| Assuming forward and reverse KL give same result | They optimize different things | Choose based on whether you need mode-seeking or mean-seeking |
| Forgetting the non-negativity property | KL is always β₯ 0 | If you get negative KL, check your implementation |
| Using natural log vs log base 2 carelessly | Units differ (nats vs bits) | Be consistent; VAEs typically use nats |
Interview Questions
Q1: Why is KL divergence asymmetric? A: averages over , while averages over . These weight different regions of the space differently, leading to different values.
Q2: What's the difference between forward and reverse KL? A: Forward KL is mean-seeking (spreads to cover all of ). Reverse KL is mode-seeking (concentrates on one mode of ). VAEs use forward KL for regularization.
Q3: Can KL divergence be infinite? A: Yes. If has support where doesn't ( but ), the KL divergence is infinite. This is why it's important to ensure has support at least as wide as .
Q4: Why not just use Euclidean distance between distributions? A: Euclidean distance ignores the probability structure. Two distributions can be close in Euclidean distance but have very different shapes. KL divergence accounts for the actual probability values.
Q5: How is KL divergence used in VAEs? A: The VAE loss includes where . This regularizes the latent space, ensuring the encoder produces distributions close to the prior, enabling smooth interpolation and generation.
Practice Problems
πProblem 1: Symmetric KL
Problem: Under what condition is ?
π‘Solution: Symmetric KL
This holds when (both are zero), or in special cases where the distributions are "symmetric" in a specific sense. In general, KL is asymmetric, so equality is rare.
πProblem 2: KL with Shifted Distribution
Problem: Compute .
π‘Solution: Shifted Gaussian KL
πProblem 3: KL Lower Bound
Problem: Show that implies .
π‘Solution: KL Lower Bound
Since , rearranging gives . Cross-entropy is always at least as large as the entropy of the true distribution.
Variants and Related Divergences
DfJensen-Shannon Divergence
A symmetric, bounded alternative to KL:
where . JSD is bounded: nats.
DfReverse KL in Practice
Reverse KL is used when you want to concentrate on a single mode of . Applications include variational inference with unimodal approximations to multimodal posteriors, and conservative models that avoid low-density regions of .
βΉοΈ KL in EM Algorithm
The Expectation-Maximization algorithm maximizes a lower bound on log-likelihood:
The E-step sets (making KL = 0), and the M-step maximizes the expected complete-data log-likelihood.
βΉοΈ KL in GANs and Flows
- GANs: The original GAN minimizes Jensen-Shannon divergence (related to KL). WGAN uses Wasserstein distance instead.
- Flow models: Maximum likelihood training minimizes implicitly through log-likelihood maximization.
- Domain adaptation: Minimizes feature distribution shift by reducing .
- Reinforcement learning: KL constraints prevent policy updates from being too large (TRPO, PPO).
βΉοΈ KL and Maximum Likelihood
Minimizing w.r.t. is equivalent to maximizing , which is exactly maximum likelihood estimation. This is why many generative models (flows, autoregressive models) maximize log-likelihood.
Quick Reference
| Quantity | Formula | Key Property |
|---|---|---|
| Forward KL | Mean-seeking | |
| Reverse KL | Mode-seeking | |
| Gaussian KL | Closed form | |
| Relation to CE | Non-negative | |
| VAE KL | Closed form for diagonal | |
| JSD | Symmetric, bounded |
Cross-References
- 081 - Entropy β β KL is the difference between cross-entropy and entropy.
- 082 - Mutual Information β β MI is a special case of KL divergence.
- 084 - Cross-Entropy β Cross-entropy loss = entropy + KL divergence. Minimizing CE is equivalent to minimizing KL when entropy is fixed.
- 085 - Applications β VAEs use KL regularization, EM algorithm uses KL in E-step, distillation uses KL to match teacher.
Common Pitfalls in Implementation
βΉοΈ Numerical Stability
When implementing KL divergence, always:
- Filter out zero probabilities before computing
- Use
np.whereor masking to avoid division by zero - Add small epsilon (e.g., ) to denominators
- Consider using log-space computations for numerical stability
βΉοΈ When KL is Infinite
If has support where doesn't, KL is infinite. In practice:
- Use truncated distributions with matching support
- Add smoothing to to ensure non-zero probability everywhere
- Use reverse KL which handles this differently
Summary
πKey Takeaways
-
KL Divergence: measures the information lost when is used to approximate . It's the expected log-likelihood ratio.
-
Non-negativity: always, with equality iff . This is Gibbs' inequality and follows from Jensen's inequality.
-
Asymmetry: in general. Forward KL averages over (mean-seeking); reverse KL averages over (mode-seeking).
-
Relation to Other Quantities: . Cross-entropy = entropy + KL divergence.
-
Gaussian KL: For univariate Gaussians, KL has a closed form involving means and variances. For multivariate Gaussians, it involves the log-determinant of covariance matrices.
-
VAE Loss: The KL term regularizes the latent space. With diagonal Gaussian assumptions, it has an elegant closed form.
-
Mode-Seeking vs Mean-Seeking: Forward KL () causes to spread out; reverse KL () causes to concentrate on one mode. Choose based on your application.