← Math|83 of 100
Information Theory

KL Divergence

Master KL divergence, its properties, asymmetry, and applications in variational inference.

πŸ“‚ DivergenceπŸ“– Lesson 83 of 100πŸŽ“ Free Course

Advertisement

KL Divergence

ℹ️ Why It Matters

KL divergence (Kullback-Leibler divergence) measures how one probability distribution differs from another. It is not a true distance metric (it's asymmetric), but it is the fundamental building block of variational autoencoders (VAEs), the expectation-maximization algorithm, and many other ML methods. When you see "minimize KL divergence" in a paper, this is what they mean.


Historical Context

ℹ️ Kullback and Leibler

Solomon Kullback and Richard Leibler introduced this divergence in 1951. It's also known as relative entropy or information gain. It measures the "extra" information needed when using distribution QQ to approximate distribution PP.


Core Definitions

DfKL Divergence

The Kullback-Leibler divergence from distribution QQ to distribution PP is:

DKL(Pβˆ₯Q)=βˆ‘x∈Xp(x)log⁑p(x)q(x)D_{KL}(P \| Q) = \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}

for discrete distributions, or:

DKL(Pβˆ₯Q)=βˆ«βˆ’βˆžβˆžp(x)log⁑p(x)q(x) dxD_{KL}(P \| Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \, dx

for continuous distributions.

DfForward vs Reverse KL

  • Forward KL DKL(Pβˆ₯Q)D_{KL}(P \| Q): Averages over PP. Sensitive where PP has high probability.
  • Reverse KL DKL(Qβˆ₯P)D_{KL}(Q \| P): Averages over QQ. Sensitive where QQ has high probability.

The choice between forward and reverse KL has profound implications for approximation quality.

Dff-Divergence

KL divergence is a special case of the f-divergence family:

Df(Pβˆ₯Q)=βˆ‘xq(x)f(p(x)q(x))D_f(P \| Q) = \sum_x q(x) f\left(\frac{p(x)}{q(x)}\right)

For KL divergence, f(t)=tlog⁑tf(t) = t \log t (forward) or f(t)=βˆ’log⁑tf(t) = -\log t (reverse).


Key Formulas

KL Divergence (Discrete)

DKL(Pβˆ₯Q)=βˆ‘xp(x)log⁑p(x)q(x)D_{KL}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}

Here,

  • DKL(Pβˆ₯Q)D_{KL}(P \| Q)=KL divergence from Q to P
  • p(x)p(x)=True distribution
  • q(x)q(x)=Approximate distribution

KL Divergence (Continuous)

DKL(Pβˆ₯Q)=∫p(x)log⁑p(x)q(x) dxD_{KL}(P \| Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

Here,

  • p(x)p(x)=True density
  • q(x)q(x)=Approximate density

KL Divergence of Gaussians

DKL(N(ΞΌ0,Οƒ02)βˆ₯N(ΞΌ1,Οƒ12))=log⁑σ1Οƒ0+Οƒ02+(ΞΌ0βˆ’ΞΌ1)22Οƒ12βˆ’12D_{KL}(\mathcal{N}(\mu_0, \sigma_0^2) \| \mathcal{N}(\mu_1, \sigma_1^2)) = \log \frac{\sigma_1}{\sigma_0} + \frac{\sigma_0^2 + (\mu_0 - \mu_1)^2}{2\sigma_1^2} - \frac{1}{2}

Here,

  • ΞΌ0,Οƒ0\mu_0, \sigma_0=Mean and std of true distribution
  • ΞΌ1,Οƒ1\mu_1, \sigma_1=Mean and std of approximate distribution

Relation to Cross-Entropy and Entropy

DKL(Pβˆ₯Q)=H(P,Q)βˆ’H(P)D_{KL}(P \| Q) = H(P, Q) - H(P)

Here,

  • H(P,Q)H(P, Q)=Cross-entropy between P and Q
  • H(P)H(P)=Entropy of P

Relation to Mutual Information

I(X;Y)=DKL(p(x,y)β€…β€Šβˆ₯β€…β€Šp(x)p(y))I(X; Y) = D_{KL}\big(p(x,y) \;\|\; p(x)p(y)\big)

Here,

  • I(X;Y)I(X; Y)=Mutual information
  • p(x)p(y)p(x)p(y)=Product of marginals (independence)

Properties and Theorems

ThNon-negativity (Gibbs' Inequality)

DKL(Pβˆ₯Q)β‰₯0D_{KL}(P \| Q) \geq 0 for all distributions PP and QQ. Equality holds if and only if P=QP = Q almost everywhere.

ThAsymmetry

DKL(Pβˆ₯Q)β‰ DKL(Qβˆ₯P)D_{KL}(P \| Q) \neq D_{KL}(Q \| P) in general. KL divergence is NOT a metric or distanceβ€”it fails the symmetry and triangle inequality properties.

ThForward KL is Mean-Seeking

When minimizing DKL(Pβˆ₯Q)D_{KL}(P \| Q) w.r.t. QQ, the optimal QQ tends to cover all modes of PP. This is because the expectation is over PP, so QQ must assign non-zero probability wherever PP does.

ThReverse KL is Mode-Seeking

When minimizing DKL(Qβˆ₯P)D_{KL}(Q \| P) w.r.t. QQ, the optimal QQ tends to concentrate on a single mode of PP. This is because the expectation is over QQ, so QQ can "ignore" low-probability regions of PP.

ThChain Rule for KL Divergence

DKL(P(X,Y)βˆ₯Q(X,Y))=DKL(P(X)βˆ₯Q(X))+DKL(P(Y∣X)βˆ₯Q(Y∣X))D_{KL}(P(X,Y) \| Q(X,Y)) = D_{KL}(P(X) \| Q(X)) + D_{KL}(P(Y|X) \| Q(Y|X))

The joint KL decomposes into marginal KL plus conditional KL.

ThData Processing Inequality

If X→Y→ZX \to Y \to Z is a Markov chain and P,QP, Q are distributions over XX, then:

DKL(PZβˆ₯QZ)≀DKL(PXβˆ₯QX)D_{KL}(P_Z \| Q_Z) \leq D_{KL}(P_X \| Q_X)

Post-processing cannot increase KL divergence.


Worked Examples

πŸ“Example 1: Basic KL Calculation

Problem: P=[0.9,0.1]P = [0.9, 0.1], Q=[0.5,0.5]Q = [0.5, 0.5]. Compute DKL(Pβˆ₯Q)D_{KL}(P \| Q) and DKL(Qβˆ₯P)D_{KL}(Q \| P).

πŸ’‘Solution: Basic KL

DKL(Pβˆ₯Q)=0.9log⁑0.90.5+0.1log⁑0.10.5=0.9(0.585)+0.1(βˆ’1.322)=0.394Β bitsD_{KL}(P \| Q) = 0.9 \log \frac{0.9}{0.5} + 0.1 \log \frac{0.1}{0.5} = 0.9(0.585) + 0.1(-1.322) = 0.394 \text{ bits}
DKL(Qβˆ₯P)=0.5log⁑0.50.9+0.5log⁑0.50.1=0.5(βˆ’0.585)+0.5(1.322)=0.369Β bitsD_{KL}(Q \| P) = 0.5 \log \frac{0.5}{0.9} + 0.5 \log \frac{0.5}{0.1} = 0.5(-0.585) + 0.5(1.322) = 0.369 \text{ bits}

Note: DKL(Pβˆ₯Q)β‰ DKL(Qβˆ₯P)D_{KL}(P \| Q) \neq D_{KL}(Q \| P) β€” asymmetry in action.

πŸ“Example 2: KL Between Gaussians

Problem: Compute DKL(N(0,1)βˆ₯N(0,2))D_{KL}(\mathcal{N}(0, 1) \| \mathcal{N}(0, 2)).

πŸ’‘Solution: Gaussian KL

Using the formula with ΞΌ0=0,Οƒ0=1,ΞΌ1=0,Οƒ1=2\mu_0 = 0, \sigma_0 = 1, \mu_1 = 0, \sigma_1 = 2:

DKL=log⁑21+1+02β‹…4βˆ’12=log⁑2+18βˆ’12=0.693+0.125βˆ’0.5=0.318Β natsD_{KL} = \log \frac{2}{1} + \frac{1 + 0}{2 \cdot 4} - \frac{1}{2} = \log 2 + \frac{1}{8} - \frac{1}{2} = 0.693 + 0.125 - 0.5 = 0.318 \text{ nats}

πŸ“Example 3: KL Divergence in VAE

Problem: In a VAE, the encoder outputs q(z∣x)=N(ΞΌ,Οƒ2)q(z|x) = \mathcal{N}(\mu, \sigma^2) and the prior is p(z)=N(0,1)p(z) = \mathcal{N}(0, 1). If ΞΌ=0.5,Οƒ=0.8\mu = 0.5, \sigma = 0.8, compute the KL term.

πŸ’‘Solution: VAE KL Term

DKL(q(z∣x)βˆ₯p(z))=log⁑10.8+0.64+0.252β‹…1βˆ’12D_{KL}(q(z|x) \| p(z)) = \log \frac{1}{0.8} + \frac{0.64 + 0.25}{2 \cdot 1} - \frac{1}{2}
=βˆ’log⁑(0.8)+0.892βˆ’0.5=0.223+0.445βˆ’0.5=0.168Β nats= -\log(0.8) + \frac{0.89}{2} - 0.5 = 0.223 + 0.445 - 0.5 = 0.168 \text{ nats}

πŸ“Example 4: Reverse KL Mode-Seeking

Problem: PP is a mixture of two Gaussians: 0.5N(βˆ’3,1)+0.5N(3,1)0.5 \mathcal{N}(-3, 1) + 0.5 \mathcal{N}(3, 1). If Q=N(0,1)Q = \mathcal{N}(0, 1), what happens when minimizing DKL(Qβˆ₯P)D_{KL}(Q \| P) vs DKL(Pβˆ₯Q)D_{KL}(P \| Q)?

πŸ’‘Solution: Mode-Seeking vs Mean-Seeking

  • Minimizing DKL(Qβˆ₯P)D_{KL}(Q \| P): QQ will concentrate on one mode (e.g., ΞΌβ‰ˆ3\mu \approx 3), ignoring the other. This is mode-seeking behavior.
  • Minimizing DKL(Pβˆ₯Q)D_{KL}(P \| Q): QQ will spread out to cover both modes, becoming wider. This is mean-seeking behavior.

This is why VAEs use DKL(q(z∣x)βˆ₯p(z))D_{KL}(q(z|x) \| p(z)) β€” it encourages the posterior to stay close to the prior, preventing posterior collapse.


Python Implementation

import numpy as np
from scipy import stats

def kl_divergence_discrete(p, q):
    """Compute KL divergence D_KL(P || Q) for discrete distributions."""
    p, q = np.array(p, dtype=float), np.array(q, dtype=float)
    # Filter where p > 0 to avoid log(0)
    mask = p > 0
    p, q = p[mask], q[mask]
    # Avoid division by zero
    q = np.where(q > 0, q, 1e-10)
    return np.sum(p * np.log(p / q))

def kl_divergence_gaussian(mu0, sigma0, mu1, sigma1):
    """Compute KL divergence between two univariate Gaussians."""
    return (np.log(sigma1 / sigma0) +
            (sigma0**2 + (mu0 - mu1)**2) / (2 * sigma1**2) - 0.5)

def reverse_kl_gaussian(mu0, sigma0, mu1, sigma1):
    """Compute D_KL(N(mu1, sigma1^2) || N(mu0, sigma0^2))."""
    return kl_divergence_gaussian(mu1, sigma1, mu0, sigma0)

def vae_kl_term(mu, log_var):
    """Compute KL divergence for VAE: D_KL(q(z|x) || N(0,I))."""
    # q(z|x) = N(mu, exp(log_var))
    # p(z) = N(0, 1)
    return -0.5 * np.sum(1 + log_var - mu**2 - np.exp(log_var))

# --- Examples ---
# Discrete distributions
p = [0.9, 0.1]
q = [0.5, 0.5]
print(f"KL(P||Q): {kl_divergence_discrete(p, q):.4f}")
print(f"KL(Q||P): {kl_divergence_discrete(q, p):.4f}")

# Gaussian KL
print(f"KL(N(0,1)||N(0,2)): {kl_divergence_gaussian(0, 1, 0, 2):.4f}")
print(f"KL(N(0,2)||N(0,1)): {kl_divergence_gaussian(0, 2, 0, 1):.4f}")

# VAE KL term
mu = np.array([0.5, -0.3])
log_var = np.array([-0.5, -1.0])
print(f"VAE KL term: {vae_kl_term(mu, log_var):.4f}")

Applications in AI/ML

ℹ️ Variational Autoencoders (VAEs)

The VAE loss is: L=Eq(z∣x)[log⁑p(x∣z)]βˆ’DKL(q(z∣x)βˆ₯p(z))\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))

The KL term regularizes the encoder to stay close to the prior p(z)=N(0,I)p(z) = \mathcal{N}(0, I). This enables meaningful latent space interpolation and generation.

ℹ️ Expectation-Maximization

EM alternates between:

  • E-step: Compute q(z)=p(z∣x,ΞΈold)q(z) = p(z|x, \theta_{\text{old}})
  • M-step: Maximize Eq[log⁑p(x,z∣θ)]βˆ’DKL(qβˆ₯qold)\mathbb{E}_q[\log p(x,z|\theta)] - D_{KL}(q \| q_{\text{old}})

This is equivalent to maximizing a lower bound on the log-likelihood.

ℹ️ Knowledge Distillation

In distillation, the student minimizes DKL(pTβˆ₯pS)D_{KL}(p_T \| p_S) where pTp_T is the teacher's soft output and pSp_S is the student's. Using forward KL ensures the student covers all modes of the teacher's knowledge.

ℹ️ Distribution Matching

  • GANs: Original GAN minimizes DKL(Pdataβˆ₯Pmodel)D_{KL}(P_{\text{data}} \| P_{\text{model}}) (via JS divergence)
  • Flow models: Maximize likelihood, equivalent to minimizing DKL(Pdataβˆ₯Pmodel)D_{KL}(P_{\text{data}} \| P_{\text{model}})
  • Domain adaptation: Minimize feature distribution shift between source and target domains

Common Mistakes

MistakeWhy It's WrongCorrect Approach
Treating KL as a distance metricKL is asymmetric and doesn't satisfy triangle inequalityUse symmetric alternatives like Jensen-Shannon divergence
Using KL when q(x)=0q(x) = 0 but p(x)>0p(x) > 0Division by zeroUse small epsilon or filter zero probabilities
Assuming forward and reverse KL give same resultThey optimize different thingsChoose based on whether you need mode-seeking or mean-seeking
Forgetting the non-negativity propertyKL is always β‰₯ 0If you get negative KL, check your implementation
Using natural log vs log base 2 carelesslyUnits differ (nats vs bits)Be consistent; VAEs typically use nats

Interview Questions

Q1: Why is KL divergence asymmetric? A: DKL(Pβˆ₯Q)D_{KL}(P \| Q) averages log⁑(p/q)\log(p/q) over PP, while DKL(Qβˆ₯P)D_{KL}(Q \| P) averages log⁑(q/p)\log(q/p) over QQ. These weight different regions of the space differently, leading to different values.

Q2: What's the difference between forward and reverse KL? A: Forward KL DKL(Pβˆ₯Q)D_{KL}(P \| Q) is mean-seeking (spreads to cover all of PP). Reverse KL DKL(Qβˆ₯P)D_{KL}(Q \| P) is mode-seeking (concentrates on one mode of PP). VAEs use forward KL for regularization.

Q3: Can KL divergence be infinite? A: Yes. If PP has support where QQ doesn't (p(x)>0p(x) > 0 but q(x)=0q(x) = 0), the KL divergence is infinite. This is why it's important to ensure QQ has support at least as wide as PP.

Q4: Why not just use Euclidean distance between distributions? A: Euclidean distance ignores the probability structure. Two distributions can be close in Euclidean distance but have very different shapes. KL divergence accounts for the actual probability values.

Q5: How is KL divergence used in VAEs? A: The VAE loss includes DKL(q(z∣x)βˆ₯p(z))D_{KL}(q(z|x) \| p(z)) where p(z)=N(0,I)p(z) = \mathcal{N}(0, I). This regularizes the latent space, ensuring the encoder produces distributions close to the prior, enabling smooth interpolation and generation.


Practice Problems

πŸ“Problem 1: Symmetric KL

Problem: Under what condition is DKL(Pβˆ₯Q)=DKL(Qβˆ₯P)D_{KL}(P \| Q) = D_{KL}(Q \| P)?

πŸ’‘Solution: Symmetric KL

This holds when P=QP = Q (both are zero), or in special cases where the distributions are "symmetric" in a specific sense. In general, KL is asymmetric, so equality is rare.

πŸ“Problem 2: KL with Shifted Distribution

Problem: Compute DKL(N(1,1)βˆ₯N(0,1))D_{KL}(\mathcal{N}(1, 1) \| \mathcal{N}(0, 1)).

πŸ’‘Solution: Shifted Gaussian KL

DKL=log⁑11+1+(1βˆ’0)22β‹…1βˆ’12=0+1βˆ’0.5=0.5Β natsD_{KL} = \log \frac{1}{1} + \frac{1 + (1-0)^2}{2 \cdot 1} - \frac{1}{2} = 0 + 1 - 0.5 = 0.5 \text{ nats}

πŸ“Problem 3: KL Lower Bound

Problem: Show that DKL(Pβˆ₯Q)β‰₯0D_{KL}(P \| Q) \geq 0 implies H(P,Q)β‰₯H(P)H(P, Q) \geq H(P).

πŸ’‘Solution: KL Lower Bound

Since DKL(Pβˆ₯Q)=H(P,Q)βˆ’H(P)β‰₯0D_{KL}(P \| Q) = H(P, Q) - H(P) \geq 0, rearranging gives H(P,Q)β‰₯H(P)H(P, Q) \geq H(P). Cross-entropy is always at least as large as the entropy of the true distribution.


Variants and Related Divergences

DfJensen-Shannon Divergence

A symmetric, bounded alternative to KL:

JSD(Pβˆ₯Q)=12DKL(Pβˆ₯M)+12DKL(Qβˆ₯M)JSD(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)

where M=12(P+Q)M = \frac{1}{2}(P + Q). JSD is bounded: 0≀JSD≀log⁑20 \leq JSD \leq \log 2 nats.

DfReverse KL in Practice

Reverse KL DKL(Qβˆ₯P)D_{KL}(Q \| P) is used when you want QQ to concentrate on a single mode of PP. Applications include variational inference with unimodal approximations to multimodal posteriors, and conservative models that avoid low-density regions of PP.

ℹ️ KL in EM Algorithm

The Expectation-Maximization algorithm maximizes a lower bound on log-likelihood:

log⁑p(x∣θ)β‰₯Eq(z)[log⁑p(x,z∣θ)]βˆ’DKL(q(z)βˆ₯p(z∣x,ΞΈold))\log p(x|\theta) \geq \mathbb{E}_{q(z)}[\log p(x,z|\theta)] - D_{KL}(q(z) \| p(z|x, \theta_{\text{old}}))

The E-step sets q(z)=p(z∣x,θold)q(z) = p(z|x, \theta_{\text{old}}) (making KL = 0), and the M-step maximizes the expected complete-data log-likelihood.

ℹ️ KL in GANs and Flows

  • GANs: The original GAN minimizes Jensen-Shannon divergence (related to KL). WGAN uses Wasserstein distance instead.
  • Flow models: Maximum likelihood training minimizes DKL(Pdataβˆ₯Pmodel)D_{KL}(P_{\text{data}} \| P_{\text{model}}) implicitly through log-likelihood maximization.
  • Domain adaptation: Minimizes feature distribution shift by reducing DKL(Psourceβˆ₯Ptarget)D_{KL}(P_{\text{source}} \| P_{\text{target}}).
  • Reinforcement learning: KL constraints prevent policy updates from being too large (TRPO, PPO).

ℹ️ KL and Maximum Likelihood

Minimizing DKL(Pdataβˆ₯PΞΈ)D_{KL}(P_{\text{data}} \| P_\theta) w.r.t. ΞΈ\theta is equivalent to maximizing EPdata[log⁑PΞΈ(x)]\mathbb{E}_{P_{\text{data}}}[\log P_\theta(x)], which is exactly maximum likelihood estimation. This is why many generative models (flows, autoregressive models) maximize log-likelihood.


Quick Reference

QuantityFormulaKey Property
Forward KLDKL(Pβˆ₯Q)=βˆ‘plog⁑(p/q)D_{KL}(P \| Q) = \sum p \log(p/q)Mean-seeking
Reverse KLDKL(Qβˆ₯P)=βˆ‘qlog⁑(q/p)D_{KL}(Q \| P) = \sum q \log(q/p)Mode-seeking
Gaussian KLlog⁑(Οƒ1/Οƒ0)+Οƒ02+(ΞΌ0βˆ’ΞΌ1)22Οƒ12βˆ’12\log(\sigma_1/\sigma_0) + \frac{\sigma_0^2 + (\mu_0-\mu_1)^2}{2\sigma_1^2} - \frac{1}{2}Closed form
Relation to CEDKL(Pβˆ₯Q)=H(P,Q)βˆ’H(P)D_{KL}(P \| Q) = H(P, Q) - H(P)Non-negative
VAE KLβˆ’0.5βˆ‘(1+log⁑σ2βˆ’ΞΌ2βˆ’Οƒ2)-0.5 \sum(1 + \log\sigma^2 - \mu^2 - \sigma^2)Closed form for diagonal
JSD12DKL(Pβˆ₯M)+12DKL(Qβˆ₯M)\frac{1}{2}D_{KL}(P\|M) + \frac{1}{2}D_{KL}(Q\|M)Symmetric, bounded

Cross-References

  • 081 - Entropy β€” DKL(Pβˆ₯Q)=H(P,Q)βˆ’H(P)D_{KL}(P \| Q) = H(P, Q) - H(P) β€” KL is the difference between cross-entropy and entropy.
  • 082 - Mutual Information β€” I(X;Y)=DKL(p(x,y)βˆ₯p(x)p(y))I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y)) β€” MI is a special case of KL divergence.
  • 084 - Cross-Entropy β€” Cross-entropy loss = entropy + KL divergence. Minimizing CE is equivalent to minimizing KL when entropy is fixed.
  • 085 - Applications β€” VAEs use KL regularization, EM algorithm uses KL in E-step, distillation uses KL to match teacher.

Common Pitfalls in Implementation

ℹ️ Numerical Stability

When implementing KL divergence, always:

  1. Filter out zero probabilities before computing log⁑(p/q)\log(p/q)
  2. Use np.where or masking to avoid division by zero
  3. Add small epsilon (e.g., 10βˆ’1010^{-10}) to denominators
  4. Consider using log-space computations for numerical stability

ℹ️ When KL is Infinite

If PP has support where QQ doesn't, KL is infinite. In practice:

  • Use truncated distributions with matching support
  • Add smoothing to QQ to ensure non-zero probability everywhere
  • Use reverse KL DKL(Qβˆ₯P)D_{KL}(Q \| P) which handles this differently

Summary

πŸ“‹Key Takeaways

  • KL Divergence: DKL(Pβˆ₯Q)=βˆ‘xp(x)log⁑p(x)q(x)D_{KL}(P \| Q) = \sum_x p(x) \log \frac{p(x)}{q(x)} measures the information lost when QQ is used to approximate PP. It's the expected log-likelihood ratio.

  • Non-negativity: DKL(Pβˆ₯Q)β‰₯0D_{KL}(P \| Q) \geq 0 always, with equality iff P=QP = Q. This is Gibbs' inequality and follows from Jensen's inequality.

  • Asymmetry: DKL(Pβˆ₯Q)β‰ DKL(Qβˆ₯P)D_{KL}(P \| Q) \neq D_{KL}(Q \| P) in general. Forward KL averages over PP (mean-seeking); reverse KL averages over QQ (mode-seeking).

  • Relation to Other Quantities: DKL(Pβˆ₯Q)=H(P,Q)βˆ’H(P)=βˆ’H(P)+H(P,Q)D_{KL}(P \| Q) = H(P, Q) - H(P) = -H(P) + H(P, Q). Cross-entropy = entropy + KL divergence.

  • Gaussian KL: For univariate Gaussians, KL has a closed form involving means and variances. For multivariate Gaussians, it involves the log-determinant of covariance matrices.

  • VAE Loss: The KL term DKL(q(z∣x)βˆ₯p(z))D_{KL}(q(z|x) \| p(z)) regularizes the latent space. With diagonal Gaussian assumptions, it has an elegant closed form.

  • Mode-Seeking vs Mean-Seeking: Forward KL (Pβˆ₯QP \| Q) causes QQ to spread out; reverse KL (Qβˆ₯PQ \| P) causes QQ to concentrate on one mode. Choose based on your application.

Lesson Progress83 / 100