← Math|85 of 100
Information Theory

Applications in Machine Learning

See how information theory powers feature selection, VAEs, diffusion models, and knowledge distillation.

📂 Applications📖 Lesson 85 of 100🎓 Free Course

Advertisement

Applications in Machine Learning

ℹ️ Why It Matters

Information theory isn't just elegant math—it's the engine behind many practical ML algorithms. Feature selection uses mutual information to pick the best variables. Decision trees use information gain to choose splits. VAEs minimize KL divergence to learn structured latent spaces. Diffusion models use score matching grounded in information theory. Knowledge distillation uses cross-entropy to transfer knowledge from large to small models. Every major ML paradigm has information theory at its core.


Feature Selection with Mutual Information

DfFeature Selection via MI

For a feature XjX_j and target YY, compute the mutual information:

I(Xj;Y)=H(Y)H(YXj)I(X_j; Y) = H(Y) - H(Y|X_j)

Features with higher MI are more informative. Select the top-kk features or use a threshold.

MI-Based Feature Ranking

Score(Xj)=I(Xj;Y)=x,yp(x,y)logp(x,y)p(x)p(y)\text{Score}(X_j) = I(X_j; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

Here,

  • XjX_j=j-th feature
  • YY=Target variable
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import make_classification
import numpy as np

# Classification example
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5,
                           n_redundant=5, random_state=42)
mi_scores = mutual_info_classif(X, y, random_state=42)

# Rank features
feature_ranking = np.argsort(mi_scores)[::-1]
print("Feature ranking by MI:")
for rank, idx in enumerate(feature_ranking[:10]):
    print(f"  Rank {rank+1}: Feature {idx} (MI = {mi_scores[idx]:.4f})")

# Select top-k features
k = 5
selected_features = feature_ranking[:k]
X_selected = X[:, selected_features]
print(f"\nSelected features: {selected_features}")

ℹ️ Why MI Over Correlation?

MI captures nonlinear dependencies that correlation misses. For example, Y=X2Y = X^2 has zero correlation with XX (if symmetric) but high MI. MI is also model-agnostic—you don't need to assume a specific relationship.


Information Bottleneck

DfInformation Bottleneck Principle

The information bottleneck method finds a compressed representation TT of input XX that preserves maximum information about target YY:

minp(tx)I(X;T)βI(T;Y)\min_{p(t|x)} I(X; T) - \beta \, I(T; Y)

where β\beta controls the trade-off between compression (low I(X;T)I(X;T)) and prediction (high I(T;Y)I(T;Y)).

Information Bottleneck Objective

minp(tx)I(X;T)βI(T;Y)\min_{p(t|x)} I(X;T) - \beta I(T;Y)

Here,

  • I(X;T)I(X;T)=Compression: how much of X is retained in T
  • I(T;Y)I(T;Y)=Prediction: how much T tells us about Y
  • β\beta=Trade-off parameter

ℹ️ Deep Learning Connection

Deep neural networks implicitly optimize the information bottleneck. Early layers compress input (reduce I(X;T)I(X; T)), while later layers preserve task-relevant information (maximize I(T;Y)I(T; Y)). This explains why deep networks generalize well—they compress away irrelevant noise.


Decision Trees and Information Gain

DfInformation Gain

For a dataset DD and feature AA with values {a1,,aV}\{a_1, \ldots, a_V\}:

IG(D,A)=H(D)v=1VDvDH(Dv)IG(D, A) = H(D) - \sum_{v=1}^{V} \frac{|D_v|}{|D|} H(D_v)

where DvD_v is the subset of DD where feature AA has value ava_v.

Information Gain

IG(D,A)=H(D)vDvDH(Dv)IG(D, A) = H(D) - \sum_{v} \frac{|D_v|}{|D|} H(D_v)

Here,

  • H(D)H(D)=Entropy of the dataset before split
  • H(Dv)H(D_v)=Entropy of subset D_v after split
  • Dv/D|D_v|/|D|=Fraction of samples in subset D_v

ThGain Ratio

Information gain biases toward features with many values. The gain ratio normalizes:

GR(D,A)=IG(D,A)HA(D)GR(D, A) = \frac{IG(D, A)}{H_A(D)}

where HA(D)=vDvDlogDvDH_A(D) = -\sum_v \frac{|D_v|}{|D|} \log \frac{|D_v|}{|D|} is the intrinsic information of the split.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)

# Train decision tree (uses information gain internally)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print(f"Training accuracy: {tree.score(X_train, y_train):.3f}")
print(f"Test accuracy: {tree.score(X_test, y_test):.3f}")

# Feature importances (based on information gain)
print("\nFeature importances:")
for name, importance in zip(iris.feature_names, tree.feature_importances_):
    print(f"  {name}: {importance:.4f}")

Variational Autoencoders (VAEs)

DfVAE Evidence Lower Bound (ELBO)

The VAE maximizes the ELBO:

L=Eq(zx)[logp(xz)]DKL(q(zx)p(z))\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))

where q(zx)q(z|x) is the encoder and p(xz)p(x|z) is the decoder.

VAE Loss

L=Eq(zx)[logp(xz)]reconstructionDKL(q(zx)p(z))regularization\mathcal{L} = \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{regularization}}

Here,

  • q(zx)q(z|x)=Encoder distribution (approximate posterior)
  • p(xz)p(x|z)=Decoder distribution (likelihood)
  • p(z)p(z)=Prior (usually N(0, I))

VAE KL Term (Diagonal Gaussian)

DKL=12j=1J(1+logσj2μj2σj2)D_{KL} = -\frac{1}{2} \sum_{j=1}^{J} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)

Here,

  • μj\mu_j=Mean of encoder output for latent dimension j
  • σj2\sigma_j^2=Variance of encoder output for latent dimension j
  • JJ=Dimensionality of latent space
import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.mu = nn.Linear(hidden_dim, latent_dim)
        self.log_var = nn.Linear(hidden_dim, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid(),
        )

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        h = self.encoder(x)
        mu, log_var = self.mu(h), self.log_var(h)
        z = self.reparameterize(mu, log_var)
        x_hat = self.decoder(z)
        return x_hat, mu, log_var

    def loss(self, x, x_hat, mu, log_var):
        recon = nn.functional.binary_cross_entropy(x_hat, x, reduction='sum')
        kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
        return recon + kl

# Example usage
model = VAE(input_dim=784, hidden_dim=256, latent_dim=16)
x = torch.randn(32, 784)
x_hat, mu, log_var = model(x)
loss = model.loss(x, x_hat, mu, log_var)
print(f"VAE loss: {loss.item():.2f}")

ℹ️ Why KL in VAEs?

The KL term DKL(q(zx)p(z))D_{KL}(q(z|x) \| p(z)) ensures the latent space is well-structured: close points in latent space decode to similar outputs. Without it, the encoder could memorize inputs, producing a degenerate latent space.


Diffusion Models

DfDenoising Score Matching

Diffusion models learn a score function xlogp(x)\nabla_x \log p(x) by training a neural network ϵθ(xt,t)\epsilon_\theta(x_t, t) to predict the noise added at each step:

L=Et,x0,ϵ[ϵϵθ(xt,t)2]\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

where xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon.

Diffusion Training Objective

L=EtU(1,T),ϵN(0,I)[ϵϵθ(xt,t)2]\mathcal{L} = \mathbb{E}_{t \sim U(1,T), \epsilon \sim \mathcal{N}(0,I)} \left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

Here,

  • xtx_t=Noised data at timestep t
  • ϵ\epsilon=Added noise
  • ϵθ\epsilon_\theta=Noise prediction network

ThConnection to Information Theory

The diffusion process destroys information gradually: I(X0;Xt)I(X_0; X_t) decreases as tt increases. The reverse process recovers information. The total training objective maximizes a lower bound on I(X0;XT)I(X_0; X_T) (the mutual clean data and generated samples).

import torch
import torch.nn as nn

class SimpleDiffusion(nn.Module):
    def __init__(self, dim, hidden_dim, t_dim=32):
        super().__init__()
        self.time_embed = nn.Sequential(
            nn.Linear(t_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        self.net = nn.Sequential(
            nn.Linear(dim + hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, dim),
        )

    def forward(self, x_t, t):
        t_emb = self.time_embed(torch.randn(t, 32, device=x_t.device))
        h = torch.cat([x_t, t_emb], dim=-1)
        return self.net(h)

# Example: forward diffusion
def forward_diffusion(x_0, t, betas):
    alpha_bar = torch.cumprod(1 - betas, dim=0)
    noise = torch.randn_like(x_0)
    x_t = torch.sqrt(alpha_bar[t]) * x_0 + torch.sqrt(1 - alpha_bar[t]) * noise
    return x_t, noise

# Training step
betas = torch.linspace(1e-4, 0.02, 1000)
model = SimpleDiffusion(dim=64, hidden_dim=128)
x_0 = torch.randn(16, 64)
t = torch.randint(0, 1000, (16,))
x_t, noise = forward_diffusion(x_0, t, betas)
predicted_noise = model(x_t, t)
loss = nn.functional.mse_loss(predicted_noise, noise)
print(f"Diffusion loss: {loss.item():.4f}")

Knowledge Distillation

DfKnowledge Distillation

Knowledge distillation trains a small student model to mimic a large teacher model by minimizing:

L=αH(y,y^S)+(1α)H(pT,pS)\mathcal{L} = \alpha \cdot H(y, \hat{y}_S) + (1-\alpha) \cdot H(p_T, p_S)

where pTp_T and pSp_S are the soft probability outputs of teacher and student, respectively.

Knowledge Distillation Loss

L=αCE(y,y^S)+(1α)CE(pT,pS)L = \alpha \cdot CE(y, \hat{y}_S) + (1-\alpha) \cdot CE(p_T, p_S)

Here,

  • α\alpha=Weight balancing hard and soft labels
  • CE(y,y^S)CE(y, \hat{y}_S)=Cross-entropy with ground truth labels
  • CE(pT,pS)CE(p_T, p_S)=Cross-entropy between teacher and student soft outputs

ℹ️ Why Soft Labels?

Soft labels from the teacher encode inter-class relationships. For example, a teacher might output [0.8, 0.15, 0.05] for [cat, dog, car]. This tells the student that cat is more similar to dog than to car—knowledge that hard one-hot labels [1, 0, 0] cannot convey.

import torch
import torch.nn as nn

def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.7):
    """Compute knowledge distillation loss."""
    soft_student = nn.functional.log_softmax(student_logits / temperature, dim=1)
    soft_teacher = nn.functional.softmax(teacher_logits / temperature, dim=1)

    distill_loss = nn.functional.kl_div(
        soft_student, soft_teacher, reduction='batchmean'
    ) * (temperature ** 2)

    hard_loss = nn.functional.cross_entropy(student_logits, labels)

    return alpha * hard_loss + (1 - alpha) * distill_loss

# Example
teacher = nn.Linear(128, 10)
student = nn.Linear(128, 10)
x = torch.randn(32, 128)
labels = torch.randint(0, 10, (32,))

with torch.no_grad():
    teacher_logits = teacher(x)
student_logits = student(x)

loss = distillation_loss(student_logits, teacher_logits, labels)
print(f"Distillation loss: {loss.item():.4f}")

Information Theory in NLP

DfPerplexity

Perplexity measures how well a language model predicts text:

PPL=2H(P,Q)=21Nilog2q(wiw<i)\text{PPL} = 2^{H(P, Q)} = 2^{-\frac{1}{N}\sum_i \log_2 q(w_i | w_{<i})}

Lower perplexity = better model. GPT-3 has PPL ≈ 20 on WikiText.

ℹ️ Language Modeling as CE

Language modeling minimizes cross-entropy between the true next-token distribution and the model's prediction. Each token's loss is logq(wiw<i)-\log q(w_i | w_{<i}). The sum over tokens gives the total sequence loss.

import torch
import torch.nn as nn

# Simplified language model training
vocab_size = 10000
model = nn.TransformerEncoderLayer(d_model=256, nhead=8, batch_first=True)
head = nn.Linear(256, vocab_size)

# Input: batch of token sequences
x = torch.randint(0, vocab_size, (4, 128))  # (batch, seq_len)
logits = head(model(x))  # (batch, seq_len, vocab_size)

# Target: next token
targets = torch.randint(0, vocab_size, (4, 128))
loss = nn.functional.cross_entropy(
    logits.reshape(-1, vocab_size),
    targets.reshape(-1)
)

# Perplexity
import math
ppl = math.exp(loss.item())
print(f"Loss: {loss.item():.4f}, Perplexity: {ppl:.2f}")

Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using correlation for feature selectionMisses nonlinear dependenciesUse mutual information
Ignoring information bottleneck in deep networksMay overfit to noiseAdd noise/information constraints
Not using soft labels in distillationLoses inter-class relationship infoUse temperature scaling for soft targets
Confusing perplexity with accuracyPerplexity is exponential of CELower perplexity ≠ higher accuracy
Assuming all information theory quantities are symmetricKL and conditional MI are asymmetricPay attention to argument order

Interview Questions

Q1: How does MI help in feature selection? A: MI I(Xj;Y)I(X_j; Y) measures the reduction in uncertainty about YY given feature XjX_j. Unlike correlation, it captures any dependency (linear, nonlinear, monotonic). Features with higher MI are more informative.

Q2: What is the information bottleneck and why does it matter? A: It's a principle for learning compressed representations that retain task-relevant information. The objective minI(X;T)βI(T;Y)\min I(X;T) - \beta I(T;Y) balances compression and prediction. It explains why deep networks generalize: they compress noise.

Q3: How do VAEs use KL divergence? A: The KL term DKL(q(zx)p(z))D_{KL}(q(z|x) \| p(z)) regularizes the latent space to be close to the prior N(0,I)\mathcal{N}(0, I). This enables smooth interpolation and generation. Without it, the encoder could memorize.

Q4: Why does knowledge distillation use soft labels? A: Soft labels encode inter-class similarities. A teacher outputting [0.8, 0.15, 0.05] for [cat, dog, car] transfers knowledge about class relationships that hard one-hot labels cannot convey.

Q5: What's the connection between perplexity and cross-entropy? A: Perplexity =2H(P,Q)= 2^{H(P,Q)}, the exponential of cross-entropy. It's the effective number of equally likely tokens the model is confused between. Lower perplexity = better predictions.


Practice Problems

📝Problem 1: Feature Selection

Problem: You have 100 features and a binary target. Feature A has I(A;Y)=0.5I(A; Y) = 0.5 bits, Feature B has I(B;Y)=0.3I(B; Y) = 0.3 bits, and Feature C has I(C;Y)=0.0I(C; Y) = 0.0 bits. Which features should you select?

💡Solution: Feature Selection

Select Features A and B (high MI). Feature C has zero MI with the target, meaning it's independent and provides no information. In practice, you'd select all features above some threshold or the top-kk features.

📝Problem 2: VAE KL Term

Problem: For a VAE with encoder output μ=[0.5,0.3]\mu = [0.5, -0.3], logσ2=[0.5,1.0]\log \sigma^2 = [-0.5, -1.0], compute the KL divergence term.

💡Solution: VAE KL

DKL=12j(1+logσj2μj2σj2)D_{KL} = -\frac{1}{2}\sum_j (1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2)
=12[(10.50.250.607)+(11.00.090.368)]= -\frac{1}{2}[(1 - 0.5 - 0.25 - 0.607) + (1 - 1.0 - 0.09 - 0.368)]
=12[(0.357)+(0.458)]=0.408 nats= -\frac{1}{2}[(-0.357) + (-0.458)] = 0.408 \text{ nats}

📝Problem 3: Distillation Temperature

Problem: Why does higher temperature in knowledge distillation produce softer probability distributions?

💡Solution: Temperature

Higher temperature TT divides the logits before softmax: softmax(z/T)\text{softmax}(z/T). This flattens the distribution because the differences between logits are divided by TT. As TT \to \infty, the distribution becomes uniform. As T1T \to 1, it becomes the original sharp distribution.


Quick Reference

ApplicationIT ConceptFormula
Feature SelectionMutual InformationI(Xj;Y)=H(Y)H(YXj)I(X_j; Y) = H(Y) - H(Y\|X_j)
Decision TreesInformation GainIG(D,A)=H(D)vDvDH(Dv)IG(D, A) = H(D) - \sum_v \frac{|D_v|}{|D|} H(D_v)
VAEKL DivergenceDKL(q(zx)p(z))D_{KL}(q(z\|x) \| p(z))
DiffusionScore Matchingϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2
DistillationCross-EntropyαCE(y,y^S)+(1α)CE(pT,pS)\alpha \cdot CE(y, \hat{y}_S) + (1-\alpha) \cdot CE(p_T, p_S)
Language ModelingPerplexityPPL=2H(P,Q)\text{PPL} = 2^{H(P, Q)}

Cross-References

  • 081 - Entropy — Foundation for information gain, perplexity, and all uncertainty measures.
  • 082 - Mutual Information — Used directly in feature selection and the information bottleneck.
  • 083 - KL Divergence: Central to VAEs, EM algorithm, and distribution matching.
  • 084 - Cross-Entropy — The loss function for classification, distillation, and language modeling.

Summary

📋Key Takeaways

  • Feature Selection: Mutual information I(Xj;Y)=H(Y)H(YXj)I(X_j; Y) = H(Y) - H(Y|X_j) captures any dependency between feature and target. It's model-agnostic and more powerful than correlation for selecting informative features.

  • Information Bottleneck: The principle minI(X;T)βI(T;Y)\min I(X;T) - \beta I(T;Y) learns compressed representations that retain task-relevant information. Deep networks implicitly optimize this, explaining their generalization ability.

  • Decision Trees: Information gain IG(D,A)=H(D)vDvDH(Dv)IG(D, A) = H(D) - \sum_v \frac{|D_v|}{|D|} H(D_v) measures how much a feature reduces entropy. Trees greedily split on the feature with highest gain.

  • VAEs: The loss L=E[logp(xz)]DKL(q(zx)p(z))\mathcal{L} = \mathbb{E}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z)) balances reconstruction quality with latent space regularization. The KL term ensures smooth, structured latent spaces.

  • Diffusion Models: Train by predicting noise: ϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2. The forward process gradually destroys information; the reverse process recovers it.

  • Knowledge Distillation: L=αCE(y,y^S)+(1α)CE(pT,pS)L = \alpha \cdot CE(y, \hat{y}_S) + (1-\alpha) \cdot CE(p_T, p_S) transfers knowledge via soft labels that encode inter-class relationships.

  • Language Modeling: Perplexity =2H(P,Q)= 2^{H(P,Q)} measures prediction quality. Lower perplexity = better model. Training minimizes cross-entropy between true and predicted token distributions.

Lesson Progress85 / 100