Applications in Machine Learning
ℹ️ Why It Matters
Information theory isn't just elegant math—it's the engine behind many practical ML algorithms. Feature selection uses mutual information to pick the best variables. Decision trees use information gain to choose splits. VAEs minimize KL divergence to learn structured latent spaces. Diffusion models use score matching grounded in information theory. Knowledge distillation uses cross-entropy to transfer knowledge from large to small models. Every major ML paradigm has information theory at its core.
Feature Selection with Mutual Information
DfFeature Selection via MI
For a feature and target , compute the mutual information:
Features with higher MI are more informative. Select the top- features or use a threshold.
MI-Based Feature Ranking
Here,
- =j-th feature
- =Target variable
from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import make_classification
import numpy as np
# Classification example
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5,
n_redundant=5, random_state=42)
mi_scores = mutual_info_classif(X, y, random_state=42)
# Rank features
feature_ranking = np.argsort(mi_scores)[::-1]
print("Feature ranking by MI:")
for rank, idx in enumerate(feature_ranking[:10]):
print(f" Rank {rank+1}: Feature {idx} (MI = {mi_scores[idx]:.4f})")
# Select top-k features
k = 5
selected_features = feature_ranking[:k]
X_selected = X[:, selected_features]
print(f"\nSelected features: {selected_features}")
ℹ️ Why MI Over Correlation?
MI captures nonlinear dependencies that correlation misses. For example, has zero correlation with (if symmetric) but high MI. MI is also model-agnostic—you don't need to assume a specific relationship.
Information Bottleneck
DfInformation Bottleneck Principle
The information bottleneck method finds a compressed representation of input that preserves maximum information about target :
where controls the trade-off between compression (low ) and prediction (high ).
Information Bottleneck Objective
Here,
- =Compression: how much of X is retained in T
- =Prediction: how much T tells us about Y
- =Trade-off parameter
ℹ️ Deep Learning Connection
Deep neural networks implicitly optimize the information bottleneck. Early layers compress input (reduce ), while later layers preserve task-relevant information (maximize ). This explains why deep networks generalize well—they compress away irrelevant noise.
Decision Trees and Information Gain
DfInformation Gain
For a dataset and feature with values :
where is the subset of where feature has value .
Information Gain
Here,
- =Entropy of the dataset before split
- =Entropy of subset D_v after split
- =Fraction of samples in subset D_v
ThGain Ratio
Information gain biases toward features with many values. The gain ratio normalizes:
where is the intrinsic information of the split.
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)
# Train decision tree (uses information gain internally)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)
print(f"Training accuracy: {tree.score(X_train, y_train):.3f}")
print(f"Test accuracy: {tree.score(X_test, y_test):.3f}")
# Feature importances (based on information gain)
print("\nFeature importances:")
for name, importance in zip(iris.feature_names, tree.feature_importances_):
print(f" {name}: {importance:.4f}")
Variational Autoencoders (VAEs)
DfVAE Evidence Lower Bound (ELBO)
The VAE maximizes the ELBO:
where is the encoder and is the decoder.
VAE Loss
Here,
- =Encoder distribution (approximate posterior)
- =Decoder distribution (likelihood)
- =Prior (usually N(0, I))
VAE KL Term (Diagonal Gaussian)
Here,
- =Mean of encoder output for latent dimension j
- =Variance of encoder output for latent dimension j
- =Dimensionality of latent space
import torch
import torch.nn as nn
class VAE(nn.Module):
def __init__(self, input_dim, hidden_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
)
self.mu = nn.Linear(hidden_dim, latent_dim)
self.log_var = nn.Linear(hidden_dim, latent_dim)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, input_dim),
nn.Sigmoid(),
)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + eps * std
def forward(self, x):
h = self.encoder(x)
mu, log_var = self.mu(h), self.log_var(h)
z = self.reparameterize(mu, log_var)
x_hat = self.decoder(z)
return x_hat, mu, log_var
def loss(self, x, x_hat, mu, log_var):
recon = nn.functional.binary_cross_entropy(x_hat, x, reduction='sum')
kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
return recon + kl
# Example usage
model = VAE(input_dim=784, hidden_dim=256, latent_dim=16)
x = torch.randn(32, 784)
x_hat, mu, log_var = model(x)
loss = model.loss(x, x_hat, mu, log_var)
print(f"VAE loss: {loss.item():.2f}")
ℹ️ Why KL in VAEs?
The KL term ensures the latent space is well-structured: close points in latent space decode to similar outputs. Without it, the encoder could memorize inputs, producing a degenerate latent space.
Diffusion Models
DfDenoising Score Matching
Diffusion models learn a score function by training a neural network to predict the noise added at each step:
where .
Diffusion Training Objective
Here,
- =Noised data at timestep t
- =Added noise
- =Noise prediction network
ThConnection to Information Theory
The diffusion process destroys information gradually: decreases as increases. The reverse process recovers information. The total training objective maximizes a lower bound on (the mutual clean data and generated samples).
import torch
import torch.nn as nn
class SimpleDiffusion(nn.Module):
def __init__(self, dim, hidden_dim, t_dim=32):
super().__init__()
self.time_embed = nn.Sequential(
nn.Linear(t_dim, hidden_dim),
nn.SiLU(),
nn.Linear(hidden_dim, hidden_dim),
)
self.net = nn.Sequential(
nn.Linear(dim + hidden_dim, hidden_dim),
nn.SiLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.SiLU(),
nn.Linear(hidden_dim, dim),
)
def forward(self, x_t, t):
t_emb = self.time_embed(torch.randn(t, 32, device=x_t.device))
h = torch.cat([x_t, t_emb], dim=-1)
return self.net(h)
# Example: forward diffusion
def forward_diffusion(x_0, t, betas):
alpha_bar = torch.cumprod(1 - betas, dim=0)
noise = torch.randn_like(x_0)
x_t = torch.sqrt(alpha_bar[t]) * x_0 + torch.sqrt(1 - alpha_bar[t]) * noise
return x_t, noise
# Training step
betas = torch.linspace(1e-4, 0.02, 1000)
model = SimpleDiffusion(dim=64, hidden_dim=128)
x_0 = torch.randn(16, 64)
t = torch.randint(0, 1000, (16,))
x_t, noise = forward_diffusion(x_0, t, betas)
predicted_noise = model(x_t, t)
loss = nn.functional.mse_loss(predicted_noise, noise)
print(f"Diffusion loss: {loss.item():.4f}")
Knowledge Distillation
DfKnowledge Distillation
Knowledge distillation trains a small student model to mimic a large teacher model by minimizing:
where and are the soft probability outputs of teacher and student, respectively.
Knowledge Distillation Loss
Here,
- =Weight balancing hard and soft labels
- =Cross-entropy with ground truth labels
- =Cross-entropy between teacher and student soft outputs
ℹ️ Why Soft Labels?
Soft labels from the teacher encode inter-class relationships. For example, a teacher might output [0.8, 0.15, 0.05] for [cat, dog, car]. This tells the student that cat is more similar to dog than to car—knowledge that hard one-hot labels [1, 0, 0] cannot convey.
import torch
import torch.nn as nn
def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.7):
"""Compute knowledge distillation loss."""
soft_student = nn.functional.log_softmax(student_logits / temperature, dim=1)
soft_teacher = nn.functional.softmax(teacher_logits / temperature, dim=1)
distill_loss = nn.functional.kl_div(
soft_student, soft_teacher, reduction='batchmean'
) * (temperature ** 2)
hard_loss = nn.functional.cross_entropy(student_logits, labels)
return alpha * hard_loss + (1 - alpha) * distill_loss
# Example
teacher = nn.Linear(128, 10)
student = nn.Linear(128, 10)
x = torch.randn(32, 128)
labels = torch.randint(0, 10, (32,))
with torch.no_grad():
teacher_logits = teacher(x)
student_logits = student(x)
loss = distillation_loss(student_logits, teacher_logits, labels)
print(f"Distillation loss: {loss.item():.4f}")
Information Theory in NLP
DfPerplexity
Perplexity measures how well a language model predicts text:
Lower perplexity = better model. GPT-3 has PPL ≈ 20 on WikiText.
ℹ️ Language Modeling as CE
Language modeling minimizes cross-entropy between the true next-token distribution and the model's prediction. Each token's loss is . The sum over tokens gives the total sequence loss.
import torch
import torch.nn as nn
# Simplified language model training
vocab_size = 10000
model = nn.TransformerEncoderLayer(d_model=256, nhead=8, batch_first=True)
head = nn.Linear(256, vocab_size)
# Input: batch of token sequences
x = torch.randint(0, vocab_size, (4, 128)) # (batch, seq_len)
logits = head(model(x)) # (batch, seq_len, vocab_size)
# Target: next token
targets = torch.randint(0, vocab_size, (4, 128))
loss = nn.functional.cross_entropy(
logits.reshape(-1, vocab_size),
targets.reshape(-1)
)
# Perplexity
import math
ppl = math.exp(loss.item())
print(f"Loss: {loss.item():.4f}, Perplexity: {ppl:.2f}")
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Using correlation for feature selection | Misses nonlinear dependencies | Use mutual information |
| Ignoring information bottleneck in deep networks | May overfit to noise | Add noise/information constraints |
| Not using soft labels in distillation | Loses inter-class relationship info | Use temperature scaling for soft targets |
| Confusing perplexity with accuracy | Perplexity is exponential of CE | Lower perplexity ≠ higher accuracy |
| Assuming all information theory quantities are symmetric | KL and conditional MI are asymmetric | Pay attention to argument order |
Interview Questions
Q1: How does MI help in feature selection? A: MI measures the reduction in uncertainty about given feature . Unlike correlation, it captures any dependency (linear, nonlinear, monotonic). Features with higher MI are more informative.
Q2: What is the information bottleneck and why does it matter? A: It's a principle for learning compressed representations that retain task-relevant information. The objective balances compression and prediction. It explains why deep networks generalize: they compress noise.
Q3: How do VAEs use KL divergence? A: The KL term regularizes the latent space to be close to the prior . This enables smooth interpolation and generation. Without it, the encoder could memorize.
Q4: Why does knowledge distillation use soft labels? A: Soft labels encode inter-class similarities. A teacher outputting [0.8, 0.15, 0.05] for [cat, dog, car] transfers knowledge about class relationships that hard one-hot labels cannot convey.
Q5: What's the connection between perplexity and cross-entropy? A: Perplexity , the exponential of cross-entropy. It's the effective number of equally likely tokens the model is confused between. Lower perplexity = better predictions.
Practice Problems
📝Problem 1: Feature Selection
Problem: You have 100 features and a binary target. Feature A has bits, Feature B has bits, and Feature C has bits. Which features should you select?
💡Solution: Feature Selection
Select Features A and B (high MI). Feature C has zero MI with the target, meaning it's independent and provides no information. In practice, you'd select all features above some threshold or the top- features.
📝Problem 2: VAE KL Term
Problem: For a VAE with encoder output , , compute the KL divergence term.
💡Solution: VAE KL
📝Problem 3: Distillation Temperature
Problem: Why does higher temperature in knowledge distillation produce softer probability distributions?
💡Solution: Temperature
Higher temperature divides the logits before softmax: . This flattens the distribution because the differences between logits are divided by . As , the distribution becomes uniform. As , it becomes the original sharp distribution.
Quick Reference
| Application | IT Concept | Formula |
|---|---|---|
| Feature Selection | Mutual Information | |
| Decision Trees | Information Gain | |
| VAE | KL Divergence | |
| Diffusion | Score Matching | |
| Distillation | Cross-Entropy | |
| Language Modeling | Perplexity |
Cross-References
- 081 - Entropy — Foundation for information gain, perplexity, and all uncertainty measures.
- 082 - Mutual Information — Used directly in feature selection and the information bottleneck.
- 083 - KL Divergence: Central to VAEs, EM algorithm, and distribution matching.
- 084 - Cross-Entropy — The loss function for classification, distillation, and language modeling.
Summary
📋Key Takeaways
-
Feature Selection: Mutual information captures any dependency between feature and target. It's model-agnostic and more powerful than correlation for selecting informative features.
-
Information Bottleneck: The principle learns compressed representations that retain task-relevant information. Deep networks implicitly optimize this, explaining their generalization ability.
-
Decision Trees: Information gain measures how much a feature reduces entropy. Trees greedily split on the feature with highest gain.
-
VAEs: The loss balances reconstruction quality with latent space regularization. The KL term ensures smooth, structured latent spaces.
-
Diffusion Models: Train by predicting noise: . The forward process gradually destroys information; the reverse process recovers it.
-
Knowledge Distillation: transfers knowledge via soft labels that encode inter-class relationships.
-
Language Modeling: Perplexity measures prediction quality. Lower perplexity = better model. Training minimizes cross-entropy between true and predicted token distributions.