Applications in Machine Learning

ℹ️ Why It Matters

Information theory isn't just elegant math—it's the engine behind many practical ML algorithms. Feature selection uses mutual information to pick the best variables. Decision trees use information gain to choose splits. VAEs minimize KL divergence to learn structured latent spaces. Diffusion models use score matching grounded in information theory. Knowledge distillation uses cross-entropy to transfer knowledge from large to small models. Every major ML paradigm has information theory at its core.

Feature Selection with Mutual Information

DfFeature Selection via MI

For a feature $X_j$ and target $Y$ , compute the mutual information:

I(X_j; Y) = H(Y) - H(Y|X_j)

Features with higher MI are more informative. Select the top- $k$ features or use a threshold.

MI-Based Feature Ranking

\text{Score}(X_j) = I(X_j; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

Here,

$X_j$ =j-th feature
$Y$ =Target variable

from sklearn.feature_selection import mutual_info_classif, mutual_info_regression
from sklearn.datasets import make_classification
import numpy as np

# Classification example
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5,
                           n_redundant=5, random_state=42)
mi_scores = mutual_info_classif(X, y, random_state=42)

# Rank features
feature_ranking = np.argsort(mi_scores)[::-1]
print("Feature ranking by MI:")
for rank, idx in enumerate(feature_ranking[:10]):
    print(f"  Rank {rank+1}: Feature {idx} (MI = {mi_scores[idx]:.4f})")

# Select top-k features
k = 5
selected_features = feature_ranking[:k]
X_selected = X[:, selected_features]
print(f"\nSelected features: {selected_features}")

ℹ️ Why MI Over Correlation?

MI captures nonlinear dependencies that correlation misses. For example, $Y = X^2$ has zero correlation with $X$ (if symmetric) but high MI. MI is also model-agnostic—you don't need to assume a specific relationship.

Information Bottleneck

DfInformation Bottleneck Principle

The information bottleneck method finds a compressed representation $T$ of input $X$ that preserves maximum information about target $Y$ :

\min_{p(t|x)} I(X; T) - \beta \, I(T; Y)

where $\beta$ controls the trade-off between compression (low $I(X;T)$ ) and prediction (high $I(T;Y)$ ).

Information Bottleneck Objective

\min_{p(t|x)} I(X;T) - \beta I(T;Y)

Here,

$I(X;T)$ =Compression: how much of X is retained in T
$I(T;Y)$ =Prediction: how much T tells us about Y
$\beta$ =Trade-off parameter

ℹ️ Deep Learning Connection

Deep neural networks implicitly optimize the information bottleneck. Early layers compress input (reduce $I(X; T)$ ), while later layers preserve task-relevant information (maximize $I(T; Y)$ ). This explains why deep networks generalize well—they compress away irrelevant noise.

Decision Trees and Information Gain

DfInformation Gain

For a dataset $D$ and feature $A$ with values $\{a_1, \ldots, a_V\}$ :

IG(D, A) = H(D) - \sum_{v=1}^{V} \frac{|D_v|}{|D|} H(D_v)

where $D_v$ is the subset of $D$ where feature $A$ has value $a_v$ .

Information Gain

IG(D, A) = H(D) - \sum_{v} \frac{|D_v|}{|D|} H(D_v)

Here,

$H(D)$ =Entropy of the dataset before split
$H(D_v)$ =Entropy of subset D_v after split
$|D_v|/|D|$ =Fraction of samples in subset D_v

ThGain Ratio

Information gain biases toward features with many values. The gain ratio normalizes:

GR(D, A) = \frac{IG(D, A)}{H_A(D)}

where $H_A(D) = -\sum_v \frac{|D_v|}{|D|} \log \frac{|D_v|}{|D|}$ is the intrinsic information of the split.

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=42)

# Train decision tree (uses information gain internally)
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

print(f"Training accuracy: {tree.score(X_train, y_train):.3f}")
print(f"Test accuracy: {tree.score(X_test, y_test):.3f}")

# Feature importances (based on information gain)
print("\nFeature importances:")
for name, importance in zip(iris.feature_names, tree.feature_importances_):
    print(f"  {name}: {importance:.4f}")

Variational Autoencoders (VAEs)

DfVAE Evidence Lower Bound (ELBO)

The VAE maximizes the ELBO:

\mathcal{L} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))

where $q(z|x)$ is the encoder and $p(x|z)$ is the decoder.

VAE Loss

\mathcal{L} = \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q(z|x) \| p(z))}_{\text{regularization}}

Here,

$q(z|x)$ =Encoder distribution (approximate posterior)
$p(x|z)$ =Decoder distribution (likelihood)
$p(z)$ =Prior (usually N(0, I))

VAE KL Term (Diagonal Gaussian)

D_{KL} = -\frac{1}{2} \sum_{j=1}^{J} \left(1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2\right)

Here,

$\mu_j$ =Mean of encoder output for latent dimension j
$\sigma_j^2$ =Variance of encoder output for latent dimension j
$J$ =Dimensionality of latent space

import torch
import torch.nn as nn

class VAE(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
        )
        self.mu = nn.Linear(hidden_dim, latent_dim)
        self.log_var = nn.Linear(hidden_dim, latent_dim)
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid(),
        )

    def reparameterize(self, mu, log_var):
        std = torch.exp(0.5 * log_var)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        h = self.encoder(x)
        mu, log_var = self.mu(h), self.log_var(h)
        z = self.reparameterize(mu, log_var)
        x_hat = self.decoder(z)
        return x_hat, mu, log_var

    def loss(self, x, x_hat, mu, log_var):
        recon = nn.functional.binary_cross_entropy(x_hat, x, reduction='sum')
        kl = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
        return recon + kl

# Example usage
model = VAE(input_dim=784, hidden_dim=256, latent_dim=16)
x = torch.randn(32, 784)
x_hat, mu, log_var = model(x)
loss = model.loss(x, x_hat, mu, log_var)
print(f"VAE loss: {loss.item():.2f}")

ℹ️ Why KL in VAEs?

The KL term $D_{KL}(q(z|x) \| p(z))$ ensures the latent space is well-structured: close points in latent space decode to similar outputs. Without it, the encoder could memorize inputs, producing a degenerate latent space.

Diffusion Models

DfDenoising Score Matching

Diffusion models learn a score function $\nabla_x \log p(x)$ by training a neural network $\epsilon_\theta(x_t, t)$ to predict the noise added at each step:

\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

where $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$ .

Diffusion Training Objective

\mathcal{L} = \mathbb{E}_{t \sim U(1,T), \epsilon \sim \mathcal{N}(0,I)} \left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]

Here,

$x_t$ =Noised data at timestep t
$\epsilon$ =Added noise
$\epsilon_\theta$ =Noise prediction network

ThConnection to Information Theory

The diffusion process destroys information gradually: $I(X_0; X_t)$ decreases as $t$ increases. The reverse process recovers information. The total training objective maximizes a lower bound on $I(X_0; X_T)$ (the mutual clean data and generated samples).

import torch
import torch.nn as nn

class SimpleDiffusion(nn.Module):
    def __init__(self, dim, hidden_dim, t_dim=32):
        super().__init__()
        self.time_embed = nn.Sequential(
            nn.Linear(t_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
        )
        self.net = nn.Sequential(
            nn.Linear(dim + hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, dim),
        )

    def forward(self, x_t, t):
        t_emb = self.time_embed(torch.randn(t, 32, device=x_t.device))
        h = torch.cat([x_t, t_emb], dim=-1)
        return self.net(h)

# Example: forward diffusion
def forward_diffusion(x_0, t, betas):
    alpha_bar = torch.cumprod(1 - betas, dim=0)
    noise = torch.randn_like(x_0)
    x_t = torch.sqrt(alpha_bar[t]) * x_0 + torch.sqrt(1 - alpha_bar[t]) * noise
    return x_t, noise

# Training step
betas = torch.linspace(1e-4, 0.02, 1000)
model = SimpleDiffusion(dim=64, hidden_dim=128)
x_0 = torch.randn(16, 64)
t = torch.randint(0, 1000, (16,))
x_t, noise = forward_diffusion(x_0, t, betas)
predicted_noise = model(x_t, t)
loss = nn.functional.mse_loss(predicted_noise, noise)
print(f"Diffusion loss: {loss.item():.4f}")

Knowledge Distillation

DfKnowledge Distillation

Knowledge distillation trains a small student model to mimic a large teacher model by minimizing:

\mathcal{L} = \alpha \cdot H(y, \hat{y}_S) + (1-\alpha) \cdot H(p_T, p_S)

where $p_T$ and $p_S$ are the soft probability outputs of teacher and student, respectively.

Knowledge Distillation Loss

L = \alpha \cdot CE(y, \hat{y}_S) + (1-\alpha) \cdot CE(p_T, p_S)

Here,

$\alpha$ =Weight balancing hard and soft labels
$CE(y, \hat{y}_S)$ =Cross-entropy with ground truth labels
$CE(p_T, p_S)$ =Cross-entropy between teacher and student soft outputs

ℹ️ Why Soft Labels?

Soft labels from the teacher encode inter-class relationships. For example, a teacher might output [0.8, 0.15, 0.05] for [cat, dog, car]. This tells the student that cat is more similar to dog than to car—knowledge that hard one-hot labels [1, 0, 0] cannot convey.

import torch
import torch.nn as nn

def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.7):
    """Compute knowledge distillation loss."""
    soft_student = nn.functional.log_softmax(student_logits / temperature, dim=1)
    soft_teacher = nn.functional.softmax(teacher_logits / temperature, dim=1)

    distill_loss = nn.functional.kl_div(
        soft_student, soft_teacher, reduction='batchmean'
    ) * (temperature ** 2)

    hard_loss = nn.functional.cross_entropy(student_logits, labels)

    return alpha * hard_loss + (1 - alpha) * distill_loss

# Example
teacher = nn.Linear(128, 10)
student = nn.Linear(128, 10)
x = torch.randn(32, 128)
labels = torch.randint(0, 10, (32,))

with torch.no_grad():
    teacher_logits = teacher(x)
student_logits = student(x)

loss = distillation_loss(student_logits, teacher_logits, labels)
print(f"Distillation loss: {loss.item():.4f}")

Information Theory in NLP

DfPerplexity

Perplexity measures how well a language model predicts text:

\text{PPL} = 2^{H(P, Q)} = 2^{-\frac{1}{N}\sum_i \log_2 q(w_i | w_{<i})}

Lower perplexity = better model. GPT-3 has PPL ≈ 20 on WikiText.

ℹ️ Language Modeling as CE

Language modeling minimizes cross-entropy between the true next-token distribution and the model's prediction. Each token's loss is $-\log q(w_i | w_{<i})$ . The sum over tokens gives the total sequence loss.

import torch
import torch.nn as nn

# Simplified language model training
vocab_size = 10000
model = nn.TransformerEncoderLayer(d_model=256, nhead=8, batch_first=True)
head = nn.Linear(256, vocab_size)

# Input: batch of token sequences
x = torch.randint(0, vocab_size, (4, 128))  # (batch, seq_len)
logits = head(model(x))  # (batch, seq_len, vocab_size)

# Target: next token
targets = torch.randint(0, vocab_size, (4, 128))
loss = nn.functional.cross_entropy(
    logits.reshape(-1, vocab_size),
    targets.reshape(-1)
)

# Perplexity
import math
ppl = math.exp(loss.item())
print(f"Loss: {loss.item():.4f}, Perplexity: {ppl:.2f}")

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Using correlation for feature selection	Misses nonlinear dependencies	Use mutual information
Ignoring information bottleneck in deep networks	May overfit to noise	Add noise/information constraints
Not using soft labels in distillation	Loses inter-class relationship info	Use temperature scaling for soft targets
Confusing perplexity with accuracy	Perplexity is exponential of CE	Lower perplexity ≠ higher accuracy
Assuming all information theory quantities are symmetric	KL and conditional MI are asymmetric	Pay attention to argument order

Interview Questions

Q1: How does MI help in feature selection? A: MI $I(X_j; Y)$ measures the reduction in uncertainty about $Y$ given feature $X_j$ . Unlike correlation, it captures any dependency (linear, nonlinear, monotonic). Features with higher MI are more informative.

Q2: What is the information bottleneck and why does it matter? A: It's a principle for learning compressed representations that retain task-relevant information. The objective $\min I(X;T) - \beta I(T;Y)$ balances compression and prediction. It explains why deep networks generalize: they compress noise.

Q3: How do VAEs use KL divergence? A: The KL term $D_{KL}(q(z|x) \| p(z))$ regularizes the latent space to be close to the prior $\mathcal{N}(0, I)$ . This enables smooth interpolation and generation. Without it, the encoder could memorize.

Q4: Why does knowledge distillation use soft labels? A: Soft labels encode inter-class similarities. A teacher outputting [0.8, 0.15, 0.05] for [cat, dog, car] transfers knowledge about class relationships that hard one-hot labels cannot convey.

Q5: What's the connection between perplexity and cross-entropy? A: Perplexity $= 2^{H(P,Q)}$ , the exponential of cross-entropy. It's the effective number of equally likely tokens the model is confused between. Lower perplexity = better predictions.

Practice Problems

📝Problem 1: Feature Selection

Problem: You have 100 features and a binary target. Feature A has $I(A; Y) = 0.5$ bits, Feature B has $I(B; Y) = 0.3$ bits, and Feature C has $I(C; Y) = 0.0$ bits. Which features should you select?

💡Solution: Feature Selection

Select Features A and B (high MI). Feature C has zero MI with the target, meaning it's independent and provides no information. In practice, you'd select all features above some threshold or the top- $k$ features.

📝Problem 2: VAE KL Term

Problem: For a VAE with encoder output $\mu = [0.5, -0.3]$ , $\log \sigma^2 = [-0.5, -1.0]$ , compute the KL divergence term.

💡Solution: VAE KL

D_{KL} = -\frac{1}{2}\sum_j (1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2)

= -\frac{1}{2}[(1 - 0.5 - 0.25 - 0.607) + (1 - 1.0 - 0.09 - 0.368)]

= -\frac{1}{2}[(-0.357) + (-0.458)] = 0.408 \text{ nats}

📝Problem 3: Distillation Temperature

Problem: Why does higher temperature in knowledge distillation produce softer probability distributions?

💡Solution: Temperature

Higher temperature $T$ divides the logits before softmax: $\text{softmax}(z/T)$ . This flattens the distribution because the differences between logits are divided by $T$ . As $T \to \infty$ , the distribution becomes uniform. As $T \to 1$ , it becomes the original sharp distribution.

Quick Reference

Application	IT Concept	Formula
Feature Selection	Mutual Information	$I(X_j; Y) = H(Y) - H(Y\\|X_j)$
Decision Trees	Information Gain	$IG(D, A) = H(D) - \sum_v \frac{\|D_v\|}{\|D\|} H(D_v)$
VAE	KL Divergence	$D_{KL}(q(z\\|x) \\| p(z))$
Diffusion	Score Matching	$\\|\epsilon - \epsilon_\theta(x_t, t)\\|^2$
Distillation	Cross-Entropy	$\alpha \cdot CE(y, \hat{y}_S) + (1-\alpha) \cdot CE(p_T, p_S)$
Language Modeling	Perplexity	$\text{PPL} = 2^{H(P, Q)}$

Cross-References

081 - Entropy — Foundation for information gain, perplexity, and all uncertainty measures.
082 - Mutual Information — Used directly in feature selection and the information bottleneck.
083 - KL Divergence: Central to VAEs, EM algorithm, and distribution matching.
084 - Cross-Entropy — The loss function for classification, distillation, and language modeling.

Summary

📋Key Takeaways

Feature Selection: Mutual information $I(X_j; Y) = H(Y) - H(Y|X_j)$ captures any dependency between feature and target. It's model-agnostic and more powerful than correlation for selecting informative features.
Information Bottleneck: The principle $\min I(X;T) - \beta I(T;Y)$ learns compressed representations that retain task-relevant information. Deep networks implicitly optimize this, explaining their generalization ability.
Decision Trees: Information gain $IG(D, A) = H(D) - \sum_v \frac{|D_v|}{|D|} H(D_v)$ measures how much a feature reduces entropy. Trees greedily split on the feature with highest gain.
VAEs: The loss $\mathcal{L} = \mathbb{E}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$ balances reconstruction quality with latent space regularization. The KL term ensures smooth, structured latent spaces.
Diffusion Models: Train by predicting noise: $\|\epsilon - \epsilon_\theta(x_t, t)\|^2$ . The forward process gradually destroys information; the reverse process recovers it.
Knowledge Distillation: $L = \alpha \cdot CE(y, \hat{y}_S) + (1-\alpha) \cdot CE(p_T, p_S)$ transfers knowledge via soft labels that encode inter-class relationships.
Language Modeling: Perplexity $= 2^{H(P,Q)}$ measures prediction quality. Lower perplexity = better model. Training minimizes cross-entropy between true and predicted token distributions.