Information Theory — Measuring Information

ℹ️ Why It Matters

Information theory quantifies information, measures surprise, and forms the basis for many ML loss functions (cross-entropy, KL divergence).

What is Information?

DfInformation

Surprise = Information

If something very unlikely happens → HIGH information If something very likely happens → LOW information

Information Content

I(x) = -\log_2(P(x))

Here,

$P(x)$ =Probability of event x
$I(x)$ =Information content (in bits)

📝Example: Information Content

$P(\text{heads}) = 0.5 \rightarrow I(\text{heads}) = -\log_2(0.5) = 1$ bit
$P(\text{rolling a 6}) = 1/6 \rightarrow I(\text{rolling 6}) = -\log_2(1/6) = 2.58$ bits
$P(\text{sun rises}) = 0.999 \rightarrow I(\text{sun rises}) = -\log_2(0.999) = 0.00144$ bits

Entropy

DfEntropy

The average information (surprise) of a random variable.

Entropy

H(X) = -\sum_i P(x_i) \times \log_2(P(x_i))

Here,

$H(X)$ =Entropy of random variable X

📝Example: Entropy

Fair coin: $H = -(0.5 \times \log_2(0.5) + 0.5 \times \log_2(0.5)) = 1$ bit
Loaded coin (90% heads): $H = -(0.9 \times \log_2(0.9) + 0.1 \times \log_2(0.1)) = 0.469$ bits
Certain outcome: $H = 0$ bits

Properties:

$H \geq 0$ (entropy is never negative)
$H$ is maximized when all outcomes are equally likely
$H$ measures uncertainty — more entropy = more uncertainty

Cross-Entropy

DfCross-Entropy

Measures how different two probability distributions are.

Cross-Entropy

H(P, Q) = -\sum_i P(x_i) \times \log_2(Q(x_i))

Here,

$P$ =True distribution
$Q$ =Predicted distribution

ℹ️ In ML

\text{Loss} = -\sum_i y_i \times \log(\hat{y}_i)

$y$ = true labels (one-hot encoded)
$\hat{y}$ = predicted probabilities

💡 Why cross-entropy is used in classification

If the model predicts the correct class with high probability → low loss
If the model predicts the wrong class → high loss
It heavily penalizes confident wrong predictions

KL Divergence

DfKL Divergence

Measures how much information is lost when $Q$ approximates $P$ .

KL Divergence

KL(P \| Q) = \sum_i P(x_i) \times \log\left(\frac{P(x_i)}{Q(x_i)}\right)

Here,

$KL(P \| Q)$ =KL divergence from Q to P

KL(P \| Q) = H(P, Q) - H(P) = \text{Cross-Entropy} - \text{Entropy}

Properties:

$KL(P \| Q) \geq 0$ (always non-negative)
$KL(P \| Q) = 0$ if and only if $P = Q$
KL is NOT symmetric: $KL(P \| Q) \neq KL(Q \| P)$

Applications:

Variational Autoencoders (VAE): KL divergence between learned distribution and prior
Knowledge distillation: Student model mimics teacher model
GAN training: Measures distance between real and generated distributions

Mutual Information

DfMutual Information

How much knowing one variable tells you about another.

Mutual Information

I(X; Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Here,

$I(X; Y)$ =Mutual information between X and Y

I(X; Y) = \sum_{x,y} P(x,y) \times \log\left(\frac{P(x,y)}{P(x) \times P(y)}\right)

Interpretation:

$I(X; Y) = 0$ : X and Y are independent
$I(X; Y) = H(X)$ : Knowing Y tells you everything about X

Applications:

Feature selection: Choose features with high mutual information with the target
Information bottleneck: Compress input while retaining relevant information
Clustering: Evaluate cluster quality

Data Compression Connection

Shannon's Source Coding Theorem:

ThSource Coding Theorem

You cannot compress data to fewer than $H(X)$ bits per symbol on average. $H(X)$ is the theoretical minimum number of bits needed.

Huffman Coding:

Assign shorter codes to more frequent symbols
Optimal prefix-free coding

Applications in AI:

Neural network pruning: Remove redundant weights
Knowledge distillation: Compress large model into small one
Model compression for edge devices

📋Key Takeaways

Information measures surprise. $I(x) = -\log_2(P(x))$ — rare events carry more information; certain events carry none. This is the fundamental unit (bits) of information theory.
Entropy is average surprise. $H(X) = -\sum_i P(x_i) \log_2(P(x_i))$ measures the uncertainty of a random variable — maximum when all outcomes are equally likely, zero when certain.
Cross-Entropy is THE classification loss. $H(P, Q) = -\sum_i P(x_i) \log_2(Q(x_i))$ measures how bad your predicted distribution $Q$ is compared to the true distribution $P$ — directly used as the loss function in neural network classification.
KL Divergence measures information loss. $KL(P \| Q) = H(P, Q) - H(P)$ quantifies how much information is lost when $Q$ approximates $P$ — used in VAEs, knowledge distillation, and GAN training. Note: $KL(P \| Q) \neq KL(Q \| P)$ .
Mutual Information finds dependencies. $I(X; Y) = H(X) - H(X|Y)$ measures how much knowing one variable tells you about another — powerful for feature selection and evaluating clustering quality.
Shannon's theorem sets compression limits. You cannot compress data below $H(X)$ bits per symbol on average — connecting information theory to practical model compression and pruning in neural networks.