← Math|81 of 100
Information Theory

Entropy

Master Shannon entropy, conditional entropy, and their applications in ML and compression.

📂 Entropy📖 Lesson 81 of 100🎓 Free Course

Advertisement

Entropy

â„šī¸ Why It Matters

Entropy is the cornerstone of information theory. Introduced by Claude Shannon in 1948, it quantifies the fundamental uncertainty or "surprise" in a random variable. Every time you compress a file, train a neural network with cross-entropy loss, or split a decision tree, entropy is at work. Understanding entropy deeply gives you intuition for information gain, coding theory, loss function design, and the theoretical limits of compression.


Historical Context

â„šī¸ Shannon's Breakthrough

Before Shannon, "information" was an intuitive concept. Shannon formalized it: the information gained from observing an event with probability pp is −log⁡2p-\log_2 p. Rare events carry more information. Entropy is the expected information content across all possible outcomes of a random variable. This single idea unified communication, compression, and cryptography under a mathematical framework.


Core Definitions

DfShannon Entropy

The Shannon entropy of a discrete random variable XX with probability mass function p(x)p(x) is:

H(X)=−∑x∈Xp(x)log⁡2p(x)H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)

where the sum is over all possible values of XX, and we adopt the convention 0log⁥20=00 \log_2 0 = 0. Entropy is measured in bits when using log⁥2\log_2.

DfDifferential Entropy

For a continuous random variable XX with probability density function f(x)f(x):

h(X)=−âˆĢ−∞∞f(x)log⁥2f(x) dxh(X) = -\int_{-\infty}^{\infty} f(x) \log_2 f(x) \, dx

Unlike discrete entropy, differential entropy can be negative and is not directly interpretable as "information content." However, differences in differential entropy (e.g., in rate-distortion theory) are meaningful.


Key Formulas

Shannon Entropy

H(X)=−∑xp(x)log⁡2p(x)H(X) = -\sum_{x} p(x) \log_2 p(x)

Here,

  • H(X)H(X)=Entropy of random variable X (in bits)
  • p(x)p(x)=Probability of outcome x
  • log2log_2=Logarithm base 2 (gives units in bits)

Entropy in Nats

H(X)=−∑xp(x)ln⁡p(x)H(X) = -\sum_{x} p(x) \ln p(x)

Here,

  • lnln=Natural logarithm (gives units in nats)
  • 1nat1 nat== 1 / ln(2) ≈ 1.4427 bits

Joint Entropy

H(X,Y)=−∑x,yp(x,y)log⁡2p(x,y)H(X, Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y)

Here,

  • H(X,Y)H(X, Y)=Joint entropy of X and Y
  • p(x,y)p(x,y)=Joint probability of x and y

Conditional Entropy

H(YâˆŖX)=−∑x,yp(x,y)log⁥2p(yâˆŖx)H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x)

Here,

  • H(YâˆŖX)H(Y|X)=Entropy of Y given X
  • p(yâˆŖx)p(y|x)=Conditional probability of y given x

Relative Entropy (KL Divergence)

DKL(PâˆĨQ)=∑xp(x)log⁥p(x)q(x)D_{KL}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}

Here,

  • DKLD_{KL}=Kullback-Leibler divergence
  • P,QP, Q=Two probability distributions

Properties and Theorems

ThNon-negativity

H(X)â‰Ĩ0H(X) \geq 0 for all discrete random variables XX. Equality holds if and only if XX is deterministic (one outcome has probability 1).

ThMaximum Entropy

For a discrete random variable with âˆŖXâˆŖ|\mathcal{X}| possible outcomes:

H(X)≤log⁥2âˆŖXâˆŖH(X) \leq \log_2 |\mathcal{X}|

Equality holds when XX is uniformly distributed over its support.

ThChain Rule for Entropy

H(X,Y)=H(X)+H(YâˆŖX)=H(Y)+H(XâˆŖY)H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

The joint entropy decomposes into one marginal entropy plus one conditional entropy.

ThSubadditivity

H(X,Y)≤H(X)+H(Y)H(X, Y) \leq H(X) + H(Y)

Equality holds if and only if XX and YY are independent.

ThConditioning Reduces Entropy

H(YâˆŖX)≤H(Y)H(Y|X) \leq H(Y)

Knowing XX cannot increase uncertainty about YY. Equality holds iff XX and YY are independent.

ThShannon's Source Coding Theorem

No lossless compression scheme can compress a source with entropy H(X)H(X) to fewer than H(X)H(X) bits per symbol on average. Conversely, there exist codes that achieve average length arbitrarily close to H(X)H(X).


Worked Examples

📝Example 1: Fair Coin

Problem: Compute the entropy of a fair coin flip.

💡Solution: Fair Coin

A fair coin has p(heads)=p(tails)=0.5p(\text{heads}) = p(\text{tails}) = 0.5.

H(X)=−[0.5log⁡2(0.5)+0.5log⁡2(0.5)]=−[0.5(−1)+0.5(−1)]=1.0 bitH(X) = -\left[0.5 \log_2(0.5) + 0.5 \log_2(0.5)\right] = -\left[0.5(-1) + 0.5(-1)\right] = 1.0 \text{ bit}

This is the maximum entropy for a binary variable.

📝Example 2: Biased Coin

Problem: Compute the entropy of a coin with p(heads)=0.9p(\text{heads}) = 0.9.

💡Solution: Biased Coin

H(X)=−[0.9log⁡2(0.9)+0.1log⁡2(0.1)]=−[0.9(−0.152)+0.1(−3.322)]=0.469 bitsH(X) = -\left[0.9 \log_2(0.9) + 0.1 \log_2(0.1)\right] = -\left[0.9(-0.152) + 0.1(-3.322)\right] = 0.469 \text{ bits}

The biased coin has lower entropy because the outcome is more predictable.

📝Example 3: Loaded Die

Problem: Compute the entropy of a loaded die with probabilities [1/2,1/4,1/8,1/16,1/32,1/32][1/2, 1/4, 1/8, 1/16, 1/32, 1/32].

💡Solution: Loaded Die

H=−∑i=16pilog⁡2(pi)H = -\sum_{i=1}^{6} p_i \log_2(p_i)
=−[12(−1)+14(−2)+18(−3)+116(−4)+132(−5)+132(−5)]= -\left[\frac{1}{2}(-1) + \frac{1}{4}(-2) + \frac{1}{8}(-3) + \frac{1}{16}(-4) + \frac{1}{32}(-5) + \frac{1}{32}(-5)\right]
=0.5+0.5+0.375+0.25+0.15625+0.15625=1.9375 bits= 0.5 + 0.5 + 0.375 + 0.25 + 0.15625 + 0.15625 = 1.9375 \text{ bits}

Compare with a fair die: H=log⁡2(6)≈2.585H = \log_2(6) \approx 2.585 bits.

📝Example 4: Joint and Conditional Entropy

Problem: Let X,YX, Y be binary with joint distribution: p(0,0)=0.4p(0,0) = 0.4, p(0,1)=0.1p(0,1) = 0.1, p(1,0)=0.1p(1,0) = 0.1, p(1,1)=0.4p(1,1) = 0.4. Compute H(X)H(X), H(YâˆŖX)H(Y|X), and H(X,Y)H(X,Y).

💡Solution: Joint and Conditional

Marginals: p(X=0)=0.5p(X=0) = 0.5, p(X=1)=0.5p(X=1) = 0.5. So H(X)=1.0H(X) = 1.0 bit.

H(YâˆŖX=0)=−[0.8log⁥2(0.8)+0.2log⁥2(0.2)]=0.722H(Y|X=0) = -[0.8 \log_2(0.8) + 0.2 \log_2(0.2)] = 0.722 bits.

H(YâˆŖX=1)=−[0.2log⁥2(0.2)+0.8log⁥2(0.8)]=0.722H(Y|X=1) = -[0.2 \log_2(0.2) + 0.8 \log_2(0.8)] = 0.722 bits.

H(YâˆŖX)=∑xp(x)H(YâˆŖX=x)=0.5(0.722)+0.5(0.722)=0.722H(Y|X) = \sum_x p(x) H(Y|X=x) = 0.5(0.722) + 0.5(0.722) = 0.722 bits.

H(X,Y)=H(X)+H(YâˆŖX)=1.0+0.722=1.722H(X,Y) = H(X) + H(Y|X) = 1.0 + 0.722 = 1.722 bits.


Python Implementation

import numpy as np
from collections import Counter

def entropy(probs):
    """Compute Shannon entropy in bits from a probability distribution."""
    probs = np.array(probs)
    probs = probs[probs > 0]  # Avoid log(0)
    return -np.sum(probs * np.log2(probs))

def entropy_from_samples(samples):
    """Estimate entropy from a list of samples."""
    counter = Counter(samples)
    total = len(samples)
    probs = np.array([count / total for count in counter.values()])
    return entropy(probs)

def joint_entropy(joint_probs):
    """Compute joint entropy from a 2D joint probability matrix."""
    flat = joint_probs.flatten()
    return entropy(flat)

def conditional_entropy(joint_probs):
    """Compute H(Y|X) from a joint probability matrix."""
    h_y_given_x = 0.0
    for i in range(joint_probs.shape[0]):
        p_x = joint_probs[i].sum()
        if p_x > 0:
            h_y_given_x += p_x * entropy(joint_probs[i] / p_x)
    return h_y_given_x

# --- Examples ---
# Fair coin
print(f"Fair coin: {entropy([0.5, 0.5]):.3f} bits")          # 1.000

# Biased coin
print(f"Biased (0.9,0.1): {entropy([0.9, 0.1]):.3f} bits")  # 0.469

# Fair die
print(f"Fair die: {entropy([1/6]*6):.3f} bits")              # 2.585

# Loaded die
loaded = [1/2, 1/4, 1/8, 1/16, 1/32, 1/32]
print(f"Loaded die: {entropy(loaded):.3f} bits")              # 1.938

# Joint entropy example
joint = np.array([[0.4, 0.1], [0.1, 0.4]])
print(f"Joint entropy: {joint_entropy(joint):.3f} bits")      # 1.722
print(f"Conditional H(Y|X): {conditional_entropy(joint):.3f}")  # 0.722

# Entropy from samples
np.random.seed(42)
samples = np.random.choice(['A', 'B', 'C', 'D'], size=10000, p=[0.5, 0.25, 0.125, 0.125])
print(f"Estimated entropy: {entropy_from_samples(samples):.3f}")  # ~1.75

Applications in AI/ML

â„šī¸ Decision Trees

Decision trees use information gain to choose splits. For a feature AA splitting dataset DD into subsets DvD_v:

IG(D,A)=H(D)−∑vâˆŖDvâˆŖâˆŖDâˆŖH(Dv)IG(D, A) = H(D) - \sum_{v} \frac{|D_v|}{|D|} H(D_v)

The feature with highest IG is chosen. This is directly derived from entropy.

â„šī¸ Cross-Entropy Loss

Neural networks for classification minimize:

L=−1N∑i=1N∑c=1Cyiclog⁡p^ic\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log \hat{p}_{ic}

where yicy_{ic} is the true label and p^ic\hat{p}_{ic} is the predicted probability. This is the cross-entropy between the true distribution and the model's predicted distribution.

â„šī¸ Compression

Shannon's source coding theorem guarantees that the optimal average code length for a source is H(X)H(X) bits per symbol. Huffman coding and arithmetic coding approach this limit.

â„šī¸ Feature Selection

Mutual information I(X;Y)=H(Y)−H(YâˆŖX)I(X; Y) = H(Y) - H(Y|X) measures how much a feature XX reduces uncertainty about target YY. Features with higher MI are more informative.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using log⁥\log without specifying baseUnits depend on baseUse log⁥2\log_2 for bits, ln⁥\ln for nats
Forgetting 0log⁥0=00 \log 0 = 0 conventionCauses NaN errorsFilter zero probabilities before computing
Confusing entropy with informationEntropy is expected information, not instantaneousEntropy averages over all outcomes
Assuming entropy is always positive for continuous RVsDifferential entropy can be negativeDifferential entropy is not directly comparable
Treating conditional entropy as "entropy minus something"H(YâˆĨX)≠H(Y)−H(X)H(Y\|X) \neq H(Y) - H(X)Use the chain rule: H(YâˆĨX)=H(X,Y)−H(X)H(Y\|X) = H(X,Y) - H(X)

Interview Questions

Q1: Why does a fair coin have maximum entropy? A: A fair coin has p=0.5p = 0.5 for both outcomes, which maximizes −∑plog⁡p-\sum p \log p subject to ∑p=1\sum p = 1. This is proven by Jensen's inequality or Lagrange multipliers.

Q2: Can entropy be negative? A: Discrete entropy H(X)â‰Ĩ0H(X) \geq 0 always. Differential entropy h(X)h(X) can be negative (e.g., a very concentrated Gaussian with small variance).

Q3: What's the relationship between entropy and compression? A: Shannon's source coding theorem: you cannot compress below H(X)H(X) bits per symbol losslessly. Entropy is the theoretical minimum average code length.

Q4: How is entropy used in decision trees? A: Information gain IG(D,A)=H(D)−∑vâˆŖDvâˆŖâˆŖDâˆŖH(Dv)IG(D, A) = H(D) - \sum_v \frac{|D_v|}{|D|} H(D_v) measures how much a feature reduces uncertainty. The feature with highest IG is selected for splitting.

Q5: Why use log⁥2\log_2 instead of ln⁥\ln? A: log⁥2\log_2 gives entropy in bits, which is intuitive for binary decisions. ln⁥\ln gives nats. The choice doesn't affect which distribution has higher entropy, only the units.


Practice Problems

📝Problem 1: Minimum Entropy

Problem: What is the minimum possible entropy for a binary random variable? Under what condition is it achieved?

💡Solution: Minimum Entropy

The minimum is H(X)=0H(X) = 0, achieved when p=0p = 0 or p=1p = 1 (deterministic outcome). One outcome has probability 1, so there is no uncertainty.

📝Problem 2: Entropy Ordering

Problem: Rank the following distributions by entropy (lowest to highest): (a) [0.99, 0.01], (b) [0.5, 0.5], (c) [0.7, 0.3], (d) [0.25, 0.25, 0.25, 0.25].

💡Solution: Entropy Ordering

(a) H≈0.081H \approx 0.081 bits, (c) H≈0.881H \approx 0.881 bits, (b) H=1.0H = 1.0 bit, (d) H=2.0H = 2.0 bits.

Ordering: (a) < (c) < (b) < (d). The more concentrated the distribution, the lower the entropy. The uniform distribution over 4 outcomes has the highest.

📝Problem 3: Chain Rule Application

Problem: H(X)=3H(X) = 3 bits, H(X,Y)=5H(X,Y) = 5 bits. What is H(YâˆŖX)H(Y|X)? What is H(YâˆŖX)H(Y|X) if XX and YY are independent?

💡Solution: Chain Rule

By the chain rule: H(YâˆŖX)=H(X,Y)−H(X)=5−3=2H(Y|X) = H(X,Y) - H(X) = 5 - 3 = 2 bits.

If XX and YY are independent: H(YâˆŖX)=H(Y)H(Y|X) = H(Y). We need H(Y)=H(X,Y)−H(X)=2H(Y) = H(X,Y) - H(X) = 2 bits only if independent. But we're told H(X,Y)=5H(X,Y) = 5 and H(X)=3H(X) = 3, so H(YâˆŖX)=2H(Y|X) = 2 regardless. If additionally independent, H(Y)=2H(Y) = 2 and H(X,Y)=H(X)+H(Y)=5H(X,Y) = H(X) + H(Y) = 5 ✓.


Quick Reference

QuantityFormulaUnits
Shannon EntropyH(X)=−∑xp(x)log⁡2p(x)H(X) = -\sum_x p(x) \log_2 p(x)bits
Joint EntropyH(X,Y)=−∑x,yp(x,y)log⁡2p(x,y)H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y)bits
Conditional EntropyH(YâˆĨX)=H(X,Y)−H(X)H(Y\|X) = H(X,Y) - H(X)bits
Max Entropylog⁥2âˆĨXâˆĨ\log_2 \|\mathcal{X}\|bits
Information GainIG=H(D)−∑vâˆĨDvâˆĨâˆĨDâˆĨH(Dv)IG = H(D) - \sum_v \frac{\|D_v\|}{\|D\|} H(D_v)bits

Relationship to Other Quantities

â„šī¸ Entropy as the Foundation

Entropy is the building block for many other information-theoretic quantities:

  • Mutual Information: I(X;Y)=H(X)−H(XâˆŖY)I(X;Y) = H(X) - H(X|Y) — measures shared information.
  • KL Divergence: DKL(PâˆĨQ)=H(P,Q)−H(P)D_{KL}(P\|Q) = H(P,Q) - H(P) — measures distribution mismatch.
  • Cross-Entropy: H(P,Q)=H(P)+DKL(PâˆĨQ)H(P,Q) = H(P) + D_{KL}(P\|Q) — the loss function for classification.
  • Information Gain: IG(D,A)=H(D)−H(DâˆŖA)IG(D,A) = H(D) - H(D|A) — used in decision trees.
  • Perplexity: PPL=2H(P,Q)\text{PPL} = 2^{H(P,Q)} — used in language modeling.

ThData Processing Inequality

If X→Y→ZX \to Y \to Z forms a Markov chain, then H(ZâˆŖX)â‰ĨH(ZâˆŖY)H(Z|X) \geq H(Z|Y). Processing data cannot decrease uncertainty about a downstream variable more than the original data already does.

ThStrong Subadditivity

For three random variables X,Y,ZX, Y, Z:

H(X,Y,Z)+H(Y)≤H(X,Y)+H(Y,Z)H(X, Y, Z) + H(Y) \leq H(X, Y) + H(Y, Z)

This is the information-theoretic analog of the triangle inequality.

â„šī¸ Entropy in Physics

In statistical mechanics, entropy S=kBln⁡WS = k_B \ln W measures the number of microstates WW consistent with a macrostate. Shannon entropy H=−∑plog⁡pH = -\sum p \log p generalizes this to arbitrary probability distributions. The connection: Boltzmann entropy is Shannon entropy when all microstates are equally likely.


Common Confusions

â„šī¸ Entropy vs Information

Entropy is the expected information content, not the information content of a single event. A single rare event has high information (−log⁡p-\log p), but entropy averages over all events weighted by their probabilities.

â„šī¸ Entropy vs Randomness

Entropy measures uncertainty, not randomness in the colloquial sense. A fair coin has high entropy (1 bit), while a biased coin (p=0.99p = 0.99) has low entropy (0.08 bits), even though both are "random" in the sense that outcomes are unpredictable.


Cross-References

  • 082 - Mutual Information — I(X;Y)=H(X)−H(XâˆŖY)I(X;Y) = H(X) - H(X|Y) — measures shared information between variables.
  • 083 - KL Divergence: DKL(PâˆĨQ)=H(P,Q)−H(P)D_{KL}(P\|Q) = H(P,Q) - H(P) — measures distribution mismatch.
  • 084 - Cross-Entropy — H(P,Q)=H(P)+DKL(PâˆĨQ)H(P,Q) = H(P) + D_{KL}(P\|Q) — used as loss in classification.
  • 085 - Applications — Information gain in decision trees, feature selection with MI, VAE loss functions.

Summary

📋Key Takeaways

  • Shannon Entropy: H(X)=−∑xp(x)log⁥2p(x)H(X) = -\sum_{x} p(x) \log_2 p(x) measures the average information content (in bits) of a random variable. Higher entropy means more uncertainty and more information per observation.

  • Entropy Bounds: Entropy is always non-negative H(X)â‰Ĩ0H(X) \geq 0, equals 0 for deterministic outcomes, and is maximized at log⁥2âˆŖXâˆŖ\log_2 |X| when all outcomes are equally likely (uniform distribution).

  • Conditional Entropy: H(YâˆŖX)=−∑x,yp(x,y)log⁥p(yâˆŖx)H(Y|X) = -\sum_{x,y} p(x,y) \log p(y|x) measures remaining uncertainty about YY after observing XX. Conditioning reduces entropy: H(YâˆŖX)≤H(Y)H(Y|X) \leq H(Y).

  • Chain Rule: H(X,Y)=H(X)+H(YâˆŖX)H(X,Y) = H(X) + H(Y|X). The joint entropy of two variables equals the entropy of XX plus the conditional entropy of YY given XX. This decomposes complex systems.

  • Units Matter: Using log⁥2\log_2 gives entropy in bits; using ln⁥\ln gives nats; using log⁥10\log_{10} gives bans/hartleys. Bits are standard in information theory and machine learning.

  • Source Coding Theorem: Shannon proved that the entropy H(X)H(X) is the theoretical minimum average number of bits needed to encode messages from a source. No compression scheme can do better.

  • Practical Application: Entropy drives decision tree splitting (information gain), loss function design (cross-entropy loss in neural networks), compression limits (Shannon's source coding theorem), and feature selection in ML pipelines.

Lesson Progress81 / 100