Entropy

ℹ️ Why It Matters

Entropy is the cornerstone of information theory. Introduced by Claude Shannon in 1948, it quantifies the fundamental uncertainty or "surprise" in a random variable. Every time you compress a file, train a neural network with cross-entropy loss, or split a decision tree, entropy is at work. Understanding entropy deeply gives you intuition for information gain, coding theory, loss function design, and the theoretical limits of compression.

Historical Context

ℹ️ Shannon's Breakthrough

Before Shannon, "information" was an intuitive concept. Shannon formalized it: the information gained from observing an event with probability $p$ is $-\log_2 p$ . Rare events carry more information. Entropy is the expected information content across all possible outcomes of a random variable. This single idea unified communication, compression, and cryptography under a mathematical framework.

Core Definitions

DfShannon Entropy

The Shannon entropy of a discrete random variable $X$ with probability mass function $p(x)$ is:

H(X) = -\sum_{x \in \mathcal{X}} p(x) \log_2 p(x)

where the sum is over all possible values of $X$ , and we adopt the convention $0 \log_2 0 = 0$ . Entropy is measured in bits when using $\log_2$ .

DfDifferential Entropy

For a continuous random variable $X$ with probability density function $f(x)$ :

h(X) = -\int_{-\infty}^{\infty} f(x) \log_2 f(x) \, dx

Unlike discrete entropy, differential entropy can be negative and is not directly interpretable as "information content." However, differences in differential entropy (e.g., in rate-distortion theory) are meaningful.

Key Formulas

Shannon Entropy

H(X) = -\sum_{x} p(x) \log_2 p(x)

Here,

$H(X)$ =Entropy of random variable X (in bits)
$p(x)$ =Probability of outcome x
$log_2$ =Logarithm base 2 (gives units in bits)

Entropy in Nats

H(X) = -\sum_{x} p(x) \ln p(x)

Here,

$ln$ =Natural logarithm (gives units in nats)
$1 nat$ == 1 / ln(2) ≈ 1.4427 bits

Joint Entropy

H(X, Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y)

Here,

$H(X, Y)$ =Joint entropy of X and Y
$p(x,y)$ =Joint probability of x and y

Conditional Entropy

H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x)

Here,

$H(Y|X)$ =Entropy of Y given X
$p(y|x)$ =Conditional probability of y given x

Relative Entropy (KL Divergence)

D_{KL}(P \| Q) = \sum_{x} p(x) \log \frac{p(x)}{q(x)}

Here,

$D_{KL}$ =Kullback-Leibler divergence
$P, Q$ =Two probability distributions

Properties and Theorems

ThNon-negativity

$H(X) \geq 0$ for all discrete random variables $X$ . Equality holds if and only if $X$ is deterministic (one outcome has probability 1).

ThMaximum Entropy

For a discrete random variable with $|\mathcal{X}|$ possible outcomes:

H(X) \leq \log_2 |\mathcal{X}|

Equality holds when $X$ is uniformly distributed over its support.

ThChain Rule for Entropy

H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)

The joint entropy decomposes into one marginal entropy plus one conditional entropy.

ThSubadditivity

H(X, Y) \leq H(X) + H(Y)

Equality holds if and only if $X$ and $Y$ are independent.

ThConditioning Reduces Entropy

H(Y|X) \leq H(Y)

Knowing $X$ cannot increase uncertainty about $Y$ . Equality holds iff $X$ and $Y$ are independent.

ThShannon's Source Coding Theorem

No lossless compression scheme can compress a source with entropy $H(X)$ to fewer than $H(X)$ bits per symbol on average. Conversely, there exist codes that achieve average length arbitrarily close to $H(X)$ .

Worked Examples

📝Example 1: Fair Coin

Problem: Compute the entropy of a fair coin flip.

💡Solution: Fair Coin

A fair coin has $p(\text{heads}) = p(\text{tails}) = 0.5$ .

H(X) = -\left[0.5 \log_2(0.5) + 0.5 \log_2(0.5)\right] = -\left[0.5(-1) + 0.5(-1)\right] = 1.0 \text{ bit}

This is the maximum entropy for a binary variable.

📝Example 2: Biased Coin

Problem: Compute the entropy of a coin with $p(\text{heads}) = 0.9$ .

💡Solution: Biased Coin

H(X) = -\left[0.9 \log_2(0.9) + 0.1 \log_2(0.1)\right] = -\left[0.9(-0.152) + 0.1(-3.322)\right] = 0.469 \text{ bits}

The biased coin has lower entropy because the outcome is more predictable.

📝Example 3: Loaded Die

Problem: Compute the entropy of a loaded die with probabilities $[1/2, 1/4, 1/8, 1/16, 1/32, 1/32]$ .

💡Solution: Loaded Die

H = -\sum_{i=1}^{6} p_i \log_2(p_i)

= -\left[\frac{1}{2}(-1) + \frac{1}{4}(-2) + \frac{1}{8}(-3) + \frac{1}{16}(-4) + \frac{1}{32}(-5) + \frac{1}{32}(-5)\right]

= 0.5 + 0.5 + 0.375 + 0.25 + 0.15625 + 0.15625 = 1.9375 \text{ bits}

Compare with a fair die: $H = \log_2(6) \approx 2.585$ bits.

📝Example 4: Joint and Conditional Entropy

Problem: Let $X, Y$ be binary with joint distribution: $p(0,0) = 0.4$ , $p(0,1) = 0.1$ , $p(1,0) = 0.1$ , $p(1,1) = 0.4$ . Compute $H(X)$ , $H(Y|X)$ , and $H(X,Y)$ .

💡Solution: Joint and Conditional

Marginals: $p(X=0) = 0.5$ , $p(X=1) = 0.5$ . So $H(X) = 1.0$ bit.

$H(Y|X=0) = -[0.8 \log_2(0.8) + 0.2 \log_2(0.2)] = 0.722$ bits.

$H(Y|X=1) = -[0.2 \log_2(0.2) + 0.8 \log_2(0.8)] = 0.722$ bits.

$H(Y|X) = \sum_x p(x) H(Y|X=x) = 0.5(0.722) + 0.5(0.722) = 0.722$ bits.

$H(X,Y) = H(X) + H(Y|X) = 1.0 + 0.722 = 1.722$ bits.

Python Implementation

import numpy as np
from collections import Counter

def entropy(probs):
    """Compute Shannon entropy in bits from a probability distribution."""
    probs = np.array(probs)
    probs = probs[probs > 0]  # Avoid log(0)
    return -np.sum(probs * np.log2(probs))

def entropy_from_samples(samples):
    """Estimate entropy from a list of samples."""
    counter = Counter(samples)
    total = len(samples)
    probs = np.array([count / total for count in counter.values()])
    return entropy(probs)

def joint_entropy(joint_probs):
    """Compute joint entropy from a 2D joint probability matrix."""
    flat = joint_probs.flatten()
    return entropy(flat)

def conditional_entropy(joint_probs):
    """Compute H(Y|X) from a joint probability matrix."""
    h_y_given_x = 0.0
    for i in range(joint_probs.shape[0]):
        p_x = joint_probs[i].sum()
        if p_x > 0:
            h_y_given_x += p_x * entropy(joint_probs[i] / p_x)
    return h_y_given_x

# --- Examples ---
# Fair coin
print(f"Fair coin: {entropy([0.5, 0.5]):.3f} bits")          # 1.000

# Biased coin
print(f"Biased (0.9,0.1): {entropy([0.9, 0.1]):.3f} bits")  # 0.469

# Fair die
print(f"Fair die: {entropy([1/6]*6):.3f} bits")              # 2.585

# Loaded die
loaded = [1/2, 1/4, 1/8, 1/16, 1/32, 1/32]
print(f"Loaded die: {entropy(loaded):.3f} bits")              # 1.938

# Joint entropy example
joint = np.array([[0.4, 0.1], [0.1, 0.4]])
print(f"Joint entropy: {joint_entropy(joint):.3f} bits")      # 1.722
print(f"Conditional H(Y|X): {conditional_entropy(joint):.3f}")  # 0.722

# Entropy from samples
np.random.seed(42)
samples = np.random.choice(['A', 'B', 'C', 'D'], size=10000, p=[0.5, 0.25, 0.125, 0.125])
print(f"Estimated entropy: {entropy_from_samples(samples):.3f}")  # ~1.75

Applications in AI/ML

ℹ️ Decision Trees

Decision trees use information gain to choose splits. For a feature $A$ splitting dataset $D$ into subsets $D_v$ :

IG(D, A) = H(D) - \sum_{v} \frac{|D_v|}{|D|} H(D_v)

The feature with highest IG is chosen. This is directly derived from entropy.

ℹ️ Cross-Entropy Loss

Neural networks for classification minimize:

\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log \hat{p}_{ic}

where $y_{ic}$ is the true label and $\hat{p}_{ic}$ is the predicted probability. This is the cross-entropy between the true distribution and the model's predicted distribution.

ℹ️ Compression

Shannon's source coding theorem guarantees that the optimal average code length for a source is $H(X)$ bits per symbol. Huffman coding and arithmetic coding approach this limit.

ℹ️ Feature Selection

Mutual information $I(X; Y) = H(Y) - H(Y|X)$ measures how much a feature $X$ reduces uncertainty about target $Y$ . Features with higher MI are more informative.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Using $\log$ without specifying base	Units depend on base	Use $\log_2$ for bits, $\ln$ for nats
Forgetting $0 \log 0 = 0$ convention	Causes NaN errors	Filter zero probabilities before computing
Confusing entropy with information	Entropy is expected information, not instantaneous	Entropy averages over all outcomes
Assuming entropy is always positive for continuous RVs	Differential entropy can be negative	Differential entropy is not directly comparable
Treating conditional entropy as "entropy minus something"	$H(Y\\|X) \neq H(Y) - H(X)$	Use the chain rule: $H(Y\\|X) = H(X,Y) - H(X)$

Interview Questions

Q1: Why does a fair coin have maximum entropy? A: A fair coin has $p = 0.5$ for both outcomes, which maximizes $-\sum p \log p$ subject to $\sum p = 1$ . This is proven by Jensen's inequality or Lagrange multipliers.

Q2: Can entropy be negative? A: Discrete entropy $H(X) \geq 0$ always. Differential entropy $h(X)$ can be negative (e.g., a very concentrated Gaussian with small variance).

Q3: What's the relationship between entropy and compression? A: Shannon's source coding theorem: you cannot compress below $H(X)$ bits per symbol losslessly. Entropy is the theoretical minimum average code length.

Q4: How is entropy used in decision trees? A: Information gain $IG(D, A) = H(D) - \sum_v \frac{|D_v|}{|D|} H(D_v)$ measures how much a feature reduces uncertainty. The feature with highest IG is selected for splitting.

Q5: Why use $\log_2$ instead of $\ln$ ? A: $\log_2$ gives entropy in bits, which is intuitive for binary decisions. $\ln$ gives nats. The choice doesn't affect which distribution has higher entropy, only the units.

Practice Problems

📝Problem 1: Minimum Entropy

Problem: What is the minimum possible entropy for a binary random variable? Under what condition is it achieved?

💡Solution: Minimum Entropy

The minimum is $H(X) = 0$ , achieved when $p = 0$ or $p = 1$ (deterministic outcome). One outcome has probability 1, so there is no uncertainty.

📝Problem 2: Entropy Ordering

Problem: Rank the following distributions by entropy (lowest to highest): (a) [0.99, 0.01], (b) [0.5, 0.5], (c) [0.7, 0.3], (d) [0.25, 0.25, 0.25, 0.25].

💡Solution: Entropy Ordering

(a) $H \approx 0.081$ bits, (c) $H \approx 0.881$ bits, (b) $H = 1.0$ bit, (d) $H = 2.0$ bits.

Ordering: (a) < (c) < (b) < (d). The more concentrated the distribution, the lower the entropy. The uniform distribution over 4 outcomes has the highest.

📝Problem 3: Chain Rule Application

Problem: $H(X) = 3$ bits, $H(X,Y) = 5$ bits. What is $H(Y|X)$ ? What is $H(Y|X)$ if $X$ and $Y$ are independent?

💡Solution: Chain Rule

By the chain rule: $H(Y|X) = H(X,Y) - H(X) = 5 - 3 = 2$ bits.

If $X$ and $Y$ are independent: $H(Y|X) = H(Y)$ . We need $H(Y) = H(X,Y) - H(X) = 2$ bits only if independent. But we're told $H(X,Y) = 5$ and $H(X) = 3$ , so $H(Y|X) = 2$ regardless. If additionally independent, $H(Y) = 2$ and $H(X,Y) = H(X) + H(Y) = 5$ ✓.

Quick Reference

Quantity	Formula	Units
Shannon Entropy	$H(X) = -\sum_x p(x) \log_2 p(x)$	bits
Joint Entropy	$H(X,Y) = -\sum_{x,y} p(x,y) \log_2 p(x,y)$	bits
Conditional Entropy	$H(Y\\|X) = H(X,Y) - H(X)$	bits
Max Entropy	$\log_2 \\|\mathcal{X}\\|$	bits
Information Gain	$IG = H(D) - \sum_v \frac{\\|D_v\\|}{\\|D\\|} H(D_v)$	bits

Relationship to Other Quantities

ℹ️ Entropy as the Foundation

Entropy is the building block for many other information-theoretic quantities:

Mutual Information: $I(X;Y) = H(X) - H(X|Y)$ — measures shared information.
KL Divergence: $D_{KL}(P\|Q) = H(P,Q) - H(P)$ — measures distribution mismatch.
Cross-Entropy: $H(P,Q) = H(P) + D_{KL}(P\|Q)$ — the loss function for classification.
Information Gain: $IG(D,A) = H(D) - H(D|A)$ — used in decision trees.
Perplexity: $\text{PPL} = 2^{H(P,Q)}$ — used in language modeling.

ThData Processing Inequality

If $X \to Y \to Z$ forms a Markov chain, then $H(Z|X) \geq H(Z|Y)$ . Processing data cannot decrease uncertainty about a downstream variable more than the original data already does.

ThStrong Subadditivity

For three random variables $X, Y, Z$ :

H(X, Y, Z) + H(Y) \leq H(X, Y) + H(Y, Z)

This is the information-theoretic analog of the triangle inequality.

ℹ️ Entropy in Physics

In statistical mechanics, entropy $S = k_B \ln W$ measures the number of microstates $W$ consistent with a macrostate. Shannon entropy $H = -\sum p \log p$ generalizes this to arbitrary probability distributions. The connection: Boltzmann entropy is Shannon entropy when all microstates are equally likely.

Common Confusions

ℹ️ Entropy vs Information

Entropy is the expected information content, not the information content of a single event. A single rare event has high information ( $-\log p$ ), but entropy averages over all events weighted by their probabilities.

ℹ️ Entropy vs Randomness

Entropy measures uncertainty, not randomness in the colloquial sense. A fair coin has high entropy (1 bit), while a biased coin ( $p = 0.99$ ) has low entropy (0.08 bits), even though both are "random" in the sense that outcomes are unpredictable.

Cross-References

082 - Mutual Information — $I(X;Y) = H(X) - H(X|Y)$ — measures shared information between variables.
083 - KL Divergence: $D_{KL}(P\|Q) = H(P,Q) - H(P)$ — measures distribution mismatch.
084 - Cross-Entropy — $H(P,Q) = H(P) + D_{KL}(P\|Q)$ — used as loss in classification.
085 - Applications — Information gain in decision trees, feature selection with MI, VAE loss functions.

Summary

📋Key Takeaways

Shannon Entropy: $H(X) = -\sum_{x} p(x) \log_2 p(x)$ measures the average information content (in bits) of a random variable. Higher entropy means more uncertainty and more information per observation.
Entropy Bounds: Entropy is always non-negative $H(X) \geq 0$ , equals 0 for deterministic outcomes, and is maximized at $\log_2 |X|$ when all outcomes are equally likely (uniform distribution).
Conditional Entropy: $H(Y|X) = -\sum_{x,y} p(x,y) \log p(y|x)$ measures remaining uncertainty about $Y$ after observing $X$ . Conditioning reduces entropy: $H(Y|X) \leq H(Y)$ .
Chain Rule: $H(X,Y) = H(X) + H(Y|X)$ . The joint entropy of two variables equals the entropy of $X$ plus the conditional entropy of $Y$ given $X$ . This decomposes complex systems.
Units Matter: Using $\log_2$ gives entropy in bits; using $\ln$ gives nats; using $\log_{10}$ gives bans/hartleys. Bits are standard in information theory and machine learning.
Source Coding Theorem: Shannon proved that the entropy $H(X)$ is the theoretical minimum average number of bits needed to encode messages from a source. No compression scheme can do better.
Practical Application: Entropy drives decision tree splitting (information gain), loss function design (cross-entropy loss in neural networks), compression limits (Shannon's source coding theorem), and feature selection in ML pipelines.