Entropy
âšī¸ Why It Matters
Entropy is the cornerstone of information theory. Introduced by Claude Shannon in 1948, it quantifies the fundamental uncertainty or "surprise" in a random variable. Every time you compress a file, train a neural network with cross-entropy loss, or split a decision tree, entropy is at work. Understanding entropy deeply gives you intuition for information gain, coding theory, loss function design, and the theoretical limits of compression.
Historical Context
âšī¸ Shannon's Breakthrough
Before Shannon, "information" was an intuitive concept. Shannon formalized it: the information gained from observing an event with probability is . Rare events carry more information. Entropy is the expected information content across all possible outcomes of a random variable. This single idea unified communication, compression, and cryptography under a mathematical framework.
Core Definitions
DfShannon Entropy
The Shannon entropy of a discrete random variable with probability mass function is:
where the sum is over all possible values of , and we adopt the convention . Entropy is measured in bits when using .
DfDifferential Entropy
For a continuous random variable with probability density function :
Unlike discrete entropy, differential entropy can be negative and is not directly interpretable as "information content." However, differences in differential entropy (e.g., in rate-distortion theory) are meaningful.
Key Formulas
Shannon Entropy
Here,
- =Entropy of random variable X (in bits)
- =Probability of outcome x
- =Logarithm base 2 (gives units in bits)
Entropy in Nats
Here,
- =Natural logarithm (gives units in nats)
- == 1 / ln(2) â 1.4427 bits
Joint Entropy
Here,
- =Joint entropy of X and Y
- =Joint probability of x and y
Conditional Entropy
Here,
- =Entropy of Y given X
- =Conditional probability of y given x
Relative Entropy (KL Divergence)
Here,
- =Kullback-Leibler divergence
- =Two probability distributions
Properties and Theorems
ThNon-negativity
for all discrete random variables . Equality holds if and only if is deterministic (one outcome has probability 1).
ThMaximum Entropy
For a discrete random variable with possible outcomes:
Equality holds when is uniformly distributed over its support.
ThChain Rule for Entropy
The joint entropy decomposes into one marginal entropy plus one conditional entropy.
ThSubadditivity
Equality holds if and only if and are independent.
ThConditioning Reduces Entropy
Knowing cannot increase uncertainty about . Equality holds iff and are independent.
ThShannon's Source Coding Theorem
No lossless compression scheme can compress a source with entropy to fewer than bits per symbol on average. Conversely, there exist codes that achieve average length arbitrarily close to .
Worked Examples
đExample 1: Fair Coin
Problem: Compute the entropy of a fair coin flip.
đĄSolution: Fair Coin
A fair coin has .
This is the maximum entropy for a binary variable.
đExample 2: Biased Coin
Problem: Compute the entropy of a coin with .
đĄSolution: Biased Coin
The biased coin has lower entropy because the outcome is more predictable.
đExample 3: Loaded Die
Problem: Compute the entropy of a loaded die with probabilities .
đĄSolution: Loaded Die
Compare with a fair die: bits.
đExample 4: Joint and Conditional Entropy
Problem: Let be binary with joint distribution: , , , . Compute , , and .
đĄSolution: Joint and Conditional
Marginals: , . So bit.
bits.
bits.
bits.
bits.
Python Implementation
import numpy as np
from collections import Counter
def entropy(probs):
"""Compute Shannon entropy in bits from a probability distribution."""
probs = np.array(probs)
probs = probs[probs > 0] # Avoid log(0)
return -np.sum(probs * np.log2(probs))
def entropy_from_samples(samples):
"""Estimate entropy from a list of samples."""
counter = Counter(samples)
total = len(samples)
probs = np.array([count / total for count in counter.values()])
return entropy(probs)
def joint_entropy(joint_probs):
"""Compute joint entropy from a 2D joint probability matrix."""
flat = joint_probs.flatten()
return entropy(flat)
def conditional_entropy(joint_probs):
"""Compute H(Y|X) from a joint probability matrix."""
h_y_given_x = 0.0
for i in range(joint_probs.shape[0]):
p_x = joint_probs[i].sum()
if p_x > 0:
h_y_given_x += p_x * entropy(joint_probs[i] / p_x)
return h_y_given_x
# --- Examples ---
# Fair coin
print(f"Fair coin: {entropy([0.5, 0.5]):.3f} bits") # 1.000
# Biased coin
print(f"Biased (0.9,0.1): {entropy([0.9, 0.1]):.3f} bits") # 0.469
# Fair die
print(f"Fair die: {entropy([1/6]*6):.3f} bits") # 2.585
# Loaded die
loaded = [1/2, 1/4, 1/8, 1/16, 1/32, 1/32]
print(f"Loaded die: {entropy(loaded):.3f} bits") # 1.938
# Joint entropy example
joint = np.array([[0.4, 0.1], [0.1, 0.4]])
print(f"Joint entropy: {joint_entropy(joint):.3f} bits") # 1.722
print(f"Conditional H(Y|X): {conditional_entropy(joint):.3f}") # 0.722
# Entropy from samples
np.random.seed(42)
samples = np.random.choice(['A', 'B', 'C', 'D'], size=10000, p=[0.5, 0.25, 0.125, 0.125])
print(f"Estimated entropy: {entropy_from_samples(samples):.3f}") # ~1.75
Applications in AI/ML
âšī¸ Decision Trees
Decision trees use information gain to choose splits. For a feature splitting dataset into subsets :
The feature with highest IG is chosen. This is directly derived from entropy.
âšī¸ Cross-Entropy Loss
Neural networks for classification minimize:
where is the true label and is the predicted probability. This is the cross-entropy between the true distribution and the model's predicted distribution.
âšī¸ Compression
Shannon's source coding theorem guarantees that the optimal average code length for a source is bits per symbol. Huffman coding and arithmetic coding approach this limit.
âšī¸ Feature Selection
Mutual information measures how much a feature reduces uncertainty about target . Features with higher MI are more informative.
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Using without specifying base | Units depend on base | Use for bits, for nats |
| Forgetting convention | Causes NaN errors | Filter zero probabilities before computing |
| Confusing entropy with information | Entropy is expected information, not instantaneous | Entropy averages over all outcomes |
| Assuming entropy is always positive for continuous RVs | Differential entropy can be negative | Differential entropy is not directly comparable |
| Treating conditional entropy as "entropy minus something" | Use the chain rule: |
Interview Questions
Q1: Why does a fair coin have maximum entropy? A: A fair coin has for both outcomes, which maximizes subject to . This is proven by Jensen's inequality or Lagrange multipliers.
Q2: Can entropy be negative? A: Discrete entropy always. Differential entropy can be negative (e.g., a very concentrated Gaussian with small variance).
Q3: What's the relationship between entropy and compression? A: Shannon's source coding theorem: you cannot compress below bits per symbol losslessly. Entropy is the theoretical minimum average code length.
Q4: How is entropy used in decision trees? A: Information gain measures how much a feature reduces uncertainty. The feature with highest IG is selected for splitting.
Q5: Why use instead of ? A: gives entropy in bits, which is intuitive for binary decisions. gives nats. The choice doesn't affect which distribution has higher entropy, only the units.
Practice Problems
đProblem 1: Minimum Entropy
Problem: What is the minimum possible entropy for a binary random variable? Under what condition is it achieved?
đĄSolution: Minimum Entropy
The minimum is , achieved when or (deterministic outcome). One outcome has probability 1, so there is no uncertainty.
đProblem 2: Entropy Ordering
Problem: Rank the following distributions by entropy (lowest to highest): (a) [0.99, 0.01], (b) [0.5, 0.5], (c) [0.7, 0.3], (d) [0.25, 0.25, 0.25, 0.25].
đĄSolution: Entropy Ordering
(a) bits, (c) bits, (b) bit, (d) bits.
Ordering: (a) < (c) < (b) < (d). The more concentrated the distribution, the lower the entropy. The uniform distribution over 4 outcomes has the highest.
đProblem 3: Chain Rule Application
Problem: bits, bits. What is ? What is if and are independent?
đĄSolution: Chain Rule
By the chain rule: bits.
If and are independent: . We need bits only if independent. But we're told and , so regardless. If additionally independent, and â.
Quick Reference
| Quantity | Formula | Units |
|---|---|---|
| Shannon Entropy | bits | |
| Joint Entropy | bits | |
| Conditional Entropy | bits | |
| Max Entropy | bits | |
| Information Gain | bits |
Relationship to Other Quantities
âšī¸ Entropy as the Foundation
Entropy is the building block for many other information-theoretic quantities:
- Mutual Information: â measures shared information.
- KL Divergence: â measures distribution mismatch.
- Cross-Entropy: â the loss function for classification.
- Information Gain: â used in decision trees.
- Perplexity: â used in language modeling.
ThData Processing Inequality
If forms a Markov chain, then . Processing data cannot decrease uncertainty about a downstream variable more than the original data already does.
ThStrong Subadditivity
For three random variables :
This is the information-theoretic analog of the triangle inequality.
âšī¸ Entropy in Physics
In statistical mechanics, entropy measures the number of microstates consistent with a macrostate. Shannon entropy generalizes this to arbitrary probability distributions. The connection: Boltzmann entropy is Shannon entropy when all microstates are equally likely.
Common Confusions
âšī¸ Entropy vs Information
Entropy is the expected information content, not the information content of a single event. A single rare event has high information (), but entropy averages over all events weighted by their probabilities.
âšī¸ Entropy vs Randomness
Entropy measures uncertainty, not randomness in the colloquial sense. A fair coin has high entropy (1 bit), while a biased coin () has low entropy (0.08 bits), even though both are "random" in the sense that outcomes are unpredictable.
Cross-References
- 082 - Mutual Information â â measures shared information between variables.
- 083 - KL Divergence: â measures distribution mismatch.
- 084 - Cross-Entropy â â used as loss in classification.
- 085 - Applications â Information gain in decision trees, feature selection with MI, VAE loss functions.
Summary
đKey Takeaways
-
Shannon Entropy: measures the average information content (in bits) of a random variable. Higher entropy means more uncertainty and more information per observation.
-
Entropy Bounds: Entropy is always non-negative , equals 0 for deterministic outcomes, and is maximized at when all outcomes are equally likely (uniform distribution).
-
Conditional Entropy: measures remaining uncertainty about after observing . Conditioning reduces entropy: .
-
Chain Rule: . The joint entropy of two variables equals the entropy of plus the conditional entropy of given . This decomposes complex systems.
-
Units Matter: Using gives entropy in bits; using gives nats; using gives bans/hartleys. Bits are standard in information theory and machine learning.
-
Source Coding Theorem: Shannon proved that the entropy is the theoretical minimum average number of bits needed to encode messages from a source. No compression scheme can do better.
-
Practical Application: Entropy drives decision tree splitting (information gain), loss function design (cross-entropy loss in neural networks), compression limits (Shannon's source coding theorem), and feature selection in ML pipelines.