Information Theory — Measuring Information
ℹ️ Why It Matters
Information theory quantifies information, measures surprise, and forms the basis for many ML loss functions (cross-entropy, KL divergence).
What is Information?
DfInformation
Surprise = Information
If something very unlikely happens → HIGH information If something very likely happens → LOW information
Information Content
Here,
- =Probability of event x
- =Information content (in bits)
📝Example: Information Content
- bit
- bits
- bits
Entropy
DfEntropy
The average information (surprise) of a random variable.
Entropy
Here,
- =Entropy of random variable X
📝Example: Entropy
- Fair coin: bit
- Loaded coin (90% heads): bits
- Certain outcome: bits
Properties:
- (entropy is never negative)
- is maximized when all outcomes are equally likely
- measures uncertainty — more entropy = more uncertainty
Cross-Entropy
DfCross-Entropy
Measures how different two probability distributions are.
Cross-Entropy
Here,
- =True distribution
- =Predicted distribution
ℹ️ In ML
- = true labels (one-hot encoded)
- = predicted probabilities
💡 Why cross-entropy is used in classification
- If the model predicts the correct class with high probability → low loss
- If the model predicts the wrong class → high loss
- It heavily penalizes confident wrong predictions
KL Divergence
DfKL Divergence
Measures how much information is lost when approximates .
KL Divergence
Here,
- =KL divergence from Q to P
Properties:
- (always non-negative)
- if and only if
- KL is NOT symmetric:
Applications:
- Variational Autoencoders (VAE): KL divergence between learned distribution and prior
- Knowledge distillation: Student model mimics teacher model
- GAN training: Measures distance between real and generated distributions
Mutual Information
DfMutual Information
How much knowing one variable tells you about another.
Mutual Information
Here,
- =Mutual information between X and Y
Interpretation:
- : X and Y are independent
- : Knowing Y tells you everything about X
Applications:
- Feature selection: Choose features with high mutual information with the target
- Information bottleneck: Compress input while retaining relevant information
- Clustering: Evaluate cluster quality
Data Compression Connection
Shannon's Source Coding Theorem:
ThSource Coding Theorem
You cannot compress data to fewer than bits per symbol on average. is the theoretical minimum number of bits needed.
Huffman Coding:
- Assign shorter codes to more frequent symbols
- Optimal prefix-free coding
Applications in AI:
- Neural network pruning: Remove redundant weights
- Knowledge distillation: Compress large model into small one
- Model compression for edge devices
📋Key Takeaways
-
Information measures surprise. — rare events carry more information; certain events carry none. This is the fundamental unit (bits) of information theory.
-
Entropy is average surprise. measures the uncertainty of a random variable — maximum when all outcomes are equally likely, zero when certain.
-
Cross-Entropy is THE classification loss. measures how bad your predicted distribution is compared to the true distribution — directly used as the loss function in neural network classification.
-
KL Divergence measures information loss. quantifies how much information is lost when approximates — used in VAEs, knowledge distillation, and GAN training. Note: .
-
Mutual Information finds dependencies. measures how much knowing one variable tells you about another — powerful for feature selection and evaluating clustering quality.
-
Shannon's theorem sets compression limits. You cannot compress data below bits per symbol on average — connecting information theory to practical model compression and pruning in neural networks.