Mutual Information
ℹ️ Why It Matters
Mutual information (MI) answers a fundamental question: "How much does knowing one variable tell me about another?" Unlike correlation, MI captures arbitrary dependencies—linear, nonlinear, monotonic, or otherwise. It's the backbone of feature selection in ML, the objective in the information bottleneck method, and the metric for evaluating clustering quality. If you understand MI, you understand what it means for two variables to be "related."
Core Definitions
DfMutual Information
The mutual information between two discrete random variables and is:
MI measures the reduction in uncertainty about one variable due to knowledge of the other.
DfConditional Entropy
The conditional entropy of given is:
It represents the remaining uncertainty about after observing .
DfInformation Gain
Information gain (also called mutual information in the context of feature selection) is:
This is the expected reduction in entropy of dataset after splitting on feature .
Key Formulas
Mutual Information (Entropy Form)
Here,
- =Mutual information between X and Y
- =Marginal entropy of X
- =Conditional entropy of X given Y
Mutual Information (Joint Form)
Here,
- =Joint probability
- =Marginal probabilities
Mutual Information as KL Divergence
Here,
- =Kullback-Leibler divergence
- =Product of marginals (independence assumption)
Conditional Entropy
Here,
- =Conditional entropy of Y given X
- =Joint entropy
Chain Rule for Mutual Information
Here,
- =MI of X with the pair (Y, Z)
- =Conditional MI of X and Z given Y
Conditional Mutual Information
Here,
- =MI of X and Y given Z
Properties and Theorems
ThNon-negativity
for all random variables . Equality holds if and only if and are independent.
ThSymmetry
. Mutual information is symmetric—knowing about is the same as knowing about .
ThSelf-Mutual Information
. A variable's MI with itself equals its entropy. This is also called the information of .
ThChain Rule Decomposition
The total MI between and the pair equals the MI between and plus the conditional MI between and given .
ThData Processing Inequality
If forms a Markov chain, then . Processing data cannot increase information about the original source.
ThMI and Entropy Relationship
. MI measures the "overlap" between the uncertainties of and .
Worked Examples
📝Example 1: Independent Variables
Problem: If and are independent with and , what is ?
💡Solution: Independent Variables
Since and are independent, for all . Therefore:
MI is zero when variables are independent—knowing one tells you nothing about the other.
📝Example 2: Deterministic Relationship
Problem: Let where is binary with . Compute .
💡Solution: Deterministic
Since , knowing completely determines :
MI equals the entropy when the relationship is deterministic.
📝Example 3: Partial Dependency
Problem: Let be binary uniform, and with probability 0.8, else . Compute .
💡Solution: Partial Dependency
Joint distribution:
bit (uniform).
bits.
bits.
📝Example 4: Chain Rule in Practice
Problem: bits, bits. Find .
💡Solution: Chain Rule
By the chain rule:
📝Example 5: Data Processing Inequality
Problem: If is a Markov chain with bits and bits, what can you say about ?
💡Solution: Data Processing Inequality
By the data processing inequality: bits.
Processing into cannot create new information about . So bits. In general, could be anywhere from 0 to 5 bits depending on how much information retains about .
Python Implementation
import numpy as np
from collections import Counter
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif
def mutual_information(x, y, base=2):
"""Compute MI between two discrete arrays."""
n = len(x)
x, y = np.array(x), np.array(y)
# Joint distribution
xy = np.stack([x, y], axis=1)
xy_counter = Counter(map(tuple, xy))
pxy = np.array([count / n for count in xy_counter.values()])
# Marginals
x_counter = Counter(x)
y_counter = Counter(y)
px = np.array([x_counter[k] / n for k in xy_counter.keys().__iter__()])
# Rebuild properly
px = np.array([sum(pxy[i] for i in range(len(pxy))
if list(xy_counter.keys())[i][0] == k)
for k in x_counter.keys()])
# Simpler approach: compute from joint
joint = np.zeros((len(set(x)), len(set(y))))
x_vals, y_vals = sorted(set(x)), sorted(set(y))
x_map = {v: i for i, v in enumerate(x_vals)}
y_map = {v: i for i, v in enumerate(y_vals)}
for xi, yi in zip(x, y):
joint[x_map[xi], y_map[yi]] += 1
joint /= n
# MI from joint
mi = 0.0
for i in range(joint.shape[0]):
for j in range(joint.shape[1]):
if joint[i, j] > 0:
px = joint[i].sum()
py = joint[:, j].sum()
mi += joint[i, j] * np.log2(joint[i, j] / (px * py))
return mi
def conditional_mi(x, y, z):
"""Estimate I(X; Y | Z) by averaging I(X; Y) within each Z value."""
z_vals = np.unique(z)
total = 0.0
for z_val in z_vals:
mask = z == z_val
if mask.sum() > 1:
total += mutual_information(x[mask], y[mask]) * mask.mean()
return total
# --- Feature Selection Example ---
np.random.seed(42)
X = np.random.randn(1000, 5)
y = X[:, 0] * 2 + X[:, 1] * 0.5 + np.random.randn(1000) * 0.1
mi_scores = mutual_info_regression(X, y, random_state=42)
print("MI scores per feature:")
for i, score in enumerate(mi_scores):
print(f" Feature {i}: {score:.4f}")
top_features = np.argsort(mi_scores)[::-1][:3]
print(f"Top 3 features: {top_features}")
# Feature 0 and 1 should have highest MI
Applications in AI/ML
ℹ️ Feature Selection
MI is model-agnostic: it captures any dependency (linear or nonlinear) between feature and target. Use mutual_info_classif or mutual_info_regression from sklearn to rank features. Select the top- features with highest MI scores.
ℹ️ Information Bottleneck
The information bottleneck method learns a compressed representation of input that preserves maximum information about target :
This is the theoretical foundation of autoencoders and VAEs.
ℹ️ Clustering Evaluation
Normalized mutual information (NMI) evaluates clustering quality against ground truth labels. NMI = 1 means perfect agreement; NMI = 0 means no relationship.
ℹ️ Feature Selection Pipeline
- Compute MI between each feature and target.
- Rank features by MI score.
- Select top- or use a threshold.
- Train model on selected features only.
This reduces dimensionality, removes noise, and often improves model performance.
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Using correlation instead of MI | Correlation only captures linear dependencies | Use MI to capture any type of dependency |
| Assuming MI = causation | MI measures association, not causation | MI is a necessary but not sufficient condition for causation |
| Ignoring the sign of MI | MI is always non-negative | If you get a negative MI estimate, check your implementation |
| Not handling continuous variables | MI estimation for continuous data requires binning or k-NN | Use mutual_info_regression from sklearn |
| Confusing MI with conditional MI | Conditional MI accounts for the effect of a third variable |
Interview Questions
Q1: What's the difference between MI and correlation? A: Correlation measures linear dependence only. MI captures any type of dependence (linear, nonlinear, monotonic, etc.). MI = 0 if and only if variables are independent. Correlation = 0 only implies no linear relationship.
Q2: Why is MI symmetric but correlation can be asymmetric in causal interpretation? A: MI is a mutual measure— by definition. Correlation is also symmetric, but causal relationships are directional. MI measures shared information, not causal direction.
Q3: How do you estimate MI for continuous variables?
A: Options include: (1) binning/discretization, (2) k-NN based estimators (Kragh & Pethel), (3) kernel density estimation. sklearn's mutual_info_regression uses k-NN estimation.
Q4: What is the information bottleneck? A: It's a method to find a compressed representation of that maximizes while minimizing . The Lagrange multiplier controls the trade-off between compression and prediction.
Q5: Can MI be greater than the entropy? A: No. . MI cannot exceed the entropy of either variable.
Practice Problems
📝Problem 1: Binary MI Calculation
Problem: and are binary with joint probabilities: , , , . Compute .
💡Solution: Binary MI
Marginals: , , , .
bit, bits.
bits.
bits.
📝Problem 2: Maximum MI
Problem: What is the maximum possible MI between two binary variables? When is it achieved?
💡Solution: Maximum MI
. For binary variables, bit. Maximum MI = 1 bit, achieved when is a deterministic function of (or vice versa) and both are uniform.
📝Problem 3: MI and Independence
Problem: If , what can you conclude about and ?
💡Solution: MI and Independence
if and only if and are independent. This means for all , so knowing one variable gives zero information about the other.
Quick Reference
| Quantity | Formula | Notes |
|---|---|---|
| MI (entropy form) | Reduction in uncertainty | |
| MI (KL form) | Distance from independence | |
| Chain rule | Decompose multi-variable MI | |
| Conditional MI | MI after controlling for Z | |
| Data processing | Processing cannot add info |
Cross-References
- 081 - Entropy — — MI is defined in terms of entropy.
- 083 - KL Divergence: — MI is a special case of KL divergence.
- 084 - Cross-Entropy — Cross-entropy loss minimizes , related to MI through .
- 085 - Applications — Feature selection, information bottleneck, decision trees.
Summary
📋Key Takeaways
-
Mutual Information: measures how much knowing one variable reduces uncertainty about the other. It captures any type of dependency.
-
MI as KL Divergence: measures how far the joint distribution is from independence. MI = 0 iff independent.
-
Symmetry: . MI is symmetric—knowing about is the same as knowing about . This differs from conditional MI which is asymmetric.
-
Chain Rule: . The total MI decomposes into a marginal MI plus a conditional MI.
-
Data Processing Inequality: If is a Markov chain, then . Post-processing cannot increase information.
-
Feature Selection: MI is the gold standard for model-agnostic feature selection. It captures nonlinear dependencies that correlation misses. Use sklearn's
mutual_info_classif/mutual_info_regressionfor practical estimation. -
Bounds: . MI cannot exceed the entropy of either variable.