← Math|82 of 100
Information Theory

Mutual Information

Master mutual information, conditional entropy, chain rule, and applications in feature selection.

📂 MI📖 Lesson 82 of 100🎓 Free Course

Advertisement

Mutual Information

ℹ️ Why It Matters

Mutual information (MI) answers a fundamental question: "How much does knowing one variable tell me about another?" Unlike correlation, MI captures arbitrary dependencies—linear, nonlinear, monotonic, or otherwise. It's the backbone of feature selection in ML, the objective in the information bottleneck method, and the metric for evaluating clustering quality. If you understand MI, you understand what it means for two variables to be "related."


Core Definitions

DfMutual Information

The mutual information between two discrete random variables XX and YY is:

I(X;Y)=xXyYp(x,y)log2p(x,y)p(x)p(y)I(X; Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log_2 \frac{p(x, y)}{p(x) p(y)}

MI measures the reduction in uncertainty about one variable due to knowledge of the other.

DfConditional Entropy

The conditional entropy of YY given XX is:

H(YX)=x,yp(x,y)log2p(yx)=xp(x)H(YX=x)H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = \sum_x p(x) H(Y|X=x)

It represents the remaining uncertainty about YY after observing XX.

DfInformation Gain

Information gain (also called mutual information in the context of feature selection) is:

IG(D,A)=H(D)H(DA)IG(D, A) = H(D) - H(D|A)

This is the expected reduction in entropy of dataset DD after splitting on feature AA.


Key Formulas

Mutual Information (Entropy Form)

I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Here,

  • I(X;Y)I(X;Y)=Mutual information between X and Y
  • H(X)H(X)=Marginal entropy of X
  • H(XY)H(X|Y)=Conditional entropy of X given Y

Mutual Information (Joint Form)

I(X;Y)=x,yp(x,y)logp(x,y)p(x)p(y)I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

Here,

  • p(x,y)p(x,y)=Joint probability
  • p(x),p(y)p(x), p(y)=Marginal probabilities

Mutual Information as KL Divergence

I(X;Y)=DKL(p(x,y)    p(x)p(y))I(X;Y) = D_{KL}\big(p(x,y) \;\|\; p(x)p(y)\big)

Here,

  • DKLD_{KL}=Kullback-Leibler divergence
  • p(x)p(y)p(x)p(y)=Product of marginals (independence assumption)

Conditional Entropy

H(YX)=xp(x)H(YX=x)=H(X,Y)H(X)H(Y|X) = \sum_x p(x) H(Y|X=x) = H(X,Y) - H(X)

Here,

  • H(YX)H(Y|X)=Conditional entropy of Y given X
  • H(X,Y)H(X,Y)=Joint entropy

Chain Rule for Mutual Information

I(X;Y,Z)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Y) + I(X; Z | Y)

Here,

  • I(X;Y,Z)I(X; Y, Z)=MI of X with the pair (Y, Z)
  • I(X;ZY)I(X; Z | Y)=Conditional MI of X and Z given Y

Conditional Mutual Information

I(X;YZ)=H(XZ)H(XY,Z)I(X; Y | Z) = H(X|Z) - H(X|Y,Z)

Here,

  • I(X;YZ)I(X; Y | Z)=MI of X and Y given Z

Properties and Theorems

ThNon-negativity

I(X;Y)0I(X; Y) \geq 0 for all random variables X,YX, Y. Equality holds if and only if XX and YY are independent.

ThSymmetry

I(X;Y)=I(Y;X)I(X; Y) = I(Y; X). Mutual information is symmetric—knowing XX about YY is the same as knowing YY about XX.

ThSelf-Mutual Information

I(X;X)=H(X)I(X; X) = H(X). A variable's MI with itself equals its entropy. This is also called the information of XX.

ThChain Rule Decomposition

I(X;Y,Z)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Y) + I(X; Z | Y)

The total MI between XX and the pair (Y,Z)(Y, Z) equals the MI between XX and YY plus the conditional MI between XX and ZZ given YY.

ThData Processing Inequality

If XYZX \to Y \to Z forms a Markov chain, then I(X;Z)I(X;Y)I(X; Z) \leq I(X; Y). Processing data cannot increase information about the original source.

ThMI and Entropy Relationship

I(X;Y)=H(X)+H(Y)H(X,Y)I(X; Y) = H(X) + H(Y) - H(X, Y). MI measures the "overlap" between the uncertainties of XX and YY.


Worked Examples

📝Example 1: Independent Variables

Problem: If XX and YY are independent with XBernoulli(0.5)X \sim \text{Bernoulli}(0.5) and YBernoulli(0.5)Y \sim \text{Bernoulli}(0.5), what is I(X;Y)I(X; Y)?

💡Solution: Independent Variables

Since XX and YY are independent, p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) for all x,yx, y. Therefore:

I(X;Y)=x,yp(x,y)logp(x,y)p(x)p(y)=x,yp(x,y)log1=0I(X; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} = \sum_{x,y} p(x,y) \log 1 = 0

MI is zero when variables are independent—knowing one tells you nothing about the other.

📝Example 2: Deterministic Relationship

Problem: Let Y=XY = X where XX is binary with p(0)=0.5p(0) = 0.5. Compute I(X;Y)I(X; Y).

💡Solution: Deterministic

Since Y=XY = X, knowing XX completely determines YY:

I(X;Y)=H(Y)H(YX)=H(X)0=1.0 bitI(X; Y) = H(Y) - H(Y|X) = H(X) - 0 = 1.0 \text{ bit}

MI equals the entropy when the relationship is deterministic.

📝Example 3: Partial Dependency

Problem: Let XX be binary uniform, and Y=XY = X with probability 0.8, else Y=1XY = 1-X. Compute I(X;Y)I(X;Y).

💡Solution: Partial Dependency

Joint distribution:

  • p(0,0)=0.5×0.8=0.4p(0,0) = 0.5 \times 0.8 = 0.4
  • p(0,1)=0.5×0.2=0.1p(0,1) = 0.5 \times 0.2 = 0.1
  • p(1,0)=0.5×0.2=0.1p(1,0) = 0.5 \times 0.2 = 0.1
  • p(1,1)=0.5×0.8=0.4p(1,1) = 0.5 \times 0.8 = 0.4

H(Y)=1.0H(Y) = 1.0 bit (uniform).

H(YX=0)=H(YX=1)=[0.8log2(0.8)+0.2log2(0.2)]=0.722H(Y|X=0) = H(Y|X=1) = -[0.8 \log_2(0.8) + 0.2 \log_2(0.2)] = 0.722 bits.

I(X;Y)=H(Y)H(YX)=1.00.722=0.278I(X;Y) = H(Y) - H(Y|X) = 1.0 - 0.722 = 0.278 bits.

📝Example 4: Chain Rule in Practice

Problem: I(X;Y,Z)=3I(X; Y, Z) = 3 bits, I(X;Y)=2I(X; Y) = 2 bits. Find I(X;ZY)I(X; Z | Y).

💡Solution: Chain Rule

By the chain rule: I(X;Y,Z)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Y) + I(X; Z | Y)

I(X;ZY)=I(X;Y,Z)I(X;Y)=32=1 bitI(X; Z | Y) = I(X; Y, Z) - I(X; Y) = 3 - 2 = 1 \text{ bit}

📝Example 5: Data Processing Inequality

Problem: If XYZX \to Y \to Z is a Markov chain with I(X;Y)=5I(X;Y) = 5 bits and I(Y;Z)=4I(Y;Z) = 4 bits, what can you say about I(X;Z)I(X;Z)?

💡Solution: Data Processing Inequality

By the data processing inequality: I(X;Z)I(X;Y)=5I(X;Z) \leq I(X;Y) = 5 bits.

Processing YY into ZZ cannot create new information about XX. So I(X;Z)5I(X;Z) \leq 5 bits. In general, I(X;Z)I(X;Z) could be anywhere from 0 to 5 bits depending on how much information ZZ retains about YY.


Python Implementation

import numpy as np
from collections import Counter
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif

def mutual_information(x, y, base=2):
    """Compute MI between two discrete arrays."""
    n = len(x)
    x, y = np.array(x), np.array(y)

    # Joint distribution
    xy = np.stack([x, y], axis=1)
    xy_counter = Counter(map(tuple, xy))
    pxy = np.array([count / n for count in xy_counter.values()])

    # Marginals
    x_counter = Counter(x)
    y_counter = Counter(y)
    px = np.array([x_counter[k] / n for k in xy_counter.keys().__iter__()])
    # Rebuild properly
    px = np.array([sum(pxy[i] for i in range(len(pxy))
                       if list(xy_counter.keys())[i][0] == k)
                   for k in x_counter.keys()])

    # Simpler approach: compute from joint
    joint = np.zeros((len(set(x)), len(set(y))))
    x_vals, y_vals = sorted(set(x)), sorted(set(y))
    x_map = {v: i for i, v in enumerate(x_vals)}
    y_map = {v: i for i, v in enumerate(y_vals)}
    for xi, yi in zip(x, y):
        joint[x_map[xi], y_map[yi]] += 1
    joint /= n

    # MI from joint
    mi = 0.0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i, j] > 0:
                px = joint[i].sum()
                py = joint[:, j].sum()
                mi += joint[i, j] * np.log2(joint[i, j] / (px * py))
    return mi

def conditional_mi(x, y, z):
    """Estimate I(X; Y | Z) by averaging I(X; Y) within each Z value."""
    z_vals = np.unique(z)
    total = 0.0
    for z_val in z_vals:
        mask = z == z_val
        if mask.sum() > 1:
            total += mutual_information(x[mask], y[mask]) * mask.mean()
    return total

# --- Feature Selection Example ---
np.random.seed(42)
X = np.random.randn(1000, 5)
y = X[:, 0] * 2 + X[:, 1] * 0.5 + np.random.randn(1000) * 0.1

mi_scores = mutual_info_regression(X, y, random_state=42)
print("MI scores per feature:")
for i, score in enumerate(mi_scores):
    print(f"  Feature {i}: {score:.4f}")

top_features = np.argsort(mi_scores)[::-1][:3]
print(f"Top 3 features: {top_features}")
# Feature 0 and 1 should have highest MI

Applications in AI/ML

ℹ️ Feature Selection

MI is model-agnostic: it captures any dependency (linear or nonlinear) between feature and target. Use mutual_info_classif or mutual_info_regression from sklearn to rank features. Select the top-kk features with highest MI scores.

ℹ️ Information Bottleneck

The information bottleneck method learns a compressed representation TT of input XX that preserves maximum information about target YY:

minp(tx)I(X;T)βI(T;Y)\min_{p(t|x)} I(X; T) - \beta \, I(T; Y)

This is the theoretical foundation of autoencoders and VAEs.

ℹ️ Clustering Evaluation

Normalized mutual information (NMI) evaluates clustering quality against ground truth labels. NMI = 1 means perfect agreement; NMI = 0 means no relationship.

ℹ️ Feature Selection Pipeline

  1. Compute MI between each feature and target.
  2. Rank features by MI score.
  3. Select top-kk or use a threshold.
  4. Train model on selected features only.

This reduces dimensionality, removes noise, and often improves model performance.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Using correlation instead of MICorrelation only captures linear dependenciesUse MI to capture any type of dependency
Assuming MI = causationMI measures association, not causationMI is a necessary but not sufficient condition for causation
Ignoring the sign of MIMI is always non-negativeIf you get a negative MI estimate, check your implementation
Not handling continuous variablesMI estimation for continuous data requires binning or k-NNUse mutual_info_regression from sklearn
Confusing MI with conditional MII(X;Y)I(X;YZ)I(X; Y) \neq I(X; Y | Z)Conditional MI accounts for the effect of a third variable

Interview Questions

Q1: What's the difference between MI and correlation? A: Correlation measures linear dependence only. MI captures any type of dependence (linear, nonlinear, monotonic, etc.). MI = 0 if and only if variables are independent. Correlation = 0 only implies no linear relationship.

Q2: Why is MI symmetric but correlation can be asymmetric in causal interpretation? A: MI is a mutual measure—I(X;Y)=I(Y;X)I(X;Y) = I(Y;X) by definition. Correlation is also symmetric, but causal relationships are directional. MI measures shared information, not causal direction.

Q3: How do you estimate MI for continuous variables? A: Options include: (1) binning/discretization, (2) k-NN based estimators (Kragh & Pethel), (3) kernel density estimation. sklearn's mutual_info_regression uses k-NN estimation.

Q4: What is the information bottleneck? A: It's a method to find a compressed representation TT of XX that maximizes I(T;Y)I(T; Y) while minimizing I(X;T)I(X; T). The Lagrange multiplier β\beta controls the trade-off between compression and prediction.

Q5: Can MI be greater than the entropy? A: No. I(X;Y)min(H(X),H(Y))I(X; Y) \leq \min(H(X), H(Y)). MI cannot exceed the entropy of either variable.


Practice Problems

📝Problem 1: Binary MI Calculation

Problem: XX and YY are binary with joint probabilities: p(0,0)=0.3p(0,0) = 0.3, p(0,1)=0.2p(0,1) = 0.2, p(1,0)=0.1p(1,0) = 0.1, p(1,1)=0.4p(1,1) = 0.4. Compute I(X;Y)I(X;Y).

💡Solution: Binary MI

Marginals: p(X=0)=0.5p(X=0) = 0.5, p(X=1)=0.5p(X=1) = 0.5, p(Y=0)=0.4p(Y=0) = 0.4, p(Y=1)=0.6p(Y=1) = 0.6.

H(X)=1.0H(X) = 1.0 bit, H(Y)=0.4log2(0.4)0.6log2(0.6)=0.971H(Y) = -0.4 \log_2(0.4) - 0.6 \log_2(0.6) = 0.971 bits.

H(X,Y)=[0.3log2(0.3)+0.2log2(0.2)+0.1log2(0.1)+0.4log2(0.4)]=1.846H(X,Y) = -[0.3 \log_2(0.3) + 0.2 \log_2(0.2) + 0.1 \log_2(0.1) + 0.4 \log_2(0.4)] = 1.846 bits.

I(X;Y)=H(X)+H(Y)H(X,Y)=1.0+0.9711.846=0.125I(X;Y) = H(X) + H(Y) - H(X,Y) = 1.0 + 0.971 - 1.846 = 0.125 bits.

📝Problem 2: Maximum MI

Problem: What is the maximum possible MI between two binary variables? When is it achieved?

💡Solution: Maximum MI

I(X;Y)min(H(X),H(Y))I(X;Y) \leq \min(H(X), H(Y)). For binary variables, min(H(X),H(Y))1\min(H(X), H(Y)) \leq 1 bit. Maximum MI = 1 bit, achieved when YY is a deterministic function of XX (or vice versa) and both are uniform.

📝Problem 3: MI and Independence

Problem: If I(X;Y)=0I(X;Y) = 0, what can you conclude about XX and YY?

💡Solution: MI and Independence

I(X;Y)=0I(X;Y) = 0 if and only if XX and YY are independent. This means p(x,y)=p(x)p(y)p(x,y) = p(x)p(y) for all x,yx, y, so knowing one variable gives zero information about the other.


Quick Reference

QuantityFormulaNotes
MI (entropy form)I(X;Y)=H(X)H(XY)I(X;Y) = H(X) - H(X\|Y)Reduction in uncertainty
MI (KL form)I(X;Y)=DKL(p(x,y)p(x)p(y))I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y))Distance from independence
Chain ruleI(X;Y,Z)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Y) + I(X; Z \| Y)Decompose multi-variable MI
Conditional MII(X;YZ)=H(XZ)H(XY,Z)I(X; Y \| Z) = H(X\|Z) - H(X\|Y,Z)MI after controlling for Z
Data processingXYZ    I(X;Z)I(X;Y)X \to Y \to Z \implies I(X;Z) \leq I(X;Y)Processing cannot add info

Cross-References

  • 081 - EntropyI(X;Y)=H(X)H(XY)I(X;Y) = H(X) - H(X|Y) — MI is defined in terms of entropy.
  • 083 - KL Divergence: I(X;Y)=DKL(p(x,y)p(x)p(y))I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y)) — MI is a special case of KL divergence.
  • 084 - Cross-Entropy — Cross-entropy loss minimizes H(P,Q)H(P, Q), related to MI through DKLD_{KL}.
  • 085 - Applications — Feature selection, information bottleneck, decision trees.

Summary

📋Key Takeaways

  • Mutual Information: I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) measures how much knowing one variable reduces uncertainty about the other. It captures any type of dependency.

  • MI as KL Divergence: I(X;Y)=DKL(p(x,y)p(x)p(y))I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y)) measures how far the joint distribution is from independence. MI = 0 iff independent.

  • Symmetry: I(X;Y)=I(Y;X)I(X;Y) = I(Y;X). MI is symmetric—knowing XX about YY is the same as knowing YY about XX. This differs from conditional MI which is asymmetric.

  • Chain Rule: I(X;Y,Z)=I(X;Y)+I(X;ZY)I(X; Y, Z) = I(X; Y) + I(X; Z | Y). The total MI decomposes into a marginal MI plus a conditional MI.

  • Data Processing Inequality: If XYZX \to Y \to Z is a Markov chain, then I(X;Z)I(X;Y)I(X;Z) \leq I(X;Y). Post-processing cannot increase information.

  • Feature Selection: MI is the gold standard for model-agnostic feature selection. It captures nonlinear dependencies that correlation misses. Use sklearn's mutual_info_classif/mutual_info_regression for practical estimation.

  • Bounds: I(X;Y)min(H(X),H(Y))I(X;Y) \leq \min(H(X), H(Y)). MI cannot exceed the entropy of either variable.

Lesson Progress82 / 100