Mutual Information

ℹ️ Why It Matters

Mutual information (MI) answers a fundamental question: "How much does knowing one variable tell me about another?" Unlike correlation, MI captures arbitrary dependencies—linear, nonlinear, monotonic, or otherwise. It's the backbone of feature selection in ML, the objective in the information bottleneck method, and the metric for evaluating clustering quality. If you understand MI, you understand what it means for two variables to be "related."

Core Definitions

DfMutual Information

The mutual information between two discrete random variables $X$ and $Y$ is:

I(X; Y) = \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log_2 \frac{p(x, y)}{p(x) p(y)}

MI measures the reduction in uncertainty about one variable due to knowledge of the other.

DfConditional Entropy

The conditional entropy of $Y$ given $X$ is:

H(Y|X) = -\sum_{x,y} p(x,y) \log_2 p(y|x) = \sum_x p(x) H(Y|X=x)

It represents the remaining uncertainty about $Y$ after observing $X$ .

DfInformation Gain

Information gain (also called mutual information in the context of feature selection) is:

IG(D, A) = H(D) - H(D|A)

This is the expected reduction in entropy of dataset $D$ after splitting on feature $A$ .

Key Formulas

Mutual Information (Entropy Form)

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Here,

$I(X;Y)$ =Mutual information between X and Y
$H(X)$ =Marginal entropy of X
$H(X|Y)$ =Conditional entropy of X given Y

Mutual Information (Joint Form)

I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

Here,

$p(x,y)$ =Joint probability
$p(x), p(y)$ =Marginal probabilities

Mutual Information as KL Divergence

I(X;Y) = D_{KL}\big(p(x,y) \;\|\; p(x)p(y)\big)

Here,

$D_{KL}$ =Kullback-Leibler divergence
$p(x)p(y)$ =Product of marginals (independence assumption)

Conditional Entropy

H(Y|X) = \sum_x p(x) H(Y|X=x) = H(X,Y) - H(X)

Here,

$H(Y|X)$ =Conditional entropy of Y given X
$H(X,Y)$ =Joint entropy

Chain Rule for Mutual Information

I(X; Y, Z) = I(X; Y) + I(X; Z | Y)

Here,

$I(X; Y, Z)$ =MI of X with the pair (Y, Z)
$I(X; Z | Y)$ =Conditional MI of X and Z given Y

Conditional Mutual Information

I(X; Y | Z) = H(X|Z) - H(X|Y,Z)

Here,

$I(X; Y | Z)$ =MI of X and Y given Z

Properties and Theorems

ThNon-negativity

$I(X; Y) \geq 0$ for all random variables $X, Y$ . Equality holds if and only if $X$ and $Y$ are independent.

ThSymmetry

$I(X; Y) = I(Y; X)$ . Mutual information is symmetric—knowing $X$ about $Y$ is the same as knowing $Y$ about $X$ .

ThSelf-Mutual Information

$I(X; X) = H(X)$ . A variable's MI with itself equals its entropy. This is also called the information of $X$ .

ThChain Rule Decomposition

I(X; Y, Z) = I(X; Y) + I(X; Z | Y)

The total MI between $X$ and the pair $(Y, Z)$ equals the MI between $X$ and $Y$ plus the conditional MI between $X$ and $Z$ given $Y$ .

ThData Processing Inequality

If $X \to Y \to Z$ forms a Markov chain, then $I(X; Z) \leq I(X; Y)$ . Processing data cannot increase information about the original source.

ThMI and Entropy Relationship

$I(X; Y) = H(X) + H(Y) - H(X, Y)$ . MI measures the "overlap" between the uncertainties of $X$ and $Y$ .

Worked Examples

📝Example 1: Independent Variables

Problem: If $X$ and $Y$ are independent with $X \sim \text{Bernoulli}(0.5)$ and $Y \sim \text{Bernoulli}(0.5)$ , what is $I(X; Y)$ ?

💡Solution: Independent Variables

Since $X$ and $Y$ are independent, $p(x,y) = p(x)p(y)$ for all $x, y$ . Therefore:

I(X; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)} = \sum_{x,y} p(x,y) \log 1 = 0

MI is zero when variables are independent—knowing one tells you nothing about the other.

📝Example 2: Deterministic Relationship

Problem: Let $Y = X$ where $X$ is binary with $p(0) = 0.5$ . Compute $I(X; Y)$ .

💡Solution: Deterministic

Since $Y = X$ , knowing $X$ completely determines $Y$ :

I(X; Y) = H(Y) - H(Y|X) = H(X) - 0 = 1.0 \text{ bit}

MI equals the entropy when the relationship is deterministic.

📝Example 3: Partial Dependency

Problem: Let $X$ be binary uniform, and $Y = X$ with probability 0.8, else $Y = 1-X$ . Compute $I(X;Y)$ .

💡Solution: Partial Dependency

Joint distribution:

$p(0,0) = 0.5 \times 0.8 = 0.4$
$p(0,1) = 0.5 \times 0.2 = 0.1$
$p(1,0) = 0.5 \times 0.2 = 0.1$
$p(1,1) = 0.5 \times 0.8 = 0.4$

$H(Y) = 1.0$ bit (uniform).

$H(Y|X=0) = H(Y|X=1) = -[0.8 \log_2(0.8) + 0.2 \log_2(0.2)] = 0.722$ bits.

$I(X;Y) = H(Y) - H(Y|X) = 1.0 - 0.722 = 0.278$ bits.

📝Example 4: Chain Rule in Practice

Problem: $I(X; Y, Z) = 3$ bits, $I(X; Y) = 2$ bits. Find $I(X; Z | Y)$ .

💡Solution: Chain Rule

By the chain rule: $I(X; Y, Z) = I(X; Y) + I(X; Z | Y)$

I(X; Z | Y) = I(X; Y, Z) - I(X; Y) = 3 - 2 = 1 \text{ bit}

📝Example 5: Data Processing Inequality

Problem: If $X \to Y \to Z$ is a Markov chain with $I(X;Y) = 5$ bits and $I(Y;Z) = 4$ bits, what can you say about $I(X;Z)$ ?

💡Solution: Data Processing Inequality

By the data processing inequality: $I(X;Z) \leq I(X;Y) = 5$ bits.

Processing $Y$ into $Z$ cannot create new information about $X$ . So $I(X;Z) \leq 5$ bits. In general, $I(X;Z)$ could be anywhere from 0 to 5 bits depending on how much information $Z$ retains about $Y$ .

Python Implementation

import numpy as np
from collections import Counter
from sklearn.feature_selection import mutual_info_regression, mutual_info_classif

def mutual_information(x, y, base=2):
    """Compute MI between two discrete arrays."""
    n = len(x)
    x, y = np.array(x), np.array(y)

    # Joint distribution
    xy = np.stack([x, y], axis=1)
    xy_counter = Counter(map(tuple, xy))
    pxy = np.array([count / n for count in xy_counter.values()])

    # Marginals
    x_counter = Counter(x)
    y_counter = Counter(y)
    px = np.array([x_counter[k] / n for k in xy_counter.keys().__iter__()])
    # Rebuild properly
    px = np.array([sum(pxy[i] for i in range(len(pxy))
                       if list(xy_counter.keys())[i][0] == k)
                   for k in x_counter.keys()])

    # Simpler approach: compute from joint
    joint = np.zeros((len(set(x)), len(set(y))))
    x_vals, y_vals = sorted(set(x)), sorted(set(y))
    x_map = {v: i for i, v in enumerate(x_vals)}
    y_map = {v: i for i, v in enumerate(y_vals)}
    for xi, yi in zip(x, y):
        joint[x_map[xi], y_map[yi]] += 1
    joint /= n

    # MI from joint
    mi = 0.0
    for i in range(joint.shape[0]):
        for j in range(joint.shape[1]):
            if joint[i, j] > 0:
                px = joint[i].sum()
                py = joint[:, j].sum()
                mi += joint[i, j] * np.log2(joint[i, j] / (px * py))
    return mi

def conditional_mi(x, y, z):
    """Estimate I(X; Y | Z) by averaging I(X; Y) within each Z value."""
    z_vals = np.unique(z)
    total = 0.0
    for z_val in z_vals:
        mask = z == z_val
        if mask.sum() > 1:
            total += mutual_information(x[mask], y[mask]) * mask.mean()
    return total

# --- Feature Selection Example ---
np.random.seed(42)
X = np.random.randn(1000, 5)
y = X[:, 0] * 2 + X[:, 1] * 0.5 + np.random.randn(1000) * 0.1

mi_scores = mutual_info_regression(X, y, random_state=42)
print("MI scores per feature:")
for i, score in enumerate(mi_scores):
    print(f"  Feature {i}: {score:.4f}")

top_features = np.argsort(mi_scores)[::-1][:3]
print(f"Top 3 features: {top_features}")
# Feature 0 and 1 should have highest MI

Applications in AI/ML

ℹ️ Feature Selection

MI is model-agnostic: it captures any dependency (linear or nonlinear) between feature and target. Use mutual_info_classif or mutual_info_regression from sklearn to rank features. Select the top- $k$ features with highest MI scores.

ℹ️ Information Bottleneck

The information bottleneck method learns a compressed representation $T$ of input $X$ that preserves maximum information about target $Y$ :

\min_{p(t|x)} I(X; T) - \beta \, I(T; Y)

This is the theoretical foundation of autoencoders and VAEs.

ℹ️ Clustering Evaluation

Normalized mutual information (NMI) evaluates clustering quality against ground truth labels. NMI = 1 means perfect agreement; NMI = 0 means no relationship.

ℹ️ Feature Selection Pipeline

Compute MI between each feature and target.
Rank features by MI score.
Select top- $k$ or use a threshold.
Train model on selected features only.

This reduces dimensionality, removes noise, and often improves model performance.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Using correlation instead of MI	Correlation only captures linear dependencies	Use MI to capture any type of dependency
Assuming MI = causation	MI measures association, not causation	MI is a necessary but not sufficient condition for causation
Ignoring the sign of MI	MI is always non-negative	If you get a negative MI estimate, check your implementation
Not handling continuous variables	MI estimation for continuous data requires binning or k-NN	Use `mutual_info_regression` from sklearn
Confusing MI with conditional MI	$I(X; Y) \neq I(X; Y \| Z)$	Conditional MI accounts for the effect of a third variable

Interview Questions

Q1: What's the difference between MI and correlation? A: Correlation measures linear dependence only. MI captures any type of dependence (linear, nonlinear, monotonic, etc.). MI = 0 if and only if variables are independent. Correlation = 0 only implies no linear relationship.

Q2: Why is MI symmetric but correlation can be asymmetric in causal interpretation? A: MI is a mutual measure— $I(X;Y) = I(Y;X)$ by definition. Correlation is also symmetric, but causal relationships are directional. MI measures shared information, not causal direction.

Q3: How do you estimate MI for continuous variables? A: Options include: (1) binning/discretization, (2) k-NN based estimators (Kragh & Pethel), (3) kernel density estimation. sklearn's mutual_info_regression uses k-NN estimation.

Q4: What is the information bottleneck? A: It's a method to find a compressed representation $T$ of $X$ that maximizes $I(T; Y)$ while minimizing $I(X; T)$ . The Lagrange multiplier $\beta$ controls the trade-off between compression and prediction.

Q5: Can MI be greater than the entropy? A: No. $I(X; Y) \leq \min(H(X), H(Y))$ . MI cannot exceed the entropy of either variable.

Practice Problems

📝Problem 1: Binary MI Calculation

Problem: $X$ and $Y$ are binary with joint probabilities: $p(0,0) = 0.3$ , $p(0,1) = 0.2$ , $p(1,0) = 0.1$ , $p(1,1) = 0.4$ . Compute $I(X;Y)$ .

💡Solution: Binary MI

Marginals: $p(X=0) = 0.5$ , $p(X=1) = 0.5$ , $p(Y=0) = 0.4$ , $p(Y=1) = 0.6$ .

$H(X) = 1.0$ bit, $H(Y) = -0.4 \log_2(0.4) - 0.6 \log_2(0.6) = 0.971$ bits.

$H(X,Y) = -[0.3 \log_2(0.3) + 0.2 \log_2(0.2) + 0.1 \log_2(0.1) + 0.4 \log_2(0.4)] = 1.846$ bits.

$I(X;Y) = H(X) + H(Y) - H(X,Y) = 1.0 + 0.971 - 1.846 = 0.125$ bits.

📝Problem 2: Maximum MI

Problem: What is the maximum possible MI between two binary variables? When is it achieved?

💡Solution: Maximum MI

$I(X;Y) \leq \min(H(X), H(Y))$ . For binary variables, $\min(H(X), H(Y)) \leq 1$ bit. Maximum MI = 1 bit, achieved when $Y$ is a deterministic function of $X$ (or vice versa) and both are uniform.

📝Problem 3: MI and Independence

Problem: If $I(X;Y) = 0$ , what can you conclude about $X$ and $Y$ ?

💡Solution: MI and Independence

$I(X;Y) = 0$ if and only if $X$ and $Y$ are independent. This means $p(x,y) = p(x)p(y)$ for all $x, y$ , so knowing one variable gives zero information about the other.

Quick Reference

Quantity	Formula	Notes
MI (entropy form)	$I(X;Y) = H(X) - H(X\\|Y)$	Reduction in uncertainty
MI (KL form)	$I(X;Y) = D_{KL}(p(x,y) \\| p(x)p(y))$	Distance from independence
Chain rule	$I(X; Y, Z) = I(X; Y) + I(X; Z \\| Y)$	Decompose multi-variable MI
Conditional MI	$I(X; Y \\| Z) = H(X\\|Z) - H(X\\|Y,Z)$	MI after controlling for Z
Data processing	$X \to Y \to Z \implies I(X;Z) \leq I(X;Y)$	Processing cannot add info

Cross-References

081 - Entropy — $I(X;Y) = H(X) - H(X|Y)$ — MI is defined in terms of entropy.
083 - KL Divergence: $I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y))$ — MI is a special case of KL divergence.
084 - Cross-Entropy — Cross-entropy loss minimizes $H(P, Q)$ , related to MI through $D_{KL}$ .
085 - Applications — Feature selection, information bottleneck, decision trees.

Summary

📋Key Takeaways

Mutual Information: $I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$ measures how much knowing one variable reduces uncertainty about the other. It captures any type of dependency.
MI as KL Divergence: $I(X;Y) = D_{KL}(p(x,y) \| p(x)p(y))$ measures how far the joint distribution is from independence. MI = 0 iff independent.
Symmetry: $I(X;Y) = I(Y;X)$ . MI is symmetric—knowing $X$ about $Y$ is the same as knowing $Y$ about $X$ . This differs from conditional MI which is asymmetric.
Chain Rule: $I(X; Y, Z) = I(X; Y) + I(X; Z | Y)$ . The total MI decomposes into a marginal MI plus a conditional MI.
Data Processing Inequality: If $X \to Y \to Z$ is a Markov chain, then $I(X;Z) \leq I(X;Y)$ . Post-processing cannot increase information.
Feature Selection: MI is the gold standard for model-agnostic feature selection. It captures nonlinear dependencies that correlation misses. Use sklearn's mutual_info_classif/mutual_info_regression for practical estimation.
Bounds: $I(X;Y) \leq \min(H(X), H(Y))$ . MI cannot exceed the entropy of either variable.