← Math|40 of 100
Probability

Joint Distributions

Understand joint, marginal, and conditional distributions for multiple random variables, covariance, correlation, and independence.

📂 Multivariate📖 Lesson 40 of 100🎓 Free Course

Advertisement

Why It Matters

💡 Why It Matters

Real-world data is never univariate. Every dataset — images, time series, tabular records — consists of multiple interacting variables. Joint distributions are the mathematical framework that captures how random variables co-vary, enabling you to reason about dependencies, compute conditional probabilities, and build multivariate models. Without joint distributions, you cannot understand correlations, perform Bayesian inference on multiple variables, build copula models in finance, or design variational autoencoders in deep learning. They are the foundation of PCA, Gaussian mixture models, hidden Markov models, and virtually every multivariate statistical technique. Mastering joint distributions is the bridge from elementary probability to real-world data science and machine learning.


Joint PMF/PDF

DfJoint Probability Mass Function (Discrete)

For two discrete random variables XX and YY, the joint PMF pX,Y(x,y)p_{X,Y}(x,y) gives the probability that X=xX = x and Y=yY = y simultaneously:

pX,Y(x,y)=P(X=x,Y=y)p_{X,Y}(x,y) = P(X = x, Y = y)

The joint PMF must satisfy:

  1. pX,Y(x,y)0p_{X,Y}(x,y) \geq 0 for all (x,y)(x,y)
  2. xypX,Y(x,y)=1\sum_{x}\sum_{y} p_{X,Y}(x,y) = 1

For a finite support with mm values of XX and nn values of YY, the joint PMF can be represented as an m×nm \times n matrix where each entry is a probability.

DfJoint Probability Density Function (Continuous)

For two continuous random variables XX and YY, the joint PDF fX,Y(x,y)f_{X,Y}(x,y) satisfies:

fX,Y(x,y)0for all (x,y)f_{X,Y}(x,y) \geq 0 \quad \text{for all } (x,y)
fX,Y(x,y)dxdy=1\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \, dy = 1

The probability of falling in a region AA is:

P((X,Y)A)=AfX,Y(x,y)dxdyP((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy

Unlike the PMF, the joint PDF value fX,Y(x,y)f_{X,Y}(x,y) is not a probability — it is a density. Only integration over a region yields a probability.

Joint Distribution — Discrete

P(X=x,Y=y)=p(x,y)0,xyp(x,y)=1P(X=x, Y=y) = p(x,y) \geq 0, \quad \sum_x \sum_y p(x,y) = 1

Here,

  • p(x,y)p(x,y)=Joint probability mass function
  • P(X=x,Y=y)P(X=x, Y=y)=Probability that X takes value x and Y takes value y simultaneously

Joint Distribution — Continuous

fX,Y(x,y)0,R2fX,Y(x,y)dxdy=1f_{X,Y}(x,y) \geq 0, \quad \iint_{\mathbb{R}^2} f_{X,Y}(x,y) \, dx\,dy = 1

Here,

  • fX,Y(x,y)f_{X,Y}(x,y)=Joint probability density function
  • R2\mathbb{R}^2=The entire 2D real plane

📝Joint PMF Example

Consider two fair coins tossed independently. Let XX = number of heads on coin 1 (0 or 1) and YY = number of heads on coin 2 (0 or 1). The joint PMF is:

Y=0Y=0Y=1Y=1
X=0X=00.250.25
X=1X=10.250.25

Each cell has probability 0.250.25 because the coins are independent.


Marginal Distributions

ThMarginal Distribution

The marginal distribution of one variable is obtained by summing (discrete) or integrating (continuous) the joint distribution over all values of the other variable. It tells you the distribution of a single variable ignoring the others.

For discrete random variables:

P(X=x)=yP(X=x,Y=y)=yp(x,y)P(X=x) = \sum_{y} P(X=x, Y=y) = \sum_{y} p(x,y)
P(Y=y)=xP(X=x,Y=y)=xp(x,y)P(Y=y) = \sum_{x} P(X=x, Y=y) = \sum_{x} p(x,y)

For continuous random variables:

fX(x)=fX,Y(x,y)dyf_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy
fY(y)=fX,Y(x,y)dxf_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx

⚠️ Marginalizing Loses Information

Computing a marginal distribution discards all information about the relationship between variables. Two very different joint distributions can have the same marginals. The joint distribution contains strictly more information than the marginals alone.

📝Marginal from Joint PMF

Given the joint PMF:

Y=0Y=0Y=1Y=1Y=2Y=2Marginal P(X)P(X)
X=0X=00.10.20.10.4
X=1X=10.20.30.10.6
Marginal P(Y)P(Y)0.30.50.21.0
P(X=0)=0.1+0.2+0.1=0.4P(X=0) = 0.1 + 0.2 + 0.1 = 0.4P(Y=1)=0.2+0.3=0.5P(Y=1) = 0.2 + 0.3 = 0.5

Conditional Distributions

DfConditional Distribution

The conditional distribution of YY given X=xX = x describes the probability of YY values when we know XX has taken a specific value. It is defined as:

Discrete case:

P(Y=yX=x)=P(X=x,Y=y)P(X=x)=p(x,y)pX(x)P(Y=y \mid X=x) = \frac{P(X=x, Y=y)}{P(X=x)} = \frac{p(x,y)}{p_X(x)}

provided pX(x)>0p_X(x) > 0.

Continuous case:

fYX(yx)=fX,Y(x,y)fX(x)f_{Y \mid X}(y \mid x) = \frac{f_{X,Y}(x,y)}{f_X(x)}

provided fX(x)>0f_X(x) > 0.

Conditional Probability — Bayes' Rule for Joint Distributions

P(Y=yX=x)=P(X=x,Y=y)P(X=x)=p(x,y)yp(x,y)P(Y=y \mid X=x) = \frac{P(X=x, Y=y)}{P(X=x)} = \frac{p(x,y)}{\sum_{y'} p(x,y')}

Here,

  • P(Y=yX=x)P(Y=y \mid X=x)=Conditional probability of Y given X = x
  • p(x,y)p(x,y)=Joint probability at (x,y)
  • pX(x)p_X(x)=Marginal probability of X = x (the normalizing constant)

ThChain Rule of Probability

The joint distribution can always be decomposed as a product of a marginal and a conditional:

P(X=x,Y=y)=P(X=x)P(Y=yX=x)P(X=x, Y=y) = P(X=x) \cdot P(Y=y \mid X=x)
P(X=x,Y=y)=P(Y=y)P(X=xY=y)P(X=x, Y=y) = P(Y=y) \cdot P(X=x \mid Y=y)

This chain rule generalizes to nn variables:

P(X1,X2,,Xn)=P(X1)P(X2X1)P(X3X1,X2)P(XnX1,,Xn1)P(X_1, X_2, \dots, X_n) = P(X_1) \cdot P(X_2 \mid X_1) \cdot P(X_3 \mid X_1, X_2) \cdots P(X_n \mid X_1, \dots, X_{n-1})

📝Conditional Distribution Example

Suppose P(X=0,Y=0)=0.2P(X=0,Y=0)=0.2, P(X=0,Y=1)=0.3P(X=0,Y=1)=0.3, P(X=1,Y=0)=0.1P(X=1,Y=0)=0.1, P(X=1,Y=1)=0.4P(X=1,Y=1)=0.4.

Find P(Y=1X=0)P(Y=1 \mid X=0):

P(Y=1X=0)=P(X=0,Y=1)P(X=0)=0.30.2+0.3=0.30.5=0.6P(Y=1 \mid X=0) = \frac{P(X=0, Y=1)}{P(X=0)} = \frac{0.3}{0.2 + 0.3} = \frac{0.3}{0.5} = 0.6

Interpretation: Given X=0X=0, there is a 60% chance Y=1Y=1.


Independence of Random Variables

DfStatistical Independence

Two random variables XX and YY are independent if and only if the joint distribution factorizes into the product of marginals for all values:

P(X=x,Y=y)=P(X=x)P(Y=y)for all x,yP(X=x, Y=y) = P(X=x) \cdot P(Y=y) \quad \text{for all } x, y

Equivalently, XX and YY are independent if and only if:

  1. P(Y=yX=x)=P(Y=y)P(Y=y \mid X=x) = P(Y=y) for all x,yx, y (knowing XX does not change YY's distribution)
  2. fX,Y(x,y)=fX(x)fY(y)f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y) for all (x,y)(x,y) (continuous case)

⚠️ Independence vs Uncorrelated

Independence is strictly stronger than zero correlation. If XX and YY are independent, then Cov(X,Y)=0\text{Cov}(X,Y) = 0. But the converse is false — uncorrelated variables can still be dependent (e.g., XU(1,1)X \sim U(-1,1) and Y=X2Y = X^2). Independence implies zero correlation, but zero correlation does not imply independence. This distinction is critical in ML: many models assume independence (Naive Bayes) or only capture linear dependence (PCA), missing nonlinear relationships.

ThTests for Independence

To check whether XX and YY are independent:

  1. Factorization test: Verify p(x,y)=pX(x)pY(y)p(x,y) = p_X(x) \cdot p_Y(y) for all (x,y)(x,y).
  2. Conditional test: Verify P(Y=yX=x)=P(Y=y)P(Y=y \mid X=x) = P(Y=y) for all x,yx,y with P(X=x)>0P(X=x) > 0.
  3. Correlation test (necessary only): If Cov(X,Y)0\text{Cov}(X,Y) \neq 0, they are not independent. But Cov(X,Y)=0\text{Cov}(X,Y) = 0 does not prove independence.
  4. Mutual information: I(X;Y)=0I(X;Y) = 0 if and only if XX and YY are independent.

📝Checking Independence

Given: P(X=0,Y=0)=0.12P(X=0,Y=0)=0.12, P(X=0,Y=1)=0.18P(X=0,Y=1)=0.18, P(X=1,Y=0)=0.28P(X=1,Y=0)=0.28, P(X=1,Y=1)=0.42P(X=1,Y=1)=0.42.

Marginals: P(X=0)=0.30P(X=0)=0.30, P(X=1)=0.70P(X=1)=0.70, P(Y=0)=0.40P(Y=0)=0.40, P(Y=1)=0.60P(Y=1)=0.60.

Check: P(X=0)P(Y=0)=0.30×0.40=0.12=P(X=0,Y=0)P(X=0) \cdot P(Y=0) = 0.30 \times 0.40 = 0.12 = P(X=0,Y=0)P(X=0)P(Y=1)=0.30×0.60=0.18=P(X=0,Y=1)P(X=0) \cdot P(Y=1) = 0.30 \times 0.60 = 0.18 = P(X=0,Y=1)

All entries match → XX and YY are independent.


Joint Expectation

Expectation of a Function of Two Variables

E[g(X,Y)]=xyg(x,y)p(x,y)quadtext(discrete)E[g(X,Y)] = \sum_x \sum_y g(x,y) \, p(x,y) \\quad \\text{(discrete)}

Here,

  • g(X,Y)g(X,Y)=Any function of the two random variables
  • p(x,y)p(x,y)=Joint PMF

Joint Expectation — Continuous

E[g(X,Y)]=g(x,y)fX,Y(x,y)dxdyE[g(X,Y)] = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} g(x,y) \, f_{X,Y}(x,y) \, dx \, dy

Here,

  • g(X,Y)g(X,Y)=Any function of the two random variables
  • fX,Y(x,y)f_{X,Y}(x,y)=Joint PDF

ThLinearity of Joint Expectation

For any random variables X,YX, Y and constants a,b,ca, b, c:

  1. E[aX+bY+c]=aE[X]+bE[Y]+cE[aX + bY + c] = aE[X] + bE[Y] + c
  2. E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y] (always holds, even for dependent variables)
  3. E[XY]=E[X]E[Y]E[XY] = E[X] \cdot E[Y] if and only if XX and YY are independent (or uncorrelated)

Important: Linearity of expectation never requires independence. The product rule E[XY]=E[X]E[Y]E[XY] = E[X]E[Y] always requires independence or uncorrelatedness.

📝Joint Expectation Computation

Let P(X=0,Y=0)=0.2P(X=0,Y=0)=0.2, P(X=0,Y=1)=0.3P(X=0,Y=1)=0.3, P(X=1,Y=0)=0.1P(X=1,Y=0)=0.1, P(X=1,Y=1)=0.4P(X=1,Y=1)=0.4.

E[XY]=xyxyp(x,y)=00.2+00.3+00.1+110.4=0.4E[XY] = \sum_x \sum_y xy \cdot p(x,y) = 0 \cdot 0.2 + 0 \cdot 0.3 + 0 \cdot 0.1 + 1 \cdot 1 \cdot 0.4 = 0.4
E[X]=0(0.5)+1(0.5)=0.5E[X] = 0(0.5) + 1(0.5) = 0.5
E[Y]=0(0.3)+1(0.7)=0.7E[Y] = 0(0.3) + 1(0.7) = 0.7

Since E[XY]=0.4E[X]E[Y]=0.35E[XY] = 0.4 \neq E[X]E[Y] = 0.35, XX and YY are not independent.


Covariance

Covariance

Cov(X,Y)=E[(XμX)(YμY)]=E[XY]E[X]E[Y]\text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]

Here,

  • Cov(X,Y)\text{Cov}(X,Y)=Covariance between X and Y — measures linear co-movement
  • μX,μY\mu_X, \mu_Y=Means of X and Y respectively
  • E[XY]E[XY]=Joint expectation (product moment)

ThProperties of Covariance

  1. Self-covariance: Cov(X,X)=Var(X)\text{Cov}(X,X) = \text{Var}(X)
  2. Symmetry: Cov(X,Y)=Cov(Y,X)\text{Cov}(X,Y) = \text{Cov}(Y,X)
  3. Linearity: Cov(aX+b,cY+d)=acCov(X,Y)\text{Cov}(aX + b, cY + d) = ac \cdot \text{Cov}(X,Y)
  4. Sum rule: Cov(X+Z,Y)=Cov(X,Y)+Cov(Z,Y)\text{Cov}(X+Z, Y) = \text{Cov}(X,Y) + \text{Cov}(Z,Y)
  5. Zero for independent variables: If XYX \perp Y, then Cov(X,Y)=0\text{Cov}(X,Y) = 0
  6. Variance of sum: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)

ℹ️ Interpreting Covariance Sign

  • Cov(X,Y)>0\text{Cov}(X,Y) > 0: XX and YY tend to move in the same direction
  • Cov(X,Y)<0\text{Cov}(X,Y) < 0: XX and YY tend to move in opposite directions
  • Cov(X,Y)=0\text{Cov}(X,Y) = 0: XX and YY are uncorrelated (but not necessarily independent)
  • The magnitude of covariance depends on the units of XX and YY, making it hard to compare across variable pairs — hence the need for correlation.

Correlation

Pearson Correlation Coefficient

ρXY=Cov(X,Y)σXσY[1,1]\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]

Here,

  • ρXY\rho_{XY}=Pearson correlation coefficient between X and Y
  • σX,σY\sigma_X, \sigma_Y=Standard deviations of X and Y

ThProperties of Correlation

  1. 1ρ1-1 \leq \rho \leq 1 always
  2. ρ=1\rho = 1 if and only if Y=aX+bY = aX + b with a>0a > 0 (perfect positive linear relationship)
  3. ρ=1\rho = -1 if and only if Y=aX+bY = aX + b with a<0a < 0 (perfect negative linear relationship)
  4. ρ=0\rho = 0 means uncorrelated (no linear relationship)
  5. ρ\rho is invariant to linear transformations: Corr(aX+b,cY+d)=Corr(X,Y)\text{Corr}(aX+b, cY+d) = \text{Corr}(X,Y) for a,c>0a, c > 0
  6. Independence \Rightarrow ρ=0\rho = 0, but ρ=0\rho = 0 does not imply independence

⚠️ Correlation Only Captures Linear Dependence

Two variables can have a strong nonlinear relationship yet have ρ0\rho \approx 0. Classic example: XU(1,1)X \sim U(-1, 1) and Y=X2Y = X^2. These are perfectly dependent but uncorrelated. Use mutual information or distance correlation to detect nonlinear dependencies.

📝Computing Correlation

Given: E[X]=2E[X] = 2, E[Y]=5E[Y] = 5, E[XY]=12E[XY] = 12, Var(X)=4\text{Var}(X) = 4, Var(Y)=9\text{Var}(Y) = 9.

Cov(X,Y)=E[XY]E[X]E[Y]=1210=2\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 12 - 10 = 2
ρ=Cov(X,Y)σXσY=22×3=130.333\rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} = \frac{2}{2 \times 3} = \frac{1}{3} \approx 0.333

Moderate positive linear correlation.


Python Implementation

💡 Python for Joint Distributions

NumPy and SciPy provide efficient tools for computing joint distributions, marginals, conditionals, covariance, and correlation from both theoretical and empirical data.

import numpy as np
from scipy import stats

# === Joint PMF from a table ===
joint_pmf = np.array([
    [0.1, 0.2, 0.1],   # X=0: Y=0,1,2
    [0.2, 0.3, 0.1],   # X=1: Y=0,1,2
    [0.1, 0.1, 0.1]    # X=2: Y=0,1,2
])

# Marginal distributions
p_x = joint_pmf.sum(axis=1)  # Sum over Y → P(X)
p_y = joint_pmf.sum(axis=0)  # Sum over X → P(Y)
print(f"Marginal P(X): {p_x}")   # [0.4, 0.6, 0.3]  -- note: these don't sum to 1, fix indices
print(f"Marginal P(Y): {p_y}")   # [0.4, 0.6, 0.3]

# Conditional distribution P(Y | X=1)
x_idx = 1
p_y_given_x1 = joint_pmf[x_idx] / p_x[x_idx]
print(f"P(Y | X=1): {p_y_given_x1}")

# Verify: conditional probabilities sum to 1
print(f"Sum of conditional: {p_y_given_x1.sum():.4f}")  # Should be 1.0

# === Joint PDF for continuous variables ===
# Bivariate normal example
mean = [0, 0]
cov_matrix = [[1, 0.8], [0.8, 1]]  # Correlation = 0.8

# Sample from bivariate normal
np.random.seed(42)
n_samples = 10000
samples = np.random.multivariate_normal(mean, cov_matrix, size=n_samples)
X_samples, Y_samples = samples[:, 0], samples[:, 1]

# Empirical covariance and correlation
emp_cov = np.cov(X_samples, Y_samples)
emp_corr = np.corrcoef(X_samples, Y_samples)
print(f"Empirical covariance matrix:\n{emp_cov}")
print(f"Empirical correlation:\n{emp_corr}")

# === Independence check ===
# Two independent variables
X_ind = np.random.randn(5000)
Y_ind = np.random.randn(5000)
print(f"Correlation (independent): {np.corrcoef(X_ind, Y_ind)[0,1]:.4f}")

# Two dependent variables (Y = X^2)
X_dep = np.random.uniform(-1, 1, 5000)
Y_dep = X_dep ** 2
print(f"Correlation (dependent but uncorrelated): {np.corrcoef(X_dep, Y_dep)[0,1]:.4f}")

# === Mutual information for nonlinear dependence ===
from scipy.stats import entropy

def mutual_information_2d(x, y, bins=20):
    """Estimate mutual information using histogram-based method."""
    p_xy = np.histogram2d(x, y, bins=bins)[0]
    p_xy = p_xy / p_xy.sum()
    p_x = p_xy.sum(axis=1)
    p_y = p_xy.sum(axis=0)
    mi = entropy(p_x) + entropy(p_y) - entropy(p_xy.ravel())
    return mi

print(f"MI (independent): {mutual_information_2d(X_ind, Y_ind):.4f}")
print(f"MI (X, X^2):     {mutual_information_2d(X_dep, Y_dep):.4f}")

# === Visualize joint distribution ===
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Joint PMF heatmap
im = axes[0].imshow(joint_pmf, cmap='Blues', vmin=0, vmax=0.3)
axes[0].set_xlabel('Y')
axes[0].set_ylabel('X')
axes[0].set_title('Joint PMF')
plt.colorbar(im, ax=axes[0])

# Bivariate normal samples
axes[1].scatter(X_samples[:500], Y_samples[:500], alpha=0.3, s=10)
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].set_title('Bivariate Normal (ρ=0.8)')
axes[1].set_aspect('equal')

# Dependent but uncorrelated
axes[2].scatter(X_dep[:500], Y_dep[:500], alpha=0.3, s=10)
axes[2].set_xlabel('X')
axes[2].set_ylabel('Y = X²')
axes[2].set_title('Dependent but Uncorrelated')

plt.tight_layout()
plt.savefig('joint_distributions.png', dpi=150)
plt.show()

Applications in AI/ML

💡 Joint Distributions in Machine Learning

Understanding joint distributions is essential for probabilistic modeling, generative models, and any ML task that involves uncertainty quantification.

DfGenerative Models

A generative model learns the joint distribution pmodel(x,z)p_{\text{model}}(\mathbf{x}, z) over data x\mathbf{x} and latent variables zz. Once learned, you can:

  • Sample new data: draw zp(z)z \sim p(z), then xp(xz)\mathbf{x} \sim p(\mathbf{x} \mid z)
  • Compute posteriors: p(zx)=p(xz)p(z)p(x)p(z \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid z) p(z)}{p(\mathbf{x})} (Bayesian inference)
  • Evaluate likelihood: p(x)=p(x,z)dzp(\mathbf{x}) = \int p(\mathbf{x}, z) \, dz

Examples: VAEs, GANs (implicitly), Hidden Markov Models, Gaussian Mixture Models, Bayesian Networks.

DfNaive Bayes Classifier

The Naive Bayes classifier assumes conditional independence of features given the class label:

P(xY=k)=j=1dP(xjY=k)P(\mathbf{x} \mid Y = k) = \prod_{j=1}^{d} P(x_j \mid Y = k)

This factorization turns a high-dimensional joint estimation problem into dd independent one-dimensional problems, making training tractable even with limited data. Despite the unrealistic independence assumption, Naive Bayes often performs surprisingly well in practice (spam filtering, text classification).

DfGaussian Mixture Models (GMM)

A GMM models data as a mixture of KK multivariate Gaussians:

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where πk\pi_k are mixing weights, μk\boldsymbol{\mu}_k are component means, and Σk\boldsymbol{\Sigma}_k are component covariance matrices. GMMs capture multimodal distributions and are used for clustering, density estimation, and as sub-components of more complex models.

DfVariational Autoencoders (VAEs)

VAEs approximate an intractable joint posterior p(zx)p(z \mid \mathbf{x}) with a learned encoder qϕ(zx)q_\phi(z \mid \mathbf{x}). The training objective is the ELBO:

L=Eqϕ(zx)[logpθ(xz)]DKL(qϕ(zx)p(z))\mathcal{L} = E_{q_\phi(z|\mathbf{x})}[\log p_\theta(\mathbf{x} \mid z)] - D_{\text{KL}}(q_\phi(z \mid \mathbf{x}) \| p(z))

This relies fundamentally on the joint distribution p(x,z)=p(xz)p(z)p(\mathbf{x}, z) = p(\mathbf{x} \mid z) p(z).

📝Covariance Matrix in PCA

Principal Component Analysis (PCA) performs eigendecomposition of the data covariance matrix Σ=1nXTX\boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X}. The eigenvectors are the principal components (directions of maximum variance), and the eigenvalues quantify the variance explained by each component. The joint distribution of features — encoded in the covariance matrix — determines the entire PCA solution.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Treating P(X=x,Y=y)P(X=x, Y=y) as P(X=x)+P(Y=y)P(X=x) + P(Y=y)Joint probability is NOT the sum of marginalsP(X=x,Y=y)=P(X=x)P(Y=yX=x)P(X=x, Y=y) = P(X=x) \cdot P(Y=y \mid X=x)
Assuming Cov(X,Y)=0XY\text{Cov}(X,Y) = 0 \Rightarrow X \perp YZero correlation does not imply independenceUse mutual information or chi-squared tests for independence
Computing marginals by summing the wrong axisAxis convention depends on variable orderingCheck which axis corresponds to which variable before summing
Dividing by zero in conditional probabilityP(YX)P(Y \mid X) is undefined when P(X)=0P(X) = 0Always verify the conditioning event has positive probability
Confusing joint PMF with joint PDFPMF gives probabilities; PDF gives densitiesP(X=x,Y=y)=p(x,y)P(X=x, Y=y) = p(x,y) for discrete; P((X,Y)A)=Af(x,y)dxdyP((X,Y) \in A) = \iint_A f(x,y) dxdy for continuous
Assuming independence from a scatter plotVisual correlation only captures linear dependenceCompute mutual information or use nonparametric tests
Forgetting to normalize conditional distributionsConditionals must sum/integrate to 1Divide by the marginal: P(YX)=p(x,y)/pX(x)P(Y \mid X) = p(x,y) / p_X(x)
Using sample covariance for hypothesis testing without correctionSample covariance is biased for small nnUse Bessel's correction or proper statistical tests

Interview Questions

📝Question 1: Joint to Marginal

Q: Given the joint PMF p(x,y)p(x,y) for X,Y{1,2,3}X, Y \in \{1, 2, 3\} where p(x,y)=x+y54p(x,y) = \frac{x+y}{54}, find P(X2,Y2)P(X \leq 2, Y \leq 2).

A: P(X2,Y2)=x=12y=12x+y54=2+3+3+454=1254=29P(X \leq 2, Y \leq 2) = \sum_{x=1}^{2}\sum_{y=1}^{2} \frac{x+y}{54} = \frac{2+3+3+4}{54} = \frac{12}{54} = \frac{2}{9}.

📝Question 2: Independence Test

Q: XX and YY have joint PMF: p(0,0)=0.3p(0,0)=0.3, p(0,1)=0.2p(0,1)=0.2, p(1,0)=0.1p(1,0)=0.1, p(1,1)=0.4p(1,1)=0.4. Are they independent?

A: P(X=0)=0.5P(X=0) = 0.5, P(Y=0)=0.4P(Y=0) = 0.4. If independent, p(0,0)p(0,0) should equal 0.5×0.4=0.20.30.5 \times 0.4 = 0.2 \neq 0.3. Not independent.

📝Question 3: Conditional Distribution

Q: If XX and YY are jointly distributed with P(X=0,Y=0)=0.2P(X=0,Y=0)=0.2, P(X=0,Y=1)=0.3P(X=0,Y=1)=0.3, P(X=1,Y=0)=0.1P(X=1,Y=0)=0.1, P(X=1,Y=1)=0.4P(X=1,Y=1)=0.4, find E[YX=1]E[Y \mid X=1].

A: P(Y=0X=1)=0.1/0.5=0.2P(Y=0 \mid X=1) = 0.1/0.5 = 0.2, P(Y=1X=1)=0.4/0.5=0.8P(Y=1 \mid X=1) = 0.4/0.5 = 0.8. E[YX=1]=0(0.2)+1(0.8)=0.8E[Y \mid X=1] = 0(0.2) + 1(0.8) = 0.8.

📝Question 4: Covariance Computation

Q: Given E[X]=3E[X]=3, E[Y]=7E[Y]=7, E[XY]=25E[XY]=25, E[X2]=13E[X^2]=13, E[Y2]=55E[Y^2]=55, find Var(X+Y)\text{Var}(X+Y).

A: Cov(X,Y)=2521=4\text{Cov}(X,Y) = 25 - 21 = 4. Var(X)=139=4\text{Var}(X) = 13-9=4, Var(Y)=5549=6\text{Var}(Y) = 55-49=6. Var(X+Y)=4+6+2(4)=18\text{Var}(X+Y) = 4+6+2(4) = 18.

📝Question 5: Bayes' Rule Application

Q: A diagnostic test has sensitivity 0.95 and specificity 0.90. If the disease prevalence is 1%, what is P(diseasepositive)P(\text{disease} \mid \text{positive})?

A: Using joint distribution: P(D+,T+)=0.01×0.95=0.0095P(D+, T+) = 0.01 \times 0.95 = 0.0095. P(D,T+)=0.99×0.10=0.099P(D-, T+) = 0.99 \times 0.10 = 0.099. P(T+)=0.1085P(T+) = 0.1085. P(D+T+)=0.0095/0.10850.0876P(D+ \mid T+) = 0.0095/0.1085 \approx 0.0876 (only 8.8%!)

📝Question 6: Multivariate Normal

Q: If (X,Y)N((00),(1ρρ1))(X,Y) \sim \mathcal{N}\left(\begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix}1 & \rho \\ \rho & 1\end{pmatrix}\right), what is the conditional distribution of YY given X=xX=x?

A: YX=xN(ρx,1ρ2)Y \mid X=x \sim \mathcal{N}(\rho x, 1-\rho^2). The conditional mean is ρx\rho x (regression toward the mean) and the conditional variance 1ρ21-\rho^2 decreases as ρ|\rho| increases (knowing XX reduces uncertainty about YY).


Practice Problems

📝Problem 1: Marginal Distribution

The joint PMF of XX and YY is p(x,y)=xy36p(x,y) = \frac{xy}{36} for x,y{1,2,3}x, y \in \{1, 2, 3\}. Find the marginal PMFs pX(x)p_X(x) and pY(y)p_Y(y).

💡Solution

pX(x)=y=13xy36=x36(1+2+3)=6x36=x6p_X(x) = \sum_{y=1}^{3} \frac{xy}{36} = \frac{x}{36}(1+2+3) = \frac{6x}{36} = \frac{x}{6}

So pX(1)=1/6p_X(1)=1/6, pX(2)=2/6=1/3p_X(2)=2/6=1/3, pX(3)=3/6=1/2p_X(3)=3/6=1/2. By symmetry, pY(y)=y/6p_Y(y) = y/6.

📝Problem 2: Conditional Distribution

Given p(0,0)=0.1p(0,0)=0.1, p(0,1)=0.2p(0,1)=0.2, p(1,0)=0.3p(1,0)=0.3, p(1,1)=0.4p(1,1)=0.4, compute P(X=1Y=0)P(X=1 \mid Y=0) and P(Y=1X=0)P(Y=1 \mid X=0).

💡Solution

P(Y=0)=0.1+0.3=0.4P(Y=0) = 0.1 + 0.3 = 0.4. P(X=1Y=0)=p(1,0)/P(Y=0)=0.3/0.4=0.75P(X=1 \mid Y=0) = p(1,0)/P(Y=0) = 0.3/0.4 = 0.75.

P(X=0)=0.1+0.2=0.3P(X=0) = 0.1 + 0.2 = 0.3. P(Y=1X=0)=p(0,1)/P(X=0)=0.2/0.30.667P(Y=1 \mid X=0) = p(0,1)/P(X=0) = 0.2/0.3 \approx 0.667.

📝Problem 3: Independence Check

XX takes values {1,2}\{1, 2\} and YY takes values {a,b}\{a, b\} with joint PMF: p(1,a)=0.2p(1,a)=0.2, p(1,b)=0.3p(1,b)=0.3, p(2,a)=0.1p(2,a)=0.1, p(2,b)=0.4p(2,b)=0.4. Are XX and YY independent?

💡Solution

P(X=1)=0.5P(X=1)=0.5, P(Y=a)=0.3P(Y=a)=0.3. If independent, p(1,a)p(1,a) should be 0.5×0.3=0.150.20.5 \times 0.3 = 0.15 \neq 0.2.

Not independent. The joint probabilities do not factor into the product of marginals.

📝Problem 4: Covariance from Joint Distribution

Let P(X=1,Y=1)=0.1P(X=-1,Y=-1)=0.1, P(X=1,Y=1)=0.2P(X=-1,Y=1)=0.2, P(X=1,Y=1)=0.3P(X=1,Y=-1)=0.3, P(X=1,Y=1)=0.4P(X=1,Y=1)=0.4. Compute Cov(X,Y)\text{Cov}(X,Y) and ρXY\rho_{XY}.

💡Solution

E[X]=1(0.3)+1(0.7)=0.4E[X] = -1(0.3) + 1(0.7) = 0.4. E[Y]=1(0.4)+1(0.6)=0.2E[Y] = -1(0.4) + 1(0.6) = 0.2.

E[XY]=(1)(1)(0.1)+(1)(1)(0.2)+(1)(1)(0.3)+(1)(1)(0.4)=0.10.20.3+0.4=0E[XY] = (-1)(-1)(0.1) + (-1)(1)(0.2) + (1)(-1)(0.3) + (1)(1)(0.4) = 0.1 - 0.2 - 0.3 + 0.4 = 0.

Cov(X,Y)=E[XY]E[X]E[Y]=00.08=0.08\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0 - 0.08 = -0.08.

E[X2]=1(0.3)+1(0.7)=1E[X^2] = 1(0.3) + 1(0.7) = 1. Var(X)=10.16=0.84\text{Var}(X) = 1 - 0.16 = 0.84. σX=0.840.9165\sigma_X = \sqrt{0.84} \approx 0.9165.

E[Y2]=1(0.4)+1(0.6)=1E[Y^2] = 1(0.4) + 1(0.6) = 1. Var(Y)=10.04=0.96\text{Var}(Y) = 1 - 0.04 = 0.96. σY=0.960.9798\sigma_Y = \sqrt{0.96} \approx 0.9798.

ρ=0.08/(0.9165×0.9798)0.0896\rho = -0.08 / (0.9165 \times 0.9798) \approx -0.0896.

📝Problem 5: Chain Rule Application

A joint distribution over three binary variables satisfies P(A=0)=0.6P(A=0)=0.6, P(B=0A=0)=0.7P(B=0 \mid A=0)=0.7, P(B=0A=1)=0.4P(B=0 \mid A=1)=0.4, P(C=0A=0,B=0)=0.8P(C=0 \mid A=0, B=0)=0.8, P(C=0A=0,B=1)=0.5P(C=0 \mid A=0, B=1)=0.5, P(C=0A=1,B=0)=0.3P(C=0 \mid A=1, B=0)=0.3, P(C=0A=1,B=1)=0.1P(C=0 \mid A=1, B=1)=0.1. Compute P(A=0,B=0,C=0)P(A=0, B=0, C=0).

💡Solution

Using the chain rule:

P(A=0,B=0,C=0)=P(A=0)P(B=0A=0)P(C=0A=0,B=0)P(A=0, B=0, C=0) = P(A=0) \cdot P(B=0 \mid A=0) \cdot P(C=0 \mid A=0, B=0)
=0.6×0.7×0.8=0.336= 0.6 \times 0.7 \times 0.8 = 0.336

Quick Reference

QuantityFormulaPython
Joint PMFP(X=x,Y=y)=p(x,y)P(X=x, Y=y) = p(x,y)2D numpy array
Joint PDFfX,Y(x,y)f_{X,Y}(x,y)scipy.stats.multivariate_normal
Marginal (discrete)P(X=x)=yp(x,y)P(X=x) = \sum_y p(x,y)joint.sum(axis=1)
Marginal (continuous)fX(x)=fX,Y(x,y)dyf_X(x) = \int f_{X,Y}(x,y) dynp.trapz(f, y, axis=1)
ConditionalP(Y=yX=x)=p(x,y)/pX(x)P(Y=y \mid X=x) = p(x,y)/p_X(x)joint[x] / marginal_x
Independence testp(x,y)=pX(x)pY(y)p(x,y) = p_X(x) \cdot p_Y(y) for all (x,y)(x,y)np.allclose(joint, np.outer(px, py))
CovarianceCov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X,Y) = E[XY] - E[X]E[Y]np.cov(X, Y)
Correlationρ=Cov(X,Y)/(σXσY)\rho = \text{Cov}(X,Y)/(\sigma_X \sigma_Y)np.corrcoef(X, Y)
Chain ruleP(A,B,C)=P(A)P(BA)P(CA,B)P(A,B,C) = P(A)P(B \mid A)P(C \mid A,B)Nested loops or recursion
Mutual informationI(X;Y)=x,yp(x,y)logp(x,y)pX(x)pY(y)I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p_X(x)p_Y(y)}sklearn.metrics.mutual_info_score

Cross-References


📋Key Takeaways

  • Joint Distribution P(X=x,Y=y)=p(x,y)P(X=x, Y=y) = p(x,y) captures the probability of specific combinations of values across multiple random variables simultaneously. It is the most complete description of the relationship between variables.
  • Marginal Distribution is obtained by summing (discrete: P(X=x)=yp(x,y)P(X=x) = \sum_y p(x,y)) or integrating (continuous: fX(x)=fX,Y(x,y)dyf_X(x) = \int f_{X,Y}(x,y) dy) over the other variable. Marginals discard dependence information.
  • Conditional Distribution P(Y=yX=x)=p(x,y)pX(x)P(Y=y \mid X=x) = \frac{p(x,y)}{p_X(x)} gives the distribution of one variable given a specific value of another. It is the foundation of Bayes' rule and predictive modeling.
  • Independence means P(X=x,Y=y)=P(X=x)P(Y=y)P(X=x, Y=y) = P(X=x)P(Y=y) for all x,yx, y — the joint factorizes into the product of marginals. Independence is stronger than zero correlation.
  • Covariance Cov(X,Y)=E[XY]E[X]E[Y]\text{Cov}(X,Y) = E[XY] - E[X]E[Y] measures linear co-movement. It is the numerator of correlation and appears in the variance-of-sum formula.
  • Correlation ρ=Cov(X,Y)/(σXσY)[1,1]\rho = \text{Cov}(X,Y)/(\sigma_X \sigma_Y) \in [-1,1] normalizes covariance to a unitless scale. It captures only linear dependence — nonlinear relationships require mutual information or other measures.
  • Chain Rule: P(X1,,Xn)=P(X1)P(X2X1)P(XnX1,,Xn1)P(X_1, \dots, X_n) = P(X_1) P(X_2 \mid X_1) \cdots P(X_n \mid X_1, \dots, X_{n-1}) decomposes any joint distribution into a product of conditionals.
  • In code: Use np.sum(axis=...) for marginals, division for conditionals, np.cov() and np.corrcoef() for dependence measures, and np.outer() to test independence.
  • Practice Application: Joint distributions are essential for probabilistic graphical models, generative modeling (VAEs, GMMs), PCA, Bayesian inference, and understanding feature dependencies in high-dimensional ML datasets.
Lesson Progress40 / 100