Joint Distributions | ChatWhole Learn

Why It Matters

💡 Why It Matters

Real-world data is never univariate. Every dataset — images, time series, tabular records — consists of multiple interacting variables. Joint distributions are the mathematical framework that captures how random variables co-vary, enabling you to reason about dependencies, compute conditional probabilities, and build multivariate models. Without joint distributions, you cannot understand correlations, perform Bayesian inference on multiple variables, build copula models in finance, or design variational autoencoders in deep learning. They are the foundation of PCA, Gaussian mixture models, hidden Markov models, and virtually every multivariate statistical technique. Mastering joint distributions is the bridge from elementary probability to real-world data science and machine learning.

Joint PMF/PDF

DfJoint Probability Mass Function (Discrete)

For two discrete random variables $X$ and $Y$ , the joint PMF $p_{X,Y}(x,y)$ gives the probability that $X = x$ and $Y = y$ simultaneously:

p_{X,Y}(x,y) = P(X = x, Y = y)

The joint PMF must satisfy:

$p_{X,Y}(x,y) \geq 0$ for all $(x,y)$
$\sum_{x}\sum_{y} p_{X,Y}(x,y) = 1$

For a finite support with $m$ values of $X$ and $n$ values of $Y$ , the joint PMF can be represented as an $m \times n$ matrix where each entry is a probability.

DfJoint Probability Density Function (Continuous)

For two continuous random variables $X$ and $Y$ , the joint PDF $f_{X,Y}(x,y)$ satisfies:

f_{X,Y}(x,y) \geq 0 \quad \text{for all } (x,y)

\int_{-\infty}^{\infty}\int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx \, dy = 1

The probability of falling in a region $A$ is:

P((X,Y) \in A) = \iint_A f_{X,Y}(x,y) \, dx \, dy

Unlike the PMF, the joint PDF value $f_{X,Y}(x,y)$ is not a probability — it is a density. Only integration over a region yields a probability.

Joint Distribution — Discrete

P(X=x, Y=y) = p(x,y) \geq 0, \quad \sum_x \sum_y p(x,y) = 1

Here,

$p(x,y)$ =Joint probability mass function
$P(X=x, Y=y)$ =Probability that X takes value x and Y takes value y simultaneously

Joint Distribution — Continuous

f_{X,Y}(x,y) \geq 0, \quad \iint_{\mathbb{R}^2} f_{X,Y}(x,y) \, dx\,dy = 1

Here,

$f_{X,Y}(x,y)$ =Joint probability density function
$\mathbb{R}^2$ =The entire 2D real plane

📝Joint PMF Example

Consider two fair coins tossed independently. Let $X$ = number of heads on coin 1 (0 or 1) and $Y$ = number of heads on coin 2 (0 or 1). The joint PMF is:

	$Y=0$	$Y=1$
$X=0$	0.25	0.25
$X=1$	0.25	0.25

Each cell has probability $0.25$ because the coins are independent.

Marginal Distributions

ThMarginal Distribution

The marginal distribution of one variable is obtained by summing (discrete) or integrating (continuous) the joint distribution over all values of the other variable. It tells you the distribution of a single variable ignoring the others.

For discrete random variables:

P(X=x) = \sum_{y} P(X=x, Y=y) = \sum_{y} p(x,y)

P(Y=y) = \sum_{x} P(X=x, Y=y) = \sum_{x} p(x,y)

For continuous random variables:

f_X(x) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dy

f_Y(y) = \int_{-\infty}^{\infty} f_{X,Y}(x,y) \, dx

⚠️ Marginalizing Loses Information

Computing a marginal distribution discards all information about the relationship between variables. Two very different joint distributions can have the same marginals. The joint distribution contains strictly more information than the marginals alone.

📝Marginal from Joint PMF

Given the joint PMF:

	$Y=0$	$Y=1$	$Y=2$	Marginal $P(X)$
$X=0$	0.1	0.2	0.1	0.4
$X=1$	0.2	0.3	0.1	0.6
Marginal $P(Y)$	0.3	0.5	0.2	1.0

P(X=0) = 0.1 + 0.2 + 0.1 = 0.4

P(Y=1) = 0.2 + 0.3 = 0.5

Conditional Distributions

DfConditional Distribution

The conditional distribution of $Y$ given $X = x$ describes the probability of $Y$ values when we know $X$ has taken a specific value. It is defined as:

Discrete case:

P(Y=y \mid X=x) = \frac{P(X=x, Y=y)}{P(X=x)} = \frac{p(x,y)}{p_X(x)}

provided $p_X(x) > 0$ .

Continuous case:

f_{Y \mid X}(y \mid x) = \frac{f_{X,Y}(x,y)}{f_X(x)}

provided $f_X(x) > 0$ .

Conditional Probability — Bayes' Rule for Joint Distributions

P(Y=y \mid X=x) = \frac{P(X=x, Y=y)}{P(X=x)} = \frac{p(x,y)}{\sum_{y'} p(x,y')}

Here,

$P(Y=y \mid X=x)$ =Conditional probability of Y given X = x
$p(x,y)$ =Joint probability at (x,y)
$p_X(x)$ =Marginal probability of X = x (the normalizing constant)

ThChain Rule of Probability

The joint distribution can always be decomposed as a product of a marginal and a conditional:

P(X=x, Y=y) = P(X=x) \cdot P(Y=y \mid X=x)

P(X=x, Y=y) = P(Y=y) \cdot P(X=x \mid Y=y)

This chain rule generalizes to $n$ variables:

P(X_1, X_2, \dots, X_n) = P(X_1) \cdot P(X_2 \mid X_1) \cdot P(X_3 \mid X_1, X_2) \cdots P(X_n \mid X_1, \dots, X_{n-1})

📝Conditional Distribution Example

Suppose $P(X=0,Y=0)=0.2$ , $P(X=0,Y=1)=0.3$ , $P(X=1,Y=0)=0.1$ , $P(X=1,Y=1)=0.4$ .

Find $P(Y=1 \mid X=0)$ :

P(Y=1 \mid X=0) = \frac{P(X=0, Y=1)}{P(X=0)} = \frac{0.3}{0.2 + 0.3} = \frac{0.3}{0.5} = 0.6

Interpretation: Given $X=0$ , there is a 60% chance $Y=1$ .

Independence of Random Variables

DfStatistical Independence

Two random variables $X$ and $Y$ are independent if and only if the joint distribution factorizes into the product of marginals for all values:

P(X=x, Y=y) = P(X=x) \cdot P(Y=y) \quad \text{for all } x, y

Equivalently, $X$ and $Y$ are independent if and only if:

$P(Y=y \mid X=x) = P(Y=y)$ for all $x, y$ (knowing $X$ does not change $Y$ 's distribution)
$f_{X,Y}(x,y) = f_X(x) \cdot f_Y(y)$ for all $(x,y)$ (continuous case)

⚠️ Independence vs Uncorrelated

Independence is strictly stronger than zero correlation. If $X$ and $Y$ are independent, then $\text{Cov}(X,Y) = 0$ . But the converse is false — uncorrelated variables can still be dependent (e.g., $X \sim U(-1,1)$ and $Y = X^2$ ). Independence implies zero correlation, but zero correlation does not imply independence. This distinction is critical in ML: many models assume independence (Naive Bayes) or only capture linear dependence (PCA), missing nonlinear relationships.

ThTests for Independence

To check whether $X$ and $Y$ are independent:

Factorization test: Verify $p(x,y) = p_X(x) \cdot p_Y(y)$ for all $(x,y)$ .
Conditional test: Verify $P(Y=y \mid X=x) = P(Y=y)$ for all $x,y$ with $P(X=x) > 0$ .
Correlation test (necessary only): If $\text{Cov}(X,Y) \neq 0$ , they are not independent. But $\text{Cov}(X,Y) = 0$ does not prove independence.
Mutual information: $I(X;Y) = 0$ if and only if $X$ and $Y$ are independent.

📝Checking Independence

Given: $P(X=0,Y=0)=0.12$ , $P(X=0,Y=1)=0.18$ , $P(X=1,Y=0)=0.28$ , $P(X=1,Y=1)=0.42$ .

Marginals: $P(X=0)=0.30$ , $P(X=1)=0.70$ , $P(Y=0)=0.40$ , $P(Y=1)=0.60$ .

Check: $P(X=0) \cdot P(Y=0) = 0.30 \times 0.40 = 0.12 = P(X=0,Y=0)$ ✓ $P(X=0) \cdot P(Y=1) = 0.30 \times 0.60 = 0.18 = P(X=0,Y=1)$ ✓

All entries match → $X$ and $Y$ are independent.

Joint Expectation

Expectation of a Function of Two Variables

E[g(X,Y)] = \sum_x \sum_y g(x,y) \, p(x,y) \\quad \\text{(discrete)}

Here,

$g(X,Y)$ =Any function of the two random variables
$p(x,y)$ =Joint PMF

Joint Expectation — Continuous

E[g(X,Y)] = \int_{-\infty}^{\infty}\int_{-\infty}^{\infty} g(x,y) \, f_{X,Y}(x,y) \, dx \, dy

Here,

$g(X,Y)$ =Any function of the two random variables
$f_{X,Y}(x,y)$ =Joint PDF

ThLinearity of Joint Expectation

For any random variables $X, Y$ and constants $a, b, c$ :

$E[aX + bY + c] = aE[X] + bE[Y] + c$
$E[X + Y] = E[X] + E[Y]$ (always holds, even for dependent variables)
$E[XY] = E[X] \cdot E[Y]$ if and only if $X$ and $Y$ are independent (or uncorrelated)

Important: Linearity of expectation never requires independence. The product rule $E[XY] = E[X]E[Y]$ always requires independence or uncorrelatedness.

📝Joint Expectation Computation

Let $P(X=0,Y=0)=0.2$ , $P(X=0,Y=1)=0.3$ , $P(X=1,Y=0)=0.1$ , $P(X=1,Y=1)=0.4$ .

E[XY] = \sum_x \sum_y xy \cdot p(x,y) = 0 \cdot 0.2 + 0 \cdot 0.3 + 0 \cdot 0.1 + 1 \cdot 1 \cdot 0.4 = 0.4

E[X] = 0(0.5) + 1(0.5) = 0.5

E[Y] = 0(0.3) + 1(0.7) = 0.7

Since $E[XY] = 0.4 \neq E[X]E[Y] = 0.35$ , $X$ and $Y$ are not independent.

Covariance

\text{Cov}(X,Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]

Here,

$\text{Cov}(X,Y)$ =Covariance between X and Y — measures linear co-movement
$\mu_X, \mu_Y$ =Means of X and Y respectively
$E[XY]$ =Joint expectation (product moment)

ThProperties of Covariance

Self-covariance: $\text{Cov}(X,X) = \text{Var}(X)$
Symmetry: $\text{Cov}(X,Y) = \text{Cov}(Y,X)$
Linearity: $\text{Cov}(aX + b, cY + d) = ac \cdot \text{Cov}(X,Y)$
Sum rule: $\text{Cov}(X+Z, Y) = \text{Cov}(X,Y) + \text{Cov}(Z,Y)$
Zero for independent variables: If $X \perp Y$ , then $\text{Cov}(X,Y) = 0$
Variance of sum: $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)$

ℹ️ Interpreting Covariance Sign

$\text{Cov}(X,Y) > 0$ : $X$ and $Y$ tend to move in the same direction
$\text{Cov}(X,Y) < 0$ : $X$ and $Y$ tend to move in opposite directions
$\text{Cov}(X,Y) = 0$ : $X$ and $Y$ are uncorrelated (but not necessarily independent)
The magnitude of covariance depends on the units of $X$ and $Y$ , making it hard to compare across variable pairs — hence the need for correlation.

Correlation

Pearson Correlation Coefficient

\rho_{XY} = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} \in [-1, 1]

Here,

$\rho_{XY}$ =Pearson correlation coefficient between X and Y
$\sigma_X, \sigma_Y$ =Standard deviations of X and Y

ThProperties of Correlation

$-1 \leq \rho \leq 1$ always
$\rho = 1$ if and only if $Y = aX + b$ with $a > 0$ (perfect positive linear relationship)
$\rho = -1$ if and only if $Y = aX + b$ with $a < 0$ (perfect negative linear relationship)
$\rho = 0$ means uncorrelated (no linear relationship)
$\rho$ is invariant to linear transformations: $\text{Corr}(aX+b, cY+d) = \text{Corr}(X,Y)$ for $a, c > 0$
Independence $\Rightarrow$ $\rho = 0$ , but $\rho = 0$ does not imply independence

⚠️ Correlation Only Captures Linear Dependence

Two variables can have a strong nonlinear relationship yet have $\rho \approx 0$ . Classic example: $X \sim U(-1, 1)$ and $Y = X^2$ . These are perfectly dependent but uncorrelated. Use mutual information or distance correlation to detect nonlinear dependencies.

📝Computing Correlation

Given: $E[X] = 2$ , $E[Y] = 5$ , $E[XY] = 12$ , $\text{Var}(X) = 4$ , $\text{Var}(Y) = 9$ .

\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 12 - 10 = 2

\rho = \frac{\text{Cov}(X,Y)}{\sigma_X \sigma_Y} = \frac{2}{2 \times 3} = \frac{1}{3} \approx 0.333

Moderate positive linear correlation.

Python Implementation

💡 Python for Joint Distributions

NumPy and SciPy provide efficient tools for computing joint distributions, marginals, conditionals, covariance, and correlation from both theoretical and empirical data.

import numpy as np
from scipy import stats

# === Joint PMF from a table ===
joint_pmf = np.array([
    [0.1, 0.2, 0.1],   # X=0: Y=0,1,2
    [0.2, 0.3, 0.1],   # X=1: Y=0,1,2
    [0.1, 0.1, 0.1]    # X=2: Y=0,1,2
])

# Marginal distributions
p_x = joint_pmf.sum(axis=1)  # Sum over Y → P(X)
p_y = joint_pmf.sum(axis=0)  # Sum over X → P(Y)
print(f"Marginal P(X): {p_x}")   # [0.4, 0.6, 0.3]  -- note: these don't sum to 1, fix indices
print(f"Marginal P(Y): {p_y}")   # [0.4, 0.6, 0.3]

# Conditional distribution P(Y | X=1)
x_idx = 1
p_y_given_x1 = joint_pmf[x_idx] / p_x[x_idx]
print(f"P(Y | X=1): {p_y_given_x1}")

# Verify: conditional probabilities sum to 1
print(f"Sum of conditional: {p_y_given_x1.sum():.4f}")  # Should be 1.0

# === Joint PDF for continuous variables ===
# Bivariate normal example
mean = [0, 0]
cov_matrix = [[1, 0.8], [0.8, 1]]  # Correlation = 0.8

# Sample from bivariate normal
np.random.seed(42)
n_samples = 10000
samples = np.random.multivariate_normal(mean, cov_matrix, size=n_samples)
X_samples, Y_samples = samples[:, 0], samples[:, 1]

# Empirical covariance and correlation
emp_cov = np.cov(X_samples, Y_samples)
emp_corr = np.corrcoef(X_samples, Y_samples)
print(f"Empirical covariance matrix:\n{emp_cov}")
print(f"Empirical correlation:\n{emp_corr}")

# === Independence check ===
# Two independent variables
X_ind = np.random.randn(5000)
Y_ind = np.random.randn(5000)
print(f"Correlation (independent): {np.corrcoef(X_ind, Y_ind)[0,1]:.4f}")

# Two dependent variables (Y = X^2)
X_dep = np.random.uniform(-1, 1, 5000)
Y_dep = X_dep ** 2
print(f"Correlation (dependent but uncorrelated): {np.corrcoef(X_dep, Y_dep)[0,1]:.4f}")

# === Mutual information for nonlinear dependence ===
from scipy.stats import entropy

def mutual_information_2d(x, y, bins=20):
    """Estimate mutual information using histogram-based method."""
    p_xy = np.histogram2d(x, y, bins=bins)[0]
    p_xy = p_xy / p_xy.sum()
    p_x = p_xy.sum(axis=1)
    p_y = p_xy.sum(axis=0)
    mi = entropy(p_x) + entropy(p_y) - entropy(p_xy.ravel())
    return mi

print(f"MI (independent): {mutual_information_2d(X_ind, Y_ind):.4f}")
print(f"MI (X, X^2):     {mutual_information_2d(X_dep, Y_dep):.4f}")

# === Visualize joint distribution ===
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Joint PMF heatmap
im = axes[0].imshow(joint_pmf, cmap='Blues', vmin=0, vmax=0.3)
axes[0].set_xlabel('Y')
axes[0].set_ylabel('X')
axes[0].set_title('Joint PMF')
plt.colorbar(im, ax=axes[0])

# Bivariate normal samples
axes[1].scatter(X_samples[:500], Y_samples[:500], alpha=0.3, s=10)
axes[1].set_xlabel('X')
axes[1].set_ylabel('Y')
axes[1].set_title('Bivariate Normal (ρ=0.8)')
axes[1].set_aspect('equal')

# Dependent but uncorrelated
axes[2].scatter(X_dep[:500], Y_dep[:500], alpha=0.3, s=10)
axes[2].set_xlabel('X')
axes[2].set_ylabel('Y = X²')
axes[2].set_title('Dependent but Uncorrelated')

plt.tight_layout()
plt.savefig('joint_distributions.png', dpi=150)
plt.show()

Applications in AI/ML

💡 Joint Distributions in Machine Learning

Understanding joint distributions is essential for probabilistic modeling, generative models, and any ML task that involves uncertainty quantification.

DfGenerative Models

A generative model learns the joint distribution $p_{\text{model}}(\mathbf{x}, z)$ over data $\mathbf{x}$ and latent variables $z$ . Once learned, you can:

Sample new data: draw $z \sim p(z)$ , then $\mathbf{x} \sim p(\mathbf{x} \mid z)$
Compute posteriors: $p(z \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid z) p(z)}{p(\mathbf{x})}$ (Bayesian inference)
Evaluate likelihood: $p(\mathbf{x}) = \int p(\mathbf{x}, z) \, dz$

Examples: VAEs, GANs (implicitly), Hidden Markov Models, Gaussian Mixture Models, Bayesian Networks.

DfNaive Bayes Classifier

The Naive Bayes classifier assumes conditional independence of features given the class label:

P(\mathbf{x} \mid Y = k) = \prod_{j=1}^{d} P(x_j \mid Y = k)

This factorization turns a high-dimensional joint estimation problem into $d$ independent one-dimensional problems, making training tractable even with limited data. Despite the unrealistic independence assumption, Naive Bayes often performs surprisingly well in practice (spam filtering, text classification).

DfGaussian Mixture Models (GMM)

A GMM models data as a mixture of $K$ multivariate Gaussians:

p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where $\pi_k$ are mixing weights, $\boldsymbol{\mu}_k$ are component means, and $\boldsymbol{\Sigma}_k$ are component covariance matrices. GMMs capture multimodal distributions and are used for clustering, density estimation, and as sub-components of more complex models.

DfVariational Autoencoders (VAEs)

VAEs approximate an intractable joint posterior $p(z \mid \mathbf{x})$ with a learned encoder $q_\phi(z \mid \mathbf{x})$ . The training objective is the ELBO:

\mathcal{L} = E_{q_\phi(z|\mathbf{x})}[\log p_\theta(\mathbf{x} \mid z)] - D_{\text{KL}}(q_\phi(z \mid \mathbf{x}) \| p(z))

This relies fundamentally on the joint distribution $p(\mathbf{x}, z) = p(\mathbf{x} \mid z) p(z)$ .

📝Covariance Matrix in PCA

Principal Component Analysis (PCA) performs eigendecomposition of the data covariance matrix $\boldsymbol{\Sigma} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$ . The eigenvectors are the principal components (directions of maximum variance), and the eigenvalues quantify the variance explained by each component. The joint distribution of features — encoded in the covariance matrix — determines the entire PCA solution.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Treating $P(X=x, Y=y)$ as $P(X=x) + P(Y=y)$	Joint probability is NOT the sum of marginals	$P(X=x, Y=y) = P(X=x) \cdot P(Y=y \mid X=x)$
Assuming $\text{Cov}(X,Y) = 0 \Rightarrow X \perp Y$	Zero correlation does not imply independence	Use mutual information or chi-squared tests for independence
Computing marginals by summing the wrong axis	Axis convention depends on variable ordering	Check which axis corresponds to which variable before summing
Dividing by zero in conditional probability	$P(Y \mid X)$ is undefined when $P(X) = 0$	Always verify the conditioning event has positive probability
Confusing joint PMF with joint PDF	PMF gives probabilities; PDF gives densities	$P(X=x, Y=y) = p(x,y)$ for discrete; $P((X,Y) \in A) = \iint_A f(x,y) dxdy$ for continuous
Assuming independence from a scatter plot	Visual correlation only captures linear dependence	Compute mutual information or use nonparametric tests
Forgetting to normalize conditional distributions	Conditionals must sum/integrate to 1	Divide by the marginal: $P(Y \mid X) = p(x,y) / p_X(x)$
Using sample covariance for hypothesis testing without correction	Sample covariance is biased for small $n$	Use Bessel's correction or proper statistical tests

Interview Questions

📝Question 1: Joint to Marginal

Q: Given the joint PMF $p(x,y)$ for $X, Y \in \{1, 2, 3\}$ where $p(x,y) = \frac{x+y}{54}$ , find $P(X \leq 2, Y \leq 2)$ .

A: $P(X \leq 2, Y \leq 2) = \sum_{x=1}^{2}\sum_{y=1}^{2} \frac{x+y}{54} = \frac{2+3+3+4}{54} = \frac{12}{54} = \frac{2}{9}$ .

📝Question 2: Independence Test

Q: $X$ and $Y$ have joint PMF: $p(0,0)=0.3$ , $p(0,1)=0.2$ , $p(1,0)=0.1$ , $p(1,1)=0.4$ . Are they independent?

A: $P(X=0) = 0.5$ , $P(Y=0) = 0.4$ . If independent, $p(0,0)$ should equal $0.5 \times 0.4 = 0.2 \neq 0.3$ . Not independent.

📝Question 3: Conditional Distribution

Q: If $X$ and $Y$ are jointly distributed with $P(X=0,Y=0)=0.2$ , $P(X=0,Y=1)=0.3$ , $P(X=1,Y=0)=0.1$ , $P(X=1,Y=1)=0.4$ , find $E[Y \mid X=1]$ .

A: $P(Y=0 \mid X=1) = 0.1/0.5 = 0.2$ , $P(Y=1 \mid X=1) = 0.4/0.5 = 0.8$ . $E[Y \mid X=1] = 0(0.2) + 1(0.8) = 0.8$ .

📝Question 4: Covariance Computation

Q: Given $E[X]=3$ , $E[Y]=7$ , $E[XY]=25$ , $E[X^2]=13$ , $E[Y^2]=55$ , find $\text{Var}(X+Y)$ .

A: $\text{Cov}(X,Y) = 25 - 21 = 4$ . $\text{Var}(X) = 13-9=4$ , $\text{Var}(Y) = 55-49=6$ . $\text{Var}(X+Y) = 4+6+2(4) = 18$ .

📝Question 5: Bayes' Rule Application

Q: A diagnostic test has sensitivity 0.95 and specificity 0.90. If the disease prevalence is 1%, what is $P(\text{disease} \mid \text{positive})$ ?

A: Using joint distribution: $P(D+, T+) = 0.01 \times 0.95 = 0.0095$ . $P(D-, T+) = 0.99 \times 0.10 = 0.099$ . $P(T+) = 0.1085$ . $P(D+ \mid T+) = 0.0095/0.1085 \approx 0.0876$ (only 8.8%!)

📝Question 6: Multivariate Normal

Q: If $(X,Y) \sim \mathcal{N}\left(\begin{pmatrix}0\\0\end{pmatrix}, \begin{pmatrix}1 & \rho \\ \rho & 1\end{pmatrix}\right)$ , what is the conditional distribution of $Y$ given $X=x$ ?

A: $Y \mid X=x \sim \mathcal{N}(\rho x, 1-\rho^2)$ . The conditional mean is $\rho x$ (regression toward the mean) and the conditional variance $1-\rho^2$ decreases as $|\rho|$ increases (knowing $X$ reduces uncertainty about $Y$ ).

Practice Problems

📝Problem 1: Marginal Distribution

The joint PMF of $X$ and $Y$ is $p(x,y) = \frac{xy}{36}$ for $x, y \in \{1, 2, 3\}$ . Find the marginal PMFs $p_X(x)$ and $p_Y(y)$ .

💡Solution

p_X(x) = \sum_{y=1}^{3} \frac{xy}{36} = \frac{x}{36}(1+2+3) = \frac{6x}{36} = \frac{x}{6}

So $p_X(1)=1/6$ , $p_X(2)=2/6=1/3$ , $p_X(3)=3/6=1/2$ . By symmetry, $p_Y(y) = y/6$ .

📝Problem 2: Conditional Distribution

Given $p(0,0)=0.1$ , $p(0,1)=0.2$ , $p(1,0)=0.3$ , $p(1,1)=0.4$ , compute $P(X=1 \mid Y=0)$ and $P(Y=1 \mid X=0)$ .

💡Solution

$P(Y=0) = 0.1 + 0.3 = 0.4$ . $P(X=1 \mid Y=0) = p(1,0)/P(Y=0) = 0.3/0.4 = 0.75$ .

$P(X=0) = 0.1 + 0.2 = 0.3$ . $P(Y=1 \mid X=0) = p(0,1)/P(X=0) = 0.2/0.3 \approx 0.667$ .

📝Problem 3: Independence Check

$X$ takes values $\{1, 2\}$ and $Y$ takes values $\{a, b\}$ with joint PMF: $p(1,a)=0.2$ , $p(1,b)=0.3$ , $p(2,a)=0.1$ , $p(2,b)=0.4$ . Are $X$ and $Y$ independent?

💡Solution

$P(X=1)=0.5$ , $P(Y=a)=0.3$ . If independent, $p(1,a)$ should be $0.5 \times 0.3 = 0.15 \neq 0.2$ .

Not independent. The joint probabilities do not factor into the product of marginals.

📝Problem 4: Covariance from Joint Distribution

Let $P(X=-1,Y=-1)=0.1$ , $P(X=-1,Y=1)=0.2$ , $P(X=1,Y=-1)=0.3$ , $P(X=1,Y=1)=0.4$ . Compute $\text{Cov}(X,Y)$ and $\rho_{XY}$ .

💡Solution

$E[X] = -1(0.3) + 1(0.7) = 0.4$ . $E[Y] = -1(0.4) + 1(0.6) = 0.2$ .

$E[XY] = (-1)(-1)(0.1) + (-1)(1)(0.2) + (1)(-1)(0.3) + (1)(1)(0.4) = 0.1 - 0.2 - 0.3 + 0.4 = 0$ .

$\text{Cov}(X,Y) = E[XY] - E[X]E[Y] = 0 - 0.08 = -0.08$ .

$E[X^2] = 1(0.3) + 1(0.7) = 1$ . $\text{Var}(X) = 1 - 0.16 = 0.84$ . $\sigma_X = \sqrt{0.84} \approx 0.9165$ .

$E[Y^2] = 1(0.4) + 1(0.6) = 1$ . $\text{Var}(Y) = 1 - 0.04 = 0.96$ . $\sigma_Y = \sqrt{0.96} \approx 0.9798$ .

$\rho = -0.08 / (0.9165 \times 0.9798) \approx -0.0896$ .

📝Problem 5: Chain Rule Application

A joint distribution over three binary variables satisfies $P(A=0)=0.6$ , $P(B=0 \mid A=0)=0.7$ , $P(B=0 \mid A=1)=0.4$ , $P(C=0 \mid A=0, B=0)=0.8$ , $P(C=0 \mid A=0, B=1)=0.5$ , $P(C=0 \mid A=1, B=0)=0.3$ , $P(C=0 \mid A=1, B=1)=0.1$ . Compute $P(A=0, B=0, C=0)$ .

💡Solution

Using the chain rule:

P(A=0, B=0, C=0) = P(A=0) \cdot P(B=0 \mid A=0) \cdot P(C=0 \mid A=0, B=0)

= 0.6 \times 0.7 \times 0.8 = 0.336

Quick Reference

Quantity	Formula	Python
Joint PMF	$P(X=x, Y=y) = p(x,y)$	2D numpy array
Joint PDF	$f_{X,Y}(x,y)$	`scipy.stats.multivariate_normal`
Marginal (discrete)	$P(X=x) = \sum_y p(x,y)$	`joint.sum(axis=1)`
Marginal (continuous)	$f_X(x) = \int f_{X,Y}(x,y) dy$	`np.trapz(f, y, axis=1)`
Conditional	$P(Y=y \mid X=x) = p(x,y)/p_X(x)$	`joint[x] / marginal_x`
Independence test	$p(x,y) = p_X(x) \cdot p_Y(y)$ for all $(x,y)$	`np.allclose(joint, np.outer(px, py))`
Covariance	$\text{Cov}(X,Y) = E[XY] - E[X]E[Y]$	`np.cov(X, Y)`
Correlation	$\rho = \text{Cov}(X,Y)/(\sigma_X \sigma_Y)$	`np.corrcoef(X, Y)`
Chain rule	$P(A,B,C) = P(A)P(B \mid A)P(C \mid A,B)$	Nested loops or recursion
Mutual information	$I(X;Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p_X(x)p_Y(y)}$	`sklearn.metrics.mutual_info_score`

Cross-References

Probability Foundations: 034-probability-foundations — Sample spaces, axioms, and basic probability rules
Conditional Probability: 035-probability-conditional — Conditional probability and Bayes' theorem
Random Variables: 037-probability-random-variables — Definition and types of random variables
Probability Distributions: 038-probability-distributions — Common univariate distributions
Expectation and Variance: 039-probability-expectation — Moments, MGF, and properties
Covariance and Correlation: 041-probability-covariance — Detailed treatment of dependence measures
Central Limit Theorem: 042-probability-clt — Normal approximation for sums of random variables
Markov Chains: 043-probability-markov — Transition matrices and stationary distributions
Information Theory: 082-info-theory-mutual — Mutual information as a measure of dependence

📋Key Takeaways

Joint Distribution $P(X=x, Y=y) = p(x,y)$ captures the probability of specific combinations of values across multiple random variables simultaneously. It is the most complete description of the relationship between variables.
Marginal Distribution is obtained by summing (discrete: $P(X=x) = \sum_y p(x,y)$ ) or integrating (continuous: $f_X(x) = \int f_{X,Y}(x,y) dy$ ) over the other variable. Marginals discard dependence information.
Conditional Distribution $P(Y=y \mid X=x) = \frac{p(x,y)}{p_X(x)}$ gives the distribution of one variable given a specific value of another. It is the foundation of Bayes' rule and predictive modeling.
Independence means $P(X=x, Y=y) = P(X=x)P(Y=y)$ for all $x, y$ — the joint factorizes into the product of marginals. Independence is stronger than zero correlation.
Covariance $\text{Cov}(X,Y) = E[XY] - E[X]E[Y]$ measures linear co-movement. It is the numerator of correlation and appears in the variance-of-sum formula.
Correlation $\rho = \text{Cov}(X,Y)/(\sigma_X \sigma_Y) \in [-1,1]$ normalizes covariance to a unitless scale. It captures only linear dependence — nonlinear relationships require mutual information or other measures.
Chain Rule: $P(X_1, \dots, X_n) = P(X_1) P(X_2 \mid X_1) \cdots P(X_n \mid X_1, \dots, X_{n-1})$ decomposes any joint distribution into a product of conditionals.
In code: Use np.sum(axis=...) for marginals, division for conditionals, np.cov() and np.corrcoef() for dependence measures, and np.outer() to test independence.
Practice Application: Joint distributions are essential for probabilistic graphical models, generative modeling (VAEs, GMMs), PCA, Bayesian inference, and understanding feature dependencies in high-dimensional ML datasets.