Random Variables & Probability Distributions

ℹ️ Why It Matters

Every machine learning model makes predictions under uncertainty. A patient might have a disease or not. A stock price tomorrow could be higher or lower. A spam filter must decide if an email is junk. Random variables are the mathematical language that lets us quantify, model, and reason about this uncertainty. Without them, there is no probability theory — and without probability theory, there is no modern AI, statistics, or data science. Every loss function, every sampling strategy, every Bayesian model, and every generative AI system is built on the foundation of random variables and their distributions.

What is a Random Variable?

DfRandom Variable

A random variable $X$ is a function that maps outcomes from a sample space $\Omega$ to real numbers. Formally, $X: \Omega \rightarrow \mathbb{R}$ . It assigns a numerical value to each possible outcome of a random experiment.

Random Variable

X: \Omega \rightarrow \mathbb{R}

Here,

$X$ =The random variable
$\Omega$ =The sample space (all possible outcomes)
$\mathbb{R}$ =The set of real numbers

Simple Analogy: Think of a random variable like a score in a game. The game has many possible outcomes (roll of a dice, draw of a card), but the random variable converts each outcome into a number you can work with — like points, dollars, or centimeters.

Real-World Examples:

Coin flip: $X = 1$ if heads, $X = 0$ if tails
Dice roll: $X$ = the number shown on the die (1 through 6)
Height of a person: $X$ = height in centimeters (can be any value in a range)
Number of customers: $X$ = count of customers arriving in an hour (0, 1, 2, ...)

Discrete vs. Continuous

DfDiscrete Random Variable

A random variable $X$ is discrete if it takes on a countable number of distinct values. The set of possible values is finite or countably infinite.

Examples of discrete variables:

Number of heads in 10 coin flips: $\{0, 1, 2, \ldots, 10\}$
Number of emails received per day: $\{0, 1, 2, 3, \ldots\}$
Customer rating: $\{1, 2, 3, 4, 5\}$

DfContinuous Random Variable

A random variable $X$ is continuous if it can take on any value within a real interval. The set of possible values is uncountably infinite.

Examples of continuous variables:

Temperature: any value in $[-40, 50]$ degrees Celsius
Weight: any value in $[0, \infty)$ kilograms
Time to complete a task: any value in $[0, \infty)$ seconds

💡 Key Distinction

For a discrete random variable, $P(X = x) > 0$ for specific values. For a continuous random variable, $P(X = x) = 0$ for any single point — probabilities are only meaningful over intervals, e.g., $P(a \leq X \leq b)$ .

Probability Mass Function (PMF)

DfProbability Mass Function (PMF)

For a discrete random variable $X$ , the probability mass function $p(x)$ gives the probability that $X$ takes on the exact value $x$ :

p(x) = P(X = x)

PMF Properties

p(x) \geq 0 \quad \text{and} \quad \sum_{x} p(x) = 1

Here,

$p(x)$ =Probability that X equals x
$\sum_x p(x) = 1$ =All probabilities sum to 1

📝Example: PMF of a Fair Die

For a fair six-sided die, $X$ = the outcome:

$x$	1	2	3	4	5	6
$p(x)$	1/6	1/6	1/6	1/6	1/6	1/6

$p(1) = P(X = 1) = \frac{1}{6}$
$\sum_{x=1}^{6} p(x) = \frac{1}{6} \times 6 = 1$ ✓

📝Example: PMF of an Unfair Coin

Suppose a biased coin has $P(\text{heads}) = 0.7$ and $P(\text{tails}) = 0.3$ :

$x$	0 (tails)	1 (heads)
$p(x)$	0.3	0.7

$p(0) + p(1) = 0.3 + 0.7 = 1$ ✓

Probability Density Function (PDF)

DfProbability Density Function (PDF)

For a continuous random variable $X$ , the probability density function $f(x)$ describes the relative likelihood of $X$ taking on a value near $x$ . The probability that $X$ falls in an interval $[a, b]$ is the area under $f(x)$ over that interval:

P(a \leq X \leq b) = \int_a^b f(x)\,dx

PDF Properties

f(x) \geq 0 \quad \text{and} \quad \int_{-\infty}^{\infty} f(x)\,dx = 1

Here,

$f(x)$ =The probability density at x
$\int_{-\infty}^{\infty} f(x)dx = 1$ =Total area under the curve equals 1

⚠️ Important

Unlike a PMF, $f(x)$ is not a probability. It is a density. The value $f(x)$ can exceed 1 — only probabilities (areas under the curve) must be between 0 and 1. For a continuous variable, $P(X = c) = 0$ for any specific point $c$ .

📝Example: PDF of a Uniform Distribution

For $X \sim \text{Uniform}(0, 1)$ :

f(x) = \begin{cases} 1 & 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases}

$P(0.2 \leq X \leq 0.5) = \int_{0.2}^{0.5} 1\,dx = 0.3$
The total area: $\int_0^1 1\,dx = 1$ ✓

Cumulative Distribution Function (CDF)

DfCumulative Distribution Function (CDF)

The cumulative distribution function $F(x)$ gives the probability that the random variable $X$ takes on a value less than or equal to $x$ :

F(x) = P(X \leq x)

CDF Definition

F(x) = P(X \leq x)

Here,

$F(x)$ =Cumulative probability up to x
$P(X \leq x)$ =Probability that X is at most x

CDF for Continuous Variables

F(x) = \int_{-\infty}^{x} f(t)\,dt

Here,

$F(x)$ =The CDF
$f(t)$ =The probability density function

CDF for Discrete Variables

F(x) = \sum_{x_i \leq x} p(x_i)

Here,

$F(x)$ =The CDF
$p(x_i)$ =PMF evaluated at x_i

ThCDF Properties

Non-decreasing: If $a \leq b$ , then $F(a) \leq F(b)$
Limits: $\lim_{x \to -\infty} F(x) = 0$ and $\lim_{x \to \infty} F(x) = 1$
Right-continuous: $F(x)$ is always right-continuous
Interval probability: $P(a < X \leq b) = F(b) - F(a)$

📝Example: CDF of a Fair Die

For $X$ = outcome of a fair die roll:

$x$	1	2	3	4	5	6
$F(x)$	1/6	2/6	3/6	4/6	5/6	1

$F(3) = P(X \leq 3) = P(X=1) + P(X=2) + P(X=3) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2}$
$P(2 < X \leq 5) = F(5) - F(2) = \frac{5}{6} - \frac{2}{6} = \frac{3}{6} = \frac{1}{2}$

💡 CDF vs PMF/PDF

The CDF works for both discrete and continuous random variables. It always exists and is always well-defined, making it a universal tool. The PMF only works for discrete variables; the PDF only works for continuous variables.

Bernoulli Distribution

The simplest distribution: a single yes/no trial.

DfBernoulli Distribution

A random variable $X$ follows a Bernoulli distribution with parameter $p$ if it takes value 1 with probability $p$ and value 0 with probability $1 - p$ .

Bernoulli PMF

p(x) = p^x (1-p)^{1-x}, \quad x \in \{0, 1\}

Here,

$p$ =Probability of success (X = 1)
$1 - p$ =Probability of failure (X = 0)
$x$ =Outcome: 0 or 1

Bernoulli Parameters

\mu = p, \quad \sigma^2 = p(1-p)

Here,

$\mu$ =Mean (expected value)
$\sigma^2$ =Variance

📝Example: Bernoulli Distribution

A drug has a 90% cure rate. Let $X = 1$ if cured, $X = 0$ if not.

$p = 0.9$ , so $P(X = 1) = 0.9$ , $P(X = 0) = 0.1$
Mean: $\mu = 0.9$
Variance: $\sigma^2 = 0.9 \times 0.1 = 0.09$
Standard deviation: $\sigma = 0.3$

ℹ️ AI/ML Connection

The Bernoulli distribution models binary decisions: spam/not spam, click/no-click, cat/dog. It is the output distribution of binary classifiers and the foundation of logistic regression.

Binomial Distribution

How many successes in $n$ independent Bernoulli trials?

DfBinomial Distribution

A random variable $X$ follows a binomial distribution $X \sim \text{Binomial}(n, p)$ if it counts the number of successes in $n$ independent Bernoulli trials, each with success probability $p$ .

Binomial PMF

P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}, \quad k = 0, 1, 2, \ldots, n

Here,

$\binom{n}{k}$ =Binomial coefficient: n choose k
$n$ =Number of trials
$p$ =Probability of success on each trial
$k$ =Number of successes

Binomial Parameters

\mu = np, \quad \sigma^2 = np(1-p)

Here,

$n$ =Number of trials
$p$ =Success probability
$\mu$ =Mean
$\sigma^2$ =Variance

📝Example: Binomial Distribution

Flip a fair coin 10 times. Let $X$ = number of heads. Then $X \sim \text{Binomial}(10, 0.5)$ .

P(X = 5) = \binom{10}{5} (0.5)^5 (0.5)^5 = 252 \times \frac{1}{1024} \approx 0.246

Mean: $\mu = 10 \times 0.5 = 5$
Variance: $\sigma^2 = 10 \times 0.5 \times 0.5 = 2.5$

The most likely outcome is 5 heads (24.6% probability), which makes intuitive sense.

💡 Relationship to Bernoulli

A Bernoulli distribution is a special case of the binomial distribution where $n = 1$ . That is, $\text{Bernoulli}(p) = \text{Binomial}(1, p)$ .

Poisson Distribution

Counting rare events over a fixed interval.

DfPoisson Distribution

A random variable $X$ follows a Poisson distribution $X \sim \text{Poisson}(\lambda)$ if it counts the number of events occurring in a fixed interval of time or space, given a constant average rate $\lambda$ and independent events.

Poisson PMF

P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \ldots

Here,

$\lambda$ =Average rate (expected number of events)
$k$ =Number of events
$e$ =Euler's number ≈ 2.71828

Poisson Parameters

\mu = \lambda, \quad \sigma^2 = \lambda

Here,

$\lambda$ =Rate parameter
$\mu$ =Mean equals lambda
$\sigma^2$ =Variance also equals lambda

📝Example: Poisson Distribution

A server receives an average of 4 requests per minute ( $\lambda = 4$ ). What is the probability of getting exactly 6 requests in a minute?

P(X = 6) = \frac{4^6 e^{-4}}{6!} = \frac{4096 \times 0.0183}{720} \approx 0.104

There is about a 10.4% chance of receiving exactly 6 requests.

ℹ️ Poisson Limit

When $n$ is large and $p$ is small, the Binomial distribution $\text{Binomial}(n, p)$ is well approximated by $\text{Poisson}(\lambda)$ where $\lambda = np$ . This is why Poisson is used for rare events.

Uniform Distribution

Every outcome equally likely over an interval.

DfUniform Distribution

A random variable $X$ follows a continuous uniform distribution $X \sim \text{Uniform}(a, b)$ if every value in the interval $[a, b]$ is equally likely.

Uniform PDF

f(x) = \frac{1}{b - a}, \quad a \leq x \leq b

Here,

$a$ =Lower bound
$b$ =Upper bound
$b - a$ =Width of the interval

Uniform CDF

F(x) = \frac{x - a}{b - a}, \quad a \leq x \leq b

Here,

$F(x)$ =Cumulative probability at x

Uniform Parameters

\mu = \frac{a + b}{2}, \quad \sigma^2 = \frac{(b - a)^2}{12}

Here,

$\mu$ =Mean (midpoint of the interval)
$\sigma^2$ =Variance

📝Example: Uniform Distribution

A random number generator produces values uniformly between 0 and 1: $X \sim \text{Uniform}(0, 1)$ .

PDF: $f(x) = 1$ for $0 \leq x \leq 1$
Mean: $\mu = \frac{0 + 1}{2} = 0.5$
Variance: $\sigma^2 = \frac{(1-0)^2}{12} = \frac{1}{12} \approx 0.0833$
$P(0.25 \leq X \leq 0.75) = \frac{0.75 - 0.25}{1 - 0} = 0.5$

ℹ️ Why Uniform Matters

The Uniform(0,1) distribution is the foundation of all random number generation. Every continuous random variable can be generated from uniform random numbers using the inverse transform method.

Python Implementation

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# --- Bernoulli Distribution ---
# Simulate 10000 coin flips with p=0.7
bernoulli_rv = stats.bernoulli(p=0.7)
samples = bernoulli_rv.rvs(size=10000)
print(f"Bernoulli Mean: {samples.mean():.3f}")       # ~0.70
print(f"Bernoulli Var: {samples.var():.3f}")          # ~0.21
print(f"P(X=1): {bernoulli_rv.pmf(1):.3f}")          # 0.700
print(f"P(X<=0): {bernoulli_rv.cdf(0):.3f}")         # 0.300

# --- Binomial Distribution ---
# 10 trials, p=0.3, simulate 10000 experiments
binom_rv = stats.binom(n=10, p=0.3)
samples = binom_rv.rvs(size=10000)
print(f"Binomial Mean: {samples.mean():.3f}")         # ~3.0
print(f"Binomial Var: {samples.var():.3f}")           # ~2.1
print(f"P(X=5): {binom_rv.pmf(5):.4f}")              # ~0.1029
print(f"P(X<=3): {binom_rv.cdf(3):.4f}")             # ~0.6496

# --- Poisson Distribution ---
# Average 4 events per interval
poisson_rv = stats.poisson(mu=4)
samples = poisson_rv.rvs(size=10000)
print(f"Poisson Mean: {samples.mean():.3f}")          # ~4.0
print(f"Poisson Var: {samples.var():.3f}")            # ~4.0
print(f"P(X=6): {poisson_rv.pmf(6):.4f}")            # ~0.1042

# --- Uniform Distribution ---
# Continuous uniform on [0, 1]
uniform_rv = stats.uniform(loc=0, scale=1)
samples = uniform_rv.rvs(size=10000)
print(f"Uniform Mean: {samples.mean():.3f}")          # ~0.50
print(f"Uniform Var: {samples.var():.4f}")            # ~0.0833
print(f"P(0.25<=X<=0.75): {uniform_rv.cdf(0.75) - uniform_rv.cdf(0.25):.3f}")  # 0.500

# --- Visualization ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))

# Bernoulli
axes[0, 0].bar([0, 1], [0.3, 0.7], color=['steelblue', 'coral'])
axes[0, 0].set_title('Bernoulli(p=0.7)')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('P(X=x)')

# Binomial
x_binom = np.arange(0, 11)
axes[0, 1].bar(x_binom, binom_rv.pmf(x_binom), color='steelblue')
axes[0, 1].set_title('Binomial(n=10, p=0.3)')
axes[0, 1].set_xlabel('k')
axes[0, 1].set_ylabel('P(X=k)')

# Poisson
x_poisson = np.arange(0, 12)
axes[1, 0].bar(x_poisson, poisson_rv.pmf(x_poisson), color='coral')
axes[1, 0].set_title('Poisson(λ=4)')
axes[1, 0].set_xlabel('k')
axes[1, 0].set_ylabel('P(X=k)')

# Uniform
x_uniform = np.linspace(-0.2, 1.2, 1000)
axes[1, 1].fill_between(x_uniform, uniform_rv.pdf(x_uniform), alpha=0.3, color='steelblue')
axes[1, 1].plot(x_uniform, uniform_rv.pdf(x_uniform), color='steelblue')
axes[1, 1].set_title('Uniform(0, 1)')
axes[1, 1].set_xlabel('x')
axes[1, 1].set_ylabel('f(x)')

plt.tight_layout()
plt.savefig('distributions.png', dpi=150)
plt.show()

Applications in AI/ML

ℹ️ Why Distributions Matter in ML

Probability distributions are not just theory — they are the engine behind every machine learning system. Here are the most important applications.

Loss Functions Derived from Distributions

Many common loss functions in ML are negative log-likelihoods of probability distributions:

Loss Function	Distribution	Use Case
Binary Cross-Entropy	Bernoulli	Binary classification
Categorical Cross-Entropy	Categorical	Multi-class classification
MSE (Mean Squared Error)	Gaussian	Regression
Poisson Loss	Poisson	Count prediction

📝Example: Cross-Entropy Loss

For a binary classifier predicting $\hat{y} = P(Y=1)$ , the cross-entropy loss for a single example with true label $y \in \{0, 1\}$ is:

\mathcal{L} = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]

This is the negative log-likelihood of a Bernoulli distribution with parameter $\hat{y}$ .

Sampling and Data Augmentation

Monte Carlo methods: Draw samples from distributions to estimate integrals and expectations
Reparameterization trick: Used in VAEs (Variational Autoencoders) to backpropagate through random sampling
Data augmentation: Add noise sampled from known distributions to training data

Generative Models

Gaussian Mixture Models (GMM): Model data as a mixture of Gaussians
Naive Bayes: Assume features follow specific distributions (Gaussian, Bernoulli, Multinomial)
Normalizing Flows: Transform simple distributions (Uniform, Gaussian) into complex ones

💡 The Big Picture

Choosing the right distribution for your data is one of the most important modeling decisions. If your data are counts, use Poisson or Negative Binomial. If they are binary, use Bernoulli. If they are continuous and symmetric, consider Gaussian. Mis-specifying the distribution leads to poor models and misleading conclusions.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Saying $P(X = 0.5) = 0.3$ for a continuous variable	For continuous RVs, the probability at a single point is always 0	Use intervals: $P(0.4 \leq X \leq 0.6)$
Treating PDF values as probabilities	$f(x)$ is a density, not a probability; it can exceed 1	Probabilities are areas under the curve: $\int f(x)dx$
Using PMF for continuous variables	PMFs are only defined for discrete variables	Use PDF for continuous, PMF for discrete
Forgetting $\sum p(x) = 1$ or $\int f(x)dx = 1$	If these don't hold, it's not a valid distribution	Always verify normalization
Confusing $\mu$ and $\bar{x}$	$\mu$ is the population mean (parameter); $\bar{x}$ is the sample mean (statistic)	$\mu$ is fixed; $\bar{x}$ varies by sample
Assuming independence when it's not given	Independence is a strong assumption that must be justified	Check the problem statement carefully
Using Binomial when trials are not independent	Binomial requires independent trials	Use Hypergeometric for sampling without replacement

Interview Questions

📝Q1: What is the difference between a PMF and a PDF?

Answer: A PMF (Probability Mass Function) applies to discrete random variables and gives the probability at each point: $p(x) = P(X = x)$ . A PDF (Probability Density Function) applies to continuous random variables and gives the density at each point. For continuous variables, $P(X = c) = 0$ for any single point — probabilities are only defined over intervals using the PDF: $P(a \leq X \leq b) = \int_a^b f(x)dx$ . The key difference is that PMF values are actual probabilities (between 0 and 1), while PDF values are densities (can exceed 1).

📝Q2: Why is the variance of a Poisson distribution equal to its mean?

Answer: This is a mathematical property of the Poisson distribution that arises from its derivation. The Poisson distribution models the limit of the Binomial distribution as $n \to \infty$ and $p \to 0$ with $np = \lambda$ fixed. For Binomial, the variance is $np(1-p)$ . As $p \to 0$ , $(1-p) \to 1$ , so the variance approaches $np = \lambda$ , which is also the mean. This equality ( $\mu = \sigma^2 = \lambda$ ) is a defining characteristic of the Poisson distribution and is used as a diagnostic: if the sample mean and variance are approximately equal, the data may follow a Poisson distribution.

📝Q3: How would you choose between Bernoulli, Binomial, and Poisson distributions?

Answer:

Bernoulli: Single binary trial (yes/no, success/failure). Example: "Will this customer buy?"
Binomial: Fixed number $n$ of independent binary trials. Example: "How many of 10 customers will buy?"
Poisson: Count of events in a continuous interval (time/space) at a constant rate. Example: "How many customers arrive per hour?"

The key distinctions: Bernoulli is for one trial, Binomial is for a fixed number of trials, and Poisson is for events over a continuous interval. Use Binomial when you know $n$ ; use Poisson when $n$ is effectively infinite and $p$ is small.

📝Q4: What is the CDF and why is it useful?

Answer: The CDF (Cumulative Distribution Function) is defined as $F(x) = P(X \leq x)$ . It gives the probability that a random variable takes a value less than or equal to $x$ . The CDF is useful because: (1) It works for both discrete and continuous random variables, unlike PMF/PDF which are type-specific. (2) It always exists and is well-defined. (3) You can compute interval probabilities: $P(a < X \leq b) = F(b) - F(a)$ . (4) In hypothesis testing, p-values are computed using CDFs. (5) For continuous variables, $f(x) = F'(x)$ , connecting CDF to PDF.

📝Q5: A website gets 3 visits per minute on average. What is the probability of getting exactly 5 visits in the next minute? Model this and compute.

Answer: This is a Poisson process with $\lambda = 3$ .

P(X = 5) = \frac{3^5 e^{-3}}{5!} = \frac{243 \times 0.0498}{120} = \frac{12.1}{120} \approx 0.1008

There is approximately a 10.1% chance of getting exactly 5 visits in the next minute. This can be computed in Python using stats.poisson.pmf(5, mu=3).

📝Q6: Why can't we use a Gaussian distribution to model the number of heads in 10 coin flips?

Answer: The number of heads in 10 flips follows $\text{Binomial}(10, 0.5)$ , which is a discrete distribution with support $\{0, 1, 2, \ldots, 10\}$ . The Gaussian is a continuous distribution defined on $(-\infty, \infty)$ . Using a Gaussian would imply that 2.5 heads is possible, which is nonsensical. Additionally, the Gaussian allows negative values (impossible for counts) and has nonzero probability density at non-integer values. While the Gaussian can approximate the Binomial for large $n$ (by the Central Limit Theorem), it is not the correct model for small $n$ with discrete outcomes.

Practice Problems

📝Problem 1: Bernoulli Calculation

A quality control test has a 95% detection rate. If 20 items pass through the test, what is the probability that exactly 19 are correctly detected? Assume independence.

💡Solution

This is a Binomial distribution with $n = 20$ , $p = 0.95$ , and $k = 19$ :

P(X = 19) = \binom{20}{19} (0.95)^{19} (0.05)^{1}

= 20 \times (0.95)^{19} \times 0.05

(0.95)^{19} \approx 0.3774

P(X = 19) = 20 \times 0.3774 \times 0.05 \approx 0.377

There is approximately a 37.7% chance that exactly 19 out of 20 items are correctly detected.

📝Problem 2: Poisson Process

A call center receives an average of 8 calls per 10-minute interval. What is the probability of receiving exactly 3 calls in a 10-minute interval? What is the probability of receiving 0 calls?

💡Solution

With $\lambda = 8$ :

Exactly 3 calls:

P(X = 3) = \frac{8^3 e^{-8}}{3!} = \frac{512 \times 0.000335}{6} = \frac{0.1716}{6} \approx 0.0286

0 calls:

P(X = 0) = \frac{8^0 e^{-8}}{0!} = e^{-8} \approx 0.000335

There is a 2.86% chance of exactly 3 calls and a 0.034% chance of no calls. The call center is very unlikely to be idle.

📝Problem 3: Uniform Distribution

A bus arrives uniformly between 0 and 20 minutes. You arrive at a random time. What is the probability that you wait less than 5 minutes for the bus?

💡Solution

Let $X$ = time until the bus arrives, $X \sim \text{Uniform}(0, 20)$ . You arrive at a random time $t$ . You wait less than 5 minutes if the bus arrives in the interval $[t, t+5]$ , but constrained to $[0, 20]$ .

Since the bus arrival is uniform, the probability you wait less than 5 minutes is:

P(\text{wait} < 5) = \frac{5}{20} = 0.25 = 25\%

This is because the bus arrival time is uniform, and any 5-minute window out of the 20-minute interval has probability $\frac{5}{20}$ .

📝Problem 4: Connecting Distributions

Show that as $n \to \infty$ with $p = \lambda/n$ , the Binomial PMF approaches the Poisson PMF.

💡Solution

Starting with the Binomial PMF:

P(X = k) = \binom{n}{k} p^k (1-p)^{n-k}

Substitute $p = \lambda/n$ :

P(X = k) = \binom{n}{k} \left(\frac{\lambda}{n}\right)^k \left(1 - \frac{\lambda}{n}\right)^{n-k}

Expand:

= \frac{n!}{k!(n-k)!} \cdot \frac{\lambda^k}{n^k} \cdot \left(1 - \frac{\lambda}{n}\right)^{n-k}

As $n \to \infty$ :

$\frac{n!}{(n-k)! \cdot n^k} \to 1$ (the leading terms cancel)
$\left(1 - \frac{\lambda}{n}\right)^n \to e^{-\lambda}$ (fundamental limit)
$\left(1 - \frac{\lambda}{n}\right)^{-k} \to 1$

Therefore:

P(X = k) \to \frac{\lambda^k e^{-\lambda}}{k!}

This is the Poisson PMF. This is the Poisson Limit Theorem and explains why Poisson is used for rare events with large $n$ and small $p$ .

Quick Reference

📋Key Takeaways

Concept	Formula	Notes
Random Variable	$X: \Omega \rightarrow \mathbb{R}$	Maps outcomes to numbers
PMF	$p(x) = P(X = x)$	Discrete only; $\sum p(x) = 1$
PDF	$f(x) \geq 0$ ; $\int f(x)dx = 1$	Continuous only; $f(x)$ is density, not probability
CDF	$F(x) = P(X \leq x)$	Works for both types; non-decreasing, $F(-\infty)=0$ , $F(\infty)=1$
Bernoulli	$p^x(1-p)^{1-x}$	Single binary trial; $\mu = p$ , $\sigma^2 = p(1-p)$
Binomial	$\binom{n}{k}p^k(1-p)^{n-k}$	$n$ independent trials; $\mu = np$ , $\sigma^2 = np(1-p)$
Poisson	$\frac{\lambda^k e^{-\lambda}}{k!}$	Rare events; $\mu = \sigma^2 = \lambda$
Uniform	$f(x) = \frac{1}{b-a}$	Equal likelihood; $\mu = \frac{a+b}{2}$ , $\sigma^2 = \frac{(b-a)^2}{12}$

Cross-References

Previous: Probability Fundamentals — Sample spaces, events, conditional probability, Bayes' theorem
Next: Expectation and Variance — Mean, variance, standard deviation, higher moments
Related: Linear Algebra — Vector spaces, matrix operations
Related: Information Theory — Entropy, KL divergence, mutual information
Applied: Loss Functions in ML — Cross-entropy, MSE, and their distributional origins
Applied: Bayesian Methods — Prior distributions, posterior inference, MCMC