Random Variables & Probability Distributions
âšī¸ Why It Matters
Every machine learning model makes predictions under uncertainty. A patient might have a disease or not. A stock price tomorrow could be higher or lower. A spam filter must decide if an email is junk. Random variables are the mathematical language that lets us quantify, model, and reason about this uncertainty. Without them, there is no probability theory â and without probability theory, there is no modern AI, statistics, or data science. Every loss function, every sampling strategy, every Bayesian model, and every generative AI system is built on the foundation of random variables and their distributions.
What is a Random Variable?
DfRandom Variable
A random variable is a function that maps outcomes from a sample space to real numbers. Formally, . It assigns a numerical value to each possible outcome of a random experiment.
Random Variable
Here,
- =The random variable
- =The sample space (all possible outcomes)
- =The set of real numbers
Simple Analogy: Think of a random variable like a score in a game. The game has many possible outcomes (roll of a dice, draw of a card), but the random variable converts each outcome into a number you can work with â like points, dollars, or centimeters.
Real-World Examples:
- Coin flip: if heads, if tails
- Dice roll: = the number shown on the die (1 through 6)
- Height of a person: = height in centimeters (can be any value in a range)
- Number of customers: = count of customers arriving in an hour (0, 1, 2, ...)
Discrete vs. Continuous
DfDiscrete Random Variable
A random variable is discrete if it takes on a countable number of distinct values. The set of possible values is finite or countably infinite.
Examples of discrete variables:
- Number of heads in 10 coin flips:
- Number of emails received per day:
- Customer rating:
DfContinuous Random Variable
A random variable is continuous if it can take on any value within a real interval. The set of possible values is uncountably infinite.
Examples of continuous variables:
- Temperature: any value in degrees Celsius
- Weight: any value in kilograms
- Time to complete a task: any value in seconds
đĄ Key Distinction
For a discrete random variable, for specific values. For a continuous random variable, for any single point â probabilities are only meaningful over intervals, e.g., .
Probability Mass Function (PMF)
DfProbability Mass Function (PMF)
For a discrete random variable , the probability mass function gives the probability that takes on the exact value :
PMF Properties
Here,
- =Probability that X equals x
- =All probabilities sum to 1
đExample: PMF of a Fair Die
For a fair six-sided die, = the outcome:
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 1/6 | 1/6 | 1/6 | 1/6 | 1/6 | 1/6 |
- â
đExample: PMF of an Unfair Coin
Suppose a biased coin has and :
| 0 (tails) | 1 (heads) | |
|---|---|---|
| 0.3 | 0.7 |
- â
Probability Density Function (PDF)
DfProbability Density Function (PDF)
For a continuous random variable , the probability density function describes the relative likelihood of taking on a value near . The probability that falls in an interval is the area under over that interval:
PDF Properties
Here,
- =The probability density at x
- =Total area under the curve equals 1
â ī¸ Important
Unlike a PMF, is not a probability. It is a density. The value can exceed 1 â only probabilities (areas under the curve) must be between 0 and 1. For a continuous variable, for any specific point .
đExample: PDF of a Uniform Distribution
For :
- The total area: â
Cumulative Distribution Function (CDF)
DfCumulative Distribution Function (CDF)
The cumulative distribution function gives the probability that the random variable takes on a value less than or equal to :
CDF Definition
Here,
- =Cumulative probability up to x
- =Probability that X is at most x
CDF for Continuous Variables
Here,
- =The CDF
- =The probability density function
CDF for Discrete Variables
Here,
- =The CDF
- =PMF evaluated at x_i
ThCDF Properties
- Non-decreasing: If , then
- Limits: and
- Right-continuous: is always right-continuous
- Interval probability:
đExample: CDF of a Fair Die
For = outcome of a fair die roll:
| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| 1/6 | 2/6 | 3/6 | 4/6 | 5/6 | 1 |
đĄ CDF vs PMF/PDF
The CDF works for both discrete and continuous random variables. It always exists and is always well-defined, making it a universal tool. The PMF only works for discrete variables; the PDF only works for continuous variables.
Bernoulli Distribution
The simplest distribution: a single yes/no trial.
DfBernoulli Distribution
A random variable follows a Bernoulli distribution with parameter if it takes value 1 with probability and value 0 with probability .
Bernoulli PMF
Here,
- =Probability of success (X = 1)
- =Probability of failure (X = 0)
- =Outcome: 0 or 1
Bernoulli Parameters
Here,
- =Mean (expected value)
- =Variance
đExample: Bernoulli Distribution
A drug has a 90% cure rate. Let if cured, if not.
- , so ,
- Mean:
- Variance:
- Standard deviation:
âšī¸ AI/ML Connection
The Bernoulli distribution models binary decisions: spam/not spam, click/no-click, cat/dog. It is the output distribution of binary classifiers and the foundation of logistic regression.
Binomial Distribution
How many successes in independent Bernoulli trials?
DfBinomial Distribution
A random variable follows a binomial distribution if it counts the number of successes in independent Bernoulli trials, each with success probability .
Binomial PMF
Here,
- =Binomial coefficient: n choose k
- =Number of trials
- =Probability of success on each trial
- =Number of successes
Binomial Parameters
Here,
- =Number of trials
- =Success probability
- =Mean
- =Variance
đExample: Binomial Distribution
Flip a fair coin 10 times. Let = number of heads. Then .
- Mean:
- Variance:
The most likely outcome is 5 heads (24.6% probability), which makes intuitive sense.
đĄ Relationship to Bernoulli
A Bernoulli distribution is a special case of the binomial distribution where . That is, .
Poisson Distribution
Counting rare events over a fixed interval.
DfPoisson Distribution
A random variable follows a Poisson distribution if it counts the number of events occurring in a fixed interval of time or space, given a constant average rate and independent events.
Poisson PMF
Here,
- =Average rate (expected number of events)
- =Number of events
- =Euler's number â 2.71828
Poisson Parameters
Here,
- =Rate parameter
- =Mean equals lambda
- =Variance also equals lambda
đExample: Poisson Distribution
A server receives an average of 4 requests per minute (). What is the probability of getting exactly 6 requests in a minute?
There is about a 10.4% chance of receiving exactly 6 requests.
âšī¸ Poisson Limit
When is large and is small, the Binomial distribution is well approximated by where . This is why Poisson is used for rare events.
Uniform Distribution
Every outcome equally likely over an interval.
DfUniform Distribution
A random variable follows a continuous uniform distribution if every value in the interval is equally likely.
Uniform PDF
Here,
- =Lower bound
- =Upper bound
- =Width of the interval
Uniform CDF
Here,
- =Cumulative probability at x
Uniform Parameters
Here,
- =Mean (midpoint of the interval)
- =Variance
đExample: Uniform Distribution
A random number generator produces values uniformly between 0 and 1: .
- PDF: for
- Mean:
- Variance:
âšī¸ Why Uniform Matters
The Uniform(0,1) distribution is the foundation of all random number generation. Every continuous random variable can be generated from uniform random numbers using the inverse transform method.
Python Implementation
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# --- Bernoulli Distribution ---
# Simulate 10000 coin flips with p=0.7
bernoulli_rv = stats.bernoulli(p=0.7)
samples = bernoulli_rv.rvs(size=10000)
print(f"Bernoulli Mean: {samples.mean():.3f}") # ~0.70
print(f"Bernoulli Var: {samples.var():.3f}") # ~0.21
print(f"P(X=1): {bernoulli_rv.pmf(1):.3f}") # 0.700
print(f"P(X<=0): {bernoulli_rv.cdf(0):.3f}") # 0.300
# --- Binomial Distribution ---
# 10 trials, p=0.3, simulate 10000 experiments
binom_rv = stats.binom(n=10, p=0.3)
samples = binom_rv.rvs(size=10000)
print(f"Binomial Mean: {samples.mean():.3f}") # ~3.0
print(f"Binomial Var: {samples.var():.3f}") # ~2.1
print(f"P(X=5): {binom_rv.pmf(5):.4f}") # ~0.1029
print(f"P(X<=3): {binom_rv.cdf(3):.4f}") # ~0.6496
# --- Poisson Distribution ---
# Average 4 events per interval
poisson_rv = stats.poisson(mu=4)
samples = poisson_rv.rvs(size=10000)
print(f"Poisson Mean: {samples.mean():.3f}") # ~4.0
print(f"Poisson Var: {samples.var():.3f}") # ~4.0
print(f"P(X=6): {poisson_rv.pmf(6):.4f}") # ~0.1042
# --- Uniform Distribution ---
# Continuous uniform on [0, 1]
uniform_rv = stats.uniform(loc=0, scale=1)
samples = uniform_rv.rvs(size=10000)
print(f"Uniform Mean: {samples.mean():.3f}") # ~0.50
print(f"Uniform Var: {samples.var():.4f}") # ~0.0833
print(f"P(0.25<=X<=0.75): {uniform_rv.cdf(0.75) - uniform_rv.cdf(0.25):.3f}") # 0.500
# --- Visualization ---
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
# Bernoulli
axes[0, 0].bar([0, 1], [0.3, 0.7], color=['steelblue', 'coral'])
axes[0, 0].set_title('Bernoulli(p=0.7)')
axes[0, 0].set_xlabel('x')
axes[0, 0].set_ylabel('P(X=x)')
# Binomial
x_binom = np.arange(0, 11)
axes[0, 1].bar(x_binom, binom_rv.pmf(x_binom), color='steelblue')
axes[0, 1].set_title('Binomial(n=10, p=0.3)')
axes[0, 1].set_xlabel('k')
axes[0, 1].set_ylabel('P(X=k)')
# Poisson
x_poisson = np.arange(0, 12)
axes[1, 0].bar(x_poisson, poisson_rv.pmf(x_poisson), color='coral')
axes[1, 0].set_title('Poisson(Îģ=4)')
axes[1, 0].set_xlabel('k')
axes[1, 0].set_ylabel('P(X=k)')
# Uniform
x_uniform = np.linspace(-0.2, 1.2, 1000)
axes[1, 1].fill_between(x_uniform, uniform_rv.pdf(x_uniform), alpha=0.3, color='steelblue')
axes[1, 1].plot(x_uniform, uniform_rv.pdf(x_uniform), color='steelblue')
axes[1, 1].set_title('Uniform(0, 1)')
axes[1, 1].set_xlabel('x')
axes[1, 1].set_ylabel('f(x)')
plt.tight_layout()
plt.savefig('distributions.png', dpi=150)
plt.show()
Applications in AI/ML
âšī¸ Why Distributions Matter in ML
Probability distributions are not just theory â they are the engine behind every machine learning system. Here are the most important applications.
Loss Functions Derived from Distributions
Many common loss functions in ML are negative log-likelihoods of probability distributions:
| Loss Function | Distribution | Use Case |
|---|---|---|
| Binary Cross-Entropy | Bernoulli | Binary classification |
| Categorical Cross-Entropy | Categorical | Multi-class classification |
| MSE (Mean Squared Error) | Gaussian | Regression |
| Poisson Loss | Poisson | Count prediction |
đExample: Cross-Entropy Loss
For a binary classifier predicting , the cross-entropy loss for a single example with true label is:
This is the negative log-likelihood of a Bernoulli distribution with parameter .
Sampling and Data Augmentation
- Monte Carlo methods: Draw samples from distributions to estimate integrals and expectations
- Reparameterization trick: Used in VAEs (Variational Autoencoders) to backpropagate through random sampling
- Data augmentation: Add noise sampled from known distributions to training data
Generative Models
- Gaussian Mixture Models (GMM): Model data as a mixture of Gaussians
- Naive Bayes: Assume features follow specific distributions (Gaussian, Bernoulli, Multinomial)
- Normalizing Flows: Transform simple distributions (Uniform, Gaussian) into complex ones
đĄ The Big Picture
Choosing the right distribution for your data is one of the most important modeling decisions. If your data are counts, use Poisson or Negative Binomial. If they are binary, use Bernoulli. If they are continuous and symmetric, consider Gaussian. Mis-specifying the distribution leads to poor models and misleading conclusions.
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Saying for a continuous variable | For continuous RVs, the probability at a single point is always 0 | Use intervals: |
| Treating PDF values as probabilities | is a density, not a probability; it can exceed 1 | Probabilities are areas under the curve: |
| Using PMF for continuous variables | PMFs are only defined for discrete variables | Use PDF for continuous, PMF for discrete |
| Forgetting or | If these don't hold, it's not a valid distribution | Always verify normalization |
| Confusing and | is the population mean (parameter); is the sample mean (statistic) | is fixed; varies by sample |
| Assuming independence when it's not given | Independence is a strong assumption that must be justified | Check the problem statement carefully |
| Using Binomial when trials are not independent | Binomial requires independent trials | Use Hypergeometric for sampling without replacement |
Interview Questions
đQ1: What is the difference between a PMF and a PDF?
Answer: A PMF (Probability Mass Function) applies to discrete random variables and gives the probability at each point: . A PDF (Probability Density Function) applies to continuous random variables and gives the density at each point. For continuous variables, for any single point â probabilities are only defined over intervals using the PDF: . The key difference is that PMF values are actual probabilities (between 0 and 1), while PDF values are densities (can exceed 1).
đQ2: Why is the variance of a Poisson distribution equal to its mean?
Answer: This is a mathematical property of the Poisson distribution that arises from its derivation. The Poisson distribution models the limit of the Binomial distribution as and with fixed. For Binomial, the variance is . As , , so the variance approaches , which is also the mean. This equality () is a defining characteristic of the Poisson distribution and is used as a diagnostic: if the sample mean and variance are approximately equal, the data may follow a Poisson distribution.
đQ3: How would you choose between Bernoulli, Binomial, and Poisson distributions?
Answer:
- Bernoulli: Single binary trial (yes/no, success/failure). Example: "Will this customer buy?"
- Binomial: Fixed number of independent binary trials. Example: "How many of 10 customers will buy?"
- Poisson: Count of events in a continuous interval (time/space) at a constant rate. Example: "How many customers arrive per hour?"
The key distinctions: Bernoulli is for one trial, Binomial is for a fixed number of trials, and Poisson is for events over a continuous interval. Use Binomial when you know ; use Poisson when is effectively infinite and is small.
đQ4: What is the CDF and why is it useful?
Answer: The CDF (Cumulative Distribution Function) is defined as . It gives the probability that a random variable takes a value less than or equal to . The CDF is useful because: (1) It works for both discrete and continuous random variables, unlike PMF/PDF which are type-specific. (2) It always exists and is well-defined. (3) You can compute interval probabilities: . (4) In hypothesis testing, p-values are computed using CDFs. (5) For continuous variables, , connecting CDF to PDF.
đQ5: A website gets 3 visits per minute on average. What is the probability of getting exactly 5 visits in the next minute? Model this and compute.
Answer: This is a Poisson process with .
There is approximately a 10.1% chance of getting exactly 5 visits in the next minute. This can be computed in Python using stats.poisson.pmf(5, mu=3).
đQ6: Why can't we use a Gaussian distribution to model the number of heads in 10 coin flips?
Answer: The number of heads in 10 flips follows , which is a discrete distribution with support . The Gaussian is a continuous distribution defined on . Using a Gaussian would imply that 2.5 heads is possible, which is nonsensical. Additionally, the Gaussian allows negative values (impossible for counts) and has nonzero probability density at non-integer values. While the Gaussian can approximate the Binomial for large (by the Central Limit Theorem), it is not the correct model for small with discrete outcomes.
Practice Problems
đProblem 1: Bernoulli Calculation
A quality control test has a 95% detection rate. If 20 items pass through the test, what is the probability that exactly 19 are correctly detected? Assume independence.
đĄSolution
This is a Binomial distribution with , , and :
There is approximately a 37.7% chance that exactly 19 out of 20 items are correctly detected.
đProblem 2: Poisson Process
A call center receives an average of 8 calls per 10-minute interval. What is the probability of receiving exactly 3 calls in a 10-minute interval? What is the probability of receiving 0 calls?
đĄSolution
With :
Exactly 3 calls:
0 calls:
There is a 2.86% chance of exactly 3 calls and a 0.034% chance of no calls. The call center is very unlikely to be idle.
đProblem 3: Uniform Distribution
A bus arrives uniformly between 0 and 20 minutes. You arrive at a random time. What is the probability that you wait less than 5 minutes for the bus?
đĄSolution
Let = time until the bus arrives, . You arrive at a random time . You wait less than 5 minutes if the bus arrives in the interval , but constrained to .
Since the bus arrival is uniform, the probability you wait less than 5 minutes is:
This is because the bus arrival time is uniform, and any 5-minute window out of the 20-minute interval has probability .
đProblem 4: Connecting Distributions
Show that as with , the Binomial PMF approaches the Poisson PMF.
đĄSolution
Starting with the Binomial PMF:
Substitute :
Expand:
As :
- (the leading terms cancel)
- (fundamental limit)
Therefore:
This is the Poisson PMF. This is the Poisson Limit Theorem and explains why Poisson is used for rare events with large and small .
Quick Reference
đKey Takeaways
| Concept | Formula | Notes |
|---|---|---|
| Random Variable | Maps outcomes to numbers | |
| PMF | Discrete only; | |
| ; | Continuous only; is density, not probability | |
| CDF | Works for both types; non-decreasing, , | |
| Bernoulli | Single binary trial; , | |
| Binomial | independent trials; , | |
| Poisson | Rare events; | |
| Uniform | Equal likelihood; , |
Cross-References
- Previous: Probability Fundamentals â Sample spaces, events, conditional probability, Bayes' theorem
- Next: Expectation and Variance â Mean, variance, standard deviation, higher moments
- Related: Linear Algebra â Vector spaces, matrix operations
- Related: Information Theory â Entropy, KL divergence, mutual information
- Applied: Loss Functions in ML â Cross-entropy, MSE, and their distributional origins
- Applied: Bayesian Methods â Prior distributions, posterior inference, MCMC