Probability — The Math of Uncertainty

ℹ️ Why It Matters

AI makes decisions under uncertainty. "Is this email spam?" "Is this tumor cancerous?" "What word comes next?" Probability is the math that handles uncertainty.

What is Probability?

Probability measures how likely something is to happen. It ranges from 0 (impossible) to 1 (certain).

Probability Definition

P(\\text{event}) = \\frac{\\text{number of favorable outcomes}}{\\text{total outcomes}}

Here,

$P(\text{event})$ =Probability of the event occurring

📝Example: Rolling a Die

P(6) = \\frac{1}{6}

Key Terminology

Term	Meaning	Example
Experiment	An action with uncertain outcome	Flipping a coin
Sample Space (S)	All possible outcomes	{Heads, Tails}
Event	A subset of outcomes	{Heads}
Favorable outcomes	Outcomes we want	Getting Heads
P(event)	Probability of event	P(Heads) = 0.5

Fundamental Rules

Addition Rule

P(A \\cup B) = P(A) + P(B) - P(A \\cap B)

Here,

$P(A \cup B)$ =Probability of A or B occurring
$P(A \cap B)$ =Probability of both A and B occurring

📝Example: Addition Rule

P(\\text{rolling 1 or 6}) = \\frac{1}{6} + \\frac{1}{6} - 0 = \\frac{2}{6} = \\frac{1}{3}

ℹ️ Mutually Exclusive Events

If A and B are mutually exclusive (can't happen together):

P(A \\cup B) = P(A) + P(B)

Multiplication Rule

P(A \\cap B) = P(A) \\times P(B|A)

Here,

$P(B|A)$ =Conditional probability of B given A

📝Example: Multiplication Rule

P(\\text{Head then Tail}) = P(\\text{Head}) \\times P(\\text{Tail}) = 0.5 \\times 0.5 = 0.25

ℹ️ Independent Events

If A and B are independent:

P(A \\cap B) = P(A) \\times P(B)

Conditional Probability

P(A|B) = Probability of A given that B has happened.

Conditional Probability

P(A|B) = \\frac{P(A \\cap B)}{P(B)}

Here,

$P(A|B)$ =Probability of A given B has occurred

📝Example: Conditional Probability

P(\\text{Rain} | \\text{Cloudy}) = \\frac{P(\\text{Rain and Cloudy})}{P(\\text{Cloudy})}

Analogy: You know it's cloudy (B happened). Given that information, what's the chance it rains (A)?

Bayes' Theorem — The Crown Jewel of Probability

Bayes' Theorem

P(A|B) = \\frac{P(B|A) \\times P(A)}{P(B)}

Here,

$P(A|B)$ =Posterior probability
$P(B|A)$ =Likelihood
$P(A)$ =Prior probability
$P(B)$ =Evidence

In plain English:

ℹ️ Bayes' Theorem

\\text{Posterior} = \\frac{\\text{Likelihood} \\times \\text{Prior}}{\\text{Evidence}}

📝Example: Medical Test

Disease affects 1% of people: $P(\text{Disease}) = 0.01$
Test is 99% accurate: $P(\text{Positive}|\text{Disease}) = 0.99$
False positive rate: $P(\text{Positive}|\text{No Disease}) = 0.05$

You test positive. What's the probability you have the disease?

P(\\text{Disease}|\\text{Positive}) = \\frac{P(\\text{Positive}|\\text{Disease}) \\times P(\\text{Disease})}{P(\\text{Positive})}

P(\\text{Positive}) = P(\\text{Positive}|\\text{Disease}) \\times P(\\text{Disease}) + P(\\text{Positive}|\\text{No Disease}) \\times P(\\text{No Disease})

= 0.99 \\times 0.01 + 0.05 \\times 0.99 = 0.0099 + 0.0495 = 0.0594

P(\\text{Disease}|\\text{Positive}) = \\frac{0.0099}{0.0594} = 0.1667 = 16.7\\%

⚠️ Surprise

Even with a positive test, you only have a 16.7% chance of having the disease! This is why base rates matter.

Applications in AI:

Naive Bayes classifier: Text classification, spam filtering
Bayesian networks: Causal reasoning
Bayesian optimization: Hyperparameter tuning
Probabilistic programming: Stan, PyMC, Edward

Random Variables

DfRandom Variable

A random variable is a variable whose value is determined by chance.

Discrete Random Variable: Can take specific values (countable)

Number of emails per day: {0, 1, 2, 3, ...}
Coin flips: {0, 1} (0=tails, 1=heads)

Continuous Random Variable: Can take any value in a range

Height: any value between 0 and 3 meters
Temperature: any real number

Probability Distributions

Discrete Distributions

Bernoulli Distribution: Single coin flip

Bernoulli Distribution

P(X=1) = p, \\quad P(X=0) = 1-p

Here,

$p$ =Probability of success

Mean: $p$
Variance: $p(1-p)$
Used in: Binary classification output

Binomial Distribution: Number of successes in n trials

Binomial Distribution

P(X=k) = \\binom{n}{k} p^k (1-p)^{n-k}

Here,

$n$ =Number of trials
$k$ =Number of successes
$p$ =Probability of success

Mean: $np$
Variance: $np(1-p)$

📝Example: Binomial Distribution

In 10 coin flips, P(exactly 5 heads):

P(X=5) = \\binom{10}{5} \\times 0.5^5 \\times 0.5^5 = \\frac{252}{1024} \\approx 0.246

Poisson Distribution: Number of events in a fixed time/area

Poisson Distribution

P(X=k) = \\frac{\\lambda^k e^{-\\lambda}}{k!}

Here,

$\lambda$ =Average rate of events

Mean: $\lambda$
Variance: $\lambda$

📝Example: Poisson Distribution

If you receive 3 emails/hour on average:

P(5 \\text{ emails in an hour}) = \\frac{3^5 \\times e^{-3}}{5!} = 0.1008

Geometric Distribution: Number of trials until first success

Geometric Distribution

P(X=k) = (1-p)^{k-1} \\times p

Here,

$p$ =Probability of success

Mean: $1/p$
Variance: $(1-p)/p^2$

Continuous Distributions

Uniform Distribution: Every value equally likely

Uniform Distribution

f(x) = \\frac{1}{b-a} \\quad \\text{for } a \\leq x \\leq b

Here,

$a$ =Lower bound
$b$ =Upper bound

Mean: $(a+b)/2$
Variance: $(b-a)^2/12$
Used in: Random initialization, Monte Carlo methods

Normal (Gaussian) Distribution — THE Most Important Distribution

Normal Distribution

f(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}} e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}}

Here,

$\mu$ =Mean (center)
$\sigma$ =Standard deviation (spread)
$\sigma^2$ =Variance

ℹ️ Properties of Normal Distribution

Bell-shaped curve
68% of data within $\mu \pm \sigma$
95% within $\mu \pm 2\sigma$
99.7% within $\mu \pm 3\sigma$ (the "3-sigma rule")

Why Normal Distribution is EVERYWHERE:

Central Limit Theorem (see below)
Heights, weights, test scores are approximately normal
Noise in measurements is typically Gaussian
Most ML assumes Gaussian noise

Standard Normal Distribution: $\mu=0$ , $\sigma=1$

Standardization

Z = \\frac{X - \\mu}{\\sigma}

Here,

$Z$ =Standardized value

Exponential Distribution: Time between events

Exponential Distribution

f(x) = \\lambda e^{-\\lambda x} \\quad \\text{for } x \\geq 0

Here,

$\lambda$ =Rate parameter

Mean: $1/\lambda$
Variance: $1/\lambda^2$
Used in: Modeling waiting times, survival analysis

Expected Value and Variance

Expected Value (Mean): The "average" outcome if you repeat the experiment many times.

Expected Value

E[X] = \\sum x_i \\times P(x_i)

Here,

$E[X]$ =Expected value of X

📝Example: Expected Value

Roll a fair die:

E[X] = 1 \\times \\frac{1}{6} + 2 \\times \\frac{1}{6} + 3 \\times \\frac{1}{6} + 4 \\times \\frac{1}{6} + 5 \\times \\frac{1}{6} + 6 \\times \\frac{1}{6} = 3.5

Variance: How spread out the values are.

Variance

\\text{Var}(X) = E[(X - \\mu)^2] = E[X^2] - (E[X])^2

Here,

$\text{Var}(X)$ =Variance of X

\\text{Standard Deviation: } \\sigma = \\sqrt{\\text{Var}(X)}

Properties:

$E[aX + b] = aE[X] + b$
$\text{Var}(aX + b) = a^2\text{Var}(X)$

Covariance: How two variables move together

Covariance

\\text{Cov}(X,Y) = E[(X-\\mu_X)(Y-\\mu_Y)] = E[XY] - E[X]E[Y]

Here,

$\text{Cov}(X,Y)$ =Covariance between X and Y

$\text{Cov} > 0$ : X and Y tend to increase together
$\text{Cov} < 0$ : One increases while the other decreases
$\text{Cov} = 0$ : No linear relationship

Correlation: Normalized covariance (-1 to 1)

Correlation

\\rho(X,Y) = \\frac{\\text{Cov}(X,Y)}{\\sigma_X \\times \\sigma_Y}

Here,

$\rho(X,Y)$ =Correlation coefficient

$\rho = 1$ : Perfect positive correlation
$\rho = -1$ : Perfect negative correlation
$\rho = 0$ : No linear correlation

Joint, Marginal, and Conditional Distributions

Joint Distribution: $P(X=x, Y=y)$ — probability of both happening simultaneously

Marginal Distribution: Get one variable by summing/integrating out the other

Marginal Distribution

P(X=x) = \\sum_y P(X=x, Y=y)

Here,

$P(X=x)$ =Marginal probability

Conditional Distribution: Probability of one variable given another

Conditional Distribution

P(X|Y=y) = \\frac{P(X, Y=y)}{P(Y=y)}

Here,

$P(X|Y=y)$ =Conditional probability

Independence:

ℹ️ Independence

X and Y are independent if:

P(X,Y) = P(X) \\times P(Y) \\quad \\text{for all } X, Y

Central Limit Theorem (CLT)

ThCentral Limit Theorem

No matter what distribution your data follows, the distribution of sample means approaches a normal distribution as sample size increases.

CLT Statement

\\bar{X} \\sim \\text{approximately } N\\left(\\mu, \\frac{\\sigma^2}{n}\\right) \\text{ for large } n

Here,

$\bar{X}$ =Sample mean
$n$ =Sample size

💡 Why this is HUGE

It explains why the normal distribution appears everywhere
It allows us to make confidence intervals
It justifies hypothesis testing
It works regardless of the original distribution!

Rule of thumb: $n \geq 30$ is usually enough for the CLT to kick in.

Maximum Likelihood Estimation (MLE)

The Idea: Given some data, find the parameters that make the data MOST probable.

Likelihood Function

L(\\theta) = P(\\text{data} | \\theta) = \\prod_i P(x_i | \\theta)

Here,

$\theta$ =Parameters to estimate

Log-likelihood: $\log L(\theta) = \sum_i \log P(x_i | \theta)$ (easier to work with)

MLE: $\hat{\theta} = \arg\max_\theta \log L(\theta)$

📝Example: MLE for Coin Flip

Data: H, H, T, H, H, T, H (5 heads, 2 tails) $P(H) = p$ , $P(T) = 1-p$

L(p) = p^5 \\times (1-p)^2

\\log L(p) = 5\\log(p) + 2\\log(1-p)

\\frac{d}{dp}[\\log L(p)] = \\frac{5}{p} - \\frac{2}{1-p} = 0

5(1-p) = 2p

5 - 5p = 2p

5 = 7p

\\hat{p} = \\frac{5}{7} \\approx 0.714

Applications in AI:

Logistic regression uses MLE
Training neural networks with cross-entropy loss ≡ MLE
Gaussian Mixture Models use MLE (via EM algorithm)

Common Probability Mistakes

⚠️ Common Mistakes

Base rate neglect: Ignoring prior probability (the medical test example)
Confusion of the inverse: $P(A|B) \neq P(B|A)$
Gambler's belief: Past events don't affect independent events
Small sample bias: Small samples can look very different from the population
Correlation ≠ Causation: Two things moving together doesn't mean one causes the other

📋Key Takeaways

Probability quantifies uncertainty from 0 to 1. $P(\text{event}) = \frac{\text{favorable outcomes}}{\text{total outcomes}}$ is the foundation for all statistical reasoning in AI.
Bayes' Theorem reverses conditional probabilities. $P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}$ lets you update beliefs with evidence — the engine behind Naive Bayes classifiers and Bayesian optimization.
The Normal Distribution is everywhere. $f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$ with the 68-95-99.7 rule: most data falls within 3 standard deviations of the mean.
The Central Limit Theorem explains why normality appears everywhere. Sample means approach a normal distribution $\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$ regardless of the underlying distribution — the theoretical basis for confidence intervals and hypothesis testing.
MLE finds the parameters that maximize data likelihood. $\hat{\theta} = \arg\max_\theta \sum_i \log P(x_i | \theta)$ — used in logistic regression, training with cross-entropy loss, and Gaussian Mixture Models.
Independence means $P(X,Y) = P(X) \times P(Y)$ . Understanding when variables are independent vs. correlated is critical for feature selection and avoiding spurious patterns in data.