← Math|3 of 100
Mathematics for Data Science & AI

Probability — The Math of Uncertainty

Master probability for AI and data science: distributions, Bayes theorem, MLE, and real-world applications in machine learning.

📂 Probability📖 Lesson 3 of 100🎓 Free Course

Advertisement

Probability — The Math of Uncertainty

ℹ️ Why It Matters

AI makes decisions under uncertainty. "Is this email spam?" "Is this tumor cancerous?" "What word comes next?" Probability is the math that handles uncertainty.


What is Probability?

Probability measures how likely something is to happen. It ranges from 0 (impossible) to 1 (certain).

Probability Definition

P(textevent)=fractextnumberoffavorableoutcomestexttotaloutcomesP(\\text{event}) = \\frac{\\text{number of favorable outcomes}}{\\text{total outcomes}}

Here,

  • P(event)P(\text{event})=Probability of the event occurring

📝Example: Rolling a Die

P(6)=frac16P(6) = \\frac{1}{6}

Key Terminology

TermMeaningExample
ExperimentAn action with uncertain outcomeFlipping a coin
Sample Space (S)All possible outcomes{Heads, Tails}
EventA subset of outcomes{Heads}
Favorable outcomesOutcomes we wantGetting Heads
P(event)Probability of eventP(Heads) = 0.5

Fundamental Rules

Addition Rule

Addition Rule

P(AcupB)=P(A)+P(B)P(AcapB)P(A \\cup B) = P(A) + P(B) - P(A \\cap B)

Here,

  • P(AB)P(A \cup B)=Probability of A or B occurring
  • P(AB)P(A \cap B)=Probability of both A and B occurring

📝Example: Addition Rule

P(textrolling1or6)=frac16+frac160=frac26=frac13P(\\text{rolling 1 or 6}) = \\frac{1}{6} + \\frac{1}{6} - 0 = \\frac{2}{6} = \\frac{1}{3}

ℹ️ Mutually Exclusive Events

If A and B are mutually exclusive (can't happen together):

P(AcupB)=P(A)+P(B)P(A \\cup B) = P(A) + P(B)

Multiplication Rule

Multiplication Rule

P(AcapB)=P(A)timesP(BA)P(A \\cap B) = P(A) \\times P(B|A)

Here,

  • P(BA)P(B|A)=Conditional probability of B given A

📝Example: Multiplication Rule

P(textHeadthenTail)=P(textHead)timesP(textTail)=0.5times0.5=0.25P(\\text{Head then Tail}) = P(\\text{Head}) \\times P(\\text{Tail}) = 0.5 \\times 0.5 = 0.25

ℹ️ Independent Events

If A and B are independent:

P(AcapB)=P(A)timesP(B)P(A \\cap B) = P(A) \\times P(B)

Conditional Probability

P(A|B) = Probability of A given that B has happened.

Conditional Probability

P(AB)=fracP(AcapB)P(B)P(A|B) = \\frac{P(A \\cap B)}{P(B)}

Here,

  • P(AB)P(A|B)=Probability of A given B has occurred

📝Example: Conditional Probability

P(textRaintextCloudy)=fracP(textRainandCloudy)P(textCloudy)P(\\text{Rain} | \\text{Cloudy}) = \\frac{P(\\text{Rain and Cloudy})}{P(\\text{Cloudy})}

Analogy: You know it's cloudy (B happened). Given that information, what's the chance it rains (A)?

Bayes' Theorem — The Crown Jewel of Probability

Bayes' Theorem

P(AB)=fracP(BA)timesP(A)P(B)P(A|B) = \\frac{P(B|A) \\times P(A)}{P(B)}

Here,

  • P(AB)P(A|B)=Posterior probability
  • P(BA)P(B|A)=Likelihood
  • P(A)P(A)=Prior probability
  • P(B)P(B)=Evidence

In plain English:

ℹ️ Bayes' Theorem

textPosterior=fractextLikelihoodtimestextPriortextEvidence\\text{Posterior} = \\frac{\\text{Likelihood} \\times \\text{Prior}}{\\text{Evidence}}

📝Example: Medical Test

  • Disease affects 1% of people: P(Disease)=0.01P(\text{Disease}) = 0.01
  • Test is 99% accurate: P(PositiveDisease)=0.99P(\text{Positive}|\text{Disease}) = 0.99
  • False positive rate: P(PositiveNo Disease)=0.05P(\text{Positive}|\text{No Disease}) = 0.05

You test positive. What's the probability you have the disease?

P(textDiseasetextPositive)=fracP(textPositivetextDisease)timesP(textDisease)P(textPositive)P(\\text{Disease}|\\text{Positive}) = \\frac{P(\\text{Positive}|\\text{Disease}) \\times P(\\text{Disease})}{P(\\text{Positive})}
P(textPositive)=P(textPositivetextDisease)timesP(textDisease)+P(textPositivetextNoDisease)timesP(textNoDisease)P(\\text{Positive}) = P(\\text{Positive}|\\text{Disease}) \\times P(\\text{Disease}) + P(\\text{Positive}|\\text{No Disease}) \\times P(\\text{No Disease})
=0.99times0.01+0.05times0.99=0.0099+0.0495=0.0594= 0.99 \\times 0.01 + 0.05 \\times 0.99 = 0.0099 + 0.0495 = 0.0594
P(textDiseasetextPositive)=frac0.00990.0594=0.1667=16.7P(\\text{Disease}|\\text{Positive}) = \\frac{0.0099}{0.0594} = 0.1667 = 16.7\\%

⚠️ Surprise

Even with a positive test, you only have a 16.7% chance of having the disease! This is why base rates matter.

Applications in AI:

  • Naive Bayes classifier: Text classification, spam filtering
  • Bayesian networks: Causal reasoning
  • Bayesian optimization: Hyperparameter tuning
  • Probabilistic programming: Stan, PyMC, Edward

Random Variables

DfRandom Variable

A random variable is a variable whose value is determined by chance.

Discrete Random Variable: Can take specific values (countable)

  • Number of emails per day: {0, 1, 2, 3, ...}
  • Coin flips: {0, 1} (0=tails, 1=heads)

Continuous Random Variable: Can take any value in a range

  • Height: any value between 0 and 3 meters
  • Temperature: any real number

Probability Distributions

Discrete Distributions

Bernoulli Distribution: Single coin flip

Bernoulli Distribution

P(X=1)=p,quadP(X=0)=1pP(X=1) = p, \\quad P(X=0) = 1-p

Here,

  • pp=Probability of success
  • Mean: pp
  • Variance: p(1p)p(1-p)
  • Used in: Binary classification output

Binomial Distribution: Number of successes in n trials

Binomial Distribution

P(X=k)=binomnkpk(1p)nkP(X=k) = \\binom{n}{k} p^k (1-p)^{n-k}

Here,

  • nn=Number of trials
  • kk=Number of successes
  • pp=Probability of success
  • Mean: npnp
  • Variance: np(1p)np(1-p)

📝Example: Binomial Distribution

In 10 coin flips, P(exactly 5 heads):

P(X=5)=binom105times0.55times0.55=frac2521024approx0.246P(X=5) = \\binom{10}{5} \\times 0.5^5 \\times 0.5^5 = \\frac{252}{1024} \\approx 0.246

Poisson Distribution: Number of events in a fixed time/area

Poisson Distribution

P(X=k)=fraclambdakelambdak!P(X=k) = \\frac{\\lambda^k e^{-\\lambda}}{k!}

Here,

  • λ\lambda=Average rate of events
  • Mean: λ\lambda
  • Variance: λ\lambda

📝Example: Poisson Distribution

If you receive 3 emails/hour on average:

P(5textemailsinanhour)=frac35timese35!=0.1008P(5 \\text{ emails in an hour}) = \\frac{3^5 \\times e^{-3}}{5!} = 0.1008

Geometric Distribution: Number of trials until first success

Geometric Distribution

P(X=k)=(1p)k1timespP(X=k) = (1-p)^{k-1} \\times p

Here,

  • pp=Probability of success
  • Mean: 1/p1/p
  • Variance: (1p)/p2(1-p)/p^2

Continuous Distributions

Uniform Distribution: Every value equally likely

Uniform Distribution

f(x)=frac1baquadtextforaleqxleqbf(x) = \\frac{1}{b-a} \\quad \\text{for } a \\leq x \\leq b

Here,

  • aa=Lower bound
  • bb=Upper bound
  • Mean: (a+b)/2(a+b)/2
  • Variance: (ba)2/12(b-a)^2/12
  • Used in: Random initialization, Monte Carlo methods

Normal (Gaussian) Distribution — THE Most Important Distribution

Normal Distribution

f(x)=frac1sigmasqrt2piefrac(xmu)22sigma2f(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}} e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}}

Here,

  • μ\mu=Mean (center)
  • σ\sigma=Standard deviation (spread)
  • σ2\sigma^2=Variance

ℹ️ Properties of Normal Distribution

  • Bell-shaped curve
  • 68% of data within μ±σ\mu \pm \sigma
  • 95% within μ±2σ\mu \pm 2\sigma
  • 99.7% within μ±3σ\mu \pm 3\sigma (the "3-sigma rule")

Why Normal Distribution is EVERYWHERE:

  • Central Limit Theorem (see below)
  • Heights, weights, test scores are approximately normal
  • Noise in measurements is typically Gaussian
  • Most ML assumes Gaussian noise

Standard Normal Distribution: μ=0\mu=0, σ=1\sigma=1

Standardization

Z=fracXmusigmaZ = \\frac{X - \\mu}{\\sigma}

Here,

  • ZZ=Standardized value

Exponential Distribution: Time between events

Exponential Distribution

f(x)=lambdaelambdaxquadtextforxgeq0f(x) = \\lambda e^{-\\lambda x} \\quad \\text{for } x \\geq 0

Here,

  • λ\lambda=Rate parameter
  • Mean: 1/λ1/\lambda
  • Variance: 1/λ21/\lambda^2
  • Used in: Modeling waiting times, survival analysis

Expected Value and Variance

Expected Value (Mean): The "average" outcome if you repeat the experiment many times.

Expected Value

E[X]=sumxitimesP(xi)E[X] = \\sum x_i \\times P(x_i)

Here,

  • E[X]E[X]=Expected value of X

📝Example: Expected Value

Roll a fair die:

E[X]=1timesfrac16+2timesfrac16+3timesfrac16+4timesfrac16+5timesfrac16+6timesfrac16=3.5E[X] = 1 \\times \\frac{1}{6} + 2 \\times \\frac{1}{6} + 3 \\times \\frac{1}{6} + 4 \\times \\frac{1}{6} + 5 \\times \\frac{1}{6} + 6 \\times \\frac{1}{6} = 3.5

Variance: How spread out the values are.

Variance

textVar(X)=E[(Xmu)2]=E[X2](E[X])2\\text{Var}(X) = E[(X - \\mu)^2] = E[X^2] - (E[X])^2

Here,

  • Var(X)\text{Var}(X)=Variance of X
textStandardDeviation:sigma=sqrttextVar(X)\\text{Standard Deviation: } \\sigma = \\sqrt{\\text{Var}(X)}

Properties:

  • E[aX+b]=aE[X]+bE[aX + b] = aE[X] + b
  • Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2\text{Var}(X)

Covariance: How two variables move together

Covariance

textCov(X,Y)=E[(XmuX)(YmuY)]=E[XY]E[X]E[Y]\\text{Cov}(X,Y) = E[(X-\\mu_X)(Y-\\mu_Y)] = E[XY] - E[X]E[Y]

Here,

  • Cov(X,Y)\text{Cov}(X,Y)=Covariance between X and Y
  • Cov>0\text{Cov} > 0: X and Y tend to increase together
  • Cov<0\text{Cov} < 0: One increases while the other decreases
  • Cov=0\text{Cov} = 0: No linear relationship

Correlation: Normalized covariance (-1 to 1)

Correlation

rho(X,Y)=fractextCov(X,Y)sigmaXtimessigmaY\\rho(X,Y) = \\frac{\\text{Cov}(X,Y)}{\\sigma_X \\times \\sigma_Y}

Here,

  • ρ(X,Y)\rho(X,Y)=Correlation coefficient
  • ρ=1\rho = 1: Perfect positive correlation
  • ρ=1\rho = -1: Perfect negative correlation
  • ρ=0\rho = 0: No linear correlation

Joint, Marginal, and Conditional Distributions

Joint Distribution: P(X=x,Y=y)P(X=x, Y=y) — probability of both happening simultaneously

Marginal Distribution: Get one variable by summing/integrating out the other

Marginal Distribution

P(X=x)=sumyP(X=x,Y=y)P(X=x) = \\sum_y P(X=x, Y=y)

Here,

  • P(X=x)P(X=x)=Marginal probability

Conditional Distribution: Probability of one variable given another

Conditional Distribution

P(XY=y)=fracP(X,Y=y)P(Y=y)P(X|Y=y) = \\frac{P(X, Y=y)}{P(Y=y)}

Here,

  • P(XY=y)P(X|Y=y)=Conditional probability

Independence:

ℹ️ Independence

X and Y are independent if:

P(X,Y)=P(X)timesP(Y)quadtextforallX,YP(X,Y) = P(X) \\times P(Y) \\quad \\text{for all } X, Y

Central Limit Theorem (CLT)

ThCentral Limit Theorem

No matter what distribution your data follows, the distribution of sample means approaches a normal distribution as sample size increases.

CLT Statement

barXsimtextapproximatelyNleft(mu,fracsigma2nright)textforlargen\\bar{X} \\sim \\text{approximately } N\\left(\\mu, \\frac{\\sigma^2}{n}\\right) \\text{ for large } n

Here,

  • Xˉ\bar{X}=Sample mean
  • nn=Sample size

💡 Why this is HUGE

  1. It explains why the normal distribution appears everywhere
  2. It allows us to make confidence intervals
  3. It justifies hypothesis testing
  4. It works regardless of the original distribution!

Rule of thumb: n30n \geq 30 is usually enough for the CLT to kick in.


Maximum Likelihood Estimation (MLE)

The Idea: Given some data, find the parameters that make the data MOST probable.

Likelihood Function

L(theta)=P(textdatatheta)=prodiP(xitheta)L(\\theta) = P(\\text{data} | \\theta) = \\prod_i P(x_i | \\theta)

Here,

  • θ\theta=Parameters to estimate

Log-likelihood: logL(θ)=ilogP(xiθ)\log L(\theta) = \sum_i \log P(x_i | \theta) (easier to work with)

MLE: θ^=argmaxθlogL(θ)\hat{\theta} = \arg\max_\theta \log L(\theta)

📝Example: MLE for Coin Flip

Data: H, H, T, H, H, T, H (5 heads, 2 tails) P(H)=pP(H) = p, P(T)=1pP(T) = 1-p

L(p)=p5times(1p)2L(p) = p^5 \\times (1-p)^2
logL(p)=5log(p)+2log(1p)\\log L(p) = 5\\log(p) + 2\\log(1-p)
fracddp[logL(p)]=frac5pfrac21p=0\\frac{d}{dp}[\\log L(p)] = \\frac{5}{p} - \\frac{2}{1-p} = 0
5(1p)=2p5(1-p) = 2p
55p=2p5 - 5p = 2p
5=7p5 = 7p
hatp=frac57approx0.714\\hat{p} = \\frac{5}{7} \\approx 0.714

Applications in AI:

  • Logistic regression uses MLE
  • Training neural networks with cross-entropy loss ≡ MLE
  • Gaussian Mixture Models use MLE (via EM algorithm)

Common Probability Mistakes

⚠️ Common Mistakes

  1. Base rate neglect: Ignoring prior probability (the medical test example)
  2. Confusion of the inverse: P(AB)P(BA)P(A|B) \neq P(B|A)
  3. Gambler's belief: Past events don't affect independent events
  4. Small sample bias: Small samples can look very different from the population
  5. Correlation ≠ Causation: Two things moving together doesn't mean one causes the other

📋Key Takeaways

  • Probability quantifies uncertainty from 0 to 1. P(event)=favorable outcomestotal outcomesP(\text{event}) = \frac{\text{favorable outcomes}}{\text{total outcomes}} is the foundation for all statistical reasoning in AI.

  • Bayes' Theorem reverses conditional probabilities. P(AB)=P(BA)×P(A)P(B)P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} lets you update beliefs with evidence — the engine behind Naive Bayes classifiers and Bayesian optimization.

  • The Normal Distribution is everywhere. f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} with the 68-95-99.7 rule: most data falls within 3 standard deviations of the mean.

  • The Central Limit Theorem explains why normality appears everywhere. Sample means approach a normal distribution XˉN(μ,σ2n)\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) regardless of the underlying distribution — the theoretical basis for confidence intervals and hypothesis testing.

  • MLE finds the parameters that maximize data likelihood. θ^=argmaxθilogP(xiθ)\hat{\theta} = \arg\max_\theta \sum_i \log P(x_i | \theta) — used in logistic regression, training with cross-entropy loss, and Gaussian Mixture Models.

  • Independence means P(X,Y)=P(X)×P(Y)P(X,Y) = P(X) \times P(Y). Understanding when variables are independent vs. correlated is critical for feature selection and avoiding spurious patterns in data.

Lesson Progress3 / 100