Probability — The Math of Uncertainty
ℹ️ Why It Matters
AI makes decisions under uncertainty. "Is this email spam?" "Is this tumor cancerous?" "What word comes next?" Probability is the math that handles uncertainty.
What is Probability?
Probability measures how likely something is to happen. It ranges from 0 (impossible) to 1 (certain).
Probability Definition P ( t e x t e v e n t ) = f r a c t e x t n u m b e r o f f a v o r a b l e o u t c o m e s t e x t t o t a l o u t c o m e s P(\\text{event}) = \\frac{\\text{number of favorable outcomes}}{\\text{total outcomes}} P ( t e x t e v e n t ) = f r a c t e x t n u mb er o f f a v or ab l eo u t co m es t e x t t o t a l o u t co m es Here,
P ( event ) P(\text{event}) P ( event ) = Probability of the event occurring
📝 Example: Rolling a DieP ( 6 ) = f r a c 16 P(6) = \\frac{1}{6} P ( 6 ) = f r a c 1 6
Key Terminology
Term Meaning Example Experiment An action with uncertain outcome Flipping a coin Sample Space (S) All possible outcomes {Heads, Tails} Event A subset of outcomes {Heads} Favorable outcomes Outcomes we want Getting Heads P(event) Probability of event P(Heads) = 0.5
Fundamental Rules
Addition Rule
Addition Rule P ( A c u p B ) = P ( A ) + P ( B ) − P ( A c a p B ) P(A \\cup B) = P(A) + P(B) - P(A \\cap B) P ( A c u pB ) = P ( A ) + P ( B ) − P ( A c a pB ) Here,
P ( A ∪ B ) P(A \cup B) P ( A ∪ B ) = Probability of A or B occurring P ( A ∩ B ) P(A \cap B) P ( A ∩ B ) = Probability of both A and B occurring
📝 Example: Addition RuleP ( t e x t r o l l i n g 1 o r 6 ) = f r a c 16 + f r a c 16 − 0 = f r a c 26 = f r a c 13 P(\\text{rolling 1 or 6}) = \\frac{1}{6} + \\frac{1}{6} - 0 = \\frac{2}{6} = \\frac{1}{3} P ( t e x t r o l l in g 1 or 6 ) = f r a c 1 6 + f r a c 1 6 − 0 = f r a c 2 6 = f r a c 1 3
ℹ️ Mutually Exclusive Events
If A and B are mutually exclusive (can't happen together):
P ( A c u p B ) = P ( A ) + P ( B ) P(A \\cup B) = P(A) + P(B) P ( A c u pB ) = P ( A ) + P ( B )
Multiplication Rule
Multiplication Rule P ( A c a p B ) = P ( A ) t i m e s P ( B ∣ A ) P(A \\cap B) = P(A) \\times P(B|A) P ( A c a pB ) = P ( A ) t im es P ( B ∣ A ) Here,
P ( B ∣ A ) P(B|A) P ( B ∣ A ) = Conditional probability of B given A
📝 Example: Multiplication RuleP ( t e x t H e a d t h e n T a i l ) = P ( t e x t H e a d ) t i m e s P ( t e x t T a i l ) = 0.5 t i m e s 0.5 = 0.25 P(\\text{Head then Tail}) = P(\\text{Head}) \\times P(\\text{Tail}) = 0.5 \\times 0.5 = 0.25 P ( t e x t H e a d t h e n T ai l ) = P ( t e x t H e a d ) t im es P ( t e x t T ai l ) = 0.5 t im es 0.5 = 0.25
ℹ️ Independent Events
If A and B are independent:
P ( A c a p B ) = P ( A ) t i m e s P ( B ) P(A \\cap B) = P(A) \\times P(B) P ( A c a pB ) = P ( A ) t im es P ( B )
Conditional Probability
P(A|B) = Probability of A given that B has happened.
Conditional Probability P ( A ∣ B ) = f r a c P ( A c a p B ) P ( B ) P(A|B) = \\frac{P(A \\cap B)}{P(B)} P ( A ∣ B ) = f r a c P ( A c a pB ) P ( B ) Here,
P ( A ∣ B ) P(A|B) P ( A ∣ B ) = Probability of A given B has occurred
📝 Example: Conditional ProbabilityP ( t e x t R a i n ∣ t e x t C l o u d y ) = f r a c P ( t e x t R a i n a n d C l o u d y ) P ( t e x t C l o u d y ) P(\\text{Rain} | \\text{Cloudy}) = \\frac{P(\\text{Rain and Cloudy})}{P(\\text{Cloudy})} P ( t e x t R ain ∣ t e x t C l o u d y ) = f r a c P ( t e x t R ainan d C l o u d y ) P ( t e x t C l o u d y )
Analogy: You know it's cloudy (B happened). Given that information, what's the chance it rains (A)?
Bayes' Theorem — The Crown Jewel of Probability
Bayes' Theorem P ( A ∣ B ) = f r a c P ( B ∣ A ) t i m e s P ( A ) P ( B ) P(A|B) = \\frac{P(B|A) \\times P(A)}{P(B)} P ( A ∣ B ) = f r a c P ( B ∣ A ) t im es P ( A ) P ( B ) Here,
P ( A ∣ B ) P(A|B) P ( A ∣ B ) = Posterior probability P ( B ∣ A ) P(B|A) P ( B ∣ A ) = Likelihood P ( A ) P(A) P ( A ) = Prior probability P ( B ) P(B) P ( B ) = Evidence
In plain English:
ℹ️ Bayes' Theorem
t e x t P o s t e r i o r = f r a c t e x t L i k e l i h o o d t i m e s t e x t P r i o r t e x t E v i d e n c e \\text{Posterior} = \\frac{\\text{Likelihood} \\times \\text{Prior}}{\\text{Evidence}} t e x t P os t er i or = f r a c t e x t L ik e l ih oo d t im es t e x t P r i or t e x t E v i d e n ce
📝 Example: Medical Test
Disease affects 1% of people: P ( Disease ) = 0.01 P(\text{Disease}) = 0.01 P ( Disease ) = 0.01
Test is 99% accurate: P ( Positive ∣ Disease ) = 0.99 P(\text{Positive}|\text{Disease}) = 0.99 P ( Positive ∣ Disease ) = 0.99
False positive rate: P ( Positive ∣ No Disease ) = 0.05 P(\text{Positive}|\text{No Disease}) = 0.05 P ( Positive ∣ No Disease ) = 0.05
You test positive. What's the probability you have the disease?
P ( t e x t D i s e a s e ∣ t e x t P o s i t i v e ) = f r a c P ( t e x t P o s i t i v e ∣ t e x t D i s e a s e ) t i m e s P ( t e x t D i s e a s e ) P ( t e x t P o s i t i v e ) P(\\text{Disease}|\\text{Positive}) = \\frac{P(\\text{Positive}|\\text{Disease}) \\times P(\\text{Disease})}{P(\\text{Positive})} P ( t e x t D i se a se ∣ t e x t P os i t i v e ) = f r a c P ( t e x t P os i t i v e ∣ t e x t D i se a se ) t im es P ( t e x t D i se a se ) P ( t e x t P os i t i v e ) P ( t e x t P o s i t i v e ) = P ( t e x t P o s i t i v e ∣ t e x t D i s e a s e ) t i m e s P ( t e x t D i s e a s e ) + P ( t e x t P o s i t i v e ∣ t e x t N o D i s e a s e ) t i m e s P ( t e x t N o D i s e a s e ) P(\\text{Positive}) = P(\\text{Positive}|\\text{Disease}) \\times P(\\text{Disease}) + P(\\text{Positive}|\\text{No Disease}) \\times P(\\text{No Disease}) P ( t e x t P os i t i v e ) = P ( t e x t P os i t i v e ∣ t e x t D i se a se ) t im es P ( t e x t D i se a se ) + P ( t e x t P os i t i v e ∣ t e x t N oD i se a se ) t im es P ( t e x t N oD i se a se ) = 0.99 t i m e s 0.01 + 0.05 t i m e s 0.99 = 0.0099 + 0.0495 = 0.0594 = 0.99 \\times 0.01 + 0.05 \\times 0.99 = 0.0099 + 0.0495 = 0.0594 = 0.99 t im es 0.01 + 0.05 t im es 0.99 = 0.0099 + 0.0495 = 0.0594 P ( t e x t D i s e a s e ∣ t e x t P o s i t i v e ) = f r a c 0.00990.0594 = 0.1667 = 16.7 P(\\text{Disease}|\\text{Positive}) = \\frac{0.0099}{0.0594} = 0.1667 = 16.7\\% P ( t e x t D i se a se ∣ t e x t P os i t i v e ) = f r a c 0.0099 0.0594 = 0.1667 = 16.7
⚠️ Surprise
Even with a positive test, you only have a 16.7% chance of having the disease! This is why base rates matter.
Applications in AI:
Naive Bayes classifier: Text classification, spam filtering
Bayesian networks: Causal reasoning
Bayesian optimization: Hyperparameter tuning
Probabilistic programming: Stan, PyMC, Edward
Random Variables
Df Random VariableA random variable is a variable whose value is determined by chance.
Discrete Random Variable: Can take specific values (countable)
Number of emails per day: {0, 1, 2, 3, ...}
Coin flips: {0, 1} (0=tails, 1=heads)
Continuous Random Variable: Can take any value in a range
Height: any value between 0 and 3 meters
Temperature: any real number
Probability Distributions
Discrete Distributions
Bernoulli Distribution: Single coin flip
Bernoulli Distribution P ( X = 1 ) = p , q u a d P ( X = 0 ) = 1 − p P(X=1) = p, \\quad P(X=0) = 1-p P ( X = 1 ) = p , q u a d P ( X = 0 ) = 1 − p Here,
p p p = Probability of success
Mean: p p p
Variance: p ( 1 − p ) p(1-p) p ( 1 − p )
Used in: Binary classification output
Binomial Distribution: Number of successes in n trials
Binomial Distribution P ( X = k ) = b i n o m n k p k ( 1 − p ) n − k P(X=k) = \\binom{n}{k} p^k (1-p)^{n-k} P ( X = k ) = bin o m n k p k ( 1 − p ) n − k Here,
n n n = Number of trials k k k = Number of successes p p p = Probability of success
Mean: n p np n p
Variance: n p ( 1 − p ) np(1-p) n p ( 1 − p )
📝 Example: Binomial DistributionIn 10 coin flips, P(exactly 5 heads):
P ( X = 5 ) = b i n o m 105 t i m e s 0.5 5 t i m e s 0.5 5 = f r a c 2521024 a p p r o x 0.246 P(X=5) = \\binom{10}{5} \\times 0.5^5 \\times 0.5^5 = \\frac{252}{1024} \\approx 0.246 P ( X = 5 ) = bin o m 10 5 t im es 0. 5 5 t im es 0. 5 5 = f r a c 252 1024 a pp r o x 0.246
Poisson Distribution: Number of events in a fixed time/area
Poisson Distribution P ( X = k ) = f r a c l a m b d a k e − l a m b d a k ! P(X=k) = \\frac{\\lambda^k e^{-\\lambda}}{k!} P ( X = k ) = f r a c l amb d a k e − l amb d a k ! Here,
λ \lambda λ = Average rate of events
Mean: λ \lambda λ
Variance: λ \lambda λ
📝 Example: Poisson DistributionIf you receive 3 emails/hour on average:
P ( 5 t e x t e m a i l s i n a n h o u r ) = f r a c 3 5 t i m e s e − 3 5 ! = 0.1008 P(5 \\text{ emails in an hour}) = \\frac{3^5 \\times e^{-3}}{5!} = 0.1008 P ( 5 t e x t e mai l s inanh o u r ) = f r a c 3 5 t im es e − 3 5 ! = 0.1008
Geometric Distribution: Number of trials until first success
Geometric Distribution P ( X = k ) = ( 1 − p ) k − 1 t i m e s p P(X=k) = (1-p)^{k-1} \\times p P ( X = k ) = ( 1 − p ) k − 1 t im es p Here,
p p p = Probability of success
Mean: 1 / p 1/p 1/ p
Variance: ( 1 − p ) / p 2 (1-p)/p^2 ( 1 − p ) / p 2
Continuous Distributions
Uniform Distribution: Every value equally likely
Uniform Distribution f ( x ) = f r a c 1 b − a q u a d t e x t f o r a l e q x l e q b f(x) = \\frac{1}{b-a} \\quad \\text{for } a \\leq x \\leq b f ( x ) = f r a c 1 b − a q u a d t e x t f or a l e q x l e q b Here,
a a a = Lower bound b b b = Upper bound
Mean: ( a + b ) / 2 (a+b)/2 ( a + b ) /2
Variance: ( b − a ) 2 / 12 (b-a)^2/12 ( b − a ) 2 /12
Used in: Random initialization, Monte Carlo methods
Normal (Gaussian) Distribution — THE Most Important Distribution
Normal Distribution f ( x ) = f r a c 1 s i g m a s q r t 2 p i e − f r a c ( x − m u ) 2 2 s i g m a 2 f(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}} e^{-\\frac{(x-\\mu)^2}{2\\sigma^2}} f ( x ) = f r a c 1 s i g ma s q r t 2 p i e − f r a c ( x − m u ) 2 2 s i g m a 2 Here,
μ \mu μ = Mean (center) σ \sigma σ = Standard deviation (spread) σ 2 \sigma^2 σ 2 = Variance
ℹ️ Properties of Normal Distribution
Bell-shaped curve
68% of data within μ ± σ \mu \pm \sigma μ ± σ
95% within μ ± 2 σ \mu \pm 2\sigma μ ± 2 σ
99.7% within μ ± 3 σ \mu \pm 3\sigma μ ± 3 σ (the "3-sigma rule")
Why Normal Distribution is EVERYWHERE:
Central Limit Theorem (see below)
Heights, weights, test scores are approximately normal
Noise in measurements is typically Gaussian
Most ML assumes Gaussian noise
Standard Normal Distribution: μ = 0 \mu=0 μ = 0 , σ = 1 \sigma=1 σ = 1
Standardization Z = f r a c X − m u s i g m a Z = \\frac{X - \\mu}{\\sigma} Z = f r a c X − m u s i g ma
Exponential Distribution: Time between events
Exponential Distribution f ( x ) = l a m b d a e − l a m b d a x q u a d t e x t f o r x g e q 0 f(x) = \\lambda e^{-\\lambda x} \\quad \\text{for } x \\geq 0 f ( x ) = l amb d a e − l amb d a x q u a d t e x t f or x g e q 0
Mean: 1 / λ 1/\lambda 1/ λ
Variance: 1 / λ 2 1/\lambda^2 1/ λ 2
Used in: Modeling waiting times, survival analysis
Expected Value and Variance
Expected Value (Mean): The "average" outcome if you repeat the experiment many times.
Expected Value E [ X ] = s u m x i t i m e s P ( x i ) E[X] = \\sum x_i \\times P(x_i) E [ X ] = s u m x i t im es P ( x i ) Here,
E [ X ] E[X] E [ X ] = Expected value of X
📝 Example: Expected ValueRoll a fair die:
E [ X ] = 1 t i m e s f r a c 16 + 2 t i m e s f r a c 16 + 3 t i m e s f r a c 16 + 4 t i m e s f r a c 16 + 5 t i m e s f r a c 16 + 6 t i m e s f r a c 16 = 3.5 E[X] = 1 \\times \\frac{1}{6} + 2 \\times \\frac{1}{6} + 3 \\times \\frac{1}{6} + 4 \\times \\frac{1}{6} + 5 \\times \\frac{1}{6} + 6 \\times \\frac{1}{6} = 3.5 E [ X ] = 1 t im es f r a c 1 6 + 2 t im es f r a c 1 6 + 3 t im es f r a c 1 6 + 4 t im es f r a c 1 6 + 5 t im es f r a c 1 6 + 6 t im es f r a c 1 6 = 3.5
Variance: How spread out the values are.
Variance t e x t V a r ( X ) = E [ ( X − m u ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 \\text{Var}(X) = E[(X - \\mu)^2] = E[X^2] - (E[X])^2 t e x t V a r ( X ) = E [( X − m u ) 2 ] = E [ X 2 ] − ( E [ X ] ) 2 Here,
Var ( X ) \text{Var}(X) Var ( X ) = Variance of X
t e x t S t a n d a r d D e v i a t i o n : s i g m a = s q r t t e x t V a r ( X ) \\text{Standard Deviation: } \\sigma = \\sqrt{\\text{Var}(X)} t e x t S t an d a r d D e v ia t i o n : s i g ma = s q r t t e x t V a r ( X )
Properties:
E [ a X + b ] = a E [ X ] + b E[aX + b] = aE[X] + b E [ a X + b ] = a E [ X ] + b
Var ( a X + b ) = a 2 Var ( X ) \text{Var}(aX + b) = a^2\text{Var}(X) Var ( a X + b ) = a 2 Var ( X )
Covariance: How two variables move together
Covariance t e x t C o v ( X , Y ) = E [ ( X − m u X ) ( Y − m u Y ) ] = E [ X Y ] − E [ X ] E [ Y ] \\text{Cov}(X,Y) = E[(X-\\mu_X)(Y-\\mu_Y)] = E[XY] - E[X]E[Y] t e x t C o v ( X , Y ) = E [( X − m u X ) ( Y − m u Y )] = E [ X Y ] − E [ X ] E [ Y ] Here,
Cov ( X , Y ) \text{Cov}(X,Y) Cov ( X , Y ) = Covariance between X and Y
Cov > 0 \text{Cov} > 0 Cov > 0 : X and Y tend to increase together
Cov < 0 \text{Cov} < 0 Cov < 0 : One increases while the other decreases
Cov = 0 \text{Cov} = 0 Cov = 0 : No linear relationship
Correlation: Normalized covariance (-1 to 1)
Correlation r h o ( X , Y ) = f r a c t e x t C o v ( X , Y ) s i g m a X t i m e s s i g m a Y \\rho(X,Y) = \\frac{\\text{Cov}(X,Y)}{\\sigma_X \\times \\sigma_Y} r h o ( X , Y ) = f r a c t e x t C o v ( X , Y ) s i g m a X t im es s i g m a Y Here,
ρ ( X , Y ) \rho(X,Y) ρ ( X , Y ) = Correlation coefficient
ρ = 1 \rho = 1 ρ = 1 : Perfect positive correlation
ρ = − 1 \rho = -1 ρ = − 1 : Perfect negative correlation
ρ = 0 \rho = 0 ρ = 0 : No linear correlation
Joint, Marginal, and Conditional Distributions
Joint Distribution: P ( X = x , Y = y ) P(X=x, Y=y) P ( X = x , Y = y ) — probability of both happening simultaneously
Marginal Distribution: Get one variable by summing/integrating out the other
Marginal Distribution P ( X = x ) = s u m y P ( X = x , Y = y ) P(X=x) = \\sum_y P(X=x, Y=y) P ( X = x ) = s u m y P ( X = x , Y = y ) Here,
P ( X = x ) P(X=x) P ( X = x ) = Marginal probability
Conditional Distribution: Probability of one variable given another
Conditional Distribution P ( X ∣ Y = y ) = f r a c P ( X , Y = y ) P ( Y = y ) P(X|Y=y) = \\frac{P(X, Y=y)}{P(Y=y)} P ( X ∣ Y = y ) = f r a c P ( X , Y = y ) P ( Y = y ) Here,
P ( X ∣ Y = y ) P(X|Y=y) P ( X ∣ Y = y ) = Conditional probability
Independence:
ℹ️ Independence
X and Y are independent if:
P ( X , Y ) = P ( X ) t i m e s P ( Y ) q u a d t e x t f o r a l l X , Y P(X,Y) = P(X) \\times P(Y) \\quad \\text{for all } X, Y P ( X , Y ) = P ( X ) t im es P ( Y ) q u a d t e x t f or a l l X , Y
Central Limit Theorem (CLT)
Th Central Limit TheoremNo matter what distribution your data follows, the distribution of sample means approaches a normal distribution as sample size increases.
CLT Statement b a r X s i m t e x t a p p r o x i m a t e l y N l e f t ( m u , f r a c s i g m a 2 n r i g h t ) t e x t f o r l a r g e n \\bar{X} \\sim \\text{approximately } N\\left(\\mu, \\frac{\\sigma^2}{n}\\right) \\text{ for large } n ba r X s im t e x t a pp r o x ima t e l y N l e f t ( m u , f r a c s i g m a 2 n r i g h t ) t e x t f or l a r g e n Here,
X ˉ \bar{X} X ˉ = Sample mean n n n = Sample size
💡 Why this is HUGE
It explains why the normal distribution appears everywhere
It allows us to make confidence intervals
It justifies hypothesis testing
It works regardless of the original distribution!
Rule of thumb: n ≥ 30 n \geq 30 n ≥ 30 is usually enough for the CLT to kick in.
Maximum Likelihood Estimation (MLE)
The Idea: Given some data, find the parameters that make the data MOST probable.
Likelihood Function L ( t h e t a ) = P ( t e x t d a t a ∣ t h e t a ) = p r o d i P ( x i ∣ t h e t a ) L(\\theta) = P(\\text{data} | \\theta) = \\prod_i P(x_i | \\theta) L ( t h e t a ) = P ( t e x t d a t a ∣ t h e t a ) = p r o d i P ( x i ∣ t h e t a ) Here,
θ \theta θ = Parameters to estimate
Log-likelihood: log L ( θ ) = ∑ i log P ( x i ∣ θ ) \log L(\theta) = \sum_i \log P(x_i | \theta) log L ( θ ) = ∑ i log P ( x i ∣ θ ) (easier to work with)
MLE: θ ^ = arg max θ log L ( θ ) \hat{\theta} = \arg\max_\theta \log L(\theta) θ ^ = arg max θ log L ( θ )
📝 Example: MLE for Coin FlipData: H, H, T, H, H, T, H (5 heads, 2 tails)
P ( H ) = p P(H) = p P ( H ) = p , P ( T ) = 1 − p P(T) = 1-p P ( T ) = 1 − p
L ( p ) = p 5 t i m e s ( 1 − p ) 2 L(p) = p^5 \\times (1-p)^2 L ( p ) = p 5 t im es ( 1 − p ) 2 l o g L ( p ) = 5 l o g ( p ) + 2 l o g ( 1 − p ) \\log L(p) = 5\\log(p) + 2\\log(1-p) l o g L ( p ) = 5 l o g ( p ) + 2 l o g ( 1 − p ) f r a c d d p [ l o g L ( p ) ] = f r a c 5 p − f r a c 2 1 − p = 0 \\frac{d}{dp}[\\log L(p)] = \\frac{5}{p} - \\frac{2}{1-p} = 0 f r a c d d p [ l o g L ( p )] = f r a c 5 p − f r a c 2 1 − p = 0 5 ( 1 − p ) = 2 p 5(1-p) = 2p 5 ( 1 − p ) = 2 p 5 − 5 p = 2 p 5 - 5p = 2p 5 − 5 p = 2 p h a t p = f r a c 57 a p p r o x 0.714 \\hat{p} = \\frac{5}{7} \\approx 0.714 ha t p = f r a c 5 7 a pp r o x 0.714
Applications in AI:
Logistic regression uses MLE
Training neural networks with cross-entropy loss ≡ MLE
Gaussian Mixture Models use MLE (via EM algorithm)
Common Probability Mistakes
⚠️ Common Mistakes
Base rate neglect: Ignoring prior probability (the medical test example)
Confusion of the inverse: P ( A ∣ B ) ≠ P ( B ∣ A ) P(A|B) \neq P(B|A) P ( A ∣ B ) = P ( B ∣ A )
Gambler's belief: Past events don't affect independent events
Small sample bias: Small samples can look very different from the population
Correlation ≠ Causation: Two things moving together doesn't mean one causes the other
📋 Key Takeaways
Probability quantifies uncertainty from 0 to 1. P ( event ) = favorable outcomes total outcomes P(\text{event}) = \frac{\text{favorable outcomes}}{\text{total outcomes}} P ( event ) = total outcomes favorable outcomes is the foundation for all statistical reasoning in AI.
Bayes' Theorem reverses conditional probabilities. P ( A ∣ B ) = P ( B ∣ A ) × P ( A ) P ( B ) P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} P ( A ∣ B ) = P ( B ) P ( B ∣ A ) × P ( A ) lets you update beliefs with evidence — the engine behind Naive Bayes classifiers and Bayesian optimization.
The Normal Distribution is everywhere. f ( x ) = 1 σ 2 π e − ( x − μ ) 2 2 σ 2 f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} f ( x ) = σ 2 π 1 e − 2 σ 2 ( x − μ ) 2 with the 68-95-99.7 rule: most data falls within 3 standard deviations of the mean.
The Central Limit Theorem explains why normality appears everywhere. Sample means approach a normal distribution X ˉ ∼ N ( μ , σ 2 n ) \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) X ˉ ∼ N ( μ , n σ 2 ) regardless of the underlying distribution — the theoretical basis for confidence intervals and hypothesis testing.
MLE finds the parameters that maximize data likelihood. θ ^ = arg max θ ∑ i log P ( x i ∣ θ ) \hat{\theta} = \arg\max_\theta \sum_i \log P(x_i | \theta) θ ^ = arg max θ ∑ i log P ( x i ∣ θ ) — used in logistic regression, training with cross-entropy loss, and Gaussian Mixture Models.
Independence means P ( X , Y ) = P ( X ) × P ( Y ) P(X,Y) = P(X) \times P(Y) P ( X , Y ) = P ( X ) × P ( Y ) . Understanding when variables are independent vs. correlated is critical for feature selection and avoiding spurious patterns in data.