← Math|34 of 100
Probability

Probability Foundations

Master sample spaces, axioms of probability, counting, combinatorics, and practical applications in AI/ML.

📂 Basics📖 Lesson 34 of 100🎓 Free Course

Advertisement

Probability Foundations

💡 Why It Matters

Probability is the mathematical framework for quantifying uncertainty. In machine learning, every model—from simple linear regression to complex neural networks—operates under uncertainty. Probabilistic thinking is crucial for model evaluation, risk assessment, and decision-making. Understanding probability foundations helps you debug models, design experiments, and interpret results critically.


What is Probability

DfProbability

Probability is a measure of the likelihood that an event will occur. It is a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty. Formally, for a random experiment, the probability of an event AA is denoted P(A)P(A) and satisfies the axioms of probability.


Sample Space and Events

DfSample Space

The sample space Ω\Omega is the set of all possible outcomes of a random experiment. For example:

  • Tossing a coin: Ω={heads,tails}\Omega = \{\text{heads}, \text{tails}\}
  • Rolling a die: Ω={1,2,3,4,5,6}\Omega = \{1, 2, 3, 4, 5, 6\}
  • Measuring height: Ω=[0,)\Omega = [0, \infty) (continuous)

DfEvents

An event is a subset of the sample space. Events can be:

  • Simple event: A single outcome, e.g., {3}\{3\} when rolling a die.
  • Compound event: Multiple outcomes, e.g., {2,4,6}\{2, 4, 6\} (even numbers).
  • Impossible event: \emptyset (empty set), probability 0.
  • Certain event: Ω\Omega, probability 1.

Axioms of Probability

ThKolmogorov Axioms of Probability

For a sample space Ω\Omega and event AΩA \subseteq \Omega:

  1. Non-negativity: P(A)0P(A) \geq 0
  2. Normalization: P(Ω)=1P(\Omega) = 1
  3. Additivity: For mutually exclusive events AA and BB (i.e., AB=A \cap B = \emptyset):
P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B)

These axioms form the foundation of all probability theory. The third axiom extends to countably many events via sigma-additivity.


Basic Properties

Derived from the axioms, these properties are essential:

  • Complement Rule: P(Ac)=1P(A)P(A^c) = 1 - P(A), where AcA^c is the complement of AA.
  • Monotonicity: If ABA \subseteq B, then P(A)P(B)P(A) \leq P(B).
  • General Addition Rule: For any events AA and BB:
P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
  • Bonferroni's Inequality: P(AB)P(A)+P(B)1P(A \cap B) \geq P(A) + P(B) - 1.
  • Boole's Inequality: P(i=1nAi)i=1nP(Ai)P(\bigcup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i) (union bound).

Counting Principles

Counting techniques are fundamental for computing probabilities in discrete settings.

Permutations (Ordered Selection)

P(n,k)=n!(nk)!P(n, k) = \frac{n!}{(n-k)!}

Here,

  • nn=Total number of distinct items
  • kk=Number of items to select and arrange
  • P(n,k)P(n,k)=Number of ordered arrangements of k items from n

Combinations (Unordered Selection)

(nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!}

Here,

  • nn=Total number of distinct items
  • kk=Number of items to choose
  • (nk)\binom{n}{k}=Number of unordered selections of k items from n

Extended Counting Formulas

  • Permutations with Repetition: n!n1!n2!nk!\frac{n!}{n_1! n_2! \cdots n_k!} for multiset.
  • Circular Permutations: (n1)!(n-1)! for arranging nn items in a circle.
  • Stars and Bars: Number of ways to distribute nn identical items into kk distinct bins: (n+k1k1)\binom{n+k-1}{k-1}.

Inclusion-Exclusion

Inclusion-Exclusion Principle

P(i=1nAi)=iP(Ai)i<jP(AiAj)+i<j<kP(AiAjAk)+(1)n+1P(A1An)P\left(\bigcup_{i=1}^n A_i\right) = \sum_{i} P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n)

Here,

  • AiA_i=Events in the sample space
  • P(Ai)P(\cup A_i)=Probability of at least one event occurring

Example: Two Events

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

Example: Three Events

P(ABC)=P(A)+P(B)+P(C)P(AB)P(AC)P(BC)+P(ABC)P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)

Complement Rule

Complement Rule

P(Ac)=1P(A)P(A^c) = 1 - P(A)

Here,

  • AA=Event of interest
  • AcA^c=Complement of A (event not occurring)
  • P(Ac)P(A^c)=Probability that A does not occur

This rule is particularly useful for "at least one" problems:

P(at least one A)=1P(no A)P(\text{at least one } A) = 1 - P(\text{no } A)

Python Implementation

import numpy as np
from math import comb, perm, factorial
from itertools import combinations, permutations

# Basic counting
print(f"Permutations P(10,3): {perm(10, 3)}")  # 720
print(f"Combinations C(10,3): {comb(10, 3)}")  # 120

# Probability of union
def prob_union(p_a, p_b, p_ab):
    """P(A ∪ B) = P(A) + P(B) - P(A ∩ B)"""
    return p_a + p_b - p_ab

# Inclusion-Exclusion for three events
def prob_union_three(p_a, p_b, p_c, p_ab, p_ac, p_bc, p_abc):
    """P(A ∪ B ∪ C) using inclusion-exclusion"""
    return (p_a + p_b + p_c 
            - p_ab - p_ac - p_bc 
            + p_abc)

# Birthday problem
def birthday_prob(n, days=365):
    """Probability of at least one collision in n birthdays"""
    if n > days:
        return 1.0
    prob_no_collision = 1.0
    for i in range(n):
        prob_no_collision *= (days - i) / days
    return 1 - prob_no_collision

# Simulation approach
def simulate_birthday(n, days=365, trials=100000):
    """Monte Carlo simulation for birthday problem"""
    collisions = 0
    for _ in range(trials):
        birthdays = np.random.randint(0, days, size=n)
        if len(np.unique(birthdays)) < n:
            collisions += 1
    return collisions / trials

# Dice problems
def dice_probability(target_sum, num_dice=2, faces=6):
    """Probability of getting target_sum with num_dice dice"""
    # Count favorable outcomes
    favorable = 0
    total = faces ** num_dice
    
    # Brute force for small cases
    if num_dice == 2:
        for i in range(1, faces+1):
            for j in range(1, faces+1):
                if i + j == target_sum:
                    favorable += 1
    else:
        # Use dynamic programming
        dp = np.zeros((num_dice + 1, target_sum + 1))
        dp[0][0] = 1
        for die in range(1, num_dice + 1):
            for sum_val in range(1, target_sum + 1):
                for face in range(1, min(faces, sum_val) + 1):
                    dp[die][sum_val] += dp[die-1][sum_val-face]
        favorable = dp[num_dice][target_sum]
    
    return favorable / total

# Bayes' theorem
def bayes_theorem(p_b_given_a, p_a, p_b):
    """P(A|B) = P(B|A) * P(A) / P(B)"""
    return (p_b_given_a * p_a) / p_b

# Example: Medical test
p_disease = 0.01  # Prior probability
p_positive_given_disease = 0.95  # Sensitivity
p_positive_given_healthy = 0.05  # False positive rate

p_healthy = 1 - p_disease
p_positive = (p_positive_given_disease * p_disease + 
              p_positive_given_healthy * p_healthy)
p_disease_given_positive = bayes_theorem(
    p_positive_given_disease, p_disease, p_positive
)

print(f"Probability of disease given positive test: {p_disease_given_positive:.4f}")

# Examples
print(f"\nBirthday problem (23 people): {birthday_prob(23):.4f}")
print(f"Dice: P(sum=7 with 2 dice) = {dice_probability(7):.4f}")
print(f"Simulation: {simulate_birthday(23):.4f}")

Applications in AI/ML

ℹ️ Probability in Machine Learning

Probability theory underpins modern AI/ML systems:

  1. Uncertainty Quantification: Neural networks with dropout provide uncertainty estimates via Monte Carlo sampling.
  2. Bayesian Reasoning: Updating beliefs with evidence—core to Bayesian optimization and probabilistic graphical models.
  3. Loss Functions: Cross-entropy loss is derived from maximum likelihood estimation.
  4. A/B Testing: Comparing conversion rates requires hypothesis testing based on probability.
  5. Generative Models: VAEs and GANs learn probability distributions over data.

Key Concepts

  • Bayes' Theorem: P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
  • Maximum Likelihood Estimation (MLE): θ^=argmaxθP(dataθ)\hat{\theta} = \arg\max_\theta P(\text{data}|\theta)
  • Posterior Distribution: P(θdata)P(dataθ)P(θ)P(\theta|\text{data}) \propto P(\text{data}|\theta) \cdot P(\theta)

Common Mistakes

MistakeExplanationCorrect Approach
Assuming events are independentP(AB)=P(A)P(B)P(A \cap B) = P(A)P(B) only if independentCheck if one event affects the other
Confusing permutations and combinationsOrder matters for permutationsUse combinations when order doesn't matter
Ignoring the complement rule"At least one" problems are easier with complementsAlways consider 1P(none)1 - P(\text{none})
Double-counting in inclusion-exclusionOverlapping probabilities counted multiple timesUse the full inclusion-exclusion formula
Forgetting conditional probabilityP(AB)P(A)P(A|B) \neq P(A) in generalApply Bayes' theorem when conditioning
Assuming equally likely outcomesNot all sample spaces have uniform probabilitiesUse the definition of probability, not symmetry

Interview Questions

1. What is the probability of getting at least one heads in 3 coin flips?

Answer: 1P(all tails)=1(1/2)3=7/8=0.8751 - P(\text{all tails}) = 1 - (1/2)^3 = 7/8 = 0.875

2. Explain the difference between independent and mutually exclusive events.

Answer:

  • Independent: P(AB)=P(A)P(B)P(A \cap B) = P(A)P(B); one event doesn't affect the other.
  • Mutually exclusive: AB=A \cap B = \emptyset, so P(AB)=0P(A \cap B) = 0; they cannot occur together.
  • Note: Mutually exclusive events are generally not independent (unless one has probability 0).

3. How would you calculate the probability of being dealt a flush in poker?

Answer: A flush is 5 cards of the same suit. Total 5-card hands: (525)\binom{52}{5}. Favorable: 4×(135)4 \times \binom{13}{5}. Probability: 4×(135)(525)0.00198\frac{4 \times \binom{13}{5}}{\binom{52}{5}} \approx 0.00198.

4. What is Bayes' theorem and why is it important in ML?

Answer: Bayes' theorem: P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}. It updates prior beliefs with evidence. In ML, it's used for classification (Naive Bayes), Bayesian optimization, and probabilistic models.

5. A test has 99% accuracy and 1% prevalence of disease. If you test positive, what's the probability you have the disease?

Answer: Using Bayes' theorem: P(diseasepositive)=0.99×0.010.99×0.01+0.01×0.99=0.5P(\text{disease}|\text{positive}) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = 0.5 (50%).

6. How many ways can you arrange 5 books on a shelf?

Answer: 5!=1205! = 120 permutations.

7. What's the probability of rolling a sum of 7 with two dice?

Answer: Favorable outcomes: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6. Total: 36. Probability: 6/36=1/60.16676/36 = 1/6 \approx 0.1667.


Practice Problems

📝Problem 1: Card Draw

You draw 2 cards from a standard deck without replacement. What is the probability both are aces?

💡Solution 1

Total ways to draw 2 cards: (522)=1326\binom{52}{2} = 1326. Favorable: (42)=6\binom{4}{2} = 6. Probability: 61326=12210.00452\frac{6}{1326} = \frac{1}{221} \approx 0.00452.


📝Problem 2: Conditional Probability

In a class, 60% study math, 40% study science, and 25% study both. If a student studies math, what's the probability they also study science?

💡Solution 2

P(sciencemath)=P(both)P(math)=0.250.600.4167P(\text{science}|\text{math}) = \frac{P(\text{both})}{P(\text{math})} = \frac{0.25}{0.60} \approx 0.4167.


📝Problem 3: Inclusion-Exclusion

In a survey of 100 people: 70 like tea, 50 like coffee, and 30 like both. How many like neither?

💡Solution 3

P(teacoffee)=0.7+0.50.3=0.9P(\text{tea} \cup \text{coffee}) = 0.7 + 0.5 - 0.3 = 0.9. So 90% like at least one. Thus, 10% like neither: 100×0.10=10100 \times 0.10 = 10 people.


📝Problem 4: Bayes' Theorem

A factory has two machines. Machine A produces 60% of items, with 5% defect rate. Machine B produces 40%, with 10% defect rate. An item is defective. What's the probability it came from Machine A?

💡Solution 4

Let AA = from machine A, DD = defective. We want P(AD)P(A|D).

P(D)=P(DA)P(A)+P(DB)P(B)=0.05×0.6+0.10×0.4=0.03+0.04=0.07P(D) = P(D|A)P(A) + P(D|B)P(B) = 0.05 \times 0.6 + 0.10 \times 0.4 = 0.03 + 0.04 = 0.07.

P(AD)=P(DA)P(A)P(D)=0.05×0.60.07=0.030.070.4286P(A|D) = \frac{P(D|A)P(A)}{P(D)} = \frac{0.05 \times 0.6}{0.07} = \frac{0.03}{0.07} \approx 0.4286.


📝Problem 5: Counting with Repetition

How many 4-letter words can be formed from the letters in "PROBABILITY" (with repetition allowed)?

💡Solution 5

Distinct letters: P, R, O, B, A, I, T, Y (8 letters). With repetition allowed: 84=40968^4 = 4096 possible words.


Probability Inequalities

ℹ️ Markov's Inequality

For a non-negative random variable XX and a>0a > 0:

P(Xa)E[X]aP(X \geq a) \leq \frac{E[X]}{a}

This provides an upper bound on the probability that a random variable exceeds a certain value.

ℹ️ Chebyshev's Inequality

For a random variable XX with mean μ\mu and variance σ2\sigma^2, and for any k>0k > 0:

P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

This bounds the probability that XX deviates from its mean by more than kk standard deviations.

ℹ️ Union Bound (Boole's Inequality)

For any events A1,A2,,AnA_1, A_2, \ldots, A_n:

P(i=1nAi)i=1nP(Ai)P\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

Useful for upper-bounding the probability of at least one event occurring.


Quick Reference

ConceptFormulaNotes
Sample SpaceΩ\OmegaSet of all outcomes
EventAΩA \subseteq \OmegaSubset of outcomes
ComplementP(Ac)=1P(A)P(A^c) = 1 - P(A)Probability event doesn't occur
Addition RuleP(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)For any events
Multiplication RuleP(AB)=P(AB)P(B)P(A \cap B) = P(A|B)P(B)Chain rule
IndependenceP(AB)=P(A)P(B)P(A \cap B) = P(A)P(B)No influence
Mutually ExclusiveP(AB)=0P(A \cap B) = 0Cannot co-occur
Bayes' TheoremP(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}Update beliefs
PermutationsP(n,k)=n!(nk)!P(n,k) = \frac{n!}{(n-k)!}Order matters
Combinations(nk)=n!k!(nk)!\binom{n}{k} = \frac{n!}{k!(n-k)!}Order doesn't matter
Inclusion-ExclusionP(Ai)=P(Ai)P(AiAj)+P(\cup A_i) = \sum P(A_i) - \sum P(A_i \cap A_j) + \cdotsAvoid double-counting
Law of Total ProbabilityP(A)=iP(ABi)P(Bi)P(A) = \sum_i P(A|B_i)P(B_i)Partition of sample space

Cross-References


📋Key Takeaways

  • Probability axioms (P(A)0P(A) \geq 0, P(Ω)=1P(\Omega) = 1, additivity) are the foundation. All properties derive from them.
  • Counting principles (permutations, combinations) enable exact probability calculations in discrete settings.
  • Inclusion-exclusion and the complement rule simplify complex probability calculations.
  • Bayes' theorem is essential for updating probabilities with new evidence—critical in AI/ML.
  • Python implementation allows simulation and computation of probabilities for real-world problems.
  • Common mistakes include confusing independence with mutual exclusion and forgetting conditional probability.
  • Applications span from A/B testing to uncertainty quantification in neural networks.
  • Practice regularly to build intuition—probability is both conceptual and computational.
Lesson Progress34 / 100