Probability Foundations

💡 Why It Matters

Probability is the mathematical framework for quantifying uncertainty. In machine learning, every model—from simple linear regression to complex neural networks—operates under uncertainty. Probabilistic thinking is crucial for model evaluation, risk assessment, and decision-making. Understanding probability foundations helps you debug models, design experiments, and interpret results critically.

What is Probability

DfProbability

Probability is a measure of the likelihood that an event will occur. It is a number between 0 and 1, where 0 indicates impossibility and 1 indicates certainty. Formally, for a random experiment, the probability of an event $A$ is denoted $P(A)$ and satisfies the axioms of probability.

Sample Space and Events

DfSample Space

The sample space $\Omega$ is the set of all possible outcomes of a random experiment. For example:

Tossing a coin: $\Omega = \{\text{heads}, \text{tails}\}$
Rolling a die: $\Omega = \{1, 2, 3, 4, 5, 6\}$
Measuring height: $\Omega = [0, \infty)$ (continuous)

DfEvents

An event is a subset of the sample space. Events can be:

Simple event: A single outcome, e.g., $\{3\}$ when rolling a die.
Compound event: Multiple outcomes, e.g., $\{2, 4, 6\}$ (even numbers).
Impossible event: $\emptyset$ (empty set), probability 0.
Certain event: $\Omega$ , probability 1.

Axioms of Probability

ThKolmogorov Axioms of Probability

For a sample space $\Omega$ and event $A \subseteq \Omega$ :

Non-negativity: $P(A) \geq 0$
Normalization: $P(\Omega) = 1$
Additivity: For mutually exclusive events $A$ and $B$ (i.e., $A \cap B = \emptyset$ ):

P(A \cup B) = P(A) + P(B)

These axioms form the foundation of all probability theory. The third axiom extends to countably many events via sigma-additivity.

Basic Properties

Derived from the axioms, these properties are essential:

Complement Rule: $P(A^c) = 1 - P(A)$ , where $A^c$ is the complement of $A$ .
Monotonicity: If $A \subseteq B$ , then $P(A) \leq P(B)$ .
General Addition Rule: For any events $A$ and $B$ :

P(A \cup B) = P(A) + P(B) - P(A \cap B)

Bonferroni's Inequality: $P(A \cap B) \geq P(A) + P(B) - 1$ .
Boole's Inequality: $P(\bigcup_{i=1}^n A_i) \leq \sum_{i=1}^n P(A_i)$ (union bound).

Counting Principles

Counting techniques are fundamental for computing probabilities in discrete settings.

Permutations (Ordered Selection)

P(n, k) = \frac{n!}{(n-k)!}

Here,

$n$ =Total number of distinct items
$k$ =Number of items to select and arrange
$P(n,k)$ =Number of ordered arrangements of k items from n

Combinations (Unordered Selection)

\binom{n}{k} = \frac{n!}{k!(n-k)!}

Here,

$n$ =Total number of distinct items
$k$ =Number of items to choose
$\binom{n}{k}$ =Number of unordered selections of k items from n

Extended Counting Formulas

Permutations with Repetition: $\frac{n!}{n_1! n_2! \cdots n_k!}$ for multiset.
Circular Permutations: $(n-1)!$ for arranging $n$ items in a circle.
Stars and Bars: Number of ways to distribute $n$ identical items into $k$ distinct bins: $\binom{n+k-1}{k-1}$ .

Inclusion-Exclusion

Inclusion-Exclusion Principle

P\left(\bigcup_{i=1}^n A_i\right) = \sum_{i} P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n)

Here,

$A_i$ =Events in the sample space
$P(\cup A_i)$ =Probability of at least one event occurring

Example: Two Events

P(A \cup B) = P(A) + P(B) - P(A \cap B)

Example: Three Events

P(A \cup B \cup C) = P(A) + P(B) + P(C) - P(A \cap B) - P(A \cap C) - P(B \cap C) + P(A \cap B \cap C)

Complement Rule

P(A^c) = 1 - P(A)

Here,

$A$ =Event of interest
$A^c$ =Complement of A (event not occurring)
$P(A^c)$ =Probability that A does not occur

This rule is particularly useful for "at least one" problems:

P(\text{at least one } A) = 1 - P(\text{no } A)

Python Implementation

import numpy as np
from math import comb, perm, factorial
from itertools import combinations, permutations

# Basic counting
print(f"Permutations P(10,3): {perm(10, 3)}")  # 720
print(f"Combinations C(10,3): {comb(10, 3)}")  # 120

# Probability of union
def prob_union(p_a, p_b, p_ab):
    """P(A ∪ B) = P(A) + P(B) - P(A ∩ B)"""
    return p_a + p_b - p_ab

# Inclusion-Exclusion for three events
def prob_union_three(p_a, p_b, p_c, p_ab, p_ac, p_bc, p_abc):
    """P(A ∪ B ∪ C) using inclusion-exclusion"""
    return (p_a + p_b + p_c 
            - p_ab - p_ac - p_bc 
            + p_abc)

# Birthday problem
def birthday_prob(n, days=365):
    """Probability of at least one collision in n birthdays"""
    if n > days:
        return 1.0
    prob_no_collision = 1.0
    for i in range(n):
        prob_no_collision *= (days - i) / days
    return 1 - prob_no_collision

# Simulation approach
def simulate_birthday(n, days=365, trials=100000):
    """Monte Carlo simulation for birthday problem"""
    collisions = 0
    for _ in range(trials):
        birthdays = np.random.randint(0, days, size=n)
        if len(np.unique(birthdays)) < n:
            collisions += 1
    return collisions / trials

# Dice problems
def dice_probability(target_sum, num_dice=2, faces=6):
    """Probability of getting target_sum with num_dice dice"""
    # Count favorable outcomes
    favorable = 0
    total = faces ** num_dice
    
    # Brute force for small cases
    if num_dice == 2:
        for i in range(1, faces+1):
            for j in range(1, faces+1):
                if i + j == target_sum:
                    favorable += 1
    else:
        # Use dynamic programming
        dp = np.zeros((num_dice + 1, target_sum + 1))
        dp[0][0] = 1
        for die in range(1, num_dice + 1):
            for sum_val in range(1, target_sum + 1):
                for face in range(1, min(faces, sum_val) + 1):
                    dp[die][sum_val] += dp[die-1][sum_val-face]
        favorable = dp[num_dice][target_sum]
    
    return favorable / total

# Bayes' theorem
def bayes_theorem(p_b_given_a, p_a, p_b):
    """P(A|B) = P(B|A) * P(A) / P(B)"""
    return (p_b_given_a * p_a) / p_b

# Example: Medical test
p_disease = 0.01  # Prior probability
p_positive_given_disease = 0.95  # Sensitivity
p_positive_given_healthy = 0.05  # False positive rate

p_healthy = 1 - p_disease
p_positive = (p_positive_given_disease * p_disease + 
              p_positive_given_healthy * p_healthy)
p_disease_given_positive = bayes_theorem(
    p_positive_given_disease, p_disease, p_positive
)

print(f"Probability of disease given positive test: {p_disease_given_positive:.4f}")

# Examples
print(f"\nBirthday problem (23 people): {birthday_prob(23):.4f}")
print(f"Dice: P(sum=7 with 2 dice) = {dice_probability(7):.4f}")
print(f"Simulation: {simulate_birthday(23):.4f}")

Applications in AI/ML

ℹ️ Probability in Machine Learning

Probability theory underpins modern AI/ML systems:

Uncertainty Quantification: Neural networks with dropout provide uncertainty estimates via Monte Carlo sampling.
Bayesian Reasoning: Updating beliefs with evidence—core to Bayesian optimization and probabilistic graphical models.
Loss Functions: Cross-entropy loss is derived from maximum likelihood estimation.
A/B Testing: Comparing conversion rates requires hypothesis testing based on probability.
Generative Models: VAEs and GANs learn probability distributions over data.

Key Concepts

Bayes' Theorem: $P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$
Maximum Likelihood Estimation (MLE): $\hat{\theta} = \arg\max_\theta P(\text{data}|\theta)$
Posterior Distribution: $P(\theta|\text{data}) \propto P(\text{data}|\theta) \cdot P(\theta)$

Common Mistakes

Mistake	Explanation	Correct Approach
Assuming events are independent	$P(A \cap B) = P(A)P(B)$ only if independent	Check if one event affects the other
Confusing permutations and combinations	Order matters for permutations	Use combinations when order doesn't matter
Ignoring the complement rule	"At least one" problems are easier with complements	Always consider $1 - P(\text{none})$
Double-counting in inclusion-exclusion	Overlapping probabilities counted multiple times	Use the full inclusion-exclusion formula
Forgetting conditional probability	$P(A\|B) \neq P(A)$ in general	Apply Bayes' theorem when conditioning
Assuming equally likely outcomes	Not all sample spaces have uniform probabilities	Use the definition of probability, not symmetry

Interview Questions

1. What is the probability of getting at least one heads in 3 coin flips?

Answer: $1 - P(\text{all tails}) = 1 - (1/2)^3 = 7/8 = 0.875$

2. Explain the difference between independent and mutually exclusive events.

Answer:

Independent: $P(A \cap B) = P(A)P(B)$ ; one event doesn't affect the other.
Mutually exclusive: $A \cap B = \emptyset$ , so $P(A \cap B) = 0$ ; they cannot occur together.
Note: Mutually exclusive events are generally not independent (unless one has probability 0).

3. How would you calculate the probability of being dealt a flush in poker?

Answer: A flush is 5 cards of the same suit. Total 5-card hands: $\binom{52}{5}$ . Favorable: $4 \times \binom{13}{5}$ . Probability: $\frac{4 \times \binom{13}{5}}{\binom{52}{5}} \approx 0.00198$ .

4. What is Bayes' theorem and why is it important in ML?

Answer: Bayes' theorem: $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$ . It updates prior beliefs with evidence. In ML, it's used for classification (Naive Bayes), Bayesian optimization, and probabilistic models.

5. A test has 99% accuracy and 1% prevalence of disease. If you test positive, what's the probability you have the disease?

Answer: Using Bayes' theorem: $P(\text{disease}|\text{positive}) = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = 0.5$ (50%).

6. How many ways can you arrange 5 books on a shelf?

Answer: $5! = 120$ permutations.

7. What's the probability of rolling a sum of 7 with two dice?

Answer: Favorable outcomes: (1,6), (2,5), (3,4), (4,3), (5,2), (6,1) = 6. Total: 36. Probability: $6/36 = 1/6 \approx 0.1667$ .

Practice Problems

📝Problem 1: Card Draw

You draw 2 cards from a standard deck without replacement. What is the probability both are aces?

💡Solution 1

Total ways to draw 2 cards: $\binom{52}{2} = 1326$ . Favorable: $\binom{4}{2} = 6$ . Probability: $\frac{6}{1326} = \frac{1}{221} \approx 0.00452$ .

📝Problem 2: Conditional Probability

In a class, 60% study math, 40% study science, and 25% study both. If a student studies math, what's the probability they also study science?

💡Solution 2

$P(\text{science}|\text{math}) = \frac{P(\text{both})}{P(\text{math})} = \frac{0.25}{0.60} \approx 0.4167$ .

📝Problem 3: Inclusion-Exclusion

In a survey of 100 people: 70 like tea, 50 like coffee, and 30 like both. How many like neither?

💡Solution 3

$P(\text{tea} \cup \text{coffee}) = 0.7 + 0.5 - 0.3 = 0.9$ . So 90% like at least one. Thus, 10% like neither: $100 \times 0.10 = 10$ people.

📝Problem 4: Bayes' Theorem

A factory has two machines. Machine A produces 60% of items, with 5% defect rate. Machine B produces 40%, with 10% defect rate. An item is defective. What's the probability it came from Machine A?

💡Solution 4

Let $A$ = from machine A, $D$ = defective. We want $P(A|D)$ .

$P(D) = P(D|A)P(A) + P(D|B)P(B) = 0.05 \times 0.6 + 0.10 \times 0.4 = 0.03 + 0.04 = 0.07$ .

$P(A|D) = \frac{P(D|A)P(A)}{P(D)} = \frac{0.05 \times 0.6}{0.07} = \frac{0.03}{0.07} \approx 0.4286$ .

📝Problem 5: Counting with Repetition

How many 4-letter words can be formed from the letters in "PROBABILITY" (with repetition allowed)?

💡Solution 5

Distinct letters: P, R, O, B, A, I, T, Y (8 letters). With repetition allowed: $8^4 = 4096$ possible words.

Probability Inequalities

ℹ️ Markov's Inequality

For a non-negative random variable $X$ and $a > 0$ :

P(X \geq a) \leq \frac{E[X]}{a}

This provides an upper bound on the probability that a random variable exceeds a certain value.

ℹ️ Chebyshev's Inequality

For a random variable $X$ with mean $\mu$ and variance $\sigma^2$ , and for any $k > 0$ :

P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

This bounds the probability that $X$ deviates from its mean by more than $k$ standard deviations.

ℹ️ Union Bound (Boole's Inequality)

For any events $A_1, A_2, \ldots, A_n$ :

P\left(\bigcup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

Useful for upper-bounding the probability of at least one event occurring.

Quick Reference

Concept	Formula	Notes
Sample Space	$\Omega$	Set of all outcomes
Event	$A \subseteq \Omega$	Subset of outcomes
Complement	$P(A^c) = 1 - P(A)$	Probability event doesn't occur
Addition Rule	$P(A \cup B) = P(A) + P(B) - P(A \cap B)$	For any events
Multiplication Rule	$P(A \cap B) = P(A\|B)P(B)$	Chain rule
Independence	$P(A \cap B) = P(A)P(B)$	No influence
Mutually Exclusive	$P(A \cap B) = 0$	Cannot co-occur
Bayes' Theorem	$P(A\|B) = \frac{P(B\|A)P(A)}{P(B)}$	Update beliefs
Permutations	$P(n,k) = \frac{n!}{(n-k)!}$	Order matters
Combinations	$\binom{n}{k} = \frac{n!}{k!(n-k)!}$	Order doesn't matter
Inclusion-Exclusion	$P(\cup A_i) = \sum P(A_i) - \sum P(A_i \cap A_j) + \cdots$	Avoid double-counting
Law of Total Probability	$P(A) = \sum_i P(A\|B_i)P(B_i)$	Partition of sample space

Cross-References

Conditional Probability: See Conditional Probability for deeper dive into $P(A|B)$ .
Random Variables: See Random Variables for discrete and continuous distributions.
Distributions: See Common Distributions for Bernoulli, Binomial, Poisson, etc.
Statistical Inference: See Hypothesis Testing for applying probability in decision-making.
Machine Learning: See Bayesian Methods for probabilistic ML models.

📋Key Takeaways

Probability axioms ( $P(A) \geq 0$ , $P(\Omega) = 1$ , additivity) are the foundation. All properties derive from them.
Counting principles (permutations, combinations) enable exact probability calculations in discrete settings.
Inclusion-exclusion and the complement rule simplify complex probability calculations.
Bayes' theorem is essential for updating probabilities with new evidence—critical in AI/ML.
Python implementation allows simulation and computation of probabilities for real-world problems.
Common mistakes include confusing independence with mutual exclusion and forgetting conditional probability.
Applications span from A/B testing to uncertainty quantification in neural networks.
Practice regularly to build intuition—probability is both conceptual and computational.