← Math|36 of 100
Probability

Bayes' Theorem

Master Bayes' theorem, prior/posterior distributions, Bayesian updating, and practical applications in AI/ML.

πŸ“‚ BayesianπŸ“– Lesson 36 of 100πŸŽ“ Free Course

Advertisement

Why It Matters

ℹ️ The Power of Bayesian Inference

Bayesian inference is a framework for updating beliefs in the light of evidence. Unlike frequentist statistics, which treats parameters as fixed and data as random, Bayesian methods treat parameters as random variables with distributions that evolve as data arrives. This paradigm shift enables:

  • Uncertainty quantification: Not just a point estimate, but a full probability distribution over possible values.
  • Sequential learning: Each new observation refines the posterior, which becomes the prior for the next update.
  • Prior knowledge incorporation: Domain expertise can be encoded formally before any data is collected.
  • Decision-making under uncertainty: Posterior distributions can be integrated into loss functions for optimal decisions.

Bayesian reasoning underpins modern machine learning β€” from Gaussian processes and variational autoencoders to reinforcement learning and Bayesian neural networks. Understanding Bayes' theorem is foundational to building systems that reason about what they don't know.


Bayes' Theorem

ThBayes' Theorem (General Form)

Let AA and BB be events with P(B)>0P(B) > 0. Then:

P(A∣B)=P(B∣A) P(A)P(B)P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}

Equivalently, when the sample space can be partitioned into mutually exclusive hypotheses {H1,H2,…,Hn}\{H_1, H_2, \ldots, H_n\}:

P(Hi∣B)=P(B∣Hi) P(Hi)βˆ‘j=1nP(B∣Hj) P(Hj)P(H_i \mid B) = \frac{P(B \mid H_i) \, P(H_i)}{\sum_{j=1}^{n} P(B \mid H_j) \, P(H_j)}

DfDerivation from Conditional Probability

Starting from the definition of conditional probability:

P(A∣B)=P(A∩B)P(B),P(B∣A)=P(A∩B)P(A)P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B \mid A) = \frac{P(A \cap B)}{P(A)}

Solving the second equation for P(A∩B)P(A \cap B):

P(A∩B)=P(B∣A) P(A)P(A \cap B) = P(B \mid A) \, P(A)

Substituting into the first equation yields Bayes' Theorem:

P(A∣B)=P(B∣A) P(A)P(B)β– P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} \qquad \blacksquare

Bayesian Inference Components

Posterior=LikelihoodΓ—PriorEvidence\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}

Here,

  • P(H∣D)P(H \mid D)=Posterior β€” updated belief about hypothesis H after observing data D
  • P(D∣H)P(D \mid H)=Likelihood β€” probability of data given hypothesis H is true
  • P(H)P(H)=Prior β€” initial belief about H before seeing data
  • P(D)P(D)=Evidence (marginal likelihood) β€” total probability of the data across all hypotheses

Prior, Likelihood, Posterior

DfPrior Distribution

The prior P(ΞΈ)P(\theta) encodes beliefs about parameter ΞΈ\theta before observing any data. It can be:

  • Informative: Based on previous studies or domain expertise (e.g., "disease prevalence is ~1%").
  • Non-informative (vague): Spreads probability broadly to let the data dominate (e.g., Uniform(0,1) or Beta(1,1)).
  • Conjugate: Chosen so the posterior belongs to the same family as the prior (mathematical convenience).

The choice of prior is a modeling decision. Sensitivity analysis β€” checking how conclusions change under different priors β€” is essential.

DfLikelihood Function

The likelihood P(D∣θ)P(D \mid \theta) measures how probable the observed data DD is for different values of θ\theta. Note: the likelihood is not a probability distribution over θ\theta; it is a function of θ\theta treated as a fixed (but unknown) quantity.

For independent observations x1,x2,…,xnx_1, x_2, \ldots, x_n:

P(D∣θ)=∏i=1nP(xi∣θ)P(D \mid \theta) = \prod_{i=1}^{n} P(x_i \mid \theta)

The likelihood surface may be flat (data is uninformative about ΞΈ\theta) or sharply peaked (data strongly constrains ΞΈ\theta).

DfPosterior Distribution

The posterior P(θ∣D)P(\theta \mid D) is the result of applying Bayes' Theorem:

P(θ∣D)=P(D∣θ) P(ΞΈ)∫P(D∣θ) P(ΞΈ) dΞΈP(\theta \mid D) = \frac{P(D \mid \theta) \, P(\theta)}{\int P(D \mid \theta) \, P(\theta) \, d\theta}

The posterior encodes all updated knowledge about ΞΈ\theta given the data. Key summaries include:

  • Posterior mean: E[θ∣D]=βˆ«ΞΈβ€‰P(θ∣D) dΞΈ\mathbb{E}[\theta \mid D] = \int \theta \, P(\theta \mid D) \, d\theta
  • Posterior mode (MAP): ΞΈMAP=arg⁑max⁑θP(θ∣D)\theta_{\text{MAP}} = \arg\max_\theta P(\theta \mid D)
  • Posterior variance: Var[θ∣D]=∫(ΞΈβˆ’E[θ∣D])2 P(θ∣D) dΞΈ\text{Var}[\theta \mid D] = \int (\theta - \mathbb{E}[\theta \mid D])^2 \, P(\theta \mid D) \, d\theta

Bayesian Updating

The Bayesian updating process is iterative β€” each observation refines the posterior, and the posterior becomes the prior for the next update.

ThSequential Bayesian Updating

Given prior P(ΞΈ)P(\theta) and sequential observations x1,x2,…,xnx_1, x_2, \ldots, x_n:

Step 1 (Initialize): Start with prior P(ΞΈ)P(\theta).

Step 2 (Observe x1x_1): Compute posterior after first observation:

P(θ∣x1)=P(x1∣θ) P(ΞΈ)∫P(x1∣θ) P(ΞΈ) dΞΈP(\theta \mid x_1) = \frac{P(x_1 \mid \theta) \, P(\theta)}{\int P(x_1 \mid \theta) \, P(\theta) \, d\theta}

Step 3 (Observe x2x_2): Treat the posterior from Step 2 as the new prior:

P(θ∣x1,x2)=P(x2∣θ) P(θ∣x1)∫P(x2∣θ) P(θ∣x1) dΞΈP(\theta \mid x_1, x_2) = \frac{P(x_2 \mid \theta) \, P(\theta \mid x_1)}{\int P(x_2 \mid \theta) \, P(\theta \mid x_1) \, d\theta}

Step kk (Observe xkx_k): General update:

P(θ∣x1,…,xk)∝P(xk∣θ)β‹…P(θ∣x1,…,xkβˆ’1)P(\theta \mid x_1, \ldots, x_k) \propto P(x_k \mid \theta) \cdot P(\theta \mid x_1, \ldots, x_{k-1})

This process converges: as nβ†’βˆžn \to \infty, the posterior concentrates around the true parameter value (under regularity conditions), regardless of the choice of prior. This is the Bernstein–von Mises theorem.


Medical Diagnosis

Medical diagnosis is the canonical application of Bayes' Theorem. It highlights why base rates (prevalence) matter enormously.

πŸ“Medical Test Problem

A disease affects 1 in 1000 people in a population. A diagnostic test has:

  • Sensitivity (true positive rate): P(+∣D)=0.99P(+ \mid D) = 0.99
  • Specificity (true negative rate): P(βˆ’βˆ£Β¬D)=0.99P(- \mid \neg D) = 0.99

A randomly selected person tests positive. What is P(D∣+)P(D \mid +)?

πŸ’‘Solution

Define events:

  • DD = has disease, Β¬D\neg D = does not have disease
  • ++ = tests positive, βˆ’- = tests negative

Given:

  • P(D)=0.001P(D) = 0.001 (prevalence)
  • P(+∣D)=0.99P(+ \mid D) = 0.99 (sensitivity)
  • P(βˆ’βˆ£Β¬D)=0.99P(- \mid \neg D) = 0.99 β†’ P(+∣¬D)=0.01P(+ \mid \neg D) = 0.01 (false positive rate)

Apply Bayes' Theorem:

P(D∣+)=P(+∣D) P(D)P(+∣D) P(D)+P(+∣¬D) P(Β¬D)P(D \mid +) = \frac{P(+ \mid D) \, P(D)}{P(+ \mid D) \, P(D) + P(+ \mid \neg D) \, P(\neg D)}
=0.99Γ—0.0010.99Γ—0.001+0.01Γ—0.999= \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999}
=0.000990.00099+0.00999=0.000990.01098β‰ˆ0.0902= \frac{0.00099}{0.00099 + 0.00999} = \frac{0.00099}{0.01098} \approx 0.0902

Result: Only about 9% of positive tests are true positives!

Intuition: In a population of 10,000:

  • 10 people have the disease β†’ ~9.9 test positive
  • 9,990 people are healthy β†’ ~99.9 test positive (false alarms)
  • Among ~110 positive tests, only ~10 are genuine

Key takeaway: Even with a 99%-accurate test, a rare disease produces mostly false positives when screened broadly. This is why confirmatory tests are essential.


Spam Classification

ℹ️ Naive Bayes for Text Classification

The Naive Bayes classifier applies Bayes' Theorem to classification problems with multiple features. It assumes features are conditionally independent given the class β€” a strong assumption that rarely holds in practice but works surprisingly well, especially for text.

For a document with word features w1,w2,…,wnw_1, w_2, \ldots, w_n:

P(spam∣w1,…,wn)=P(w1,…,wn∣spam) P(spam)P(w1,…,wn)P(\text{spam} \mid w_1, \ldots, w_n) = \frac{P(w_1, \ldots, w_n \mid \text{spam}) \, P(\text{spam})}{P(w_1, \ldots, w_n)}

Under the Naive Bayes assumption:

P(w1,…,wn∣spam)=∏i=1nP(wi∣spam)P(w_1, \ldots, w_n \mid \text{spam}) = \prod_{i=1}^{n} P(w_i \mid \text{spam})

We classify by choosing the class with maximum posterior:

y^=arg⁑max⁑c[log⁑P(c)+βˆ‘i=1nlog⁑P(wi∣c)]\hat{y} = \arg\max_c \left[ \log P(c) + \sum_{i=1}^{n} \log P(w_i \mid c) \right]

Log-space avoids numerical underflow from multiplying many small probabilities.

πŸ“Spam Filter Example

Given a vocabulary of two words ("free" and "meeting"), with:

P(spam)P(\text{spam})P(freeβˆ£β‹…)P(\text{free} \mid \cdot)P(meetingβˆ£β‹…)P(\text{meeting} \mid \cdot)
Spam0.40.90.1
Ham0.60.20.7

An email contains both "free" and "meeting". Classify it.

πŸ’‘Solution

Spam:

P(spam∣free,meeting)∝0.4Γ—0.9Γ—0.1=0.036P(\text{spam} \mid \text{free}, \text{meeting}) \propto 0.4 \times 0.9 \times 0.1 = 0.036

Ham:

P(ham∣free,meeting)∝0.6Γ—0.2Γ—0.7=0.084P(\text{ham} \mid \text{free}, \text{meeting}) \propto 0.6 \times 0.2 \times 0.7 = 0.084

Posterior for spam:

P(spam∣free,meeting)=0.0360.036+0.084=0.0360.120=0.3P(\text{spam} \mid \text{free}, \text{meeting}) = \frac{0.036}{0.036 + 0.084} = \frac{0.036}{0.120} = 0.3

Classification: Ham (0.7) > Spam (0.3) β†’ classified as not spam.

Despite "free" being strongly associated with spam, the word "meeting" and the higher prior probability of ham shift the classification.


Prior Distributions

The choice of prior significantly impacts inference, especially with limited data. Conjugate priors offer mathematical tractability.

DfConjugate Priors

A conjugate prior is a prior distribution that, when combined with a likelihood, produces a posterior in the same family. This makes computation of the posterior closed-form.

LikelihoodConjugate PriorPosterior
Bernoulli / BinomialBeta(Ξ±,Ξ²\alpha, \beta)Beta(Ξ±+s,Ξ²+nβˆ’s\alpha + s, \beta + n - s)
PoissonGamma(Ξ±,Ξ²\alpha, \beta)Gamma(Ξ±+βˆ‘xi,Ξ²+n\alpha + \sum x_i, \beta + n)
Normal (known Οƒ2\sigma^2)Normal(ΞΌ0,Οƒ02\mu_0, \sigma_0^2)Normal(ΞΌn,Οƒn2\mu_n, \sigma_n^2)
MultinomialDirichlet(Ξ±1,…,Ξ±k\alpha_1, \ldots, \alpha_k)Dirichlet(Ξ±1+c1,…,Ξ±k+ck\alpha_1 + c_1, \ldots, \alpha_k + c_k)
Normal (known ΞΌ\mu, unknown Οƒ2\sigma^2)Inverse-Gamma(Ξ±,Ξ²\alpha, \beta)Inverse-Gamma(Ξ±+n/2,Ξ²+SS/2\alpha + n/2, \beta + \text{SS}/2)

Beta-Binomial Example:

P(θ∣s,n)=Beta(Ξ±+s,Ξ²+nβˆ’s)Beta(Ξ±,Ξ²)β‹…P(s∣θ)P(\theta \mid s, n) = \frac{\text{Beta}(\alpha + s, \beta + n - s)}{\text{Beta}(\alpha, \beta)} \cdot P(s \mid \theta)

The posterior mean is Ξ±+sΞ±+Ξ²+n\frac{\alpha + s}{\alpha + \beta + n}, which is a weighted average of the prior mean Ξ±Ξ±+Ξ²\frac{\alpha}{\alpha + \beta} and the sample proportion sn\frac{s}{n}.

⚠️ Sensitivity to Priors

With small sample sizes, the posterior is sensitive to the choice of prior. As data accumulates, the likelihood dominates and the posterior becomes robust to the prior. Always perform prior sensitivity analysis β€” check how conclusions change under different priors.


MAP vs MLE

DfMaximum Likelihood Estimation (MLE)

The MLE finds the parameter that maximizes the likelihood:

θMLE=arg⁑max⁑θP(D∣θ)=arg⁑max⁑θlog⁑P(D∣θ)\theta_{\text{MLE}} = \arg\max_\theta P(D \mid \theta) = \arg\max_\theta \log P(D \mid \theta)

MLE treats the parameter as a fixed unknown and finds the single value that best explains the data. It is consistent (converges to the true parameter) and efficient (minimum variance) asymptotically.

DfMaximum A Posteriori (MAP) Estimation

The MAP estimate finds the parameter that maximizes the posterior:

θMAP=arg⁑max⁑θP(θ∣D)=arg⁑max⁑θ[log⁑P(D∣θ)+log⁑P(θ)]\theta_{\text{MAP}} = \arg\max_\theta P(\theta \mid D) = \arg\max_\theta \left[ \log P(D \mid \theta) + \log P(\theta) \right]

MAP is equivalent to MLE with a log-prior penalty. This is the Bayesian analog of regularized estimation:

PriorMAP Equivalent
UniformMLE (no regularization)
Gaussian N(0,Οƒ2)N(0, \sigma^2)Ridge regression (L2L_2 penalty)
Laplace (double exponential)Lasso (L1L_1 penalty)

Key distinction: MLE and MAP produce point estimates; full Bayesian inference produces the entire posterior distribution, capturing uncertainty.


Python Implementation

import numpy as np
from scipy import stats


class NaiveBayesClassifier:
    """Gaussian Naive Bayes classifier for continuous features."""

    def fit(self, X, y):
        self.classes = np.unique(y)
        self.params = {}
        for c in self.classes:
            X_c = X[y == c]
            self.params[c] = {
                "mean": X_c.mean(axis=0),
                "var": X_c.var(axis=0) + 1e-9,  # smoothing
                "prior": len(X_c) / len(X),
            }
        return self

    def _log_likelihood(self, x, mean, var):
        return -0.5 * np.sum(np.log(2 * np.pi * var) + ((x - mean) ** 2) / var)

    def predict(self, X):
        predictions = []
        for x in X:
            posteriors = []
            for c in self.classes:
                log_prior = np.log(self.params[c]["prior"])
                log_lik = self._log_likelihood(
                    x, self.params[c]["mean"], self.params[c]["var"]
                )
                posteriors.append(log_prior + log_lik)
            predictions.append(self.classes[np.argmax(posteriors)])
        return np.array(predictions)


# Bayesian updating for a coin flip (Beta-Binomial)
def bayesian_coin_update(prior_alpha, prior_beta, flips):
    """Sequential Bayesian updating for a biased coin."""
    alpha, beta = prior_alpha, prior_beta
    history = [(alpha, beta)]

    for flip in flips:
        if flip == 1:
            alpha += 1
        else:
            beta += 1
        history.append((alpha, beta))

    return history


# Example usage
prior_alpha, prior_beta = 2, 2  # Beta(2,2) prior β€” mildly informative
flips = [1, 1, 0, 1, 1, 1, 0, 1, 1, 1]  # observed sequence
history = bayesian_coin_update(prior_alpha, prior_beta, flips)

print("Step | Alpha | Beta | Mean | 95% CI")
print("-" * 50)
for i, (a, b) in enumerate(history):
    mean = a / (a + b)
    ci_low = stats.beta.ppf(0.025, a, b)
    ci_high = stats.beta.ppf(0.975, a, b)
    label = "Prior" if i == 0 else f"Obs {i}"
    print(f"{label:5s} | {a:5.1f} | {b:4.1f} | {mean:.3f} | [{ci_low:.3f}, {ci_high:.3f}]")

Output:

Architecture Diagram
Step | Alpha | Beta  | Mean  | 95% CI
--------------------------------------------------
Prior |   2.0 |   2.0 | 0.500 | [0.079, 0.921]
Obs 1 |   3.0 |   2.0 | 0.600 | [0.170, 0.934]
Obs 2 |   4.0 |   2.0 | 0.667 | [0.263, 0.942]
Obs 3 |   4.0 |   3.0 | 0.571 | [0.245, 0.855]
Obs 4 |   5.0 |   3.0 | 0.625 | [0.306, 0.882]
Obs 5 |   6.0 |   3.0 | 0.667 | [0.359, 0.900]
Obs 6 |   7.0 |   3.0 | 0.700 | [0.400, 0.915]
Obs 7 |   7.0 |   4.0 | 0.636 | [0.363, 0.860]
Obs 8 |   8.0 |   4.0 | 0.667 | [0.410, 0.872]
Obs 9 |   9.0 |   4.0 | 0.692 | [0.444, 0.887]
Obs10 |  10.0 |   4.0 | 0.714 | [0.472, 0.898]

The posterior mean starts at 0.5 (prior belief) and converges toward the observed proportion (7/10 = 0.7) as data accumulates.


Applications in AI/ML

ℹ️ Bayesian Methods in Modern AI

Bayesian inference is not a niche technique β€” it is woven into the fabric of modern AI and machine learning.

Bayesian Deep Learning:

  • Replaces point estimates of neural network weights with distributions.
  • MC Dropout: Applying dropout at test time approximates Bayesian inference over weights.
  • Bayes by Backprop: Maintains a variational distribution over weights, trained by minimizing KL divergence.
  • Enables uncertainty estimation β€” critical for safety-critical applications (medical AI, autonomous driving).

A/B Testing:

  • Frequentist A/B testing produces p-values and confidence intervals.
  • Bayesian A/B testing directly computes P(A>B∣data)P(A > B \mid \text{data}) β€” the probability that variant A outperforms B.
  • Advantages: no multiple-testing correction needed, natural interpretation, can incorporate prior knowledge.

Bayesian Optimization:

  • Uses a Gaussian process surrogate model to find the optimum of expensive black-box functions.
  • Acquisition function (e.g., Expected Improvement) balances exploration vs. exploitation.
  • Used for hyperparameter tuning (e.g., Bayesian Optimization with Tree-structured Parzen Estimators).

Gaussian Processes:

  • Non-parametric Bayesian models for regression and classification.
  • Provide uncertainty estimates alongside predictions.
  • Kernel choice encodes assumptions about function smoothness.

Natural Language Processing:

  • Latent Dirichlet Allocation (LDA) for topic modeling.
  • Bayesian smoothing for language models (Kneser-Ney smoothing has a Bayesian interpretation).
  • Hierarchical Bayesian models for transfer learning across languages.

Common Mistakes

MistakeWhy It's WrongCorrect Approach
Ignoring the base rate (prevalence)P(D)β‰ P(D∣+)P(D) \neq P(D \mid +) β€” a 99%-accurate test for a 1-in-1,000 disease still produces ~91% false positivesAlways include the prior; use Bayes' Theorem explicitly
Confusing P(A∣B)P(A \mid B) with P(B∣A)P(B \mid A)These are generally not equal; P(smoke∣cancer)β‰ P(cancer∣smoke)P(\text{smoke} \mid \text{cancer}) \neq P(\text{cancer} \mid \text{smoke})Draw the conditional structure; apply Bayes' Theorem
Treating likelihood as a distribution over parametersThe likelihood P(D∣θ)P(D \mid \theta) is a function of θ\theta, not a probability distribution over θ\thetaUse the posterior for inference about θ\theta
Using a single point estimate (MAP) and calling it "Bayesian"MAP is a point estimate; true Bayesian inference uses the full posteriorCompute posterior summaries (mean, credible intervals)
Assuming the Naive Bayes independence assumption always holdsConditional independence is often violated; performance degrades when features are highly dependentTest independence; use semi-naive or Tree-augmented Naive Bayes
Ignoring prior sensitivityStrong priors can dominate small datasets, leading to biased conclusionsPerform sensitivity analysis with multiple priors
Not normalizing the posteriorPosterior∝LikelihoodΓ—Prior\text{Posterior} \propto \text{Likelihood} \times \text{Prior} β€” the denominator is needed for a proper distributionUse the normalizing constant or MCMC for proper inference

Interview Questions

Q1: Why is Bayes' Theorem important in machine learning?

A: Bayes' Theorem provides a principled framework for updating beliefs given evidence. In ML, it enables: (1) incorporating prior knowledge into models, (2) quantifying uncertainty in predictions, (3) building models that improve with data sequentially, and (4) regularizing estimates through priors. It underpins Bayesian neural networks, Gaussian processes, and probabilistic graphical models.

Q2: What is the difference between MLE and MAP?

A: MLE maximizes P(D∣θ)P(D \mid \theta) (the likelihood); MAP maximizes P(θ∣D)P(\theta \mid D) (the posterior). MAP is equivalent to MLE with a log-prior penalty term. For example, with a Gaussian prior, MAP is equivalent to L2L_2-regularized MLE. MAP produces a point estimate; full Bayesian inference computes the entire posterior distribution.

Q3: A test for a disease with 1% prevalence has 90% sensitivity and 95% specificity. If someone tests positive, what is the probability they have the disease?

A: P(D∣+)=0.9Γ—0.010.9Γ—0.01+0.05Γ—0.99=0.0090.009+0.0495=0.0090.0585β‰ˆ0.154P(D \mid +) = \frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.05 \times 0.99} = \frac{0.009}{0.009 + 0.0495} = \frac{0.009}{0.0585} \approx 0.154. Despite the test being quite accurate, only ~15.4% of positive results are true positives due to the low base rate.

Q4: Why does Naive Bayes work well in practice despite its independence assumption?

A: Three reasons: (1) Classification only requires ranking posteriors, not estimating them accurately β€” errors in P(y∣x)P(y \mid x) can cancel out if the ranking is preserved. (2) With enough training data, the likelihood term dominates and corrects for prior misspecification. (3) Feature dependencies tend to partially cancel each other's effects. Naive Bayes is particularly effective for high-dimensional sparse data like text.

Q5: What is a conjugate prior and why is it useful?

A: A conjugate prior is one where the posterior belongs to the same distribution family as the prior. For example, the Beta distribution is conjugate to the Bernoulli likelihood β€” after observing data, the posterior is also Beta. This makes computation analytically tractable (no numerical integration needed), enables elegant closed-form updates, and allows intuitive interpretation of hyperparameters as "pseudo-counts."

Q6: How do you choose a prior in practice?

A: Common approaches: (1) Use domain knowledge if available (e.g., known disease prevalence). (2) Use non-informative priors (Jeffreys prior, uniform) when knowledge is limited. (3) Use weakly informative priors (e.g., Beta(2,2)) to regularize without dominating. (4) Always perform prior sensitivity analysis β€” check if conclusions are robust to different priors. With sufficient data, the likelihood dominates and the prior's influence diminishes.


Practice Problems

Problem 1: Weather Prediction

You observe that on 8 out of 10 days when it was cloudy in the morning, it rained by afternoon. On 3 out of 10 days when it was sunny in the morning, it rained by afternoon. Historically, it rains on 40% of days.

Question: If it is cloudy this morning, what is the probability of rain this afternoon?

πŸ’‘Solution

Given:

  • P(rain∣cloudy)=0.8P(\text{rain} \mid \text{cloudy}) = 0.8
  • P(rain∣sunny)=0.3P(\text{rain} \mid \text{sunny}) = 0.3
  • P(rain)=0.4P(\text{rain}) = 0.4 β†’ P(cloudy)=0.5P(\text{cloudy}) = 0.5 (assumed symmetric)

By Bayes' Theorem:

P(rain∣cloudy)=P(cloudy∣rain)β‹…P(rain)P(cloudy)P(\text{rain} \mid \text{cloudy}) = \frac{P(\text{cloudy} \mid \text{rain}) \cdot P(\text{rain})}{P(\text{cloudy})}

We need P(cloudy∣rain)P(\text{cloudy} \mid \text{rain}). Using the Law of Total Probability:

P(rain)=P(rain∣cloudy)P(cloudy)+P(rain∣sunny)P(sunny)P(\text{rain}) = P(\text{rain} \mid \text{cloudy})P(\text{cloudy}) + P(\text{rain} \mid \text{sunny})P(\text{sunny})
0.4=0.8β‹…P(cloudy)+0.3β‹…(1βˆ’P(cloudy))0.4 = 0.8 \cdot P(\text{cloudy}) + 0.3 \cdot (1 - P(\text{cloudy}))
0.4=0.5β‹…P(cloudy)+0.30.4 = 0.5 \cdot P(\text{cloudy}) + 0.3
P(cloudy)=0.2P(\text{cloudy}) = 0.2

Now apply Bayes' Theorem:

P(rain∣cloudy)=P(cloudy∣rain)β‹…P(rain)P(cloudy)P(\text{rain} \mid \text{cloudy}) = \frac{P(\text{cloudy} \mid \text{rain}) \cdot P(\text{rain})}{P(\text{cloudy})}

We compute P(cloudy∣rain)P(\text{cloudy} \mid \text{rain}):

P(cloudy∣rain)=P(rain∣cloudy)β‹…P(cloudy)P(rain)=0.8Γ—0.20.4=0.4P(\text{cloudy} \mid \text{rain}) = \frac{P(\text{rain} \mid \text{cloudy}) \cdot P(\text{cloudy})}{P(\text{rain})} = \frac{0.8 \times 0.2}{0.4} = 0.4

Therefore:

P(rain∣cloudy)=0.4Γ—0.40.2=0.8P(\text{rain} \mid \text{cloudy}) = \frac{0.4 \times 0.4}{0.2} = 0.8

Answer: The probability is 0.8 (or 80%), consistent with the observed frequency.


Problem 2: Bayesian Spam Filter

A spam filter uses two features: whether the email contains "free" and whether it contains "winner". Training data:

P(free)P(\text{free})P(winner)P(\text{winner})P(spam)P(\text{spam})
Spam0.950.700.30
Ham0.100.050.70

An email contains both "free" and "winner". Should it be classified as spam?

πŸ’‘Solution

Using Naive Bayes:

P(spam∣free,winner)∝0.30Γ—0.95Γ—0.70=0.1995P(\text{spam} \mid \text{free}, \text{winner}) \propto 0.30 \times 0.95 \times 0.70 = 0.1995
P(ham∣free,winner)∝0.70Γ—0.10Γ—0.05=0.0035P(\text{ham} \mid \text{free}, \text{winner}) \propto 0.70 \times 0.10 \times 0.05 = 0.0035

Posterior for spam:

P(spam∣free,winner)=0.19950.1995+0.0035=0.19950.203β‰ˆ0.983P(\text{spam} \mid \text{free}, \text{winner}) = \frac{0.1995}{0.1995 + 0.0035} = \frac{0.1995}{0.203} \approx 0.983

Answer: The email should be classified as spam with ~98.3% probability.


Problem 3: Beta-Binomial Updating

You believe a coin is fair, modeled as a Beta(5, 5) prior. You flip it 20 times and get 14 heads.

Question: What is the posterior distribution? What is the posterior probability that P(heads)>0.5P(\text{heads}) > 0.5?

πŸ’‘Solution

Posterior: Beta(Ξ±+s,Ξ²+nβˆ’s\alpha + s, \beta + n - s) = Beta(5 + 14, 5 + 6) = Beta(19, 11)

Posterior mean: 1919+11=1930β‰ˆ0.633\frac{19}{19 + 11} = \frac{19}{30} \approx 0.633

95% credible interval: Using the Beta distribution, approximately [0.454,0.796][0.454, 0.796].

P(θ>0.5∣data)P(\theta > 0.5 \mid \text{data}):

P(ΞΈ>0.5)=1βˆ’FBeta(19,11)(0.5)P(\theta > 0.5) = 1 - F_{\text{Beta}(19,11)}(0.5)

Using a calculator or Python:

from scipy import stats
1 - stats.beta.cdf(0.5, 19, 11)  # β‰ˆ 0.887

Answer: The posterior is Beta(19, 11) with mean ~0.633. The probability that the coin is biased toward heads is approximately 0.887 (88.7%).


Problem 4: Medical Test with Two Tests

A disease has 2% prevalence. Test A has sensitivity 95% and specificity 90%. Test B (confirmatory) has sensitivity 99% and specificity 98%. If someone tests positive on both tests, what is P(disease∣A+,B+)P(\text{disease} \mid A+, B+)?

πŸ’‘Solution

After Test A positive:

P(D∣A+)=0.95Γ—0.020.95Γ—0.02+0.10Γ—0.98=0.0190.019+0.098=0.0190.117β‰ˆ0.1624P(D \mid A+) = \frac{0.95 \times 0.02}{0.95 \times 0.02 + 0.10 \times 0.98} = \frac{0.019}{0.019 + 0.098} = \frac{0.019}{0.117} \approx 0.1624

After Test B positive (using updated prior):

P(D)=0.1624P(D) = 0.1624, P(Β¬D)=0.8376P(\neg D) = 0.8376

P(D∣A+,B+)=0.99Γ—0.16240.99Γ—0.1624+0.02Γ—0.8376=0.16080.1608+0.0168=0.16080.1776β‰ˆ0.905P(D \mid A+, B+) = \frac{0.99 \times 0.1624}{0.99 \times 0.1624 + 0.02 \times 0.8376} = \frac{0.1608}{0.1608 + 0.0168} = \frac{0.1608}{0.1776} \approx 0.905

Answer: After two positive tests, the probability of disease is approximately 90.5%. Sequential Bayesian updating dramatically increases confidence.


Quick Reference

πŸ“‹Bayes' Theorem β€” Quick Reference

ConceptFormulaDescription
Bayes' TheoremP(A∣B)=P(B∣A)P(A)P(B)P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}Reverse conditioning
PosteriorP(θ∣D)∝P(D∣θ)P(θ)P(\theta \mid D) \propto P(D \mid \theta)P(\theta)Updated belief
MAPθMAP=arg⁑max⁑θP(θ∣D)\theta_{\text{MAP}} = \arg\max_\theta P(\theta \mid D)Most probable parameter
MLEθMLE=arg⁑max⁑θP(D∣θ)\theta_{\text{MLE}} = \arg\max_\theta P(D \mid \theta)Best-fitting parameter
Naive BayesP(y∣x1,…,xn)∝P(y)∏iP(xi∣y)P(y \mid x_1, \ldots, x_n) \propto P(y) \prod_i P(x_i \mid y)Independence assumption
Beta-BinomialBeta(Ξ±+s,Ξ²+nβˆ’s)\text{Beta}(\alpha + s, \beta + n - s)Conjugate updating
Law of Total Prob.P(B)=βˆ‘iP(B∣Ai)P(Ai)P(B) = \sum_i P(B \mid A_i)P(A_i)Compute the evidence
Sequential UpdateP(θ∣D1,D2)∝P(D2∣θ)P(θ∣D1)P(\theta \mid D_1, D_2) \propto P(D_2 \mid \theta)P(\theta \mid D_1)Posterior β†’ prior

Key Properties:

  • The posterior is always a proper distribution (integrates to 1) if the prior is proper.
  • With enough data, the posterior becomes insensitive to the prior (Bernstein–von Mises theorem).
  • Log-posterior = log-likelihood + log-prior β€” the additive structure in log-space enables efficient computation.
  • Bayesian and frequentist methods converge with large samples.

Cross-References

Lesson Progress36 / 100