Bayes' Theorem

Why It Matters

ℹ️ The Power of Bayesian Inference

Bayesian inference is a framework for updating beliefs in the light of evidence. Unlike frequentist statistics, which treats parameters as fixed and data as random, Bayesian methods treat parameters as random variables with distributions that evolve as data arrives. This paradigm shift enables:

Uncertainty quantification: Not just a point estimate, but a full probability distribution over possible values.
Sequential learning: Each new observation refines the posterior, which becomes the prior for the next update.
Prior knowledge incorporation: Domain expertise can be encoded formally before any data is collected.
Decision-making under uncertainty: Posterior distributions can be integrated into loss functions for optimal decisions.

Bayesian reasoning underpins modern machine learning — from Gaussian processes and variational autoencoders to reinforcement learning and Bayesian neural networks. Understanding Bayes' theorem is foundational to building systems that reason about what they don't know.

ThBayes' Theorem (General Form)

Let $A$ and $B$ be events with $P(B) > 0$ . Then:

P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)}

Equivalently, when the sample space can be partitioned into mutually exclusive hypotheses $\{H_1, H_2, \ldots, H_n\}$ :

P(H_i \mid B) = \frac{P(B \mid H_i) \, P(H_i)}{\sum_{j=1}^{n} P(B \mid H_j) \, P(H_j)}

DfDerivation from Conditional Probability

Starting from the definition of conditional probability:

P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \quad P(B \mid A) = \frac{P(A \cap B)}{P(A)}

Solving the second equation for $P(A \cap B)$ :

P(A \cap B) = P(B \mid A) \, P(A)

Substituting into the first equation yields Bayes' Theorem:

P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} \qquad \blacksquare

Bayesian Inference Components

\text{Posterior} = \frac{\text{Likelihood} \times \text{Prior}}{\text{Evidence}}

Here,

$P(H \mid D)$ =Posterior — updated belief about hypothesis H after observing data D
$P(D \mid H)$ =Likelihood — probability of data given hypothesis H is true
$P(H)$ =Prior — initial belief about H before seeing data
$P(D)$ =Evidence (marginal likelihood) — total probability of the data across all hypotheses

Prior, Likelihood, Posterior

DfPrior Distribution

The prior $P(\theta)$ encodes beliefs about parameter $\theta$ before observing any data. It can be:

Informative: Based on previous studies or domain expertise (e.g., "disease prevalence is ~1%").
Non-informative (vague): Spreads probability broadly to let the data dominate (e.g., Uniform(0,1) or Beta(1,1)).
Conjugate: Chosen so the posterior belongs to the same family as the prior (mathematical convenience).

The choice of prior is a modeling decision. Sensitivity analysis — checking how conclusions change under different priors — is essential.

DfLikelihood Function

The likelihood $P(D \mid \theta)$ measures how probable the observed data $D$ is for different values of $\theta$ . Note: the likelihood is not a probability distribution over $\theta$ ; it is a function of $\theta$ treated as a fixed (but unknown) quantity.

For independent observations $x_1, x_2, \ldots, x_n$ :

P(D \mid \theta) = \prod_{i=1}^{n} P(x_i \mid \theta)

The likelihood surface may be flat (data is uninformative about $\theta$ ) or sharply peaked (data strongly constrains $\theta$ ).

DfPosterior Distribution

The posterior $P(\theta \mid D)$ is the result of applying Bayes' Theorem:

P(\theta \mid D) = \frac{P(D \mid \theta) \, P(\theta)}{\int P(D \mid \theta) \, P(\theta) \, d\theta}

The posterior encodes all updated knowledge about $\theta$ given the data. Key summaries include:

Posterior mean: $\mathbb{E}[\theta \mid D] = \int \theta \, P(\theta \mid D) \, d\theta$
Posterior mode (MAP): $\theta_{\text{MAP}} = \arg\max_\theta P(\theta \mid D)$
Posterior variance: $\text{Var}[\theta \mid D] = \int (\theta - \mathbb{E}[\theta \mid D])^2 \, P(\theta \mid D) \, d\theta$

Bayesian Updating

The Bayesian updating process is iterative — each observation refines the posterior, and the posterior becomes the prior for the next update.

ThSequential Bayesian Updating

Given prior $P(\theta)$ and sequential observations $x_1, x_2, \ldots, x_n$ :

Step 1 (Initialize): Start with prior $P(\theta)$ .

Step 2 (Observe $x_1$ ): Compute posterior after first observation:

P(\theta \mid x_1) = \frac{P(x_1 \mid \theta) \, P(\theta)}{\int P(x_1 \mid \theta) \, P(\theta) \, d\theta}

Step 3 (Observe $x_2$ ): Treat the posterior from Step 2 as the new prior:

P(\theta \mid x_1, x_2) = \frac{P(x_2 \mid \theta) \, P(\theta \mid x_1)}{\int P(x_2 \mid \theta) \, P(\theta \mid x_1) \, d\theta}

Step $k$ (Observe $x_k$ ): General update:

P(\theta \mid x_1, \ldots, x_k) \propto P(x_k \mid \theta) \cdot P(\theta \mid x_1, \ldots, x_{k-1})

This process converges: as $n \to \infty$ , the posterior concentrates around the true parameter value (under regularity conditions), regardless of the choice of prior. This is the Bernstein–von Mises theorem.

Medical Diagnosis

Medical diagnosis is the canonical application of Bayes' Theorem. It highlights why base rates (prevalence) matter enormously.

📝Medical Test Problem

A disease affects 1 in 1000 people in a population. A diagnostic test has:

Sensitivity (true positive rate): $P(+ \mid D) = 0.99$
Specificity (true negative rate): $P(- \mid \neg D) = 0.99$

A randomly selected person tests positive. What is $P(D \mid +)$ ?

💡Solution

Define events:

$D$ = has disease, $\neg D$ = does not have disease
$+$ = tests positive, $-$ = tests negative

Given:

$P(D) = 0.001$ (prevalence)
$P(+ \mid D) = 0.99$ (sensitivity)
$P(- \mid \neg D) = 0.99$ → $P(+ \mid \neg D) = 0.01$ (false positive rate)

Apply Bayes' Theorem:

P(D \mid +) = \frac{P(+ \mid D) \, P(D)}{P(+ \mid D) \, P(D) + P(+ \mid \neg D) \, P(\neg D)}

= \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.01 \times 0.999}

= \frac{0.00099}{0.00099 + 0.00999} = \frac{0.00099}{0.01098} \approx 0.0902

Result: Only about 9% of positive tests are true positives!

Intuition: In a population of 10,000:

10 people have the disease → ~9.9 test positive
9,990 people are healthy → ~99.9 test positive (false alarms)
Among ~110 positive tests, only ~10 are genuine

Key takeaway: Even with a 99%-accurate test, a rare disease produces mostly false positives when screened broadly. This is why confirmatory tests are essential.

Spam Classification

ℹ️ Naive Bayes for Text Classification

The Naive Bayes classifier applies Bayes' Theorem to classification problems with multiple features. It assumes features are conditionally independent given the class — a strong assumption that rarely holds in practice but works surprisingly well, especially for text.

For a document with word features $w_1, w_2, \ldots, w_n$ :

P(\text{spam} \mid w_1, \ldots, w_n) = \frac{P(w_1, \ldots, w_n \mid \text{spam}) \, P(\text{spam})}{P(w_1, \ldots, w_n)}

Under the Naive Bayes assumption:

P(w_1, \ldots, w_n \mid \text{spam}) = \prod_{i=1}^{n} P(w_i \mid \text{spam})

We classify by choosing the class with maximum posterior:

\hat{y} = \arg\max_c \left[ \log P(c) + \sum_{i=1}^{n} \log P(w_i \mid c) \right]

Log-space avoids numerical underflow from multiplying many small probabilities.

📝Spam Filter Example

Given a vocabulary of two words ("free" and "meeting"), with:

	$P(\text{spam})$	$P(\text{free} \mid \cdot)$	$P(\text{meeting} \mid \cdot)$
Spam	0.4	0.9	0.1
Ham	0.6	0.2	0.7

An email contains both "free" and "meeting". Classify it.

💡Solution

Spam:

P(\text{spam} \mid \text{free}, \text{meeting}) \propto 0.4 \times 0.9 \times 0.1 = 0.036

Ham:

P(\text{ham} \mid \text{free}, \text{meeting}) \propto 0.6 \times 0.2 \times 0.7 = 0.084

Posterior for spam:

P(\text{spam} \mid \text{free}, \text{meeting}) = \frac{0.036}{0.036 + 0.084} = \frac{0.036}{0.120} = 0.3

Classification: Ham (0.7) > Spam (0.3) → classified as not spam.

Despite "free" being strongly associated with spam, the word "meeting" and the higher prior probability of ham shift the classification.

Prior Distributions

The choice of prior significantly impacts inference, especially with limited data. Conjugate priors offer mathematical tractability.

DfConjugate Priors

A conjugate prior is a prior distribution that, when combined with a likelihood, produces a posterior in the same family. This makes computation of the posterior closed-form.

Likelihood	Conjugate Prior	Posterior
Bernoulli / Binomial	Beta( $\alpha, \beta$ )	Beta( $\alpha + s, \beta + n - s$ )
Poisson	Gamma( $\alpha, \beta$ )	Gamma( $\alpha + \sum x_i, \beta + n$ )
Normal (known $\sigma^2$ )	Normal( $\mu_0, \sigma_0^2$ )	Normal( $\mu_n, \sigma_n^2$ )
Multinomial	Dirichlet( $\alpha_1, \ldots, \alpha_k$ )	Dirichlet( $\alpha_1 + c_1, \ldots, \alpha_k + c_k$ )
Normal (known $\mu$ , unknown $\sigma^2$ )	Inverse-Gamma( $\alpha, \beta$ )	Inverse-Gamma( $\alpha + n/2, \beta + \text{SS}/2$ )

Beta-Binomial Example:

P(\theta \mid s, n) = \frac{\text{Beta}(\alpha + s, \beta + n - s)}{\text{Beta}(\alpha, \beta)} \cdot P(s \mid \theta)

The posterior mean is $\frac{\alpha + s}{\alpha + \beta + n}$ , which is a weighted average of the prior mean $\frac{\alpha}{\alpha + \beta}$ and the sample proportion $\frac{s}{n}$ .

⚠️ Sensitivity to Priors

With small sample sizes, the posterior is sensitive to the choice of prior. As data accumulates, the likelihood dominates and the posterior becomes robust to the prior. Always perform prior sensitivity analysis — check how conclusions change under different priors.

MAP vs MLE

DfMaximum Likelihood Estimation (MLE)

The MLE finds the parameter that maximizes the likelihood:

\theta_{\text{MLE}} = \arg\max_\theta P(D \mid \theta) = \arg\max_\theta \log P(D \mid \theta)

MLE treats the parameter as a fixed unknown and finds the single value that best explains the data. It is consistent (converges to the true parameter) and efficient (minimum variance) asymptotically.

DfMaximum A Posteriori (MAP) Estimation

The MAP estimate finds the parameter that maximizes the posterior:

\theta_{\text{MAP}} = \arg\max_\theta P(\theta \mid D) = \arg\max_\theta \left[ \log P(D \mid \theta) + \log P(\theta) \right]

MAP is equivalent to MLE with a log-prior penalty. This is the Bayesian analog of regularized estimation:

Prior	MAP Equivalent
Uniform	MLE (no regularization)
Gaussian $N(0, \sigma^2)$	Ridge regression ( $L_2$ penalty)
Laplace (double exponential)	Lasso ( $L_1$ penalty)

Key distinction: MLE and MAP produce point estimates; full Bayesian inference produces the entire posterior distribution, capturing uncertainty.

Python Implementation

import numpy as np
from scipy import stats


class NaiveBayesClassifier:
    """Gaussian Naive Bayes classifier for continuous features."""

    def fit(self, X, y):
        self.classes = np.unique(y)
        self.params = {}
        for c in self.classes:
            X_c = X[y == c]
            self.params[c] = {
                "mean": X_c.mean(axis=0),
                "var": X_c.var(axis=0) + 1e-9,  # smoothing
                "prior": len(X_c) / len(X),
            }
        return self

    def _log_likelihood(self, x, mean, var):
        return -0.5 * np.sum(np.log(2 * np.pi * var) + ((x - mean) ** 2) / var)

    def predict(self, X):
        predictions = []
        for x in X:
            posteriors = []
            for c in self.classes:
                log_prior = np.log(self.params[c]["prior"])
                log_lik = self._log_likelihood(
                    x, self.params[c]["mean"], self.params[c]["var"]
                )
                posteriors.append(log_prior + log_lik)
            predictions.append(self.classes[np.argmax(posteriors)])
        return np.array(predictions)


# Bayesian updating for a coin flip (Beta-Binomial)
def bayesian_coin_update(prior_alpha, prior_beta, flips):
    """Sequential Bayesian updating for a biased coin."""
    alpha, beta = prior_alpha, prior_beta
    history = [(alpha, beta)]

    for flip in flips:
        if flip == 1:
            alpha += 1
        else:
            beta += 1
        history.append((alpha, beta))

    return history


# Example usage
prior_alpha, prior_beta = 2, 2  # Beta(2,2) prior — mildly informative
flips = [1, 1, 0, 1, 1, 1, 0, 1, 1, 1]  # observed sequence
history = bayesian_coin_update(prior_alpha, prior_beta, flips)

print("Step | Alpha | Beta | Mean | 95% CI")
print("-" * 50)
for i, (a, b) in enumerate(history):
    mean = a / (a + b)
    ci_low = stats.beta.ppf(0.025, a, b)
    ci_high = stats.beta.ppf(0.975, a, b)
    label = "Prior" if i == 0 else f"Obs {i}"
    print(f"{label:5s} | {a:5.1f} | {b:4.1f} | {mean:.3f} | [{ci_low:.3f}, {ci_high:.3f}]")

Output:

Architecture Diagram

Step | Alpha | Beta  | Mean  | 95% CI
--------------------------------------------------
Prior |   2.0 |   2.0 | 0.500 | [0.079, 0.921]
Obs 1 |   3.0 |   2.0 | 0.600 | [0.170, 0.934]
Obs 2 |   4.0 |   2.0 | 0.667 | [0.263, 0.942]
Obs 3 |   4.0 |   3.0 | 0.571 | [0.245, 0.855]
Obs 4 |   5.0 |   3.0 | 0.625 | [0.306, 0.882]
Obs 5 |   6.0 |   3.0 | 0.667 | [0.359, 0.900]
Obs 6 |   7.0 |   3.0 | 0.700 | [0.400, 0.915]
Obs 7 |   7.0 |   4.0 | 0.636 | [0.363, 0.860]
Obs 8 |   8.0 |   4.0 | 0.667 | [0.410, 0.872]
Obs 9 |   9.0 |   4.0 | 0.692 | [0.444, 0.887]
Obs10 |  10.0 |   4.0 | 0.714 | [0.472, 0.898]

The posterior mean starts at 0.5 (prior belief) and converges toward the observed proportion (7/10 = 0.7) as data accumulates.

Applications in AI/ML

ℹ️ Bayesian Methods in Modern AI

Bayesian inference is not a niche technique — it is woven into the fabric of modern AI and machine learning.

Bayesian Deep Learning:

Replaces point estimates of neural network weights with distributions.
MC Dropout: Applying dropout at test time approximates Bayesian inference over weights.
Bayes by Backprop: Maintains a variational distribution over weights, trained by minimizing KL divergence.
Enables uncertainty estimation — critical for safety-critical applications (medical AI, autonomous driving).

A/B Testing:

Frequentist A/B testing produces p-values and confidence intervals.
Bayesian A/B testing directly computes $P(A > B \mid \text{data})$ — the probability that variant A outperforms B.
Advantages: no multiple-testing correction needed, natural interpretation, can incorporate prior knowledge.

Bayesian Optimization:

Uses a Gaussian process surrogate model to find the optimum of expensive black-box functions.
Acquisition function (e.g., Expected Improvement) balances exploration vs. exploitation.
Used for hyperparameter tuning (e.g., Bayesian Optimization with Tree-structured Parzen Estimators).

Gaussian Processes:

Non-parametric Bayesian models for regression and classification.
Provide uncertainty estimates alongside predictions.
Kernel choice encodes assumptions about function smoothness.

Natural Language Processing:

Latent Dirichlet Allocation (LDA) for topic modeling.
Bayesian smoothing for language models (Kneser-Ney smoothing has a Bayesian interpretation).
Hierarchical Bayesian models for transfer learning across languages.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
Ignoring the base rate (prevalence)	$P(D) \neq P(D \mid +)$ — a 99%-accurate test for a 1-in-1,000 disease still produces ~91% false positives	Always include the prior; use Bayes' Theorem explicitly
Confusing $P(A \mid B)$ with $P(B \mid A)$	These are generally not equal; $P(\text{smoke} \mid \text{cancer}) \neq P(\text{cancer} \mid \text{smoke})$	Draw the conditional structure; apply Bayes' Theorem
Treating likelihood as a distribution over parameters	The likelihood $P(D \mid \theta)$ is a function of $\theta$ , not a probability distribution over $\theta$	Use the posterior for inference about $\theta$
Using a single point estimate (MAP) and calling it "Bayesian"	MAP is a point estimate; true Bayesian inference uses the full posterior	Compute posterior summaries (mean, credible intervals)
Assuming the Naive Bayes independence assumption always holds	Conditional independence is often violated; performance degrades when features are highly dependent	Test independence; use semi-naive or Tree-augmented Naive Bayes
Ignoring prior sensitivity	Strong priors can dominate small datasets, leading to biased conclusions	Perform sensitivity analysis with multiple priors
Not normalizing the posterior	$\text{Posterior} \propto \text{Likelihood} \times \text{Prior}$ — the denominator is needed for a proper distribution	Use the normalizing constant or MCMC for proper inference

Interview Questions

Q1: Why is Bayes' Theorem important in machine learning?

A: Bayes' Theorem provides a principled framework for updating beliefs given evidence. In ML, it enables: (1) incorporating prior knowledge into models, (2) quantifying uncertainty in predictions, (3) building models that improve with data sequentially, and (4) regularizing estimates through priors. It underpins Bayesian neural networks, Gaussian processes, and probabilistic graphical models.

Q2: What is the difference between MLE and MAP?

A: MLE maximizes $P(D \mid \theta)$ (the likelihood); MAP maximizes $P(\theta \mid D)$ (the posterior). MAP is equivalent to MLE with a log-prior penalty term. For example, with a Gaussian prior, MAP is equivalent to $L_2$ -regularized MLE. MAP produces a point estimate; full Bayesian inference computes the entire posterior distribution.

Q3: A test for a disease with 1% prevalence has 90% sensitivity and 95% specificity. If someone tests positive, what is the probability they have the disease?

A: $P(D \mid +) = \frac{0.9 \times 0.01}{0.9 \times 0.01 + 0.05 \times 0.99} = \frac{0.009}{0.009 + 0.0495} = \frac{0.009}{0.0585} \approx 0.154$ . Despite the test being quite accurate, only ~15.4% of positive results are true positives due to the low base rate.

Q4: Why does Naive Bayes work well in practice despite its independence assumption?

A: Three reasons: (1) Classification only requires ranking posteriors, not estimating them accurately — errors in $P(y \mid x)$ can cancel out if the ranking is preserved. (2) With enough training data, the likelihood term dominates and corrects for prior misspecification. (3) Feature dependencies tend to partially cancel each other's effects. Naive Bayes is particularly effective for high-dimensional sparse data like text.

Q5: What is a conjugate prior and why is it useful?

A: A conjugate prior is one where the posterior belongs to the same distribution family as the prior. For example, the Beta distribution is conjugate to the Bernoulli likelihood — after observing data, the posterior is also Beta. This makes computation analytically tractable (no numerical integration needed), enables elegant closed-form updates, and allows intuitive interpretation of hyperparameters as "pseudo-counts."

Q6: How do you choose a prior in practice?

A: Common approaches: (1) Use domain knowledge if available (e.g., known disease prevalence). (2) Use non-informative priors (Jeffreys prior, uniform) when knowledge is limited. (3) Use weakly informative priors (e.g., Beta(2,2)) to regularize without dominating. (4) Always perform prior sensitivity analysis — check if conclusions are robust to different priors. With sufficient data, the likelihood dominates and the prior's influence diminishes.

Practice Problems

Problem 1: Weather Prediction

You observe that on 8 out of 10 days when it was cloudy in the morning, it rained by afternoon. On 3 out of 10 days when it was sunny in the morning, it rained by afternoon. Historically, it rains on 40% of days.

Question: If it is cloudy this morning, what is the probability of rain this afternoon?

💡Solution

Given:

$P(\text{rain} \mid \text{cloudy}) = 0.8$
$P(\text{rain} \mid \text{sunny}) = 0.3$
$P(\text{rain}) = 0.4$ → $P(\text{cloudy}) = 0.5$ (assumed symmetric)

By Bayes' Theorem:

P(\text{rain} \mid \text{cloudy}) = \frac{P(\text{cloudy} \mid \text{rain}) \cdot P(\text{rain})}{P(\text{cloudy})}

We need $P(\text{cloudy} \mid \text{rain})$ . Using the Law of Total Probability:

P(\text{rain}) = P(\text{rain} \mid \text{cloudy})P(\text{cloudy}) + P(\text{rain} \mid \text{sunny})P(\text{sunny})

0.4 = 0.8 \cdot P(\text{cloudy}) + 0.3 \cdot (1 - P(\text{cloudy}))

0.4 = 0.5 \cdot P(\text{cloudy}) + 0.3

P(\text{cloudy}) = 0.2

Now apply Bayes' Theorem:

P(\text{rain} \mid \text{cloudy}) = \frac{P(\text{cloudy} \mid \text{rain}) \cdot P(\text{rain})}{P(\text{cloudy})}

We compute $P(\text{cloudy} \mid \text{rain})$ :

P(\text{cloudy} \mid \text{rain}) = \frac{P(\text{rain} \mid \text{cloudy}) \cdot P(\text{cloudy})}{P(\text{rain})} = \frac{0.8 \times 0.2}{0.4} = 0.4

Therefore:

P(\text{rain} \mid \text{cloudy}) = \frac{0.4 \times 0.4}{0.2} = 0.8

Answer: The probability is 0.8 (or 80%), consistent with the observed frequency.

Problem 2: Bayesian Spam Filter

A spam filter uses two features: whether the email contains "free" and whether it contains "winner". Training data:

	$P(\text{free})$	$P(\text{winner})$	$P(\text{spam})$
Spam	0.95	0.70	0.30
Ham	0.10	0.05	0.70

An email contains both "free" and "winner". Should it be classified as spam?

💡Solution

Using Naive Bayes:

P(\text{spam} \mid \text{free}, \text{winner}) \propto 0.30 \times 0.95 \times 0.70 = 0.1995

P(\text{ham} \mid \text{free}, \text{winner}) \propto 0.70 \times 0.10 \times 0.05 = 0.0035

Posterior for spam:

P(\text{spam} \mid \text{free}, \text{winner}) = \frac{0.1995}{0.1995 + 0.0035} = \frac{0.1995}{0.203} \approx 0.983

Answer: The email should be classified as spam with ~98.3% probability.

Problem 3: Beta-Binomial Updating

You believe a coin is fair, modeled as a Beta(5, 5) prior. You flip it 20 times and get 14 heads.

Question: What is the posterior distribution? What is the posterior probability that $P(\text{heads}) > 0.5$ ?

💡Solution

Posterior: Beta( $\alpha + s, \beta + n - s$ ) = Beta(5 + 14, 5 + 6) = Beta(19, 11)

Posterior mean: $\frac{19}{19 + 11} = \frac{19}{30} \approx 0.633$

95% credible interval: Using the Beta distribution, approximately $[0.454, 0.796]$ .

$P(\theta > 0.5 \mid \text{data})$ :

P(\theta > 0.5) = 1 - F_{\text{Beta}(19,11)}(0.5)

Using a calculator or Python:

from scipy import stats
1 - stats.beta.cdf(0.5, 19, 11)  # ≈ 0.887

Answer: The posterior is Beta(19, 11) with mean ~0.633. The probability that the coin is biased toward heads is approximately 0.887 (88.7%).

Problem 4: Medical Test with Two Tests

A disease has 2% prevalence. Test A has sensitivity 95% and specificity 90%. Test B (confirmatory) has sensitivity 99% and specificity 98%. If someone tests positive on both tests, what is $P(\text{disease} \mid A+, B+)$ ?

💡Solution

After Test A positive:

P(D \mid A+) = \frac{0.95 \times 0.02}{0.95 \times 0.02 + 0.10 \times 0.98} = \frac{0.019}{0.019 + 0.098} = \frac{0.019}{0.117} \approx 0.1624

After Test B positive (using updated prior):

$P(D) = 0.1624$ , $P(\neg D) = 0.8376$

P(D \mid A+, B+) = \frac{0.99 \times 0.1624}{0.99 \times 0.1624 + 0.02 \times 0.8376} = \frac{0.1608}{0.1608 + 0.0168} = \frac{0.1608}{0.1776} \approx 0.905

Answer: After two positive tests, the probability of disease is approximately 90.5%. Sequential Bayesian updating dramatically increases confidence.

Quick Reference

📋Bayes' Theorem — Quick Reference

Concept	Formula	Description
Bayes' Theorem	$P(A \mid B) = \frac{P(B \mid A)P(A)}{P(B)}$	Reverse conditioning
Posterior	$P(\theta \mid D) \propto P(D \mid \theta)P(\theta)$	Updated belief
MAP	$\theta_{\text{MAP}} = \arg\max_\theta P(\theta \mid D)$	Most probable parameter
MLE	$\theta_{\text{MLE}} = \arg\max_\theta P(D \mid \theta)$	Best-fitting parameter
Naive Bayes	$P(y \mid x_1, \ldots, x_n) \propto P(y) \prod_i P(x_i \mid y)$	Independence assumption
Beta-Binomial	$\text{Beta}(\alpha + s, \beta + n - s)$	Conjugate updating
Law of Total Prob.	$P(B) = \sum_i P(B \mid A_i)P(A_i)$	Compute the evidence
Sequential Update	$P(\theta \mid D_1, D_2) \propto P(D_2 \mid \theta)P(\theta \mid D_1)$	Posterior → prior

Key Properties:

The posterior is always a proper distribution (integrates to 1) if the prior is proper.
With enough data, the posterior becomes insensitive to the prior (Bernstein–von Mises theorem).
Log-posterior = log-likelihood + log-prior — the additive structure in log-space enables efficient computation.
Bayesian and frequentist methods converge with large samples.

Cross-References

034 - Probability Conditional — Conditional probability and the Law of Total Probability, prerequisites for Bayes' Theorem.
035 - Probability Random Variables — Random variables and distributions underlying Bayesian inference.
037 - Statistics Hypothesis Testing — Frequentist hypothesis testing; contrast with Bayesian hypothesis testing.
038 - Statistics Regression — Linear regression; Bayesian linear regression uses priors on coefficients.
040 - ML Classification — Naive Bayes classifier in the context of broader classification methods.
042 - ML Model Evaluation — Metrics for evaluating classifiers; calibration of probabilistic predictions.
044 - ML Optimization — MAP estimation and regularization as optimization problems.