CW

Probability, Bayes' Theorem and PDF/CDF

Module 4: Statistics & ProbabilityFree Lesson

Advertisement

Why Probability Matters

Probability is the mathematics of uncertainty. In data science, we use probability to quantify uncertainty in predictions, make inferences about populations from samples, build probabilistic models, and test hypotheses.

Architecture Diagram
Certainty <----------------------------------------------> Uncertainty
   |                                                          |
   +---------------- Probability Theory --------------------+

Fundamental Concepts

Sample Space and Events

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Sample space: All possible outcomes
# Event: A subset of the sample space

# Example: Rolling two dice
sample_space = {(i, j) for i in range(1, 7) for j in range(1, 7)}
print(f"Sample space size: {len(sample_space)}")  # 36

# Event: Sum equals 7
event_sum_7 = {(i, j) for i, j in sample_space if i + j == 7}
print(f"Event sum=7: {event_sum_7}")
print(f"Probability: {len(event_sum_7)/len(sample_space):.4f}")  # 6/36 = 0.1667

Basic Probability Rules

Addition Rule

P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)

Here,

  • P(Aa^ˆªB)P(A ∪ B)=Probability of A or B occurring
  • P(Aa^ˆ©B)P(A ∩ B)=Probability of both A and B occurring

Multiplication Rule

P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A|B) \cdot P(B) = P(B|A) \cdot P(A)

Here,

  • P(AB)P(A|B)=Conditional probability of A given B
Conditional Probability
P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}
# Verify addition rule
p_red = 26/52
p_king = 4/52
p_red_and_king = 2/52
p_red_or_king = p_red + p_king - p_red_and_king
print(f"P(Red or King): {p_red_or_king:.4f}")

Independence and Conditional Probability

Two events A and B are independent if and only if P(A ∩ B) = P(A)·P(B), equivalently P(A|B) = P(A). Independence means knowing B occurred gives no information about A. Always check independence before applying simplified multiplication rules.

Probability Distributions

Discrete Distributions

# 1. Bernoulli Distribution (single trial)
p = 0.3
bernoulli = stats.bernoulli(p)
print(f"Bernoulli(p={p}):")
print(f"  P(X=0): {bernoulli.pmf(0):.2f}")
print(f"  P(X=1): {bernoulli.pmf(1):.2f}")
print(f"  Mean: {bernoulli.mean():.2f}")
print(f"  Variance: {bernoulli.var():.2f}")

# 2. Binomial Distribution (n trials)
n, p = 10, 0.5
binomial = stats.binom(n, p)
x = np.arange(0, n+1)

plt.figure(figsize=(10, 5))
plt.bar(x, binomial.pmf(x), color='skyblue', edgecolor='black')
plt.title(f'Binomial Distribution (n={n}, p={p})')
plt.xlabel('Number of Successes')
plt.ylabel('Probability')
plt.show()

# 3. Poisson Distribution (events per interval)
lambda_param = 5
poisson = stats.poisson(lambda_param)
x = np.arange(0, 15)

plt.figure(figsize=(10, 5))
plt.bar(x, poisson.pmf(x), color='lightgreen', edgecolor='black')
plt.title(f'Poisson Distribution (λ={lambda_param})')
plt.xlabel('Number of Events')
plt.ylabel('Probability')
plt.show()

Binomial Distribution PMF

P(X=k)=(nk)pk(1p)nkP(X = k) = \binom{n}{k} p^k (1-p)^{n-k}

Here,

  • nn=Number of trials
  • kk=Number of successes
  • pp=Probability of success on each trial

Poisson Distribution PMF

P(X=k)=λkeλk!P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}

Here,

  • I^»Î»=Average rate of events per interval
  • kk=Number of events

Continuous Distributions

# 1. Normal Distribution (Gaussian)
mu, sigma = 0, 1
normal = stats.norm(mu, sigma)
x = np.linspace(-4, 4, 1000)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(x, normal.pdf(x), 'b-', linewidth=2)
axes[0].fill_between(x, normal.pdf(x), alpha=0.3)
axes[0].set_title('Normal Distribution PDF')

axes[1].plot(x, normal.cdf(x), 'r-', linewidth=2)
axes[1].set_title('Normal Distribution CDF')
plt.tight_layout()
plt.show()

# 2. Exponential Distribution
lambda_param = 1
exponential = stats.expon(scale=1/lambda_param)
x = np.linspace(0, 5, 1000)

plt.figure(figsize=(10, 5))
plt.plot(x, exponential.pdf(x), 'r-', linewidth=2)
plt.fill_between(x, exponential.pdf(x), alpha=0.3)
plt.title(f'Exponential Distribution (λ={lambda_param})')
plt.show()

# 3. Beta Distribution
alpha, beta_param = 2, 5
beta = stats.beta(alpha, beta_param)
x = np.linspace(0, 1, 1000)

plt.figure(figsize=(10, 5))
plt.plot(x, beta.pdf(x), 'g-', linewidth=2)
plt.fill_between(x, beta.pdf(x), alpha=0.3)
plt.title(f'Beta Distribution (α={alpha}, β={beta_param})')
plt.show()

PDF vs CDF

x = np.linspace(-4, 4, 1000)
pdf = stats.norm.pdf(x)
cdf = stats.norm.cdf(x)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

axes[0].plot(x, pdf, 'b-', linewidth=2)
axes[0].fill_between(x, pdf, where=(x >= -1) & (x <= 1), alpha=0.5, label='P(-1 ≤ X ≤ 1)')
axes[0].set_title('PDF: P(-1 ≤ X ≤ 1) ≈ 0.68')
axes[0].legend()

axes[1].plot(x, cdf, 'r-', linewidth=2)
axes[1].axhline(y=0.84, color='g', linestyle='--', label='P(X ≤ 1)')
axes[1].axvline(x=1, color='g', linestyle='--')
axes[1].set_title('CDF: P(X ≤ 1)')
axes[1].legend()
plt.tight_layout()
plt.show()

# Calculating probabilities
p_between = stats.norm.cdf(1) - stats.norm.cdf(-1)
print(f"P(-1 ≤ X ≤ 1): {p_between:.4f}")

p_greater = 1 - stats.norm.cdf(2)
print(f"P(X > 2): {p_greater:.4f}")

Computing Probabilities with CDFs

For any continuous distribution, compute interval probabilities using the CDF:

  • P(a ≤ X ≤ b) = F(b) - F(a)
  • P(X > a) = 1 - F(a)
  • P(X ≤ a) = F(a) This works because the CDF encodes all probabilistic information about the distribution.

Bayes' Theorem

Bayes' Theorem
P(AB)=P(BA)×P(A)P(B)P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}

Intuitive Understanding

Architecture Diagram
Bayes' Theorem:

P(Hypothesis | Data) = P(Data | Hypothesis) × P(Hypothesis)
                       -------------------------------------
                                    P(Data)

Posterior = Likelihood × Prior
            ------------------
               Evidence

Bayes' Theorem (Expanded)

P(AB)=P(BA)P(A)P(BA)P(A)+P(B¬A)P(¬A)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B|A) \cdot P(A) + P(B|\neg A) \cdot P(\neg A)}

Here,

  • P(AB)P(A|B)=Posterior: probability of hypothesis given evidence
  • P(BA)P(B|A)=Likelihood: probability of evidence given hypothesis
  • P(A)P(A)=Prior: initial belief in hypothesis
  • P(BA^¬A)P(B|¬A)=False positive rate
  • P(A^¬A)P(¬A)=Prior probability of complement

Example: Medical Testing

# Medical test example
p_disease = 0.01
p_positive_given_disease = 0.99
p_positive_given_healthy = 0.05  # False positive rate

# P(Positive) = P(Positive|Disease)×P(Disease) + P(Positive|Healthy)×P(Healthy)
p_positive = (p_positive_given_disease * p_disease + 
              p_positive_given_healthy * (1 - p_disease))

# Apply Bayes' theorem
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print("Medical Test Analysis:")
print(f"P(Disease): {p_disease:.2%}")
print(f"P(Positive|Disease): {p_positive_given_disease:.2%}")
print(f"P(Positive|Healthy): {p_positive_given_healthy:.2%}")
print(f"P(Disease|Positive): {p_disease_given_positive:.2%}")

Worked Example: Medical Test (Bayes' Theorem)

Given: Disease prevalence = 1%, Sensitivity = 99%, Specificity = 95%

Step 1: Identify the quantities

  • P(Disease) = 0.01 (prior)
  • P(Positive | Disease) = 0.99 (sensitivity/likelihood)
  • P(Positive | Healthy) = 1 - 0.95 = 0.05 (false positive rate)

Step 2: Compute evidence P(Positive) P(Positive) = 0.99 × 0.01 + 0.05 × 0.99 = 0.0099 + 0.0495 = 0.0594

Step 3: Apply Bayes' Theorem P(Disease | Positive) = 0.0099 / 0.0594 ≈ 16.67%

Interpretation: Despite the test being "99% accurate," only ~17% of positive results are true positives. This is the base rate fallacy.

Naive Bayes Classification

# Spam classification
p_spam = 0.3
p_ham = 0.7

likelihoods = {
    'free': {'spam': 0.8, 'ham': 0.1},
    'money': {'spam': 0.7, 'ham': 0.05},
    'meeting': {'spam': 0.1, 'ham': 0.6},
    'project': {'spam': 0.2, 'ham': 0.5}
}

words = ['free', 'money', 'meeting']

p_words_given_spam = np.prod([likelihoods[w]['spam'] for w in words])
p_words_given_ham = np.prod([likelihoods[w]['ham'] for w in words])

p_words = p_words_given_spam * p_spam + p_words_given_ham * p_ham
p_spam_given_words = (p_words_given_spam * p_spam) / p_words

print(f"P(Spam|Words): {p_spam_given_words:.4f}")
print(f"Classification: {'Spam' if p_spam_given_words > 0.5 else 'Ham'}")

The 'Naive' Assumption

Naive Bayes assumes features are conditionally independent given the class label: P(x₁, x₂, ..., xₙ | C) = ∏ᵢ P(xᵢ | C). This assumption is rarely true in practice, yet Naive Bayes performs remarkably well for text classification because the ranking of posterior probabilities is often correct even if the absolute values are not.

Central Limit Theorem (CLT)

Central Limit Theorem
XˉN(μ,σ2n) as n\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \text{ as } n \to \infty
# Demonstrate CLT with different distributions
np.random.seed(42)

distributions = {
    'Uniform': np.random.uniform(0, 10, 10000),
    'Exponential': np.random.exponential(2, 10000),
    'Chi-Square': np.random.chisquare(2, 10000),
    'Bimodal': np.concatenate([np.random.normal(-2, 1, 5000), 
                                np.random.normal(2, 1, 5000)])
}

sample_sizes = [1, 5, 10, 30]

fig, axes = plt.subplots(len(distributions), len(sample_sizes), figsize=(16, 12))

for i, (dist_name, data) in enumerate(distributions.items()):
    for j, n in enumerate(sample_sizes):
        sample_means = [np.mean(np.random.choice(data, n)) for _ in range(1000)]
        axes[i, j].hist(sample_means, bins=30, density=True, alpha=0.7, edgecolor='black')
        axes[i, j].set_title(f'{dist_name}\nn={n}')

plt.suptitle('Central Limit Theorem Demonstration', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Rules of Thumb for CLT

  • n ≥ 30: CLT approximation is generally reasonable for most distributions
  • n ≥ 50: Good approximation even for moderately skewed distributions
  • n ≥ 100: Excellent approximation for nearly all distributions
  • For heavily skewed or heavy-tailed distributions, larger samples are needed

Practical Example: A/B Testing with Bayesian Analysis

np.random.seed(42)

# A/B test data
n_control, conversions_control = 1000, 120
n_treatment, conversions_treatment = 1000, 150

# Bayesian approach with Beta prior
prior_a, prior_b = 1, 1

posterior_control = stats.beta(conversions_control + prior_a,
                               n_control - conversions_control + prior_b)
posterior_treatment = stats.beta(conversions_treatment + prior_a,
                                 n_treatment - conversions_treatment + prior_b)

n_samples = 10000
samples_control = posterior_control.rvs(n_samples)
samples_treatment = posterior_treatment.rvs(n_samples)

prob_b_better = np.mean(samples_treatment > samples_control)
lift = (samples_treatment - samples_control) / samples_control

print(f"P(B > A): {prob_b_better:.2%}")
print(f"Expected Lift: {np.mean(lift)*100:.1f}% ± {np.std(lift)*100:.1f}%")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

x = np.linspace(0.08, 0.20, 1000)
axes[0].plot(x, posterior_control.pdf(x), 'b-', label='Control', linewidth=2)
axes[0].plot(x, posterior_treatment.pdf(x), 'r-', label='Treatment', linewidth=2)
axes[0].fill_between(x, posterior_control.pdf(x), alpha=0.3)
axes[0].fill_between(x, posterior_treatment.pdf(x), alpha=0.3)
axes[0].set_title('Posterior Distributions')
axes[0].legend()

axes[1].hist(lift * 100, bins=50, density=True, alpha=0.7, edgecolor='black')
axes[1].axvline(x=0, color='r', linestyle='--', label='No Effect')
axes[1].axvline(x=np.mean(lift)*100, color='g', linestyle='-',
                label=f'Mean: {np.mean(lift)*100:.1f}%')
axes[1].set_title('Lift Distribution (B vs A)')
axes[1].legend()
plt.tight_layout()
plt.show()

Key Takeaways

Summary: Probability, Bayes, PDF/CDF, CLT

  1. Probability provides a rigorous axiomatic framework for quantifying uncertainty — P(A) ≥ 0, P(Ω) = 1, countable additivity
  2. PDF describes density (area = probability); CDF gives cumulative probability directly — use CDFs for interval calculations
  3. Bayes' Theorem enables principled belief updating: Posterior ∝ Likelihood × Prior. Watch for the base rate fallacy
  4. CLT guarantees the sampling distribution of the mean is approximately normal for large n — this underpins confidence intervals and hypothesis tests
  5. Discrete distributions (Binomial, Poisson) model counts; continuous distributions (Normal, Exponential, Beta) model measurements
  6. Naive Bayes assumes conditional independence — rarely true but often effective for classification

Practice Exercise

  1. Calculate the probability of getting exactly 3 heads in 5 fair coin flips using the Binomial distribution
  2. Apply Bayes' Theorem to a drug testing scenario: 2% prevalence, 95% sensitivity, 90% specificity
  3. Demonstrate CLT by sampling from an exponential distribution with n = 2, 10, 30, 100
  4. Build a Bayesian A/B testing framework that computes P(B > A) for different sample sizes
  5. For a Poisson process with λ = 3 events/hour, compute the probability of observing 0-5+ events
  6. Compare the PDF and CDF of the normal distribution visually
  7. Verify that P(μ - σ ≤ X ≤ μ + σ) ≈ 0.6827 for a standard normal

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement