Central Limit Theorem

Why It Matters

💡 Why It Matters

The Central Limit Theorem is arguably the most important result in all of statistics. It explains why the Normal distribution appears everywhere — in test scores, measurement errors, stock returns, and sample means — even when the underlying data is far from Normal. Without the CLT, we could not build confidence intervals, perform hypothesis tests, or run A/B experiments. Every time you compute a p-value or construct a confidence interval, you are relying on the CLT. For AI practitioners, the CLT underpins the statistical guarantees behind model evaluation, feature importance testing, and the distributional assumptions that make gradient-based learning tractable.

ThCentral Limit Theorem (Lindeberg-Lévy)

If $X_1, X_2, \ldots, X_n$ are independent and identically distributed random variables with finite mean $E[X_i] = \mu$ and finite variance $\text{Var}(X_i) = \sigma^2 > 0$ , then the standardized sample mean converges in distribution to the standard Normal:

Z_n = \frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

where $\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i$ .

CLT: Sample Mean Distribution

\bar{X}_n \;\dot{\sim}\; N\!\left(\mu,\; \frac{\sigma^2}{n}\right)

Here,

$\mu$ =Population mean — the center of the sampling distribution
$\sigma^2$ =Population variance — controls spread of individual observations
$n$ =Sample size — the number of i.i.d. observations
$\sigma^2/n$ =Variance of the sample mean (standard error squared)

ℹ️ What the CLT Really Says

The CLT is a statement about the sampling distribution of the mean. It does not say the data becomes Normal — the data stays whatever distribution it started with. It says that if you repeatedly draw samples of size $n$ and compute each sample mean, the collection of those means will be approximately Normal. The larger $n$ is, the better the approximation.

Intuition and Proof Sketch

ℹ️ Why Does Averaging Produce Normality?

Consider adding up many independent random variables. Each variable contributes a small, random perturbation. The Central Limit Theorem says that the cumulative effect of many small, independent perturbations is always approximately Normal — regardless of the shape of each individual perturbation. This is because the Normal distribution is the unique fixed point of the convolution operation: convolving any distribution with itself many times converges to a Gaussian.

Proof Sketch via Moment Generating Functions

ThCLT Proof Outline (MGF Approach)

Step 1: Define the standardized variable $Z_n = \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} = \frac{1}{\sigma\sqrt{n}}\sum_{i=1}^n (X_i - \mu)$ .

Step 2: Compute the MGF of $Z_n$ . Let $Y_i = \frac{X_i - \mu}{\sigma}$ (standardized, so $E[Y_i] = 0$ , $\text{Var}(Y_i) = 1$ ). Then:

M_{Z_n}(t) = \left[M_Y\!\left(\frac{t}{\sqrt{n}}\right)\right]^n

Step 3: Taylor-expand $M_Y(s)$ around $s = 0$ :

M_Y(s) = 1 + \frac{s^2}{2} + O(s^3)

(since $M_Y(0) = 1$ , $M_Y'(0) = E[Y] = 0$ , $M_Y''(0) = E[Y^2] = 1$ ).

Step 4: Substitute $s = t/\sqrt{n}$ :

M_{Z_n}(t) = \left[1 + \frac{t^2}{2n} + O\!\left(\frac{t^3}{n^{3/2}}\right)\right]^n \;\xrightarrow{n\to\infty}\; e^{t^2/2}

Step 5: Since $e^{t^2/2}$ is the MGF of $N(0,1)$ , by the continuity theorem, $Z_n \xrightarrow{d} N(0,1)$ .

💡 Convolution Intuition

When you sum two independent random variables, their distributions convolve. Repeated convolution of any distribution with finite variance produces a result that increasingly resembles a Gaussian — this is the "smoothing" effect of the Central Limit Theorem. Each convolution removes structure from the original distribution, and the Gaussian is the universal attractor.

When Does CLT Apply?

ThSufficient Conditions for CLT

The classical CLT holds when all of the following conditions are satisfied:

Independence: The random variables $X_1, X_2, \ldots, X_n$ must be independent (or at least uncorrelated in weaker versions).
Identical Distribution: All $X_i$ come from the same distribution (Lindeberg-Lévy version; the Lindeberg condition relaxes this).
Finite Mean: $E[|X_i|] < \infty$ so that $\mu$ exists.
Finite Variance: $\text{Var}(X_i) = \sigma^2 < \infty$ — this is critical. Distributions with infinite variance (e.g., Cauchy, heavy-tailed Pareto with $\alpha \leq 2$ ) do not satisfy the classical CLT.

For non-identically distributed variables, the Lindeberg condition or the stronger Lyapunov condition provides the general framework.

Lyapunov Condition (for non-i.i.d. variables)

\lim_{n \to \infty} \frac{1}{s_n^{2+\delta}} \sum_{i=1}^{n} E\!\left[|X_i - \mu_i|^{2+\delta}\right] = 0

Here,

$s_n^2$ =Sum of variances: $s_n^2 = \sum_{i=1}^n \sigma_i^2$
$\delta$ =A positive constant (typically $\delta = 1$)
$\mu_i$ =Mean of the $i$-th variable
$\sigma_i^2$ =Variance of the $i$-th variable

ℹ️ Rule of Thumb for Sample Size

A common heuristic: $n \geq 30$ is "large enough" for the CLT to provide a good approximation. However, this depends on the underlying distribution:

Symmetric distributions (e.g., Uniform): CLT works well even for $n \geq 10$
Moderately skewed (e.g., Poisson, Exponential): $n \geq 30$ is usually sufficient
Heavily skewed or heavy-tailed: may need $n \geq 50$ or more
Cauchy distribution (infinite variance): CLT never applies — the sample mean remains Cauchy regardless of $n$

Always check with Q-Q plots or normality tests when in doubt.

ℹ️ CLT Does NOT Apply When

The variance is infinite (e.g., Cauchy distribution, Pareto with $\alpha \leq 2$ )
The variables are strongly dependent (e.g., time series with autocorrelation)
The sample size is too small for the specific distribution
You are looking at the distribution of the data itself, not the sample mean
The data comes from a mixture with widely separated components

Berry-Esseen Theorem

ℹ️ How Fast Does CLT Converge?

The CLT tells us the sample mean eventually becomes Normal, but how fast? The Berry-Esseen theorem quantifies the rate of convergence by bounding the maximum difference between the true distribution and the Normal approximation.

ThBerry-Esseen Theorem

Let $X_1, X_2, \ldots, X_n$ be i.i.d. with $E[X_i] = 0$ , $\text{Var}(X_i) = \sigma^2$ , and $E[|X_i|^3] = \rho < \infty$ . Let $F_n$ be the CDF of $\frac{\bar{X}_n}{\sigma/\sqrt{n}}$ and $\Phi$ be the standard Normal CDF. Then:

\sup_{x \in \mathbb{R}} \left| F_n(x) - \Phi(x) \right| \leq \frac{C \cdot \rho}{\sigma^3 \sqrt{n}}

where $C$ is a universal constant (best known bound: $C \leq 0.4748$ ; originally $C \leq 7.59$ ).

Berry-Esseen Convergence Rate

\sup_x |F_n(x) - \Phi(x)| \leq \frac{C \cdot E[|X - \mu|^3]}{\sigma^3 \sqrt{n}}

Here,

$C$ =Universal constant (≤ 0.4748)
$E[|X - \mu|^3]$ =Third absolute central moment (skewness-related)
$\sigma^3$ =Cube of the standard deviation
$n$ =Sample size — error shrinks as $1/\sqrt{n}$

💡 Practical Implications of Berry-Esseen

The convergence rate is $O(1/\sqrt{n})$ : to halve the approximation error, you need 4x the sample size
Distributions with higher skewness (larger $\rho/\sigma^3$ ) converge more slowly
The Berry-Esseen bound is a worst case — actual convergence is often much faster
For symmetric distributions, the convergence rate improves to $O(1/n)$ due to cancellation of odd moments
This explains why $n = 30$ works for many distributions but not all

Law of Large Numbers

ℹ️ CLT vs. LLN: What's the Difference?

The Law of Large Numbers (LLN) and the Central Limit Theorem answer complementary questions:

LLN: The sample mean converges to the true mean (as a point)
CLT: The sample mean is approximately Normal around the true mean (describes the shape)

The LLN tells you where the sample mean ends up. The CLT tells you how it fluctuates around that target.

ThStrong Law of Large Numbers (Kolmogorov)

If $X_1, X_2, \ldots$ are i.i.d. with $E[|X_1|] < \infty$ , then:

P\!\left(\lim_{n \to \infty} \bar{X}_n = \mu\right) = 1

That is, the sample mean converges to the population mean with probability 1 (almost sure convergence).

ThWeak Law of Large Numbers

If $X_1, X_2, \ldots$ are i.i.d. with $E[X_i] = \mu$ and $\text{Var}(X_i) = \sigma^2 < \infty$ , then for every $\epsilon > 0$ :

\lim_{n \to \infty} P\!\left(|\bar{X}_n - \mu| > \epsilon\right) = 0

The sample mean converges in probability to the population mean.

LLN vs. CLT Summary

\text{LLN: } \bar{X}_n \xrightarrow{P} \mu \qquad \text{CLT: } \frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0,1)

Here,

$\xrightarrow{P}$ =Convergence in probability (LLN)
$\xrightarrow{d}$ =Convergence in distribution (CLT)
$\sigma/\sqrt{n}$ =Standard error — the scale of CLT fluctuations

CLT for Proportions

ℹ️ Applying CLT to Binary Data

When each observation is a Bernoulli trial ( $X_i \in \{0, 1\}$ with $P(X_i = 1) = p$ ), the sample mean $\bar{X}_n = \hat{p}$ is the sample proportion. The CLT gives us the sampling distribution of $\hat{p}$ , which is the foundation for hypothesis tests and confidence intervals for proportions.

ThCLT for Sample Proportions

If $X_1, X_2, \ldots, X_n$ are i.i.d. Bernoulli( $p$ ), then:

\hat{p} = \frac{1}{n}\sum_{i=1}^n X_i \quad \text{with} \quad E[\hat{p}] = p, \quad \text{Var}(\hat{p}) = \frac{p(1-p)}{n}

By the CLT:

\frac{\hat{p} - p}{\sqrt{p(1-p)/n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Confidence Interval for a Proportion

\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}

Here,

$\hat{p}$ =Sample proportion (number of successes / n)
$z_{\alpha/2}$ =Critical value: 1.96 for 95%, 2.576 for 99%
$n$ =Sample size
$\sqrt{\hat{p}(1-\hat{p})/n}$ =Standard error of the proportion

💡 When Is the Normal Approximation Valid for Proportions?

The normal approximation for proportions requires both $np \geq 10$ and $n(1-p) \geq 10$ . If $p$ is close to 0 or 1, you need a larger sample. When these conditions fail, use the Wilson score interval or exact binomial methods instead. The Agresti-Coull interval (adding 2 successes and 2 failures) is a simple correction.

Confidence Intervals

ℹ️ The CLT as the Engine of Confidence Intervals

Confidence intervals are built directly on the CLT. The theorem guarantees that $\bar{X}_n$ is approximately Normal, which means we can construct intervals that contain the true parameter with a specified probability.

Confidence Interval for the Mean (Known σ)

\bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}}

Here,

$\bar{x}$ =Observed sample mean
$z_{\alpha/2}$ =Critical value from standard Normal (1.96 for 95%)
$\sigma$ =Known population standard deviation
$n$ =Sample size

Confidence Interval for the Mean (Unknown σ)

\bar{x} \pm t_{\alpha/2, \, n-1} \cdot \frac{s}{\sqrt{n}}

Here,

$s$ =Sample standard deviation (estimates σ)
$t_{\alpha/2, \, n-1}$ =Critical value from t-distribution with $n-1$ degrees of freedom
$n$ =Sample size

ThSample Size Determination

To achieve a desired margin of error $E$ at confidence level $1 - \alpha$ :

n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2

To halve the margin of error, you need 4 times the sample size — a direct consequence of the $1/\sqrt{n}$ rate in the CLT.

Python Implementation

📝CLT Simulation: Exponential to Normal

The following simulation draws samples from an Exponential distribution (which is heavily skewed) and demonstrates that the distribution of sample means becomes increasingly Normal as $n$ grows.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# True distribution: Exponential (heavily skewed, not Normal)
lam = 2.0
true_mean = 1 / lam  # 0.5
true_std = 1 / lam   # 0.5

# Simulate CLT: draw samples of increasing size
sample_sizes = [1, 5, 15, 50, 100]
n_experiments = 10000

fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for ax, n in zip(axes, sample_sizes):
    # Draw n_experiments samples, each of size n
    samples = np.random.exponential(1/lam, (n_experiments, n))
    means = samples.mean(axis=1)

    # CLT prediction: N(true_mean, true_std^2 / n)
    theoretical_mean = true_mean
    theoretical_std = true_std / np.sqrt(n)

    # Histogram of sample means
    ax.hist(means, bins=50, density=True, alpha=0.7,
            edgecolor='black', linewidth=0.5)

    # Overlay theoretical Normal
    x = np.linspace(means.min(), means.max(), 200)
    ax.plot(x, stats.norm.pdf(x, theoretical_mean, theoretical_std),
            'r-', linewidth=2, label='CLT prediction')

    # Shapiro-Wilk test for normality
    stat, p_val = stats.shapiro(means[:1000])
    ax.set_title(f'n = {n}\nShapiro p = {p_val:.4f}')
    ax.legend(fontsize=8)

plt.suptitle('CLT: Exponential → Normal via Averaging', fontsize=14)
plt.tight_layout()
plt.savefig('clt_exponential.png', dpi=150)
plt.show()

📝CLT for Proportions: A/B Testing Simulation

Simulate an A/B test comparing two conversion rates and verify that the difference in sample proportions is approximately Normal.

import numpy as np
from scipy import stats

np.random.seed(42)

# True conversion rates
p_control = 0.10   # 10% conversion
p_treatment = 0.12  # 12% conversion

n_per_group = 5000
n_simulations = 10000

# Simulate A/B test many times
diffs = []
for _ in range(n_simulations):
    control = np.random.binomial(1, p_control, n_per_group)
    treatment = np.random.binomial(1, p_treatment, n_per_group)
    diffs.append(treatment.mean() - control.mean())

diffs = np.array(diffs)

# CLT prediction for the difference
se_theory = np.sqrt(p_control*(1-p_control)/n_per_group +
                     p_treatment*(1-p_treatment)/n_per_group)
print(f"Observed SE:       {diffs.std():.6f}")
print(f"Theoretical SE:    {se_theory:.6f}")
print(f"True difference:   {p_treatment - p_control:.4f}")
print(f"Mean of diffs:     {diffs.mean():.6f}")

# Normality test
stat, p_val = stats.shapiro(diffs[:1000])
print(f"Shapiro-Wilk p:    {p_val:.4f}")

# 95% CI should contain 0 about 95% of the time under null (p_c = p_t)
# Here, true diff = 0.02, so CI should mostly NOT contain 0
from scipy.stats import norm
z_crit = norm.ppf(0.975)
contains_zero = np.mean(np.abs(diffs) < z_crit * se_theory)
print(f"Fraction within CLT 95% CI of 0: {1 - contains_zero:.1%}")

📝Berry-Esseen Convergence Rate in Practice

Compare convergence speed for distributions with different skewness.

import numpy as np
from scipy import stats

np.random.seed(42)

def berry_esseen_error(samples, mu, sigma):
    """Empirical max |F_n(x) - Phi(x)|"""
    standardized = (samples - mu) / sigma
    n = len(standardized)
    sorted_data = np.sort(standardized)
    empirical_cdf = np.arange(1, n+1) / n
    normal_cdf = stats.norm.cdf(sorted_data)
    return np.max(np.abs(empirical_cdf - normal_cdf))

distributions = {
    'Uniform(0,1)': {'rvs': np.random.uniform, 'args': (0, 1),
                     'mu': 0.5, 'sigma': 1/np.sqrt(12)},
    'Exponential(1)': {'rvs': np.random.exponential, 'args': (1,),
                       'mu': 1.0, 'sigma': 1.0},
    'Chi-Square(3)': {'rvs': np.random.chisquare, 'args': (3,),
                      'mu': 3.0, 'sigma': np.sqrt(6)},
}

n_values = [10, 30, 50, 100, 500]
n_sims = 5000

for name, dist in distributions.items():
    print(f"\n{name}:")
    for n in n_values:
        errors = []
        for _ in range(n_sims):
            samples = dist['rvs'](*dist['args'], n)
            means = samples.mean(axis=0) if samples.ndim > 1 else samples
            err = berry_esseen_error(
                means, dist['mu'], dist['sigma'] / np.sqrt(n))
            errors.append(err)
        avg_err = np.mean(errors)
        bound = 0.4748 * stats.moment(samples, 3)**0.33 / (dist['sigma']**3 * np.sqrt(n))
        print(f"  n={n:>4d}: empirical={avg_err:.4f}")

Applications in AI and Machine Learning

A/B Testing

ℹ️ CLT Powers Every A/B Test

Every A/B test relies on the CLT. When you compare conversion rates, click-through rates, or revenue per user between a control and treatment group, you are computing a difference in sample means. The CLT guarantees that this difference is approximately Normal, enabling you to compute p-values, confidence intervals, and statistical power.

ThA/B Test Sample Size Formula

To detect a minimum detectable effect (MDE) of $\delta$ with power $1 - \beta$ and significance level $\alpha$ :

n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot 2\sigma^2}{\delta^2}

For proportions, replace $\sigma^2$ with $p(1-p)$ (use $p = 0.5$ for worst case).

Hypothesis Testing

ℹ️ CLT as the Basis for Z-Tests and T-Tests

The CLT justifies using Normal-based test statistics for means. When $n$ is large, the test statistic $Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}$ is approximately $N(0,1)$ under $H_0$ , even if the data is not Normal. For small samples with unknown $\sigma$ , the t-distribution (which accounts for the extra uncertainty of estimating $\sigma$ ) is used instead.

Other ML Applications

📋CLT in Machine Learning

Application	How CLT Is Used
Model evaluation	Confidence intervals for accuracy, AUC, loss metrics across folds
Feature importance	Permutation test statistics are approximately Normal by CLT
Batch normalization	Assumes mini-batch means are approximately Normal
Stochastic optimization	SGD noise is approximately Normal for large batches (CLT on gradients)
Ensemble methods	Averaging $n$ models: prediction error shrinks as $O(1/\sqrt{n})$
Uncertainty quantification	Bootstrap distributions of metrics converge to Normal by CLT
Reward estimation in RL	Average return over episodes is approximately Normal for large horizons

💡 CLT in Bayesian Inference

The CLT also appears in Bayesian statistics. The posterior distribution of a parameter, when the sample size is large, becomes approximately Normal regardless of the prior (Bernstein-von Mises theorem). This is why posterior means and credible intervals behave like frequentist confidence intervals for large $n$ .

Common Mistakes

💡 Learn from Others' Errors

The following table captures frequent mistakes practitioners make with the CLT. Avoiding these errors will save debugging time and prevent incorrect conclusions.

Mistake	Why It Is Wrong	Correct Approach
Applying CLT to the Cauchy distribution	Cauchy has infinite variance; sample mean stays Cauchy	Use median or trimmed mean instead
Assuming CLT means "data is Normal"	CLT applies to the sample mean, not the data itself	The data can be any distribution with finite variance
Using $n \geq 30$ as a universal rule	Convergence rate depends on skewness and tail heaviness	Check with Q-Q plots; use Berry-Esseen to gauge adequacy
Forgetting finite variance requirement	Infinite variance distributions (Pareto with $\alpha \leq 2$ ) violate CLT	Verify variance exists; consider stable distributions
Applying CLT to dependent data	CLT requires independence (or weak dependence)	Use time-series CLT (e.g., martingale CLT) or block bootstrap
Ignoring small-sample bias in proportions	$np < 10$ or $n(1-p) < 10$ breaks the normal approximation	Use Wilson score interval or exact binomial CI
Using z-test when $\sigma$ is unknown and $n$ is small	Z-test assumes known $\sigma$	Use t-test which accounts for estimating $\sigma$
Confusing LLN with CLT	LLN says where the mean converges; CLT says how it fluctuates	LLN: point convergence; CLT: distributional shape
Using CLT for heavy-tailed data without checking	Skewed distributions converge slowly; heavy tails may not converge	Increase $n$ , use bootstrap, or use robust methods
Assuming the CLT holds for max/min	CLT is about sums/means, not order statistics	Max/min follow extreme value distributions (Gumbel, Fréchet)

Interview Questions

📝Interview Question 1: CLT Fundamentals

Q: Explain the Central Limit Theorem in plain language. Why is it important?

A: The CLT states that if you take many independent random samples from any distribution with a finite mean and variance, and compute the mean of each sample, those sample means will be approximately Normally distributed — regardless of the original distribution's shape. This is powerful because it lets us use the well-understood Normal distribution to make inferences about population means, even when we have no idea what the underlying distribution looks like. It's the foundation of confidence intervals, hypothesis tests, and A/B testing.

📝Interview Question 2: When CLT Fails

Q: Give an example where the CLT does not apply. What happens instead?

A: The Cauchy distribution has no finite mean or variance. If you average Cauchy-distributed random variables, the sample mean has exactly the same Cauchy distribution as each individual observation — it never concentrates, and never becomes Normal. The CLT requires finite variance. More practically, Pareto distributions with tail index $\alpha \leq 2$ also have infinite variance and do not satisfy the classical CLT. For such data, you should use the median, trimmed mean, or robust statistical methods instead.

📝Interview Question 3: CLT vs. LLN

Q: What is the difference between the Law of Large Numbers and the Central Limit Theorem?

A: The LLN says the sample mean converges to the population mean as $n \to \infty$ — it tells you the destination. The CLT says the sample mean fluctuates around the population mean with a distribution that is approximately $N(\mu, \sigma^2/n)$ — it tells you the shape of the fluctuations. The LLN is about convergence; the CLT is about the distribution of the error. You need the LLN first (to know the mean exists and is the target), and the CLT adds the quantitative description of how the approximation improves with $n$ .

📝Interview Question 4: A/B Testing

Q: You're running an A/B test with 10,000 users per group. The control conversion rate is 8% and you observe 9.5% in treatment. Is this significant? Walk through the CLT-based analysis.

A: By the CLT, the difference $\hat{p}_T - \hat{p}_C$ is approximately Normal. The standard error is $\sqrt{p_C(1-p_C)/n + p_T(1-p_T)/n} = \sqrt{0.08 \times 0.92/10000 + 0.095 \times 0.905/10000} \approx 0.0041$ . The z-statistic is $(0.095 - 0.08)/0.0041 \approx 3.66$ , which gives $p \approx 0.0003$ . Yes, this is highly significant at $\alpha = 0.05$ . The 95% CI for the difference is approximately $0.015 \pm 1.96 \times 0.0041 = [0.007, 0.023]$ .

📝Interview Question 5: Berry-Esseen

Q: The CLT says the approximation improves with $n$ . How fast? What determines the rate?

A: The Berry-Esseen theorem bounds the error as $O(1/\sqrt{n})$ , with the constant depending on the ratio $\rho/\sigma^3$ where $\rho = E[|X-\mu|^3]$ is the third absolute central moment. Distributions with higher skewness converge more slowly. For a symmetric distribution like the Uniform, the rate improves to $O(1/n)$ because odd central moments vanish. Practically, this means a heavily skewed distribution might need $n = 100+$ for a good Normal approximation, while a symmetric one might need only $n = 10$ .

📝Interview Question 6: Practical Application

Q: You have a dataset of user session times that is heavily right-skewed. You need to estimate the mean session time with a 95% CI. What do you do?

A: Despite the skewness, the CLT applies if the variance is finite. I would: (1) compute $\bar{x}$ and $s$ ; (2) check $n \geq 30$ — if yes, use the CLT-based CI $\bar{x} \pm 1.96 \cdot s/\sqrt{n}$ ; (3) if $n$ is small or variance appears infinite, use the bootstrap to construct the CI empirically; (4) verify with a Q-Q plot of sample means (bootstrap means) that the Normal approximation is reasonable. If the variance is truly infinite (heavy tails), switch to the median with a bootstrap CI.

Practice Problems

📝Problem 1: CLT Applied to Dice Rolls

A fair die is rolled $n = 90$ times. What is the approximate probability that the sample mean is between 3.3 and 3.7?

💡Solution

For a fair die: $\mu = 3.5$ , $\sigma^2 = 35/12 \approx 2.9167$ , so $\sigma \approx 1.7078$ .

By the CLT: $\bar{X}_{90} \;\dot{\sim}\; N(3.5, \, 2.9167/90)$ , so $\text{SE} = \sigma/\sqrt{90} \approx 0.1799$ .

P(3.3 < \bar{X} < 3.7) = P\!\left(\frac{3.3 - 3.5}{0.1799} < Z < \frac{3.7 - 3.5}{0.1799}\right) = P(-1.11 < Z < 1.11)

= \Phi(1.11) - \Phi(-1.11) = 2\Phi(1.11) - 1 \approx 2(0.8665) - 1 \approx 0.7330

The probability is approximately 73.3%.

📝Problem 2: Sample Size for Desired Precision

A researcher wants to estimate the average reaction time with a margin of error of no more than 5 ms. The population standard deviation is estimated at 20 ms. How many subjects are needed for a 95% CI?

💡Solution

n = \left(\frac{z_{0.025} \cdot \sigma}{E}\right)^2 = \left(\frac{1.96 \times 20}{5}\right)^2 = \left(\frac{39.2}{5}\right)^2 = (7.84)^2 \approx 61.47

Round up to $n = 62$ subjects.

Note: if we used a conservative estimate of $\sigma = 25$ , we would need $n = (1.96 \times 25 / 5)^2 = 96.04 \approx 97$ subjects.

📝Problem 3: CLT for Proportions

In a poll of $n = 500$ voters, 280 support candidate A. Construct a 95% CI for the true proportion.

💡Solution

$\hat{p} = 280/500 = 0.56$ . Check: $np = 280 \geq 10$ and $n(1-p) = 220 \geq 10$ ✓.

\hat{p} \pm z_{0.025}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} = 0.56 \pm 1.96\sqrt{\frac{0.56 \times 0.44}{500}}

= 0.56 \pm 1.96 \times 0.0222 = 0.56 \pm 0.0435

95% CI: (0.5165, 0.6035)

We are 95% confident that the true proportion of supporters is between 51.7% and 60.4%.

📝Problem 4: When CLT Fails

You sample $n = 100$ observations from a Cauchy distribution and compute the sample mean. You repeat this 10,000 times. What distribution do the 10,000 sample means follow? Why doesn't the CLT help?

💡Solution

The sample means follow the same Cauchy distribution as the original data. This is a famous property of the Cauchy distribution: the mean of $n$ i.i.d. Cauchy random variables has exactly the same Cauchy distribution as a single observation.

The CLT does not apply because the Cauchy distribution has infinite variance (and infinite mean). The CLT's key requirement — finite variance — is violated. No matter how large $n$ is, the sample mean never becomes Normal, never concentrates, and never converges to a point. This is why robust statistics (using the median instead of the mean) is essential for heavy-tailed data.

📝Problem 5: Berry-Esseen Bound

For a Bernoulli( $p = 0.3$ ) distribution, compute the Berry-Esseen upper bound on the maximum error of the CLT approximation for $n = 100$ .

💡Solution

For Bernoulli( $p$ ): $\mu = p$ , $\sigma^2 = p(1-p)$ , $E[|X - \mu|^3] = p(1-p)(1 - 2p)^2 + p(1-p) = p(1-p)[(1-2p)^2 + 1]$ .

For $p = 0.3$ : $\sigma^2 = 0.21$ , $\sigma = 0.4583$ .

$E[|X - 0.3|^3] = 0.3 \times 0.7 \times [0.4^2 + 1] = 0.21 \times 1.16 = 0.2436$ .

Berry-Esseen bound:

\sup_x |F_n(x) - \Phi(x)| \leq \frac{0.4748 \times 0.2436}{(0.4583)^3 \times \sqrt{100}} = \frac{0.1157}{0.0962 \times 10} = \frac{0.1157}{0.962} \approx 0.120

The maximum error between the true distribution and the Normal approximation is at most about 12%. In practice, the actual error is much smaller, but this bound confirms the CLT is working.

Quick Reference

📋CLT Quick Reference

Concept	Formula / Statement
CLT (Lindeberg-Lévy)	$\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \xrightarrow{d} N(0,1)$
Sample mean distribution	$\bar{X}_n \;\dot{\sim}\; N(\mu, \sigma^2/n)$
Standard error	$\text{SE} = \sigma/\sqrt{n}$
CI for mean (known σ)	$\bar{x} \pm z_{\alpha/2} \cdot \sigma/\sqrt{n}$
CI for mean (unknown σ)	$\bar{x} \pm t_{\alpha/2, n-1} \cdot s/\sqrt{n}$
CI for proportion	$\hat{p} \pm z_{\alpha/2}\sqrt{\hat{p}(1-\hat{p})/n}$
Sample size (mean)	$n = (z_{\alpha/2} \cdot \sigma / E)^2$
Sample size (A/B test)	$n = (z_{\alpha/2} + z_{\beta})^2 \cdot 2\sigma^2 / \delta^2$
Berry-Esseen bound	$\sup_x \|F_n(x) - \Phi(x)\| \leq \frac{C\rho}{\sigma^3\sqrt{n}}$
SLLN	$\bar{X}_n \xrightarrow{a.s.} \mu$
WLLN	$\bar{X}_n \xrightarrow{P} \mu$

💡 Common Critical Values

Confidence Level	$z_{\alpha/2}$
90%	1.645
95%	1.960
98%	2.326
99%	2.576

For small samples ( $n < 30$ ), replace $z$ with the corresponding $t$ critical value from the t-distribution with $n-1$ degrees of freedom.

Cross-References

📋Related Topics

Probability Distributions → Understanding the Normal distribution that CLT converges to
Expectation and Variance → Required parameters for CLT: $E[X] = \mu$ and $\text{Var}(X) = \sigma^2$
Law of Large Numbers → Complementary result: where the sample mean converges
Confidence Intervals → Direct application of CLT for parameter estimation
Hypothesis Testing → Z-tests and t-tests rely on CLT for the sampling distribution
A/B Testing → Real-world application of CLT for comparing proportions
Bootstrapping → When CLT conditions are uncertain, bootstrap provides distribution-free inference
Maximum Likelihood Estimation → MLE asymptotic normality follows from CLT
Bayesian Inference → Bernstein-von Mises theorem: posterior becomes Normal by CLT
Machine Learning → SGD noise, ensemble averaging, and model evaluation all use CLT