Measure Theory

ℹ️ Why It Matters

Measure theory provides the rigorous mathematical foundation for probability theory, enabling continuous distributions, conditional expectations, and convergence theorems that justify interchange of limits and integrals. Without measure theory, we cannot properly define probabilities over uncountable sets (like the real line), and the central limit theorem, law of large numbers, and Bayesian inference lack rigorous foundations. In machine learning, measure theory underlies expectation-maximization algorithms, variational inference, and the theoretical analysis of generalization error. The Lebesgue integral generalizes the Riemann integral to handle highly discontinuous functions and provides the framework for $L^p$ spaces.

Core Definitions

DfSigma-Algebra (σ-Algebra)

Let $\Omega$ be a set. A σ-algebra $\mathcal{F}$ on $\Omega$ is a collection of subsets of $\Omega$ satisfying:

$\Omega \in \mathcal{F}$
If $A \in \mathcal{F}$ , then $A^c = \Omega \setminus A \in \mathcal{F}$ (closed under complement)
If $A_1, A_2, \ldots \in \mathcal{F}$ , then $\bigcup_{n=1}^\infty A_n \in \mathcal{F}$ (closed under countable union)

The pair $(\Omega, \mathcal{F})$ is called a measurable space. The Borel σ-algebra $\mathcal{B}(\mathbb{R})$ is generated by all open intervals in $\mathbb{R}$ .

DfMeasure

A measure $\mu$ on a measurable space $(\Omega, \mathcal{F})$ is a function $\mu: \mathcal{F} \to [0, \infty]$ satisfying:

$\mu(\emptyset) = 0$ (null empty set)
Countable additivity: For pairwise disjoint sets $A_1, A_2, \ldots \in \mathcal{F}$ , $\mu\left(\bigcup_{n=1}^\infty A_n\right) = \sum_{n=1}^\infty \mu(A_n)$

The triple $(\Omega, \mathcal{F}, \mu)$ is called a measure space. Countable additivity is essential—it enables limits and infinite processes to be well-defined.

DfProbability Measure

A probability measure $P$ is a measure on $(\Omega, \mathcal{F})$ with $P(\Omega) = 1$ . The triple $(\Omega, \mathcal{F}, P)$ is a probability space. Events are elements of $\mathcal{F}$ , and $P(A)$ is the probability of event $A$ . A random variable $X: \Omega \to \mathbb{R}$ is a measurable function, and its distribution $P_X$ is the pushforward measure $P_X(B) = P(X^{-1}(B))$ .

DfLebesgue Measure

The Lebesgue measure $\lambda$ on $(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ is the unique translation-invariant measure on the Borel σ-algebra satisfying $\lambda([a,b]) = b - a$ . It extends to $\mathbb{R}^n$ with $\lambda([a_1,b_1] \times \cdots \times [a_n,b_n]) = \prod_{i=1}^n (b_i - a_i)$ . The Lebesgue measure generalizes the notion of "length," "area," and "volume" to arbitrary measurable sets.

DfMeasurable Function

A function $f: (\Omega_1, \mathcal{F}_1) \to (\Omega_2, \mathcal{F}_2)$ is measurable if $f^{-1}(B) \in \mathcal{F}_1$ for every $B \in \mathcal{F}_2$ . For real-valued functions, it suffices that $f^{-1}((a, \infty)) \in \mathcal{F}_1$ for all $a \in \mathbb{R}$ . Measurable functions are the "nice" functions that can be integrated.

DfLebesgue Integral

For a non-negative simple function $s = \sum_{i=1}^n a_i \mathbf{1}_{A_i}$ (where $A_i$ are measurable and disjoint), the Lebesgue integral is $\int s \, d\mu = \sum_{i=1}^n a_i \mu(A_i)$ . For a general non-negative measurable function $f$ :

\int f \, d\mu = \sup\left\{\int s \, d\mu : 0 \leq s \leq f, \, s \text{ simple}\right\}

For general measurable $f$ , write $f = f^+ - f^-$ where $f^+ = \max(f, 0)$ and $f^- = \max(-f, 0)$ , and define $\int f \, d\mu = \int f^+ \, d\mu - \int f^- \, d\mu$ when at least one side is finite.

DfLᵖ Space

For $1 \leq p < \infty$ , $L^p(\Omega, \mathcal{F}, \mu)$ is the space of measurable functions $f$ with $\|f\|_p = \left(\int |f|^p \, d\mu\right)^{1/p} < \infty$ . $L^\infty$ consists of essentially bounded functions with $\|f\|_\infty = \text{ess sup} |f|$ . These are Banach spaces; $L^2$ is additionally a Hilbert space with inner product $\langle f, g \rangle = \int f\bar{g} \, d\mu$ .

DfAbsolute Continuity

A measure $\nu$ is absolutely continuous with respect to $\mu$ (written $\nu \ll \mu$ ) if $\mu(A) = 0 \implies \nu(A) = 0$ for all $A \in \mathcal{F}$ . The Radon-Nikodym theorem states that if $\nu \ll \mu$ and both are σ-finite, there exists a measurable function $\frac{d\nu}{d\mu}$ (the Radon-Nikodym derivative) such that $\nu(A) = \int_A \frac{d\nu}{d\mu} \, d\mu$ for all $A \in \mathcal{F}$ .

DfProduct Measure

Given two measure spaces $(X, \mathcal{A}, \mu)$ and $(Y, \mathcal{B}, \nu)$ , the product measure $\mu \times \nu$ on $(X \times Y, \mathcal{A} \otimes \mathcal{B})$ satisfies $(\mu \times \nu)(A \times B) = \mu(A)\nu(B)$ for measurable rectangles. The product σ-algebra $\mathcal{A} \otimes \mathcal{B}$ is generated by sets of the form $A \times B$ with $A \in \mathcal{A}$ and $B \in \mathcal{B}$ .

Key Formulas

Monotone Convergence Theorem (MCT)

0 \leq f_1 \leq f_2 \leq \cdots, \quad f_n \to f \implies \lim_{n \to \infty} \int f_n \, d\mu = \int f \, d\mu

Here,

$f_n$ =Sequence of non-negative measurable functions, monotonically increasing
$f$ =Pointwise limit of f_n
$mu$ =Measure on the measurable space

Fatou's Lemma

\int \liminf_{n \to \infty} f_n \, d\mu \leq \liminf_{n \to \infty} \int f_n \, d\mu

Here,

$f_n$ =Sequence of non-negative measurable functions
$liminf$ =Limit inferior (greatest lower bound of tail limits)

Dominated Convergence Theorem (DCT)

f_n \to f \text{ a.e.}, \quad |f_n| \leq g \text{ with } \int g \, d\mu < \infty \implies \lim \int f_n \, d\mu = \int f \, d\mu

Here,

$f_n$ =Sequence of measurable functions converging a.e. to f
$g$ =Integrable dominating function: |f_n| ≤ g a.e.

Fubini's Theorem

\int_{X \times Y} f(x,y) \, d(\mu \times \nu) = \int_X \left(\int_Y f(x,y) \, d\nu(y)\right) d\mu(x) = \int_Y \left(\int_X f(x,y) \, d\mu(x)\right) d\nu(y)

Here,

$f$ =Integrable function on the product space X × Y
$mu, u$ =Measures on X and Y respectively
$mu imes u$ =Product measure on X × Y

Change of Variables Formula

\int_Y f(y) \, d\nu = \int_X f(T(x)) \left|\frac{dy}{dx}\right| \, d\mu

Here,

$T$ =Measurable transformation from (X, μ) to (Y, ν)
$dy/dx$ =Jacobian determinant of T

Radon-Nikodym Theorem

\nu(A) = \int_A \frac{d\nu}{d\mu} \, d\mu \quad \text{for all } A \in \mathcal{F}

Here,

$u$ =σ-finite measure absolutely continuous w.r.t. μ
$d u/dmu$ =Radon-Nikodym derivative (density function)

Markov's Inequality

P(X \geq a) \leq \frac{E[X]}{a}

Here,

$X$ =Non-negative random variable
$a$ =Positive constant
$E[X]$ =Expected value (Lebesgue integral of X w.r.t. P)

Chebyshev's Inequality

P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}

Here,

$X$ =Random variable with mean μ and variance σ²
$k$ =Number of standard deviations

Important Theorems

ThLebesgue's Dominated Convergence Theorem

If $\{f_n\}$ is a sequence of measurable functions converging pointwise a.e. to $f$ , and there exists an integrable function $g$ (i.e., $\int g \, d\mu < \infty$ ) such that $|f_n| \leq g$ a.e. for all $n$ , then $f$ is integrable and $\lim_{n \to \infty} \int f_n \, d\mu = \int f \, d\mu$ .

This theorem justifies interchange of limit and integral under the domination condition, and is the workhorse of measure theory. The domination condition prevents "escape of mass" to infinity or to singularities.

ThFubini's Theorem

Let $(X, \mathcal{A}, \mu)$ and $(Y, \mathcal{B}, \nu)$ be σ-finite measure spaces. If $f$ is integrable on the product space $(X \times Y, \mathcal{A} \otimes \mathcal{B}, \mu \times \nu)$ , then:

For a.e. $x \in X$ , $f(x, \cdot)$ is integrable on $Y$
The function $x \mapsto \int_Y f(x,y) \, d\nu(y)$ is integrable on $X$
$\int_{X \times Y} f \, d(\mu \times \nu) = \int_X \left(\int_Y f(x,y) \, d\nu(y)\right) d\mu(x) = \int_Y \left(\int_X f(x,y) \, d\mu(x)\right) d\nu(y)$

Fubini's theorem allows evaluation of multi-dimensional integrals as iterated one-dimensional integrals, in either order.

ThHahn Decomposition Theorem

Let $\mu$ be a signed measure on $(\Omega, \mathcal{F})$ . There exists a measurable set $P$ (the positive set) such that $\mu(A) \geq 0$ for all measurable $A \subseteq P$ , and $N = P^c$ (the negative set) such that $\mu(A) \leq 0$ for all measurable $A \subseteq N$ . The decomposition is essentially unique (unique up to μ-null sets).

ThVitali Convergence Theorem

Let $\{f_n\}$ be a sequence of measurable functions on a finite measure space $(\Omega, \mathcal{F}, \mu)$ with $\mu(\Omega) < \infty$ . Then $f_n \to f$ in $L^1$ (i.e., $\int |f_n - f| \, d\mu \to 0$ ) if and only if:

$\{f_n\}$ is uniformly integrable: $\lim_{M \to \infty} \sup_n \int_{|f_n| > M} |f_n| \, d\mu = 0$
$\{f_n\}$ is tight: for every $\epsilon > 0$ , there exists a set $E$ with $\mu(E^c) < \epsilon$ such that $\{f_n|_E\}$ is uniformly integrable

ThCarathéodory's Extension Theorem

If $\mu_0$ is a premeasure on an algebra $\mathcal{A}$ on $\Omega$ , then $\mu_0$ extends to a measure $\mu$ on the σ-algebra $\sigma(\mathcal{A})$ generated by $\mathcal{A}$ . If $\mu_0$ is σ-finite, the extension is unique. This theorem justifies the construction of Lebesgue measure from the premeasure on intervals.

Worked Examples

📝Constructing Lebesgue Measure on [0,1]

Problem: Show that the Lebesgue measure of the Cantor set is 0.

Solution:

Step 1: The Cantor set $C$ is constructed by repeatedly removing middle thirds from $[0,1]$ .

Step 2: At stage $n$ , we remove $2^{n-1}$ intervals, each of length $3^{-n}$ .

Step 3: Total length removed:

L = \sum_{n=1}^{\infty} 2^{n-1} \cdot 3^{-n} = \frac{1}{3} \sum_{n=0}^{\infty} \left(\frac{2}{3}\right)^n = \frac{1}{3} \cdot \frac{1}{1 - 2/3} = \frac{1}{3} \cdot 3 = 1

Step 4: Since $[0,1]$ has Lebesgue measure 1 and we removed a total of 1, the remaining set $C$ has measure:

\lambda(C) = 1 - 1 = 0

Result: The Cantor set is uncountable (it has cardinality $2^{\aleph_0}$ ) but has Lebesgue measure 0. This demonstrates that "size" (measure) and cardinality are fundamentally different notions, and motivates the need for σ-algebras to handle such pathological sets.

📝Applying the Monotone Convergence Theorem

Problem: Compute $\lim_{n \to \infty} \int_0^\infty \left(1 + \frac{x}{n}\right)^{-n} e^{-x/2} \, dx$ .

Solution:

Step 1: Identify the pointwise limit. For each fixed $x \geq 0$ :

\lim_{n \to \infty} \left(1 + \frac{x}{n}\right)^{-n} = e^{-x}

So $f_n(x) = \left(1 + \frac{x}{n}\right)^{-n} e^{-x/2} \to e^{-x} \cdot e^{-x/2} = e^{-3x/2}$ .

Step 2: Verify monotonicity. For fixed $x$ , $\left(1 + \frac{x}{n}\right)^{-n}$ is increasing in $n$ (since $(1+t/n)^n$ is increasing). Therefore $f_n(x)$ is increasing in $n$ .

Step 3: Apply the Monotone Convergence Theorem:

\lim_{n \to \infty} \int_0^\infty f_n(x) \, dx = \int_0^\infty \lim_{n \to \infty} f_n(x) \, dx = \int_0^\infty e^{-3x/2} \, dx

Step 4: Compute the integral:

\int_0^\infty e^{-3x/2} \, dx = \left[-\frac{2}{3} e^{-3x/2}\right]_0^\infty = 0 + \frac{2}{3} = \frac{2}{3}

Result: $\lim_{n \to \infty} \int_0^\infty \left(1 + \frac{x}{n}\right)^{-n} e^{-x/2} \, dx = \frac{2}{3}$

The MCT allowed us to swap limit and integral because the sequence was monotonically increasing.

📝Applying Fatou's Lemma

Problem: Let $f_n = n \cdot \mathbf{1}_{(0, 1/n]}$ on $([0,1], \mathcal{B}, \lambda)$ . Compute $\liminf \int f_n \, d\lambda$ and $\int \liminf f_n \, d\lambda$ .

Solution:

Step 1: Compute $\int f_n \, d\lambda = n \cdot \lambda((0, 1/n]) = n \cdot \frac{1}{n} = 1$ for all $n$ .

So $\liminf_{n \to \infty} \int f_n \, d\lambda = 1$ .

Step 2: Find the pointwise limit. For any $x > 0$ , there exists $N$ such that $x > 1/N$ , so $f_n(x) = 0$ for all $n \geq N$ . Thus $\lim_{n \to \infty} f_n(x) = 0$ for all $x > 0$ .

Also $f_n(0) = 0$ for all $n$ . So $\liminf f_n = 0$ pointwise.

Step 3: Compute $\int \liminf f_n \, d\lambda = \int 0 \, d\lambda = 0$ .

Step 4: Fatou's Lemma gives:

\int \liminf f_n \, d\lambda = 0 \leq 1 = \liminf \int f_n \, d\lambda

Result: Fatou's inequality is strict: $0 < 1$ . The functions $f_n$ escape to infinity near 0 (they "spike" higher but over smaller intervals), so the integral of the limit (0) is strictly less than the limit of the integrals (1). This is why Fatou gives an inequality, not an equality.

📝Applying the Dominated Convergence Theorem

Problem: Compute $\lim_{n \to \infty} \int_0^1 \frac{n x^{n-1}}{1 + x^n} \, dx$ .

Solution:

Step 1: Let $f_n(x) = \frac{nx^{n-1}}{1+x^n}$ . Find the pointwise limit.

For $x \in [0, 1)$ : $x^n \to 0$ , so $f_n(x) \to \frac{0}{1+0} = 0$ .

For $x = 1$ : $f_n(1) = \frac{n}{2} \to \infty$ .

Since $\{1\}$ has measure 0, $f_n \to 0$ a.e.

Step 2: Find a dominating function. Note that:

f_n(x) = \frac{d}{dx} \ln(1 + x^n)

By the mean value theorem, for $x \in [0,1)$ :

f_n(x) = \frac{nx^{n-1}}{1+x^n} \leq \frac{n}{1} = n

But this bound grows with $n$ . Instead, note:

f_n(x) \leq \frac{n \cdot 1}{1} = n \quad \text{for } x \in [0,1]

This doesn't work. Let's try a different approach.

Actually, integrate directly:

\int_0^1 f_n(x) \, dx = \int_0^1 \frac{nx^{n-1}}{1+x^n} \, dx = [\ln(1+x^n)]_0^1 = \ln 2 - \ln 1 = \ln 2

Step 3: The integral is $\ln 2$ for all $n$ , so the limit is $\ln 2$ .

Step 4: Apply DCT: $f_n \to 0$ a.e., but the integrals converge to $\ln 2 \neq 0$ . The DCT does NOT apply because we cannot find an integrable dominating function (the "mass" escapes to the point $x=1$ , which has measure 0 but the function blows up there).

Result: $\lim_{n \to \infty} \int_0^1 f_n \, dx = \ln 2$ , despite $f_n \to 0$ a.e. This demonstrates that pointwise convergence alone does not guarantee convergence of integrals—we need the domination condition.

Practice Problems

📝Problem 1: Show That the Rationals Have Lebesgue Measure Zero

Problem: Prove that $\lambda(\mathbb{Q} \cap [0,1]) = 0$ .

Solution:

Step 1: $\mathbb{Q} \cap [0,1]$ is countable. Enumerate it as $\{q_1, q_2, q_3, \ldots\}$ .

Step 2: For any $\epsilon > 0$ , cover each $q_n$ with an interval of length $\epsilon/2^n$ :

U = \bigcup_{n=1}^{\infty} \left(q_n - \frac{\epsilon}{2^{n+1}}, q_n + \frac{\epsilon}{2^{n+1}}\right)

Step 3: By countable subadditivity of Lebesgue measure:

\lambda(U) \leq \sum_{n=1}^{\infty} \frac{\epsilon}{2^n} = \epsilon

Step 4: Since $\mathbb{Q} \cap [0,1] \subseteq U$ and $\lambda(U) \leq \epsilon$ for all $\epsilon > 0$ :

\lambda(\mathbb{Q} \cap [0,1]) = 0

Result: The rationals in $[0,1]$ have Lebesgue measure zero. This means Lebesgue integration "ignores" the rationals—they form a set of measure zero, so modifying a function on the rationals does not change its Lebesgue integral.

📝Problem 2: Verify Countable Additivity

Problem: Let $\mu$ be the counting measure on $(\mathbb{N}, 2^{\mathbb{N}})$ . Verify that $\mu$ is a measure.

Solution:

Step 1: Check $\mu(\emptyset) = 0$ . The counting measure of the empty set is 0 (no elements to count). ✓

Step 2: Check countable additivity. Let $A_1, A_2, \ldots$ be pairwise disjoint subsets of $\mathbb{N}$ .

\mu\left(\bigcup_{n=1}^{\infty} A_n\right) = \left|\bigcup_{n=1}^{\infty} A_n\right|

Since the $A_n$ are disjoint, each element of $\bigcup A_n$ belongs to exactly one $A_n$ :

\left|\bigcup_{n=1}^{\infty} A_n\right| = \sum_{n=1}^{\infty} |A_n| = \sum_{n=1}^{\infty} \mu(A_n)

Step 3: Both sides are either finite and equal, or both are infinite. ✓

Result: The counting measure is a valid measure. Note that it is σ-finite (each $\{n\}$ has measure 1), and the counting measure is used to define $\ell^p$ spaces.

📝Problem 3: Lebesgue Integral vs. Riemann Integral

Problem: Compute $\int_0^1 \mathbf{1}_{\mathbb{Q}}(x) \, d\lambda$ (Lebesgue integral of the Dirichlet function).

Solution:

Step 1: The Dirichlet function is $f(x) = \mathbf{1}_{\mathbb{Q}}(x) = \begin{cases} 1 & x \in \mathbb{Q} \\ 0 & x \notin \mathbb{Q} \end{cases}$

Step 2: We showed that $\lambda(\mathbb{Q} \cap [0,1]) = 0$ .

Step 3: Since $f = 0$ almost everywhere (a.e.) with respect to Lebesgue measure:

\int_0^1 \mathbf{1}_{\mathbb{Q}} \, d\lambda = \int_0^1 0 \, d\lambda = 0

Step 4: Contrast with Riemann integral: The Riemann integral of $\mathbf{1}_{\mathbb{Q}}$ does not exist because in every subinterval of $[0,1]$ , the function takes both values 0 and 1, so upper and lower Riemann sums never converge.

Result: The Lebesgue integral equals 0, while the Riemann integral does not exist. This demonstrates the power of Lebesgue integration: it can integrate highly discontinuous functions that Riemann integration cannot handle.

📝Problem 4: Applying Fubini's Theorem

Problem: Compute $\int_0^1 \int_0^1 \frac{x - y}{(x + y)^3} \, dx \, dy$ and $\int_0^1 \int_0^1 \frac{x - y}{(x + y)^3} \, dy \, dx$ . Do they give the same answer?

Solution:

Step 1: Note that $f(x,y) = \frac{x-y}{(x+y)^3}$ is NOT non-negative, so we need to check integrability.

Step 2: Compute the iterated integral $\int_0^1 \left(\int_0^1 \frac{x-y}{(x+y)^3} \, dx\right) dy$ .

Inner integral: $\int_0^1 \frac{x-y}{(x+y)^3} \, dx$ . Let $u = x + y$ , $du = dx$ :

\int_y^{1+y} \frac{u - 2y}{u^3} \, du = \int_y^{1+y} (u^{-2} - 2y u^{-3}) \, du = \left[-u^{-1} + yu^{-2}\right]_y^{1+y}

= \left(-\frac{1}{1+y} + \frac{y}{(1+y)^2}\right) - \left(-\frac{1}{y} + \frac{y}{y^2}\right) = \frac{-y + y}{(1+y)^2} - \frac{-1 + 1}{y} = 0

Wait, let me redo this more carefully.

\left[-\frac{1}{u} + \frac{y}{u^2}\right]_y^{1+y} = \left(-\frac{1}{1+y} + \frac{y}{(1+y)^2}\right) - \left(-\frac{1}{y} + \frac{1}{y}\right)

= \frac{-(1+y) + y}{(1+y)^2} - 0 = \frac{-1}{(1+y)^2}

Step 3: Outer integral: $\int_0^1 \frac{-1}{(1+y)^2} \, dy = \left[\frac{1}{1+y}\right]_0^1 = \frac{1}{2} - 1 = -\frac{1}{2}$

Step 4: By symmetry ( $f(x,y) = -f(y,x)$ ):

\int_0^1 \int_0^1 \frac{x-y}{(x+y)^3} \, dy \, dx = \frac{1}{2}

Result: The two iterated integrals give different values ( $-1/2$ vs $1/2$ ). Fubini's theorem does NOT apply because $\int \int |f(x,y)| \, dx \, dy = \infty$ (the function is not integrable on the product space). This demonstrates the importance of checking the integrability condition in Fubini's theorem.

📝Problem 5: Radon-Nikodym Derivative

Problem: Let $P$ be the uniform distribution on $[0,1]$ (Lebesgue measure) and $Q$ be the distribution with density $q(x) = 2x$ . Find the Radon-Nikodym derivative $\frac{dQ}{dP}$ .

Solution:

Step 1: Check absolute continuity. If $P(A) = \lambda(A) = 0$ , then $Q(A) = \int_A 2x \, dx = 0$ . So $Q \ll P$ .

Step 2: By the Radon-Nikodym theorem, there exists $\frac{dQ}{dP}$ such that $Q(A) = \int_A \frac{dQ}{dP} \, dP$ .

Step 3: Since $P$ is Lebesgue measure and $Q$ has density $2x$ with respect to Lebesgue measure:

Q(A) = \int_A 2x \, dx = \int_A 2x \, dP

Step 4: Therefore $\frac{dQ}{dP}(x) = 2x$ .

Verification: $\int_0^1 2x \, dx = [x^2]_0^1 = 1 = Q([0,1])$ ✓

Result: The Radon-Nikodym derivative $\frac{dQ}{dP} = 2x$ is the likelihood ratio. In statistics, this is the "score function" or "likelihood ratio" used in hypothesis testing and maximum likelihood estimation. The Kullback-Leibler divergence can be expressed as $D_{KL}(Q \| P) = \int \log\left(\frac{dQ}{dP}\right) dQ$ .

📝Problem 6: Borel-Cantelli Lemma

Problem: Let $A_n$ be events in a probability space with $\sum_{n=1}^{\infty} P(A_n) < \infty$ . Show that $P(\limsup A_n) = 0$ .

Solution:

Step 1: Define $B = \limsup A_n = \bigcap_{n=1}^{\infty} \bigcup_{k=n}^{\infty} A_k$ (infinitely many $A_n$ occur).

Step 2: For each $N$ , $\bigcup_{k=N}^{\infty} A_k$ is an event containing $B$ .

Step 3: By countable subadditivity:

P\left(\bigcup_{k=N}^{\infty} A_k\right) \leq \sum_{k=N}^{\infty} P(A_k)

Step 4: Since $\sum P(A_n) < \infty$ , the tail $\sum_{k=N}^{\infty} P(A_k) \to 0$ as $N \to \infty$ .

Step 5: Therefore $P(B) \leq P\left(\bigcup_{k=N}^{\infty} A_k\right) \leq \sum_{k=N}^{\infty} P(A_k) \to 0$ .

Result: $P(\limsup A_n) = 0$ . This means "almost surely, only finitely many $A_n$ occur." The Borel-Cantelli lemma is fundamental in probability theory and connects measure theory to the study of random events.

📝Problem 7: Egorov's Theorem

Problem: State Egorov's theorem and explain its significance.

Solution:

Statement: Let $(\Omega, \mathcal{F}, \mu)$ be a finite measure space ( $\mu(\Omega) < \infty$ ), and let $f_n \to f$ pointwise a.e. Then for every $\epsilon > 0$ , there exists a measurable set $E \subseteq \Omega$ with $\mu(E^c) < \epsilon$ such that $f_n \to f$ uniformly on $E$ .

Significance: Egorov's theorem bridges pointwise convergence and uniform convergence. It says that on a finite measure space, pointwise convergence is "almost uniform"—we can find a set of arbitrarily small complement on which convergence is uniform.

Example: For $f_n = n \cdot \mathbf{1}_{(0,1/n]}$ on $[0,1]$ , $f_n \to 0$ pointwise but not uniformly on $[0,1]$ . However, for any $\epsilon > 0$ , take $E = [\epsilon, 1]$ . Then $\mu(E^c) = \epsilon$ and $f_n = 0$ on $E$ for $n > 1/\epsilon$ , so $f_n \to 0$ uniformly on $E$ .

Result: Egorov's theorem is important because it shows that pointwise convergence on finite measure spaces is "close to" uniform convergence, which is stronger and more useful for analysis.

📝Problem 8: Lebesgue Differentiation Theorem

Problem: State the Lebesgue differentiation theorem and explain its significance.

Solution:

Statement: For $f \in L^1_{\text{loc}}(\mathbb{R})$ (locally integrable), for almost every $x \in \mathbb{R}$ :

\lim_{h \to 0} \frac{1}{h} \int_x^{x+h} f(t) \, dt = f(x)

Equivalently: $\lim_{r \to 0} \frac{1}{|B(x,r)|} \int_{B(x,r)} f(y) \, dy = f(x)$ a.e.

Significance: This theorem says that the "average value" of $f$ over small balls converges to $f(x)$ for a.e. $x$ . It justifies the informal idea that integrals "recover" the integrand.

Example: For $f = \mathbf{1}_{[0,1]}$ , at $x = 0.5$ :

\frac{1}{2h}\int_{0.5-h}^{0.5+h} \mathbf{1}_{[0,1]}(t)dt = \frac{2h}{2h} = 1 = f(0.5)

At $x = 0$ (boundary): $\frac{1}{2h}\int_{-h}^{h} \mathbf{1}_{[0,1]}(t)dt = \frac{h}{2h} = 1/2 \neq f(0) = 1$

But $\{0\}$ has measure zero, so the theorem holds a.e.

Result: The Lebesgue differentiation theorem is fundamental in real analysis—it connects the integral and the function value, and is used in the study of Hardy-Littlewood maximal functions, singular integrals, and the theory of differentiation.

📝Problem 9: Jensen's Inequality

Problem: State Jensen's inequality and apply it to show that $E[\log X] \leq \log E[X]$ for positive random variable $X$ .

Solution:

Statement: If $\phi$ is a convex function and $X$ is a random variable, then $\phi(E[X]) \leq E[\phi(X)]$ .

Step 1: Let $\phi(x) = -\log x$ (which is convex for $x > 0$ ).

Step 2: Apply Jensen's inequality:

\phi(E[X]) \leq E[\phi(X)]

-\log E[X] \leq E[-\log X]

\log E[X] \geq E[\log X]

Step 3: Therefore $E[\log X] \leq \log E[X]$ .

Step 4: This is the AM-GM inequality in expectation: the geometric mean is always less than or equal to the arithmetic mean.

Example: If $X$ takes values 1 and 3 with equal probability:

$E[X] = 2$ , $\log E[X] = \log 2 \approx 0.693$
$E[\log X] = \frac{1}{2}(\log 1 + \log 3) = \frac{1}{2}\log 3 \approx 0.549$
$0.549 \leq 0.693$ ✓

Result: Jensen's inequality is fundamental in information theory (KL divergence is non-negative) and in machine learning (variational inference bounds).

📝Problem 10: Completion of a Measure Space

Problem: Explain the completion of a measure space and why it matters.

Solution:

Definition: Given a measure space $(\Omega, \mathcal{F}, \mu)$ , the completion $(\Omega, \bar{\mathcal{F}}, \bar{\mu})$ is defined by:

\bar{\mathcal{F}} = \{A \cup B : A \in \mathcal{F}, B \subseteq N \text{ for some } N \in \mathcal{F} \text{ with } \mu(N) = 0\}

and $\bar{\mu}(A \cup B) = \mu(A)$ for $B \subseteq N$ with $\mu(N) = 0$ .

Step 1: The completion adds all subsets of $\mu$ -null sets to $\mathcal{F}$ . This ensures that subsets of measure-zero sets are measurable.

Step 2: The Lebesgue measure is the completion of the Borel measure on $(\mathbb{R}, \mathcal{B}(\mathbb{R}))$ . The Borel σ-algebra is generated by open sets, but there are Borel sets that are not Lebesgue measurable. The Lebesgue σ-algebra includes all Borel sets plus all subsets of Borel null sets.

Step 3: Why it matters:

Without completion, subsets of null sets might not be measurable
The Lebesgue differentiation theorem requires completion to state "almost everywhere" precisely
In probability, completing the probability space ensures that events that "almost surely" don't happen have probability 0

Step 4: Example: The Cantor set has Lebesgue measure 0. Any subset of the Cantor set has Lebesgue measure 0 in the completed measure space, even though it may not be a Borel set.

Result: Completion ensures that measure theory is "maximally inclusive"—all subsets of negligible sets are automatically measurable. This is important for rigorous statements about "almost everywhere" convergence and for working with completions of probability spaces.

📝Problem 11: Product Measure Construction

Problem: Construct the product measure on $[0,1]^2$ from Lebesgue measure on $[0,1]$ .

Solution:

Step 1: Start with the product σ-algebra $\mathcal{B}([0,1]) \otimes \mathcal{B}([0,1]) = \mathcal{B}([0,1]^2)$ .

Step 2: Define the product measure $\mu \times \mu$ on rectangles: $(\mu \times \mu)(A \times B) = \mu(A) \cdot \mu(B)$ for $A, B \in \mathcal{B}([0,1])$ .

Step 3: By Carathéodory's extension theorem, this extends uniquely to a measure on $\mathcal{B}([0,1]^2)$ .

Step 4: The product measure on $[0,1]^2$ is just the 2-dimensional Lebesgue measure $\lambda_2$ .

Step 5: Verification via Fubini: For any $A \in \mathcal{B}([0,1]^2)$ :

\lambda_2(A) = \int_{[0,1]} \lambda(\{y : (x,y) \in A\}) \, dx = \int_{[0,1]} \lambda(A_x) \, dx

where $A_x = \{y : (x,y) \in A\}$ is the $x$ -section of $A$ .

Result: The product of 1-dimensional Lebesgue measure with itself gives 2-dimensional Lebesgue measure. This generalizes: $\lambda_n = \lambda_1 \times \cdots \times \lambda_1$ ( $n$ times) gives $n$ -dimensional Lebesgue measure.

📝Problem 12: Tonelli's Theorem

Problem: State Tonelli's theorem and explain how it differs from Fubini's theorem.

Solution:

Statement: Let $(X, \mathcal{A}, \mu)$ and $(Y, \mathcal{B}, \nu)$ be σ-finite measure spaces. If $f: X \times Y \to [0, \infty]$ is measurable (non-negative), then:

The function $x \mapsto \int_Y f(x,y) \, d\nu(y)$ is measurable on $X$
$\int_{X \times Y} f \, d(\mu \times \nu) = \int_X \left(\int_Y f(x,y) \, d\nu(y)\right) d\mu(x) = \int_Y \left(\int_X f(x,y) \, d\mu(x)\right) d\nu(y)$

Key difference from Fubini: Tonelli requires $f \geq 0$ (non-negative) but does NOT require integrability. Fubini requires integrability but allows signed functions. For non-negative functions, you can always swap the order of integration (Tonelli). For signed functions, you must check integrability first (Fubini).

Example: For $f(x,y) = e^{-xy} \sin(x)$ on $[0,\infty) \times [0,\infty)$ :

Tonelli applies to $|f|$ to check integrability
If $\int \int |f| < \infty$ , then Fubini applies and we can swap

Result: Tonelli's theorem is often used as a stepping stone: first apply Tonelli to $|f|$ to verify integrability, then apply Fubini to $f$ to swap integration order.

Common Mistakes

Mistake	Correct Approach
Assuming pointwise convergence implies convergence of integrals	Need dominated convergence (DCT) or monotone convergence (MCT)
Confusing "almost everywhere" with "everywhere"	Sets of measure zero can be ignored in Lebesgue integration
Forgetting Fubini's theorem requires integrability	Check $\int \int \|f\| \, d\mu \, d\nu < \infty$ before swapping integration order
Assuming the Riemann and Lebesgue integrals always agree	They agree for Riemann-integrable functions, but Lebesgue handles more cases
Confusing σ-algebra with topology	σ-algebras are closed under complement and countable union; topologies are closed under arbitrary union and finite intersection
Assuming every subset is measurable	There exist non-measurable sets (Vitali sets) requiring the Axiom of Choice
Forgetting that measures are only countably additive	Finite additivity is weaker; countable additivity enables limits

Connections to Machine Learning

ℹ️ Connections to Machine Learning

Measure theory is the mathematical foundation of probability and statistics, which underpin all of ML: (1) Probability distributions are measures on sample spaces; continuous densities exist by the Radon-Nikodym theorem. (2) Expectation is the Lebesgue integral $E[X] = \int X \, dP$ , enabling rigorous treatment of expectations over infinite spaces. (3) Convergence theorems justify interchange of limits in EM algorithms, variational inference, and stochastic gradient descent analysis. (4) Hypothesis testing uses likelihood ratios $\frac{dQ}{dP}$ (Radon-Nikodym derivatives). (5) Bayesian inference updates prior measures to posterior measures via Bayes' rule, which is a statement about conditional Radon-Nikodym derivatives. (6) Generalization bounds use concentration inequalities (Markov, Chebyshev, Hoeffding) that are consequences of measure-theoretic integration.

Exam/Interview Questions

Q1: State the three convergence theorems and explain when each applies.

Answer: (1) Monotone Convergence Theorem: If $0 \leq f_1 \leq f_2 \leq \cdots$ with $f_n \to f$ , then $\int f_n \to \int f$ . Requires monotone increase. (2) Fatou's Lemma: For non-negative $f_n$ , $\int \liminf f_n \leq \liminf \int f_n$ . Always holds but gives inequality. (3) Dominated Convergence Theorem: If $f_n \to f$ a.e. and $|f_n| \leq g$ with $g$ integrable, then $\int f_n \to \int f$ . Requires an integrable bound. MCT is used for increasing sequences, DCT for general convergence with domination, and Fatou for lower bounds.

Q2: Why is the Lebesgue integral preferred over the Riemann integral in probability theory?

Answer: The Lebesgue integral handles: (1) highly discontinuous functions (like indicator functions of rationals), (2) infinite-dimensional spaces (function spaces, sequence spaces), (3) general measures (not just length on $\mathbb{R}$ ), and (4) convergence theorems that justify limit-interchange operations essential in probability (law of large numbers, central limit theorem). Riemann integration cannot handle the Dirichlet function, and cannot be extended to general measure spaces.

Q3: What is a σ-algebra and why do we need it?

Answer: A σ-algebra $\mathcal{F}$ on $\Omega$ is a collection of subsets closed under complement and countable union. We need it because: (1) Not all subsets of $\mathbb{R}$ are Lebesgue measurable (Vitali sets). (2) The complement axiom ensures $P(A^c) = 1 - P(A)$ . (3) Countable union closure enables countable additivity: $P(\bigcup A_n) = \sum P(A_n)$ for disjoint $A_n$ , which is essential for taking limits of events. The Borel σ-algebra $\mathcal{B}(\mathbb{R})$ is generated by open sets.

Q4: Give an example where the Dominated Convergence Theorem does not apply and the conclusion fails.

Answer: Let $f_n = n \cdot \mathbf{1}_{(0, 1/n]}$ on $([0,1], \lambda)$ . Then $f_n \to 0$ a.e., but $\int f_n = 1 \to 1 \neq 0 = \int 0$ . The DCT fails because there is no integrable dominating function: $|f_n| \leq n \cdot \mathbf{1}_{(0,1/n]}$ , and any dominating $g$ would need $g \geq n$ on $(0, 1/n]$ for all $n$ , making $\int g = \infty$ . This illustrates the necessity of the domination condition.

Q5: Explain the Radon-Nikodym theorem and its interpretation as a likelihood ratio.

Answer: If $Q \ll P$ (Q is absolutely continuous w.r.t. P), then $\frac{dQ}{dP}$ exists such that $Q(A) = \int_A \frac{dQ}{dP} dP$ . In probability: if $P$ and $Q$ are probability measures on the same space with densities $p$ and $q$ , then $\frac{dQ}{dP} = \frac{q}{p}$ , which is the likelihood ratio. This is fundamental in: hypothesis testing (Neyman-Pearson lemma), importance sampling (reweighting samples), and KL divergence $D_{KL}(Q\|P) = E_Q[\log\frac{dQ}{dP}]$ .

Quick Reference

Concept	Formula	Key Insight
σ-Algebra	$\Omega \in \mathcal{F}$ , closed under complement and countable union	Domain of measurable sets
Measure	$\mu(\emptyset) = 0$ , countably additive	Assigns "size" to sets
Probability Measure	$P(\Omega) = 1$	Normalized measure
Lebesgue Integral	$\int f \, d\mu = \sup\{\int s \, d\mu : 0 \leq s \leq f, s \text{ simple}\}$	Integrates via range slicing
MCT	$f_n \uparrow f \implies \int f_n \to \int f$	Swap limit and integral for increasing sequences
Fatou's Lemma	$\int \liminf f_n \leq \liminf \int f_n$	Lower bound on limit of integrals
DCT	$\\|f_n\\| \leq g \in L^1, f_n \to f \implies \int f_n \to \int f$	Swap limit with domination
Fubini's Theorem	$\int_{X \times Y} f = \int_X \int_Y f = \int_Y \int_X f$	Swap order of integration
Radon-Nikodym	$\frac{d\nu}{d\mu}$ with $\nu(A) = \int_A \frac{d\nu}{d\mu} d\mu$	Density of one measure w.r.t. another
Markov's Inequality	$P(X \geq a) \leq E[X]/a$	Bound tail probabilities
Chebyshev's Inequality	$P(\|X - \mu\| \geq k\sigma) \leq 1/k^2$	Bound deviations from mean

Cross-References

096-advanced-tensor-calculus — Integration of tensor fields requires measure-theoretic foundations
097-advanced-differential-geometry — Integration on manifolds uses differential forms and measures
098-advanced-functional-analysis — $L^p$ spaces are Banach/Hilbert spaces defined via Lebesgue integration
100-advanced-topological — Borel σ-algebras connect measure theory to topology
Probability Theory: All probability axioms are measure-theoretic statements