Why It Matters
💡 Why It Matters
Expectation and variance are the cornerstones of probability theory and statistical inference. The expected value tells you the long-run average outcome of a random variable, while variance quantifies uncertainty around that average. In machine learning, these quantities underpin everything from loss function design (expected risk minimization) to model evaluation (mean squared error, cross-entropy). Understanding them is essential for building robust, well-calibrated models. In reinforcement learning, agents maximize expected cumulative reward. In finance, portfolio optimization balances expected return against variance (risk). In deep learning, batch normalization and dropout regularization both exploit properties of expectation and variance to stabilize training. Without a firm grasp of these concepts, you cannot reason properly about uncertainty, make optimal decisions under risk, or debug statistical pipelines.
Expected Value
DfExpected Value (Discrete Case)
For a discrete random variable taking values with probabilities , the expected value (or mean) is defined as:
provided the sum converges absolutely, i.e., . The expected value represents the "center of mass" of the probability distribution — the value you would obtain on average if you repeated the random experiment infinitely many times.
Expected Value — Discrete
Here,
- =Expected value of random variable X
- =Possible values X can take
- =Probability mass function (PMF) at x_i
DfExpected Value (Continuous Case)
For a continuous random variable with probability density function , the expected value is:
provided the integral converges absolutely. For distributions with heavy tails (e.g., Cauchy distribution), the expected value may not exist.
Expected Value — Continuous
Here,
- =Expected value of continuous random variable X
- =Probability density function (PDF)
- =Integration variable
📝Expected Value Examples
Example 1 (Discrete): Fair six-sided die.
Example 2 (Continuous): Uniform distribution .
Example 3 (Bernoulli): .
Properties of Expectation
ThProperties of Expectation (Linearity)
The expectation operator is linear. For any random variables and constants :
- Constant rule: for any constant .
- Linearity: .
- Iterated expectation: (law of total expectation).
- Monotonicity: If with probability 1, then .
- Non-negativity: If with probability 1, then .
Important: Linearity holds even when and are dependent. This is what makes expectation so powerful — no independence assumption is needed for .
⚠️ Linearity vs Independence
Linearity of expectation () always holds. However, holds only when and are independent (or uncorrelated). Confusing these two facts is a common source of errors.
Variance
DfVariance
The variance of a random variable measures the spread of its distribution around the mean . It is defined as:
Variance is always non-negative. if and only if is a constant with probability 1. The units of variance are the square of the units of , which is why the standard deviation is often more interpretable.
Variance — Computational Formula
Here,
- =Variance of random variable X
- =Second moment of X
- =Square of the first moment (mean)
Variance — Definition Form
Here,
- =Mean (expected value) of X
- =Squared deviation from the mean
💡 Why the Computational Formula?
The formula is called the computational formula because it often simplifies calculations. To find variance, you compute and separately, then subtract. This avoids computing deviations from the mean for each outcome. However, the definition form is conceptually clearer and more numerically stable in practice.
📝Variance of a Fair Die
For a fair six-sided die, .
Properties of Variance
ThProperties of Variance
For any random variable and constants :
-
Scaling: .
- Adding a constant shifts the mean but does not change the spread.
- Multiplying by scales the variance by .
-
Sum of variances: .
-
Independent case: If and are independent, .
-
Generalized sum: .
-
Non-negativity: for all , with equality if and only if is a constant a.s.
⚠️ Variance of a Sum Is NOT the Sum of Variances
A very common mistake is assuming always holds. This is true only when and are uncorrelated (). For correlated variables, you must include the covariance term. In deep learning, this matters when analyzing gradient noise across correlated mini-batches.
Standard Deviation
Standard Deviation
Here,
- =Standard deviation of X
- =Variance of X
DfWhy Standard Deviation?
The standard deviation is the square root of the variance. Its key advantage is that it shares the same units as the original random variable , making it directly interpretable. For a normal distribution, approximately 68% of observations fall within , 95% within , and 99.7% within (the 68-95-99.7 rule). In machine learning, we often report mean standard deviation to convey both the average performance and its variability.
📝Standard Deviation in Practice
If a model's test accuracy has and , then approximately:
- 68% of runs achieve accuracy in
- 95% of runs achieve accuracy in
This tells you the model is reasonably stable ( is small relative to ).
Moments
DfRaw Moments
The -th raw moment of a random variable is:
The first raw moment is the mean: . The second raw moment is , which appears in the computational variance formula. Higher raw moments capture increasingly detailed information about the shape of the distribution.
DfCentral Moments
The -th central moment of a random variable with mean is:
Central moments are invariant to shifts in the distribution (adding a constant to does not change central moments). The second central moment is the variance: .
Moments — Summary Table
Here,
- =n-th raw moment
- =n-th central moment
- =\gamma_1 = \mu_3 / \sigma^3 (measures asymmetry)
- =\kappa = \mu_4 / \sigma^4 (measures tail heaviness)
💡 Skewness and Kurtosis
Skewness () measures the asymmetry of a distribution. Positive skew means a longer right tail. Kurtosis () measures the heaviness of tails. The normal distribution has kurtosis = 3 (excess kurtosis = 0). Leptokurtic distributions (kurtosis > 3) have heavier tails, which matters in risk management — extreme events are more likely than the normal model predicts.
Moment Generating Function
Moment Generating Function (MGF)
Here,
- =Moment generating function of X evaluated at t
- =Real parameter (near 0)
- =Exponential transform of X
DfWhy the MGF Matters
The moment generating function uniquely determines the distribution of (when it exists in a neighborhood of ). Its key properties:
- Moment extraction: , i.e., the -th derivative at gives the -th raw moment.
- Sum of independent variables: If and are independent, . This makes it easy to find the distribution of sums.
- Uniqueness: If for all in a neighborhood of 0, then and have the same distribution.
The MGF is related to the Laplace transform. The characteristic function always exists (even when the MGF does not) and serves a similar role using complex exponentials.
📝MGF of the Normal Distribution
If , then:
Taking derivatives and evaluating at recovers all moments. For instance, and , confirming .
Python Implementation
💡 Python for Expectation and Variance
Python's numpy and scipy libraries provide efficient tools for computing and verifying theoretical expectations, variances, and moments.
import numpy as np
from scipy import stats
# === Theoretical Values ===
# Normal distribution: mu=5, sigma=2
mu, sigma = 5, 2
print(f"E[X] = {mu}")
print(f"Var(X) = {sigma**2}")
# === Empirical Verification via Sampling ===
np.random.seed(42)
samples = np.random.normal(mu, sigma, size=100_000)
print(f"Sample mean: {samples.mean():.4f}")
print(f"Sample variance: {samples.var():.4f}")
print(f"Sample std dev: {samples.std():.4f}")
# === Discrete Distribution Moments ===
# Fair die
die = np.arange(1, 7)
prob = np.ones(6) / 6
E_die = np.sum(die * prob)
E_die2 = np.sum(die**2 * prob)
Var_die = E_die2 - E_die**2
print(f"E[die] = {E_die:.4f}, Var(die) = {Var_die:.4f}")
# === Custom Random Variable ===
def compute_expectation(values, probs):
"""Compute E[X] for a discrete random variable."""
return np.sum(values * probs)
def compute_variance(values, probs):
"""Compute Var(X) for a discrete random variable."""
mean = compute_expectation(values, probs)
return compute_expectation((values - mean)**2, probs)
values = np.array([0, 1, 2, 3, 4])
probs = np.array([0.1, 0.2, 0.3, 0.25, 0.15])
print(f"E[X] = {compute_expectation(values, probs):.4f}")
print(f"Var(X) = {compute_variance(values, probs):.4f}")
# === Moment Generating Function ===
def mgf_normal(t, mu, sigma):
"""MGF of N(mu, sigma^2)."""
return np.exp(mu * t + 0.5 * sigma**2 * t**2)
t = 0.1
print(f"M_X({t}) = {mgf_normal(t, mu, sigma):.6f}")
# === Empirical MGF ===
empirical_mgf = np.mean(np.exp(samples * t))
print(f"Empirical M_X({t}) = {empirical_mgf:.6f}")
# === Higher Moments with Scipy ===
from scipy.stats import skew, kurtosis
print(f"Skewness: {skew(samples):.4f}")
print(f"Excess Kurtosis: {kurtosis(samples):.4f}")
Applications in AI/ML
💡 Why ML Engineers Care About Moments
Expectation and variance are not abstract math — they directly inform how we design, train, and evaluate machine learning systems.
DfExpected Risk Minimization
In supervised learning, the population risk (expected loss) is:
We approximate this with the empirical risk (average loss over training data):
By the law of large numbers, as . The variance of the loss estimator tells us how much our risk estimate fluctuates with different training samples — high variance indicates the estimate is unreliable.
DfGradient Variance in SGD
Stochastic gradient descent (SGD) approximates the true gradient with a mini-batch estimate. The variance of this estimate directly affects convergence speed:
where is the batch size. Doubling the batch size halves the gradient variance. This is why larger batches produce smoother training curves, though not always better generalization.
DfBias-Variance Tradeoff
The expected prediction error decomposes as:
- Bias (): error from wrong assumptions (underfitting).
- Variance: error from sensitivity to training data (overfitting).
- Irreducible noise (): inherent data noise.
Complex models (deep neural networks) have low bias but high variance. Regularization techniques (dropout, weight decay, early stopping) reduce variance at the cost of slightly increased bias.
📝Value at Risk (Finance ML)
In portfolio optimization, a financial ML model estimates the expected return and risk of a portfolio. A risk-averse investor maximizes:
where is the risk aversion parameter. This mean-variance framework directly uses expectation and variance as the two fundamental quantities.
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| always | Only true for independent (or uncorrelated) variables | |
| always | Only true for independent (or uncorrelated) variables | Compute directly or use covariance: |
| Variance is in the same units as | Variance has units of | Use standard deviation for interpretable units |
| Expected value always exists | Some distributions (e.g., Cauchy) have no finite mean | Check convergence before using expectation-based results |
| "Variance = standard deviation" | They are different quantities; | , |
| Forgetting in | , not | The factor is because variance involves squaring deviations |
| Confusing moments with central moments | Raw moments and central moments are different | Use the correct definition for skewness, kurtosis, etc. |
Interview Questions
📝Question 1: Expectation of a Function
Q: If , what is ?
A: . Using the fact that and for a Poisson distribution, so . This technique is useful for computing higher moments.
📝Question 2: Variance of a Sum
Q: If , , and , what is ?
A: . The constant 5 drops out, and the coefficient of the covariance term is .
📝Question 3: Linearity of Expectation
Q: A class of 30 students each flip a fair coin. What is the expected number of heads?
A: Let be the indicator for student getting heads. . By linearity, . Linearity works even though the coin flips are independent — it always works.
📝Question 4: Conditional Expectation
Q: What is when ?
A: By symmetry of the standard normal, . This uses the truncated normal distribution: .
📝Question 5: Moment Generating Functions
Q: If and are independent, what is the distribution of ?
A: , which is the MGF of . By uniqueness of MGFs, .
📝Question 6: Jensen's Inequality
Q: State Jensen's inequality and give an example.
A: For a convex function and random variable : . Example: By convexity of , , which implies . This is a fundamental inequality used in variational inference (ELBO derivation).
Practice Problems
📝Problem 1: Expected Value of a Function
Let have PMF , , . Find , , and .
💡Solution
📝Problem 2: Linear Transformation
If and , find and .
💡Solution
Note: Adding 5 shifts the mean but does not affect variance.
📝Problem 3: Linearity with Dependent Variables
Let and be random variables with , , , , and . Find and .
💡Solution
The covariance term has a coefficient because the formula gives .
📝Problem 4: MGF Application
If for , identify the distribution of and find and .
💡Solution
The MGF is the MGF of an Exponential(1) distribution.
This confirms with .
📝Problem 5: Chebyshev's Inequality
A random variable has and . Use Chebyshev's inequality to bound .
💡Solution
Chebyshev's inequality states: .
Here , so .
This holds for any distribution with the given mean and variance — no normality assumption needed.
Quick Reference
| Quantity | Formula | Python |
|---|---|---|
| Expectation (discrete) | np.sum(x * p) | |
| Expectation (continuous) | np.trapz(x * pdf, x) | |
| Variance | np.var(x) | |
| Standard Deviation | np.std(x) | |
| -th raw moment | np.mean(x**n) | |
| -th central moment | np.mean((x - mu)**n) | |
| Skewness | scipy.stats.skew(x) | |
| Kurtosis | scipy.stats.kurtosis(x) | |
| MGF | np.mean(np.exp(x * t)) | |
| Linearity of | — | |
| Variance scaling | — | |
| Covariance rule | np.cov(x, y) |
Cross-References
- Probability Distributions: 030-discrete-distributions — Bernoulli, Binomial, Poisson distributions and their moments
- Continuous Distributions: 031-continuous-distributions — Normal, Exponential, Uniform distributions
- Law of Large Numbers: 040-lln-clt — Why sample means converge to the expected value
- Central Limit Theorem: 040-lln-clt — Normal approximation for sums of random variables
- Covariance and Correlation: 038-probability-covariance — Joint distributions and dependence
- Bayesian Inference: 041-bayesian-inference — Posterior expectations and credible intervals
- Information Theory: 042-information-theory — Entropy, KL divergence, and their expectations