Expectation and Variance | ChatWhole Learn

Why It Matters

💡 Why It Matters

Expectation and variance are the cornerstones of probability theory and statistical inference. The expected value tells you the long-run average outcome of a random variable, while variance quantifies uncertainty around that average. In machine learning, these quantities underpin everything from loss function design (expected risk minimization) to model evaluation (mean squared error, cross-entropy). Understanding them is essential for building robust, well-calibrated models. In reinforcement learning, agents maximize expected cumulative reward. In finance, portfolio optimization balances expected return against variance (risk). In deep learning, batch normalization and dropout regularization both exploit properties of expectation and variance to stabilize training. Without a firm grasp of these concepts, you cannot reason properly about uncertainty, make optimal decisions under risk, or debug statistical pipelines.

Expected Value

DfExpected Value (Discrete Case)

For a discrete random variable $X$ taking values $x_1, x_2, \dots$ with probabilities $p(x_1), p(x_2), \dots$ , the expected value (or mean) is defined as:

E[X] = \sum_{i} x_i \cdot p(x_i)

provided the sum converges absolutely, i.e., $\sum_i |x_i| \cdot p(x_i) < \infty$ . The expected value represents the "center of mass" of the probability distribution — the value you would obtain on average if you repeated the random experiment infinitely many times.

Expected Value — Discrete

E[X] = \sum_{i=1}^{\infty} x_i \cdot P(X = x_i)

Here,

$E[X]$ =Expected value of random variable X
$x_i$ =Possible values X can take
$P(X=x_i)$ =Probability mass function (PMF) at x_i

DfExpected Value (Continuous Case)

For a continuous random variable $X$ with probability density function $f(x)$ , the expected value is:

E[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

provided the integral converges absolutely. For distributions with heavy tails (e.g., Cauchy distribution), the expected value may not exist.

Expected Value — Continuous

E[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx

Here,

$E[X]$ =Expected value of continuous random variable X
$f(x)$ =Probability density function (PDF)
$dx$ =Integration variable

📝Expected Value Examples

Example 1 (Discrete): Fair six-sided die.

E[X] = \frac{1}{6}(1+2+3+4+5+6) = \frac{21}{6} = 3.5

Example 2 (Continuous): Uniform distribution $X \sim U(0,1)$ .

E[X] = \int_0^1 x \, dx = \frac{1}{2}

Example 3 (Bernoulli): $X \sim \text{Bernoulli}(p)$ .

E[X] = 0 \cdot (1-p) + 1 \cdot p = p

Properties of Expectation

ThProperties of Expectation (Linearity)

The expectation operator is linear. For any random variables $X, Y$ and constants $a, b$ :

Constant rule: $E[c] = c$ for any constant $c$ .
Linearity: $E[aX + bY] = aE[X] + bE[Y]$ .
Iterated expectation: $E[X] = E[E[X|Y]]$ (law of total expectation).
Monotonicity: If $X \leq Y$ with probability 1, then $E[X] \leq E[Y]$ .
Non-negativity: If $X \geq 0$ with probability 1, then $E[X] \geq 0$ .

Important: Linearity holds even when $X$ and $Y$ are dependent. This is what makes expectation so powerful — no independence assumption is needed for $E[X+Y] = E[X] + E[Y]$ .

⚠️ Linearity vs Independence

Linearity of expectation ( $E[X+Y] = E[X]+E[Y]$ ) always holds. However, $E[XY] = E[X] \cdot E[Y]$ holds only when $X$ and $Y$ are independent (or uncorrelated). Confusing these two facts is a common source of errors.

Variance

DfVariance

The variance of a random variable $X$ measures the spread of its distribution around the mean $\mu = E[X]$ . It is defined as:

\text{Var}(X) = E[(X - \mu)^2]

Variance is always non-negative. $\text{Var}(X) = 0$ if and only if $X$ is a constant with probability 1. The units of variance are the square of the units of $X$ , which is why the standard deviation is often more interpretable.

Variance — Computational Formula

\text{Var}(X) = E[X^2] - (E[X])^2

Here,

$\text{Var}(X)$ =Variance of random variable X
$E[X^2]$ =Second moment of X
$(E[X])^2$ =Square of the first moment (mean)

Variance — Definition Form

\text{Var}(X) = E[(X - E[X])^2]

Here,

$E[X]$ =Mean (expected value) of X
$(X - E[X])^2$ =Squared deviation from the mean

💡 Why the Computational Formula?

The formula $\text{Var}(X) = E[X^2] - (E[X])^2$ is called the computational formula because it often simplifies calculations. To find variance, you compute $E[X^2]$ and $(E[X])^2$ separately, then subtract. This avoids computing deviations from the mean for each outcome. However, the definition form $E[(X-\mu)^2]$ is conceptually clearer and more numerically stable in practice.

📝Variance of a Fair Die

For a fair six-sided die, $E[X] = 3.5$ .

E[X^2] = \frac{1}{6}(1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2) = \frac{91}{6}

\text{Var}(X) = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{35}{12} \approx 2.917

Properties of Variance

ThProperties of Variance

For any random variable $X$ and constants $a, b$ :

Scaling: $\text{Var}(aX + b) = a^2 \text{Var}(X)$ .
- Adding a constant shifts the mean but does not change the spread.
- Multiplying by $a$ scales the variance by $a^2$ .
Sum of variances: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$ .
Independent case: If $X$ and $Y$ are independent, $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ .
Generalized sum: $\text{Var}\!\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n \sum_{j=1}^n a_i a_j \text{Cov}(X_i, X_j)$ .
Non-negativity: $\text{Var}(X) \geq 0$ for all $X$ , with equality if and only if $X$ is a constant a.s.

⚠️ Variance of a Sum Is NOT the Sum of Variances

A very common mistake is assuming $\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ always holds. This is true only when $X$ and $Y$ are uncorrelated ( $\text{Cov}(X,Y) = 0$ ). For correlated variables, you must include the covariance term. In deep learning, this matters when analyzing gradient noise across correlated mini-batches.

Standard Deviation

\sigma_X = \sqrt{\text{Var}(X)}

Here,

$\sigma_X$ =Standard deviation of X
$\text{Var}(X)$ =Variance of X

DfWhy Standard Deviation?

The standard deviation $\sigma$ is the square root of the variance. Its key advantage is that it shares the same units as the original random variable $X$ , making it directly interpretable. For a normal distribution, approximately 68% of observations fall within $\mu \pm \sigma$ , 95% within $\mu \pm 2\sigma$ , and 99.7% within $\mu \pm 3\sigma$ (the 68-95-99.7 rule). In machine learning, we often report mean $\pm$ standard deviation to convey both the average performance and its variability.

📝Standard Deviation in Practice

If a model's test accuracy has $\mu = 0.92$ and $\sigma = 0.03$ , then approximately:

68% of runs achieve accuracy in $[0.89, 0.95]$
95% of runs achieve accuracy in $[0.86, 0.98]$

This tells you the model is reasonably stable ( $\sigma$ is small relative to $\mu$ ).

Moments

DfRaw Moments

The $n$ -th raw moment of a random variable $X$ is:

\mu_n' = E[X^n]

The first raw moment is the mean: $\mu_1' = E[X]$ . The second raw moment is $E[X^2]$ , which appears in the computational variance formula. Higher raw moments capture increasingly detailed information about the shape of the distribution.

DfCentral Moments

The $n$ -th central moment of a random variable $X$ with mean $\mu$ is:

\mu_n = E[(X - \mu)^n]

Central moments are invariant to shifts in the distribution (adding a constant to $X$ does not change central moments). The second central moment is the variance: $\mu_2 = \text{Var}(X)$ .

Moments — Summary Table

\begin{aligned} \mu_1' &= E[X] = \text{mean} \\ \mu_2 &= E[(X-\mu)^2] = \text{variance} \\ \mu_3 &= E[(X-\mu)^3] \text{ (skewness numerator)} \\ \mu_4 &= E[(X-\mu)^4] \text{ (kurtosis numerator)} \end{aligned}

Here,

$\mu_n'$ =n-th raw moment
$\mu_n$ =n-th central moment
$\text{skewness}$ =\gamma_1 = \mu_3 / \sigma^3 (measures asymmetry)
$\text{kurtosis}$ =\kappa = \mu_4 / \sigma^4 (measures tail heaviness)

💡 Skewness and Kurtosis

Skewness ( $\mu_3 / \sigma^3$ ) measures the asymmetry of a distribution. Positive skew means a longer right tail. Kurtosis ( $\mu_4 / \sigma^4$ ) measures the heaviness of tails. The normal distribution has kurtosis = 3 (excess kurtosis = 0). Leptokurtic distributions (kurtosis > 3) have heavier tails, which matters in risk management — extreme events are more likely than the normal model predicts.

Moment Generating Function

Moment Generating Function (MGF)

M_X(t) = E[e^{tX}]

Here,

$M_X(t)$ =Moment generating function of X evaluated at t
$t$ =Real parameter (near 0)
$e^{tX}$ =Exponential transform of X

DfWhy the MGF Matters

The moment generating function $M_X(t) = E[e^{tX}]$ uniquely determines the distribution of $X$ (when it exists in a neighborhood of $t=0$ ). Its key properties:

Moment extraction: $M_X^{(n)}(0) = E[X^n]$ , i.e., the $n$ -th derivative at $t=0$ gives the $n$ -th raw moment.
Sum of independent variables: If $X$ and $Y$ are independent, $M_{X+Y}(t) = M_X(t) \cdot M_Y(t)$ . This makes it easy to find the distribution of sums.
Uniqueness: If $M_X(t) = M_Y(t)$ for all $t$ in a neighborhood of 0, then $X$ and $Y$ have the same distribution.

The MGF is related to the Laplace transform. The characteristic function $\phi_X(t) = E[e^{itX}]$ always exists (even when the MGF does not) and serves a similar role using complex exponentials.

📝MGF of the Normal Distribution

If $X \sim N(\mu, \sigma^2)$ , then:

M_X(t) = \exp\!\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

Taking derivatives and evaluating at $t=0$ recovers all moments. For instance, $M_X'(0) = \mu$ and $M_X''(0) = \mu^2 + \sigma^2$ , confirming $E[X^2] = \mu^2 + \sigma^2$ .

Python Implementation

💡 Python for Expectation and Variance

Python's numpy and scipy libraries provide efficient tools for computing and verifying theoretical expectations, variances, and moments.

import numpy as np
from scipy import stats

# === Theoretical Values ===
# Normal distribution: mu=5, sigma=2
mu, sigma = 5, 2
print(f"E[X] = {mu}")
print(f"Var(X) = {sigma**2}")

# === Empirical Verification via Sampling ===
np.random.seed(42)
samples = np.random.normal(mu, sigma, size=100_000)
print(f"Sample mean: {samples.mean():.4f}")
print(f"Sample variance: {samples.var():.4f}")
print(f"Sample std dev: {samples.std():.4f}")

# === Discrete Distribution Moments ===
# Fair die
die = np.arange(1, 7)
prob = np.ones(6) / 6
E_die = np.sum(die * prob)
E_die2 = np.sum(die**2 * prob)
Var_die = E_die2 - E_die**2
print(f"E[die] = {E_die:.4f}, Var(die) = {Var_die:.4f}")

# === Custom Random Variable ===
def compute_expectation(values, probs):
    """Compute E[X] for a discrete random variable."""
    return np.sum(values * probs)

def compute_variance(values, probs):
    """Compute Var(X) for a discrete random variable."""
    mean = compute_expectation(values, probs)
    return compute_expectation((values - mean)**2, probs)

values = np.array([0, 1, 2, 3, 4])
probs = np.array([0.1, 0.2, 0.3, 0.25, 0.15])
print(f"E[X] = {compute_expectation(values, probs):.4f}")
print(f"Var(X) = {compute_variance(values, probs):.4f}")

# === Moment Generating Function ===
def mgf_normal(t, mu, sigma):
    """MGF of N(mu, sigma^2)."""
    return np.exp(mu * t + 0.5 * sigma**2 * t**2)

t = 0.1
print(f"M_X({t}) = {mgf_normal(t, mu, sigma):.6f}")

# === Empirical MGF ===
empirical_mgf = np.mean(np.exp(samples * t))
print(f"Empirical M_X({t}) = {empirical_mgf:.6f}")

# === Higher Moments with Scipy ===
from scipy.stats import skew, kurtosis
print(f"Skewness: {skew(samples):.4f}")
print(f"Excess Kurtosis: {kurtosis(samples):.4f}")

Applications in AI/ML

💡 Why ML Engineers Care About Moments

Expectation and variance are not abstract math — they directly inform how we design, train, and evaluate machine learning systems.

DfExpected Risk Minimization

In supervised learning, the population risk (expected loss) is:

R(f) = E_{(x,y) \sim P}[L(f(x), y)]

We approximate this with the empirical risk (average loss over training data):

\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i)

By the law of large numbers, $\hat{R}(f) \to R(f)$ as $n \to \infty$ . The variance of the loss estimator tells us how much our risk estimate fluctuates with different training samples — high variance indicates the estimate is unreliable.

DfGradient Variance in SGD

Stochastic gradient descent (SGD) approximates the true gradient with a mini-batch estimate. The variance of this estimate directly affects convergence speed:

\text{Var}(\hat{g}) = \frac{\text{Var}(\nabla L_i)}{B}

where $B$ is the batch size. Doubling the batch size halves the gradient variance. This is why larger batches produce smoother training curves, though not always better generalization.

DfBias-Variance Tradeoff

The expected prediction error decomposes as:

E[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2

Bias ( $\text{Bias}^2$ ): error from wrong assumptions (underfitting).
Variance: error from sensitivity to training data (overfitting).
Irreducible noise ( $\sigma^2$ ): inherent data noise.

Complex models (deep neural networks) have low bias but high variance. Regularization techniques (dropout, weight decay, early stopping) reduce variance at the cost of slightly increased bias.

📝Value at Risk (Finance ML)

In portfolio optimization, a financial ML model estimates the expected return $\mu$ and risk $\sigma$ of a portfolio. A risk-averse investor maximizes:

U = \mu - \frac{\lambda}{2} \sigma^2

where $\lambda > 0$ is the risk aversion parameter. This mean-variance framework directly uses expectation and variance as the two fundamental quantities.

Common Mistakes

Mistake	Why It's Wrong	Correct Approach
$\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y)$ always	Only true for independent (or uncorrelated) variables	$\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)$
$E[XY] = E[X] \cdot E[Y]$ always	Only true for independent (or uncorrelated) variables	Compute $E[XY]$ directly or use covariance: $E[XY] = \text{Cov}(X,Y) + E[X]E[Y]$
Variance is in the same units as $X$	Variance has units of $X^2$	Use standard deviation $\sigma = \sqrt{\text{Var}(X)}$ for interpretable units
Expected value always exists	Some distributions (e.g., Cauchy) have no finite mean	Check convergence before using expectation-based results
"Variance = standard deviation"	They are different quantities; $\sigma \neq \sigma^2$	$\text{Var}(X) = \sigma^2$ , $\sigma = \sqrt{\text{Var}(X)}$
Forgetting $a^2$ in $\text{Var}(aX+b)$	$\text{Var}(aX+b) = a^2 \text{Var}(X)$ , not $a \cdot \text{Var}(X)$	The factor is $a^2$ because variance involves squaring deviations
Confusing moments with central moments	Raw moments $E[X^n]$ and central moments $E[(X-\mu)^n]$ are different	Use the correct definition for skewness, kurtosis, etc.

Interview Questions

📝Question 1: Expectation of a Function

Q: If $X \sim \text{Poisson}(\lambda)$ , what is $E[X(X-1)]$ ?

A: $E[X(X-1)] = \lambda^2$ . Using the fact that $E[X(X-1)] = E[X^2] - E[X]$ and $E[X^2] = \lambda + \lambda^2$ for a Poisson distribution, so $E[X(X-1)] = \lambda^2$ . This technique is useful for computing higher moments.

📝Question 2: Variance of a Sum

Q: If $\text{Var}(X) = 4$ , $\text{Var}(Y) = 9$ , and $\text{Cov}(X,Y) = 2$ , what is $\text{Var}(2X - 3Y + 5)$ ?

A: $\text{Var}(2X - 3Y + 5) = 4\text{Var}(X) + 9\text{Var}(Y) - 12\text{Cov}(X,Y) = 16 + 81 - 24 = 73$ . The constant 5 drops out, and the coefficient of the covariance term is $2 \cdot (-3) \cdot 2 = -12$ .

📝Question 3: Linearity of Expectation

Q: A class of 30 students each flip a fair coin. What is the expected number of heads?

A: Let $X_i$ be the indicator for student $i$ getting heads. $E[X_i] = 0.5$ . By linearity, $E[\sum X_i] = \sum E[X_i] = 30 \cdot 0.5 = 15$ . Linearity works even though the coin flips are independent — it always works.

📝Question 4: Conditional Expectation

Q: What is $E[X \mid X > 0]$ when $X \sim N(0, 1)$ ?

A: By symmetry of the standard normal, $E[X \mid X > 0] = \sqrt{2/\pi} \approx 0.7979$ . This uses the truncated normal distribution: $E[X \mid X > 0] = \frac{\phi(0)}{1 - \Phi(0)} = \frac{1/\sqrt{2\pi}}{0.5} = \sqrt{2/\pi}$ .

📝Question 5: Moment Generating Functions

Q: If $X \sim N(\mu_1, \sigma_1^2)$ and $Y \sim N(\mu_2, \sigma_2^2)$ are independent, what is the distribution of $X+Y$ ?

A: $M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = e^{(\mu_1+\mu_2)t + (\sigma_1^2+\sigma_2^2)t^2/2}$ , which is the MGF of $N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)$ . By uniqueness of MGFs, $X+Y \sim N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)$ .

📝Question 6: Jensen's Inequality

Q: State Jensen's inequality and give an example.

A: For a convex function $\phi$ and random variable $X$ : $\phi(E[X]) \leq E[\phi(X)]$ . Example: By convexity of $x^2$ , $(E[X])^2 \leq E[X^2]$ , which implies $\text{Var}(X) = E[X^2] - (E[X])^2 \geq 0$ . This is a fundamental inequality used in variational inference (ELBO derivation).

Practice Problems

📝Problem 1: Expected Value of a Function

Let $X$ have PMF $P(X=0)=0.2$ , $P(X=1)=0.5$ , $P(X=2)=0.3$ . Find $E[X]$ , $E[X^2]$ , and $\text{Var}(X)$ .

💡Solution

E[X] = 0(0.2) + 1(0.5) + 2(0.3) = 1.1

E[X^2] = 0^2(0.2) + 1^2(0.5) + 2^2(0.3) = 1.7

\text{Var}(X) = 1.7 - 1.1^2 = 1.7 - 1.21 = 0.49

📝Problem 2: Linear Transformation

If $Y = 3X + 5$ and $\text{Var}(X) = 4$ , find $\text{Var}(Y)$ and $\text{SD}(Y)$ .

💡Solution

\text{Var}(Y) = \text{Var}(3X + 5) = 9 \cdot \text{Var}(X) = 9 \cdot 4 = 36

\text{SD}(Y) = \sqrt{36} = 6

Note: Adding 5 shifts the mean but does not affect variance.

📝Problem 3: Linearity with Dependent Variables

Let $X$ and $Y$ be random variables with $E[X] = 2$ , $E[Y] = 3$ , $\text{Var}(X) = 1$ , $\text{Var}(Y) = 4$ , and $\text{Cov}(X,Y) = -1$ . Find $E[2X - Y + 3]$ and $\text{Var}(2X - Y + 3)$ .

💡Solution

E[2X - Y + 3] = 2E[X] - E[Y] + 3 = 4 - 3 + 3 = 4

\text{Var}(2X - Y + 3) = 4\text{Var}(X) + \text{Var}(Y) - 4\text{Cov}(X,Y) = 4(1) + 4 - 4(-1) = 12

The covariance term has a $-4$ coefficient because the formula gives $2(2)(-1)\text{Cov}(X,Y)$ .

📝Problem 4: MGF Application

If $M_X(t) = \frac{1}{1-t}$ for $t < 1$ , identify the distribution of $X$ and find $E[X]$ and $\text{Var}(X)$ .

💡Solution

The MGF $M_X(t) = (1-t)^{-1}$ is the MGF of an Exponential(1) distribution.

E[X] = M_X'(0) = \frac{1}{(1-0)^2} = 1

E[X^2] = M_X''(0) = \frac{2}{(1-0)^3} = 2

\text{Var}(X) = E[X^2] - (E[X])^2 = 2 - 1 = 1

This confirms $X \sim \text{Exponential}(1)$ with $\lambda = 1$ .

📝Problem 5: Chebyshev's Inequality

A random variable $X$ has $\mu = 10$ and $\sigma^2 = 4$ . Use Chebyshev's inequality to bound $P(|X - 10| \geq 5)$ .

💡Solution

Chebyshev's inequality states: $P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}$ .

Here $k\sigma = 5$ , so $k = 5/2 = 2.5$ .

P(|X - 10| \geq 5) \leq \frac{1}{(2.5)^2} = \frac{1}{6.25} = 0.16

This holds for any distribution with the given mean and variance — no normality assumption needed.

Quick Reference

Quantity	Formula	Python
Expectation (discrete)	$E[X] = \sum x \cdot p(x)$	`np.sum(x * p)`
Expectation (continuous)	$E[X] = \int x f(x) dx$	`np.trapz(x * pdf, x)`
Variance	$\text{Var}(X) = E[X^2] - (E[X])^2$	`np.var(x)`
Standard Deviation	$\sigma = \sqrt{\text{Var}(X)}$	`np.std(x)`
$n$ -th raw moment	$E[X^n]$	`np.mean(x**n)`
$n$ -th central moment	$E[(X-\mu)^n]$	`np.mean((x - mu)**n)`
Skewness	$\gamma_1 = \mu_3 / \sigma^3$	`scipy.stats.skew(x)`
Kurtosis	$\kappa = \mu_4 / \sigma^4$	`scipy.stats.kurtosis(x)`
MGF	$M_X(t) = E[e^{tX}]$	`np.mean(np.exp(x * t))`
Linearity of $E$	$E[aX+b] = aE[X]+b$	—
Variance scaling	$\text{Var}(aX+b) = a^2\text{Var}(X)$	—
Covariance rule	$\text{Var}(X+Y) = \text{Var}(X)+\text{Var}(Y)+2\text{Cov}(X,Y)$	`np.cov(x, y)`

Cross-References

Probability Distributions: 030-discrete-distributions — Bernoulli, Binomial, Poisson distributions and their moments
Continuous Distributions: 031-continuous-distributions — Normal, Exponential, Uniform distributions
Law of Large Numbers: 040-lln-clt — Why sample means converge to the expected value
Central Limit Theorem: 040-lln-clt — Normal approximation for sums of random variables
Covariance and Correlation: 038-probability-covariance — Joint distributions and dependence
Bayesian Inference: 041-bayesian-inference — Posterior expectations and credible intervals
Information Theory: 042-information-theory — Entropy, KL divergence, and their expectations