← Math|39 of 100
Probability

Expectation and Variance

Master expectation, variance, higher moments, and moment generating functions.

📂 Moments📖 Lesson 39 of 100🎓 Free Course

Advertisement

Why It Matters

💡 Why It Matters

Expectation and variance are the cornerstones of probability theory and statistical inference. The expected value tells you the long-run average outcome of a random variable, while variance quantifies uncertainty around that average. In machine learning, these quantities underpin everything from loss function design (expected risk minimization) to model evaluation (mean squared error, cross-entropy). Understanding them is essential for building robust, well-calibrated models. In reinforcement learning, agents maximize expected cumulative reward. In finance, portfolio optimization balances expected return against variance (risk). In deep learning, batch normalization and dropout regularization both exploit properties of expectation and variance to stabilize training. Without a firm grasp of these concepts, you cannot reason properly about uncertainty, make optimal decisions under risk, or debug statistical pipelines.


Expected Value

DfExpected Value (Discrete Case)

For a discrete random variable XX taking values x1,x2,x_1, x_2, \dots with probabilities p(x1),p(x2),p(x_1), p(x_2), \dots, the expected value (or mean) is defined as:

E[X]=ixip(xi)E[X] = \sum_{i} x_i \cdot p(x_i)

provided the sum converges absolutely, i.e., ixip(xi)<\sum_i |x_i| \cdot p(x_i) < \infty. The expected value represents the "center of mass" of the probability distribution — the value you would obtain on average if you repeated the random experiment infinitely many times.

Expected Value — Discrete

E[X]=i=1xiP(X=xi)E[X] = \sum_{i=1}^{\infty} x_i \cdot P(X = x_i)

Here,

  • E[X]E[X]=Expected value of random variable X
  • xix_i=Possible values X can take
  • P(X=xi)P(X=x_i)=Probability mass function (PMF) at x_i

DfExpected Value (Continuous Case)

For a continuous random variable XX with probability density function f(x)f(x), the expected value is:

E[X]=xf(x)dxE[X] = \int_{-\infty}^{\infty} x \cdot f(x) \, dx

provided the integral converges absolutely. For distributions with heavy tails (e.g., Cauchy distribution), the expected value may not exist.

Expected Value — Continuous

E[X]=xf(x)dxE[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx

Here,

  • E[X]E[X]=Expected value of continuous random variable X
  • f(x)f(x)=Probability density function (PDF)
  • dxdx=Integration variable

📝Expected Value Examples

Example 1 (Discrete): Fair six-sided die.

E[X]=16(1+2+3+4+5+6)=216=3.5E[X] = \frac{1}{6}(1+2+3+4+5+6) = \frac{21}{6} = 3.5

Example 2 (Continuous): Uniform distribution XU(0,1)X \sim U(0,1).

E[X]=01xdx=12E[X] = \int_0^1 x \, dx = \frac{1}{2}

Example 3 (Bernoulli): XBernoulli(p)X \sim \text{Bernoulli}(p).

E[X]=0(1p)+1p=pE[X] = 0 \cdot (1-p) + 1 \cdot p = p

Properties of Expectation

ThProperties of Expectation (Linearity)

The expectation operator is linear. For any random variables X,YX, Y and constants a,ba, b:

  1. Constant rule: E[c]=cE[c] = c for any constant cc.
  2. Linearity: E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y].
  3. Iterated expectation: E[X]=E[E[XY]]E[X] = E[E[X|Y]] (law of total expectation).
  4. Monotonicity: If XYX \leq Y with probability 1, then E[X]E[Y]E[X] \leq E[Y].
  5. Non-negativity: If X0X \geq 0 with probability 1, then E[X]0E[X] \geq 0.

Important: Linearity holds even when XX and YY are dependent. This is what makes expectation so powerful — no independence assumption is needed for E[X+Y]=E[X]+E[Y]E[X+Y] = E[X] + E[Y].

⚠️ Linearity vs Independence

Linearity of expectation (E[X+Y]=E[X]+E[Y]E[X+Y] = E[X]+E[Y]) always holds. However, E[XY]=E[X]E[Y]E[XY] = E[X] \cdot E[Y] holds only when XX and YY are independent (or uncorrelated). Confusing these two facts is a common source of errors.


Variance

DfVariance

The variance of a random variable XX measures the spread of its distribution around the mean μ=E[X]\mu = E[X]. It is defined as:

Var(X)=E[(Xμ)2]\text{Var}(X) = E[(X - \mu)^2]

Variance is always non-negative. Var(X)=0\text{Var}(X) = 0 if and only if XX is a constant with probability 1. The units of variance are the square of the units of XX, which is why the standard deviation is often more interpretable.

Variance — Computational Formula

Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2

Here,

  • Var(X)\text{Var}(X)=Variance of random variable X
  • E[X2]E[X^2]=Second moment of X
  • (E[X])2(E[X])^2=Square of the first moment (mean)

Variance — Definition Form

Var(X)=E[(XE[X])2]\text{Var}(X) = E[(X - E[X])^2]

Here,

  • E[X]E[X]=Mean (expected value) of X
  • (XE[X])2(X - E[X])^2=Squared deviation from the mean

💡 Why the Computational Formula?

The formula Var(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2 is called the computational formula because it often simplifies calculations. To find variance, you compute E[X2]E[X^2] and (E[X])2(E[X])^2 separately, then subtract. This avoids computing deviations from the mean for each outcome. However, the definition form E[(Xμ)2]E[(X-\mu)^2] is conceptually clearer and more numerically stable in practice.

📝Variance of a Fair Die

For a fair six-sided die, E[X]=3.5E[X] = 3.5.

E[X2]=16(12+22+32+42+52+62)=916E[X^2] = \frac{1}{6}(1^2 + 2^2 + 3^2 + 4^2 + 5^2 + 6^2) = \frac{91}{6}
Var(X)=916(72)2=916494=35122.917\text{Var}(X) = \frac{91}{6} - \left(\frac{7}{2}\right)^2 = \frac{91}{6} - \frac{49}{4} = \frac{35}{12} \approx 2.917

Properties of Variance

ThProperties of Variance

For any random variable XX and constants a,ba, b:

  1. Scaling: Var(aX+b)=a2Var(X)\text{Var}(aX + b) = a^2 \text{Var}(X).

    • Adding a constant shifts the mean but does not change the spread.
    • Multiplying by aa scales the variance by a2a^2.
  2. Sum of variances: Var(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y).

  3. Independent case: If XX and YY are independent, Var(X+Y)=Var(X)+Var(Y)\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y).

  4. Generalized sum: Var ⁣(i=1naiXi)=i=1nj=1naiajCov(Xi,Xj)\text{Var}\!\left(\sum_{i=1}^n a_i X_i\right) = \sum_{i=1}^n \sum_{j=1}^n a_i a_j \text{Cov}(X_i, X_j).

  5. Non-negativity: Var(X)0\text{Var}(X) \geq 0 for all XX, with equality if and only if XX is a constant a.s.

⚠️ Variance of a Sum Is NOT the Sum of Variances

A very common mistake is assuming Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) always holds. This is true only when XX and YY are uncorrelated (Cov(X,Y)=0\text{Cov}(X,Y) = 0). For correlated variables, you must include the covariance term. In deep learning, this matters when analyzing gradient noise across correlated mini-batches.


Standard Deviation

Standard Deviation

σX=Var(X)\sigma_X = \sqrt{\text{Var}(X)}

Here,

  • σX\sigma_X=Standard deviation of X
  • Var(X)\text{Var}(X)=Variance of X

DfWhy Standard Deviation?

The standard deviation σ\sigma is the square root of the variance. Its key advantage is that it shares the same units as the original random variable XX, making it directly interpretable. For a normal distribution, approximately 68% of observations fall within μ±σ\mu \pm \sigma, 95% within μ±2σ\mu \pm 2\sigma, and 99.7% within μ±3σ\mu \pm 3\sigma (the 68-95-99.7 rule). In machine learning, we often report mean ±\pm standard deviation to convey both the average performance and its variability.

📝Standard Deviation in Practice

If a model's test accuracy has μ=0.92\mu = 0.92 and σ=0.03\sigma = 0.03, then approximately:

  • 68% of runs achieve accuracy in [0.89,0.95][0.89, 0.95]
  • 95% of runs achieve accuracy in [0.86,0.98][0.86, 0.98]

This tells you the model is reasonably stable (σ\sigma is small relative to μ\mu).


Moments

DfRaw Moments

The nn-th raw moment of a random variable XX is:

μn=E[Xn]\mu_n' = E[X^n]

The first raw moment is the mean: μ1=E[X]\mu_1' = E[X]. The second raw moment is E[X2]E[X^2], which appears in the computational variance formula. Higher raw moments capture increasingly detailed information about the shape of the distribution.

DfCentral Moments

The nn-th central moment of a random variable XX with mean μ\mu is:

μn=E[(Xμ)n]\mu_n = E[(X - \mu)^n]

Central moments are invariant to shifts in the distribution (adding a constant to XX does not change central moments). The second central moment is the variance: μ2=Var(X)\mu_2 = \text{Var}(X).

Moments — Summary Table

μ1=E[X]=meanμ2=E[(Xμ)2]=varianceμ3=E[(Xμ)3] (skewness numerator)μ4=E[(Xμ)4] (kurtosis numerator)\begin{aligned} \mu_1' &= E[X] = \text{mean} \\ \mu_2 &= E[(X-\mu)^2] = \text{variance} \\ \mu_3 &= E[(X-\mu)^3] \text{ (skewness numerator)} \\ \mu_4 &= E[(X-\mu)^4] \text{ (kurtosis numerator)} \end{aligned}

Here,

  • μn\mu_n'=n-th raw moment
  • μn\mu_n=n-th central moment
  • skewness\text{skewness}=\gamma_1 = \mu_3 / \sigma^3 (measures asymmetry)
  • kurtosis\text{kurtosis}=\kappa = \mu_4 / \sigma^4 (measures tail heaviness)

💡 Skewness and Kurtosis

Skewness (μ3/σ3\mu_3 / \sigma^3) measures the asymmetry of a distribution. Positive skew means a longer right tail. Kurtosis (μ4/σ4\mu_4 / \sigma^4) measures the heaviness of tails. The normal distribution has kurtosis = 3 (excess kurtosis = 0). Leptokurtic distributions (kurtosis > 3) have heavier tails, which matters in risk management — extreme events are more likely than the normal model predicts.


Moment Generating Function

Moment Generating Function (MGF)

MX(t)=E[etX]M_X(t) = E[e^{tX}]

Here,

  • MX(t)M_X(t)=Moment generating function of X evaluated at t
  • tt=Real parameter (near 0)
  • etXe^{tX}=Exponential transform of X

DfWhy the MGF Matters

The moment generating function MX(t)=E[etX]M_X(t) = E[e^{tX}] uniquely determines the distribution of XX (when it exists in a neighborhood of t=0t=0). Its key properties:

  1. Moment extraction: MX(n)(0)=E[Xn]M_X^{(n)}(0) = E[X^n], i.e., the nn-th derivative at t=0t=0 gives the nn-th raw moment.
  2. Sum of independent variables: If XX and YY are independent, MX+Y(t)=MX(t)MY(t)M_{X+Y}(t) = M_X(t) \cdot M_Y(t). This makes it easy to find the distribution of sums.
  3. Uniqueness: If MX(t)=MY(t)M_X(t) = M_Y(t) for all tt in a neighborhood of 0, then XX and YY have the same distribution.

The MGF is related to the Laplace transform. The characteristic function ϕX(t)=E[eitX]\phi_X(t) = E[e^{itX}] always exists (even when the MGF does not) and serves a similar role using complex exponentials.

📝MGF of the Normal Distribution

If XN(μ,σ2)X \sim N(\mu, \sigma^2), then:

MX(t)=exp ⁣(μt+σ2t22)M_X(t) = \exp\!\left(\mu t + \frac{\sigma^2 t^2}{2}\right)

Taking derivatives and evaluating at t=0t=0 recovers all moments. For instance, MX(0)=μM_X'(0) = \mu and MX(0)=μ2+σ2M_X''(0) = \mu^2 + \sigma^2, confirming E[X2]=μ2+σ2E[X^2] = \mu^2 + \sigma^2.


Python Implementation

💡 Python for Expectation and Variance

Python's numpy and scipy libraries provide efficient tools for computing and verifying theoretical expectations, variances, and moments.

import numpy as np
from scipy import stats

# === Theoretical Values ===
# Normal distribution: mu=5, sigma=2
mu, sigma = 5, 2
print(f"E[X] = {mu}")
print(f"Var(X) = {sigma**2}")

# === Empirical Verification via Sampling ===
np.random.seed(42)
samples = np.random.normal(mu, sigma, size=100_000)
print(f"Sample mean: {samples.mean():.4f}")
print(f"Sample variance: {samples.var():.4f}")
print(f"Sample std dev: {samples.std():.4f}")

# === Discrete Distribution Moments ===
# Fair die
die = np.arange(1, 7)
prob = np.ones(6) / 6
E_die = np.sum(die * prob)
E_die2 = np.sum(die**2 * prob)
Var_die = E_die2 - E_die**2
print(f"E[die] = {E_die:.4f}, Var(die) = {Var_die:.4f}")

# === Custom Random Variable ===
def compute_expectation(values, probs):
    """Compute E[X] for a discrete random variable."""
    return np.sum(values * probs)

def compute_variance(values, probs):
    """Compute Var(X) for a discrete random variable."""
    mean = compute_expectation(values, probs)
    return compute_expectation((values - mean)**2, probs)

values = np.array([0, 1, 2, 3, 4])
probs = np.array([0.1, 0.2, 0.3, 0.25, 0.15])
print(f"E[X] = {compute_expectation(values, probs):.4f}")
print(f"Var(X) = {compute_variance(values, probs):.4f}")

# === Moment Generating Function ===
def mgf_normal(t, mu, sigma):
    """MGF of N(mu, sigma^2)."""
    return np.exp(mu * t + 0.5 * sigma**2 * t**2)

t = 0.1
print(f"M_X({t}) = {mgf_normal(t, mu, sigma):.6f}")

# === Empirical MGF ===
empirical_mgf = np.mean(np.exp(samples * t))
print(f"Empirical M_X({t}) = {empirical_mgf:.6f}")

# === Higher Moments with Scipy ===
from scipy.stats import skew, kurtosis
print(f"Skewness: {skew(samples):.4f}")
print(f"Excess Kurtosis: {kurtosis(samples):.4f}")

Applications in AI/ML

💡 Why ML Engineers Care About Moments

Expectation and variance are not abstract math — they directly inform how we design, train, and evaluate machine learning systems.

DfExpected Risk Minimization

In supervised learning, the population risk (expected loss) is:

R(f)=E(x,y)P[L(f(x),y)]R(f) = E_{(x,y) \sim P}[L(f(x), y)]

We approximate this with the empirical risk (average loss over training data):

R^(f)=1ni=1nL(f(xi),yi)\hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(f(x_i), y_i)

By the law of large numbers, R^(f)R(f)\hat{R}(f) \to R(f) as nn \to \infty. The variance of the loss estimator tells us how much our risk estimate fluctuates with different training samples — high variance indicates the estimate is unreliable.

DfGradient Variance in SGD

Stochastic gradient descent (SGD) approximates the true gradient with a mini-batch estimate. The variance of this estimate directly affects convergence speed:

Var(g^)=Var(Li)B\text{Var}(\hat{g}) = \frac{\text{Var}(\nabla L_i)}{B}

where BB is the batch size. Doubling the batch size halves the gradient variance. This is why larger batches produce smoother training curves, though not always better generalization.

DfBias-Variance Tradeoff

The expected prediction error decomposes as:

E[(yf^(x))2]=Bias2(f^)+Var(f^)+σ2E[(y - \hat{f}(x))^2] = \text{Bias}^2(\hat{f}) + \text{Var}(\hat{f}) + \sigma^2
  • Bias (Bias2\text{Bias}^2): error from wrong assumptions (underfitting).
  • Variance: error from sensitivity to training data (overfitting).
  • Irreducible noise (σ2\sigma^2): inherent data noise.

Complex models (deep neural networks) have low bias but high variance. Regularization techniques (dropout, weight decay, early stopping) reduce variance at the cost of slightly increased bias.

📝Value at Risk (Finance ML)

In portfolio optimization, a financial ML model estimates the expected return μ\mu and risk σ\sigma of a portfolio. A risk-averse investor maximizes:

U=μλ2σ2U = \mu - \frac{\lambda}{2} \sigma^2

where λ>0\lambda > 0 is the risk aversion parameter. This mean-variance framework directly uses expectation and variance as the two fundamental quantities.


Common Mistakes

MistakeWhy It's WrongCorrect Approach
Var(X+Y)=Var(X)+Var(Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) alwaysOnly true for independent (or uncorrelated) variablesVar(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X+Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X,Y)
E[XY]=E[X]E[Y]E[XY] = E[X] \cdot E[Y] alwaysOnly true for independent (or uncorrelated) variablesCompute E[XY]E[XY] directly or use covariance: E[XY]=Cov(X,Y)+E[X]E[Y]E[XY] = \text{Cov}(X,Y) + E[X]E[Y]
Variance is in the same units as XXVariance has units of X2X^2Use standard deviation σ=Var(X)\sigma = \sqrt{\text{Var}(X)} for interpretable units
Expected value always existsSome distributions (e.g., Cauchy) have no finite meanCheck convergence before using expectation-based results
"Variance = standard deviation"They are different quantities; σσ2\sigma \neq \sigma^2Var(X)=σ2\text{Var}(X) = \sigma^2, σ=Var(X)\sigma = \sqrt{\text{Var}(X)}
Forgetting a2a^2 in Var(aX+b)\text{Var}(aX+b)Var(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2 \text{Var}(X), not aVar(X)a \cdot \text{Var}(X)The factor is a2a^2 because variance involves squaring deviations
Confusing moments with central momentsRaw moments E[Xn]E[X^n] and central moments E[(Xμ)n]E[(X-\mu)^n] are differentUse the correct definition for skewness, kurtosis, etc.

Interview Questions

📝Question 1: Expectation of a Function

Q: If XPoisson(λ)X \sim \text{Poisson}(\lambda), what is E[X(X1)]E[X(X-1)]?

A: E[X(X1)]=λ2E[X(X-1)] = \lambda^2. Using the fact that E[X(X1)]=E[X2]E[X]E[X(X-1)] = E[X^2] - E[X] and E[X2]=λ+λ2E[X^2] = \lambda + \lambda^2 for a Poisson distribution, so E[X(X1)]=λ2E[X(X-1)] = \lambda^2. This technique is useful for computing higher moments.

📝Question 2: Variance of a Sum

Q: If Var(X)=4\text{Var}(X) = 4, Var(Y)=9\text{Var}(Y) = 9, and Cov(X,Y)=2\text{Cov}(X,Y) = 2, what is Var(2X3Y+5)\text{Var}(2X - 3Y + 5)?

A: Var(2X3Y+5)=4Var(X)+9Var(Y)12Cov(X,Y)=16+8124=73\text{Var}(2X - 3Y + 5) = 4\text{Var}(X) + 9\text{Var}(Y) - 12\text{Cov}(X,Y) = 16 + 81 - 24 = 73. The constant 5 drops out, and the coefficient of the covariance term is 2(3)2=122 \cdot (-3) \cdot 2 = -12.

📝Question 3: Linearity of Expectation

Q: A class of 30 students each flip a fair coin. What is the expected number of heads?

A: Let XiX_i be the indicator for student ii getting heads. E[Xi]=0.5E[X_i] = 0.5. By linearity, E[Xi]=E[Xi]=300.5=15E[\sum X_i] = \sum E[X_i] = 30 \cdot 0.5 = 15. Linearity works even though the coin flips are independent — it always works.

📝Question 4: Conditional Expectation

Q: What is E[XX>0]E[X \mid X > 0] when XN(0,1)X \sim N(0, 1)?

A: By symmetry of the standard normal, E[XX>0]=2/π0.7979E[X \mid X > 0] = \sqrt{2/\pi} \approx 0.7979. This uses the truncated normal distribution: E[XX>0]=ϕ(0)1Φ(0)=1/2π0.5=2/πE[X \mid X > 0] = \frac{\phi(0)}{1 - \Phi(0)} = \frac{1/\sqrt{2\pi}}{0.5} = \sqrt{2/\pi}.

📝Question 5: Moment Generating Functions

Q: If XN(μ1,σ12)X \sim N(\mu_1, \sigma_1^2) and YN(μ2,σ22)Y \sim N(\mu_2, \sigma_2^2) are independent, what is the distribution of X+YX+Y?

A: MX+Y(t)=MX(t)MY(t)=e(μ1+μ2)t+(σ12+σ22)t2/2M_{X+Y}(t) = M_X(t) \cdot M_Y(t) = e^{(\mu_1+\mu_2)t + (\sigma_1^2+\sigma_2^2)t^2/2}, which is the MGF of N(μ1+μ2,σ12+σ22)N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2). By uniqueness of MGFs, X+YN(μ1+μ2,σ12+σ22)X+Y \sim N(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2).

📝Question 6: Jensen's Inequality

Q: State Jensen's inequality and give an example.

A: For a convex function ϕ\phi and random variable XX: ϕ(E[X])E[ϕ(X)]\phi(E[X]) \leq E[\phi(X)]. Example: By convexity of x2x^2, (E[X])2E[X2](E[X])^2 \leq E[X^2], which implies Var(X)=E[X2](E[X])20\text{Var}(X) = E[X^2] - (E[X])^2 \geq 0. This is a fundamental inequality used in variational inference (ELBO derivation).


Practice Problems

📝Problem 1: Expected Value of a Function

Let XX have PMF P(X=0)=0.2P(X=0)=0.2, P(X=1)=0.5P(X=1)=0.5, P(X=2)=0.3P(X=2)=0.3. Find E[X]E[X], E[X2]E[X^2], and Var(X)\text{Var}(X).

💡Solution

E[X]=0(0.2)+1(0.5)+2(0.3)=1.1E[X] = 0(0.2) + 1(0.5) + 2(0.3) = 1.1
E[X2]=02(0.2)+12(0.5)+22(0.3)=1.7E[X^2] = 0^2(0.2) + 1^2(0.5) + 2^2(0.3) = 1.7
Var(X)=1.71.12=1.71.21=0.49\text{Var}(X) = 1.7 - 1.1^2 = 1.7 - 1.21 = 0.49

📝Problem 2: Linear Transformation

If Y=3X+5Y = 3X + 5 and Var(X)=4\text{Var}(X) = 4, find Var(Y)\text{Var}(Y) and SD(Y)\text{SD}(Y).

💡Solution

Var(Y)=Var(3X+5)=9Var(X)=94=36\text{Var}(Y) = \text{Var}(3X + 5) = 9 \cdot \text{Var}(X) = 9 \cdot 4 = 36
SD(Y)=36=6\text{SD}(Y) = \sqrt{36} = 6

Note: Adding 5 shifts the mean but does not affect variance.

📝Problem 3: Linearity with Dependent Variables

Let XX and YY be random variables with E[X]=2E[X] = 2, E[Y]=3E[Y] = 3, Var(X)=1\text{Var}(X) = 1, Var(Y)=4\text{Var}(Y) = 4, and Cov(X,Y)=1\text{Cov}(X,Y) = -1. Find E[2XY+3]E[2X - Y + 3] and Var(2XY+3)\text{Var}(2X - Y + 3).

💡Solution

E[2XY+3]=2E[X]E[Y]+3=43+3=4E[2X - Y + 3] = 2E[X] - E[Y] + 3 = 4 - 3 + 3 = 4
Var(2XY+3)=4Var(X)+Var(Y)4Cov(X,Y)=4(1)+44(1)=12\text{Var}(2X - Y + 3) = 4\text{Var}(X) + \text{Var}(Y) - 4\text{Cov}(X,Y) = 4(1) + 4 - 4(-1) = 12

The covariance term has a 4-4 coefficient because the formula gives 2(2)(1)Cov(X,Y)2(2)(-1)\text{Cov}(X,Y).

📝Problem 4: MGF Application

If MX(t)=11tM_X(t) = \frac{1}{1-t} for t<1t < 1, identify the distribution of XX and find E[X]E[X] and Var(X)\text{Var}(X).

💡Solution

The MGF MX(t)=(1t)1M_X(t) = (1-t)^{-1} is the MGF of an Exponential(1) distribution.

E[X]=MX(0)=1(10)2=1E[X] = M_X'(0) = \frac{1}{(1-0)^2} = 1
E[X2]=MX(0)=2(10)3=2E[X^2] = M_X''(0) = \frac{2}{(1-0)^3} = 2
Var(X)=E[X2](E[X])2=21=1\text{Var}(X) = E[X^2] - (E[X])^2 = 2 - 1 = 1

This confirms XExponential(1)X \sim \text{Exponential}(1) with λ=1\lambda = 1.

📝Problem 5: Chebyshev's Inequality

A random variable XX has μ=10\mu = 10 and σ2=4\sigma^2 = 4. Use Chebyshev's inequality to bound P(X105)P(|X - 10| \geq 5).

💡Solution

Chebyshev's inequality states: P(Xμkσ)1k2P(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}.

Here kσ=5k\sigma = 5, so k=5/2=2.5k = 5/2 = 2.5.

P(X105)1(2.5)2=16.25=0.16P(|X - 10| \geq 5) \leq \frac{1}{(2.5)^2} = \frac{1}{6.25} = 0.16

This holds for any distribution with the given mean and variance — no normality assumption needed.


Quick Reference

QuantityFormulaPython
Expectation (discrete)E[X]=xp(x)E[X] = \sum x \cdot p(x)np.sum(x * p)
Expectation (continuous)E[X]=xf(x)dxE[X] = \int x f(x) dxnp.trapz(x * pdf, x)
VarianceVar(X)=E[X2](E[X])2\text{Var}(X) = E[X^2] - (E[X])^2np.var(x)
Standard Deviationσ=Var(X)\sigma = \sqrt{\text{Var}(X)}np.std(x)
nn-th raw momentE[Xn]E[X^n]np.mean(x**n)
nn-th central momentE[(Xμ)n]E[(X-\mu)^n]np.mean((x - mu)**n)
Skewnessγ1=μ3/σ3\gamma_1 = \mu_3 / \sigma^3scipy.stats.skew(x)
Kurtosisκ=μ4/σ4\kappa = \mu_4 / \sigma^4scipy.stats.kurtosis(x)
MGFMX(t)=E[etX]M_X(t) = E[e^{tX}]np.mean(np.exp(x * t))
Linearity of EEE[aX+b]=aE[X]+bE[aX+b] = aE[X]+b
Variance scalingVar(aX+b)=a2Var(X)\text{Var}(aX+b) = a^2\text{Var}(X)
Covariance ruleVar(X+Y)=Var(X)+Var(Y)+2Cov(X,Y)\text{Var}(X+Y) = \text{Var}(X)+\text{Var}(Y)+2\text{Cov}(X,Y)np.cov(x, y)

Cross-References

Lesson Progress39 / 100