← Math|57 of 100
Statistics

Maximum Likelihood Estimation

Master MLE, Fisher information, and their role in model fitting.

📂 Estimation📖 Lesson 57 of 100🎓 Free Course

Advertisement

Maximum Likelihood Estimation

ℹ️ Why It Matters

MLE is the most common method for fitting statistical models. It underlies logistic regression, neural networks, and most machine learning. Understanding MLE gives you the theoretical foundation for why loss functions work, how parameters are estimated, and what guarantees exist about estimator quality. It is the bridge between probability theory and practical model fitting.


Overview

Given data from a distribution f(xθ)f(x|\theta), the maximum likelihood estimator finds the parameter value that maximizes the probability of observing the data. The likelihood function is L(θ)=f(xiθ)L(\theta) = \prod f(x_i|\theta), and we maximize it (or equivalently minimize the negative log-likelihood). For the Normal distribution, MLEs have closed forms: μ^=xˉ\hat{\mu} = \bar{x} and σ^2=1n(xixˉ)2\hat{\sigma}^2 = \frac{1}{n}\sum(x_i - \bar{x})^2 (biased — uses nn not n1n-1). Fisher information I(θ)I(\theta) measures how much each observation tells us about θ\theta, and the Cramér-Rao bound sets a floor on estimator variance: no unbiased estimator can have variance less than 1/I(θ)1/I(\theta).


Key Concepts

MLE Estimator

θ^MLE=argmaxθi=1nf(xiθ)\hat{\theta}_{MLE} = \arg\max_\theta \prod_{i=1}^n f(x_i|\theta)

Here,

  • f(xiθ)f(x_i|\theta)=Probability density/mass function of x_i given parameter θ
  • i=1n\prod_{i=1}^n=Product over all n observations
  • θ^MLE\hat{\theta}_{MLE}=The parameter value that maximizes the likelihood

Log-Likelihood

(θ)=i=1nlogf(xiθ)\ell(\theta) = \sum_{i=1}^n \log f(x_i|\theta)

Here,

  • (θ)\ell(\theta)=Log-likelihood (converts products to sums)

Normal Distribution MLE

μ^=xˉ,σ^2=1ni=1n(xixˉ)2\hat{\mu} = \bar{x}, \quad \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n(x_i - \bar{x})^2

Here,

  • μ^\hat{\mu}=MLE of the mean (unbiased)
  • σ^2\hat{\sigma}^2=MLE of the variance (biased — uses n, not n-1)

Fisher Information

I(θ)=E[2θ2]I(\theta) = -E\left[\frac{\partial^2 \ell}{\partial \theta^2}\right]

Here,

  • I(θ)I(\theta)=Fisher information per observation

Cramér-Rao Lower Bound

Var(θ^)1I(θ)\text{Var}(\hat{\theta}) \geq \frac{1}{I(\theta)}

Here,

  • I(θ)I(\theta)=Fisher information

Score Function

S(θ)=θS(\theta) = \frac{\partial \ell}{\partial \theta}

Here,

  • S(θ)S(\theta)=Score function; set to 0 to find MLE

MLE for Common Distributions

DistributionParameterMLENotes
Normalμ\muxˉ\bar{x}Unbiased
Normalσ2\sigma^21n(xixˉ)2\frac{1}{n}\sum(x_i-\bar{x})^2Biased (uses nn not n1n-1)
Poissonλ\lambdaxˉ\bar{x}Closed-form
Bernoullippsuccesses/n\text{successes}/nClosed-form
Exponentialλ\lambda1/xˉ1/\bar{x}Closed-form

Quick Example

📝MLE for Poisson Distribution

Data: x1,,xnx_1, \ldots, x_n from Poisson(λ\lambda). Log-likelihood:

(λ)=(xilogλλlogxi!)\ell(\lambda) = \sum(x_i \log\lambda - \lambda - \log x_i!)

Setting ddλ=xiλn=0\frac{d\ell}{d\lambda} = \frac{\sum x_i}{\lambda} - n = 0:

λ^=xˉ\hat{\lambda} = \bar{x}

The MLE of λ\lambda is the sample mean — intuitive because the Poisson mean equals λ\lambda.

📝Numerical MLE

For complex models without closed-form MLEs, minimize the negative log-likelihood numerically:

from scipy.optimize import minimize
neg_log_lik = lambda params: -np.sum(stats.norm.logpdf(data, params[0], params[1]))
result = minimize(neg_log_lik, [0, 1], bounds=[(None,None), (0.01, None)])

Key Takeaways

📋Summary: Maximum Likelihood Estimation

  • Definition: θ^MLE\hat{\theta}_{MLE} maximizes the probability of observing the data. The most widely used estimation method.
  • Log-Likelihood: Convert products to sums via (θ)=logf(xiθ)\ell(\theta) = \sum \log f(x_i|\theta). Preserves argmax, simplifies optimization.
  • Closed-Form MLEs: Normal (μ^=xˉ\hat{\mu} = \bar{x}, biased σ^2\hat{\sigma}^2), Poisson (λ^=xˉ\hat{\lambda} = \bar{x}), Bernoulli (p^=xˉ\hat{p} = \bar{x}).
  • Fisher Information: Measures parameter identifiability. Higher I(θ)I(\theta) → tighter estimates. Cramér-Rao: Var(θ^)1/I(θ)\text{Var}(\hat{\theta}) \geq 1/I(\theta).
  • Score Function: Set S(θ)=/θ=0S(\theta) = \partial\ell/\partial\theta = 0 to find the MLE analytically.
  • Numerical Optimization: When no closed form exists, minimize negative log-likelihood with scipy.optimize.minimize.
  • Connection to ML: Most ML loss functions are negative log-likelihoods (cross-entropy, MSE). MLE unifies statistical and machine learning estimation.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Point Estimation

Properties of Estimators

Related Topics

Lesson Progress57 / 100