Why It Matters
💡 Why It Matters
Markov chains are foundational to modern AI and machine learning. They model sequential data where the future depends only on the present—not the entire history. This memoryless property makes them computationally tractable and mathematically elegant. PageRank ranks web pages via Markov chains, Hidden Markov Models power speech recognition, and MCMC methods sample from complex distributions. Mastering Markov chains unlocks understanding of reinforcement learning (Markov Decision Processes), natural language processing, computational biology, and financial modeling.
Markov Property
DfMarkov Property
A stochastic process satisfies the Markov property if for all states and all :
The future state depends only on the current state, not on the sequence of past states. This is the memoryless property.
ℹ️ Intuition
Think of a chess game where each move depends only on the current board position—not how you got there. Or weather forecasting: tomorrow's weather depends on today's conditions, not the entire month's history. The Markov property captures this "memoryless" behavior mathematically.
DfMarkov Chain
A Markov chain is a sequence of random variables taking values in a finite or countable state space , satisfying the Markov property. The chain is homogeneous (time-homogeneous) if the transition probabilities do not change with time.
Transition Matrix
DfTransition Probability
The transition probability from state to state in one step is:
For a chain with states, the transition matrix is the matrix where:
Row-Stochastic Matrix
Here,
- =Probability of transitioning from state i to state j
- =Number of states in the state space
💡 Key Properties
- Every entry (probabilities are non-negative)
- Each row sums to 1:
- is a row-stochastic matrix
- The entry of gives the probability of going from state to state in exactly steps
📝Weather Model
Suppose weather has three states: Sunny (S), Cloudy (C), Rainy (R). The transition matrix is:
- Row 1: From Sunny ? 60% Sunny, 30% Cloudy, 10% Rainy
- Row 2: From Cloudy ? 30% Sunny, 40% Cloudy, 30% Rainy
- Row 3: From Rainy ? 20% Sunny, 30% Cloudy, 50% Rainy
State Distribution
State Distribution Evolution
Here,
- =Initial state distribution (row vector)
- =State distribution after n steps
- =n-step transition matrix
The state distribution at time is obtained by multiplying the initial distribution by . If we start in a specific state , then (the -th standard basis vector), and is the -th row of .
📝Distribution Evolution
Starting from Sunny: .
After 1 step:
After 2 steps:
The distribution gradually "forgets" the initial state.
n-Step Transitions
Chapman-Kolmogorov Equation
Here,
- =Probability of going from state i to state j in exactly n steps
- =State space (set of all states)
This is equivalent to matrix multiplication: .
n-Step Transition Matrix
Here,
- =The n-step transition matrix
- =Probability of going from i to j in n steps
⚠️ Computing Large Powers
For large , computing by repeated multiplication is inefficient. Use eigendecomposition: , so . The diagonal matrix is trivial to compute. This also reveals long-term behavior through the eigenvalue 1.
Stationary Distribution
DfStationary Distribution
A probability distribution over the state space is a stationary distribution (or invariant distribution) if:
Equivalently, is a left eigenvector of with eigenvalue 1. Once the chain reaches , it remains in for all future time steps.
Stationary Distribution Equations
Here,
- =Stationary probability of being in state j
- =Transition probability from state i to state j
📝Computing Stationary Distribution
For :
Solve with :
From the first equation:
With the constraint:
Stationary distribution:
💡Python Verification
import numpy as np
P = np.array([[0.9, 0.1],
[0.2, 0.8]])
# Method 1: Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(P.T)
idx = np.argmin(np.abs(eigenvalues - 1))
stationary = np.real(eigenvectors[:, idx])
stationary = stationary / stationary.sum()
print(f"Stationary distribution: {stationary}")
# Output: [0.6667 0.3333]
# Method 2: Power iteration
pi = np.array([0.5, 0.5])
for _ in range(100):
pi = pi @ P
print(f"Stationary distribution: {pi}")
# Output: [0.6667 0.3333]
Ergodicity
ThErgodic Theorem for Markov Chains
A finite-state, time-homogeneous Markov chain is ergodic if it is:
- Irreducible: Every state is reachable from every other state (the chain has a single communicating class)
- Aperiodic: The period of every state is 1 (no deterministic cycling)
For an ergodic chain:
The long-run distribution is independent of the initial state, and the stationary distribution exists, is unique, and is positive.
DfIrreducibility
A Markov chain is irreducible if for every pair of states , there exists some such that . Equivalently, the chain can eventually reach any state from any other state.
DfAperiodicity
The period of a state is . A chain is aperiodic if for all . Aperiodicity ensures the chain doesn't get trapped in deterministic cycles.
💡 Practical Implication
If a chain is ergodic, you can find the stationary distribution by simply running the chain for many steps from any starting state. The empirical distribution of visits converges to . This is the basis of Markov Chain Monte Carlo (MCMC) methods.
Absorbing States
DfAbsorbing State
A state is absorbing if once entered, it is never left: . A Markov chain is absorbing if it has at least one absorbing state and it is possible to reach an absorbing state from every state.
Absorption Probability
Here,
- =Probability of eventually being absorbed into state j from state i
For an absorbing chain with transient states and absorbing states , the fundamental matrix is:
where is the submatrix of transitions among transient states. The entry of gives the expected number of times the chain visits transient state starting from transient state before absorption.
📝Gambler's Ruin
A gambler starts with <MathBlock tex=2 and plays rounds of fair coin-flip betting. States: \ />0 (absorbing, ruined), <MathBlock tex=1, \ />2, $3 (absorbing, won). Transition matrix:
Starting with <MathBlock tex=2, what is the probability of ruin? Using the fundamental matrix: />f_{2,0} = 0.5$.
Python Implementation
💡Complete Markov Chain Library
import numpy as np
from typing import List, Tuple, Optional
class MarkovChain:
"""A complete Markov chain implementation."""
def __init__(self, transition_matrix: np.ndarray, state_names: Optional[List[str]] = None):
self.P = np.array(transition_matrix, dtype=float)
n = self.P.shape[0]
assert self.P.shape == (n, n), "Matrix must be square"
assert np.all(self.P >= 0), "All entries must be non-negative"
assert np.allclose(self.P.sum(axis=1), 1), "Rows must sum to 1"
self.n = n
self.state_names = state_names or [str(i) for i in range(n)]
def step(self, state: int) -> int:
"""Take one step from the given state."""
return np.random.choice(self.n, p=self.P[state])
def simulate(self, start: int, steps: int) -> List[int]:
"""Simulate a path of given length starting from state."""
path = [start]
for _ in range(steps):
path.append(self.step(path[-1]))
return path
def n_step_matrix(self, n: int) -> np.ndarray:
"""Compute the n-step transition matrix P^n."""
return np.linalg.matrix_power(self.P, n)
def stationary_distribution(self) -> np.ndarray:
"""Find the stationary distribution via eigendecomposition."""
eigenvalues, eigenvectors = np.linalg.eig(self.P.T)
idx = np.argmin(np.abs(eigenvalues - 1))
stationary = np.real(eigenvectors[:, idx])
return stationary / stationary.sum()
def is_irreducible(self) -> bool:
"""Check if the chain is irreducible."""
reach = np.linalg.matrix_power(self.P, self.n - 1)
return np.all(reach > 0)
def is_aperiodic(self) -> bool:
"""Check if the chain is aperiodic."""
from math import gcd
for i in range(self.n):
periods = []
P_n = np.eye(self.n)
for k in range(1, self.n * self.n + 1):
P_n = P_n @ self.P
if P_n[i, i] > 0:
periods.append(k)
if periods:
g = periods[0]
for p in periods[1:]:
g = gcd(g, p)
if g > 1:
return False
return True
def is_ergodic(self) -> bool:
"""Check if the chain is ergodic (irreducible + aperiodic)."""
return self.is_irreducible() and self.is_aperiodic()
def absorbing_states(self) -> List[int]:
"""Find all absorbing states."""
return [i for i in range(self.n) if self.P[i, i] == 1.0]
def fundamental_matrix(self) -> np.ndarray:
"""Compute the fundamental matrix for absorbing chains."""
absorbing = self.absorbing_states()
transient = [i for i in range(self.n) if i not in absorbing]
if not transient:
return np.eye(0)
Q = self.P[np.ix_(transient, transient)]
return np.linalg.inv(np.eye(len(transient)) - Q)
def steady_state_distribution(self, start: int, steps: int = 10000) -> np.ndarray:
"""Estimate stationary distribution via simulation."""
counts = np.zeros(self.n)
state = start
for _ in range(steps):
state = self.step(state)
counts[state] += 1
return counts / counts.sum()
# === Example Usage ===
# Weather model
P_weather = [[0.6, 0.3, 0.1],
[0.3, 0.4, 0.3],
[0.2, 0.3, 0.5]]
mc = MarkovChain(P_weather, ["Sunny", "Cloudy", "Rainy"])
print("Stationary distribution:", mc.stationary_distribution())
print("Is ergodic:", mc.is_ergodic())
print("10-step transitions:\n", mc.n_step_matrix(10))
# Simulate 30 days
path = mc.simulate(0, 30) # Start from Sunny
weather_counts = [path.count(i) for i in range(3)]
print("Weather distribution:", [c/len(path) for c in weather_counts])
Applications in AI/ML
PageRank
PageRank Formula
Here,
- =PageRank vector (stationary distribution)
- =Damping factor (typically 0.85)
- =Column-stochastic web graph matrix
- =Total number of pages
ℹ️ How PageRank Works
The web is modeled as a directed graph where pages are states and links are transitions. The damping factor means: 85% of the time, follow a random link; 15% of the time, jump to a random page (teleportation). This ensures the chain is ergodic. The PageRank vector is the stationary distribution—pages with higher probability are more "important."
def pagerank(adjacency: np.ndarray, d: float = 0.85, max_iter: int = 100) -> np.ndarray:
"""Compute PageRank via power iteration."""
n = adjacency.shape[0]
col_sums = adjacency.sum(axis=0)
col_sums[col_sums == 0] = 1 # Handle dangling nodes
M = adjacency / col_sums
r = np.ones(n) / n
for _ in range(max_iter):
r_new = (1 - d) / n + d * M @ r
if np.allclose(r, r_new, atol=1e-8):
break
r = r_new
return r
# Example: 4-page web
links = np.array([[0, 1, 0, 1],
[1, 0, 1, 0],
[0, 1, 0, 1],
[1, 0, 1, 0]])
print("PageRank:", pagerank(links))
Markov Chain Monte Carlo (MCMC)
DfMCMC Sampling
Markov Chain Monte Carlo methods construct a Markov chain whose stationary distribution is the target distribution . By running the chain long enough, samples approximate . The Metropolis-Hastings algorithm is the most general MCMC method.
def metropolis_hastings(target_log_prob, proposal, n_samples, x0):
"""Generic Metropolis-Hastings sampler."""
samples = [x0]
x = x0
for _ in range(n_samples):
x_prop = proposal(x)
log_alpha = target_log_prob(x_prop) - target_log_prob(x)
if np.log(np.random.uniform()) < log_alpha:
x = x_prop
samples.append(x)
return np.array(samples)
# Example: Sample from N(0,1)
target = lambda x: -0.5 * x**2
proposal = lambda x: x + np.random.normal(0, 0.5)
samples = metropolis_hastings(target, proposal, 10000, 0.0)
print(f"Mean: {samples.mean():.4f}, Std: {samples.std():.4f}")
Hidden Markov Models (HMMs)
DfHidden Markov Model
A Hidden Markov Model consists of:
- A Markov chain of hidden states with transition matrix
- Observable emissions where
- Initial distribution over hidden states
The joint probability is:
ℹ️ HMM Applications
- Speech recognition: Hidden states = phonemes, observations = audio features
- Bioinformatics: Hidden states = gene types, observations = DNA bases
- Part-of-speech tagging: Hidden states = POS tags, observations = words
- Gesture recognition: Hidden states = gesture phases, observations = sensor data
Forward Algorithm
Here,
- =Forward probability of being in state j at time t
- =Emission probability of observation X_t from state j
Common Mistakes
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| Assuming rows sum to 1 for column-stochastic matrices | PageRank uses column-stochastic matrices; rows may not sum to 1 | Check the convention; normalize columns for PageRank |
| Confusing left and right eigenvectors | Stationary distribution is a left eigenvector: | Solve , not |
| Forgetting the normalization constraint | The stationary distribution must satisfy | Always normalize the eigenvector to sum to 1 |
| Assuming all chains have a stationary distribution | Periodic or reducible chains may not converge | Verify ergodicity (irreducible + aperiodic) first |
| Ignoring dangling nodes in PageRank | Pages with no outgoing links break the Markov chain | Add teleportation or connect dangling nodes to all pages |
| Using power iteration without convergence check | May run too many or too few iterations | Check |
| Confusing transient with absorbing states | Not all states with are absorbing | An absorbing state requires exactly |
Interview Questions
ℹ️ Common Interview Questions
-
What is the Markov property? Give a real-world example. The future depends only on the current state. Example: stock price tomorrow depends only on today's price, not its history.
-
How do you find the stationary distribution of a Markov chain? Solve with . Use eigendecomposition of or power iteration.
-
What conditions guarantee convergence to a stationary distribution? The chain must be ergodic: irreducible and aperiodic. For finite chains, this ensures a unique stationary distribution.
-
Explain how PageRank works as a Markov chain. Each web page is a state; links define transitions. The damping factor adds teleportation for ergodicity. PageRank is the stationary distribution.
-
What is the fundamental matrix in an absorbing chain? where is the transient-to-transient transition submatrix. = expected visits to transient state from state before absorption.
-
How does MCMC use Markov chains? Construct a chain whose stationary distribution is the target distribution. After a burn-in period, samples approximate the target.
-
What is the difference between time-homogeneous and time-inhomogeneous chains? Homogeneous: transition probabilities are constant over time. Inhomogeneous: they change with time varies with .
-
How do you check if a chain is irreducible? Compute for . If all entries are positive, the chain is irreducible.
Practice Problems
📝Problem 1: Two-State Chain
Given , find the stationary distribution and the 3-step transition matrix.
💡Solution 1
Stationary distribution: Solve with :
3-step transition:
📝Problem 2: Absorbing Chain
A rat moves in a 4-cell maze: cells 1, 2, 3 are transient; cell 4 is absorbing (trap). Transitions: from cell 1, equal probability to 2 or 3. From cell 2, equal probability to 1 or 4. From cell 3, equal probability to 1 or 4. Find the probability of reaching cell 4 from cell 1.
💡Solution 2
Transition matrix (states 1, 2, 3, 4):
,
Probability of absorption into cell 4 from cell 1:
📝Problem 3: Random Walk
A symmetric random walk on states has reflecting boundaries: and . Find the stationary distribution.
💡Solution 3
Transition matrix:
Solving : By symmetry, and . From balance equations: and , so .
Normalization:
📝Problem 4: PageRank by Hand
Three pages A, B, C with links: A?B, A?C, B?C, C?A. Compute PageRank with .
💡Solution 4
Adjacency matrix:
Column-stochastic:
PageRank equation:
Solving: (uniform due to the cycle A?B?C?A)
Quick Reference
📋Markov Chains Quick Reference
- Markov Property:
- Transition Matrix: , rows sum to 1
- n-Step: = probability of in steps;
- Chapman-Kolmogorov:
- Stationary Distribution: ,
- Ergodicity: Irreducible + Aperiodic ? unique stationary distribution, convergence from any start
- Irreducible: All states communicate; for large enough
- Aperiodic:
- Absorbing State: ; once entered, never left
- Fundamental Matrix: for absorbing chains
- PageRank: ,
- MCMC: Construct chain with target as stationary distribution; sample after burn-in
- HMMs: Hidden Markov chain + emissions; use forward/backward/Viterbi algorithms
- Python:
np.linalg.eig(P.T)for stationary;np.linalg.matrix_power(P, n)for
Cross-References
ℹ️ Related Topics
- Probability Foundations — Prerequisites: sample spaces, conditional probability
- Bayesian Inference — Posterior updating uses similar matrix computations
- Linear Algebra — Eigendecomposition, matrix powers, and stochastic matrices
- Optimization — PageRank is an eigenvalue problem; MCMC is optimization-adjacent
- Reinforcement Learning — Markov Decision Processes extend Markov chains with actions and rewards
- Information Theory — Entropy and KL divergence relate to MCMC convergence diagnostics