← Math|23 of 100
Linear Algebra

Numerical Linear Algebra

Understand conditioning, stability, iterative methods, and numerical considerations in ML.

📂 Numerical Methods📖 Lesson 23 of 100🎓 Free Course

Advertisement

Numerical Linear Algebra

Numerical linear algebra is the bridge between elegant mathematical theory and practical computation. In theory, we solve Ax=bAx = b by computing x=A−1bx = A^{-1}b. In practice, computing A−1A^{-1} directly is almost never done — it is slow, numerically unstable, and wasteful. Instead, we use carefully designed algorithms that exploit matrix structure and maintain numerical stability. This module covers the essential numerical concepts every ML practitioner must understand: conditioning, stability, floating-point arithmetic, direct and iterative solvers, and their applications in modern AI systems.


1. Why It Matters

â„šī¸ Why It Matters

Numerical linear algebra is not a theoretical curiosity — it is the foundation of nearly every ML system. Training a neural network involves solving large linear systems (in optimizers like L-BFGS), computing matrix factorizations (in PCA, SVD-based embeddings), and inverting Hessians (in second-order optimization). Getting the numerics wrong means silent bugs: models that train on one GPU but diverge on another, solutions that appear correct but are off by 10^6, or systems that hang indefinitely without converging.

In production ML, the difference between a numerically stable and unstable implementation can mean the difference between a model that works and one that silently produces garbage. The 1996 Ariane 5 rocket failure — caused by a floating-point overflow converting a 64-bit float to a 16-bit integer — is a stark reminder that numerical precision matters in safety-critical systems. In ML, the consequences are less dramatic but still costly: wasted compute, incorrect gradients, and models that fail to generalize.

Computational Efficiency

Naive matrix inversion via cofactor expansion is O(n!)O(n!) — completely impractical. Even the adjugate method is O(n4)O(n^4). Modern algorithms achieve:

OperationNaiveOptimizedNotes
Matrix multiplyO(n3)O(n^3)O(n2.373)O(n^{2.373})Strassen, Coppersmith-Winograd
Solve Ax=bAx=bO(n4)O(n^4)O(n3)O(n^3)LU decomposition
EigenvaluesO(n4)O(n^4)O(n3)O(n^3)QR algorithm
SVDO(n4)O(n^4)O(n3)O(n^3)Golub-Reinsch

For sparse matrices with kk nonzero entries, iterative methods reduce complexity to O(k)O(k) per iteration, making problems with millions of variables tractable.

The Cost of Ignoring Numerics

import numpy as np

# Hilbert matrix: H_ij = 1/(i+j-1)
# Theoretical solution: x = [1, 1, ..., 1]
n = 12
H = np.array([[1/(i+j-1) for j in range(n)] for i in range(n)])
b = H @ np.ones(n)

# "Solve" Ax = b
x_numpy = np.linalg.solve(H, b)
print(f"Error (np.linalg.solve): {np.linalg.norm(x_numpy - np.ones(n)):.2e}")
# ~1e-4 (terrible for a 12x12 system!)

# The condition number tells you WHY
print(f"Condition number: {np.linalg.cond(H):.2e}")
# ~10^16 — the matrix is nearly singular in floating point

2. Condition Number

DfCondition Number

The condition number of a matrix AA measures how sensitive the solution of Ax=bAx = b is to perturbations in bb (or AA). For a nonsingular matrix with respect to the 2-norm:

Îē2(A)=âˆĨAâˆĨ2⋅âˆĨA−1âˆĨ2=΃max⁥(A)΃min⁥(A)\kappa_2(A) = \|A\|_2 \cdot \|A^{-1}\|_2 = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where ΃max⁥\sigma_{\max} and ΃min⁥\sigma_{\min} are the largest and smallest singular values of AA.

Condition Number Properties

What the Condition Number Tells You

âš ī¸ Interpreting Îē(A)

If Îē(A)=10k\kappa(A) = 10^k, then you lose approximately kk digits of accuracy when solving Ax=bAx = b. For double-precision arithmetic (Īĩ≈10−16\epsilon \approx 10^{-16}):

  • Îē(A)=101\kappa(A) = 10^1: 15 digits accurate — excellent
  • Îē(A)=104\kappa(A) = 10^4: 12 digits accurate — good
  • Îē(A)=108\kappa(A) = 10^8: 8 digits accurate — moderate
  • Îē(A)=1012\kappa(A) = 10^{12}: 4 digits accurate — poor
  • Îē(A)=1016\kappa(A) = 10^{16}: 0 digits accurate — essentially random

Sensitivity Analysis

For the system Ax=bAx = b, if bb is perturbed by δb\delta b, the relative error in xx satisfies:

âˆĨδxâˆĨâˆĨxâˆĨ≤Îē(A)⋅âˆĨδbâˆĨâˆĨbâˆĨ\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}

Similarly, if AA is perturbed by δA\delta A:

âˆĨδxâˆĨâˆĨx+δxâˆĨ≤Îē(A)⋅âˆĨδAâˆĨâˆĨAâˆĨ\frac{\|\delta x\|}{\|x + \delta x\|} \leq \kappa(A) \cdot \frac{\|\delta A\|}{\|A\|}

This is a worst-case bound — the actual error may be much smaller, but you cannot know this without computing Îē(A)\kappa(A).

📝Condition Number of Common Matrices

Identity matrix InI_n: Îē(I)=1\kappa(I) = 1 (perfectly conditioned)

Diagonal D=diag(d1,â€Ļ,dn)D = \text{diag}(d_1, \ldots, d_n): Îē(D)=âˆŖdmaxâĄâˆŖ/âˆŖdminâĄâˆŖ\kappa(D) = |d_{\max}| / |d_{\min}|

Hilbert matrix Hij=1i+j−1H_{ij} = \frac{1}{i+j-1}: Îē(Hn)âˆŧ(1+2)4nn\kappa(H_n) \sim \frac{(1+\sqrt{2})^{4n}}{\sqrt{n}} — grows exponentially

Vandermonde matrix: Îē\kappa grows exponentially with nn

Random Gaussian matrix AijâˆŧN(0,1)A_{ij} \sim \mathcal{N}(0,1): Îē(A)≈2n\kappa(A) \approx 2\sqrt{n} with high probability

import numpy as np
import matplotlib.pyplot as plt

# Condition number growth for Hilbert matrices
ns = range(2, 15)
kappas = []
for n in ns:
    H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
    kappas.append(np.linalg.cond(H))

for n, k in zip(ns, kappas):
    print(f"H_{n}: Îē = {k:.2e}")

# For n >= 12, Îē exceeds 10^16 — beyond double precision capability

3. Stability

DfForward Stability

An algorithm is forward stable if it computes a result x~x˃ that is close to the true answer xx. That is, the error âˆĨx~−xâˆĨ\|x˃ - x\| is small. This is the intuitive notion of "getting the right answer."

DfBackward Stability

An algorithm is backward stable if it computes x~x˃ that is the exact solution to a slightly perturbed problem. That is, there exists a small δA\delta A (or δb\delta b) with âˆĨδAâˆĨ/âˆĨAâˆĨ\|\delta A\| / \|A\| small such that (A+δA)x~=b(A + \delta A)x˃ = b (or Ax~=b+δbAx˃ = b + \delta b).

Backward stability is stronger and more desirable than forward stability: if an algorithm is backward stable, the error is always bounded by Îē(A)⋅Īĩ\kappa(A) \cdot \epsilon — the algorithm does the best possible job given the conditioning of the problem.

ThBackward Error Theorem

For a backward stable algorithm solving Ax=bAx = b:

âˆĨx~−xâˆĨâˆĨxâˆĨ≤Îē(A)⋅Īĩalgo\frac{\|x˃ - x\|}{\|x\|} \leq \kappa(A) \cdot \epsilon_{\text{algo}}

where Īĩalgo\epsilon_{\text{algo}} is the backward error of the algorithm. This is the fundamental relationship between conditioning and stability.

Stability of Common Algorithms

AlgorithmForward Stable?Backward Stable?Notes
Naive Gaussian eliminationYesNoCan have large backward error
Gaussian elimination + partial pivotingYesYes (nearly)Standard in LAPACK
QR decomposition (Householder)YesYesMost stable general method
Normal equations for least squaresNoNoSquare condition number
QR for least squaresYesYesRecommended approach
SVDYesYesMost robust, most expensive

âš ī¸ Pivoting in Gaussian Elimination

Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. Consider:

A=(Īĩ111)A = \begin{pmatrix} \epsilon & 1 \\ 1 & 1 \end{pmatrix}

Without pivoting, we divide by Īĩ\epsilon, amplifying errors by 1/Īĩ1/\epsilon. With partial pivoting (swapping rows to put the largest element on the diagonal), the backward error is bounded by a modest factor of nn.

import numpy as np

def gaussian_elimination_no_pivot(A, b):
    """Unstable Gaussian elimination (no pivoting)."""
    n = len(b)
    Ab = np.hstack([A.astype(float), b.reshape(-1, 1)])
    for col in range(n):
        for row in range(col + 1, n):
            factor = Ab[row, col] / Ab[col, col]
            Ab[row, col:] -= factor * Ab[col, col:]
    # Back substitution
    x = np.zeros(n)
    for i in range(n - 1, -1, -1):
        x[i] = (Ab[i, -1] - Ab[i, i+1:n] @ x[i+1:]) / Ab[i, i]
    return x

# Near-singular example
A = np.array([[1e-10, 1], [1, 1]], dtype=float)
b = A @ np.array([1.0, 1.0])

# numpy uses stable LAPACK (LU with partial pivoting)
x_stable = np.linalg.solve(A, b)
print(f"Stable: {x_stable} (error: {np.linalg.norm(x_stable - [1,1]):.2e})")

# No-pivot version may fail
try:
    x_unstable = gaussian_elimination_no_pivot(A, b)
    print(f"Unstable: {x_unstable} (error: {np.linalg.norm(x_unstable - [1,1]):.2e})")
except ZeroDivisionError:
    print("Unstable: Division by zero!")

4. Floating Point Arithmetic

IEEE 754 Standard

DfFloating Point Representation

A double-precision float (IEEE 754) stores a number as:

x=(−1)s×1.f×2e−1023x = (-1)^s \times 1.f \times 2^{e - 1023}

where ss is the sign bit, ff is the 52-bit mantissa (fraction), and ee is the 11-bit biased exponent. This gives:

  • Range: Âą10Âą308\pm 10^{\pm 308}
  • Precision: ≈15−16\approx 15-16 decimal digits
  • Machine epsilon: Īĩ=2−52≈2.22×10−16\epsilon = 2^{-52} \approx 2.22 \times 10^{-16}

â„šī¸ Machine Epsilon

Machine epsilon Īĩ\epsilon is the smallest number such that fl(1+Īĩ)≠1\text{fl}(1 + \epsilon) \neq 1 in floating-point arithmetic. Equivalently, it bounds the relative rounding error of any basic operation:

fl(a∘b)=(a∘b)(1+δ),âˆŖÎ´âˆŖâ‰¤Īĩ\text{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \leq \epsilon

For double precision: Īĩ≈2.22×10−16\epsilon \approx 2.22 \times 10^{-16}. For single precision: Īĩ≈1.19×10−7\epsilon \approx 1.19 \times 10^{-7}.

Roundoff Error

Every floating-point operation introduces a relative error of at most Īĩ\epsilon. After nn operations, errors can accumulate. For a sum of nn numbers:

fl(∑i=1nxi)=∑i=1nxi(1+δi),âˆŖÎ´iâˆŖâ‰¤nĪĩ\text{fl}\left(\sum_{i=1}^n x_i\right) = \sum_{i=1}^n x_i (1 + \delta_i), \quad |\delta_i| \leq n\epsilon

This linear growth in error is why compensated summation (Kahan summation) is important for large reductions.

Catastrophic Cancellation

DfCatastrophic Cancellation

Catastrophic cancellation occurs when subtracting two nearly equal floating-point numbers, causing a dramatic loss of significant digits. If aa and bb agree to kk significant digits, a−ba - b has only k−log⁡10(a/b)k - \log_{10}(a/b) significant digits.

Example: a=1.23456789012345a = 1.23456789012345, b=1.23456789012344b = 1.23456789012344. Both have 16 digits, but a−b=10−14a - b = 10^{-14} has only 1 digit.

import numpy as np

# Classic catastrophic cancellation
a = 1.0
b = 1.0 + 1e-15
print(f"Direct subtraction: {(b - a) - 1e-15}")  # May not be exactly 0

# Catastrophic cancellation in quadratic formula
# For b >> sqrt(b^2 - 4ac), the standard formula loses precision
a_coeff, b_coeff, c_coeff = 1, 1e8, 1
sqrt_disc = np.sqrt(b_coeff**2 - 4*a_coeff*c_coeff)

x1_standard = (-b_coeff + sqrt_disc) / (2 * a_coeff)
x2_standard = (-b_coeff - sqrt_disc) / (2 * a_coeff)
print(f"Standard: x1 = {x1_standard:.6e}, x2 = {x2_standard:.6e}")

# Stable alternative: use the other root to avoid cancellation
x2_stable = c_coeff / (a_coeff * x1_standard)  # = c/(a*x1)
x1_stable = (-b_coeff - np.sign(b_coeff) * sqrt_disc) / (2 * a_coeff)
print(f"Stable:   x1 = {x1_stable:.6e}, x2 = {x2_stable:.6e}")

Kahan Summation

Kahan Compensated Summation

def kahan_sum(values):
    """Kahan compensated summation — nearly exact for large sums."""
    s = 0.0
    c = 0.0  # compensation for lost low-order bits
    for x in values:
        y = x - c
        t = s + y
        c = (t - s) - y  # algebraically zero, but captures rounding error
        s = t
    return s

# Demonstration: sum 1e7 values of 1.0
n = 10_000_000
values = [1.0] * n
print(f"Naive sum:   {sum(values) - n}")       # May show rounding error
print(f"Kahan sum:   {kahan_sum(values) - n}")  # Exact (or nearly so)

# More dramatic: sum decreasing values
vals = [1e16, 1.0, -1e16]
print(f"Naive:   {sum(vals)}")          # 0.0 (correct by luck)
vals2 = [1e16, 1.0, -1e16, 1.0, -1e16]
print(f"Naive:   {sum(vals2)}")         # 0.0 (lost the 1.0!)
print(f"Kahan:   {kahan_sum(vals2)}")   # 1.0 (correct)

5. Direct Methods

Gaussian Elimination

DfGaussian Elimination

Gaussian elimination transforms Ax=bAx = b into an upper triangular system Ux=yUx = y via row operations, then solves by back substitution. With partial pivoting (row swaps to maximize the pivot element), the algorithm is:

  1. For each column k=1,â€Ļ,n−1k = 1, \ldots, n-1:
    • Find row pâ‰Ĩkp \geq k with âˆŖapkâˆŖ=max⁥iâ‰ĨkâˆŖaikâˆŖ|a_{pk}| = \max_{i \geq k} |a_{ik}|
    • Swap rows kk and pp
    • For each row i>ki > k: eliminate aika_{ik} by row subtraction
  2. Back substitution: solve Ux=yUx = y from bottom to top

Total cost: 23n3\frac{2}{3}n^3 flops for the factorization, O(n2)O(n^2) for the solve.

LU Decomposition

DfLU Decomposition

The LU decomposition factors a matrix as A=PLUA = PLU where:

  • PP is a permutation matrix (from partial pivoting)
  • LL is lower triangular with 1s on the diagonal
  • UU is upper triangular

Once computed, solving Ax=bAx = b reduces to two triangular solves: Ly=PbLy = Pb, then Ux=yUx = y.

import numpy as np
from scipy.linalg import lu, solve_triangular

A = np.array([[2, 1, 1], [4, 3, 3], [8, 7, 9]], dtype=float)
b = np.array([1, 2, 3], dtype=float)

# LU decomposition via scipy
P, L, U = lu(A)

# Verify: PA = LU
print("PA = LU check:", np.allclose(P @ A, L @ U))

# Solve using LU factors
y = solve_triangular(L, P @ b, lower=True)
x = solve_triangular(U, y)
print(f"Solution: {x}")
print(f"Verify:   {A @ x}")  # Should equal b

# numpy's solve uses LAPACK's LU implementation internally
x_numpy = np.linalg.solve(A, b)
print(f"numpy:    {x_numpy}")

Cholesky Decomposition

ThCholesky Decomposition

For a symmetric positive definite (SPD) matrix AA, the Cholesky decomposition gives A=LLTA = LL^T where LL is lower triangular with positive diagonal entries.

  • Cost: 13n3\frac{1}{3}n^3 flops (half the cost of LU)
  • Uniquely defined (no pivoting needed)
  • Numerically stable without pivoting
  • Exists if and only if AA is SPD

The Cholesky decomposition is the method of choice for solving SPD systems (e.g., normal equations, covariance matrices in Gaussian processes).

import numpy as np
from scipy.linalg import cho_factor, cho_solve

# Create SPD matrix
A = np.array([[4, 2, 1], [2, 5, 3], [1, 3, 6]], dtype=float)
b = np.array([1, 2, 3], dtype=float)

# Cholesky decomposition
L = np.linalg.cholesky(A)
print("A = LL^T check:", np.allclose(L @ L.T, A))

# Solve using Cholesky
c, low = cho_factor(A)
x = cho_solve((c, low), b)
print(f"Solution: {x}")
print(f"Verify:   {A @ x}")

6. Iterative Methods

Iterative methods are essential for large-scale problems where direct methods are too expensive (e.g., finite element systems with millions of unknowns, or sparse systems arising in graph algorithms).

Jacobi Iteration

DfJacobi Iteration

The Jacobi method splits A=D+L+UA = D + L + U (diagonal, strictly lower, strictly upper) and iterates:

x(k+1)=D−1(b−(L+U)x(k))x^{(k+1)} = D^{-1}(b - (L + U)x^{(k)})

Each iteration costs O(n2)O(n^2) (or O(nnz)O(\text{nnz}) for sparse AA). Convergence is guaranteed if AA is strictly diagonally dominant: âˆŖaiiâˆŖ>∑j≠iâˆŖaijâˆŖ|a_{ii}| > \sum_{j \neq i} |a_{ij}| for all ii.

Gauss-Seidel Iteration

DfGauss-Seidel Iteration

Gauss-Seidel is identical to Jacobi but uses the most recently computed values immediately:

xi(k+1)=1aii(bi−∑j<iaijxj(k+1)−∑j>iaijxj(k))x_i^{(k+1)} = \frac{1}{a_{ii}} \left(b_i - \sum_{j < i} a_{ij} x_j^{(k+1)} - \sum_{j > i} a_{ij} x_j^{(k)}\right)

This typically converges twice as fast as Jacobi but is harder to parallelize (each update depends on the previous one in the same sweep).

Conjugate Gradient

DfConjugate Gradient Method

The Conjugate Gradient (CG) method solves Ax=bAx = b where AA is SPD. It generates a sequence of iterates x(k)x^{(k)} that minimize the AA-norm of the error over expanding Krylov subspaces:

x(k)=arg⁥min⁥x∈x0+Kk(A,r0)âˆĨx−x∗âˆĨAx^{(k)} = \arg\min_{x \in x_0 + \mathcal{K}_k(A, r_0)} \|x - x^*\|_A

where Kk(A,r0)=span{r0,Ar0,A2r0,â€Ļ,Ak−1r0}\mathcal{K}_k(A, r_0) = \text{span}\{r_0, Ar_0, A^2r_0, \ldots, A^{k-1}r_0\} is the Krylov subspace.

CG Algorithm

Iterative Methods Comparison

MethodPer-Iteration CostMemoryConvergenceParallelizable?
JacobiO(nnz)O(\text{nnz})O(n)O(n)SlowYes
Gauss-SeidelO(nnz)O(\text{nnz})O(n)O(n)MediumNo
SORO(nnz)O(\text{nnz})O(n)O(n)Fast (tuned Ή\omega)No
Conjugate GradientO(nnz)O(\text{nnz})O(n)O(n)Fast (for SPD)Partially
GMRESO(k⋅nnz)O(k \cdot \text{nnz})O(kn)O(kn)Fast (general)Partially
import numpy as np
from scipy.sparse import diags
from scipy.sparse.linalg import cg, gmres

# Create a large SPD tridiagonal system
n = 10000
A = diags([-1, 2, -1], [-1, 0, 1], shape=(n, n), format='csr')
b = np.ones(n)

# Conjugate Gradient
x_cg, info = cg(A, b, tol=1e-10, maxiter=1000)
print(f"CG: {np.linalg.norm(A @ x_cg - b):.2e} residual, info={info}")

# Jacobi iteration (manual)
def jacobi(A, b, x0=None, tol=1e-10, maxiter=1000):
    x = x0 if x0 is not None else np.zeros_like(b)
    D = A.diagonal()
    for k in range(maxiter):
        r = b - A @ x
        if np.linalg.norm(r) < tol:
            return x, k
        x = x + r / D  # Jacobi step: x_new = D^{-1}(b - (A-D)x)
    return x, maxiter

x_jac, iters = jacobi(A, b)
print(f"Jacobi: {np.linalg.norm(A @ x_jac - b):.2e} residual, {iters} iterations")

7. Matrix Norms for Numerical Analysis

DfSpectral Norm

The spectral norm (operator 2-norm) of AA is:

âˆĨAâˆĨ2=΃max⁥(A)=Îģmax⁥(ATA)\|A\|_2 = \sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^T A)}

This is the maximum stretching factor of AA: âˆĨAâˆĨ2=max⁥âˆĨxâˆĨ=1âˆĨAxâˆĨ\|A\|_2 = \max_{\|x\|=1} \|Ax\|.

Key Matrix Norms

â„šī¸ Why the Spectral Norm for Conditioning?

The condition number Îē2(A)=âˆĨAâˆĨ2âˆĨA−1âˆĨ2\kappa_2(A) = \|A\|_2 \|A^{-1}\|_2 uses the spectral norm because it captures the true geometric distortion: AA maps the unit sphere to an ellipsoid with semi-axes equal to the singular values. The ratio of the longest to shortest semi-axis is Îē2\kappa_2. Other norms give different condition numbers, but Îē2\kappa_2 is the most meaningful for understanding sensitivity.


8. Least Squares Problems

When Ax=bAx = b has no exact solution (overdetermined system, m>nm > n), we seek xx that minimizes âˆĨAx−bâˆĨ2\|Ax - b\|_2.

Normal Equations

ThNormal Equations

The least squares solution satisfies the normal equations:

ATAx=ATbA^T A x = A^T b

If AA has full column rank, ATAA^T A is SPD and the unique solution is x=(ATA)−1ATbx = (A^T A)^{-1} A^T b.

Problem: Îē(ATA)=Îē(A)2\kappa(A^T A) = \kappa(A)^2. If Îē(A)=108\kappa(A) = 10^8, then Îē(ATA)=1016\kappa(A^T A) = 10^{16} — catastrophic loss of precision!

QR Decomposition Approach

DfQR Decomposition for Least Squares

Factor A=QRA = QR where QQ is orthogonal (m×nm \times n) and RR is upper triangular (n×nn \times n). Then:

âˆĨAx−bâˆĨ2=âˆĨQRx−bâˆĨ2=âˆĨRx−QTbâˆĨ2\|Ax - b\|_2 = \|QRx - b\|_2 = \|Rx - Q^T b\|_2

since orthogonal transformations preserve the 2-norm. The solution is Rx=QTbRx = Q^T b (upper triangular solve), with Îē=Îē(A)\kappa = \kappa(A) — no squaring!

SVD Approach

DfSVD for Least Squares

The SVD A=UÎŖVTA = U\Sigma V^T gives the most robust solution:

x=VÎŖ+UTbx = V \Sigma^+ U^T b

where ÎŖ+\Sigma^+ is the pseudoinverse (reciprocal of nonzero singular values, zero otherwise). This handles rank-deficient matrices gracefully and makes the numerical rank explicit.

import numpy as np

# Overdetermined system: 5 equations, 3 unknowns
np.random.seed(42)
A = np.random.randn(5, 3)
b = np.random.randn(5)

# Method 1: Normal equations (UNSTABLE)
x_normal = np.linalg.solve(A.T @ A, A.T @ b)
print(f"Normal equations: residual = {np.linalg.norm(A @ x_normal - b):.2e}")
print(f"  Îē(A^T A) = {np.linalg.cond(A.T @ A):.2e}")

# Method 2: QR decomposition (STABLE)
x_qr, res_qr, rank_qr, sv_qr = np.linalg.lstsq(A, b, rcond=None)
print(f"QR (lstsq):       residual = {np.linalg.norm(A @ x_qr - b):.2e}")

# Method 3: SVD (MOST STABLE)
U, s, Vt = np.linalg.svd(A, full_matrices=False)
x_svd = Vt.T @ np.diag(1/s) @ U.T @ b
print(f"SVD:              residual = {np.linalg.norm(A @ x_svd - b):.2e}")

# Rank-deficient example
A_rankdef = np.array([[1, 2], [2, 4], [3, 6]], dtype=float)  # rank 1
b_rankdef = np.array([1, 2, 3.1], dtype=float)
x_rd = np.linalg.lstsq(A_rankdef, b_rankdef, rcond=None)[0]
print(f"\nRank-deficient least squares: x = {x_rd}")
print(f"  Residual: {np.linalg.norm(A_rankdef @ x_rd - b_rankdef):.2e}")

9. Sparse Matrix Algorithms

Why Sparse?

â„šī¸ Sparsity in Practice

Real-world matrices are often sparse: adjacency matrices of graphs, discretizations of PDEs, TF-IDF matrices in NLP. Storing only nonzero entries reduces memory from O(n2)O(n^2) to O(nnz)O(\text{nnz}), and operations like matrix-vector multiply from O(n2)O(n^2) to O(nnz)O(\text{nnz}).

For a 2D Laplacian on an n×nn \times n grid: the matrix is n2×n2n^2 \times n^2 (106×10610^6 \times 10^6 for n=1000n=1000) but has only âˆŧ5n2\sim 5n^2 nonzeros — a density of 5/n2≈0.0005%5/n^2 \approx 0.0005\%.

Sparse Storage Formats

FormatBest ForMemoryAccess
COOConstruction3×nnz3 \times \text{nnz}Slow
CSRRow slicing, SpMVnnz+n+1\text{nnz} + n + 1Fast row access
CSCColumn slicingnnz+n+1\text{nnz} + n + 1Fast column access
DIADiagonal structureO(nnz)O(\text{nnz})Very fast for banded
BSRBlock structureVariesFast for block patterns

Preconditioning

DfPreconditioning

A preconditioner MM approximates A−1A^{-1} and transforms the system to M−1Ax=M−1bM^{-1}Ax = M^{-1}b, which has better spectral properties. The ideal preconditioner M=AM = A makes the system trivial, but MM must be cheap to apply.

Common preconditioners:

  • Jacobi (diagonal): M=diag(A)M = \text{diag}(A) — trivial, sometimes effective
  • Incomplete LU (ILU): approximate LU factorization with fill-in limit
  • Algebraic Multigrid (AMG): hierarchical method, optimal for elliptic PDEs
  • SSOR: symmetric successive over-relaxation
import numpy as np
from scipy.sparse import diags, random
from scipy.sparse.linalg import cg, spilu, LinearOperator

# Create a difficult SPD system (convection-diffusion)
n = 500
A = diags([-1, 4, -1], [-1, 0, 1], shape=(n, n), format='csr')
A = A + 0.1 * random(n, n, density=0.01, format='csr')  # Add some non-symmetric perturbation
A = A @ A.T  # Make SPD
b = np.random.randn(n)

# Without preconditioning
x_no_precon, info1 = cg(A, b, tol=1e-10, maxiter=1000)
print(f"No preconditioner: {np.linalg.norm(A @ x_no_precon - b):.2e}")

# With Jacobi (diagonal) preconditioner
M_inv = 1.0 / A.diagonal()
M = LinearOperator(A.shape, matvec=lambda x: M_inv * x)
x_jac, info2 = cg(A, b, tol=1e-10, maxiter=1000, M=M)
print(f"Jacobi precon:     {np.linalg.norm(A @ x_jac - b):.2e}")

# With incomplete Cholesky
ilu = spilu(A.tocsc(), drop_tol=1e-4)
M_ilu = LinearOperator(A.shape, matvec=ilu.solve)
x_ilu, info3 = cg(A, b, tol=1e-10, maxiter=1000, M=M_ilu)
print(f"ILU precon:        {np.linalg.norm(A @ x_ilu - b):.2e}")

10. Python Implementation

NumPy Linear Algebra

import numpy as np

# === Core Operations ===

# Solve Ax = b
A = np.array([[3, 1], [1, 2]], dtype=float)
b = np.array([9, 8], dtype=float)
x = np.linalg.solve(A, b)
print(f"Solve: x = {x}")

# Determinant and inverse (AVOID inverting in production!)
print(f"det(A) = {np.linalg.det(A):.4f}")
print(f"inv(A) = \n{np.linalg.inv(A)}")

# Matrix factorizations
Q, R = np.linalg.qr(A)
print(f"\nQR: Q = \n{Q}\nR = \n{R}")

L = np.linalg.cholesky(A @ A.T + np.eye(2))  # Ensure SPD
print(f"Cholesky: L = \n{L}")

U, s, Vt = np.linalg.svd(A)
print(f"\nSVD: singular values = {s}")

# Eigenvalues
eigvals, eigvecs = np.linalg.eig(A)
print(f"Eigenvalues: {eigvals}")

# Matrix norms
print(f"\n||A||_F = {np.linalg.norm(A, 'fro'):.4f}")
print(f"||A||_2 = {np.linalg.norm(A, 2):.4f}")
print(f"||A||_1 = {np.linalg.norm(A, 1):.4f}")
print(f"||A||_inf = {np.linalg.norm(A, np.inf):.4f}")

# Condition number
print(f"Îē(A) = {np.linalg.cond(A):.4f}")
print(f"Îē(A^T A) = {np.linalg.cond(A.T @ A):.4f}")  # Îē^2

SciPy Sparse Linear Algebra

from scipy import sparse
from scipy.sparse import linalg as spla
import numpy as np

# Create sparse matrix (5-point Laplacian)
n = 100
A = sparse.diags([-1, 4, -1], [-1, 0, 1], shape=(n, n), format='csr')
b = np.random.randn(n)

# Solve sparse system
x, info = spla.cg(A, b, tol=1e-10)
print(f"CG solve: residual = {np.linalg.norm(A @ x - b):.2e}")

# Sparse eigenvalues (largest/smallest)
vals, vecs = spla.eigs(A, k=5, which='LM')
print(f"Largest eigenvalues: {np.sort(vals.real)}")

vals_sm, _ = spla.eigs(A, k=3, which='SM')
print(f"Smallest eigenvalues: {np.sort(vals_sm.real)}")

# Sparse SVD (useful for large-scale PCA)
U, s, Vt = spla.svds(A, k=5)
print(f"Top 5 singular values: {s}")

Practical Patterns

📝Avoiding Common Pitfalls

DO: Use np.linalg.solve for dense, scipy.sparse.linalg.cg for sparse SPD, scipy.sparse.linalg.gmres for sparse non-symmetric.

DON'T: Use np.linalg.inv(A) @ b — it is slower and less stable than np.linalg.solve(A, b). Don't use normal equations A.T @ A for least squares — use np.linalg.lstsq.


11. Applications in AI/ML

Training Stability

â„šī¸ Numerical Issues in Deep Learning

Deep learning training involves repeated matrix-vector products (forward pass), matrix transposes (backward pass), and parameter updates. Numerical issues manifest as:

  • Exploding/vanishing gradients: The condition number of the Jacobian determines whether gradients grow or shrink exponentially through layers
  • Loss of precision in mixed-precision training: FP16 has Īĩ≈5×10−3\epsilon \approx 5 \times 10^{-3}, making operations on small gradients or large activations unstable
  • Ill-conditioned Hessians: Second-order optimizers must solve systems involving ∇2L\nabla^2 L, which is often extremely ill-conditioned

Layer normalization and batch normalization are partly motivated by keeping intermediate activations well-conditioned.

Numerical Precision in Deep Learning

import numpy as np

# Simulating mixed-precision issues
def forward_pass_fp32(W, x):
    """Full precision forward pass."""
    return W @ x

def forward_pass_fp16(W, x):
    """Simulated mixed-precision forward pass."""
    W_16 = W.astype(np.float16)
    x_16 = x.astype(np.float16)
    result = W_16 @ x_16
    return result.astype(np.float32)

np.random.seed(42)
W = np.random.randn(100, 100) * 0.1
x = np.random.randn(100)

y_fp32 = forward_pass_fp32(W, x)
y_fp16 = forward_pass_fp16(W, x)

print(f"FP32 result norm: {np.linalg.norm(y_fp32):.6f}")
print(f"FP16 result norm: {np.linalg.norm(y_fp16):.6f}")
print(f"Relative error:   {np.linalg.norm(y_fp32 - y_fp16) / np.linalg.norm(y_fp32):.2e}")

# Condition number of a typical layer's weight matrix
W_layer = np.random.randn(512, 512) * (2 / 512)**0.5  # Xavier init
print(f"\nLayer condition number: {np.linalg.cond(W_layer):.2f}")
# Usually ~10-100 with good initialization, much worse after training

Key Applications

ApplicationNumerical ChallengeSolution
PCA / SVD embeddingsLarge-scale SVDTruncated SVD, randomized SVD
Gaussian processesO(n3)O(n^3) CholeskySparse approximations, inducing points
Neural ODE trainingStiff ODE systemsImplicit solvers, adjoint method
Recommendation systemsSparse least squaresALS, SGD (avoids forming normal equations)
Graph neural networksSparse matrix operationsSpMV with CSR format

12. Common Mistakes

âš ī¸ Common Mistakes in Numerical Linear Algebra

MistakeWhy It's WrongCorrect Approach
Using inv(A) @ b instead of solve(A, b)2x slower, less stableUse np.linalg.solve
Solving via normal equations ATAx=ATbA^T A x = A^T bSquares condition numberUse QR or SVD
Ignoring condition numberMay get garbage results with no errorAlways check np.linalg.cond
Not using pivoting in Gaussian eliminationCatastrophic instabilityUse LAPACK's pivoted LU
Comparing floating-point numbers with ==Roundoff makes exact equality rareUse np.allclose or tolerance
Using float32 for large summationsAccumulated error growsUse float64 or Kahan summation
Assuming iterative methods convergeDivergence is silentMonitor residual; use preconditioning
Using dense algorithms on sparse matricesO(n2)O(n^2) memory, O(n3)O(n^3) timeUse scipy.sparse and sparse solvers
Not checking matrix properties (SPD, symmetry)Wrong solver, wrong resultsVerify before choosing solver
import numpy as np

# BAD: Using inv(A) @ b
A = np.array([[1, 2], [3, 4]], dtype=float)
b = np.array([5, 6], dtype=float)

# Wrong
x_bad = np.linalg.inv(A) @ b  # Slower, less stable

# Correct
x_good = np.linalg.solve(A, b)  # Direct, stable

print(f"inv @ b: {x_bad}")
print(f"solve:  {x_good}")
print(f"Equal:  {np.allclose(x_bad, x_good)}")

# BAD: Comparing floats with ==
a = 0.1 + 0.2
print(f"\n0.1 + 0.2 == 0.3? {a == 0.3}")          # False!
print(f"0.1 + 0.2 ≈ 0.3?  {np.isclose(a, 0.3)}")  # True

13. Interview Questions

📝Interview Question 1: Why Not Invert the Matrix?

Q: Why should we never compute A−1A^{-1} and multiply by bb to solve Ax=bAx = b?

A: Three reasons: (1) Computing A−1A^{-1} costs 23n3\frac{2}{3}n^3 flops, then multiplying A−1bA^{-1}b costs another n2n^2 — LU decomposition with back substitution costs 23n3\frac{2}{3}n^3 total and does both at once. (2) The product A−1bA^{-1}b can amplify roundoff errors more than the triangular solves in LU. (3) Storing A−1A^{-1} is expensive (n2n^2 elements) and dense even when AA is sparse. LAPACK and all serious linear algebra libraries compute solves, not inverses.

📝Interview Question 2: Condition Number of Normal Equations

Q: If Îē(A)=106\kappa(A) = 10^6, what is the condition number when solving via normal equations?

A: Îē(ATA)=Îē(A)2=1012\kappa(A^T A) = \kappa(A)^2 = 10^{12}. With double precision (Īĩ≈10−16\epsilon \approx 10^{-16}), you lose about 12 digits, leaving only 4 digits of accuracy. This is why QR or SVD should be used instead of normal equations for least squares problems.

📝Interview Question 3: When to Use Iterative vs Direct Methods?

Q: When would you choose Conjugate Gradient over Cholesky decomposition?

A: Cholesky is preferred when AA is small-to-medium sized (n<104n < 10^4) and dense — it gives the exact solution in O(n3)O(n^3) time and O(n2)O(n^2) memory. CG is preferred when AA is large and sparse — it uses only matrix-vector products (costing O(nnz)O(\text{nnz}) each), needs O(nnz)O(\text{nnz}) memory, and converges in O(Îē)O(\sqrt{\kappa}) iterations. For a sparse n=106n = 10^6 system with Îē=100\kappa = 100, CG converges in âˆŧ100\sim 100 iterations, each costing O(nnz)O(\text{nnz}).

📝Interview Question 4: Floating Point Identity

Q: Is floating-point addition associative? Give an example.

A: No. Let a=1016a = 10^{16}, b=1.0b = 1.0, c=−1016c = -10^{16}. Then (a+b)+c=1016+1−1016=0(a + b) + c = 10^{16} + 1 - 10^{16} = 0, but a+(b+c)=1016+(1−1016)=1016+(−1016)=0a + (b + c) = 10^{16} + (1 - 10^{16}) = 10^{16} + (-10^{16}) = 0. Actually in this case both are 0, but consider: (a+c)+b=0+1=1(a + c) + b = 0 + 1 = 1, while (a+b)+c=0(a + b) + c = 0. The ordering matters because intermediate results may overflow or underflow.

📝Interview Question 5: SVD vs Eigenvalues

Q: Why might SVD be preferred over eigendecomposition for numerical computation?

A: (1) SVD exists for all matrices (rectangular, singular, defective); eigendecomposition only for square diagonalizable matrices. (2) SVD is always numerically stable; eigendecomposition can be unstable for defective matrices. (3) SVD of A=UÎŖVTA = U\Sigma V^T separates the input space (VV) from the output space (UU), which is geometrically meaningful. (4) SVD truncation gives the best low-rank approximation (Eckart-Young theorem), making it ideal for PCA, dimensionality reduction, and regularization.

📝Interview Question 6: Preconditioning

Q: What is a preconditioner and why do iterative methods need them?

A: A preconditioner MM approximates A−1A^{-1} such that M−1AM^{-1}A has a smaller condition number than AA. Without preconditioning, CG converges in O(Îē)O(\sqrt{\kappa}) iterations — if Îē=108\kappa = 10^8, that's 10410^4 iterations. A good preconditioner reduces Îē\kappa to âˆŧ10\sim 10, requiring only âˆŧ10\sim 10 iterations. The challenge is that MM must be cheap to apply (ideally O(n)O(n) or O(nnz)O(\text{nnz})) while still being a good approximation to A−1A^{-1}. Common choices: incomplete Cholesky, algebraic multigrid, block Jacobi for distributed systems.


14. Practice Problems

📝Problem 1: Condition Number Estimation

Compute the condition number of the n×nn \times n matrix with aij=1/(i+j−1)a_{ij} = 1/(i + j - 1) (Hilbert matrix) for n=5,10,15n = 5, 10, 15. At what nn does the matrix become effectively singular in double precision?

💡Solution

import numpy as np

for n in [5, 10, 15, 20]:
    H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
    kappa = np.linalg.cond(H)
    digits_lost = int(np.log10(kappa))
    print(f"n={n:2d}: Îē = {kappa:.2e}, ~{digits_lost} digits lost, "
          f"{max(0, 16-digits_lost)} digits remaining")

For n=5n = 5: Îē≈5×105\kappa \approx 5 \times 10^5 (11 digits remain) For n=10n = 10: Îē≈2×1013\kappa \approx 2 \times 10^{13} (3 digits remain) For n=15n = 15: Îē≈4×1019\kappa \approx 4 \times 10^{19} (matrix is singular in double precision!) For nâ‰Ĩ12n \geq 12, the Hilbert matrix is effectively singular — solving Hx=bHx = b gives meaningless results.

📝Problem 2: Backward Error Demonstration

Write a function that computes the backward error of a solution x~x˃ to Ax=bAx = b:

Ρ(x~)=âˆĨb−Ax~âˆĨâˆĨAâˆĨ⋅âˆĨx~âˆĨ\eta(x˃) = \frac{\|b - A\tilde{x}\|}{\|A\| \cdot \|\tilde{x}\|}

Show that Gaussian elimination with partial pivoting achieves small backward error even for ill-conditioned systems.

💡Solution

import numpy as np

def backward_error(A, b, x_approx):
    """Compute the backward error of an approximate solution."""
    residual = b - A @ x_approx
    return np.linalg.norm(residual) / (np.linalg.norm(A) * np.linalg.norm(x_approx))

# Ill-conditioned system
np.random.seed(42)
n = 100
H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
b = H @ np.ones(n)  # True solution is x = [1,...,1]

x_solve = np.linalg.solve(H, b)
x_perturbed = x_solve + np.random.randn(n) * 1e-10  # Slightly wrong solution

print(f"np.linalg.solve backward error: {backward_error(H, b, x_solve):.2e}")
print(f"Perturbed solution backward error: {backward_error(H, b, x_perturbed):.2e}")
print(f"Solve error: {np.linalg.norm(x_solve - np.ones(n)):.2e}")
print(f"Perturbed error: {np.linalg.norm(x_perturbed - np.ones(n)):.2e}")

The backward error of np.linalg.solve is always O(Īĩ⋅n)O(\epsilon \cdot n) regardless of conditioning. The forward error may be large if Îē\kappa is large, but this is the problem's fault, not the algorithm's.

📝Problem 3: Convergence Analysis

For the matrix A=(4113)A = \begin{pmatrix} 4 & 1 \\ 1 & 3 \end{pmatrix}, compute the convergence rate of Jacobi and Gauss-Seidel iteration. Which converges faster and by how much?

💡Solution

The spectral radius of the Jacobi iteration matrix TJ=−D−1(L+U)T_J = -D^{-1}(L+U):

TJ=−(1/4001/3)(0110)=(0−1/4−1/30)T_J = -\begin{pmatrix} 1/4 & 0 \\ 0 & 1/3 \end{pmatrix} \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & -1/4 \\ -1/3 & 0 \end{pmatrix}

Eigenvalues of TJT_J: Îģ2=1/12\lambda^2 = 1/12, so Îģ=Âą1/12≈±0.2887\lambda = \pm 1/\sqrt{12} \approx \pm 0.2887.

΁(TJ)=0.2887\rho(T_J) = 0.2887. After kk iterations, error reduced by factor (0.2887)k(0.2887)^k.

For Gauss-Seidel, ΁(TGS)â‰ˆĪ(TJ)2≈0.0833\rho(T_{GS}) \approx \rho(T_J)^2 \approx 0.0833. So GS converges roughly twice as fast per iteration.

import numpy as np

A = np.array([[4, 1], [1, 3]], dtype=float)
D = np.diag(np.diag(A))
L = np.tril(A, -1)
U = np.triu(A, 1)

T_J = -np.linalg.solve(D, L + U)
T_GS = -np.linalg.solve(D + L, U)

print(f"΁(T_Jacobi) = {max(abs(np.linalg.eigvals(T_J))):.4f}")
print(f"΁(T_GS)     = {max(abs(np.linalg.eigvals(T_GS))):.4f}")
print(f"Ratio:       {max(abs(np.linalg.eigvals(T_J))) / max(abs(np.linalg.eigvals(T_GS))):.2f}x")

📝Problem 4: Mixed Precision Stability

A neural network layer computes y=΃(Wx+b)y = \sigma(Wx + b) where W∈R1024×1024W \in \mathbb{R}^{1024 \times 1024}, x∈R1024x \in \mathbb{R}^{1024}. If WW has entries in [−1,1][-1, 1] and xx has entries in [−0.1,0.1][-0.1, 0.1], estimate the condition number needed for the computation to be reliable in FP16 vs FP32.

💡Solution

In FP16, Īĩ≈5×10−3\epsilon \approx 5 \times 10^{-3}. The accumulation of 10241024 multiply-adds introduces relative error of âˆŧ1024×Īĩ≈5\sim 1024 \times \epsilon \approx 5. This means the result can be completely unreliable.

In FP32, Īĩ≈1.2×10−7\epsilon \approx 1.2 \times 10^{-7}. Accumulation error: âˆŧ1024×1.2×10−7≈1.2×10−4\sim 1024 \times 1.2 \times 10^{-7} \approx 1.2 \times 10^{-4}. Reliable if condition number Îē<104\kappa < 10^4.

In FP64, Īĩ≈2.2×10−16\epsilon \approx 2.2 \times 10^{-16}. Accumulation error: âˆŧ1024×2.2×10−16≈2.2×10−13\sim 1024 \times 2.2 \times 10^{-16} \approx 2.2 \times 10^{-13}. Reliable for any practical condition number.

This is why mixed-precision training uses FP16 for speed but maintains FP32 copies of weights and uses loss scaling to keep gradients in FP16-representable range.


15. Quick Reference

📋Numerical Linear Algebra Quick Reference

Key Formulas

FormulaNameUse
Îē(A)=âˆĨAâˆĨâˆĨA−1âˆĨ\kappa(A) = \|A\| \|A^{-1}\|Condition numberMeasure sensitivity
Îē(A)=΃max⁥/΃min⁥\kappa(A) = \sigma_{\max}/\sigma_{\min}Spectral condition numberSVD-based computation
Îē(ATA)=Îē(A)2\kappa(A^T A) = \kappa(A)^2Normal equationsWhy to avoid them
fl(a∘b)=(a∘b)(1+δ)\text{fl}(a \circ b) = (a \circ b)(1+\delta), âˆŖÎ´âˆŖâ‰¤Īĩ|\delta| \leq \epsilonRelative error boundFloating-point model

Solver Selection Guide

ProblemRecommended MethodAvoid
Dense, general Ax=bAx = bLU with partial pivotinginv(A) @ b
Dense, SPD Ax=bAx = bCholeskyLU (wasteful)
Dense, overdeterminedQR or SVD (lstsq)Normal equations
Sparse, SPDCG (+ preconditioner)Dense LU
Sparse, generalGMRES (+ preconditioner)Dense methods
Sparse, least squaresLSMR or LSQRNormal equations

Accuracy vs Speed Tradeoff

MethodAccuracySpeedWhen to Use
SVDHighestSlowestRank-deficient, ill-conditioned
QRHighFastGeneral least squares
CholeskyHighFastest (SPD)SPD systems
LUGoodFastGeneral square systems
CG/GMRESGoodFast (sparse)Large sparse systems

Error Budget

For solving Ax=bAx = b with Îē(A)=10k\kappa(A) = 10^k in double precision (Īĩ≈10−16\epsilon \approx 10^{-16}):

  • Forward error: ≤Îē⋅Īĩ≈10k−16\leq \kappa \cdot \epsilon \approx 10^{k-16}
  • If k<10k < 10: 6+ digits — good
  • If k=12k = 12: 4 digits — acceptable
  • If k=15k = 15: 1 digit — poor, consider higher precision
  • If k=16k = 16: 0 digits — use arbitrary precision or reformulate

16. Cross-References

Lesson Progress23 / 100