Numerical Linear Algebra

Numerical linear algebra is the bridge between elegant mathematical theory and practical computation. In theory, we solve $Ax = b$ by computing $x = A^{-1}b$ . In practice, computing $A^{-1}$ directly is almost never done — it is slow, numerically unstable, and wasteful. Instead, we use carefully designed algorithms that exploit matrix structure and maintain numerical stability. This module covers the essential numerical concepts every ML practitioner must understand: conditioning, stability, floating-point arithmetic, direct and iterative solvers, and their applications in modern AI systems.

1. Why It Matters

ℹ️ Why It Matters

Numerical linear algebra is not a theoretical curiosity — it is the foundation of nearly every ML system. Training a neural network involves solving large linear systems (in optimizers like L-BFGS), computing matrix factorizations (in PCA, SVD-based embeddings), and inverting Hessians (in second-order optimization). Getting the numerics wrong means silent bugs: models that train on one GPU but diverge on another, solutions that appear correct but are off by 10^6, or systems that hang indefinitely without converging.

In production ML, the difference between a numerically stable and unstable implementation can mean the difference between a model that works and one that silently produces garbage. The 1996 Ariane 5 rocket failure — caused by a floating-point overflow converting a 64-bit float to a 16-bit integer — is a stark reminder that numerical precision matters in safety-critical systems. In ML, the consequences are less dramatic but still costly: wasted compute, incorrect gradients, and models that fail to generalize.

Computational Efficiency

Naive matrix inversion via cofactor expansion is $O(n!)$ — completely impractical. Even the adjugate method is $O(n^4)$ . Modern algorithms achieve:

Operation	Naive	Optimized	Notes
Matrix multiply	$O(n^3)$	$O(n^{2.373})$	Strassen, Coppersmith-Winograd
Solve $Ax=b$	$O(n^4)$	$O(n^3)$	LU decomposition
Eigenvalues	$O(n^4)$	$O(n^3)$	QR algorithm
SVD	$O(n^4)$	$O(n^3)$	Golub-Reinsch

For sparse matrices with $k$ nonzero entries, iterative methods reduce complexity to $O(k)$ per iteration, making problems with millions of variables tractable.

The Cost of Ignoring Numerics

import numpy as np

# Hilbert matrix: H_ij = 1/(i+j-1)
# Theoretical solution: x = [1, 1, ..., 1]
n = 12
H = np.array([[1/(i+j-1) for j in range(n)] for i in range(n)])
b = H @ np.ones(n)

# "Solve" Ax = b
x_numpy = np.linalg.solve(H, b)
print(f"Error (np.linalg.solve): {np.linalg.norm(x_numpy - np.ones(n)):.2e}")
# ~1e-4 (terrible for a 12x12 system!)

# The condition number tells you WHY
print(f"Condition number: {np.linalg.cond(H):.2e}")
# ~10^16 — the matrix is nearly singular in floating point

2. Condition Number

DfCondition Number

The condition number of a matrix $A$ measures how sensitive the solution of $Ax = b$ is to perturbations in $b$ (or $A$ ). For a nonsingular matrix with respect to the 2-norm:

\kappa_2(A) = \|A\|_2 \cdot \|A^{-1}\|_2 = \frac{\sigma_{\max}(A)}{\sigma_{\min}(A)}

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values of $A$ .

Condition Number Properties

What the Condition Number Tells You

⚠️ Interpreting κ(A)

If $\kappa(A) = 10^k$ , then you lose approximately $k$ digits of accuracy when solving $Ax = b$ . For double-precision arithmetic ( $\epsilon \approx 10^{-16}$ ):

$\kappa(A) = 10^1$ : 15 digits accurate — excellent
$\kappa(A) = 10^4$ : 12 digits accurate — good
$\kappa(A) = 10^8$ : 8 digits accurate — moderate
$\kappa(A) = 10^{12}$ : 4 digits accurate — poor
$\kappa(A) = 10^{16}$ : 0 digits accurate — essentially random

Sensitivity Analysis

For the system $Ax = b$ , if $b$ is perturbed by $\delta b$ , the relative error in $x$ satisfies:

\frac{\|\delta x\|}{\|x\|} \leq \kappa(A) \cdot \frac{\|\delta b\|}{\|b\|}

Similarly, if $A$ is perturbed by $\delta A$ :

\frac{\|\delta x\|}{\|x + \delta x\|} \leq \kappa(A) \cdot \frac{\|\delta A\|}{\|A\|}

This is a worst-case bound — the actual error may be much smaller, but you cannot know this without computing $\kappa(A)$ .

📝Condition Number of Common Matrices

Identity matrix $I_n$ : $\kappa(I) = 1$ (perfectly conditioned)

Diagonal $D = \text{diag}(d_1, \ldots, d_n)$ : $\kappa(D) = |d_{\max}| / |d_{\min}|$

Hilbert matrix $H_{ij} = \frac{1}{i+j-1}$ : $\kappa(H_n) \sim \frac{(1+\sqrt{2})^{4n}}{\sqrt{n}}$ — grows exponentially

Vandermonde matrix: $\kappa$ grows exponentially with $n$

Random Gaussian matrix $A_{ij} \sim \mathcal{N}(0,1)$ : $\kappa(A) \approx 2\sqrt{n}$ with high probability

import numpy as np
import matplotlib.pyplot as plt

# Condition number growth for Hilbert matrices
ns = range(2, 15)
kappas = []
for n in ns:
    H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
    kappas.append(np.linalg.cond(H))

for n, k in zip(ns, kappas):
    print(f"H_{n}: κ = {k:.2e}")

# For n >= 12, κ exceeds 10^16 — beyond double precision capability

3. Stability

DfForward Stability

An algorithm is forward stable if it computes a result $x̃$ that is close to the true answer $x$ . That is, the error $\|x̃ - x\|$ is small. This is the intuitive notion of "getting the right answer."

DfBackward Stability

An algorithm is backward stable if it computes $x̃$ that is the exact solution to a slightly perturbed problem. That is, there exists a small $\delta A$ (or $\delta b$ ) with $\|\delta A\| / \|A\|$ small such that $(A + \delta A)x̃ = b$ (or $Ax̃ = b + \delta b$ ).

Backward stability is stronger and more desirable than forward stability: if an algorithm is backward stable, the error is always bounded by $\kappa(A) \cdot \epsilon$ — the algorithm does the best possible job given the conditioning of the problem.

ThBackward Error Theorem

For a backward stable algorithm solving $Ax = b$ :

\frac{\|x̃ - x\|}{\|x\|} \leq \kappa(A) \cdot \epsilon_{\text{algo}}

where $\epsilon_{\text{algo}}$ is the backward error of the algorithm. This is the fundamental relationship between conditioning and stability.

Stability of Common Algorithms

Algorithm	Forward Stable?	Backward Stable?	Notes
Naive Gaussian elimination	Yes	No	Can have large backward error
Gaussian elimination + partial pivoting	Yes	Yes (nearly)	Standard in LAPACK
QR decomposition (Householder)	Yes	Yes	Most stable general method
Normal equations for least squares	No	No	Square condition number
QR for least squares	Yes	Yes	Recommended approach
SVD	Yes	Yes	Most robust, most expensive

⚠️ Pivoting in Gaussian Elimination

Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. Consider:

A = \begin{pmatrix} \epsilon & 1 \\ 1 & 1 \end{pmatrix}

Without pivoting, we divide by $\epsilon$ , amplifying errors by $1/\epsilon$ . With partial pivoting (swapping rows to put the largest element on the diagonal), the backward error is bounded by a modest factor of $n$ .

import numpy as np

def gaussian_elimination_no_pivot(A, b):
    """Unstable Gaussian elimination (no pivoting)."""
    n = len(b)
    Ab = np.hstack([A.astype(float), b.reshape(-1, 1)])
    for col in range(n):
        for row in range(col + 1, n):
            factor = Ab[row, col] / Ab[col, col]
            Ab[row, col:] -= factor * Ab[col, col:]
    # Back substitution
    x = np.zeros(n)
    for i in range(n - 1, -1, -1):
        x[i] = (Ab[i, -1] - Ab[i, i+1:n] @ x[i+1:]) / Ab[i, i]
    return x

# Near-singular example
A = np.array([[1e-10, 1], [1, 1]], dtype=float)
b = A @ np.array([1.0, 1.0])

# numpy uses stable LAPACK (LU with partial pivoting)
x_stable = np.linalg.solve(A, b)
print(f"Stable: {x_stable} (error: {np.linalg.norm(x_stable - [1,1]):.2e})")

# No-pivot version may fail
try:
    x_unstable = gaussian_elimination_no_pivot(A, b)
    print(f"Unstable: {x_unstable} (error: {np.linalg.norm(x_unstable - [1,1]):.2e})")
except ZeroDivisionError:
    print("Unstable: Division by zero!")

4. Floating Point Arithmetic

IEEE 754 Standard

DfFloating Point Representation

A double-precision float (IEEE 754) stores a number as:

x = (-1)^s \times 1.f \times 2^{e - 1023}

where $s$ is the sign bit, $f$ is the 52-bit mantissa (fraction), and $e$ is the 11-bit biased exponent. This gives:

Range: $\pm 10^{\pm 308}$
Precision: $\approx 15-16$ decimal digits
Machine epsilon: $\epsilon = 2^{-52} \approx 2.22 \times 10^{-16}$

ℹ️ Machine Epsilon

Machine epsilon $\epsilon$ is the smallest number such that $\text{fl}(1 + \epsilon) \neq 1$ in floating-point arithmetic. Equivalently, it bounds the relative rounding error of any basic operation:

\text{fl}(a \circ b) = (a \circ b)(1 + \delta), \quad |\delta| \leq \epsilon

For double precision: $\epsilon \approx 2.22 \times 10^{-16}$ . For single precision: $\epsilon \approx 1.19 \times 10^{-7}$ .

Roundoff Error

Every floating-point operation introduces a relative error of at most $\epsilon$ . After $n$ operations, errors can accumulate. For a sum of $n$ numbers:

\text{fl}\left(\sum_{i=1}^n x_i\right) = \sum_{i=1}^n x_i (1 + \delta_i), \quad |\delta_i| \leq n\epsilon

This linear growth in error is why compensated summation (Kahan summation) is important for large reductions.

Catastrophic Cancellation

DfCatastrophic Cancellation

Catastrophic cancellation occurs when subtracting two nearly equal floating-point numbers, causing a dramatic loss of significant digits. If $a$ and $b$ agree to $k$ significant digits, $a - b$ has only $k - \log_{10}(a/b)$ significant digits.

Example: $a = 1.23456789012345$ , $b = 1.23456789012344$ . Both have 16 digits, but $a - b = 10^{-14}$ has only 1 digit.

import numpy as np

# Classic catastrophic cancellation
a = 1.0
b = 1.0 + 1e-15
print(f"Direct subtraction: {(b - a) - 1e-15}")  # May not be exactly 0

# Catastrophic cancellation in quadratic formula
# For b >> sqrt(b^2 - 4ac), the standard formula loses precision
a_coeff, b_coeff, c_coeff = 1, 1e8, 1
sqrt_disc = np.sqrt(b_coeff**2 - 4*a_coeff*c_coeff)

x1_standard = (-b_coeff + sqrt_disc) / (2 * a_coeff)
x2_standard = (-b_coeff - sqrt_disc) / (2 * a_coeff)
print(f"Standard: x1 = {x1_standard:.6e}, x2 = {x2_standard:.6e}")

# Stable alternative: use the other root to avoid cancellation
x2_stable = c_coeff / (a_coeff * x1_standard)  # = c/(a*x1)
x1_stable = (-b_coeff - np.sign(b_coeff) * sqrt_disc) / (2 * a_coeff)
print(f"Stable:   x1 = {x1_stable:.6e}, x2 = {x2_stable:.6e}")

Kahan Summation

Kahan Compensated Summation

def kahan_sum(values):
    """Kahan compensated summation — nearly exact for large sums."""
    s = 0.0
    c = 0.0  # compensation for lost low-order bits
    for x in values:
        y = x - c
        t = s + y
        c = (t - s) - y  # algebraically zero, but captures rounding error
        s = t
    return s

# Demonstration: sum 1e7 values of 1.0
n = 10_000_000
values = [1.0] * n
print(f"Naive sum:   {sum(values) - n}")       # May show rounding error
print(f"Kahan sum:   {kahan_sum(values) - n}")  # Exact (or nearly so)

# More dramatic: sum decreasing values
vals = [1e16, 1.0, -1e16]
print(f"Naive:   {sum(vals)}")          # 0.0 (correct by luck)
vals2 = [1e16, 1.0, -1e16, 1.0, -1e16]
print(f"Naive:   {sum(vals2)}")         # 0.0 (lost the 1.0!)
print(f"Kahan:   {kahan_sum(vals2)}")   # 1.0 (correct)

5. Direct Methods

Gaussian Elimination

DfGaussian Elimination

Gaussian elimination transforms $Ax = b$ into an upper triangular system $Ux = y$ via row operations, then solves by back substitution. With partial pivoting (row swaps to maximize the pivot element), the algorithm is:

For each column $k = 1, \ldots, n-1$ :
- Find row $p \geq k$ with $|a_{pk}| = \max_{i \geq k} |a_{ik}|$
- Swap rows $k$ and $p$
- For each row $i > k$ : eliminate $a_{ik}$ by row subtraction
Back substitution: solve $Ux = y$ from bottom to top

Total cost: $\frac{2}{3}n^3$ flops for the factorization, $O(n^2)$ for the solve.

LU Decomposition

DfLU Decomposition

The LU decomposition factors a matrix as $A = PLU$ where:

$P$ is a permutation matrix (from partial pivoting)
$L$ is lower triangular with 1s on the diagonal
$U$ is upper triangular

Once computed, solving $Ax = b$ reduces to two triangular solves: $Ly = Pb$ , then $Ux = y$ .

import numpy as np
from scipy.linalg import lu, solve_triangular

A = np.array([[2, 1, 1], [4, 3, 3], [8, 7, 9]], dtype=float)
b = np.array([1, 2, 3], dtype=float)

# LU decomposition via scipy
P, L, U = lu(A)

# Verify: PA = LU
print("PA = LU check:", np.allclose(P @ A, L @ U))

# Solve using LU factors
y = solve_triangular(L, P @ b, lower=True)
x = solve_triangular(U, y)
print(f"Solution: {x}")
print(f"Verify:   {A @ x}")  # Should equal b

# numpy's solve uses LAPACK's LU implementation internally
x_numpy = np.linalg.solve(A, b)
print(f"numpy:    {x_numpy}")

Cholesky Decomposition

ThCholesky Decomposition

For a symmetric positive definite (SPD) matrix $A$ , the Cholesky decomposition gives $A = LL^T$ where $L$ is lower triangular with positive diagonal entries.

Cost: $\frac{1}{3}n^3$ flops (half the cost of LU)
Uniquely defined (no pivoting needed)
Numerically stable without pivoting
Exists if and only if $A$ is SPD

The Cholesky decomposition is the method of choice for solving SPD systems (e.g., normal equations, covariance matrices in Gaussian processes).

import numpy as np
from scipy.linalg import cho_factor, cho_solve

# Create SPD matrix
A = np.array([[4, 2, 1], [2, 5, 3], [1, 3, 6]], dtype=float)
b = np.array([1, 2, 3], dtype=float)

# Cholesky decomposition
L = np.linalg.cholesky(A)
print("A = LL^T check:", np.allclose(L @ L.T, A))

# Solve using Cholesky
c, low = cho_factor(A)
x = cho_solve((c, low), b)
print(f"Solution: {x}")
print(f"Verify:   {A @ x}")

6. Iterative Methods

Iterative methods are essential for large-scale problems where direct methods are too expensive (e.g., finite element systems with millions of unknowns, or sparse systems arising in graph algorithms).

Jacobi Iteration

DfJacobi Iteration

The Jacobi method splits $A = D + L + U$ (diagonal, strictly lower, strictly upper) and iterates:

x^{(k+1)} = D^{-1}(b - (L + U)x^{(k)})

Each iteration costs $O(n^2)$ (or $O(\text{nnz})$ for sparse $A$ ). Convergence is guaranteed if $A$ is strictly diagonally dominant: $|a_{ii}| > \sum_{j \neq i} |a_{ij}|$ for all $i$ .

Gauss-Seidel Iteration

DfGauss-Seidel Iteration

Gauss-Seidel is identical to Jacobi but uses the most recently computed values immediately:

x_i^{(k+1)} = \frac{1}{a_{ii}} \left(b_i - \sum_{j < i} a_{ij} x_j^{(k+1)} - \sum_{j > i} a_{ij} x_j^{(k)}\right)

This typically converges twice as fast as Jacobi but is harder to parallelize (each update depends on the previous one in the same sweep).

Conjugate Gradient

DfConjugate Gradient Method

The Conjugate Gradient (CG) method solves $Ax = b$ where $A$ is SPD. It generates a sequence of iterates $x^{(k)}$ that minimize the $A$ -norm of the error over expanding Krylov subspaces:

x^{(k)} = \arg\min_{x \in x_0 + \mathcal{K}_k(A, r_0)} \|x - x^*\|_A

where $\mathcal{K}_k(A, r_0) = \text{span}\{r_0, Ar_0, A^2r_0, \ldots, A^{k-1}r_0\}$ is the Krylov subspace.

CG Algorithm

Iterative Methods Comparison

Method	Per-Iteration Cost	Memory	Convergence	Parallelizable?
Jacobi	$O(\text{nnz})$	$O(n)$	Slow	Yes
Gauss-Seidel	$O(\text{nnz})$	$O(n)$	Medium	No
SOR	$O(\text{nnz})$	$O(n)$	Fast (tuned $\omega$ )	No
Conjugate Gradient	$O(\text{nnz})$	$O(n)$	Fast (for SPD)	Partially
GMRES	$O(k \cdot \text{nnz})$	$O(kn)$	Fast (general)	Partially

import numpy as np
from scipy.sparse import diags
from scipy.sparse.linalg import cg, gmres

# Create a large SPD tridiagonal system
n = 10000
A = diags([-1, 2, -1], [-1, 0, 1], shape=(n, n), format='csr')
b = np.ones(n)

# Conjugate Gradient
x_cg, info = cg(A, b, tol=1e-10, maxiter=1000)
print(f"CG: {np.linalg.norm(A @ x_cg - b):.2e} residual, info={info}")

# Jacobi iteration (manual)
def jacobi(A, b, x0=None, tol=1e-10, maxiter=1000):
    x = x0 if x0 is not None else np.zeros_like(b)
    D = A.diagonal()
    for k in range(maxiter):
        r = b - A @ x
        if np.linalg.norm(r) < tol:
            return x, k
        x = x + r / D  # Jacobi step: x_new = D^{-1}(b - (A-D)x)
    return x, maxiter

x_jac, iters = jacobi(A, b)
print(f"Jacobi: {np.linalg.norm(A @ x_jac - b):.2e} residual, {iters} iterations")

7. Matrix Norms for Numerical Analysis

DfSpectral Norm

The spectral norm (operator 2-norm) of $A$ is:

\|A\|_2 = \sigma_{\max}(A) = \sqrt{\lambda_{\max}(A^T A)}

This is the maximum stretching factor of $A$ : $\|A\|_2 = \max_{\|x\|=1} \|Ax\|$ .

Key Matrix Norms

ℹ️ Why the Spectral Norm for Conditioning?

The condition number $\kappa_2(A) = \|A\|_2 \|A^{-1}\|_2$ uses the spectral norm because it captures the true geometric distortion: $A$ maps the unit sphere to an ellipsoid with semi-axes equal to the singular values. The ratio of the longest to shortest semi-axis is $\kappa_2$ . Other norms give different condition numbers, but $\kappa_2$ is the most meaningful for understanding sensitivity.

8. Least Squares Problems

When $Ax = b$ has no exact solution (overdetermined system, $m > n$ ), we seek $x$ that minimizes $\|Ax - b\|_2$ .

Normal Equations

ThNormal Equations

The least squares solution satisfies the normal equations:

A^T A x = A^T b

If $A$ has full column rank, $A^T A$ is SPD and the unique solution is $x = (A^T A)^{-1} A^T b$ .

Problem: $\kappa(A^T A) = \kappa(A)^2$ . If $\kappa(A) = 10^8$ , then $\kappa(A^T A) = 10^{16}$ — catastrophic loss of precision!

QR Decomposition Approach

DfQR Decomposition for Least Squares

Factor $A = QR$ where $Q$ is orthogonal ( $m \times n$ ) and $R$ is upper triangular ( $n \times n$ ). Then:

\|Ax - b\|_2 = \|QRx - b\|_2 = \|Rx - Q^T b\|_2

since orthogonal transformations preserve the 2-norm. The solution is $Rx = Q^T b$ (upper triangular solve), with $\kappa = \kappa(A)$ — no squaring!

SVD Approach

DfSVD for Least Squares

The SVD $A = U\Sigma V^T$ gives the most robust solution:

x = V \Sigma^+ U^T b

where $\Sigma^+$ is the pseudoinverse (reciprocal of nonzero singular values, zero otherwise). This handles rank-deficient matrices gracefully and makes the numerical rank explicit.

import numpy as np

# Overdetermined system: 5 equations, 3 unknowns
np.random.seed(42)
A = np.random.randn(5, 3)
b = np.random.randn(5)

# Method 1: Normal equations (UNSTABLE)
x_normal = np.linalg.solve(A.T @ A, A.T @ b)
print(f"Normal equations: residual = {np.linalg.norm(A @ x_normal - b):.2e}")
print(f"  κ(A^T A) = {np.linalg.cond(A.T @ A):.2e}")

# Method 2: QR decomposition (STABLE)
x_qr, res_qr, rank_qr, sv_qr = np.linalg.lstsq(A, b, rcond=None)
print(f"QR (lstsq):       residual = {np.linalg.norm(A @ x_qr - b):.2e}")

# Method 3: SVD (MOST STABLE)
U, s, Vt = np.linalg.svd(A, full_matrices=False)
x_svd = Vt.T @ np.diag(1/s) @ U.T @ b
print(f"SVD:              residual = {np.linalg.norm(A @ x_svd - b):.2e}")

# Rank-deficient example
A_rankdef = np.array([[1, 2], [2, 4], [3, 6]], dtype=float)  # rank 1
b_rankdef = np.array([1, 2, 3.1], dtype=float)
x_rd = np.linalg.lstsq(A_rankdef, b_rankdef, rcond=None)[0]
print(f"\nRank-deficient least squares: x = {x_rd}")
print(f"  Residual: {np.linalg.norm(A_rankdef @ x_rd - b_rankdef):.2e}")

9. Sparse Matrix Algorithms

Why Sparse?

ℹ️ Sparsity in Practice

Real-world matrices are often sparse: adjacency matrices of graphs, discretizations of PDEs, TF-IDF matrices in NLP. Storing only nonzero entries reduces memory from $O(n^2)$ to $O(\text{nnz})$ , and operations like matrix-vector multiply from $O(n^2)$ to $O(\text{nnz})$ .

For a 2D Laplacian on an $n \times n$ grid: the matrix is $n^2 \times n^2$ ( $10^6 \times 10^6$ for $n=1000$ ) but has only $\sim 5n^2$ nonzeros — a density of $5/n^2 \approx 0.0005\%$ .

Sparse Storage Formats

Format	Best For	Memory	Access
COO	Construction	$3 \times \text{nnz}$	Slow
CSR	Row slicing, SpMV	$\text{nnz} + n + 1$	Fast row access
CSC	Column slicing	$\text{nnz} + n + 1$	Fast column access
DIA	Diagonal structure	$O(\text{nnz})$	Very fast for banded
BSR	Block structure	Varies	Fast for block patterns

Preconditioning

DfPreconditioning

A preconditioner $M$ approximates $A^{-1}$ and transforms the system to $M^{-1}Ax = M^{-1}b$ , which has better spectral properties. The ideal preconditioner $M = A$ makes the system trivial, but $M$ must be cheap to apply.

Common preconditioners:

Jacobi (diagonal): $M = \text{diag}(A)$ — trivial, sometimes effective
Incomplete LU (ILU): approximate LU factorization with fill-in limit
Algebraic Multigrid (AMG): hierarchical method, optimal for elliptic PDEs
SSOR: symmetric successive over-relaxation

import numpy as np
from scipy.sparse import diags, random
from scipy.sparse.linalg import cg, spilu, LinearOperator

# Create a difficult SPD system (convection-diffusion)
n = 500
A = diags([-1, 4, -1], [-1, 0, 1], shape=(n, n), format='csr')
A = A + 0.1 * random(n, n, density=0.01, format='csr')  # Add some non-symmetric perturbation
A = A @ A.T  # Make SPD
b = np.random.randn(n)

# Without preconditioning
x_no_precon, info1 = cg(A, b, tol=1e-10, maxiter=1000)
print(f"No preconditioner: {np.linalg.norm(A @ x_no_precon - b):.2e}")

# With Jacobi (diagonal) preconditioner
M_inv = 1.0 / A.diagonal()
M = LinearOperator(A.shape, matvec=lambda x: M_inv * x)
x_jac, info2 = cg(A, b, tol=1e-10, maxiter=1000, M=M)
print(f"Jacobi precon:     {np.linalg.norm(A @ x_jac - b):.2e}")

# With incomplete Cholesky
ilu = spilu(A.tocsc(), drop_tol=1e-4)
M_ilu = LinearOperator(A.shape, matvec=ilu.solve)
x_ilu, info3 = cg(A, b, tol=1e-10, maxiter=1000, M=M_ilu)
print(f"ILU precon:        {np.linalg.norm(A @ x_ilu - b):.2e}")

10. Python Implementation

NumPy Linear Algebra

import numpy as np

# === Core Operations ===

# Solve Ax = b
A = np.array([[3, 1], [1, 2]], dtype=float)
b = np.array([9, 8], dtype=float)
x = np.linalg.solve(A, b)
print(f"Solve: x = {x}")

# Determinant and inverse (AVOID inverting in production!)
print(f"det(A) = {np.linalg.det(A):.4f}")
print(f"inv(A) = \n{np.linalg.inv(A)}")

# Matrix factorizations
Q, R = np.linalg.qr(A)
print(f"\nQR: Q = \n{Q}\nR = \n{R}")

L = np.linalg.cholesky(A @ A.T + np.eye(2))  # Ensure SPD
print(f"Cholesky: L = \n{L}")

U, s, Vt = np.linalg.svd(A)
print(f"\nSVD: singular values = {s}")

# Eigenvalues
eigvals, eigvecs = np.linalg.eig(A)
print(f"Eigenvalues: {eigvals}")

# Matrix norms
print(f"\n||A||_F = {np.linalg.norm(A, 'fro'):.4f}")
print(f"||A||_2 = {np.linalg.norm(A, 2):.4f}")
print(f"||A||_1 = {np.linalg.norm(A, 1):.4f}")
print(f"||A||_inf = {np.linalg.norm(A, np.inf):.4f}")

# Condition number
print(f"κ(A) = {np.linalg.cond(A):.4f}")
print(f"κ(A^T A) = {np.linalg.cond(A.T @ A):.4f}")  # κ^2

SciPy Sparse Linear Algebra

from scipy import sparse
from scipy.sparse import linalg as spla
import numpy as np

# Create sparse matrix (5-point Laplacian)
n = 100
A = sparse.diags([-1, 4, -1], [-1, 0, 1], shape=(n, n), format='csr')
b = np.random.randn(n)

# Solve sparse system
x, info = spla.cg(A, b, tol=1e-10)
print(f"CG solve: residual = {np.linalg.norm(A @ x - b):.2e}")

# Sparse eigenvalues (largest/smallest)
vals, vecs = spla.eigs(A, k=5, which='LM')
print(f"Largest eigenvalues: {np.sort(vals.real)}")

vals_sm, _ = spla.eigs(A, k=3, which='SM')
print(f"Smallest eigenvalues: {np.sort(vals_sm.real)}")

# Sparse SVD (useful for large-scale PCA)
U, s, Vt = spla.svds(A, k=5)
print(f"Top 5 singular values: {s}")

Practical Patterns

📝Avoiding Common Pitfalls

DO: Use np.linalg.solve for dense, scipy.sparse.linalg.cg for sparse SPD, scipy.sparse.linalg.gmres for sparse non-symmetric.

DON'T: Use np.linalg.inv(A) @ b — it is slower and less stable than np.linalg.solve(A, b). Don't use normal equations A.T @ A for least squares — use np.linalg.lstsq.

11. Applications in AI/ML

Training Stability

ℹ️ Numerical Issues in Deep Learning

Deep learning training involves repeated matrix-vector products (forward pass), matrix transposes (backward pass), and parameter updates. Numerical issues manifest as:

Exploding/vanishing gradients: The condition number of the Jacobian determines whether gradients grow or shrink exponentially through layers
Loss of precision in mixed-precision training: FP16 has $\epsilon \approx 5 \times 10^{-3}$ , making operations on small gradients or large activations unstable
Ill-conditioned Hessians: Second-order optimizers must solve systems involving $\nabla^2 L$ , which is often extremely ill-conditioned

Layer normalization and batch normalization are partly motivated by keeping intermediate activations well-conditioned.

Numerical Precision in Deep Learning

import numpy as np

# Simulating mixed-precision issues
def forward_pass_fp32(W, x):
    """Full precision forward pass."""
    return W @ x

def forward_pass_fp16(W, x):
    """Simulated mixed-precision forward pass."""
    W_16 = W.astype(np.float16)
    x_16 = x.astype(np.float16)
    result = W_16 @ x_16
    return result.astype(np.float32)

np.random.seed(42)
W = np.random.randn(100, 100) * 0.1
x = np.random.randn(100)

y_fp32 = forward_pass_fp32(W, x)
y_fp16 = forward_pass_fp16(W, x)

print(f"FP32 result norm: {np.linalg.norm(y_fp32):.6f}")
print(f"FP16 result norm: {np.linalg.norm(y_fp16):.6f}")
print(f"Relative error:   {np.linalg.norm(y_fp32 - y_fp16) / np.linalg.norm(y_fp32):.2e}")

# Condition number of a typical layer's weight matrix
W_layer = np.random.randn(512, 512) * (2 / 512)**0.5  # Xavier init
print(f"\nLayer condition number: {np.linalg.cond(W_layer):.2f}")
# Usually ~10-100 with good initialization, much worse after training

Key Applications

Application	Numerical Challenge	Solution
PCA / SVD embeddings	Large-scale SVD	Truncated SVD, randomized SVD
Gaussian processes	$O(n^3)$ Cholesky	Sparse approximations, inducing points
Neural ODE training	Stiff ODE systems	Implicit solvers, adjoint method
Recommendation systems	Sparse least squares	ALS, SGD (avoids forming normal equations)
Graph neural networks	Sparse matrix operations	SpMV with CSR format

12. Common Mistakes

⚠️ Common Mistakes in Numerical Linear Algebra

Mistake	Why It's Wrong	Correct Approach
Using `inv(A) @ b` instead of `solve(A, b)`	2x slower, less stable	Use `np.linalg.solve`
Solving via normal equations $A^T A x = A^T b$	Squares condition number	Use QR or SVD
Ignoring condition number	May get garbage results with no error	Always check `np.linalg.cond`
Not using pivoting in Gaussian elimination	Catastrophic instability	Use LAPACK's pivoted LU
Comparing floating-point numbers with `==`	Roundoff makes exact equality rare	Use `np.allclose` or tolerance
Using float32 for large summations	Accumulated error grows	Use float64 or Kahan summation
Assuming iterative methods converge	Divergence is silent	Monitor residual; use preconditioning
Using dense algorithms on sparse matrices	$O(n^2)$ memory, $O(n^3)$ time	Use `scipy.sparse` and sparse solvers
Not checking matrix properties (SPD, symmetry)	Wrong solver, wrong results	Verify before choosing solver

import numpy as np

# BAD: Using inv(A) @ b
A = np.array([[1, 2], [3, 4]], dtype=float)
b = np.array([5, 6], dtype=float)

# Wrong
x_bad = np.linalg.inv(A) @ b  # Slower, less stable

# Correct
x_good = np.linalg.solve(A, b)  # Direct, stable

print(f"inv @ b: {x_bad}")
print(f"solve:  {x_good}")
print(f"Equal:  {np.allclose(x_bad, x_good)}")

# BAD: Comparing floats with ==
a = 0.1 + 0.2
print(f"\n0.1 + 0.2 == 0.3? {a == 0.3}")          # False!
print(f"0.1 + 0.2 ≈ 0.3?  {np.isclose(a, 0.3)}")  # True

13. Interview Questions

📝Interview Question 1: Why Not Invert the Matrix?

Q: Why should we never compute $A^{-1}$ and multiply by $b$ to solve $Ax = b$ ?

A: Three reasons: (1) Computing $A^{-1}$ costs $\frac{2}{3}n^3$ flops, then multiplying $A^{-1}b$ costs another $n^2$ — LU decomposition with back substitution costs $\frac{2}{3}n^3$ total and does both at once. (2) The product $A^{-1}b$ can amplify roundoff errors more than the triangular solves in LU. (3) Storing $A^{-1}$ is expensive ( $n^2$ elements) and dense even when $A$ is sparse. LAPACK and all serious linear algebra libraries compute solves, not inverses.

📝Interview Question 2: Condition Number of Normal Equations

Q: If $\kappa(A) = 10^6$ , what is the condition number when solving via normal equations?

A: $\kappa(A^T A) = \kappa(A)^2 = 10^{12}$ . With double precision ( $\epsilon \approx 10^{-16}$ ), you lose about 12 digits, leaving only 4 digits of accuracy. This is why QR or SVD should be used instead of normal equations for least squares problems.

📝Interview Question 3: When to Use Iterative vs Direct Methods?

Q: When would you choose Conjugate Gradient over Cholesky decomposition?

A: Cholesky is preferred when $A$ is small-to-medium sized ( $n < 10^4$ ) and dense — it gives the exact solution in $O(n^3)$ time and $O(n^2)$ memory. CG is preferred when $A$ is large and sparse — it uses only matrix-vector products (costing $O(\text{nnz})$ each), needs $O(\text{nnz})$ memory, and converges in $O(\sqrt{\kappa})$ iterations. For a sparse $n = 10^6$ system with $\kappa = 100$ , CG converges in $\sim 100$ iterations, each costing $O(\text{nnz})$ .

📝Interview Question 4: Floating Point Identity

Q: Is floating-point addition associative? Give an example.

A: No. Let $a = 10^{16}$ , $b = 1.0$ , $c = -10^{16}$ . Then $(a + b) + c = 10^{16} + 1 - 10^{16} = 0$ , but $a + (b + c) = 10^{16} + (1 - 10^{16}) = 10^{16} + (-10^{16}) = 0$ . Actually in this case both are 0, but consider: $(a + c) + b = 0 + 1 = 1$ , while $(a + b) + c = 0$ . The ordering matters because intermediate results may overflow or underflow.

📝Interview Question 5: SVD vs Eigenvalues

Q: Why might SVD be preferred over eigendecomposition for numerical computation?

A: (1) SVD exists for all matrices (rectangular, singular, defective); eigendecomposition only for square diagonalizable matrices. (2) SVD is always numerically stable; eigendecomposition can be unstable for defective matrices. (3) SVD of $A = U\Sigma V^T$ separates the input space ( $V$ ) from the output space ( $U$ ), which is geometrically meaningful. (4) SVD truncation gives the best low-rank approximation (Eckart-Young theorem), making it ideal for PCA, dimensionality reduction, and regularization.

📝Interview Question 6: Preconditioning

Q: What is a preconditioner and why do iterative methods need them?

A: A preconditioner $M$ approximates $A^{-1}$ such that $M^{-1}A$ has a smaller condition number than $A$ . Without preconditioning, CG converges in $O(\sqrt{\kappa})$ iterations — if $\kappa = 10^8$ , that's $10^4$ iterations. A good preconditioner reduces $\kappa$ to $\sim 10$ , requiring only $\sim 10$ iterations. The challenge is that $M$ must be cheap to apply (ideally $O(n)$ or $O(\text{nnz})$ ) while still being a good approximation to $A^{-1}$ . Common choices: incomplete Cholesky, algebraic multigrid, block Jacobi for distributed systems.

14. Practice Problems

📝Problem 1: Condition Number Estimation

Compute the condition number of the $n \times n$ matrix with $a_{ij} = 1/(i + j - 1)$ (Hilbert matrix) for $n = 5, 10, 15$ . At what $n$ does the matrix become effectively singular in double precision?

💡Solution

import numpy as np

for n in [5, 10, 15, 20]:
    H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
    kappa = np.linalg.cond(H)
    digits_lost = int(np.log10(kappa))
    print(f"n={n:2d}: κ = {kappa:.2e}, ~{digits_lost} digits lost, "
          f"{max(0, 16-digits_lost)} digits remaining")

For $n = 5$ : $\kappa \approx 5 \times 10^5$ (11 digits remain) For $n = 10$ : $\kappa \approx 2 \times 10^{13}$ (3 digits remain) For $n = 15$ : $\kappa \approx 4 \times 10^{19}$ (matrix is singular in double precision!) For $n \geq 12$ , the Hilbert matrix is effectively singular — solving $Hx = b$ gives meaningless results.

📝Problem 2: Backward Error Demonstration

Write a function that computes the backward error of a solution $x̃$ to $Ax = b$ :

\eta(x̃) = \frac{\|b - A\tilde{x}\|}{\|A\| \cdot \|\tilde{x}\|}

Show that Gaussian elimination with partial pivoting achieves small backward error even for ill-conditioned systems.

💡Solution

import numpy as np

def backward_error(A, b, x_approx):
    """Compute the backward error of an approximate solution."""
    residual = b - A @ x_approx
    return np.linalg.norm(residual) / (np.linalg.norm(A) * np.linalg.norm(x_approx))

# Ill-conditioned system
np.random.seed(42)
n = 100
H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
b = H @ np.ones(n)  # True solution is x = [1,...,1]

x_solve = np.linalg.solve(H, b)
x_perturbed = x_solve + np.random.randn(n) * 1e-10  # Slightly wrong solution

print(f"np.linalg.solve backward error: {backward_error(H, b, x_solve):.2e}")
print(f"Perturbed solution backward error: {backward_error(H, b, x_perturbed):.2e}")
print(f"Solve error: {np.linalg.norm(x_solve - np.ones(n)):.2e}")
print(f"Perturbed error: {np.linalg.norm(x_perturbed - np.ones(n)):.2e}")

The backward error of np.linalg.solve is always $O(\epsilon \cdot n)$ regardless of conditioning. The forward error may be large if $\kappa$ is large, but this is the problem's fault, not the algorithm's.

📝Problem 3: Convergence Analysis

For the matrix $A = \begin{pmatrix} 4 & 1 \\ 1 & 3 \end{pmatrix}$ , compute the convergence rate of Jacobi and Gauss-Seidel iteration. Which converges faster and by how much?

💡Solution

The spectral radius of the Jacobi iteration matrix $T_J = -D^{-1}(L+U)$ :

T_J = -\begin{pmatrix} 1/4 & 0 \\ 0 & 1/3 \end{pmatrix} \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix} = \begin{pmatrix} 0 & -1/4 \\ -1/3 & 0 \end{pmatrix}

Eigenvalues of $T_J$ : $\lambda^2 = 1/12$ , so $\lambda = \pm 1/\sqrt{12} \approx \pm 0.2887$ .

$\rho(T_J) = 0.2887$ . After $k$ iterations, error reduced by factor $(0.2887)^k$ .

For Gauss-Seidel, $\rho(T_{GS}) \approx \rho(T_J)^2 \approx 0.0833$ . So GS converges roughly twice as fast per iteration.

import numpy as np

A = np.array([[4, 1], [1, 3]], dtype=float)
D = np.diag(np.diag(A))
L = np.tril(A, -1)
U = np.triu(A, 1)

T_J = -np.linalg.solve(D, L + U)
T_GS = -np.linalg.solve(D + L, U)

print(f"ρ(T_Jacobi) = {max(abs(np.linalg.eigvals(T_J))):.4f}")
print(f"ρ(T_GS)     = {max(abs(np.linalg.eigvals(T_GS))):.4f}")
print(f"Ratio:       {max(abs(np.linalg.eigvals(T_J))) / max(abs(np.linalg.eigvals(T_GS))):.2f}x")

📝Problem 4: Mixed Precision Stability

A neural network layer computes $y = \sigma(Wx + b)$ where $W \in \mathbb{R}^{1024 \times 1024}$ , $x \in \mathbb{R}^{1024}$ . If $W$ has entries in $[-1, 1]$ and $x$ has entries in $[-0.1, 0.1]$ , estimate the condition number needed for the computation to be reliable in FP16 vs FP32.

💡Solution

In FP16, $\epsilon \approx 5 \times 10^{-3}$ . The accumulation of $1024$ multiply-adds introduces relative error of $\sim 1024 \times \epsilon \approx 5$ . This means the result can be completely unreliable.

In FP32, $\epsilon \approx 1.2 \times 10^{-7}$ . Accumulation error: $\sim 1024 \times 1.2 \times 10^{-7} \approx 1.2 \times 10^{-4}$ . Reliable if condition number $\kappa < 10^4$ .

In FP64, $\epsilon \approx 2.2 \times 10^{-16}$ . Accumulation error: $\sim 1024 \times 2.2 \times 10^{-16} \approx 2.2 \times 10^{-13}$ . Reliable for any practical condition number.

This is why mixed-precision training uses FP16 for speed but maintains FP32 copies of weights and uses loss scaling to keep gradients in FP16-representable range.

15. Quick Reference

📋Numerical Linear Algebra Quick Reference

Key Formulas

Formula	Name	Use
$\kappa(A) = \\|A\\| \\|A^{-1}\\|$	Condition number	Measure sensitivity
$\kappa(A) = \sigma_{\max}/\sigma_{\min}$	Spectral condition number	SVD-based computation
$\kappa(A^T A) = \kappa(A)^2$	Normal equations	Why to avoid them
$\text{fl}(a \circ b) = (a \circ b)(1+\delta)$ , $\|\delta\| \leq \epsilon$	Relative error bound	Floating-point model

Solver Selection Guide

Problem	Recommended Method	Avoid
Dense, general $Ax = b$	LU with partial pivoting	`inv(A) @ b`
Dense, SPD $Ax = b$	Cholesky	LU (wasteful)
Dense, overdetermined	QR or SVD (`lstsq`)	Normal equations
Sparse, SPD	CG (+ preconditioner)	Dense LU
Sparse, general	GMRES (+ preconditioner)	Dense methods
Sparse, least squares	LSMR or LSQR	Normal equations

Accuracy vs Speed Tradeoff

Method	Accuracy	Speed	When to Use
SVD	Highest	Slowest	Rank-deficient, ill-conditioned
QR	High	Fast	General least squares
Cholesky	High	Fastest (SPD)	SPD systems
LU	Good	Fast	General square systems
CG/GMRES	Good	Fast (sparse)	Large sparse systems

Error Budget

For solving $Ax = b$ with $\kappa(A) = 10^k$ in double precision ( $\epsilon \approx 10^{-16}$ ):

Forward error: $\leq \kappa \cdot \epsilon \approx 10^{k-16}$
If $k < 10$ : 6+ digits — good
If $k = 12$ : 4 digits — acceptable
If $k = 15$ : 1 digit — poor, consider higher precision
If $k = 16$ : 0 digits — use arbitrary precision or reformulate

16. Cross-References

Linear Algebra Basics — Vector spaces, subspaces, and the four fundamental subspaces
Linear Algebra SVD — SVD, eigendecomposition, QR, Cholesky (theory)
Linear Algebra Eigenvalues — Eigenvalue computation and the QR algorithm
Linear Algebra Applications — PCA, SVD embeddings, and matrix factorizations in ML
Optimization Convex — Gradient descent, second-order methods, and Hessian computation
Probability Distributions — Multivariate Gaussians require positive definite covariance matrices (Cholesky)
Calculus Differential Equations — Backpropagation, batch normalization, and training dynamics