Numerical Linear Algebra
Numerical linear algebra is the bridge between elegant mathematical theory and practical computation. In theory, we solve by computing . In practice, computing directly is almost never done â it is slow, numerically unstable, and wasteful. Instead, we use carefully designed algorithms that exploit matrix structure and maintain numerical stability. This module covers the essential numerical concepts every ML practitioner must understand: conditioning, stability, floating-point arithmetic, direct and iterative solvers, and their applications in modern AI systems.
1. Why It Matters
âšī¸ Why It Matters
Numerical linear algebra is not a theoretical curiosity â it is the foundation of nearly every ML system. Training a neural network involves solving large linear systems (in optimizers like L-BFGS), computing matrix factorizations (in PCA, SVD-based embeddings), and inverting Hessians (in second-order optimization). Getting the numerics wrong means silent bugs: models that train on one GPU but diverge on another, solutions that appear correct but are off by 10^6, or systems that hang indefinitely without converging.
In production ML, the difference between a numerically stable and unstable implementation can mean the difference between a model that works and one that silently produces garbage. The 1996 Ariane 5 rocket failure â caused by a floating-point overflow converting a 64-bit float to a 16-bit integer â is a stark reminder that numerical precision matters in safety-critical systems. In ML, the consequences are less dramatic but still costly: wasted compute, incorrect gradients, and models that fail to generalize.
Computational Efficiency
Naive matrix inversion via cofactor expansion is â completely impractical. Even the adjugate method is . Modern algorithms achieve:
| Operation | Naive | Optimized | Notes |
|---|---|---|---|
| Matrix multiply | Strassen, Coppersmith-Winograd | ||
| Solve | LU decomposition | ||
| Eigenvalues | QR algorithm | ||
| SVD | Golub-Reinsch |
For sparse matrices with nonzero entries, iterative methods reduce complexity to per iteration, making problems with millions of variables tractable.
The Cost of Ignoring Numerics
import numpy as np
# Hilbert matrix: H_ij = 1/(i+j-1)
# Theoretical solution: x = [1, 1, ..., 1]
n = 12
H = np.array([[1/(i+j-1) for j in range(n)] for i in range(n)])
b = H @ np.ones(n)
# "Solve" Ax = b
x_numpy = np.linalg.solve(H, b)
print(f"Error (np.linalg.solve): {np.linalg.norm(x_numpy - np.ones(n)):.2e}")
# ~1e-4 (terrible for a 12x12 system!)
# The condition number tells you WHY
print(f"Condition number: {np.linalg.cond(H):.2e}")
# ~10^16 â the matrix is nearly singular in floating point
2. Condition Number
DfCondition Number
The condition number of a matrix measures how sensitive the solution of is to perturbations in (or ). For a nonsingular matrix with respect to the 2-norm:
where and are the largest and smallest singular values of .
Condition Number Properties
What the Condition Number Tells You
â ī¸ Interpreting Îē(A)
If , then you lose approximately digits of accuracy when solving . For double-precision arithmetic ():
- : 15 digits accurate â excellent
- : 12 digits accurate â good
- : 8 digits accurate â moderate
- : 4 digits accurate â poor
- : 0 digits accurate â essentially random
Sensitivity Analysis
For the system , if is perturbed by , the relative error in satisfies:
Similarly, if is perturbed by :
This is a worst-case bound â the actual error may be much smaller, but you cannot know this without computing .
đCondition Number of Common Matrices
Identity matrix : (perfectly conditioned)
Diagonal :
Hilbert matrix : â grows exponentially
Vandermonde matrix: grows exponentially with
Random Gaussian matrix : with high probability
import numpy as np
import matplotlib.pyplot as plt
# Condition number growth for Hilbert matrices
ns = range(2, 15)
kappas = []
for n in ns:
H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
kappas.append(np.linalg.cond(H))
for n, k in zip(ns, kappas):
print(f"H_{n}: Îē = {k:.2e}")
# For n >= 12, Îē exceeds 10^16 â beyond double precision capability
3. Stability
DfForward Stability
An algorithm is forward stable if it computes a result that is close to the true answer . That is, the error is small. This is the intuitive notion of "getting the right answer."
DfBackward Stability
An algorithm is backward stable if it computes that is the exact solution to a slightly perturbed problem. That is, there exists a small (or ) with small such that (or ).
Backward stability is stronger and more desirable than forward stability: if an algorithm is backward stable, the error is always bounded by â the algorithm does the best possible job given the conditioning of the problem.
ThBackward Error Theorem
For a backward stable algorithm solving :
where is the backward error of the algorithm. This is the fundamental relationship between conditioning and stability.
Stability of Common Algorithms
| Algorithm | Forward Stable? | Backward Stable? | Notes |
|---|---|---|---|
| Naive Gaussian elimination | Yes | No | Can have large backward error |
| Gaussian elimination + partial pivoting | Yes | Yes (nearly) | Standard in LAPACK |
| QR decomposition (Householder) | Yes | Yes | Most stable general method |
| Normal equations for least squares | No | No | Square condition number |
| QR for least squares | Yes | Yes | Recommended approach |
| SVD | Yes | Yes | Most robust, most expensive |
â ī¸ Pivoting in Gaussian Elimination
Without pivoting, Gaussian elimination can be catastrophically unstable even for well-conditioned matrices. Consider:
Without pivoting, we divide by , amplifying errors by . With partial pivoting (swapping rows to put the largest element on the diagonal), the backward error is bounded by a modest factor of .
import numpy as np
def gaussian_elimination_no_pivot(A, b):
"""Unstable Gaussian elimination (no pivoting)."""
n = len(b)
Ab = np.hstack([A.astype(float), b.reshape(-1, 1)])
for col in range(n):
for row in range(col + 1, n):
factor = Ab[row, col] / Ab[col, col]
Ab[row, col:] -= factor * Ab[col, col:]
# Back substitution
x = np.zeros(n)
for i in range(n - 1, -1, -1):
x[i] = (Ab[i, -1] - Ab[i, i+1:n] @ x[i+1:]) / Ab[i, i]
return x
# Near-singular example
A = np.array([[1e-10, 1], [1, 1]], dtype=float)
b = A @ np.array([1.0, 1.0])
# numpy uses stable LAPACK (LU with partial pivoting)
x_stable = np.linalg.solve(A, b)
print(f"Stable: {x_stable} (error: {np.linalg.norm(x_stable - [1,1]):.2e})")
# No-pivot version may fail
try:
x_unstable = gaussian_elimination_no_pivot(A, b)
print(f"Unstable: {x_unstable} (error: {np.linalg.norm(x_unstable - [1,1]):.2e})")
except ZeroDivisionError:
print("Unstable: Division by zero!")
4. Floating Point Arithmetic
IEEE 754 Standard
DfFloating Point Representation
A double-precision float (IEEE 754) stores a number as:
where is the sign bit, is the 52-bit mantissa (fraction), and is the 11-bit biased exponent. This gives:
- Range:
- Precision: decimal digits
- Machine epsilon:
âšī¸ Machine Epsilon
Machine epsilon is the smallest number such that in floating-point arithmetic. Equivalently, it bounds the relative rounding error of any basic operation:
For double precision: . For single precision: .
Roundoff Error
Every floating-point operation introduces a relative error of at most . After operations, errors can accumulate. For a sum of numbers:
This linear growth in error is why compensated summation (Kahan summation) is important for large reductions.
Catastrophic Cancellation
DfCatastrophic Cancellation
Catastrophic cancellation occurs when subtracting two nearly equal floating-point numbers, causing a dramatic loss of significant digits. If and agree to significant digits, has only significant digits.
Example: , . Both have 16 digits, but has only 1 digit.
import numpy as np
# Classic catastrophic cancellation
a = 1.0
b = 1.0 + 1e-15
print(f"Direct subtraction: {(b - a) - 1e-15}") # May not be exactly 0
# Catastrophic cancellation in quadratic formula
# For b >> sqrt(b^2 - 4ac), the standard formula loses precision
a_coeff, b_coeff, c_coeff = 1, 1e8, 1
sqrt_disc = np.sqrt(b_coeff**2 - 4*a_coeff*c_coeff)
x1_standard = (-b_coeff + sqrt_disc) / (2 * a_coeff)
x2_standard = (-b_coeff - sqrt_disc) / (2 * a_coeff)
print(f"Standard: x1 = {x1_standard:.6e}, x2 = {x2_standard:.6e}")
# Stable alternative: use the other root to avoid cancellation
x2_stable = c_coeff / (a_coeff * x1_standard) # = c/(a*x1)
x1_stable = (-b_coeff - np.sign(b_coeff) * sqrt_disc) / (2 * a_coeff)
print(f"Stable: x1 = {x1_stable:.6e}, x2 = {x2_stable:.6e}")
Kahan Summation
Kahan Compensated Summation
def kahan_sum(values):
"""Kahan compensated summation â nearly exact for large sums."""
s = 0.0
c = 0.0 # compensation for lost low-order bits
for x in values:
y = x - c
t = s + y
c = (t - s) - y # algebraically zero, but captures rounding error
s = t
return s
# Demonstration: sum 1e7 values of 1.0
n = 10_000_000
values = [1.0] * n
print(f"Naive sum: {sum(values) - n}") # May show rounding error
print(f"Kahan sum: {kahan_sum(values) - n}") # Exact (or nearly so)
# More dramatic: sum decreasing values
vals = [1e16, 1.0, -1e16]
print(f"Naive: {sum(vals)}") # 0.0 (correct by luck)
vals2 = [1e16, 1.0, -1e16, 1.0, -1e16]
print(f"Naive: {sum(vals2)}") # 0.0 (lost the 1.0!)
print(f"Kahan: {kahan_sum(vals2)}") # 1.0 (correct)
5. Direct Methods
Gaussian Elimination
DfGaussian Elimination
Gaussian elimination transforms into an upper triangular system via row operations, then solves by back substitution. With partial pivoting (row swaps to maximize the pivot element), the algorithm is:
- For each column :
- Find row with
- Swap rows and
- For each row : eliminate by row subtraction
- Back substitution: solve from bottom to top
Total cost: flops for the factorization, for the solve.
LU Decomposition
DfLU Decomposition
The LU decomposition factors a matrix as where:
- is a permutation matrix (from partial pivoting)
- is lower triangular with 1s on the diagonal
- is upper triangular
Once computed, solving reduces to two triangular solves: , then .
import numpy as np
from scipy.linalg import lu, solve_triangular
A = np.array([[2, 1, 1], [4, 3, 3], [8, 7, 9]], dtype=float)
b = np.array([1, 2, 3], dtype=float)
# LU decomposition via scipy
P, L, U = lu(A)
# Verify: PA = LU
print("PA = LU check:", np.allclose(P @ A, L @ U))
# Solve using LU factors
y = solve_triangular(L, P @ b, lower=True)
x = solve_triangular(U, y)
print(f"Solution: {x}")
print(f"Verify: {A @ x}") # Should equal b
# numpy's solve uses LAPACK's LU implementation internally
x_numpy = np.linalg.solve(A, b)
print(f"numpy: {x_numpy}")
Cholesky Decomposition
ThCholesky Decomposition
For a symmetric positive definite (SPD) matrix , the Cholesky decomposition gives where is lower triangular with positive diagonal entries.
- Cost: flops (half the cost of LU)
- Uniquely defined (no pivoting needed)
- Numerically stable without pivoting
- Exists if and only if is SPD
The Cholesky decomposition is the method of choice for solving SPD systems (e.g., normal equations, covariance matrices in Gaussian processes).
import numpy as np
from scipy.linalg import cho_factor, cho_solve
# Create SPD matrix
A = np.array([[4, 2, 1], [2, 5, 3], [1, 3, 6]], dtype=float)
b = np.array([1, 2, 3], dtype=float)
# Cholesky decomposition
L = np.linalg.cholesky(A)
print("A = LL^T check:", np.allclose(L @ L.T, A))
# Solve using Cholesky
c, low = cho_factor(A)
x = cho_solve((c, low), b)
print(f"Solution: {x}")
print(f"Verify: {A @ x}")
6. Iterative Methods
Iterative methods are essential for large-scale problems where direct methods are too expensive (e.g., finite element systems with millions of unknowns, or sparse systems arising in graph algorithms).
Jacobi Iteration
DfJacobi Iteration
The Jacobi method splits (diagonal, strictly lower, strictly upper) and iterates:
Each iteration costs (or for sparse ). Convergence is guaranteed if is strictly diagonally dominant: for all .
Gauss-Seidel Iteration
DfGauss-Seidel Iteration
Gauss-Seidel is identical to Jacobi but uses the most recently computed values immediately:
This typically converges twice as fast as Jacobi but is harder to parallelize (each update depends on the previous one in the same sweep).
Conjugate Gradient
DfConjugate Gradient Method
The Conjugate Gradient (CG) method solves where is SPD. It generates a sequence of iterates that minimize the -norm of the error over expanding Krylov subspaces:
where is the Krylov subspace.
CG Algorithm
Iterative Methods Comparison
| Method | Per-Iteration Cost | Memory | Convergence | Parallelizable? |
|---|---|---|---|---|
| Jacobi | Slow | Yes | ||
| Gauss-Seidel | Medium | No | ||
| SOR | Fast (tuned ) | No | ||
| Conjugate Gradient | Fast (for SPD) | Partially | ||
| GMRES | Fast (general) | Partially |
import numpy as np
from scipy.sparse import diags
from scipy.sparse.linalg import cg, gmres
# Create a large SPD tridiagonal system
n = 10000
A = diags([-1, 2, -1], [-1, 0, 1], shape=(n, n), format='csr')
b = np.ones(n)
# Conjugate Gradient
x_cg, info = cg(A, b, tol=1e-10, maxiter=1000)
print(f"CG: {np.linalg.norm(A @ x_cg - b):.2e} residual, info={info}")
# Jacobi iteration (manual)
def jacobi(A, b, x0=None, tol=1e-10, maxiter=1000):
x = x0 if x0 is not None else np.zeros_like(b)
D = A.diagonal()
for k in range(maxiter):
r = b - A @ x
if np.linalg.norm(r) < tol:
return x, k
x = x + r / D # Jacobi step: x_new = D^{-1}(b - (A-D)x)
return x, maxiter
x_jac, iters = jacobi(A, b)
print(f"Jacobi: {np.linalg.norm(A @ x_jac - b):.2e} residual, {iters} iterations")
7. Matrix Norms for Numerical Analysis
DfSpectral Norm
The spectral norm (operator 2-norm) of is:
This is the maximum stretching factor of : .
Key Matrix Norms
âšī¸ Why the Spectral Norm for Conditioning?
The condition number uses the spectral norm because it captures the true geometric distortion: maps the unit sphere to an ellipsoid with semi-axes equal to the singular values. The ratio of the longest to shortest semi-axis is . Other norms give different condition numbers, but is the most meaningful for understanding sensitivity.
8. Least Squares Problems
When has no exact solution (overdetermined system, ), we seek that minimizes .
Normal Equations
ThNormal Equations
The least squares solution satisfies the normal equations:
If has full column rank, is SPD and the unique solution is .
Problem: . If , then â catastrophic loss of precision!
QR Decomposition Approach
DfQR Decomposition for Least Squares
Factor where is orthogonal () and is upper triangular (). Then:
since orthogonal transformations preserve the 2-norm. The solution is (upper triangular solve), with â no squaring!
SVD Approach
DfSVD for Least Squares
The SVD gives the most robust solution:
where is the pseudoinverse (reciprocal of nonzero singular values, zero otherwise). This handles rank-deficient matrices gracefully and makes the numerical rank explicit.
import numpy as np
# Overdetermined system: 5 equations, 3 unknowns
np.random.seed(42)
A = np.random.randn(5, 3)
b = np.random.randn(5)
# Method 1: Normal equations (UNSTABLE)
x_normal = np.linalg.solve(A.T @ A, A.T @ b)
print(f"Normal equations: residual = {np.linalg.norm(A @ x_normal - b):.2e}")
print(f" Îē(A^T A) = {np.linalg.cond(A.T @ A):.2e}")
# Method 2: QR decomposition (STABLE)
x_qr, res_qr, rank_qr, sv_qr = np.linalg.lstsq(A, b, rcond=None)
print(f"QR (lstsq): residual = {np.linalg.norm(A @ x_qr - b):.2e}")
# Method 3: SVD (MOST STABLE)
U, s, Vt = np.linalg.svd(A, full_matrices=False)
x_svd = Vt.T @ np.diag(1/s) @ U.T @ b
print(f"SVD: residual = {np.linalg.norm(A @ x_svd - b):.2e}")
# Rank-deficient example
A_rankdef = np.array([[1, 2], [2, 4], [3, 6]], dtype=float) # rank 1
b_rankdef = np.array([1, 2, 3.1], dtype=float)
x_rd = np.linalg.lstsq(A_rankdef, b_rankdef, rcond=None)[0]
print(f"\nRank-deficient least squares: x = {x_rd}")
print(f" Residual: {np.linalg.norm(A_rankdef @ x_rd - b_rankdef):.2e}")
9. Sparse Matrix Algorithms
Why Sparse?
âšī¸ Sparsity in Practice
Real-world matrices are often sparse: adjacency matrices of graphs, discretizations of PDEs, TF-IDF matrices in NLP. Storing only nonzero entries reduces memory from to , and operations like matrix-vector multiply from to .
For a 2D Laplacian on an grid: the matrix is ( for ) but has only nonzeros â a density of .
Sparse Storage Formats
| Format | Best For | Memory | Access |
|---|---|---|---|
| COO | Construction | Slow | |
| CSR | Row slicing, SpMV | Fast row access | |
| CSC | Column slicing | Fast column access | |
| DIA | Diagonal structure | Very fast for banded | |
| BSR | Block structure | Varies | Fast for block patterns |
Preconditioning
DfPreconditioning
A preconditioner approximates and transforms the system to , which has better spectral properties. The ideal preconditioner makes the system trivial, but must be cheap to apply.
Common preconditioners:
- Jacobi (diagonal): â trivial, sometimes effective
- Incomplete LU (ILU): approximate LU factorization with fill-in limit
- Algebraic Multigrid (AMG): hierarchical method, optimal for elliptic PDEs
- SSOR: symmetric successive over-relaxation
import numpy as np
from scipy.sparse import diags, random
from scipy.sparse.linalg import cg, spilu, LinearOperator
# Create a difficult SPD system (convection-diffusion)
n = 500
A = diags([-1, 4, -1], [-1, 0, 1], shape=(n, n), format='csr')
A = A + 0.1 * random(n, n, density=0.01, format='csr') # Add some non-symmetric perturbation
A = A @ A.T # Make SPD
b = np.random.randn(n)
# Without preconditioning
x_no_precon, info1 = cg(A, b, tol=1e-10, maxiter=1000)
print(f"No preconditioner: {np.linalg.norm(A @ x_no_precon - b):.2e}")
# With Jacobi (diagonal) preconditioner
M_inv = 1.0 / A.diagonal()
M = LinearOperator(A.shape, matvec=lambda x: M_inv * x)
x_jac, info2 = cg(A, b, tol=1e-10, maxiter=1000, M=M)
print(f"Jacobi precon: {np.linalg.norm(A @ x_jac - b):.2e}")
# With incomplete Cholesky
ilu = spilu(A.tocsc(), drop_tol=1e-4)
M_ilu = LinearOperator(A.shape, matvec=ilu.solve)
x_ilu, info3 = cg(A, b, tol=1e-10, maxiter=1000, M=M_ilu)
print(f"ILU precon: {np.linalg.norm(A @ x_ilu - b):.2e}")
10. Python Implementation
NumPy Linear Algebra
import numpy as np
# === Core Operations ===
# Solve Ax = b
A = np.array([[3, 1], [1, 2]], dtype=float)
b = np.array([9, 8], dtype=float)
x = np.linalg.solve(A, b)
print(f"Solve: x = {x}")
# Determinant and inverse (AVOID inverting in production!)
print(f"det(A) = {np.linalg.det(A):.4f}")
print(f"inv(A) = \n{np.linalg.inv(A)}")
# Matrix factorizations
Q, R = np.linalg.qr(A)
print(f"\nQR: Q = \n{Q}\nR = \n{R}")
L = np.linalg.cholesky(A @ A.T + np.eye(2)) # Ensure SPD
print(f"Cholesky: L = \n{L}")
U, s, Vt = np.linalg.svd(A)
print(f"\nSVD: singular values = {s}")
# Eigenvalues
eigvals, eigvecs = np.linalg.eig(A)
print(f"Eigenvalues: {eigvals}")
# Matrix norms
print(f"\n||A||_F = {np.linalg.norm(A, 'fro'):.4f}")
print(f"||A||_2 = {np.linalg.norm(A, 2):.4f}")
print(f"||A||_1 = {np.linalg.norm(A, 1):.4f}")
print(f"||A||_inf = {np.linalg.norm(A, np.inf):.4f}")
# Condition number
print(f"Îē(A) = {np.linalg.cond(A):.4f}")
print(f"Îē(A^T A) = {np.linalg.cond(A.T @ A):.4f}") # Îē^2
SciPy Sparse Linear Algebra
from scipy import sparse
from scipy.sparse import linalg as spla
import numpy as np
# Create sparse matrix (5-point Laplacian)
n = 100
A = sparse.diags([-1, 4, -1], [-1, 0, 1], shape=(n, n), format='csr')
b = np.random.randn(n)
# Solve sparse system
x, info = spla.cg(A, b, tol=1e-10)
print(f"CG solve: residual = {np.linalg.norm(A @ x - b):.2e}")
# Sparse eigenvalues (largest/smallest)
vals, vecs = spla.eigs(A, k=5, which='LM')
print(f"Largest eigenvalues: {np.sort(vals.real)}")
vals_sm, _ = spla.eigs(A, k=3, which='SM')
print(f"Smallest eigenvalues: {np.sort(vals_sm.real)}")
# Sparse SVD (useful for large-scale PCA)
U, s, Vt = spla.svds(A, k=5)
print(f"Top 5 singular values: {s}")
Practical Patterns
đAvoiding Common Pitfalls
DO: Use np.linalg.solve for dense, scipy.sparse.linalg.cg for sparse SPD, scipy.sparse.linalg.gmres for sparse non-symmetric.
DON'T: Use np.linalg.inv(A) @ b â it is slower and less stable than np.linalg.solve(A, b). Don't use normal equations A.T @ A for least squares â use np.linalg.lstsq.
11. Applications in AI/ML
Training Stability
âšī¸ Numerical Issues in Deep Learning
Deep learning training involves repeated matrix-vector products (forward pass), matrix transposes (backward pass), and parameter updates. Numerical issues manifest as:
- Exploding/vanishing gradients: The condition number of the Jacobian determines whether gradients grow or shrink exponentially through layers
- Loss of precision in mixed-precision training: FP16 has , making operations on small gradients or large activations unstable
- Ill-conditioned Hessians: Second-order optimizers must solve systems involving , which is often extremely ill-conditioned
Layer normalization and batch normalization are partly motivated by keeping intermediate activations well-conditioned.
Numerical Precision in Deep Learning
import numpy as np
# Simulating mixed-precision issues
def forward_pass_fp32(W, x):
"""Full precision forward pass."""
return W @ x
def forward_pass_fp16(W, x):
"""Simulated mixed-precision forward pass."""
W_16 = W.astype(np.float16)
x_16 = x.astype(np.float16)
result = W_16 @ x_16
return result.astype(np.float32)
np.random.seed(42)
W = np.random.randn(100, 100) * 0.1
x = np.random.randn(100)
y_fp32 = forward_pass_fp32(W, x)
y_fp16 = forward_pass_fp16(W, x)
print(f"FP32 result norm: {np.linalg.norm(y_fp32):.6f}")
print(f"FP16 result norm: {np.linalg.norm(y_fp16):.6f}")
print(f"Relative error: {np.linalg.norm(y_fp32 - y_fp16) / np.linalg.norm(y_fp32):.2e}")
# Condition number of a typical layer's weight matrix
W_layer = np.random.randn(512, 512) * (2 / 512)**0.5 # Xavier init
print(f"\nLayer condition number: {np.linalg.cond(W_layer):.2f}")
# Usually ~10-100 with good initialization, much worse after training
Key Applications
| Application | Numerical Challenge | Solution |
|---|---|---|
| PCA / SVD embeddings | Large-scale SVD | Truncated SVD, randomized SVD |
| Gaussian processes | Cholesky | Sparse approximations, inducing points |
| Neural ODE training | Stiff ODE systems | Implicit solvers, adjoint method |
| Recommendation systems | Sparse least squares | ALS, SGD (avoids forming normal equations) |
| Graph neural networks | Sparse matrix operations | SpMV with CSR format |
12. Common Mistakes
â ī¸ Common Mistakes in Numerical Linear Algebra
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
Using inv(A) @ b instead of solve(A, b) | 2x slower, less stable | Use np.linalg.solve |
| Solving via normal equations | Squares condition number | Use QR or SVD |
| Ignoring condition number | May get garbage results with no error | Always check np.linalg.cond |
| Not using pivoting in Gaussian elimination | Catastrophic instability | Use LAPACK's pivoted LU |
Comparing floating-point numbers with == | Roundoff makes exact equality rare | Use np.allclose or tolerance |
| Using float32 for large summations | Accumulated error grows | Use float64 or Kahan summation |
| Assuming iterative methods converge | Divergence is silent | Monitor residual; use preconditioning |
| Using dense algorithms on sparse matrices | memory, time | Use scipy.sparse and sparse solvers |
| Not checking matrix properties (SPD, symmetry) | Wrong solver, wrong results | Verify before choosing solver |
import numpy as np
# BAD: Using inv(A) @ b
A = np.array([[1, 2], [3, 4]], dtype=float)
b = np.array([5, 6], dtype=float)
# Wrong
x_bad = np.linalg.inv(A) @ b # Slower, less stable
# Correct
x_good = np.linalg.solve(A, b) # Direct, stable
print(f"inv @ b: {x_bad}")
print(f"solve: {x_good}")
print(f"Equal: {np.allclose(x_bad, x_good)}")
# BAD: Comparing floats with ==
a = 0.1 + 0.2
print(f"\n0.1 + 0.2 == 0.3? {a == 0.3}") # False!
print(f"0.1 + 0.2 â 0.3? {np.isclose(a, 0.3)}") # True
13. Interview Questions
đInterview Question 1: Why Not Invert the Matrix?
Q: Why should we never compute and multiply by to solve ?
A: Three reasons: (1) Computing costs flops, then multiplying costs another â LU decomposition with back substitution costs total and does both at once. (2) The product can amplify roundoff errors more than the triangular solves in LU. (3) Storing is expensive ( elements) and dense even when is sparse. LAPACK and all serious linear algebra libraries compute solves, not inverses.
đInterview Question 2: Condition Number of Normal Equations
Q: If , what is the condition number when solving via normal equations?
A: . With double precision (), you lose about 12 digits, leaving only 4 digits of accuracy. This is why QR or SVD should be used instead of normal equations for least squares problems.
đInterview Question 3: When to Use Iterative vs Direct Methods?
Q: When would you choose Conjugate Gradient over Cholesky decomposition?
A: Cholesky is preferred when is small-to-medium sized () and dense â it gives the exact solution in time and memory. CG is preferred when is large and sparse â it uses only matrix-vector products (costing each), needs memory, and converges in iterations. For a sparse system with , CG converges in iterations, each costing .
đInterview Question 4: Floating Point Identity
Q: Is floating-point addition associative? Give an example.
A: No. Let , , . Then , but . Actually in this case both are 0, but consider: , while . The ordering matters because intermediate results may overflow or underflow.
đInterview Question 5: SVD vs Eigenvalues
Q: Why might SVD be preferred over eigendecomposition for numerical computation?
A: (1) SVD exists for all matrices (rectangular, singular, defective); eigendecomposition only for square diagonalizable matrices. (2) SVD is always numerically stable; eigendecomposition can be unstable for defective matrices. (3) SVD of separates the input space () from the output space (), which is geometrically meaningful. (4) SVD truncation gives the best low-rank approximation (Eckart-Young theorem), making it ideal for PCA, dimensionality reduction, and regularization.
đInterview Question 6: Preconditioning
Q: What is a preconditioner and why do iterative methods need them?
A: A preconditioner approximates such that has a smaller condition number than . Without preconditioning, CG converges in iterations â if , that's iterations. A good preconditioner reduces to , requiring only iterations. The challenge is that must be cheap to apply (ideally or ) while still being a good approximation to . Common choices: incomplete Cholesky, algebraic multigrid, block Jacobi for distributed systems.
14. Practice Problems
đProblem 1: Condition Number Estimation
Compute the condition number of the matrix with (Hilbert matrix) for . At what does the matrix become effectively singular in double precision?
đĄSolution
import numpy as np
for n in [5, 10, 15, 20]:
H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
kappa = np.linalg.cond(H)
digits_lost = int(np.log10(kappa))
print(f"n={n:2d}: Îē = {kappa:.2e}, ~{digits_lost} digits lost, "
f"{max(0, 16-digits_lost)} digits remaining")
For : (11 digits remain) For : (3 digits remain) For : (matrix is singular in double precision!) For , the Hilbert matrix is effectively singular â solving gives meaningless results.
đProblem 2: Backward Error Demonstration
Write a function that computes the backward error of a solution to :
Show that Gaussian elimination with partial pivoting achieves small backward error even for ill-conditioned systems.
đĄSolution
import numpy as np
def backward_error(A, b, x_approx):
"""Compute the backward error of an approximate solution."""
residual = b - A @ x_approx
return np.linalg.norm(residual) / (np.linalg.norm(A) * np.linalg.norm(x_approx))
# Ill-conditioned system
np.random.seed(42)
n = 100
H = np.array([[1/(i+j+1) for j in range(n)] for i in range(n)])
b = H @ np.ones(n) # True solution is x = [1,...,1]
x_solve = np.linalg.solve(H, b)
x_perturbed = x_solve + np.random.randn(n) * 1e-10 # Slightly wrong solution
print(f"np.linalg.solve backward error: {backward_error(H, b, x_solve):.2e}")
print(f"Perturbed solution backward error: {backward_error(H, b, x_perturbed):.2e}")
print(f"Solve error: {np.linalg.norm(x_solve - np.ones(n)):.2e}")
print(f"Perturbed error: {np.linalg.norm(x_perturbed - np.ones(n)):.2e}")
The backward error of np.linalg.solve is always regardless of conditioning. The forward error may be large if is large, but this is the problem's fault, not the algorithm's.
đProblem 3: Convergence Analysis
For the matrix , compute the convergence rate of Jacobi and Gauss-Seidel iteration. Which converges faster and by how much?
đĄSolution
The spectral radius of the Jacobi iteration matrix :
Eigenvalues of : , so .
. After iterations, error reduced by factor .
For Gauss-Seidel, . So GS converges roughly twice as fast per iteration.
import numpy as np
A = np.array([[4, 1], [1, 3]], dtype=float)
D = np.diag(np.diag(A))
L = np.tril(A, -1)
U = np.triu(A, 1)
T_J = -np.linalg.solve(D, L + U)
T_GS = -np.linalg.solve(D + L, U)
print(f"Ī(T_Jacobi) = {max(abs(np.linalg.eigvals(T_J))):.4f}")
print(f"Ī(T_GS) = {max(abs(np.linalg.eigvals(T_GS))):.4f}")
print(f"Ratio: {max(abs(np.linalg.eigvals(T_J))) / max(abs(np.linalg.eigvals(T_GS))):.2f}x")
đProblem 4: Mixed Precision Stability
A neural network layer computes where , . If has entries in and has entries in , estimate the condition number needed for the computation to be reliable in FP16 vs FP32.
đĄSolution
In FP16, . The accumulation of multiply-adds introduces relative error of . This means the result can be completely unreliable.
In FP32, . Accumulation error: . Reliable if condition number .
In FP64, . Accumulation error: . Reliable for any practical condition number.
This is why mixed-precision training uses FP16 for speed but maintains FP32 copies of weights and uses loss scaling to keep gradients in FP16-representable range.
15. Quick Reference
đNumerical Linear Algebra Quick Reference
Key Formulas
| Formula | Name | Use |
|---|---|---|
| Condition number | Measure sensitivity | |
| Spectral condition number | SVD-based computation | |
| Normal equations | Why to avoid them | |
| , | Relative error bound | Floating-point model |
Solver Selection Guide
| Problem | Recommended Method | Avoid |
|---|---|---|
| Dense, general | LU with partial pivoting | inv(A) @ b |
| Dense, SPD | Cholesky | LU (wasteful) |
| Dense, overdetermined | QR or SVD (lstsq) | Normal equations |
| Sparse, SPD | CG (+ preconditioner) | Dense LU |
| Sparse, general | GMRES (+ preconditioner) | Dense methods |
| Sparse, least squares | LSMR or LSQR | Normal equations |
Accuracy vs Speed Tradeoff
| Method | Accuracy | Speed | When to Use |
|---|---|---|---|
| SVD | Highest | Slowest | Rank-deficient, ill-conditioned |
| QR | High | Fast | General least squares |
| Cholesky | High | Fastest (SPD) | SPD systems |
| LU | Good | Fast | General square systems |
| CG/GMRES | Good | Fast (sparse) | Large sparse systems |
Error Budget
For solving with in double precision ():
- Forward error:
- If : 6+ digits â good
- If : 4 digits â acceptable
- If : 1 digit â poor, consider higher precision
- If : 0 digits â use arbitrary precision or reformulate
16. Cross-References
- Linear Algebra Basics â Vector spaces, subspaces, and the four fundamental subspaces
- Linear Algebra SVD â SVD, eigendecomposition, QR, Cholesky (theory)
- Linear Algebra Eigenvalues â Eigenvalue computation and the QR algorithm
- Linear Algebra Applications â PCA, SVD embeddings, and matrix factorizations in ML
- Optimization Convex â Gradient descent, second-order methods, and Hessian computation
- Probability Distributions â Multivariate Gaussians require positive definite covariance matrices (Cholesky)
- Calculus Differential Equations â Backpropagation, batch normalization, and training dynamics