Numerical Methods — Computing with Math

ℹ️ Why It Matters

Computers can't do exact math with real numbers. Numerical methods bridge the gap between mathematical theory and computer implementation.

Floating Point Arithmetic

How Computers Store Numbers

IEEE 754 Standard:

ℹ️ IEEE 754 Standard

32-bit float:

1 bit sign + 8 bits exponent + 23 bits mantissa

64-bit double:

1 bit sign + 11 bits exponent + 52 bits mantissa

Precision Limits:

⚠️ Precision Limits

Float: ~7 decimal digits
Double: ~15 decimal digits

Example: $0.1 + 0.2 \neq 0.3$ (in floating point!)

0.1 + 0.2 = 0.30000000000000004

Common Numerical Issues

⚠️ Common Numerical Issues

Loss of significance: Subtracting nearly equal numbers

1.0000001 - 1.0000000 = 0.0000001

But with limited precision: might lose significant digits

Overflow: Number too large to represent Underflow: Number too small (close to zero) to represent Catastrophic cancellation: When subtraction leads to huge relative error

Root Finding

Bisection Method

ℹ️ Bisection Method

Start with $[a, b]$ where $f(a)$ and $f(b)$ have opposite signs
Compute midpoint $c = (a+b)/2$
If $f(c)$ has same sign as $f(a)$ , replace $a$ with $c$ Otherwise, replace $b$ with $c$
Repeat until converged

Convergence: Linear (slow but guaranteed)

Newton's Method

x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}

Here,

$x_n$ =Current estimate
$f'(x_n)$ =Derivative at current estimate

ℹ️ Newton's Method Properties

Requires: $f'(x)$
Convergence: Quadratic (very fast)
Can fail if $f'(x) \approx 0$

Secant Method

ℹ️ Secant Method

Like Newton's but approximates the derivative:

f'(x_n) \approx \frac{f(x_n) - f(x_{n-1})}{x_n - x_{n-1}}

Doesn't need the derivative
Superlinear convergence

Numerical Integration

Riemann Sum

\int_a^b f(x) \, dx \approx \sum_i f(x_i) \times \Delta x

Here,

$\Delta x$ =Width of each subinterval

ℹ️ Riemann Sum

Simple but inaccurate

Trapezoidal Rule

\int_a^b f(x) \, dx \approx \frac{\Delta x}{2} \left[f(a) + 2\sum_i f(x_i) + f(b)\right]

Here,

$\Delta x$ =Width of each subinterval

ℹ️ Trapezoidal Rule

Better accuracy than Riemann sum

Simpson's Rule

\int_a^b f(x) \, dx \approx \frac{\Delta x}{3} \left[f(a) + 4\sum_{\text{odd}} f(x_i) + 2\sum_{\text{even}} f(x_i) + f(b)\right]

Here,

$\Delta x$ =Width of each subinterval

ℹ️ Simpson's Rule

Even better accuracy
Uses parabolic approximation

Monte Carlo Integration

\int_a^b f(x) \, dx \approx \frac{b-a}{N} \sum_i f(x_i)

Here,

$x_i$ =Random sample points
$N$ =Number of samples

💡 Monte Carlo Integration

Works well in high dimensions!

Used in: Ray tracing, physics simulations, Bayesian inference

Linear System Solving

Gaussian Elimination

ℹ️ Gaussian Elimination

Write augmented matrix $[A|b]$
Forward elimination: Convert to upper triangular
Back substitution: Solve from bottom up

Time: $O(n^3)$

LU Decomposition

A = LU

Here,

$L$ =Lower triangular matrix
$U$ =Upper triangular matrix

ℹ️ LU Decomposition

Solve $Ly = b$ (forward substitution)
Solve $Ux = y$ (back substitution)

Time: $O(n^3)$ for decomposition, $O(n^2)$ for each solve Efficient when solving for multiple $b$ vectors

Iterative Methods

Jacobi Method:

Jacobi Method

x_i^{(k+1)} = \frac{b_i - \sum_{j \neq i} a_{ij} x_j^{(k)}}{a_{ii}}

Here,

$x_i^{(k+1)}$ =Updated value
$x_j^{(k)}$ =Previous values

ℹ️ Jacobi Method

Simple but slow convergence

Gauss-Seidel Method:

ℹ️ Gauss-Seidel Method

Like Jacobi but uses updated values immediately. Faster convergence than Jacobi.

Conjugate Gradient:

ℹ️ Conjugate Gradient

For symmetric positive definite matrices
Much faster than Gauss-Seidel
Time: $O(n\sqrt{\kappa})$ where $\kappa$ is the condition number

Eigenvalue Algorithms

Power Method:

ℹ️ Power Method

Find the largest eigenvalue:

Start with random vector $v$
Repeat: $v \leftarrow Av / \|Av\|$
Eigenvalue $\approx v^TAv$

Simple but only finds the largest eigenvalue

QR Algorithm:

ℹ️ QR Algorithm

Start with $A_0 = A$
Decompose $A_k = Q_kR_k$
$A_{k+1} = R_kQ_k$
$A_k$ converges to upper triangular (eigenvalues on diagonal)

Finds all eigenvalues. Time: $O(n^3)$ per iteration

Optimization Algorithms

Line Search

ℹ️ Line Search

Given direction $d$ , find step size $\alpha$ :

\alpha = \arg\min f(x + \alpha d)

Methods:

Exact line search (expensive)
Backtracking line search (cheap, practical)

Trust Region Methods

ℹ️ Trust Region Methods

Build a model of $f$ near current point
Find the best point within a "trust region"
If model is accurate, expand trust region
If not, shrink trust region

Conjugate Gradient for Optimization

Conjugate Gradient Optimization

\min f(x) = \frac{1}{2}x^TAx - b^Tx

Here,

$A$ =Symmetric positive definite matrix
$b$ =Target vector

ℹ️ Conjugate Gradient Algorithm

Start with $x_0 = 0$ , $r_0 = b$
$p_0 = r_0$
Repeat:
- $\alpha_k = r_k^Tr_k / p_k^TAp_k$
- $x_{k+1} = x_k + \alpha_kp_k$
- $r_{k+1} = r_k - \alpha_kAp_k$
- $\beta_k = r_{k+1}^Tr_{k+1} / r_k^Tr_k$
- $p_{k+1} = r_{k+1} + \beta_kp_k$

Very efficient for large sparse systems

📋Key Takeaways

Floating point arithmetic has precision limits. IEEE 754 doubles give ~15 decimal digits; $0.1 + 0.2 \neq 0.3$ in floating point — always use tolerances when comparing real numbers.
Newton's Method converges quadratically. $x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$ doubles correct digits each iteration, but requires the derivative and can fail if $f'(x) \approx 0$ .
Integration methods trade accuracy for simplicity. Riemann sums are simple but inaccurate; Simpson's rule uses parabolic approximations for better精度; Monte Carlo integration $\int_a^b f(x) \, dx \approx \frac{b-a}{N} \sum_i f(x_i)$ excels in high dimensions.
LU Decomposition $A = LU$ efficiently solves linear systems. Factor once in $O(n^3)$ , then solve $Ly = b$ and $Ux = y$ in $O(n^2)$ each — ideal when solving for multiple right-hand sides.
The Conjugate Gradient method solves sparse systems in $O(n\sqrt{\kappa})$ . For symmetric positive definite matrices, it's far more efficient than direct methods for large-scale problems common in ML.
The condition number $\kappa(A) = \|A\| \cdot \|A^{-1}\|$ predicts numerical stability. Large $\kappa$ means small input changes cause large output changes — always check conditioning before solving linear systems in production.