← Math|86 of 100
Numerical Methods

Floating Point Arithmetic

Understand IEEE 754, rounding errors, and their impact on numerical computations.

📂 Floating Point📖 Lesson 86 of 100🎓 Free Course

Advertisement

Floating Point Arithmetic

ℹ️ Why It Matters

Every ML model, every simulation, every digital computation ultimately relies on floating point arithmetic. A single misunderstood rounding behavior can cause silent bugs that corrupt training, break inference, or produce incorrect scientific results. Mastering IEEE 754 is non-negotiable for anyone working with numerical software.

💡 Real-World Impact

In 1996, the Ariane 5 rocket was destroyed 37 seconds after launch due to a 64-bit floating point to 16-bit integer conversion overflow — a $370 million lesson in floating point awareness. Financial systems have lost millions due to rounding errors in currency conversions. Even modern deep learning training can diverge silently due to float16 overflow in gradient accumulation.


Definitions

DfIEEE 754 Floating Point Standard

IEEE 754 is the industry standard for representing and manipulating real numbers in binary. It defines formats for single precision (float32), double precision (float64), half precision (float16), and extended precision, specifying rounding rules, special values (NaN, infinity), and exception handling behavior.

DfFloating Point Representation

A floating point number represents a real value as: sign × significand × base^exponent. The significand (or mantissa) stores the significant digits, the exponent scales the number, and the sign bit indicates positive or negative. This allows representing a huge range of values at the cost of non-uniform precision.

DfMachine Epsilon (ε)

Machine epsilon is the smallest positive number ε such that fl(1 + ε) ≠ 1 in floating point arithmetic. It quantifies the upper bound on the relative error of rounding. For float64, ε ≈ 2.22 × 10⁻¹⁶. It represents the gap between 1.0 and the next representable float64 value.

DfNormalized vs Denormalized Numbers

Normalized numbers have an implicit leading 1 in the significand: 1.xxxx... × 2^e. Denormalized (subnormal) numbers have an implicit leading 0, allowing representation of values closer to zero at the cost of reduced precision. Denormals prevent underflow from being abrupt — gradually flushing to zero.

DfRounding Modes

IEEE 754 defines four rounding modes: round to nearest, ties to even (default); round toward positive infinity (ceiling); round toward negative infinity (floor); round toward zero (truncation). The default mode minimizes systematic bias in long sequences of arithmetic operations.


Formulas

IEEE 754 Double Precision (float64) Layout

V=(1)s×2e1023×(1+i=152bi2i)V = (-1)^s \times 2^{e - 1023} \times (1 + \sum_{i=1}^{52} b_i \cdot 2^{-i})

Here,

  • ss=Sign bit (0 = positive, 1 = negative), 1 bit
  • ee=Exponent, biased by 1023, stored in 11 bits (range 0–2047)
  • bib_i=Mantissa bits, 52 bits storing the fractional part after the implicit leading 1

Machine Epsilon

εmach=2522.2204×1016\varepsilon_{\text{mach}} = 2^{-52} \approx 2.2204 \times 10^{-16}

Here,

  • εmachε_mach=For IEEE 754 double precision (float64)

Relative Rounding Error Bound

fl(x)xεmach2x|\text{fl}(x) - x| \leq \frac{\varepsilon_{\text{mach}}}{2} |x|

Here,

  • fl(x)fl(x)=Floating point representation of x
  • εmachε_mach=Machine epsilon

Condition Number for Addition

κadd(a,b)=a+ba+b\kappa_{\text{add}}(a, b) = \frac{|a| + |b|}{|a + b|}

Here,

  • a,ba, b=Operands
  • κκ=Large κ indicates catastrophic cancellation risk

Kahan Summation Compensation

Sn+1=Sn+(xnCn);Cn+1=(Sn+1Sn)(xnCn)S_{n+1} = S_n + (x_n - C_n); \quad C_{n+1} = (S_{n+1} - S_n) - (x_n - C_n)

Here,

  • SS=Running sum
  • CC=Compensation term tracking lost low-order bits
  • xnx_n=Next value to add

IEEE 754 Deep Dive

Double Precision (float64) — 64 bits total

FieldBitsPurpose
Sign (s)1 bit0 = positive, 1 = negative
Exponent (e)11 bitsBiased by 1023 (actual exponent = e − 1023)
Mantissa (m)52 bitsFractional part; implicit leading 1 for normalized

Exponent special values:

  • e = 0, m = 0: ±0
  • e = 0, m ≠ 0: Denormalized numbers
  • e = 2047, m = 0: ±Infinity
  • e = 2047, m ≠ 0: NaN (Not a Number)

Range of float64:

  • Minimum positive normal: 2^(−1022) ≈ 2.2 × 10^(−308)
  • Maximum finite: (2 − 2^(−52)) × 2^1023 ≈ 1.8 × 10^308

Half Precision (float16) — Used in ML Training

FieldBitsPurpose
Sign1 bitPositive or negative
Exponent5 bitsBiased by 15
Mantissa10 bitsFractional part
  • Range: ±65504
  • Smallest positive normal: 6.1 × 10^(−5)
  • Machine epsilon: 9.77 × 10^(−4)
  • Used extensively in deep learning for memory savings and speed

Single Precision (float32)

FieldBitsPurpose
Sign1 bitPositive or negative
Exponent8 bitsBiased by 127
Mantissa23 bitsFractional part
  • Machine epsilon: ≈ 1.19 × 10^(−7)
  • Range: ±3.4 × 10^38
  • Common default for ML inference

Examples

📝Why 0.1 + 0.2 ≠ 0.3

The decimal numbers 0.1 and 0.2 cannot be exactly represented in binary floating point. When stored, they become the closest representable float64 values, which are slightly off from the true decimal values. Adding them produces a result that, when compared to the float64 representation of 0.3, differs by approximately 5.55 × 10^(−17).

💡Demonstration: 0.1 + 0.2 vs 0.3

import numpy as np

# The classic floating point surprise
a = 0.1
b = 0.2
c = 0.3

print(f"a + b = {a + b}")          # 0.30000000000000004
print(f"c     = {c}")              # 0.3
print(f"a + b == c: {a + b == c}") # False!

# The actual binary representations
print(f"0.1 is stored as: {a:.20f}")  # 0.10000000000000000555
print(f"0.2 is stored as: {b:.20f}")  # 0.20000000000000001110
print(f"0.3 is stored as: {c:.20f}")  # 0.29999999999999998890

# Correct comparison
print(f"np.isclose: {np.isclose(a + b, c)}")  # True

📝Catastrophic Cancellation

When subtracting two nearly equal numbers, the significant digits cancel, leaving mostly noise from the less significant bits. The relative error of the result can be much larger than the relative error of the inputs.

💡Demonstration: Catastrophic Cancellation

import numpy as np

# Nearly equal numbers
a = 1.000000000000001
b = 1.000000000000000

diff = a - b
print(f"a - b = {diff}")          # 1.1102230246251565e-15
print(f"Exact: 1e-15")            # True difference is 1e-15
print(f"Relative error: {abs(diff - 1e-15) / 1e-15:.2%}")  # ~11%

# Another example: quadratic formula
import math

def quadratic_naive(a, b, c):
    """Naive quadratic formula — suffers from cancellation."""
    disc = b**2 - 4*a*c
    x1 = (-b + math.sqrt(disc)) / (2*a)
    x2 = (-b - math.sqrt(disc)) / (2*a)
    return x1, x2

def quadratic_stable(a, b, c):
    """Stable quadratic formula — avoids cancellation."""
    disc = b**2 - 4*a*c
    q = -0.5 * (b + math.copysign(math.sqrt(disc), b))
    x1 = q / a
    x2 = c / q
    return x1, x2

# Example: x^2 - 1000000x + 1 = 0
# True roots: x ≈ 999999.999999 and x ≈ 1e-6
a, b, c = 1, -1e6, 1
x1_naive, x2_naive = quadratic_naive(a, b, c)
x1_stable, x2_stable = quadratic_stable(a, b, c)

print(f"Naive:  x1={x1_naive}, x2={x2_naive}")
print(f"Stable: x1={x1_stable}, x2={x2_stable}")
# Naive x2 loses accuracy; stable version preserves it

📝Overflow and Underflow

Overflow occurs when a result exceeds the maximum representable finite value, producing infinity. Underflow occurs when a result is too close to zero for the normal range, either flushing to zero (abrupt) or entering denormalized territory (gradual).

💡Demonstration: Overflow and Underflow

import numpy as np

# Overflow to infinity
big = np.float64(1.8e308)
print(f"Max finite: {np.finfo(np.float64).max}")
print(f"big * 2 = {big * 2}")  # inf
print(f"big + big = {big + big}")  # inf

# Underflow to denormalized / zero
tiny = np.float64(2.2e-308)
print(f"Tiny: {tiny}")
print(f"Tiny / 2 = {tiny / 2}")  # Denormalized, eventually 0.0

# Gradual underflow
print(f"Smallest normal: {np.finfo(np.float64).tiny}")
print(f"Smallest subnormal: {np.finfo(np.float64).smallest_subnormal}")

# Demonstrate loss of precision with underflow
x = 1e-300
for i in range(20):
    x = x * x
    print(f"Step {i+1}: {x}")  # Quickly goes to 0.0

📝Kahan Summation

Standard sequential summation accumulates rounding error proportional to n × ε, where n is the number of terms. Kahan summation uses a compensation variable to track and correct for lost low-order bits, achieving accuracy close to the theoretical O(ε) bound regardless of n.

💡Demonstration: Kahan vs Naive Summation

import numpy as np

def naive_sum(x):
    """Simple sequential summation."""
    s = 0.0
    for val in x:
        s += val
    return s

def kahan_sum(x):
    """Kahan compensated summation."""
    s = 0.0
    c = 0.0  # Compensation for lost low-order bits
    for val in x:
        y = val - c      # Compensated value
        t = s + y         # New sum
        c = (t - s) - y  # Recover lost part
        s = t
    return s

# Sum 10^7 values of 0.1
n = 10_000_000
values = np.full(n, 0.1)

naive_result = naive_sum(values)
kahan_result = kahan_sum(values)
exact = n * 0.1

print(f"Naive sum: {naive_result:.10f}")
print(f"Kahan sum: {kahan_result:.10f}")
print(f"Exact:     {exact:.10f}")
print(f"Naive error: {abs(naive_result - exact):.2e}")
print(f"Kahan error: {abs(kahan_result - exact):.2e}")

Theorems

ThSterbenz Lemma

If x and y are floating point numbers with y/2 ≤ x ≤ 2y, then x − y is computed exactly (with no rounding error) in floating point arithmetic, provided the result is representable.

ℹ️ Why Sterbenz Matters

This theorem guarantees that subtraction of nearby numbers — the very operation that causes catastrophic cancellation — is exact when the operands satisfy the proximity condition. It explains why algorithms designed to keep operands close together achieve better numerical stability.

ThRelative Error of Floating Point Addition

For normalized floating point arithmetic with rounding to nearest, the relative error of a single addition operation satisfies: |fl(a + b) − (a + b)| / |a + b| ≤ ε_mach / 2, provided a + b ≠ 0 and no overflow or underflow occurs.

ThError Growth in Summation

For naive sequential summation of n floating point numbers, the accumulated error satisfies: |S_computed − S_exact| ≤ (n − 1) × (ε_mach / 2) × Σ|x_i| + O(ε²)

Kahan summation reduces this to: |S_computed − S_exact| ≤ 2 × (ε_mach / 2) × Σ|x_i| + O(ε²)

ThBackward Error of Floating Point Operations

For any basic floating point operation ⊕ (add, subtract, multiply, divide), there exists a mathematically exact operation on slightly perturbed inputs such that: fl(a ⊕ b) = (a(1 + δ₁)) ⊕ (b(1 + δ₂))

where |δ_i| ≤ ε_mach / 2. This is the backward error interpretation — the computed result is the exact answer to a nearby problem.


Python Implementation

import numpy as np
import struct
import sys


def inspect_float64(value):
    """Break down a float64 into its IEEE 754 components."""
    # Pack as binary (8 bytes) and unpack as unsigned int
    bits = struct.unpack('Q', struct.pack('d', value))[0]

    sign = (bits >> 63) & 1
    exponent = ((bits >> 52) & 0x7FF) - 1023  # Remove bias
    mantissa = bits & 0xFFFFFFFFFFFFF          # Lower 52 bits

    print(f"Value: {value}")
    print(f"  Binary: {bits:064b}")
    print(f"  Sign:     {sign} ({'negative' if sign else 'positive'})")
    print(f"  Exponent: {exponent + 1023} (unbiased: {exponent})")
    print(f"  Mantissa: {mantissa:052b}")
    print(f"  Implicit leading 1: {1 + mantissa / (1 << 52)}")
    print()


def demonstrate_epsilon():
    """Show machine epsilon and its effects."""
    eps = np.finfo(float).eps
    print(f"Machine epsilon (float64): {eps}")
    print(f"  = 2^(-52) = {2**-52}")

    # Find the next representable float after 1.0
    one_plus = np.nextafter(1.0, 2.0)
    print(f"Next float after 1.0: {one_plus}")
    print(f"  = 1 + {one_plus - 1.0}")
    print(f"  = 1 + 2^(-52) = 1 + {2**-52}")
    print()


def demonstrate_rounding():
    """Show how different rounding modes affect results."""
    # 0.1 in binary is a repeating fraction
    print("Rounding of 0.1 in float64:")
    print(f"  fl(0.1) = {np.float64(0.1):.20f}")
    print(f"  Error:   {np.float64(0.1) - 0.1:.2e}")
    print()

    # Show precision loss in subtraction
    a = 1.0
    b = 1.0 + 1e-15
    print(f"Precision loss demonstration:")
    print(f"  1.0 + 1e-15 = {b:.20f}")
    print(f"  (1.0 + 1e-15) - 1.0 = {b - a:.20f}")
    print(f"  Expected: 1e-15 = {1e-15:.20f}")
    print()


def demonstrate_accumulation_error():
    """Show error accumulation in repeated operations."""
    # Add 0.1 ten times
    result = 0.0
    for i in range(10):
        result += 0.1

    print(f"Summing 0.1 ten times: {result}")
    print(f"  Expected: 1.0")
    print(f"  Error: {result - 1.0:.2e}")
    print(f"  Direct: 10 * 0.1 = {10 * 0.1}")
    print()


def demonstrate_denormals():
    """Show denormalized number behavior."""
    tiny = np.finfo(np.float64).tiny
    print(f"Smallest normal: {tiny}")
    print(f"Smallest subnormal: {np.finfo(np.float64).smallest_subnormal}")
    print()

    # Gradual underflow
    x = tiny
    step = 0
    while x > 0 and step < 60:
        x /= 2
        step += 1
        if step <= 5 or x == 0:
            print(f"  Divide by 2 (step {step}): {x}")
    print()


def demonstrate_special_values():
    """Show IEEE 754 special values."""
    print("Special values:")
    print(f"  +inf: {np.inf}")
    print(f"  -inf: {-np.inf}")
    print(f"  NaN: {np.nan}")
    print(f"  inf + inf = {np.inf + np.inf}")
    print(f"  inf - inf = {np.inf - np.inf}")  # NaN
    print(f"  0/0 = {0.0 / 0.0}")              # NaN
    print(f"  1/0 = {1.0 / 0.0}")              # inf
    print(f"  NaN == NaN: {np.nan == np.nan}")  # False!
    print(f"  isnan(NaN): {np.isnan(np.nan)}")  # True
    print()


def kahan_summation(values):
    """Kahan compensated summation algorithm."""
    total = 0.0
    compensation = 0.0

    for value in values:
        y = value - compensation
        temp = total + y
        compensation = (temp - total) - y
        total = temp

    return total


if __name__ == "__main__":
    print("=" * 60)
    print("IEEE 754 FLOATING POINT ANALYSIS")
    print("=" * 60)

    # Inspect some values
    print("\n--- Value Inspection ---")
    inspect_float64(0.1)
    inspect_float64(0.2)
    inspect_float64(0.1 + 0.2)

    # Epsilon
    print("\n--- Machine Epsilon ---")
    demonstrate_epsilon()

    # Rounding
    print("\n--- Rounding Effects ---")
    demonstrate_rounding()

    # Accumulation
    print("\n--- Accumulation Error ---")
    demonstrate_accumulation_error()

    # Denormals
    print("\n--- Denormalized Numbers ---")
    demonstrate_denormals()

    # Special values
    print("\n--- Special Values ---")
    demonstrate_special_values()

    # Kahan summation
    print("\n--- Kahan Summation ---")
    np.random.seed(42)
    values = np.random.uniform(1e10, 1e10 + 1, 1_000_000)
    naive = sum(values)
    kahan = kahan_summation(values)
    print(f"Naive:  {naive:.6f}")
    print(f"Kahan:  {kahan:.6f}")
    print(f"Difference: {abs(naive - kahan):.6f}")

Applications in AI/ML

ℹ️ Floating Point in Deep Learning

Modern deep learning relies heavily on mixed-precision training, where forward passes use float16 for speed while gradient accumulation uses float32 for accuracy. Understanding float16 limitations is critical — its maximum value of 65504 means gradients exceeding this cause overflow, and its machine epsilon of ~10^(-4) means small updates can be lost entirely.

Training Considerations

Aspectfloat32float16bfloat16
Memory4 bytes2 bytes2 bytes
Max value3.4 × 10^3865,5043.4 × 10^38
Precision~7 digits~3 digits~3 digits
Dynamic rangeExcellentPoorExcellent
ML useBaselineSpeedBalance

Loss Scaling

When using float16, gradients often fall below the smallest representable normal (6.1 × 10^(-5)). Loss scaling multiplies the loss by a large factor before backpropagation, shifting gradients into the representable range, then divides after the optimizer step.

Gradient Clipping

To prevent float16 overflow during gradient accumulation, gradients are clipped to a maximum norm before the optimizer step. Without clipping, exploding gradients in float16 cause NaN propagation that corrupts the entire model.

Quantization

Model quantization converts float32 weights to int8 or float4 representations. Understanding the rounding behavior of float32 is essential for designing quantization schemes that minimize accuracy loss.

Common Pitfalls in ML

  • Log-sum-exp overflow: Computing log(Σexp(x_i)) where x_i are large causes exp to overflow. Use the identity log(Σexp(x_i)) = max(x) + log(Σexp(x_i − max(x))).
  • Softmax instability: If inputs differ by more than ~30 in float32, the softmax denominator overflows. Always subtract the max value before exponentiating.
  • Batch norm with float16: Running statistics computed in float16 lose precision. Always compute batch norm in float32.

Common Mistakes

MistakeProblemSolution
Comparing floats with ==0.1 + 0.2 ≠ 0.3 alwaysUse np.isclose() or threshold comparison
Ignoring cancellationSubtraction of nearly equal numbers amplifies errorUse Kahan summation or reformulate
Summing large + small firstSmall values get lost in roundingSort values by magnitude, sum smallest first
Using float16 without scalingGradients overflow or underflowApply loss scaling, keep accumulators in float32
Assuming associativity(a + b) + c ≠ a + (b + c) in floating pointUse compensated summation
Ignoring denormalsUnderflow can be slow on some hardwareSet flush-to-zero mode if acceptable
Printing floats with full precisionMisleading outputUse repr() or .20f format
Assuming float is exactEvery float has rounding errorDesign algorithms for numerical stability

Interview Questions

Q1: Why is 0.1 + 0.2 ≠ 0.3 in floating point? A: 0.1 and 0.2 are not exactly representable in binary. Their float64 representations are slightly off from the true decimal values. When added, the result is close to 0.3 but not equal to the float64 representation of 0.3. The difference is approximately 5.55 × 10^(−17), which is within one ULP (unit in the last place) of the true sum.

Q2: What is machine epsilon and why does it matter? A: Machine epsilon is the smallest ε such that fl(1 + ε) ≠ 1. For float64 it is 2^(−52) ≈ 2.22 × 10^(−16). It bounds the relative rounding error of every floating point operation. Algorithms with more than ~1/ε operations may lose all significant digits.

Q3: How would you implement a numerically stable softmax? A: Subtract the maximum value from all inputs before exponentiation: softmax(x_i) = exp(x_i − max(x)) / Σ exp(x_j − max(x)). This prevents overflow since all exponent arguments are ≤ 0. The result is mathematically identical but numerically stable.

Q4: What is the difference between float16 and bfloat16? A: float16 has 5 exponent bits and 10 mantissa bits (range ±65504, ~3 digits precision). bfloat16 has 8 exponent bits and 7 mantissa bits (range ±3.4 × 10^38, ~2 digits precision). bfloat16 is preferred for ML because its float32-like dynamic range prevents overflow, even though it has slightly less precision.

Q5: Explain Kahan summation and why it works. A: Kahan summation tracks a compensation variable that captures the low-order bits lost when adding a value to the running sum. Each iteration: (1) subtract compensation from the input, (2) add to sum, (3) recover the lost part using algebraic identity, (4) store it as new compensation. This achieves O(1) error regardless of n, versus O(n) for naive summation.

Q6: When would you use float32 vs float64 in ML? A: float64 is rarely needed in ML — it doubles memory and compute with no accuracy benefit for most models. float32 is the standard for training. float16/bfloat16 are used for inference and mixed-precision training to save memory and leverage tensor cores. Use float64 only for numerical algorithms requiring high precision (e.g., solving linear systems, computing condition numbers).


Practice Problems

📝Problem 1: Implement a Function to Check IEEE 754 Compliance

Write a Python function that verifies whether the system uses IEEE 754 by checking that 0.1 + 0.2 ≠ 0.3 and that infinity arithmetic produces the expected results.

💡Solution

import numpy as np

def check_ieee754():
    """Verify IEEE 754 compliance."""
    checks = []

    # Check 1: Basic rounding
    checks.append(("0.1+0.2 != 0.3", 0.1 + 0.2 != 0.3))

    # Check 2: Infinity
    checks.append(("inf + inf = inf", np.inf + np.inf == np.inf))

    # Check 3: NaN
    checks.append(("NaN != NaN", np.nan != np.nan))

    # Check 4: Overflow to inf
    checks.append(("overflow to inf", np.float64(1.8e308) * 2 == np.inf))

    # Check 5: Denormalized
    checks.append(("subnormal exists",
                    np.finfo(np.float64).smallest_subnormal > 0))

    for name, passed in checks:
        print(f"  {'PASS' if passed else 'FAIL'}: {name}")

check_ieee754()

📝Problem 2: Compute Variance Numerically Stable

The naive variance formula Var(X) = E[X²] − (E[X])² suffers from catastrophic cancellation when E[X²] ≈ (E[X])². Implement Welford's online algorithm that computes variance in a single pass with O(1) error.

💡Solution

import numpy as np

def welford_variance(values):
    """Welford's online algorithm for numerically stable variance."""
    n = 0
    mean = 0.0
    M2 = 0.0

    for x in values:
        n += 1
        delta = x - mean
        mean += delta / n
        delta2 = x - mean
        M2 += delta * delta2

    variance = M2 / n  # Population variance
    # For sample variance: M2 / (n - 1)
    return mean, variance

# Compare with naive formula
np.random.seed(42)
data = np.random.uniform(1e7, 1e7 + 1, 1_000_000)

# Naive: E[X²] - (E[X])²
naive_var = np.mean(data**2) - np.mean(data)**2

# Welford
mean, welford_var = welford_variance(data)

# True variance (known for uniform distribution)
true_var = (1/12)  # Variance of U(a,b) = (b-a)^2/12

print(f"Naive variance:   {naive_var:.6f}")
print(f"Welford variance: {welford_var:.6f}")
print(f"True variance:    {true_var:.6f}")

📝Problem 3: Design a Log-Sum-Exp Function

The expression log(Σexp(x_i)) overflows when any x_i is large. Design a numerically stable version.

💡Solution

import numpy as np

def log_sum_exp(x):
    """Numerically stable log-sum-exp.
    Uses the identity: log(Σexp(x_i)) = max(x) + log(Σexp(x_i - max(x)))
    """
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))

# Test
x = np.array([1000, 1001, 1002])  # Would overflow naive approach
result = log_sum_exp(x)
print(f"Log-sum-exp: {result:.6f}")

# Verify against scipy
from scipy.special import logsumexp
print(f"scipy result: {logsumexp(x):.6f}")

Quick Reference

📋Floating Point Quick Reference

Propertyfloat16float32float64
Bits163264
Sign bits111
Exponent bits5811
Mantissa bits102352
Machine epsilon9.77 × 10⁻⁴1.19 × 10⁻⁷2.22 × 10⁻¹⁶
Max value65,5043.4 × 10³⁸1.8 × 10³⁰⁸
Min normal6.1 × 10⁻⁵1.2 × 10⁻³⁸2.2 × 10⁻³⁰⁸
Precision (digits)~3~7~15

Key Python Functions:

  • np.finfo(float).eps — machine epsilon
  • np.finfo(float).max — maximum finite value
  • np.finfo(float).tiny — smallest positive normal
  • np.nextafter(a, b) — next representable float toward b
  • np.isclose(a, b) — approximate equality check
  • np.isnan(x) — check for NaN
  • np.isinf(x) — check for infinity

Golden Rules:

  1. Never compare floats with ==
  2. Use np.isclose() with appropriate tolerances
  3. Sum smallest values first
  4. Use Kahan summation for large accumulations
  5. Subtract a common value before exponentiating (softmax stability)
  6. Keep accumulators in float32 if using float16

Cross-References

  • 087-Root-Finding — Newton's method depends on floating point arithmetic; understanding rounding is essential for choosing convergence tolerances
  • 088-Integration — Quadrature error analysis requires understanding truncation vs roundoff error tradeoffs
  • 089-Interpolation — Polynomial interpolation accuracy is limited by floating point precision; Runge's phenomenon interacts with roundoff
  • 090-Error Analysis — Condition numbers and stability analysis build directly on IEEE 754 properties
  • Linear Algebra — Matrix operations (LU decomposition, eigenvalue computation) are highly sensitive to floating point rounding; pivoting strategies address this
  • Optimization — Gradient descent convergence depends on learning rates being representable; underflow in gradient updates causes silent failures
Lesson Progress86 / 100