Floating Point Arithmetic
ℹ️ Why It Matters
Every ML model, every simulation, every digital computation ultimately relies on floating point arithmetic. A single misunderstood rounding behavior can cause silent bugs that corrupt training, break inference, or produce incorrect scientific results. Mastering IEEE 754 is non-negotiable for anyone working with numerical software.
💡 Real-World Impact
In 1996, the Ariane 5 rocket was destroyed 37 seconds after launch due to a 64-bit floating point to 16-bit integer conversion overflow — a $370 million lesson in floating point awareness. Financial systems have lost millions due to rounding errors in currency conversions. Even modern deep learning training can diverge silently due to float16 overflow in gradient accumulation.
Definitions
DfIEEE 754 Floating Point Standard
IEEE 754 is the industry standard for representing and manipulating real numbers in binary. It defines formats for single precision (float32), double precision (float64), half precision (float16), and extended precision, specifying rounding rules, special values (NaN, infinity), and exception handling behavior.
DfFloating Point Representation
A floating point number represents a real value as: sign × significand × base^exponent. The significand (or mantissa) stores the significant digits, the exponent scales the number, and the sign bit indicates positive or negative. This allows representing a huge range of values at the cost of non-uniform precision.
DfMachine Epsilon (ε)
Machine epsilon is the smallest positive number ε such that fl(1 + ε) ≠ 1 in floating point arithmetic. It quantifies the upper bound on the relative error of rounding. For float64, ε ≈ 2.22 × 10⁻¹⁶. It represents the gap between 1.0 and the next representable float64 value.
DfNormalized vs Denormalized Numbers
Normalized numbers have an implicit leading 1 in the significand: 1.xxxx... × 2^e. Denormalized (subnormal) numbers have an implicit leading 0, allowing representation of values closer to zero at the cost of reduced precision. Denormals prevent underflow from being abrupt — gradually flushing to zero.
DfRounding Modes
IEEE 754 defines four rounding modes: round to nearest, ties to even (default); round toward positive infinity (ceiling); round toward negative infinity (floor); round toward zero (truncation). The default mode minimizes systematic bias in long sequences of arithmetic operations.
Formulas
IEEE 754 Double Precision (float64) Layout
Here,
- =Sign bit (0 = positive, 1 = negative), 1 bit
- =Exponent, biased by 1023, stored in 11 bits (range 0–2047)
- =Mantissa bits, 52 bits storing the fractional part after the implicit leading 1
Machine Epsilon
Here,
- =For IEEE 754 double precision (float64)
Relative Rounding Error Bound
Here,
- =Floating point representation of x
- =Machine epsilon
Condition Number for Addition
Here,
- =Operands
- =Large κ indicates catastrophic cancellation risk
Kahan Summation Compensation
Here,
- =Running sum
- =Compensation term tracking lost low-order bits
- =Next value to add
IEEE 754 Deep Dive
Double Precision (float64) — 64 bits total
| Field | Bits | Purpose |
|---|---|---|
| Sign (s) | 1 bit | 0 = positive, 1 = negative |
| Exponent (e) | 11 bits | Biased by 1023 (actual exponent = e − 1023) |
| Mantissa (m) | 52 bits | Fractional part; implicit leading 1 for normalized |
Exponent special values:
- e = 0, m = 0: ±0
- e = 0, m ≠ 0: Denormalized numbers
- e = 2047, m = 0: ±Infinity
- e = 2047, m ≠ 0: NaN (Not a Number)
Range of float64:
- Minimum positive normal: 2^(−1022) ≈ 2.2 × 10^(−308)
- Maximum finite: (2 − 2^(−52)) × 2^1023 ≈ 1.8 × 10^308
Half Precision (float16) — Used in ML Training
| Field | Bits | Purpose |
|---|---|---|
| Sign | 1 bit | Positive or negative |
| Exponent | 5 bits | Biased by 15 |
| Mantissa | 10 bits | Fractional part |
- Range: ±65504
- Smallest positive normal: 6.1 × 10^(−5)
- Machine epsilon: 9.77 × 10^(−4)
- Used extensively in deep learning for memory savings and speed
Single Precision (float32)
| Field | Bits | Purpose |
|---|---|---|
| Sign | 1 bit | Positive or negative |
| Exponent | 8 bits | Biased by 127 |
| Mantissa | 23 bits | Fractional part |
- Machine epsilon: ≈ 1.19 × 10^(−7)
- Range: ±3.4 × 10^38
- Common default for ML inference
Examples
📝Why 0.1 + 0.2 ≠ 0.3
The decimal numbers 0.1 and 0.2 cannot be exactly represented in binary floating point. When stored, they become the closest representable float64 values, which are slightly off from the true decimal values. Adding them produces a result that, when compared to the float64 representation of 0.3, differs by approximately 5.55 × 10^(−17).
💡Demonstration: 0.1 + 0.2 vs 0.3
import numpy as np
# The classic floating point surprise
a = 0.1
b = 0.2
c = 0.3
print(f"a + b = {a + b}") # 0.30000000000000004
print(f"c = {c}") # 0.3
print(f"a + b == c: {a + b == c}") # False!
# The actual binary representations
print(f"0.1 is stored as: {a:.20f}") # 0.10000000000000000555
print(f"0.2 is stored as: {b:.20f}") # 0.20000000000000001110
print(f"0.3 is stored as: {c:.20f}") # 0.29999999999999998890
# Correct comparison
print(f"np.isclose: {np.isclose(a + b, c)}") # True
📝Catastrophic Cancellation
When subtracting two nearly equal numbers, the significant digits cancel, leaving mostly noise from the less significant bits. The relative error of the result can be much larger than the relative error of the inputs.
💡Demonstration: Catastrophic Cancellation
import numpy as np
# Nearly equal numbers
a = 1.000000000000001
b = 1.000000000000000
diff = a - b
print(f"a - b = {diff}") # 1.1102230246251565e-15
print(f"Exact: 1e-15") # True difference is 1e-15
print(f"Relative error: {abs(diff - 1e-15) / 1e-15:.2%}") # ~11%
# Another example: quadratic formula
import math
def quadratic_naive(a, b, c):
"""Naive quadratic formula — suffers from cancellation."""
disc = b**2 - 4*a*c
x1 = (-b + math.sqrt(disc)) / (2*a)
x2 = (-b - math.sqrt(disc)) / (2*a)
return x1, x2
def quadratic_stable(a, b, c):
"""Stable quadratic formula — avoids cancellation."""
disc = b**2 - 4*a*c
q = -0.5 * (b + math.copysign(math.sqrt(disc), b))
x1 = q / a
x2 = c / q
return x1, x2
# Example: x^2 - 1000000x + 1 = 0
# True roots: x ≈ 999999.999999 and x ≈ 1e-6
a, b, c = 1, -1e6, 1
x1_naive, x2_naive = quadratic_naive(a, b, c)
x1_stable, x2_stable = quadratic_stable(a, b, c)
print(f"Naive: x1={x1_naive}, x2={x2_naive}")
print(f"Stable: x1={x1_stable}, x2={x2_stable}")
# Naive x2 loses accuracy; stable version preserves it
📝Overflow and Underflow
Overflow occurs when a result exceeds the maximum representable finite value, producing infinity. Underflow occurs when a result is too close to zero for the normal range, either flushing to zero (abrupt) or entering denormalized territory (gradual).
💡Demonstration: Overflow and Underflow
import numpy as np
# Overflow to infinity
big = np.float64(1.8e308)
print(f"Max finite: {np.finfo(np.float64).max}")
print(f"big * 2 = {big * 2}") # inf
print(f"big + big = {big + big}") # inf
# Underflow to denormalized / zero
tiny = np.float64(2.2e-308)
print(f"Tiny: {tiny}")
print(f"Tiny / 2 = {tiny / 2}") # Denormalized, eventually 0.0
# Gradual underflow
print(f"Smallest normal: {np.finfo(np.float64).tiny}")
print(f"Smallest subnormal: {np.finfo(np.float64).smallest_subnormal}")
# Demonstrate loss of precision with underflow
x = 1e-300
for i in range(20):
x = x * x
print(f"Step {i+1}: {x}") # Quickly goes to 0.0
📝Kahan Summation
Standard sequential summation accumulates rounding error proportional to n × ε, where n is the number of terms. Kahan summation uses a compensation variable to track and correct for lost low-order bits, achieving accuracy close to the theoretical O(ε) bound regardless of n.
💡Demonstration: Kahan vs Naive Summation
import numpy as np
def naive_sum(x):
"""Simple sequential summation."""
s = 0.0
for val in x:
s += val
return s
def kahan_sum(x):
"""Kahan compensated summation."""
s = 0.0
c = 0.0 # Compensation for lost low-order bits
for val in x:
y = val - c # Compensated value
t = s + y # New sum
c = (t - s) - y # Recover lost part
s = t
return s
# Sum 10^7 values of 0.1
n = 10_000_000
values = np.full(n, 0.1)
naive_result = naive_sum(values)
kahan_result = kahan_sum(values)
exact = n * 0.1
print(f"Naive sum: {naive_result:.10f}")
print(f"Kahan sum: {kahan_result:.10f}")
print(f"Exact: {exact:.10f}")
print(f"Naive error: {abs(naive_result - exact):.2e}")
print(f"Kahan error: {abs(kahan_result - exact):.2e}")
Theorems
ThSterbenz Lemma
If x and y are floating point numbers with y/2 ≤ x ≤ 2y, then x − y is computed exactly (with no rounding error) in floating point arithmetic, provided the result is representable.
ℹ️ Why Sterbenz Matters
This theorem guarantees that subtraction of nearby numbers — the very operation that causes catastrophic cancellation — is exact when the operands satisfy the proximity condition. It explains why algorithms designed to keep operands close together achieve better numerical stability.
ThRelative Error of Floating Point Addition
For normalized floating point arithmetic with rounding to nearest, the relative error of a single addition operation satisfies: |fl(a + b) − (a + b)| / |a + b| ≤ ε_mach / 2, provided a + b ≠ 0 and no overflow or underflow occurs.
ThError Growth in Summation
For naive sequential summation of n floating point numbers, the accumulated error satisfies: |S_computed − S_exact| ≤ (n − 1) × (ε_mach / 2) × Σ|x_i| + O(ε²)
Kahan summation reduces this to: |S_computed − S_exact| ≤ 2 × (ε_mach / 2) × Σ|x_i| + O(ε²)
ThBackward Error of Floating Point Operations
For any basic floating point operation ⊕ (add, subtract, multiply, divide), there exists a mathematically exact operation on slightly perturbed inputs such that: fl(a ⊕ b) = (a(1 + δ₁)) ⊕ (b(1 + δ₂))
where |δ_i| ≤ ε_mach / 2. This is the backward error interpretation — the computed result is the exact answer to a nearby problem.
Python Implementation
import numpy as np
import struct
import sys
def inspect_float64(value):
"""Break down a float64 into its IEEE 754 components."""
# Pack as binary (8 bytes) and unpack as unsigned int
bits = struct.unpack('Q', struct.pack('d', value))[0]
sign = (bits >> 63) & 1
exponent = ((bits >> 52) & 0x7FF) - 1023 # Remove bias
mantissa = bits & 0xFFFFFFFFFFFFF # Lower 52 bits
print(f"Value: {value}")
print(f" Binary: {bits:064b}")
print(f" Sign: {sign} ({'negative' if sign else 'positive'})")
print(f" Exponent: {exponent + 1023} (unbiased: {exponent})")
print(f" Mantissa: {mantissa:052b}")
print(f" Implicit leading 1: {1 + mantissa / (1 << 52)}")
print()
def demonstrate_epsilon():
"""Show machine epsilon and its effects."""
eps = np.finfo(float).eps
print(f"Machine epsilon (float64): {eps}")
print(f" = 2^(-52) = {2**-52}")
# Find the next representable float after 1.0
one_plus = np.nextafter(1.0, 2.0)
print(f"Next float after 1.0: {one_plus}")
print(f" = 1 + {one_plus - 1.0}")
print(f" = 1 + 2^(-52) = 1 + {2**-52}")
print()
def demonstrate_rounding():
"""Show how different rounding modes affect results."""
# 0.1 in binary is a repeating fraction
print("Rounding of 0.1 in float64:")
print(f" fl(0.1) = {np.float64(0.1):.20f}")
print(f" Error: {np.float64(0.1) - 0.1:.2e}")
print()
# Show precision loss in subtraction
a = 1.0
b = 1.0 + 1e-15
print(f"Precision loss demonstration:")
print(f" 1.0 + 1e-15 = {b:.20f}")
print(f" (1.0 + 1e-15) - 1.0 = {b - a:.20f}")
print(f" Expected: 1e-15 = {1e-15:.20f}")
print()
def demonstrate_accumulation_error():
"""Show error accumulation in repeated operations."""
# Add 0.1 ten times
result = 0.0
for i in range(10):
result += 0.1
print(f"Summing 0.1 ten times: {result}")
print(f" Expected: 1.0")
print(f" Error: {result - 1.0:.2e}")
print(f" Direct: 10 * 0.1 = {10 * 0.1}")
print()
def demonstrate_denormals():
"""Show denormalized number behavior."""
tiny = np.finfo(np.float64).tiny
print(f"Smallest normal: {tiny}")
print(f"Smallest subnormal: {np.finfo(np.float64).smallest_subnormal}")
print()
# Gradual underflow
x = tiny
step = 0
while x > 0 and step < 60:
x /= 2
step += 1
if step <= 5 or x == 0:
print(f" Divide by 2 (step {step}): {x}")
print()
def demonstrate_special_values():
"""Show IEEE 754 special values."""
print("Special values:")
print(f" +inf: {np.inf}")
print(f" -inf: {-np.inf}")
print(f" NaN: {np.nan}")
print(f" inf + inf = {np.inf + np.inf}")
print(f" inf - inf = {np.inf - np.inf}") # NaN
print(f" 0/0 = {0.0 / 0.0}") # NaN
print(f" 1/0 = {1.0 / 0.0}") # inf
print(f" NaN == NaN: {np.nan == np.nan}") # False!
print(f" isnan(NaN): {np.isnan(np.nan)}") # True
print()
def kahan_summation(values):
"""Kahan compensated summation algorithm."""
total = 0.0
compensation = 0.0
for value in values:
y = value - compensation
temp = total + y
compensation = (temp - total) - y
total = temp
return total
if __name__ == "__main__":
print("=" * 60)
print("IEEE 754 FLOATING POINT ANALYSIS")
print("=" * 60)
# Inspect some values
print("\n--- Value Inspection ---")
inspect_float64(0.1)
inspect_float64(0.2)
inspect_float64(0.1 + 0.2)
# Epsilon
print("\n--- Machine Epsilon ---")
demonstrate_epsilon()
# Rounding
print("\n--- Rounding Effects ---")
demonstrate_rounding()
# Accumulation
print("\n--- Accumulation Error ---")
demonstrate_accumulation_error()
# Denormals
print("\n--- Denormalized Numbers ---")
demonstrate_denormals()
# Special values
print("\n--- Special Values ---")
demonstrate_special_values()
# Kahan summation
print("\n--- Kahan Summation ---")
np.random.seed(42)
values = np.random.uniform(1e10, 1e10 + 1, 1_000_000)
naive = sum(values)
kahan = kahan_summation(values)
print(f"Naive: {naive:.6f}")
print(f"Kahan: {kahan:.6f}")
print(f"Difference: {abs(naive - kahan):.6f}")
Applications in AI/ML
ℹ️ Floating Point in Deep Learning
Modern deep learning relies heavily on mixed-precision training, where forward passes use float16 for speed while gradient accumulation uses float32 for accuracy. Understanding float16 limitations is critical — its maximum value of 65504 means gradients exceeding this cause overflow, and its machine epsilon of ~10^(-4) means small updates can be lost entirely.
Training Considerations
| Aspect | float32 | float16 | bfloat16 |
|---|---|---|---|
| Memory | 4 bytes | 2 bytes | 2 bytes |
| Max value | 3.4 × 10^38 | 65,504 | 3.4 × 10^38 |
| Precision | ~7 digits | ~3 digits | ~3 digits |
| Dynamic range | Excellent | Poor | Excellent |
| ML use | Baseline | Speed | Balance |
Loss Scaling
When using float16, gradients often fall below the smallest representable normal (6.1 × 10^(-5)). Loss scaling multiplies the loss by a large factor before backpropagation, shifting gradients into the representable range, then divides after the optimizer step.
Gradient Clipping
To prevent float16 overflow during gradient accumulation, gradients are clipped to a maximum norm before the optimizer step. Without clipping, exploding gradients in float16 cause NaN propagation that corrupts the entire model.
Quantization
Model quantization converts float32 weights to int8 or float4 representations. Understanding the rounding behavior of float32 is essential for designing quantization schemes that minimize accuracy loss.
Common Pitfalls in ML
- Log-sum-exp overflow: Computing log(Σexp(x_i)) where x_i are large causes exp to overflow. Use the identity log(Σexp(x_i)) = max(x) + log(Σexp(x_i − max(x))).
- Softmax instability: If inputs differ by more than ~30 in float32, the softmax denominator overflows. Always subtract the max value before exponentiating.
- Batch norm with float16: Running statistics computed in float16 lose precision. Always compute batch norm in float32.
Common Mistakes
| Mistake | Problem | Solution |
|---|---|---|
| Comparing floats with == | 0.1 + 0.2 ≠ 0.3 always | Use np.isclose() or threshold comparison |
| Ignoring cancellation | Subtraction of nearly equal numbers amplifies error | Use Kahan summation or reformulate |
| Summing large + small first | Small values get lost in rounding | Sort values by magnitude, sum smallest first |
| Using float16 without scaling | Gradients overflow or underflow | Apply loss scaling, keep accumulators in float32 |
| Assuming associativity | (a + b) + c ≠ a + (b + c) in floating point | Use compensated summation |
| Ignoring denormals | Underflow can be slow on some hardware | Set flush-to-zero mode if acceptable |
| Printing floats with full precision | Misleading output | Use repr() or .20f format |
| Assuming float is exact | Every float has rounding error | Design algorithms for numerical stability |
Interview Questions
Q1: Why is 0.1 + 0.2 ≠ 0.3 in floating point? A: 0.1 and 0.2 are not exactly representable in binary. Their float64 representations are slightly off from the true decimal values. When added, the result is close to 0.3 but not equal to the float64 representation of 0.3. The difference is approximately 5.55 × 10^(−17), which is within one ULP (unit in the last place) of the true sum.
Q2: What is machine epsilon and why does it matter? A: Machine epsilon is the smallest ε such that fl(1 + ε) ≠ 1. For float64 it is 2^(−52) ≈ 2.22 × 10^(−16). It bounds the relative rounding error of every floating point operation. Algorithms with more than ~1/ε operations may lose all significant digits.
Q3: How would you implement a numerically stable softmax? A: Subtract the maximum value from all inputs before exponentiation: softmax(x_i) = exp(x_i − max(x)) / Σ exp(x_j − max(x)). This prevents overflow since all exponent arguments are ≤ 0. The result is mathematically identical but numerically stable.
Q4: What is the difference between float16 and bfloat16? A: float16 has 5 exponent bits and 10 mantissa bits (range ±65504, ~3 digits precision). bfloat16 has 8 exponent bits and 7 mantissa bits (range ±3.4 × 10^38, ~2 digits precision). bfloat16 is preferred for ML because its float32-like dynamic range prevents overflow, even though it has slightly less precision.
Q5: Explain Kahan summation and why it works. A: Kahan summation tracks a compensation variable that captures the low-order bits lost when adding a value to the running sum. Each iteration: (1) subtract compensation from the input, (2) add to sum, (3) recover the lost part using algebraic identity, (4) store it as new compensation. This achieves O(1) error regardless of n, versus O(n) for naive summation.
Q6: When would you use float32 vs float64 in ML? A: float64 is rarely needed in ML — it doubles memory and compute with no accuracy benefit for most models. float32 is the standard for training. float16/bfloat16 are used for inference and mixed-precision training to save memory and leverage tensor cores. Use float64 only for numerical algorithms requiring high precision (e.g., solving linear systems, computing condition numbers).
Practice Problems
📝Problem 1: Implement a Function to Check IEEE 754 Compliance
Write a Python function that verifies whether the system uses IEEE 754 by checking that 0.1 + 0.2 ≠ 0.3 and that infinity arithmetic produces the expected results.
💡Solution
import numpy as np
def check_ieee754():
"""Verify IEEE 754 compliance."""
checks = []
# Check 1: Basic rounding
checks.append(("0.1+0.2 != 0.3", 0.1 + 0.2 != 0.3))
# Check 2: Infinity
checks.append(("inf + inf = inf", np.inf + np.inf == np.inf))
# Check 3: NaN
checks.append(("NaN != NaN", np.nan != np.nan))
# Check 4: Overflow to inf
checks.append(("overflow to inf", np.float64(1.8e308) * 2 == np.inf))
# Check 5: Denormalized
checks.append(("subnormal exists",
np.finfo(np.float64).smallest_subnormal > 0))
for name, passed in checks:
print(f" {'PASS' if passed else 'FAIL'}: {name}")
check_ieee754()
📝Problem 2: Compute Variance Numerically Stable
The naive variance formula Var(X) = E[X²] − (E[X])² suffers from catastrophic cancellation when E[X²] ≈ (E[X])². Implement Welford's online algorithm that computes variance in a single pass with O(1) error.
💡Solution
import numpy as np
def welford_variance(values):
"""Welford's online algorithm for numerically stable variance."""
n = 0
mean = 0.0
M2 = 0.0
for x in values:
n += 1
delta = x - mean
mean += delta / n
delta2 = x - mean
M2 += delta * delta2
variance = M2 / n # Population variance
# For sample variance: M2 / (n - 1)
return mean, variance
# Compare with naive formula
np.random.seed(42)
data = np.random.uniform(1e7, 1e7 + 1, 1_000_000)
# Naive: E[X²] - (E[X])²
naive_var = np.mean(data**2) - np.mean(data)**2
# Welford
mean, welford_var = welford_variance(data)
# True variance (known for uniform distribution)
true_var = (1/12) # Variance of U(a,b) = (b-a)^2/12
print(f"Naive variance: {naive_var:.6f}")
print(f"Welford variance: {welford_var:.6f}")
print(f"True variance: {true_var:.6f}")
📝Problem 3: Design a Log-Sum-Exp Function
The expression log(Σexp(x_i)) overflows when any x_i is large. Design a numerically stable version.
💡Solution
import numpy as np
def log_sum_exp(x):
"""Numerically stable log-sum-exp.
Uses the identity: log(Σexp(x_i)) = max(x) + log(Σexp(x_i - max(x)))
"""
max_x = np.max(x)
return max_x + np.log(np.sum(np.exp(x - max_x)))
# Test
x = np.array([1000, 1001, 1002]) # Would overflow naive approach
result = log_sum_exp(x)
print(f"Log-sum-exp: {result:.6f}")
# Verify against scipy
from scipy.special import logsumexp
print(f"scipy result: {logsumexp(x):.6f}")
Quick Reference
📋Floating Point Quick Reference
| Property | float16 | float32 | float64 |
|---|---|---|---|
| Bits | 16 | 32 | 64 |
| Sign bits | 1 | 1 | 1 |
| Exponent bits | 5 | 8 | 11 |
| Mantissa bits | 10 | 23 | 52 |
| Machine epsilon | 9.77 × 10⁻⁴ | 1.19 × 10⁻⁷ | 2.22 × 10⁻¹⁶ |
| Max value | 65,504 | 3.4 × 10³⁸ | 1.8 × 10³⁰⁸ |
| Min normal | 6.1 × 10⁻⁵ | 1.2 × 10⁻³⁸ | 2.2 × 10⁻³⁰⁸ |
| Precision (digits) | ~3 | ~7 | ~15 |
Key Python Functions:
np.finfo(float).eps— machine epsilonnp.finfo(float).max— maximum finite valuenp.finfo(float).tiny— smallest positive normalnp.nextafter(a, b)— next representable float toward bnp.isclose(a, b)— approximate equality checknp.isnan(x)— check for NaNnp.isinf(x)— check for infinity
Golden Rules:
- Never compare floats with
== - Use
np.isclose()with appropriate tolerances - Sum smallest values first
- Use Kahan summation for large accumulations
- Subtract a common value before exponentiating (softmax stability)
- Keep accumulators in float32 if using float16
Cross-References
- 087-Root-Finding — Newton's method depends on floating point arithmetic; understanding rounding is essential for choosing convergence tolerances
- 088-Integration — Quadrature error analysis requires understanding truncation vs roundoff error tradeoffs
- 089-Interpolation — Polynomial interpolation accuracy is limited by floating point precision; Runge's phenomenon interacts with roundoff
- 090-Error Analysis — Condition numbers and stability analysis build directly on IEEE 754 properties
- Linear Algebra — Matrix operations (LU decomposition, eigenvalue computation) are highly sensitive to floating point rounding; pivoting strategies address this
- Optimization — Gradient descent convergence depends on learning rates being representable; underflow in gradient updates causes silent failures