Calculus — The Engine of Learning
ℹ️ Why It Matters
Machine learning is all about making things better step by step. Calculus tells us HOW to improve — which direction to move and by how much. Gradient descent, the core of training AI, is pure calculus.
What is a Function?
A function takes an input and gives an output.
DfFunction
A function maps an input to an output .
📝Example: Linear Function
Let
ℹ️ In ML
The "loss function" or "cost function" takes your model's predictions and tells you how wrong they are.
Goal: Find the parameters that MINIMIZE this loss
Derivatives — The Slope of Change
A derivative tells you: "If I change x a tiny bit, how much does f(x) change?"
Analogy: If you're hiking on a mountain, the derivative tells you the steepness of the ground right where you're standing.
The Formula
Derivative Definition
Here,
- =The derivative of f at x
- =A small change in x
Simpler way: The derivative is the slope of the tangent line.
Basic Derivative Rules
| Function | Derivative | Example |
|---|---|---|
| c (constant) | 0 | d/dx(5) = 0 |
| xⁿ | nxⁿ⁻¹ | d/dx(x³) = 3x² |
| eˣ | eˣ | d/dx(eˣ) = eˣ |
| ln(x) | 1/x | d/dx(ln(x)) = 1/x |
| sin(x) | cos(x) | d/dx(sin(x)) = cos(x) |
| cos(x) | -sin(x) | d/dx(cos(x)) = -sin(x) |
Rules of Differentiation
Power Rule:
Power Rule
Here,
- =The exponent
- =The function
📝Example: Power Rule
Sum Rule:
Sum Rule
Here,
- =Two functions
📝Example: Sum Rule
Product Rule:
Product Rule
Here,
- =Two functions
📝Example: Product Rule
Chain Rule (THE MOST IMPORTANT RULE IN ML):
Chain Rule
Here,
- =Composition of functions
📝Example: Chain Rule
Let
- Outer function: → derivative:
- Inner function: → derivative:
💡 Why Chain Rule Matters
Neural networks are functions inside functions inside functions (deep layers). The chain rule lets us compute how each weight affects the final output. This is called backpropagation.
Partial Derivatives
When a function has multiple inputs, we take the derivative with respect to ONE input at a time.
Partial Derivative
Here,
- =Partial derivative with respect to x
📝Example: Partial Derivatives
Let
Analogy: If you're on a hilly terrain and you can walk east-west (x) or north-south (y), the partial derivative tells you the steepness in the east-west direction only.
The Gradient
The gradient is a vector of all partial derivatives. It points in the direction of steepest increase.
Gradient
Here,
- =The gradient of f
📝Example: Gradient
Let
At point :
💡 Key Insight
The gradient points UPHILL. To go downhill (minimize the function), move in the opposite direction: .
Gradient Descent — The Heart of ML Training
The Idea: Start somewhere on the loss landscape. Look at the gradient (which way is steepest?). Take a small step downhill. Repeat until you reach the bottom.
Gradient Descent Update Rule
Here,
- =Parameters (weights) of your model
- =Learning rate (step size)
- =Gradient of the loss function
Types of Gradient Descent
Batch Gradient Descent:
- Uses ALL training data to compute gradient
- Stable but slow
Batch Gradient Descent
Here,
- =Total number of training samples
Stochastic Gradient Descent (SGD):
- Uses ONE random data point
- Noisy but fast
Stochastic Gradient Descent
Here,
- =Single random sample
Mini-Batch Gradient Descent:
- Uses a small batch (e.g., 32 or 64 samples)
- Best of both worlds — this is what's actually used
Mini-Batch Gradient Descent
Here,
- =Batch size
Learning Rate
The learning rate () controls how big your steps are.
⚠️ Learning Rate Guidelines
- Too large: You overshoot the minimum, might never converge
- Too small: Takes forever to converge
- Just right: Converges quickly to a good solution
Typical values: 0.1, 0.01, 0.001, 0.0001
Learning Rate Schedules:
- Start large, get smaller over time
- Cosine annealing
- Warm-up then decay
Higher-Order Derivatives
Second Derivative: The derivative of the derivative. It tells you about curvature.
Second Derivative
Here,
- =The second derivative of f
ℹ️ Interpreting Second Derivative
- If : curve is concave UP (like a cup) → local minimum
- If : curve is concave DOWN (like a cap) → local maximum
- If : inflection point (might be a saddle point)
Hessian Matrix: The matrix of all second partial derivatives.
Hessian Matrix
Here,
- =The Hessian matrix
Newton's Method (Second-order optimization):
Newton's Method
Here,
- =The Hessian matrix
ℹ️ In AI
- Adam optimizer uses first and second moments (running averages of gradient and squared gradient)
- L-BFGS approximates the Hessian for large-scale problems
Integrals
An integral is the reverse of a derivative. It computes the area under a curve.
Definite Integral
Here,
- =Integration bounds
- =The function to integrate
📝Example: Integral
Basic Integration Rules:
Applications in ML:
- Probability density functions:
- Bayesian inference: Computing posterior distributions
- Expected values:
Multivariable Calculus
Chain Rule for Multiple Variables
Multivariable Chain Rule
Here,
- =Function of two variables
- =Functions of t
Gradient in Neural Networks
A neural network is a composition of functions:
ℹ️ Backpropagation
To find , we use the chain rule repeatedly:
This is BACKPROPAGATION!
Taylor Series
Approximate any function as a polynomial:
Taylor Series
Here,
- =The point around which we expand
At a = 0 (Maclaurin series):
💡 Why it matters
- Linear regression assumes a linear (first-order Taylor) approximation
- Neural networks can be seen as learned nonlinear approximations
📋Key Takeaways
-
Derivatives measure rates of change. The derivative tells you the slope of a function at any point — the steepest direction to move.
-
The Chain Rule is the backbone of deep learning. For composed functions, . Backpropagation applies this rule layer by layer through neural networks.
-
The Gradient points uphill. gives the direction of steepest ascent; move in to minimize.
-
Gradient Descent iterates toward the minimum. The update rule is how every neural network trains — the learning rate controls step size.
-
Second derivatives reveal curvature. The Hessian tells you whether you're at a minimum (), maximum, or saddle point — crucial for understanding optimization landscape geometry.
-
Integrals are the reverse of derivatives and appear in probability (), Bayesian inference, and computing expected values over distributions.