Calculus — The Engine of Learning

ℹ️ Why It Matters

Machine learning is all about making things better step by step. Calculus tells us HOW to improve — which direction to move and by how much. Gradient descent, the core of training AI, is pure calculus.

What is a Function?

A function takes an input and gives an output.

DfFunction

A function $f$ maps an input $x$ to an output $f(x)$ .

📝Example: Linear Function

Let $f(x) = 2x + 3$

f(1) = 2(1) + 3 = 5

f(2) = 2(2) + 3 = 7

f(10) = 2(10) + 3 = 23

ℹ️ In ML

The "loss function" or "cost function" takes your model's predictions and tells you how wrong they are.

\text{Loss} = f(\text{predictions}, \text{actual\_values})

Goal: Find the parameters that MINIMIZE this loss

Derivatives — The Slope of Change

A derivative tells you: "If I change x a tiny bit, how much does f(x) change?"

Analogy: If you're hiking on a mountain, the derivative tells you the steepness of the ground right where you're standing.

The Formula

Derivative Definition

f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

Here,

$f'(x)$ =The derivative of f at x
$h$ =A small change in x

Simpler way: The derivative is the slope of the tangent line.

Basic Derivative Rules

Function	Derivative	Example
c (constant)	0	d/dx(5) = 0
xⁿ	nxⁿ⁻¹	d/dx(x³) = 3x²
eˣ	eˣ	d/dx(eˣ) = eˣ
ln(x)	1/x	d/dx(ln(x)) = 1/x
sin(x)	cos(x)	d/dx(sin(x)) = cos(x)
cos(x)	-sin(x)	d/dx(cos(x)) = -sin(x)

Rules of Differentiation

Power Rule:

Power Rule

\frac{d}{dx}(x^n) = nx^{n-1}

Here,

$n$ =The exponent
$x^n$ =The function

📝Example: Power Rule

\frac{d}{dx}(x^2) = 2x

\frac{d}{dx}(x^5) = 5x^4

\frac{d}{dx}(\sqrt{x}) = \frac{d}{dx}(x^{0.5}) = 0.5x^{-0.5} = \frac{1}{2\sqrt{x}}

Sum Rule:

Sum Rule

\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

Here,

$f(x), g(x)$ =Two functions

📝Example: Sum Rule

\frac{d}{dx}(x^2 + 3x) = 2x + 3

Product Rule:

Product Rule

\frac{d}{dx}[f(x) \times g(x)] = f'(x)g(x) + f(x)g'(x)

Here,

$f(x), g(x)$ =Two functions

📝Example: Product Rule

\frac{d}{dx}(x^2 \times e^x) = 2x \times e^x + x^2 \times e^x = e^x(2x + x^2)

Chain Rule (THE MOST IMPORTANT RULE IN ML):

Chain Rule

\frac{d}{dx}[f(g(x))] = f'(g(x)) \times g'(x)

Here,

$f(g(x))$ =Composition of functions

📝Example: Chain Rule

Let $f(x) = (3x + 1)^5$

Outer function: $u^5$ → derivative: $5u^4$
Inner function: $3x + 1$ → derivative: $3$

\frac{d}{dx}(3x + 1)^5 = 5(3x + 1)^4 \times 3 = 15(3x + 1)^4

💡 Why Chain Rule Matters

Neural networks are functions inside functions inside functions (deep layers). The chain rule lets us compute how each weight affects the final output. This is called backpropagation.

Partial Derivatives

When a function has multiple inputs, we take the derivative with respect to ONE input at a time.

Partial Derivative

\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h, y) - f(x, y)}{h}

Here,

$\frac{\partial f}{\partial x}$ =Partial derivative with respect to x

📝Example: Partial Derivatives

Let $f(x, y) = x^2 + 3xy + y^2$

\frac{\partial f}{\partial x} = 2x + 3y \quad \text{(treat y as constant)}

\frac{\partial f}{\partial y} = 3x + 2y \quad \text{(treat x as constant)}

Analogy: If you're on a hilly terrain and you can walk east-west (x) or north-south (y), the partial derivative $\frac{\partial f}{\partial x}$ tells you the steepness in the east-west direction only.

The Gradient

The gradient is a vector of all partial derivatives. It points in the direction of steepest increase.

Gradient

\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]

Here,

$\nabla f$ =The gradient of f

📝Example: Gradient

Let $f(x, y) = x^2 + y^2$

\nabla f = [2x, 2y]

At point $(3, 4)$ : $\nabla f = [6, 8]$

💡 Key Insight

The gradient points UPHILL. To go downhill (minimize the function), move in the opposite direction: $-\nabla f$ .

Gradient Descent — The Heart of ML Training

The Idea: Start somewhere on the loss landscape. Look at the gradient (which way is steepest?). Take a small step downhill. Repeat until you reach the bottom.

Gradient Descent Update Rule

\theta_{\text{new}} = \theta_{\text{old}} - \alpha \times \nabla L(\theta_{\text{old}})

Here,

$\theta$ =Parameters (weights) of your model
$\alpha$ =Learning rate (step size)
$\nabla L$ =Gradient of the loss function

Types of Gradient Descent

Batch Gradient Descent:

Uses ALL training data to compute gradient
Stable but slow

Batch Gradient Descent

\theta = \theta - \alpha \times \frac{1}{N} \times \sum \nabla L(\theta, x_i)

Here,

$N$ =Total number of training samples

Stochastic Gradient Descent (SGD):

Uses ONE random data point
Noisy but fast

Stochastic Gradient Descent

\theta = \theta - \alpha \times \nabla L(\theta, x_i)

Here,

$x_i$ =Single random sample

Mini-Batch Gradient Descent:

Uses a small batch (e.g., 32 or 64 samples)
Best of both worlds — this is what's actually used

Mini-Batch Gradient Descent

\theta = \theta - \alpha \times \frac{1}{B} \times \sum_{i=1}^{B} \nabla L(\theta, x_i)

Here,

$B$ =Batch size

Learning Rate

The learning rate ( $\alpha$ ) controls how big your steps are.

⚠️ Learning Rate Guidelines

Too large: You overshoot the minimum, might never converge
Too small: Takes forever to converge
Just right: Converges quickly to a good solution

Typical values: 0.1, 0.01, 0.001, 0.0001

Learning Rate Schedules:

Start large, get smaller over time
Cosine annealing
Warm-up then decay

Higher-Order Derivatives

Second Derivative: The derivative of the derivative. It tells you about curvature.

Second Derivative

f''(x) = \frac{d^2f}{dx^2}

Here,

$f''(x)$ =The second derivative of f

ℹ️ Interpreting Second Derivative

If $f''(x) > 0$ : curve is concave UP (like a cup) → local minimum
If $f''(x) < 0$ : curve is concave DOWN (like a cap) → local maximum
If $f''(x) = 0$ : inflection point (might be a saddle point)

Hessian Matrix: The matrix of all second partial derivatives.

Hessian Matrix

H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Here,

$H$ =The Hessian matrix

Newton's Method (Second-order optimization):

Newton's Method

\theta_{\text{new}} = \theta_{\text{old}} - H^{-1} \times \nabla L

Here,

$H$ =The Hessian matrix

ℹ️ In AI

Adam optimizer uses first and second moments (running averages of gradient and squared gradient)
L-BFGS approximates the Hessian for large-scale problems

Integrals

An integral is the reverse of a derivative. It computes the area under a curve.

Definite Integral

\int_a^b f(x) \, dx

Here,

$a, b$ =Integration bounds
$f(x)$ =The function to integrate

📝Example: Integral

\int_0^3 x^2 \, dx = \left[\frac{x^3}{3}\right]_0^3 = \frac{27}{3} - 0 = 9

Basic Integration Rules:

$\int x^n \, dx = \frac{x^{n+1}}{n+1} + C$
$\int e^x \, dx = e^x + C$
$\int \frac{1}{x} \, dx = \ln|x| + C$
$\int \sin(x) \, dx = -\cos(x) + C$

Applications in ML:

Probability density functions: $P(a < X < b) = \int_a^b f(x) \, dx$
Bayesian inference: Computing posterior distributions
Expected values: $E[X] = \int x \cdot f(x) \, dx$

Multivariable Calculus

Chain Rule for Multiple Variables

Multivariable Chain Rule

\frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}

Here,

$z = f(x, y)$ =Function of two variables
$x = g(t), y = h(t)$ =Functions of t

Gradient in Neural Networks

A neural network is a composition of functions:

ℹ️ Backpropagation

\text{Loss} = L(f_3(f_2(f_1(x, W_1), W_2), W_3), y)

To find $\frac{\partial \text{Loss}}{\partial W_1}$ , we use the chain rule repeatedly:

\frac{\partial \text{Loss}}{\partial W_1} = \frac{\partial \text{Loss}}{\partial f_3} \times \frac{\partial f_3}{\partial f_2} \times \frac{\partial f_2}{\partial f_1} \times \frac{\partial f_1}{\partial W_1}

This is BACKPROPAGATION!

Taylor Series

Approximate any function as a polynomial:

Taylor Series

f(x) \approx f(a) + f'(a)(x-a) + \frac{f''(a)(x-a)^2}{2!} + \frac{f'''(a)(x-a)^3}{3!} + \cdots

Here,

$a$ =The point around which we expand

At a = 0 (Maclaurin series):

$e^x \approx 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots$
$\sin(x) \approx x - \frac{x^3}{3!} + \frac{x^5}{5!} - \cdots$
$\cos(x) \approx 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \cdots$

💡 Why it matters

Linear regression assumes a linear (first-order Taylor) approximation
Neural networks can be seen as learned nonlinear approximations

📋Key Takeaways

Derivatives measure rates of change. The derivative $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$ tells you the slope of a function at any point — the steepest direction to move.
The Chain Rule is the backbone of deep learning. For composed functions, $\frac{d}{dx}[f(g(x))] = f'(g(x)) \times g'(x)$ . Backpropagation applies this rule layer by layer through neural networks.
The Gradient points uphill. $\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]$ gives the direction of steepest ascent; move in $-\nabla f$ to minimize.
Gradient Descent iterates toward the minimum. The update rule $\theta_{\text{new}} = \theta_{\text{old}} - \alpha \times \nabla L(\theta_{\text{old}})$ is how every neural network trains — the learning rate $\alpha$ controls step size.
Second derivatives reveal curvature. The Hessian $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ tells you whether you're at a minimum ( $H \succ 0$ ), maximum, or saddle point — crucial for understanding optimization landscape geometry.
Integrals are the reverse of derivatives and appear in probability ( $E[X] = \int x \cdot f(x) \, dx$ ), Bayesian inference, and computing expected values over distributions.