← Math|2 of 100
Mathematics for Data Science & AI

Calculus — The Engine of Learning

Master calculus for machine learning: derivatives, gradients, chain rule, gradient descent, and backpropagation explained simply.

📂 Calculus📖 Lesson 2 of 100🎓 Free Course

Advertisement

Calculus — The Engine of Learning

ℹ️ Why It Matters

Machine learning is all about making things better step by step. Calculus tells us HOW to improve — which direction to move and by how much. Gradient descent, the core of training AI, is pure calculus.


What is a Function?

A function takes an input and gives an output.

DfFunction

A function ff maps an input xx to an output f(x)f(x).

📝Example: Linear Function

Let f(x)=2x+3f(x) = 2x + 3

f(1)=2(1)+3=5f(1) = 2(1) + 3 = 5
f(2)=2(2)+3=7f(2) = 2(2) + 3 = 7
f(10)=2(10)+3=23f(10) = 2(10) + 3 = 23

ℹ️ In ML

The "loss function" or "cost function" takes your model's predictions and tells you how wrong they are.

Loss=f(predictions,actual_values)\text{Loss} = f(\text{predictions}, \text{actual\_values})

Goal: Find the parameters that MINIMIZE this loss


Derivatives — The Slope of Change

A derivative tells you: "If I change x a tiny bit, how much does f(x) change?"

Analogy: If you're hiking on a mountain, the derivative tells you the steepness of the ground right where you're standing.

The Formula

Derivative Definition

f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

Here,

  • f(x)f'(x)=The derivative of f at x
  • hh=A small change in x

Simpler way: The derivative is the slope of the tangent line.

Basic Derivative Rules

FunctionDerivativeExample
c (constant)0d/dx(5) = 0
xⁿnxⁿ⁻¹d/dx(x³) = 3x²
d/dx(eˣ) = eˣ
ln(x)1/xd/dx(ln(x)) = 1/x
sin(x)cos(x)d/dx(sin(x)) = cos(x)
cos(x)-sin(x)d/dx(cos(x)) = -sin(x)

Rules of Differentiation

Power Rule:

Power Rule

ddx(xn)=nxn1\frac{d}{dx}(x^n) = nx^{n-1}

Here,

  • nn=The exponent
  • xnx^n=The function

📝Example: Power Rule

ddx(x2)=2x\frac{d}{dx}(x^2) = 2x
ddx(x5)=5x4\frac{d}{dx}(x^5) = 5x^4
ddx(x)=ddx(x0.5)=0.5x0.5=12x\frac{d}{dx}(\sqrt{x}) = \frac{d}{dx}(x^{0.5}) = 0.5x^{-0.5} = \frac{1}{2\sqrt{x}}

Sum Rule:

Sum Rule

ddx[f(x)+g(x)]=f(x)+g(x)\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)

Here,

  • f(x),g(x)f(x), g(x)=Two functions

📝Example: Sum Rule

ddx(x2+3x)=2x+3\frac{d}{dx}(x^2 + 3x) = 2x + 3

Product Rule:

Product Rule

ddx[f(x)×g(x)]=f(x)g(x)+f(x)g(x)\frac{d}{dx}[f(x) \times g(x)] = f'(x)g(x) + f(x)g'(x)

Here,

  • f(x),g(x)f(x), g(x)=Two functions

📝Example: Product Rule

ddx(x2×ex)=2x×ex+x2×ex=ex(2x+x2)\frac{d}{dx}(x^2 \times e^x) = 2x \times e^x + x^2 \times e^x = e^x(2x + x^2)

Chain Rule (THE MOST IMPORTANT RULE IN ML):

Chain Rule

ddx[f(g(x))]=f(g(x))×g(x)\frac{d}{dx}[f(g(x))] = f'(g(x)) \times g'(x)

Here,

  • f(g(x))f(g(x))=Composition of functions

📝Example: Chain Rule

Let f(x)=(3x+1)5f(x) = (3x + 1)^5

  • Outer function: u5u^5 → derivative: 5u45u^4
  • Inner function: 3x+13x + 1 → derivative: 33
ddx(3x+1)5=5(3x+1)4×3=15(3x+1)4\frac{d}{dx}(3x + 1)^5 = 5(3x + 1)^4 \times 3 = 15(3x + 1)^4

💡 Why Chain Rule Matters

Neural networks are functions inside functions inside functions (deep layers). The chain rule lets us compute how each weight affects the final output. This is called backpropagation.


Partial Derivatives

When a function has multiple inputs, we take the derivative with respect to ONE input at a time.

Partial Derivative

fx=limh0f(x+h,y)f(x,y)h\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h, y) - f(x, y)}{h}

Here,

  • fx\frac{\partial f}{\partial x}=Partial derivative with respect to x

📝Example: Partial Derivatives

Let f(x,y)=x2+3xy+y2f(x, y) = x^2 + 3xy + y^2

fx=2x+3y(treat y as constant)\frac{\partial f}{\partial x} = 2x + 3y \quad \text{(treat y as constant)}
fy=3x+2y(treat x as constant)\frac{\partial f}{\partial y} = 3x + 2y \quad \text{(treat x as constant)}

Analogy: If you're on a hilly terrain and you can walk east-west (x) or north-south (y), the partial derivative fx\frac{\partial f}{\partial x} tells you the steepness in the east-west direction only.


The Gradient

The gradient is a vector of all partial derivatives. It points in the direction of steepest increase.

Gradient

f=[fx1,fx2,,fxn]\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right]

Here,

  • f\nabla f=The gradient of f

📝Example: Gradient

Let f(x,y)=x2+y2f(x, y) = x^2 + y^2

f=[2x,2y]\nabla f = [2x, 2y]

At point (3,4)(3, 4): f=[6,8]\nabla f = [6, 8]

💡 Key Insight

The gradient points UPHILL. To go downhill (minimize the function), move in the opposite direction: f-\nabla f.


Gradient Descent — The Heart of ML Training

The Idea: Start somewhere on the loss landscape. Look at the gradient (which way is steepest?). Take a small step downhill. Repeat until you reach the bottom.

Gradient Descent Update Rule

θnew=θoldα×L(θold)\theta_{\text{new}} = \theta_{\text{old}} - \alpha \times \nabla L(\theta_{\text{old}})

Here,

  • θ\theta=Parameters (weights) of your model
  • α\alpha=Learning rate (step size)
  • L\nabla L=Gradient of the loss function

Types of Gradient Descent

Batch Gradient Descent:

  • Uses ALL training data to compute gradient
  • Stable but slow

Batch Gradient Descent

θ=θα×1N×L(θ,xi)\theta = \theta - \alpha \times \frac{1}{N} \times \sum \nabla L(\theta, x_i)

Here,

  • NN=Total number of training samples

Stochastic Gradient Descent (SGD):

  • Uses ONE random data point
  • Noisy but fast

Stochastic Gradient Descent

θ=θα×L(θ,xi)\theta = \theta - \alpha \times \nabla L(\theta, x_i)

Here,

  • xix_i=Single random sample

Mini-Batch Gradient Descent:

  • Uses a small batch (e.g., 32 or 64 samples)
  • Best of both worlds — this is what's actually used

Mini-Batch Gradient Descent

θ=θα×1B×i=1BL(θ,xi)\theta = \theta - \alpha \times \frac{1}{B} \times \sum_{i=1}^{B} \nabla L(\theta, x_i)

Here,

  • BB=Batch size

Learning Rate

The learning rate (α\alpha) controls how big your steps are.

⚠️ Learning Rate Guidelines

  • Too large: You overshoot the minimum, might never converge
  • Too small: Takes forever to converge
  • Just right: Converges quickly to a good solution

Typical values: 0.1, 0.01, 0.001, 0.0001

Learning Rate Schedules:

  • Start large, get smaller over time
  • Cosine annealing
  • Warm-up then decay

Higher-Order Derivatives

Second Derivative: The derivative of the derivative. It tells you about curvature.

Second Derivative

f(x)=d2fdx2f''(x) = \frac{d^2f}{dx^2}

Here,

  • f(x)f''(x)=The second derivative of f

ℹ️ Interpreting Second Derivative

  • If f(x)>0f''(x) > 0: curve is concave UP (like a cup) → local minimum
  • If f(x)<0f''(x) < 0: curve is concave DOWN (like a cap) → local maximum
  • If f(x)=0f''(x) = 0: inflection point (might be a saddle point)

Hessian Matrix: The matrix of all second partial derivatives.

Hessian Matrix

Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}

Here,

  • HH=The Hessian matrix

Newton's Method (Second-order optimization):

Newton's Method

θnew=θoldH1×L\theta_{\text{new}} = \theta_{\text{old}} - H^{-1} \times \nabla L

Here,

  • HH=The Hessian matrix

ℹ️ In AI

  • Adam optimizer uses first and second moments (running averages of gradient and squared gradient)
  • L-BFGS approximates the Hessian for large-scale problems

Integrals

An integral is the reverse of a derivative. It computes the area under a curve.

Definite Integral

abf(x)dx\int_a^b f(x) \, dx

Here,

  • a,ba, b=Integration bounds
  • f(x)f(x)=The function to integrate

📝Example: Integral

03x2dx=[x33]03=2730=9\int_0^3 x^2 \, dx = \left[\frac{x^3}{3}\right]_0^3 = \frac{27}{3} - 0 = 9

Basic Integration Rules:

  • xndx=xn+1n+1+C\int x^n \, dx = \frac{x^{n+1}}{n+1} + C
  • exdx=ex+C\int e^x \, dx = e^x + C
  • 1xdx=lnx+C\int \frac{1}{x} \, dx = \ln|x| + C
  • sin(x)dx=cos(x)+C\int \sin(x) \, dx = -\cos(x) + C

Applications in ML:

  • Probability density functions: P(a<X<b)=abf(x)dxP(a < X < b) = \int_a^b f(x) \, dx
  • Bayesian inference: Computing posterior distributions
  • Expected values: E[X]=xf(x)dxE[X] = \int x \cdot f(x) \, dx

Multivariable Calculus

Chain Rule for Multiple Variables

Multivariable Chain Rule

dzdt=fxdxdt+fydydt\frac{dz}{dt} = \frac{\partial f}{\partial x} \frac{dx}{dt} + \frac{\partial f}{\partial y} \frac{dy}{dt}

Here,

  • z=f(x,y)z = f(x, y)=Function of two variables
  • x=g(t),y=h(t)x = g(t), y = h(t)=Functions of t

Gradient in Neural Networks

A neural network is a composition of functions:

ℹ️ Backpropagation

Loss=L(f3(f2(f1(x,W1),W2),W3),y)\text{Loss} = L(f_3(f_2(f_1(x, W_1), W_2), W_3), y)

To find LossW1\frac{\partial \text{Loss}}{\partial W_1}, we use the chain rule repeatedly:

LossW1=Lossf3×f3f2×f2f1×f1W1\frac{\partial \text{Loss}}{\partial W_1} = \frac{\partial \text{Loss}}{\partial f_3} \times \frac{\partial f_3}{\partial f_2} \times \frac{\partial f_2}{\partial f_1} \times \frac{\partial f_1}{\partial W_1}

This is BACKPROPAGATION!


Taylor Series

Approximate any function as a polynomial:

Taylor Series

f(x)f(a)+f(a)(xa)+f(a)(xa)22!+f(a)(xa)33!+f(x) \approx f(a) + f'(a)(x-a) + \frac{f''(a)(x-a)^2}{2!} + \frac{f'''(a)(x-a)^3}{3!} + \cdots

Here,

  • aa=The point around which we expand

At a = 0 (Maclaurin series):

  • ex1+x+x22!+x33!+e^x \approx 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots
  • sin(x)xx33!+x55!\sin(x) \approx x - \frac{x^3}{3!} + \frac{x^5}{5!} - \cdots
  • cos(x)1x22!+x44!\cos(x) \approx 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \cdots

💡 Why it matters

  • Linear regression assumes a linear (first-order Taylor) approximation
  • Neural networks can be seen as learned nonlinear approximations

📋Key Takeaways

  • Derivatives measure rates of change. The derivative f(x)=limh0f(x+h)f(x)hf'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h} tells you the slope of a function at any point — the steepest direction to move.

  • The Chain Rule is the backbone of deep learning. For composed functions, ddx[f(g(x))]=f(g(x))×g(x)\frac{d}{dx}[f(g(x))] = f'(g(x)) \times g'(x). Backpropagation applies this rule layer by layer through neural networks.

  • The Gradient points uphill. f=[fx1,fx2,,fxn]\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right] gives the direction of steepest ascent; move in f-\nabla f to minimize.

  • Gradient Descent iterates toward the minimum. The update rule θnew=θoldα×L(θold)\theta_{\text{new}} = \theta_{\text{old}} - \alpha \times \nabla L(\theta_{\text{old}}) is how every neural network trains — the learning rate α\alpha controls step size.

  • Second derivatives reveal curvature. The Hessian Hij=2fxixjH_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j} tells you whether you're at a minimum (H0H \succ 0), maximum, or saddle point — crucial for understanding optimization landscape geometry.

  • Integrals are the reverse of derivatives and appear in probability (E[X]=xf(x)dxE[X] = \int x \cdot f(x) \, dx), Bayesian inference, and computing expected values over distributions.

Lesson Progress2 / 100