Derivatives and Differentiation

ℹ️ Why It Matters

Derivatives measure the instantaneous rate of change of a function. In machine learning, every gradient descent update — from simple linear regression to training large language models — relies on computing derivatives of a loss function with respect to model parameters. Understanding differentiation is the single most important skill for building and optimizing ML systems.

What is a Derivative

DfDerivative

The derivative of a function $f(x)$ at a point $x$ is defined as the limit of the difference quotient as the increment $h$ approaches zero. Geometrically, the derivative represents the slope of the tangent line to the curve at that point.

Limit Definition of the Derivative

f'(x) = \lim_{h \to 0}\frac{f(x+h) - f(x)}{h}

Here,

$f'(x)$ =The derivative of f with respect to x
$h$ =An infinitesimally small increment in x
$f(x+h) - f(x)$ =The change in function value over the interval
$rac{f(x+h)-f(x)}{h}$ =The difference quotient (average rate of change)

💡 Geometric Meaning

The derivative $f'(x_0)$ gives the slope of the line tangent to $y = f(x)$ at the point $(x_0, f(x_0))$ . If $f'(x_0) > 0$ , the function is increasing at $x_0$ . If $f'(x_0) < 0$ , it is decreasing. If $f'(x_0) = 0$ , the tangent is horizontal — a candidate for a local maximum or minimum.

⚠️ Differentiability vs Continuity

If $f$ is differentiable at $a$ , then $f$ is continuous at $a$ . The converse is not true: $f(x) = |x|$ is continuous at $x = 0$ but not differentiable there (the left and right derivatives differ).

Basic Derivative Rules

The following rules allow us to differentiate complex functions by breaking them into simpler parts.

Rule	Formula	Example
Constant	$\frac{d}{dx}[c] = 0$	$\frac{d}{dx}[7] = 0$
Constant Multiple	$\frac{d}{dx}[cf(x)] = c \cdot f'(x)$	$\frac{d}{dx}[3x^2] = 6x$
Power Rule	$\frac{d}{dx}x^n = nx^{n-1}$	$\frac{d}{dx}x^5 = 5x^4$
Sum Rule	$\frac{d}{dx}[f(x) + g(x)] = f'(x) + g'(x)$	$\frac{d}{dx}[x^2 + \sin x] = 2x + \cos x$
Difference Rule	$\frac{d}{dx}[f(x) - g(x)] = f'(x) - g'(x)$	$\frac{d}{dx}[x^3 - e^x] = 3x^2 - e^x$
Product Rule	$\frac{d}{dx}[f(x)g(x)] = f'(x)g(x) + f(x)g'(x)$	$\frac{d}{dx}[x \cdot e^x] = e^x + xe^x$
Quotient Rule	$\frac{d}{dx}\!\left[\frac{f(x)}{g(x)}\right] = \frac{f'g - fg'}{g^2}$	$\frac{d}{dx}\!\left[\frac{x}{x^2+1}\right] = \frac{1-x^2}{(x^2+1)^2}$
Chain Rule	$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$	$\frac{d}{dx}\sin(3x) = 3\cos(3x)$

💡 Product Rule Mnemonic

Remember: "derivative of the first times the second, plus the first times the derivative of the second." Or in words: $f'g + fg'$ .

Derivatives of Common Functions

Function $f(x)$	Derivative $f'(x)$	Notes
$x^n$	$nx^{n-1}$	Power rule, for any real $n$
$e^x$	$e^x$	Its own derivative
$a^x$	$a^x \ln a$	General exponential
$\ln x$	$\frac{1}{x}$	Logarithmic derivative
$\log_a x$	$\frac{1}{x \ln a}$	Change of base
$\sin x$	$\cos x$	Cyclic pattern
$\cos x$	$-\sin x$	Note the negative sign
$\tan x$	$\sec^2 x$	From quotient rule on $\sin/\cos$
$\csc x$	$-\csc x \cot x$
$\sec x$	$\sec x \tan x$
$\cot x$	$-\csc^2 x$
$\arcsin x$	$\frac{1}{\sqrt{1-x^2}}$	Inverse trig
$\arctan x$	$\frac{1}{1+x^2}$	Inverse trig
$\sqrt{x}$	$\frac{1}{2\sqrt{x}}$	Power rule with $n=\frac{1}{2}$
$\sigma(x)$	$\sigma(x)(1-\sigma(x))$	Sigmoid (ML critical)
$\tanh(x)$	$1 - \tanh^2(x)$	Hyperbolic tangent

💡 ML Sigmoid Derivative

The sigmoid function $\sigma(x) = \frac{1}{1+e^{-x}}$ has the remarkably elegant derivative $\sigma'(x) = \sigma(x)(1-\sigma(x))$ . This means once you compute $\sigma(x)$ during the forward pass, you can reuse it in the backward pass — a key efficiency in backpropagation.

Higher-Order Derivatives

DfHigher-Order Derivatives

The second derivative $f''(x) = \frac{d^2}{dx^2}f(x)$ is the derivative of the derivative. In general, the $n$ -th derivative is denoted $f^{(n)}(x)$ .

Notation	Meaning
$f'(x)$	First derivative — rate of change
$f''(x)$	Second derivative — concavity / acceleration
$f'''(x)$	Third derivative — rate of change of acceleration
$f^{(n)}(x)$	$n$ -th derivative
$\frac{d^2y}{dx^2}$	Leibniz notation for second derivative

ThSecond Derivative Test

If $f'(c) = 0$ and $f''(c)$ exists:

If $f''(c) > 0$ , then $f$ has a local minimum at $c$ .
If $f''(c) < 0$ , then $f$ has a local maximum at $c$ .
If $f''(c) = 0$ , the test is inconclusive.

Taylor Series Expansion

f(x) = \sum_{n=0}^{\infty}\frac{f^{(n)}(a)}{n!}(x-a)^n = f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \cdots

Here,

$f^{(n)}(a)$ =The n-th derivative evaluated at a
$n!$ =Factorial of n
$a$ =The point about which the expansion is made

ℹ️ Concavity and Inflection Points

$f''(x) > 0$ on an interval means $f$ is concave up (shaped like a cup).
$f''(x) < 0$ on an interval means $f$ is concave down (shaped like a frown).
An inflection point occurs where $f''(x)$ changes sign.

Implicit Differentiation

DfImplicit Differentiation

When $y$ is defined implicitly by an equation $F(x, y) = 0$ rather than explicitly as $y = f(x)$ , we differentiate both sides with respect to $x$ , treating $y$ as a function of $x$ and applying the chain rule to every $y$ term.

Implicit Derivative

\frac{dy}{dx} = -\frac{F_x}{F_y}

Here,

$F(x, y)$ =The implicit function
$F_x$ =Partial derivative of F with respect to x
$F_y$ =Partial derivative of F with respect to y

📝Implicit Differentiation Example

Problem: Find $\frac{dy}{dx}$ for $x^2 + y^2 = 25$ (a circle of radius 5).

💡Solution

Differentiate both sides with respect to $x$ :

\frac{d}{dx}[x^2] + \frac{d}{dx}[y^2] = \frac{d}{dx}[25]

2x + 2y\frac{dy}{dx} = 0

\frac{dy}{dx} = -\frac{x}{y}

At the point $(3, 4)$ : $\frac{dy}{dx} = -\frac{3}{4}$ .

Logarithmic Differentiation

DfLogarithmic Differentiation

A technique for differentiating functions of the form $f(x)^{g(x)}$ or products/quotients of many factors. Take the natural logarithm of both sides, then differentiate implicitly.

Logarithmic Differentiation Technique

\frac{d}{dx}\ln|f(x)| = \frac{f'(x)}{f(x)}

Here,

$f(x)$ =The original function
$f'(x)$ =Its derivative

📝Logarithmic Differentiation Example

Problem: Find $\frac{d}{dx}[x^x]$ for $x > 0$ .

💡Solution

Let $y = x^x$ . Take the natural log of both sides:

\ln y = x \ln x

Differentiate both sides:

\frac{1}{y}\frac{dy}{dx} = \ln x + 1

\frac{dy}{dx} = x^x(\ln x + 1)

💡 When to Use

Logarithmic differentiation is most useful when:

The function is a product or quotient of many factors
The function has the form $f(x)^{g(x)}$ where both base and exponent depend on $x$
The function involves radicals combined with other operations

Mean Value Theorem

ThMean Value Theorem (MVT)

If $f$ is continuous on $[a, b]$ and differentiable on $(a, b)$ , then there exists at least one point $c \in (a, b)$ such that:

f'(c) = \frac{f(b) - f(a)}{b - a}

ℹ️ Intuition

The MVT guarantees that at some point, the instantaneous rate of change (derivative) equals the average rate of change over the interval. Geometrically, there is a point where the tangent line is parallel to the secant line connecting the endpoints.

ThRolle's Theorem (Special Case of MVT)

If $f$ is continuous on $[a, b]$ , differentiable on $(a, b)$ , and $f(a) = f(b)$ , then there exists at least one $c \in (a, b)$ such that $f'(c) = 0$ .

⚠️ Important Corollary

If $f'(x) = 0$ for all $x$ in an interval, then $f$ is constant on that interval. This is used to prove the uniqueness of antiderivatives.

Applications

Related Rates

DfRelated Rates

When two or more quantities change with respect to time and are related by an equation, we can differentiate that equation with respect to $t$ to find relationships between their rates of change.

📝Related Rates: Expanding Sphere

Problem: A sphere's radius increases at $\frac{dr}{dt} = 3$ cm/s. How fast is the volume increasing when $r = 5$ cm?

💡Solution

V = \frac{4}{3}\pi r^3

$\frac{dV}{dt} = 4\pi r^2 \frac{dr}{dt} = 4\pi(25)(3) = 300\pi$ cm $^3$ /s

Optimization

💡 Optimization Procedure

To find the absolute maximum or minimum of $f(x)$ on $[a, b]$ :

Find all critical points: solve $f'(x) = 0$ and identify where $f'(x)$ is undefined.
Evaluate $f$ at each critical point.
Evaluate $f$ at the endpoints $a$ and $b$ .
The largest value is the absolute maximum; the smallest is the absolute minimum.

📝Optimization: Minimum Surface Area

Problem: A box with a square base and no top has volume $V = 32$ . Find the dimensions that minimize surface area.

💡Solution

Let $x$ = side of square base, $h$ = height.

Constraint: $x^2 h = 32 \implies h = \frac{32}{x^2}$

Surface area: $S = x^2 + 4xh = x^2 + \frac{128}{x}$

S'(x) = 2x - \frac{128}{x^2} = 0 \implies x^3 = 64 \implies x = 4

$S''(x) = 2 + \frac{256}{x^3} > 0$ (confirms minimum)

Dimensions: $x = 4$ , $h = 2$ .

Linear Approximation

f(x) \approx f(a) + f'(a)(x - a)

Here,

$f(a)$ =Function value at the known point a
$f'(a)$ =Derivative at a (slope of tangent line)
$x - a$ =Small displacement from a

Python Implementation

Symbolic Differentiation with SymPy

import sympy as sp

x = sp.Symbol('x')

# Symbolic derivatives
f = sp.ln(sp.sin(x))
print(f"Derivative of ln(sin(x)): {sp.diff(f, x)}")

g = x**2 * sp.exp(x)
print(f"Derivative of x^2 * e^x: {sp.diff(g, x)}")

# Higher-order derivatives
h = sp.sin(x)
print(f"Second derivative of sin(x): {sp.diff(h, x, 2)}")
print(f"Third derivative of sin(x):  {sp.diff(h, x, 3)}")

# Partial derivatives (multivariate)
y = sp.Symbol('y')
f_multi = x**2 * y + sp.sin(x * y)
print(f"∂f/∂x = {sp.diff(f_multi, x)}")
print(f"∂f/∂y = {sp.diff(f_multi, y)}")

Numerical Differentiation

import numpy as np

def numerical_derivative(f, x, h=1e-7):
    """Central difference approximation."""
    return (f(x + h) - f(x - h)) / (2 * h)

def numerical_second_derivative(f, x, h=1e-5):
    """Second derivative via finite differences."""
    return (f(x + h) - 2 * f(x) + f(x - h)) / (h ** 2)

# Examples
f = np.sin
print(f"sin'(0.5) numerical:  {numerical_derivative(f, 0.5):.6f}")
print(f"sin'(0.5) exact:      {np.cos(0.5):.6f}")

g = lambda x: x**3 - 2*x + 1
print(f"g''(1) numerical:    {numerical_second_derivative(g, 1):.6f}")
print(f"g''(1) exact:        {6 * 1:.6f}")

Sigmoid and Its Derivative (ML)

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Gradient check
x = 0.5
print(f"Sigmoid({x}) = {sigmoid(x):.6f}")
print(f"Sigmoid'({x}) = {sigmoid_derivative(x):.6f}")

# Verify numerically
h = 1e-7
numerical = (sigmoid(x + h) - sigmoid(x - h)) / (2 * h)
print(f"Numerical approx:     {numerical:.6f}")

Applications in AI/ML

Gradient Descent

Gradient Descent Update Rule

\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t)

Here,

$heta_t$ =Model parameters at iteration t
$eta$ =Learning rate (step size)
$abla_ heta mathcal{L}$ =Gradient of the loss function
$mathcal{L}$ =Loss function to minimize

ℹ️ Why Derivatives Matter in ML

Every step of training a neural network involves:

Forward pass: Compute the prediction using the current parameters.
Compute loss: Measure how wrong the prediction is.
Backward pass: Compute derivatives (gradients) of the loss with respect to every parameter using the chain rule.
Update parameters: Move each parameter in the direction that reduces the loss, scaled by the learning rate.

Chain Rule in Backpropagation

For a simple neural network layer $z = Wx + b$ , $a = \sigma(z)$ , loss $\mathcal{L}$ :

import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Forward pass
x = np.array([1.0, 2.0])
W = np.array([[0.5, 0.3], [0.2, 0.7]])
b = np.array([0.1, 0.1])
z = W @ x + b
a = sigmoid(z)

# Backward pass via chain rule
y_true = np.array([1.0, 0.0])
dL_da = a - y_true              # ∂L/∂a
da_dz = a * (1 - a)            # ∂a/∂z (sigmoid derivative)
dL_dz = dL_da * da_dz          # ∂L/∂z
dL_dW = np.outer(dL_dz, x)     # ∂L/∂W
dL_db = dL_dz                  # ∂L/∂b

Rate of Change in Feature Engineering

Velocity: First derivative of position w.r.t. time.
Acceleration: Second derivative.
Jerk: Third derivative.
In time-series ML, these derivatives (computed numerically) become features.

Common Mistakes

Mistake	Incorrect	Correct	Explanation
Forgetting chain rule	$\frac{d}{dx}\sin(x^2) = \cos(x^2)$	$\frac{d}{dx}\sin(x^2) = 2x\cos(x^2)$	Must multiply by inner derivative
Product rule order	$(fg)' = f'g'$	$(fg)' = f'g + fg'$	Product rule is NOT just product of derivatives
Quotient rule sign	$(\frac{f}{g})' = \frac{f'g + fg'}{g^2}$	$(\frac{f}{g})' = \frac{f'g - fg'}{g^2}$	Numerator uses subtraction
Power rule on $e^x$	$\frac{d}{dx}e^x = xe^{x-1}$	$\frac{d}{dx}e^x = e^x$	Exponential rule differs from power rule
$\ln$ derivative	$\frac{d}{dx}\ln(x) = \frac{1}{x}$	$\frac{d}{dx}\ln(x) = \frac{1}{x}$	Common, but $\frac{d}{dx}\ln(g(x)) = \frac{g'(x)}{g(x)}$ requires chain rule
Derivative of constant	$\frac{d}{dx}[c] = c$	$\frac{d}{dx}[c] = 0$	Constants have zero rate of change

⚠️ Double-Check Your Work

After computing a derivative, verify by plugging in simple values. For example, if you think $\frac{d}{dx}x^2 = 2x$ , test at $x=1$ : the slope of $x^2$ at $x=1$ should be 2, which matches. Always sanity-check!

Interview Questions

Q1: What is the derivative of $e^{x^2}$ ?

💡Answer

Using the chain rule with outer function $e^u$ and inner function $u = x^2$ :

\frac{d}{dx}e^{x^2} = e^{x^2} \cdot 2x = 2xe^{x^2}

Q2: Explain the product rule in plain English. When do you use it?

💡Answer

The product rule states that the derivative of a product $f(x) \cdot g(x)$ is $f'(x)g(x) + f(x)g'(x)$ . Intuitively, when differentiating a product, either factor could be changing, so you must account for the rate of change of each factor while holding the other constant. Use it whenever two functions of $x$ are multiplied together.

Q3: Why is the sigmoid derivative so important in neural networks?

💡Answer

The sigmoid derivative $\sigma'(x) = \sigma(x)(1-\sigma(x))$ is used during backpropagation to compute how the loss changes with respect to each weight. Its key property is that it can be computed using only the output of the forward pass ( $\sigma(x)$ ), meaning no additional computation is needed. This makes backpropagation through sigmoid layers very efficient. However, for large $|x|$ , $\sigma'(x) \approx 0$ , causing the vanishing gradient problem.

Q4: What is the difference between $\frac{d}{dx}[x^x]$ and $\frac{d}{dx}[x^n]$ ?

💡Answer

$\frac{d}{dx}[x^n] = nx^{n-1}$ (power rule, $n$ is constant)
$\frac{d}{dx}[x^x] = x^x(\ln x + 1)$ (requires logarithmic differentiation, both base and exponent depend on $x$ )

The power rule applies only when the exponent is a constant. When both base and exponent are variables, you must use logarithmic differentiation.

Q5: Prove that if $f'(x) = 0$ for all $x$ in an interval, then $f$ is constant.

💡Answer

By the Mean Value Theorem, for any $a, b$ in the interval with $a < b$ , there exists $c \in (a, b)$ such that $f(b) - f(a) = f'(c)(b - a)$ . Since $f'(c) = 0$ , we get $f(b) - f(a) = 0$ , so $f(b) = f(a)$ . Since $a$ and $b$ were arbitrary, $f$ is constant on the interval.

Q6: What are the conditions for a critical point to be a local minimum?

💡Answer

For a function $f$ continuous at $c$ :

First derivative test: $f'(c) = 0$ (or undefined) and $f'$ changes from negative to positive at $c$ .
Second derivative test: $f'(c) = 0$ and $f''(c) > 0$ .
Higher-order test: If the first non-zero derivative at $c$ is of even order $n$ and $f^{(n)}(c) > 0$ , then $c$ is a local minimum.

Practice Problems

Problem 1: Chain Rule

📝Find the Derivative

Problem: Find $\frac{d}{dx}\ln(\sin x)$ .

💡Solution

Let $u = \sin x$ , so $y = \ln u$ .

\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = \frac{1}{\sin x} \cdot \cos x = \cot x

Problem 2: Product Rule

📝Find the Derivative

Problem: Find $\frac{d}{dx}[x^2 e^x \sin x]$ .

💡Solution

Apply the product rule to $f = x^2$ , $g = e^x$ , $h = \sin x$ :

(fgh)' = f'gh + fg'h + fgh'

= 2x \cdot e^x \cdot \sin x + x^2 \cdot e^x \cdot \sin x + x^2 \cdot e^x \cdot \cos x

= xe^x(2\sin x + x\sin x + x\cos x)

Problem 3: Implicit Differentiation

📝Find the Derivative

Problem: Find $\frac{dy}{dx}$ for $\cos(xy) = x + y$ .

💡Solution

Differentiate both sides:

-\sin(xy) \cdot (y + x\frac{dy}{dx}) = 1 + \frac{dy}{dx}

-y\sin(xy) - x\sin(xy)\frac{dy}{dx} = 1 + \frac{dy}{dx}

\frac{dy}{dx}(-x\sin(xy) - 1) = 1 + y\sin(xy)

\frac{dy}{dx} = -\frac{1 + y\sin(xy)}{1 + x\sin(xy)}

Problem 4: Optimization

📝Find the Maximum

Problem: Find the maximum area of a rectangle inscribed in a semicircle of radius $r$ .

💡Solution

Place the semicircle on the coordinate plane: $x^2 + y^2 = r^2$ , $y \geq 0$ .

Rectangle dimensions: width $= 2x$ , height $= y = \sqrt{r^2 - x^2}$

A(x) = 2x\sqrt{r^2 - x^2}

A'(x) = 2\sqrt{r^2 - x^2} + 2x \cdot \frac{-x}{\sqrt{r^2 - x^2}} = \frac{2(r^2 - 2x^2)}{\sqrt{r^2 - x^2}}

Set $A'(x) = 0$ : $r^2 = 2x^2 \implies x = \frac{r}{\sqrt{2}}$

Maximum area $= 2 \cdot \frac{r}{\sqrt{2}} \cdot \frac{r}{\sqrt{2}} = r^2$ .

Problem 5: Logarithmic Differentiation

📝Find the Derivative

Problem: Find $\frac{d}{dx}[(\sin x)^{\cos x}]$ for $\sin x > 0$ .

💡Solution

Let $y = (\sin x)^{\cos x}$ . Take ln: $\ln y = \cos x \cdot \ln(\sin x)$ .

Differentiate:

\frac{1}{y}\frac{dy}{dx} = -\sin x \cdot \ln(\sin x) + \cos x \cdot \frac{\cos x}{\sin x}

\frac{dy}{dx} = (\sin x)^{\cos x}\left[-\sin x \ln(\sin x) + \frac{\cos^2 x}{\sin x}\right]

Quick Reference

📋Key Takeaways

Derivative Definition: $f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$ measures the instantaneous rate of change.
Power Rule: $\frac{d}{dx}x^n = nx^{n-1}$ — the most fundamental differentiation tool.
Product Rule: $(fg)' = f'g + fg'$ — for products of functions.
Quotient Rule: $(\frac{f}{g})' = \frac{f'g - fg'}{g^2}$ — for ratios of functions.
Chain Rule: $(f(g(x)))' = f'(g(x)) \cdot g'(x)$ — the foundation of backpropagation.
Second Derivative: $f''(x)$ measures concavity and acceleration; used in optimization.
ML Connection: Gradient descent $\theta \leftarrow \theta - \eta \nabla\mathcal{L}$ uses derivatives to minimize loss.
Sigmoid Derivative: $\sigma'(x) = \sigma(x)(1-\sigma(x))$ — efficient because it reuses the forward pass output.
Taylor Series: $f(x) = \sum \frac{f^{(n)}(a)}{n!}(x-a)^n$ — approximates functions using derivatives.
Common Pitfall: Always apply the chain rule to composite functions — forgetting it is the #1 mistake.

Cross-References

Limits: Understanding limits is prerequisite for the derivative definition → Limits and Continuity
Chain Rule: In-depth coverage of the chain rule and implicit differentiation → Chain Rule and Implicit Differentiation
Partial Derivatives: Derivatives with multiple variables → Partial Derivatives
Multivariable Calculus: Gradients, Jacobians, and Hessians → Multivariable Calculus
Taylor Series: Polynomial approximations using derivatives → Taylor Series
Optimization: Gradient descent and convex optimization → Optimization Fundamentals
Linear Algebra: Matrix calculus for neural network derivatives → Matrix Calculus
Numerical Methods: Numerical differentiation and integration → Numerical Methods

Derivatives and Differentiation