Mathematical Concepts Used in AI

ℹ️ Why It Matters

This chapter connects all the math to actual AI/ML applications. These are the concepts that power modern AI systems.

Neural Network Math

Forward Propagation

ℹ️ Forward Propagation

For layer $l$ :

z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)} \quad \leftarrow \text{Linear transformation}

a^{(l)} = \sigma(z^{(l)}) \quad \leftarrow \text{Activation function}

Where:

$W^{(l)}$ = weight matrix
$b^{(l)}$ = bias vector
$a^{(l)}$ = activation (output) of layer $l$
$\sigma$ = activation function

Activation Functions

Sigmoid:

Sigmoid Activation

\sigma(z) = \frac{1}{1 + e^{-z}}

Here,

$z$ =Input to activation function

\sigma'(z) = \sigma(z) \times (1 - \sigma(z))

ℹ️ Sigmoid Properties

Range: $(0, 1)$
Problem: Vanishing gradients

Tanh:

Tanh Activation

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Here,

$z$ =Input to activation function

\tanh'(z) = 1 - \tanh^2(z)

ℹ️ Tanh Properties

Range: $(-1, 1)$
Better than sigmoid (centered at 0)

ReLU (Rectified Linear Unit):

ReLU Activation

\text{ReLU}(z) = \max(0, z)

Here,

$z$ =Input to activation function

\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}

ℹ️ ReLU Properties

Range: $[0, \infty)$
Most popular activation function
Problem: "Dead neurons" (never activate)

Leaky ReLU:

Leaky ReLU

\text{LeakyReLU}(z) = \max(\alpha z, z)

Here,

$\alpha$ =Small constant (e.g., 0.01)

Softmax (for classification):

Softmax

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

Here,

$z_i$ =Logit for class i

ℹ️ Softmax

Converts logits to probabilities that sum to 1

Backpropagation

The Chain Rule Applied Layer by Layer:

ℹ️ Backpropagation Formulas

For output layer $l=L$ :

\delta^{(L)} = \nabla_a L \odot \sigma'(z^{(L)})

For hidden layers $l = L-1, \ldots, 1$ :

\delta^{(l)} = (W^{(l+1)})^T\delta^{(l+1)} \odot \sigma'(z^{(l)})

Gradients:

\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \times (a^{(l-1)})^T

\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}

💡 Why backpropagation is efficient

Computes gradients for ALL weights in one forward + backward pass
Time complexity: $O(n)$ where $n$ is number of weights
This is what makes deep learning possible

Loss Functions

Regression Losses

Mean Squared Error (MSE):

MSE Loss

L = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2

Here,

$y_i$ =True value
$\hat{y}_i$ =Predicted value

ℹ️ MSE

Penalizes large errors more (squared)

Mean Absolute Error (MAE):

MAE Loss

L = \frac{1}{n} \sum_i |y_i - \hat{y}_i|

Here,

$y_i$ =True value
$\hat{y}_i$ =Predicted value

ℹ️ MAE

More robust to outliers

Huber Loss (combination):

Huber Loss

L = \begin{cases} \frac{1}{2}(y-\hat{y})^2 & \text{if } |y-\hat{y}| \leq \delta \\ \delta|y-\hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Here,

$\delta$ =Threshold parameter

ℹ️ Huber Loss

Robust to outliers, smooth around zero

Classification Losses

Binary Cross-Entropy:

Binary Cross-Entropy

L = -[y \times \log(\hat{y}) + (1-y) \times \log(1-\hat{y})]

Here,

$y$ =True label (0 or 1)
$\hat{y}$ =Predicted probability

ℹ️ Binary Cross-Entropy

Same as negative log-likelihood for binary classification

Categorical Cross-Entropy:

Categorical Cross-Entropy

L = -\sum_i y_i \times \log(\hat{y}_i)

Here,

$y_i$ =True label (one-hot)
$\hat{y}_i$ =Predicted probability for class i

ℹ️ Categorical Cross-Entropy

For multi-class classification

Focal Loss:

Focal Loss

L = -\alpha_t(1-p_t)^\gamma \times \log(p_t)

Here,

$\alpha_t$ =Balancing factor
$\gamma$ =Focusing parameter

ℹ️ Focal Loss

Down-weights easy examples, focuses on hard ones
Used in object detection

Regularization in Neural Networks

Dropout

ℹ️ Dropout

During training:

Each neuron is "dropped" (set to 0) with probability $p$
$a' = a \times \text{mask} / (1-p)$ ← Inverted dropout

During inference:

No dropout, all neurons active

💡 Why dropout works

Prevents co-adaptation of neurons
Ensemble effect (averaging many sub-networks)
Equivalent to L2 regularization (approximately)

Batch Normalization

ℹ️ Batch Normalization

For each mini-batch:

$\mu$ = mean of activations
$\sigma^2$ = variance of activations
$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$ ← Normalize
$y = \gamma\hat{x} + \beta$ ← Scale and shift

Where $\gamma$ and $\beta$ are learned parameters

💡 Benefits of Batch Normalization

Faster training
Allows higher learning rates
Reduces sensitivity to initialization
Acts as a regularizer

Layer Normalization

ℹ️ Layer Normalization

Like Batch Norm but normalizes across features (not across batch). Independent of batch size. Used in: Transformers, RNNs

Weight Decay (L2 Regularization)

Weight Decay

L_{\text{total}} = L_{\text{original}} + \frac{\lambda}{2} \times \sum \|W\|^2

Here,

$\lambda$ =Weight decay coefficient

ℹ️ Weight Decay Effect

Weights shrink toward zero

Attention Mechanism (The Math Behind Transformers)

Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V

Here,

$Q$ =Query matrix
$K$ =Key matrix
$V$ =Value matrix
$d_k$ =Dimension of keys

ℹ️ Step by step

Compute similarity: $QK^T$ (How relevant is each key to each query?)
Scale: $\frac{QK^T}{\sqrt{d_k}}$ (Prevent large values from saturating softmax)
Normalize: $\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)$ (Convert to probabilities)
Aggregate: $\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V$ (Weighted sum of values based on attention weights)

Multi-Head Attention

\text{head}_i = \text{Attention}(QW_i^K, KW_i^K, VW_i^V)

Here,

$W_i^K, W_i^K, W_i^V$ =Learned projection matrices for each head

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \times W^O

💡 Why multi-head?

Different heads can attend to different types of relationships
Head 1 might focus on syntax, head 2 on semantics, etc.

Position Encoding

PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right)

Here,

$pos$ =Position in sequence
$i$ =Dimension index
$d$ =Model dimension

PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)

💡 Why sinusoidal?

Each position gets a unique encoding
The model can learn relative positions
Smooth interpolation to unseen positions

Generative Models

Variational Autoencoders (VAE)

ℹ️ VAE Components

Encoder: $q(z|x)$ — maps input to latent distribution
Decoder: $p(x|z)$ — reconstructs input from latent vector

VAE Loss

L = \mathbb{E}[\log p(x|z)] - KL(q(z|x) \| p(z))

Here,

$q(z|x)$ =Encoder distribution
$p(z)$ =Prior distribution

ℹ️ VAE Loss Components

Reconstruction: How well does the decoder reconstruct the input?
KL divergence: How close is the learned distribution to the prior?

Reparameterization trick:

Reparameterization Trick

z = \mu + \sigma \times \varepsilon, \quad \text{where } \varepsilon \sim N(0,1)

Here,

$\mu$ =Mean from encoder
$\sigma$ =Standard deviation from encoder
$\varepsilon$ =Random noise

ℹ️ Reparameterization Trick

This makes the sampling differentiable (can backpropagate through it)

GANs (Generative Adversarial Networks)

ℹ️ GAN Components

Generator: $G(z)$ — generates fake data from noise
Discriminator: $D(x)$ — classifies real vs fake

Minimax game:

GAN Objective

\min_G \max_D V(D,G) = \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]

Here,

$D$ =Discriminator
$G$ =Generator

ℹ️ GAN Training

$D$ tries to maximize: correctly classify real and fake
$G$ tries to minimize: fool $D$ into thinking fake is real

Training:

Train D: Maximize $\log D(x) + \log(1-D(G(z)))$
Train G: Maximize $\log D(G(z))$ [or minimize $\log(1-D(G(z)))$ ]
Repeat

Diffusion Models

Forward process: Add noise gradually

Forward Process

q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)

Here,

$\beta_t$ =Noise schedule

ℹ️ Forward Process

After $T$ steps: $x_T \approx N(0, I)$ (pure noise)

Reverse process: Learn to remove noise

Reverse Process

p_\theta(x_{t-1}|x_t) = N(x_{t-1}; \mu_\theta(x_t,t), \sigma_t^2 I)

Here,

$\theta$ =Learned parameters

ℹ️ Reverse Process

$\theta$ is learned by a neural network

Training objective:

Diffusion Training Objective

L = \mathbb{E}[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2]

Here,

$\varepsilon$ =Noise that was added
$\varepsilon_\theta$ =Predicted noise

ℹ️ Diffusion Training

Predict the noise that was added, then subtract it

Reinforcement Learning Math

Markov Decision Process (MDP)

DfMDP

An MDP is defined as $(S, A, P, R, \gamma)$ :

$S$ = set of states
$A$ = set of actions
$P(s'|s,a)$ = transition probability
$R(s,a)$ = reward function
$\gamma$ = discount factor ( $0 \leq \gamma \leq 1$ )

Bellman Equation

V(s) = \max_a \left[R(s,a) + \gamma \times \sum_{s'} P(s'|s,a) \times V(s')\right]

Here,

$V(s)$ =Value of state s
$\gamma$ =Discount factor

ℹ️ Bellman Equation

The value of a state = best immediate reward + discounted value of next state

Q-Learning

Q-Learning Update

Q(s,a) \leftarrow Q(s,a) + \alpha \times [r + \gamma \times \max_{a'} Q(s',a') - Q(s,a)]

Here,

$\alpha$ =Learning rate
$r$ =Reward received
$\gamma$ =Discount factor

Policy Gradient

J(\theta) = \mathbb{E}\left[\sum_t \gamma^t \times r_t\right]

Here,

$\theta$ =Policy parameters

\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) \times G_t\right]

ℹ️ Policy Gradient Components

$\pi_\theta$ = policy parameterized by $\theta$
$G_t$ = return from time $t$

REINFORCE Algorithm:

ℹ️ REINFORCE Algorithm

Sample trajectory using current policy
Compute returns $G_t$ for each step
Update: $\theta \leftarrow \theta + \alpha \times \nabla_\theta \log \pi_\theta(a_t|s_t) \times G_t$

Proximal Policy Optimization (PPO)

PPO Objective

L^{CLIP} = \mathbb{E}\left[\min(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)A_t)\right]

Here,

$r_t(\theta)$ =Probability ratio
$A_t$ =Advantage estimate
$\varepsilon$ =Clip parameter

ℹ️ PPO

$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ ← probability ratio
$A_t$ = advantage estimate
clip prevents large policy updates

Kernel Methods

Kernel Trick

DfKernel Trick

Compute similarities in high-dimensional space without actually going there.

Kernel Function

K(x_1, x_2) = \phi(x_1)^T\phi(x_2)

Here,

$\phi$ =Mapping to high-dimensional space

ℹ️ Common Kernels

Linear: $K(x_1,x_2) = x_1^Tx_2$
RBF: $K(x_1,x_2) = \exp(-\gamma\|x_1-x_2\|^2)$
Polynomial: $K(x_1,x_2) = (x_1^Tx_2 + c)^d$

SVM Optimization:

SVM Primal

\min \frac{1}{2}\|w\|^2 + C \times \sum_i \xi_i

Here,

$w$ =Weight vector
$C$ =Regularization parameter
$\xi_i$ =Slack variables

Subject to: $y_i(w^Tx_i + b) \geq 1 - \xi_i$

Dual form: Uses kernel trick

\alpha = \arg\max \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_jy_iy_jK(x_i,x_j)

Dimensionality Reduction

PCA (Principal Component Analysis)

ℹ️ PCA Steps

Center the data: $X = X - \text{mean}(X)$
Compute covariance: $C = X^TX / (n-1)$
Eigendecompose: $C = V\Lambda V^T$
Top $k$ eigenvectors: $V_k$
Project: $Z = XV_k$

Keeps maximum variance with fewest dimensions

Choosing k (number of components):

ℹ️ Choosing k

Explained variance ratio: $\frac{\lambda_k}{\sum_i \lambda_i}$

Choose $k$ such that cumulative variance $\geq 95\%$

t-SNE

ℹ️ t-SNE

Preserves local structure (nearby points stay nearby)

Probability in high dim: $p_{j|i} = \frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_k \exp(-\|x_i-x_k\|^2/2\sigma_i^2)}$

Probability in low dim: $q_{ij} = \frac{(1+\|y_i-y_j\|^2)^{-1}}{\sum_k (1+\|y_k-y_l\|^2)^{-1}}$

Minimize $KL(p \| q)$

UMAP

ℹ️ UMAP

Similar to t-SNE but:

Faster
Better preserves global structure
Based on topological data analysis

Uses fuzzy simplicial sets and stochastic gradient descent

📋Key Takeaways

Neural networks compute $z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)}$ then apply activation $\sigma(z^{(l)})$ . ReLU $\text{ReLU}(z) = \max(0, z)$ is the default activation; Softmax $\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$ converts logits to class probabilities.
Backpropagation is the chain rule applied layer by layer. $\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T$ computes gradients for all weights in one forward + backward pass in $O(n)$ time — this is what makes deep learning possible.
Cross-entropy loss $L = -\sum_i y_i \log(\hat{y}_i)$ is equivalent to MLE. For classification, minimizing cross-entropy maximizes the likelihood of the correct class — heavily penalizes confident wrong predictions.
Attention is $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$ . The scaling factor $\sqrt{d_k}$ prevents softmax saturation; multi-head attention lets different heads learn different relationship types.
VAEs and GANs generate data differently. VAEs minimize reconstruction loss + $KL(q(z|x) \| p(z))$ ; GANs play a minimax game $\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$ where generator and discriminator compete.
RL learns via the Bellman equation $V(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')]$ . Policy gradient methods optimize $\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \times G_t]$ directly — PPO clips updates for stability.