← Math|8 of 100
Mathematics for Data Science & AI

Mathematical Concepts Used in AI

Master the mathematical concepts behind modern AI: neural networks, transformers, GANs, reinforcement learning, and more.

📂 AI Math Concepts📖 Lesson 8 of 100🎓 Free Course

Advertisement

Mathematical Concepts Used in AI

ℹ️ Why It Matters

This chapter connects all the math to actual AI/ML applications. These are the concepts that power modern AI systems.


Neural Network Math

Forward Propagation

ℹ️ Forward Propagation

For layer ll:

z(l)=W(l)a(l1)+b(l)Linear transformationz^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)} \quad \leftarrow \text{Linear transformation}
a(l)=σ(z(l))Activation functiona^{(l)} = \sigma(z^{(l)}) \quad \leftarrow \text{Activation function}

Where:

  • W(l)W^{(l)} = weight matrix
  • b(l)b^{(l)} = bias vector
  • a(l)a^{(l)} = activation (output) of layer ll
  • σ\sigma = activation function

Activation Functions

Sigmoid:

Sigmoid Activation

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Here,

  • zz=Input to activation function
σ(z)=σ(z)×(1σ(z))\sigma'(z) = \sigma(z) \times (1 - \sigma(z))

ℹ️ Sigmoid Properties

  • Range: (0,1)(0, 1)
  • Problem: Vanishing gradients

Tanh:

Tanh Activation

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Here,

  • zz=Input to activation function
tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z)

ℹ️ Tanh Properties

  • Range: (1,1)(-1, 1)
  • Better than sigmoid (centered at 0)

ReLU (Rectified Linear Unit):

ReLU Activation

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

Here,

  • zz=Input to activation function
ReLU(z)={1if z>00otherwise\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}

ℹ️ ReLU Properties

  • Range: [0,)[0, \infty)
  • Most popular activation function
  • Problem: "Dead neurons" (never activate)

Leaky ReLU:

Leaky ReLU

LeakyReLU(z)=max(αz,z)\text{LeakyReLU}(z) = \max(\alpha z, z)

Here,

  • α\alpha=Small constant (e.g., 0.01)

Softmax (for classification):

Softmax

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

Here,

  • ziz_i=Logit for class i

ℹ️ Softmax

Converts logits to probabilities that sum to 1

Backpropagation

The Chain Rule Applied Layer by Layer:

ℹ️ Backpropagation Formulas

For output layer l=Ll=L:

δ(L)=aLσ(z(L))\delta^{(L)} = \nabla_a L \odot \sigma'(z^{(L)})

For hidden layers l=L1,,1l = L-1, \ldots, 1:

δ(l)=(W(l+1))Tδ(l+1)σ(z(l))\delta^{(l)} = (W^{(l+1)})^T\delta^{(l+1)} \odot \sigma'(z^{(l)})

Gradients:

LW(l)=δ(l)×(a(l1))T\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \times (a^{(l-1)})^T
Lb(l)=δ(l)\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}

💡 Why backpropagation is efficient

  • Computes gradients for ALL weights in one forward + backward pass
  • Time complexity: O(n)O(n) where nn is number of weights
  • This is what makes deep learning possible

Loss Functions

Regression Losses

Mean Squared Error (MSE):

MSE Loss

L=1ni(yiy^i)2L = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2

Here,

  • yiy_i=True value
  • y^i\hat{y}_i=Predicted value

ℹ️ MSE

Penalizes large errors more (squared)

Mean Absolute Error (MAE):

MAE Loss

L=1niyiy^iL = \frac{1}{n} \sum_i |y_i - \hat{y}_i|

Here,

  • yiy_i=True value
  • y^i\hat{y}_i=Predicted value

ℹ️ MAE

More robust to outliers

Huber Loss (combination):

Huber Loss

L={12(yy^)2if yy^δδyy^12δ2otherwiseL = \begin{cases} \frac{1}{2}(y-\hat{y})^2 & \text{if } |y-\hat{y}| \leq \delta \\ \delta|y-\hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases}

Here,

  • δ\delta=Threshold parameter

ℹ️ Huber Loss

Robust to outliers, smooth around zero

Classification Losses

Binary Cross-Entropy:

Binary Cross-Entropy

L=[y×log(y^)+(1y)×log(1y^)]L = -[y \times \log(\hat{y}) + (1-y) \times \log(1-\hat{y})]

Here,

  • yy=True label (0 or 1)
  • y^\hat{y}=Predicted probability

ℹ️ Binary Cross-Entropy

Same as negative log-likelihood for binary classification

Categorical Cross-Entropy:

Categorical Cross-Entropy

L=iyi×log(y^i)L = -\sum_i y_i \times \log(\hat{y}_i)

Here,

  • yiy_i=True label (one-hot)
  • y^i\hat{y}_i=Predicted probability for class i

ℹ️ Categorical Cross-Entropy

For multi-class classification

Focal Loss:

Focal Loss

L=αt(1pt)γ×log(pt)L = -\alpha_t(1-p_t)^\gamma \times \log(p_t)

Here,

  • αt\alpha_t=Balancing factor
  • γ\gamma=Focusing parameter

ℹ️ Focal Loss

  • Down-weights easy examples, focuses on hard ones
  • Used in object detection

Regularization in Neural Networks

Dropout

ℹ️ Dropout

During training:

  • Each neuron is "dropped" (set to 0) with probability pp
  • a=a×mask/(1p)a' = a \times \text{mask} / (1-p) ← Inverted dropout

During inference:

  • No dropout, all neurons active

💡 Why dropout works

  • Prevents co-adaptation of neurons
  • Ensemble effect (averaging many sub-networks)
  • Equivalent to L2 regularization (approximately)

Batch Normalization

ℹ️ Batch Normalization

For each mini-batch:

  1. μ\mu = mean of activations
  2. σ2\sigma^2 = variance of activations
  3. x^=xμσ2+ϵ\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} ← Normalize
  4. y=γx^+βy = \gamma\hat{x} + \beta ← Scale and shift

Where γ\gamma and β\beta are learned parameters

💡 Benefits of Batch Normalization

  • Faster training
  • Allows higher learning rates
  • Reduces sensitivity to initialization
  • Acts as a regularizer

Layer Normalization

ℹ️ Layer Normalization

Like Batch Norm but normalizes across features (not across batch). Independent of batch size. Used in: Transformers, RNNs

Weight Decay (L2 Regularization)

Weight Decay

Ltotal=Loriginal+λ2×W2L_{\text{total}} = L_{\text{original}} + \frac{\lambda}{2} \times \sum \|W\|^2

Here,

  • λ\lambda=Weight decay coefficient

ℹ️ Weight Decay Effect

Weights shrink toward zero


Attention Mechanism (The Math Behind Transformers)

Scaled Dot-Product Attention

Scaled Dot-Product Attention

Attention(Q,K,V)=softmax(QKTdk)×V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V

Here,

  • QQ=Query matrix
  • KK=Key matrix
  • VV=Value matrix
  • dkd_k=Dimension of keys

ℹ️ Step by step

  1. Compute similarity: QKTQK^T (How relevant is each key to each query?)

  2. Scale: QKTdk\frac{QK^T}{\sqrt{d_k}} (Prevent large values from saturating softmax)

  3. Normalize: softmax(QKTdk)\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) (Convert to probabilities)

  4. Aggregate: softmax(QKTdk)×V\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V (Weighted sum of values based on attention weights)

Multi-Head Attention

Multi-Head Attention

headi=Attention(QWiK,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^K, KW_i^K, VW_i^V)

Here,

  • WiK,WiK,WiVW_i^K, W_i^K, W_i^V=Learned projection matrices for each head
MultiHead(Q,K,V)=Concat(head1,,headh)×WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \times W^O

💡 Why multi-head?

  • Different heads can attend to different types of relationships
  • Head 1 might focus on syntax, head 2 on semantics, etc.

Position Encoding

Position Encoding

PE(pos,2i)=sin(pos100002i/d)PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right)

Here,

  • pospos=Position in sequence
  • ii=Dimension index
  • dd=Model dimension
PE(pos,2i+1)=cos(pos100002i/d)PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right)

💡 Why sinusoidal?

  • Each position gets a unique encoding
  • The model can learn relative positions
  • Smooth interpolation to unseen positions

Generative Models

Variational Autoencoders (VAE)

ℹ️ VAE Components

  • Encoder: q(zx)q(z|x) — maps input to latent distribution
  • Decoder: p(xz)p(x|z) — reconstructs input from latent vector

VAE Loss

L=E[logp(xz)]KL(q(zx)p(z))L = \mathbb{E}[\log p(x|z)] - KL(q(z|x) \| p(z))

Here,

  • q(zx)q(z|x)=Encoder distribution
  • p(z)p(z)=Prior distribution

ℹ️ VAE Loss Components

  • Reconstruction: How well does the decoder reconstruct the input?
  • KL divergence: How close is the learned distribution to the prior?

Reparameterization trick:

Reparameterization Trick

z=μ+σ×ε,where εN(0,1)z = \mu + \sigma \times \varepsilon, \quad \text{where } \varepsilon \sim N(0,1)

Here,

  • μ\mu=Mean from encoder
  • σ\sigma=Standard deviation from encoder
  • ε\varepsilon=Random noise

ℹ️ Reparameterization Trick

This makes the sampling differentiable (can backpropagate through it)

GANs (Generative Adversarial Networks)

ℹ️ GAN Components

  • Generator: G(z)G(z) — generates fake data from noise
  • Discriminator: D(x)D(x) — classifies real vs fake

Minimax game:

GAN Objective

minGmaxDV(D,G)=E[logD(x)]+E[log(1D(G(z)))]\min_G \max_D V(D,G) = \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]

Here,

  • DD=Discriminator
  • GG=Generator

ℹ️ GAN Training

  • DD tries to maximize: correctly classify real and fake
  • GG tries to minimize: fool DD into thinking fake is real

Training:

  1. Train D: Maximize logD(x)+log(1D(G(z)))\log D(x) + \log(1-D(G(z)))
  2. Train G: Maximize logD(G(z))\log D(G(z)) [or minimize log(1D(G(z)))\log(1-D(G(z)))]
  3. Repeat

Diffusion Models

Forward process: Add noise gradually

Forward Process

q(xtxt1)=N(xt;1βtxt1,βtI)q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)

Here,

  • βt\beta_t=Noise schedule

ℹ️ Forward Process

After TT steps: xTN(0,I)x_T \approx N(0, I) (pure noise)

Reverse process: Learn to remove noise

Reverse Process

pθ(xt1xt)=N(xt1;μθ(xt,t),σt2I)p_\theta(x_{t-1}|x_t) = N(x_{t-1}; \mu_\theta(x_t,t), \sigma_t^2 I)

Here,

  • θ\theta=Learned parameters

ℹ️ Reverse Process

θ\theta is learned by a neural network

Training objective:

Diffusion Training Objective

L=E[εεθ(xt,t)2]L = \mathbb{E}[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2]

Here,

  • ε\varepsilon=Noise that was added
  • εθ\varepsilon_\theta=Predicted noise

ℹ️ Diffusion Training

Predict the noise that was added, then subtract it


Reinforcement Learning Math

Markov Decision Process (MDP)

DfMDP

An MDP is defined as (S,A,P,R,γ)(S, A, P, R, \gamma):

  • SS = set of states
  • AA = set of actions
  • P(ss,a)P(s'|s,a) = transition probability
  • R(s,a)R(s,a) = reward function
  • γ\gamma = discount factor (0γ10 \leq \gamma \leq 1)

Bellman Equation

Bellman Equation

V(s)=maxa[R(s,a)+γ×sP(ss,a)×V(s)]V(s) = \max_a \left[R(s,a) + \gamma \times \sum_{s'} P(s'|s,a) \times V(s')\right]

Here,

  • V(s)V(s)=Value of state s
  • γ\gamma=Discount factor

ℹ️ Bellman Equation

The value of a state = best immediate reward + discounted value of next state

Q-Learning

Q-Learning Update

Q(s,a)Q(s,a)+α×[r+γ×maxaQ(s,a)Q(s,a)]Q(s,a) \leftarrow Q(s,a) + \alpha \times [r + \gamma \times \max_{a'} Q(s',a') - Q(s,a)]

Here,

  • α\alpha=Learning rate
  • rr=Reward received
  • γ\gamma=Discount factor

Policy Gradient

Policy Gradient

J(θ)=E[tγt×rt]J(\theta) = \mathbb{E}\left[\sum_t \gamma^t \times r_t\right]

Here,

  • θ\theta=Policy parameters
θJ(θ)=E[θlogπθ(as)×Gt]\nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) \times G_t\right]

ℹ️ Policy Gradient Components

  • πθ\pi_\theta = policy parameterized by θ\theta
  • GtG_t = return from time tt

REINFORCE Algorithm:

ℹ️ REINFORCE Algorithm

  1. Sample trajectory using current policy
  2. Compute returns GtG_t for each step
  3. Update: θθ+α×θlogπθ(atst)×Gt\theta \leftarrow \theta + \alpha \times \nabla_\theta \log \pi_\theta(a_t|s_t) \times G_t

Proximal Policy Optimization (PPO)

PPO Objective

LCLIP=E[min(rt(θ)At,clip(rt(θ),1ε,1+ε)At)]L^{CLIP} = \mathbb{E}\left[\min(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)A_t)\right]

Here,

  • rt(θ)r_t(\theta)=Probability ratio
  • AtA_t=Advantage estimate
  • ε\varepsilon=Clip parameter

ℹ️ PPO

  • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} ← probability ratio
  • AtA_t = advantage estimate
  • clip prevents large policy updates

Kernel Methods

Kernel Trick

DfKernel Trick

Compute similarities in high-dimensional space without actually going there.

Kernel Function

K(x1,x2)=ϕ(x1)Tϕ(x2)K(x_1, x_2) = \phi(x_1)^T\phi(x_2)

Here,

  • ϕ\phi=Mapping to high-dimensional space

ℹ️ Common Kernels

  • Linear: K(x1,x2)=x1Tx2K(x_1,x_2) = x_1^Tx_2
  • RBF: K(x1,x2)=exp(γx1x22)K(x_1,x_2) = \exp(-\gamma\|x_1-x_2\|^2)
  • Polynomial: K(x1,x2)=(x1Tx2+c)dK(x_1,x_2) = (x_1^Tx_2 + c)^d

SVM Optimization:

SVM Primal

min12w2+C×iξi\min \frac{1}{2}\|w\|^2 + C \times \sum_i \xi_i

Here,

  • ww=Weight vector
  • CC=Regularization parameter
  • ξi\xi_i=Slack variables

Subject to: yi(wTxi+b)1ξiy_i(w^Tx_i + b) \geq 1 - \xi_i

Dual form: Uses kernel trick

α=argmaxiαi12ijαiαjyiyjK(xi,xj)\alpha = \arg\max \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_jy_iy_jK(x_i,x_j)

Dimensionality Reduction

PCA (Principal Component Analysis)

ℹ️ PCA Steps

  1. Center the data: X=Xmean(X)X = X - \text{mean}(X)
  2. Compute covariance: C=XTX/(n1)C = X^TX / (n-1)
  3. Eigendecompose: C=VΛVTC = V\Lambda V^T
  4. Top kk eigenvectors: VkV_k
  5. Project: Z=XVkZ = XV_k

Keeps maximum variance with fewest dimensions

Choosing k (number of components):

ℹ️ Choosing k

Explained variance ratio: λkiλi\frac{\lambda_k}{\sum_i \lambda_i}

Choose kk such that cumulative variance 95%\geq 95\%

t-SNE

ℹ️ t-SNE

Preserves local structure (nearby points stay nearby)

Probability in high dim: pji=exp(xixj2/2σi2)kexp(xixk2/2σi2)p_{j|i} = \frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_k \exp(-\|x_i-x_k\|^2/2\sigma_i^2)}

Probability in low dim: qij=(1+yiyj2)1k(1+ykyl2)1q_{ij} = \frac{(1+\|y_i-y_j\|^2)^{-1}}{\sum_k (1+\|y_k-y_l\|^2)^{-1}}

Minimize KL(pq)KL(p \| q)

UMAP

ℹ️ UMAP

Similar to t-SNE but:

  • Faster
  • Better preserves global structure
  • Based on topological data analysis

Uses fuzzy simplicial sets and stochastic gradient descent


📋Key Takeaways

  • Neural networks compute z(l)=W(l)a(l1)+b(l)z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)} then apply activation σ(z(l))\sigma(z^{(l)}). ReLU ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z) is the default activation; Softmax softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} converts logits to class probabilities.

  • Backpropagation is the chain rule applied layer by layer. LW(l)=δ(l)(a(l1))T\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T computes gradients for all weights in one forward + backward pass in O(n)O(n) time — this is what makes deep learning possible.

  • Cross-entropy loss L=iyilog(y^i)L = -\sum_i y_i \log(\hat{y}_i) is equivalent to MLE. For classification, minimizing cross-entropy maximizes the likelihood of the correct class — heavily penalizes confident wrong predictions.

  • Attention is Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V. The scaling factor dk\sqrt{d_k} prevents softmax saturation; multi-head attention lets different heads learn different relationship types.

  • VAEs and GANs generate data differently. VAEs minimize reconstruction loss + KL(q(zx)p(z))KL(q(z|x) \| p(z)); GANs play a minimax game minGmaxDE[logD(x)]+E[log(1D(G(z)))]\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))] where generator and discriminator compete.

  • RL learns via the Bellman equation V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')]. Policy gradient methods optimize θJ(θ)=E[θlogπθ(as)×Gt]\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \times G_t] directly — PPO clips updates for stability.

Lesson Progress8 / 100