Mathematical Concepts Used in AI
ℹ️ Why It Matters
This chapter connects all the math to actual AI/ML applications. These are the concepts that power modern AI systems.
Neural Network Math
Forward Propagation
ℹ️ Forward Propagation
For layer l l l :
z ( l ) = W ( l ) a ( l − 1 ) + b ( l ) ← Linear transformation z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)} \quad \leftarrow \text{Linear transformation} z ( l ) = W ( l ) a ( l − 1 ) + b ( l ) ← Linear transformation a ( l ) = σ ( z ( l ) ) ← Activation function a^{(l)} = \sigma(z^{(l)}) \quad \leftarrow \text{Activation function} a ( l ) = σ ( z ( l ) ) ← Activation function Where:
W ( l ) W^{(l)} W ( l ) = weight matrix
b ( l ) b^{(l)} b ( l ) = bias vector
a ( l ) a^{(l)} a ( l ) = activation (output) of layer l l l
σ \sigma σ = activation function
Activation Functions
Sigmoid:
Sigmoid Activation σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1 + e^{-z}} σ ( z ) = 1 + e − z 1 Here,
z z z = Input to activation function
σ ′ ( z ) = σ ( z ) × ( 1 − σ ( z ) ) \sigma'(z) = \sigma(z) \times (1 - \sigma(z)) σ ′ ( z ) = σ ( z ) × ( 1 − σ ( z ))
ℹ️ Sigmoid Properties
Range: ( 0 , 1 ) (0, 1) ( 0 , 1 )
Problem: Vanishing gradients
Tanh:
Tanh Activation tanh ( z ) = e z − e − z e z + e − z \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} tanh ( z ) = e z + e − z e z − e − z Here,
z z z = Input to activation function
tanh ′ ( z ) = 1 − tanh 2 ( z ) \tanh'(z) = 1 - \tanh^2(z) tanh ′ ( z ) = 1 − tanh 2 ( z )
ℹ️ Tanh Properties
Range: ( − 1 , 1 ) (-1, 1) ( − 1 , 1 )
Better than sigmoid (centered at 0)
ReLU (Rectified Linear Unit):
ReLU Activation ReLU ( z ) = max ( 0 , z ) \text{ReLU}(z) = \max(0, z) ReLU ( z ) = max ( 0 , z ) Here,
z z z = Input to activation function
ReLU ′ ( z ) = { 1 if z > 0 0 otherwise \text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases} ReLU ′ ( z ) = { 1 0 if z > 0 otherwise
ℹ️ ReLU Properties
Range: [ 0 , ∞ ) [0, \infty) [ 0 , ∞ )
Most popular activation function
Problem: "Dead neurons" (never activate)
Leaky ReLU:
Leaky ReLU LeakyReLU ( z ) = max ( α z , z ) \text{LeakyReLU}(z) = \max(\alpha z, z) LeakyReLU ( z ) = max ( α z , z ) Here,
α \alpha α = Small constant (e.g., 0.01)
Softmax (for classification):
Softmax softmax ( z i ) = e z i ∑ j e z j \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} softmax ( z i ) = ∑ j e z j e z i Here,
z i z_i z i = Logit for class i
ℹ️ Softmax
Converts logits to probabilities that sum to 1
Backpropagation
The Chain Rule Applied Layer by Layer:
ℹ️ Backpropagation Formulas
For output layer l = L l=L l = L :
δ ( L ) = ∇ a L ⊙ σ ′ ( z ( L ) ) \delta^{(L)} = \nabla_a L \odot \sigma'(z^{(L)}) δ ( L ) = ∇ a L ⊙ σ ′ ( z ( L ) ) For hidden layers l = L − 1 , … , 1 l = L-1, \ldots, 1 l = L − 1 , … , 1 :
δ ( l ) = ( W ( l + 1 ) ) T δ ( l + 1 ) ⊙ σ ′ ( z ( l ) ) \delta^{(l)} = (W^{(l+1)})^T\delta^{(l+1)} \odot \sigma'(z^{(l)}) δ ( l ) = ( W ( l + 1 ) ) T δ ( l + 1 ) ⊙ σ ′ ( z ( l ) ) Gradients:
∂ L ∂ W ( l ) = δ ( l ) × ( a ( l − 1 ) ) T \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \times (a^{(l-1)})^T ∂ W ( l ) ∂ L = δ ( l ) × ( a ( l − 1 ) ) T ∂ L ∂ b ( l ) = δ ( l ) \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)} ∂ b ( l ) ∂ L = δ ( l )
💡 Why backpropagation is efficient
Computes gradients for ALL weights in one forward + backward pass
Time complexity: O ( n ) O(n) O ( n ) where n n n is number of weights
This is what makes deep learning possible
Loss Functions
Regression Losses
Mean Squared Error (MSE):
MSE Loss L = 1 n ∑ i ( y i − y ^ i ) 2 L = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2 L = n 1 i ∑ ( y i − y ^ i ) 2 Here,
y i y_i y i = True value y ^ i \hat{y}_i y ^ i = Predicted value
ℹ️ MSE
Penalizes large errors more (squared)
Mean Absolute Error (MAE):
MAE Loss L = 1 n ∑ i ∣ y i − y ^ i ∣ L = \frac{1}{n} \sum_i |y_i - \hat{y}_i| L = n 1 i ∑ ∣ y i − y ^ i ∣ Here,
y i y_i y i = True value y ^ i \hat{y}_i y ^ i = Predicted value
Huber Loss (combination):
Huber Loss L = { 1 2 ( y − y ^ ) 2 if ∣ y − y ^ ∣ ≤ δ δ ∣ y − y ^ ∣ − 1 2 δ 2 otherwise L = \begin{cases} \frac{1}{2}(y-\hat{y})^2 & \text{if } |y-\hat{y}| \leq \delta \\ \delta|y-\hat{y}| - \frac{1}{2}\delta^2 & \text{otherwise} \end{cases} L = { 2 1 ( y − y ^ ) 2 δ ∣ y − y ^ ∣ − 2 1 δ 2 if ∣ y − y ^ ∣ ≤ δ otherwise Here,
δ \delta δ = Threshold parameter
ℹ️ Huber Loss
Robust to outliers, smooth around zero
Classification Losses
Binary Cross-Entropy:
Binary Cross-Entropy L = − [ y × log ( y ^ ) + ( 1 − y ) × log ( 1 − y ^ ) ] L = -[y \times \log(\hat{y}) + (1-y) \times \log(1-\hat{y})] L = − [ y × log ( y ^ ) + ( 1 − y ) × log ( 1 − y ^ )] Here,
y y y = True label (0 or 1) y ^ \hat{y} y ^ = Predicted probability
ℹ️ Binary Cross-Entropy
Same as negative log-likelihood for binary classification
Categorical Cross-Entropy:
Categorical Cross-Entropy L = − ∑ i y i × log ( y ^ i ) L = -\sum_i y_i \times \log(\hat{y}_i) L = − i ∑ y i × log ( y ^ i ) Here,
y i y_i y i = True label (one-hot) y ^ i \hat{y}_i y ^ i = Predicted probability for class i
ℹ️ Categorical Cross-Entropy
For multi-class classification
Focal Loss:
Focal Loss L = − α t ( 1 − p t ) γ × log ( p t ) L = -\alpha_t(1-p_t)^\gamma \times \log(p_t) L = − α t ( 1 − p t ) γ × log ( p t ) Here,
α t \alpha_t α t = Balancing factor γ \gamma γ = Focusing parameter
ℹ️ Focal Loss
Down-weights easy examples, focuses on hard ones
Used in object detection
Regularization in Neural Networks
Dropout
ℹ️ Dropout
During training:
Each neuron is "dropped" (set to 0) with probability p p p
a ′ = a × mask / ( 1 − p ) a' = a \times \text{mask} / (1-p) a ′ = a × mask / ( 1 − p ) ← Inverted dropout
During inference:
No dropout, all neurons active
💡 Why dropout works
Prevents co-adaptation of neurons
Ensemble effect (averaging many sub-networks)
Equivalent to L2 regularization (approximately)
Batch Normalization
ℹ️ Batch Normalization
For each mini-batch:
μ \mu μ = mean of activations
σ 2 \sigma^2 σ 2 = variance of activations
x ^ = x − μ σ 2 + ϵ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} x ^ = σ 2 + ϵ x − μ ← Normalize
y = γ x ^ + β y = \gamma\hat{x} + \beta y = γ x ^ + β ← Scale and shift
Where γ \gamma γ and β \beta β are learned parameters
💡 Benefits of Batch Normalization
Faster training
Allows higher learning rates
Reduces sensitivity to initialization
Acts as a regularizer
Layer Normalization
ℹ️ Layer Normalization
Like Batch Norm but normalizes across features (not across batch). Independent of batch size. Used in: Transformers, RNNs
Weight Decay (L2 Regularization)
Weight Decay L total = L original + λ 2 × ∑ ∥ W ∥ 2 L_{\text{total}} = L_{\text{original}} + \frac{\lambda}{2} \times \sum \|W\|^2 L total = L original + 2 λ × ∑ ∥ W ∥ 2 Here,
λ \lambda λ = Weight decay coefficient
ℹ️ Weight Decay Effect
Weights shrink toward zero
Attention Mechanism (The Math Behind Transformers)
Scaled Dot-Product Attention
Scaled Dot-Product Attention Attention ( Q , K , V ) = softmax ( Q K T d k ) × V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V Attention ( Q , K , V ) = softmax ( d k Q K T ) × V Here,
Q Q Q = Query matrix K K K = Key matrix V V V = Value matrix d k d_k d k = Dimension of keys
ℹ️ Step by step
Compute similarity: Q K T QK^T Q K T
(How relevant is each key to each query?)
Scale: Q K T d k \frac{QK^T}{\sqrt{d_k}} d k Q K T
(Prevent large values from saturating softmax)
Normalize: softmax ( Q K T d k ) \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) softmax ( d k Q K T )
(Convert to probabilities)
Aggregate: softmax ( Q K T d k ) × V \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \times V softmax ( d k Q K T ) × V
(Weighted sum of values based on attention weights)
Multi-Head Attention
Multi-Head Attention head i = Attention ( Q W i K , K W i K , V W i V ) \text{head}_i = \text{Attention}(QW_i^K, KW_i^K, VW_i^V) head i = Attention ( Q W i K , K W i K , V W i V ) Here,
W i K , W i K , W i V W_i^K, W_i^K, W_i^V W i K , W i K , W i V = Learned projection matrices for each head
MultiHead ( Q , K , V ) = Concat ( head 1 , … , head h ) × W O \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \times W^O MultiHead ( Q , K , V ) = Concat ( head 1 , … , head h ) × W O
💡 Why multi-head?
Different heads can attend to different types of relationships
Head 1 might focus on syntax, head 2 on semantics, etc.
Position Encoding
Position Encoding P E ( p o s , 2 i ) = sin ( p o s 10000 2 i / d ) PE(pos, 2i) = \sin\left(\frac{pos}{10000^{2i/d}}\right) P E ( p os , 2 i ) = sin ( 1000 0 2 i / d p os ) Here,
p o s pos p os = Position in sequence i i i = Dimension index d d d = Model dimension
P E ( p o s , 2 i + 1 ) = cos ( p o s 10000 2 i / d ) PE(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i/d}}\right) P E ( p os , 2 i + 1 ) = cos ( 1000 0 2 i / d p os )
💡 Why sinusoidal?
Each position gets a unique encoding
The model can learn relative positions
Smooth interpolation to unseen positions
Generative Models
Variational Autoencoders (VAE)
ℹ️ VAE Components
Encoder: q ( z ∣ x ) q(z|x) q ( z ∣ x ) — maps input to latent distribution
Decoder: p ( x ∣ z ) p(x|z) p ( x ∣ z ) — reconstructs input from latent vector
VAE Loss L = E [ log p ( x ∣ z ) ] − K L ( q ( z ∣ x ) ∥ p ( z ) ) L = \mathbb{E}[\log p(x|z)] - KL(q(z|x) \| p(z)) L = E [ log p ( x ∣ z )] − K L ( q ( z ∣ x ) ∥ p ( z )) Here,
q ( z ∣ x ) q(z|x) q ( z ∣ x ) = Encoder distribution p ( z ) p(z) p ( z ) = Prior distribution
ℹ️ VAE Loss Components
Reconstruction: How well does the decoder reconstruct the input?
KL divergence: How close is the learned distribution to the prior?
Reparameterization trick:
Reparameterization Trick z = μ + σ × ε , where ε ∼ N ( 0 , 1 ) z = \mu + \sigma \times \varepsilon, \quad \text{where } \varepsilon \sim N(0,1) z = μ + σ × ε , where ε ∼ N ( 0 , 1 ) Here,
μ \mu μ = Mean from encoder σ \sigma σ = Standard deviation from encoder ε \varepsilon ε = Random noise
ℹ️ Reparameterization Trick
This makes the sampling differentiable (can backpropagate through it)
GANs (Generative Adversarial Networks)
ℹ️ GAN Components
Generator: G ( z ) G(z) G ( z ) — generates fake data from noise
Discriminator: D ( x ) D(x) D ( x ) — classifies real vs fake
Minimax game:
GAN Objective min G max D V ( D , G ) = E [ log D ( x ) ] + E [ log ( 1 − D ( G ( z ) ) ) ] \min_G \max_D V(D,G) = \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))] G min D max V ( D , G ) = E [ log D ( x )] + E [ log ( 1 − D ( G ( z )))] Here,
D D D = Discriminator G G G = Generator
ℹ️ GAN Training
D D D tries to maximize: correctly classify real and fake
G G G tries to minimize: fool D D D into thinking fake is real
Training:
Train D: Maximize log D ( x ) + log ( 1 − D ( G ( z ) ) ) \log D(x) + \log(1-D(G(z))) log D ( x ) + log ( 1 − D ( G ( z )))
Train G: Maximize log D ( G ( z ) ) \log D(G(z)) log D ( G ( z )) [or minimize log ( 1 − D ( G ( z ) ) ) \log(1-D(G(z))) log ( 1 − D ( G ( z ))) ]
Repeat
Diffusion Models
Forward process: Add noise gradually
Forward Process q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) q(x_t|x_{t-1}) = N(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I) q ( x t ∣ x t − 1 ) = N ( x t ; 1 − β t x t − 1 , β t I ) Here,
β t \beta_t β t = Noise schedule
ℹ️ Forward Process
After T T T steps: x T ≈ N ( 0 , I ) x_T \approx N(0, I) x T ≈ N ( 0 , I ) (pure noise)
Reverse process: Learn to remove noise
Reverse Process p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , σ t 2 I ) p_\theta(x_{t-1}|x_t) = N(x_{t-1}; \mu_\theta(x_t,t), \sigma_t^2 I) p θ ( x t − 1 ∣ x t ) = N ( x t − 1 ; μ θ ( x t , t ) , σ t 2 I ) Here,
θ \theta θ = Learned parameters
ℹ️ Reverse Process
θ \theta θ is learned by a neural network
Training objective:
Diffusion Training Objective L = E [ ∥ ε − ε θ ( x t , t ) ∥ 2 ] L = \mathbb{E}[\|\varepsilon - \varepsilon_\theta(x_t, t)\|^2] L = E [ ∥ ε − ε θ ( x t , t ) ∥ 2 ] Here,
ε \varepsilon ε = Noise that was added ε θ \varepsilon_\theta ε θ = Predicted noise
ℹ️ Diffusion Training
Predict the noise that was added, then subtract it
Reinforcement Learning Math
Markov Decision Process (MDP)
Df MDPAn MDP is defined as ( S , A , P , R , γ ) (S, A, P, R, \gamma) ( S , A , P , R , γ ) :
S S S = set of states
A A A = set of actions
P ( s ′ ∣ s , a ) P(s'|s,a) P ( s ′ ∣ s , a ) = transition probability
R ( s , a ) R(s,a) R ( s , a ) = reward function
γ \gamma γ = discount factor (0 ≤ γ ≤ 1 0 \leq \gamma \leq 1 0 ≤ γ ≤ 1 )
Bellman Equation
Bellman Equation V ( s ) = max a [ R ( s , a ) + γ × ∑ s ′ P ( s ′ ∣ s , a ) × V ( s ′ ) ] V(s) = \max_a \left[R(s,a) + \gamma \times \sum_{s'} P(s'|s,a) \times V(s')\right] V ( s ) = a max [ R ( s , a ) + γ × s ′ ∑ P ( s ′ ∣ s , a ) × V ( s ′ ) ] Here,
V ( s ) V(s) V ( s ) = Value of state s γ \gamma γ = Discount factor
ℹ️ Bellman Equation
The value of a state = best immediate reward + discounted value of next state
Q-Learning
Q-Learning Update Q ( s , a ) ← Q ( s , a ) + α × [ r + γ × max a ′ Q ( s ′ , a ′ ) − Q ( s , a ) ] Q(s,a) \leftarrow Q(s,a) + \alpha \times [r + \gamma \times \max_{a'} Q(s',a') - Q(s,a)] Q ( s , a ) ← Q ( s , a ) + α × [ r + γ × a ′ max Q ( s ′ , a ′ ) − Q ( s , a )] Here,
α \alpha α = Learning rate r r r = Reward received γ \gamma γ = Discount factor
Policy Gradient
Policy Gradient J ( θ ) = E [ ∑ t γ t × r t ] J(\theta) = \mathbb{E}\left[\sum_t \gamma^t \times r_t\right] J ( θ ) = E [ t ∑ γ t × r t ] Here,
θ \theta θ = Policy parameters
∇ θ J ( θ ) = E [ ∇ θ log π θ ( a ∣ s ) × G t ] \nabla_\theta J(\theta) = \mathbb{E}\left[\nabla_\theta \log \pi_\theta(a|s) \times G_t\right] ∇ θ J ( θ ) = E [ ∇ θ log π θ ( a ∣ s ) × G t ]
ℹ️ Policy Gradient Components
π θ \pi_\theta π θ = policy parameterized by θ \theta θ
G t G_t G t = return from time t t t
REINFORCE Algorithm:
ℹ️ REINFORCE Algorithm
Sample trajectory using current policy
Compute returns G t G_t G t for each step
Update: θ ← θ + α × ∇ θ log π θ ( a t ∣ s t ) × G t \theta \leftarrow \theta + \alpha \times \nabla_\theta \log \pi_\theta(a_t|s_t) \times G_t θ ← θ + α × ∇ θ log π θ ( a t ∣ s t ) × G t
Proximal Policy Optimization (PPO)
PPO Objective L C L I P = E [ min ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ε , 1 + ε ) A t ) ] L^{CLIP} = \mathbb{E}\left[\min(r_t(\theta)A_t, \text{clip}(r_t(\theta), 1-\varepsilon, 1+\varepsilon)A_t)\right] L C L I P = E [ min ( r t ( θ ) A t , clip ( r t ( θ ) , 1 − ε , 1 + ε ) A t ) ] Here,
r t ( θ ) r_t(\theta) r t ( θ ) = Probability ratio A t A_t A t = Advantage estimate ε \varepsilon ε = Clip parameter
ℹ️ PPO
r t ( θ ) = π θ ( a t ∣ s t ) π θ o l d ( a t ∣ s t ) r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} r t ( θ ) = π θ o l d ( a t ∣ s t ) π θ ( a t ∣ s t ) ← probability ratio
A t A_t A t = advantage estimate
clip prevents large policy updates
Kernel Methods
Kernel Trick
Df Kernel TrickCompute similarities in high-dimensional space without actually going there.
Kernel Function K ( x 1 , x 2 ) = ϕ ( x 1 ) T ϕ ( x 2 ) K(x_1, x_2) = \phi(x_1)^T\phi(x_2) K ( x 1 , x 2 ) = ϕ ( x 1 ) T ϕ ( x 2 ) Here,
ϕ \phi ϕ = Mapping to high-dimensional space
ℹ️ Common Kernels
Linear: K ( x 1 , x 2 ) = x 1 T x 2 K(x_1,x_2) = x_1^Tx_2 K ( x 1 , x 2 ) = x 1 T x 2
RBF: K ( x 1 , x 2 ) = exp ( − γ ∥ x 1 − x 2 ∥ 2 ) K(x_1,x_2) = \exp(-\gamma\|x_1-x_2\|^2) K ( x 1 , x 2 ) = exp ( − γ ∥ x 1 − x 2 ∥ 2 )
Polynomial: K ( x 1 , x 2 ) = ( x 1 T x 2 + c ) d K(x_1,x_2) = (x_1^Tx_2 + c)^d K ( x 1 , x 2 ) = ( x 1 T x 2 + c ) d
SVM Optimization:
SVM Primal min 1 2 ∥ w ∥ 2 + C × ∑ i ξ i \min \frac{1}{2}\|w\|^2 + C \times \sum_i \xi_i min 2 1 ∥ w ∥ 2 + C × i ∑ ξ i Here,
w w w = Weight vector C C C = Regularization parameter ξ i \xi_i ξ i = Slack variables
Subject to: y i ( w T x i + b ) ≥ 1 − ξ i y_i(w^Tx_i + b) \geq 1 - \xi_i y i ( w T x i + b ) ≥ 1 − ξ i
Dual form: Uses kernel trick
α = arg max ∑ i α i − 1 2 ∑ i ∑ j α i α j y i y j K ( x i , x j ) \alpha = \arg\max \sum_i \alpha_i - \frac{1}{2}\sum_i\sum_j \alpha_i\alpha_jy_iy_jK(x_i,x_j) α = arg max i ∑ α i − 2 1 i ∑ j ∑ α i α j y i y j K ( x i , x j )
Dimensionality Reduction
PCA (Principal Component Analysis)
ℹ️ PCA Steps
Center the data: X = X − mean ( X ) X = X - \text{mean}(X) X = X − mean ( X )
Compute covariance: C = X T X / ( n − 1 ) C = X^TX / (n-1) C = X T X / ( n − 1 )
Eigendecompose: C = V Λ V T C = V\Lambda V^T C = V Λ V T
Top k k k eigenvectors: V k V_k V k
Project: Z = X V k Z = XV_k Z = X V k
Keeps maximum variance with fewest dimensions
Choosing k (number of components):
ℹ️ Choosing k
Explained variance ratio: λ k ∑ i λ i \frac{\lambda_k}{\sum_i \lambda_i} ∑ i λ i λ k
Choose k k k such that cumulative variance ≥ 95 % \geq 95\% ≥ 95%
t-SNE
ℹ️ t-SNE
Preserves local structure (nearby points stay nearby)
Probability in high dim: p j ∣ i = exp ( − ∥ x i − x j ∥ 2 / 2 σ i 2 ) ∑ k exp ( − ∥ x i − x k ∥ 2 / 2 σ i 2 ) p_{j|i} = \frac{\exp(-\|x_i-x_j\|^2/2\sigma_i^2)}{\sum_k \exp(-\|x_i-x_k\|^2/2\sigma_i^2)} p j ∣ i = ∑ k e x p ( − ∥ x i − x k ∥ 2 /2 σ i 2 ) e x p ( − ∥ x i − x j ∥ 2 /2 σ i 2 )
Probability in low dim: q i j = ( 1 + ∥ y i − y j ∥ 2 ) − 1 ∑ k ( 1 + ∥ y k − y l ∥ 2 ) − 1 q_{ij} = \frac{(1+\|y_i-y_j\|^2)^{-1}}{\sum_k (1+\|y_k-y_l\|^2)^{-1}} q ij = ∑ k ( 1 + ∥ y k − y l ∥ 2 ) − 1 ( 1 + ∥ y i − y j ∥ 2 ) − 1
Minimize K L ( p ∥ q ) KL(p \| q) K L ( p ∥ q )
UMAP
ℹ️ UMAP
Similar to t-SNE but:
Faster
Better preserves global structure
Based on topological data analysis
Uses fuzzy simplicial sets and stochastic gradient descent
📋 Key Takeaways
Neural networks compute z ( l ) = W ( l ) a ( l − 1 ) + b ( l ) z^{(l)} = W^{(l)}a^{(l-1)} + b^{(l)} z ( l ) = W ( l ) a ( l − 1 ) + b ( l ) then apply activation σ ( z ( l ) ) \sigma(z^{(l)}) σ ( z ( l ) ) . ReLU ReLU ( z ) = max ( 0 , z ) \text{ReLU}(z) = \max(0, z) ReLU ( z ) = max ( 0 , z ) is the default activation; Softmax softmax ( z i ) = e z i ∑ j e z j \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} softmax ( z i ) = ∑ j e z j e z i converts logits to class probabilities.
Backpropagation is the chain rule applied layer by layer. ∂ L ∂ W ( l ) = δ ( l ) ( a ( l − 1 ) ) T \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T ∂ W ( l ) ∂ L = δ ( l ) ( a ( l − 1 ) ) T computes gradients for all weights in one forward + backward pass in O ( n ) O(n) O ( n ) time — this is what makes deep learning possible.
Cross-entropy loss L = − ∑ i y i log ( y ^ i ) L = -\sum_i y_i \log(\hat{y}_i) L = − ∑ i y i log ( y ^ i ) is equivalent to MLE. For classification, minimizing cross-entropy maximizes the likelihood of the correct class — heavily penalizes confident wrong predictions.
Attention is Attention ( Q , K , V ) = softmax ( Q K T d k ) V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V Attention ( Q , K , V ) = softmax ( d k Q K T ) V . The scaling factor d k \sqrt{d_k} d k prevents softmax saturation; multi-head attention lets different heads learn different relationship types.
VAEs and GANs generate data differently. VAEs minimize reconstruction loss + K L ( q ( z ∣ x ) ∥ p ( z ) ) KL(q(z|x) \| p(z)) K L ( q ( z ∣ x ) ∥ p ( z )) ; GANs play a minimax game min G max D E [ log D ( x ) ] + E [ log ( 1 − D ( G ( z ) ) ) ] \min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))] min G max D E [ log D ( x )] + E [ log ( 1 − D ( G ( z )))] where generator and discriminator compete.
RL learns via the Bellman equation V ( s ) = max a [ R ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V ( s ′ ) ] V(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s')] V ( s ) = max a [ R ( s , a ) + γ ∑ s ′ P ( s ′ ∣ s , a ) V ( s ′ )] . Policy gradient methods optimize ∇ θ J ( θ ) = E [ ∇ θ log π θ ( a ∣ s ) × G t ] \nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \times G_t] ∇ θ J ( θ ) = E [ ∇ θ log π θ ( a ∣ s ) × G t ] directly — PPO clips updates for stability.