CW

Neural Networks: Perceptron to MLP — Full Mathematical Foundation

Module 12: Deep LearningFree Lesson

Advertisement

Neural Networks: Perceptron to MLP

The Building Block of Deep Learning

Every deep learning model — from CNNs to Transformers — is built from neural networks. Understanding the math behind them is essential.

Deep Learning Architecture Family

Deep Learning Architecture FamilyNeuralNetworkCNNVisionRNNSequentialTransformerNLP/AllGANGenerationDiffusionImagesGNNGraphs

All architectures are built from the same fundamental building blocks


1. The Perceptron (Single Neuron)

Mathematical Model:

z=i=1nwixi+b=wTx+bz = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T\mathbf{x} + b
y^=σ(z)=σ(i=1nwixi+b)\hat{y} = \sigma(z) = \sigma\left(\sum_{i=1}^{n} w_i x_i + b\right)

Where:

  • x\mathbf{x} = input vector
  • w\mathbf{w} = weight vector (learned parameters)
  • bb = bias term
  • σ\sigma = activation function
The Perceptron: Single Neuronx₁w₁x₂w₂x₃w₃x₄w₄Σwᵢxᵢ + bσ(z)Activationσ(z)ŷoutputbbias

z = w₁x₁ + w₂x₂ + w₃x₃ + w₄x₄ + b → ŷ = σ(z)


2. Activation Functions

Why Activation Functions?

Without activation, a neural network is just linear regression:

y^=W2(W1x+b1)+b2=W2W1x+W2b1+b2=Wx+b\hat{y} = W_2(W_1\mathbf{x} + b_1) + b_2 = W_2W_1\mathbf{x} + W_2b_1 + b_2 = W'\mathbf{x} + b'

Activation functions introduce non-linearity, allowing the network to learn complex patterns.

Sigmoid, Tanh, ReLU Comparison

Activation Functions ComparisonSigmoidσ(z) = 1/(1+e⁻ᶻ)Tanhtanh(z) = (eᶻ-e⁻ᶻ)/(eᶻ+e⁻ᶻ)ReLUmax(0, z)-4+40Range: (0,1)Range: (-1,1)Range: [0,∞)

Activation Function Formulas

Sigmoid

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}


σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z))

Output: (0, 1) — good for probabilities

Tanh

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}


tanh(z)=1tanh2(z)\tanh'(z) = 1 - \tanh^2(z)

Output: (-1, 1) — zero-centered

ReLU

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)


ReLU(z)={1z>00z0\text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z \leq 0 \end{cases}

Output: [0, ∞) — most popular in hidden layers

Leaky ReLU and GELU (Modern Alternatives)

Modern Activation FunctionsLeaky ReLUf(z) = max(αz, z), α = 0.01GELU (Transformers)f(z) = z·Φ(z), Φ = Gaussian CDF

3. Multi-Layer Perceptron (MLP)

Multi-Layer Perceptron (MLP) ArchitectureInputLayerHidden 164 neuronsHidden 232 neuronsHidden 316 neuronsOutput1 neuronx₁x₂x₃x₄x₅h₁h₁h₁h₁h₁h₁h₂h₂h₂h₂h₂h₃h₃h₃h₃ŷForward Propagation →

4. Forward Propagation (Mathematical)

Layer-by-Layer Computation:

Layer 1:

z[1]=W[1]x+b[1]\mathbf{z}^{[1]} = \mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}
a[1]=σ(z[1])\mathbf{a}^{[1]} = \sigma(\mathbf{z}^{[1]})

Layer 2:

z[2]=W[2]a[1]+b[2]\mathbf{z}^{[2]} = \mathbf{W}^{[2]}\mathbf{a}^{[1]} + \mathbf{b}^{[2]}
a[2]=σ(z[2])\mathbf{a}^{[2]} = \sigma(\mathbf{z}^{[2]})

General Layer ll:

z[l]=W[l]a[l1]+b[l]\mathbf{z}^{[l]} = \mathbf{W}^{[l]}\mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}
a[l]=σ(z[l])\mathbf{a}^{[l]} = \sigma(\mathbf{z}^{[l]})

Output:

y^=a[L]=σ(z[L])\hat{y} = \mathbf{a}^{[L]} = \sigma(\mathbf{z}^{[L]})

5. Backpropagation (The Chain Rule)

Loss Function (Binary Cross-Entropy):

L(y^,y)=[ylog(y^)+(1y)log(1y^)]\mathcal{L}(\hat{y}, y) = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})]

Gradient for Output Layer:

LW[L]=La[L]a[L]z[L]z[L]W[L]\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[L]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \cdot \frac{\partial \mathbf{a}^{[L]}}{\partial \mathbf{z}^{[L]}} \cdot \frac{\partial \mathbf{z}^{[L]}}{\partial \mathbf{W}^{[L]}}
δ[L]=a[L]y\delta^{[L]} = \mathbf{a}^{[L]} - \mathbf{y}

Gradient for Hidden Layer ll:

δ[l]=(W[l+1])Tδ[l+1]σ(z[l])\delta^{[l]} = (\mathbf{W}^{[l+1]})^T \delta^{[l+1]} \odot \sigma'(\mathbf{z}^{[l]})

Weight Update:

W[l]:=W[l]αLW[l]\mathbf{W}^{[l]} := \mathbf{W}^{[l]} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}}
Backpropagation: Computing GradientsForward Pass (compute predictions)InputxHiddenz⁽¹⁾, a⁽¹⁾Hiddenz⁽²⁾, a⁽²⁾OutputŷLossL(ŷ, y)Backward Pass (compute gradients)δ⁽³⁾ = ŷ - yδ⁽²⁾ = W³ᵀδ³ ⊙ σ'δ⁽¹⁾ = W²ᵀδ² ⊙ σ'∂L/∂W⁽¹⁾

Weight Update: W := W - α · ∂L/∂W

α = learning rate, ∂L/∂W = gradient from backpropagation


6. Loss Functions

Regression Loss (MSE):

LMSE=1ni=1n(yiy^i)2\mathcal{L}_{MSE} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2

Classification Loss (Cross-Entropy):

Binary:

LBCE=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]\mathcal{L}_{BCE} = -\frac{1}{n}\sum_{i=1}^{n}[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]

Multi-class:

LCE=1ni=1nc=1Cyiclog(y^ic)\mathcal{L}_{CE} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{ic}\log(\hat{y}_{ic})

7. Weight Initialization

Why Not Initialize All Weights to Zero?

If W[l]=0W^{[l]} = 0, all neurons compute the same function → no symmetry breaking → network can't learn.

Xavier/Glorot Initialization (Sigmoid/Tanh):

WN(0,2nin+nout)W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)

He Initialization (ReLU):

WN(0,2nin)W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

8. Optimizers

Optimization Algorithms ComparisonVanilla GDSGD (noisy)Adam (fastest convergence)Vanilla GDSGDAdam

Adam Optimizer (Most Popular)

Adam Update Rules:

mt=β1mt1+(1β1)gt(1st moment: mean)m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(1st moment: mean)}
vt=β2vt1+(1β2)gt2(2nd moment: variance)v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(2nd moment: variance)}
m^t=mt1β1t,v^t=vt1β2t(bias correction)\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction)}
θt=θt1αm^tv^t+ϵ\theta_t = \theta_{t-1} - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Defaults: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}, α=0.001\alpha = 0.001


Key Takeaways

  1. Perceptron = weighted sum + activation — the basic unit
  2. MLP = multiple perceptrons in layers — universal approximator
  3. Activation functions introduce non-linearity — ReLU is default
  4. Backpropagation = chain rule applied recursively — computes gradients
  5. Adam optimizer is the default choice for most problems
  6. Weight initialization matters — use He init for ReLU

Next: PyTorch Fundamentals

Implement neural networks in PyTorch with autograd, tensors, and GPU support.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement