Neural Networks: Perceptron to MLP

The Building Block of Deep Learning

Every deep learning model � from CNNs to Transformers � is built from neural networks. Understanding the math behind them is essential.
Deep Learning Architecture Family

All architectures are built from the same fundamental building blocks
1. The Perceptron (Single Neuron)

Mathematical Model:
Where:
= input vector
= weight vector (learned parameters)
= bias term
= activation function
&#123;`{inp.label}`&#125;
&#123;`{inp.weight}`&#125;
S
w?�x?� + b
s(z)
Activation
s(z)
y
output
b
bias

    z = w1x1 + w2x2 + w3x3 + w3x3 + b ? y = s(z)~~~



---



## 2. Activation Functions



<div className="my-6 rounded-xl bg-gradient-to-r from-purple-50 to-violet-50 p-6 border border-purple-200">



**Why Activation Functions?**



Without activation, a neural network is just linear regression:





<MathBlock tex={`\\hat{y} = W_2(W_1\\mathbf{x} + b_1) + b_2 = W_2W_1\\mathbf{x} + W_2b_1 + b_2 = W'\\mathbf{x} + b'`} display={true} />





Activation functions introduce **non-linearity**, allowing the network to learn complex patterns.



</div>



### Sigmoid, Tanh, ReLU Comparison



<svg viewBox="0 0 700 300" className="w-full my-6" xmlns="http://www.w3.org/2000/svg">

  <defs>

    <linearGradient id="sigGrad" x1="0%" y1="0%" x2="100%" y2="0%">

      <stop offset="0%" style={{stopColor:'#6366F1',stopOpacity:1}} />

      <stop offset="100%" style={{stopColor:'#8B5CF6',stopOpacity:1}} />

    </linearGradient>

  </defs>

  

  <text x="350" y="25" textAnchor="middle" fill="#1E293B" fontSize="16" fontWeight="bold">Activation Functions Comparison</text>

  

  {/* Axes */}

  <line x1="50" y1="150" x2="650" y2="150" stroke="#CBD5E1" strokeWidth="1.5" />

  <line x1="350" y1="30" x2="350" y2="270" stroke="#CBD5E1" strokeWidth="1.5" />

  

  {/* Grid */}

  {[-4,-3,-2,-1,0,1,2,3,4].map(x => (

    <line key={x} x1={350 + x * 75} y1="145" x2={350 + x * 75} y2="155" stroke="#CBD5E1" strokeWidth="1" />

  ))}

  

  {/* Sigmoid */}

  <path d="M 50,240 Q 100,238 150,235 Q 200,230 250,215 Q 300,180 350,150 Q 400,120 450,85 Q 500,70 550,62 Q 600,60 650,58" 

        fill="none" stroke="#6366F1" strokeWidth="3" />

  

  {/* Tanh */}

  <path d="M 50,260 Q 100,255 150,248 Q 200,235 250,200 Q 300,160 350,150 Q 400,140 450,100 Q 500,65 550,42 Q 600,38 650,36" 

        fill="none" stroke="#10B981" strokeWidth="2.5" strokeDasharray="8 4" />

  

  {/* ReLU */}

  <path d="M 50,240 L 350,150 L 650,150" fill="none" stroke="#EF4444" strokeWidth="3" />

  

  {/* Labels */}

  <circle cx="60" cy="60" r="12" fill="#6366F1" />

  <text x="78" y="65" fill="#6366F1" fontSize="12" fontWeight="bold">Sigmoid</text>

  <text x="78" y="80" fill="#64748B" fontSize="9">s(z) = 1/(1+e??)</text>

  

  <circle cx="200" cy="60" r="12" fill="#10B981" />

  <text x="218" y="65" fill="#10B981" fontSize="12" fontWeight="bold">Tanh</text>

  <text x="218" y="80" fill="#64748B" fontSize="9">tanh(z) = (eᶻ-e??)/(eᶻ+e??)</text>

  

  <circle cx="370" cy="60" r="12" fill="#EF4444" />

  <text x="388" y="65" fill="#EF4444" fontSize="12" fontWeight="bold">ReLU</text>

  <text x="388" y="80" fill="#64748B" fontSize="9">max(0, z)</text>

  

  {/* Key points */}

  <text x="50" y="255" fill="#64748B" fontSize="10">-4</text>

  <text x="650" y="255" fill="#64748B" fontSize="10">+4</text>

  <text x="350" y="170" fill="#64748B" fontSize="10">0</text>

  

  {/* Comparison table */}

  <rect x="50" y="280" width="600" height="20" rx="4" fill="#F1F5F9" />

  <text x="150" y="293" textAnchor="middle" fill="#475569" fontSize="10" fontWeight="bold">Range: (0,1)</text>

  <text x="350" y="293" textAnchor="middle" fill="#475569" fontSize="10" fontWeight="bold">Range: (-1,1)</text>

  <text x="550" y="293" textAnchor="middle" fill="#475569" fontSize="10" fontWeight="bold">Range: [0,=)</text>

</svg>



### Activation Function Formulas



<div className="grid grid-cols-1 md:grid-cols-3 gap-4 my-6">

  <div className="rounded-xl bg-blue-50 p-4 border border-blue-200">

    <h4 className="font-bold text-blue-800 mb-2">Sigmoid</h4>

    <div className="text-sm text-blue-700">

      

<MathBlock tex={`\\sigma(z) = \\frac{1}{1 + e^{-z}}`} display={true} />



      <br/><br/>

      

<MathBlock tex={`\\sigma'(z) = \\sigma(z)(1 - \\sigma(z))`} display={true} />



    </div>

    <p className="text-xs text-blue-600 mt-2">Output: (0, 1) � good for probabilities</p>

  </div>

  

  <div className="rounded-xl bg-green-50 p-4 border border-green-200">

    <h4 className="font-bold text-green-800 mb-2">Tanh</h4>

    <div className="text-sm text-green-700">

      

<MathBlock tex={`\\tanh(z) = \\frac{e^z - e^{-z}}{e^z + e^{-z}}`} display={true} />



      <br/><br/>

      

<MathBlock tex={`\\tanh'(z) = 1 - \\tanh^2(z)`} display={true} />



    </div>

    <p className="text-xs text-green-600 mt-2">Output: (-1, 1) � zero-centered</p>

  </div>

  

  <div className="rounded-xl bg-red-50 p-4 border border-red-200">

    <h4 className="font-bold text-red-800 mb-2">ReLU</h4>

    <div className="text-sm text-red-700">

      

<MathBlock tex={`\\text{ReLU}(z) = \\max(0, z)`} display={true} />



      <br/><br/>

      

<MathBlock tex={`\\text{ReLU}'(z) = \\begin{cases} 1 & z > 0 \\\\ 0 & z \\leq 0 \\end{cases}`} display={true} />



    </div>

    <p className="text-xs text-red-600 mt-2">Output: [0, =) � most popular in hidden layers</p>

  </div>

</div>



### Leaky ReLU and GELU (Modern Alternatives)



<svg viewBox="0 0 700 200" className="w-full my-6" xmlns="http://www.w3.org/2000/svg">

  <text x="350" y="25" textAnchor="middle" fill="#1E293B" fontSize="14" fontWeight="bold">Modern Activation Functions</text>

  

  {/* Leaky ReLU */}

  <rect x="30" y="45" width="310" height="140" rx="10" fill="#FEF3C7" stroke="#F59E0B" strokeWidth="1.5" />

  <text x="185" y="65" textAnchor="middle" fill="#92400E" fontSize="12" fontWeight="bold">Leaky ReLU</text>

  <path d="M 50,160 L 185,90 L 320,90" fill="none" stroke="#F59E0B" strokeWidth="2.5" />

  <text x="185" y="175" textAnchor="middle" fill="#B45309" fontSize="10">f(z) = max(az, z), a = 0.01</text>

  

  {/* GELU */}

  <rect x="360" y="45" width="310" height="140" rx="10" fill="#EDE9FE" stroke="#8B5CF6" strokeWidth="1.5" />

  <text x="515" y="65" textAnchor="middle" fill="#6D28D9" fontSize="12" fontWeight="bold">GELU (Transformers)</text>

  <path d="M 380,160 Q 420,155 450,140 Q 480,115 515,90 Q 550,65 580,55 Q 610,50 640,48" 

        fill="none" stroke="#8B5CF6" strokeWidth="2.5" />

  <text x="515" y="175" textAnchor="middle" fill="#7C3AED" fontSize="10">f(z) = z�F(z), F = Gaussian CDF</text>

</svg>



---



## 3. Multi-Layer Perceptron (MLP)



~~~Multi-Layer Perceptron (MLP) Architecture
Input
Layer
Hidden 1
64 neurons
Hidden 2
32 neurons
Hidden 3
16 neurons
Output
1 neuron
x&#123;`{['1','2','3','3','4'][i]}`&#125;
h1
h2
h3
y
Forward Propagation ?~~~



---



## 4. Forward Propagation (Mathematical)



<div className="my-6 rounded-xl bg-gradient-to-r from-indigo-50 to-blue-50 p-6 border border-indigo-200">



**Layer-by-Layer Computation:**



**Layer 1:**



<MathBlock tex={`\\mathbf{z}^{[1]} = \\mathbf{W}^{[1]}\\mathbf{x} + \\mathbf{b}^{[1]}`} display={true} />





<MathBlock tex={`\\mathbf{a}^{[1]} = \\sigma(\\mathbf{z}^{[1]})`} display={true} />





**Layer 2:**



<MathBlock tex={`\\mathbf{z}^{[2]} = \\mathbf{W}^{[2]}\\mathbf{a}^{[1]} + \\mathbf{b}^{[2]}`} display={true} />





<MathBlock tex={`\\mathbf{a}^{[2]} = \\sigma(\\mathbf{z}^{[2]})`} display={true} />





**General Layer <MathBlock tex={`l`} />:**



<MathBlock tex={`\\mathbf{z}^{[l]} = \\mathbf{W}^{[l]}\\mathbf{a}^{[l-1]} + \\mathbf{b}^{[l]}`} display={true} />





<MathBlock tex={`\\mathbf{a}^{[l]} = \\sigma(\\mathbf{z}^{[l]})`} display={true} />





**Output:**



<MathBlock tex={`\\hat{y} = \\mathbf{a}^{[L]} = \\sigma(\\mathbf{z}^{[L]})`} display={true} />





</div>



---



## 5. Backpropagation (The Chain Rule)



<div className="my-6 rounded-xl bg-gradient-to-r from-rose-50 to-pink-50 p-6 border border-rose-200">



**Loss Function (Binary Cross-Entropy):**





<MathBlock tex={`\\mathcal{L}(\\hat{y}, y) = -[y \\log(\\hat{y}) + (1-y)\\log(1-\\hat{y})]`} display={true} />





**Gradient for Output Layer:**





<MathBlock tex={`\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{W}^{[L]} = \\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{a}^{[L]} \\cdot \\frac{\\partial \\mathbf{a}^{[L]}{\\partial \\mathbf{z}^{[L]} \\cdot \\frac{\\partial \\mathbf{z}^{[L]}{\\partial \\mathbf{W}^{[L]}`} display={true} />







<MathBlock tex={`\\delta^{[L]} = \\mathbf{a}^{[L]} - \\mathbf{y}`} display={true} />





**Gradient for Hidden Layer <MathBlock tex={`l`} />:**





<MathBlock tex={`\\delta^{[l]} = (\\mathbf{W}^{[l+1]})^T \\delta^{[l+1]} \\odot \\sigma'(\\mathbf{z}^{[l]})`} display={true} />





**Weight Update:**





<MathBlock tex={`\\mathbf{W}^{[l]} := \\mathbf{W}^{[l]} - \\alpha \\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{W}^{[l]}`} display={true} />





</div>



<svg viewBox="0 0 700 300" className="w-full my-6" xmlns="http://www.w3.org/2000/svg">

  <text x="350" y="25" textAnchor="middle" fill="#1E293B" fontSize="16" fontWeight="bold">Backpropagation: Computing Gradients</text>

  

  {/* Forward pass */}

  <text x="350" y="55" textAnchor="middle" fill="#10B981" fontSize="12" fontWeight="bold">Forward Pass (compute predictions)</text>

  

  {[

    {x: 80, label: 'Input\nx', color: '#6366F1'},

    {x: 230, label: 'Hidden\nz?�?, a?�?', color: '#8B5CF6'},

    {x: 380, label: 'Hidden\nz?�?, a?�?', color: '#A78BFA'},

    {x: 530, label: 'Output\ny', color: '#10B981'}

  ].map((node, i) => (

    <g key={i}>

      <rect x={node.x - 40} y="70" width="80" height="45" rx="10" fill={node.color} opacity="0.2" stroke={node.color} strokeWidth="1.5" />

      <text x={node.x} y="88" textAnchor="middle" fill={node.color} fontSize="10" fontWeight="bold">

        {node.label.split('\n')[0]}

      </text>

      <text x={node.x} y="103" textAnchor="middle" fill="#475569" fontSize="9">

        {node.label.split('\n')[1]}

      </text>

      {i < 3 && (

        <line x1={node.x + 40} y1="92" x2={node.x + 110} y2="92" stroke={node.color} strokeWidth="2" />

      )}

    </g>

  ))}

  

  {/* Loss */}

  <rect x="600" y="70" width="80" height="45" rx="10" fill="#EF4444" opacity="0.2" stroke="#EF4444" strokeWidth="1.5" />

  <text x="640" y="88" textAnchor="middle" fill="#EF4444" fontSize="10" fontWeight="bold">Loss</text>

  <text x="640" y="103" textAnchor="middle" fill="#991B1B" fontSize="9">L(y, y)</text>

  

  {/* Backward pass */}

  <text x="350" y="150" textAnchor="middle" fill="#EF4444" fontSize="12" fontWeight="bold">Backward Pass (compute gradients)</text>

  

  {[

    {x: 530, label: '??�? = y - y', color: '#EF4444'},

    {x: 380, label: '??�? = W�??� ?� s\'', color: '#F59E0B'},

    {x: 230, label: '??�? = W�??� ?� s\'', color: '#F97316'},

    {x: 80, label: '?L/?W?�?', color: '#DC2626'}

  ].map((node, i) => (

    <g key={i}>

      <rect x={node.x - 55} y="170" width="110" height="35" rx="8" fill={node.color} opacity="0.15" stroke={node.color} strokeWidth="1.5" />

      <text x={node.x} y="192" textAnchor="middle" fill={node.color} fontSize="9" fontWeight="bold">

        {node.label}

      </text>

      {i < 3 && (

        <line x1={node.x - 55} y1="187" x2={node.x - 120} y2="187" stroke={node.color} strokeWidth="2" />

      )}

    </g>

  ))}

  

  {/* Update rule */}

  <rect x="150" y="230" width="400" height="50" rx="12" fill="#D1FAE5" stroke="#10B981" strokeWidth="2" />

  <text x="350" y="250" textAnchor="middle" fill="#065F46" fontSize="12" fontWeight="bold">

    Weight Update: W := W - a � ?L/?W

  </text>

  <text x="350" y="268" textAnchor="middle" fill="#059669" fontSize="10">

    a = learning rate, ?L/?W = gradient from backpropagation

  </text>

</svg>



---



## 6. Loss Functions



<div className="my-6 rounded-xl bg-gradient-to-r from-amber-50 to-yellow-50 p-6 border border-amber-200">



**Regression Loss (MSE):**





<MathBlock tex={`\\mathcal{L}_{MSE} = \\frac{1}{n}\\sum_{i=1}^{n}(y_i - \\hat{y}_i)^2`} display={true} />





**Classification Loss (Cross-Entropy):**



Binary: 

<MathBlock tex={`\\mathcal{L}_{BCE} = -\\frac{1}{n}\\sum_{i=1}^{n}[y_i\\log(\\hat{y}_i) + (1-y_i)\\log(1-\\hat{y}_i)]`} display={true} />





Multi-class: 

<MathBlock tex={`\\mathcal{L}_{CE} = -\\frac{1}{n}\\sum_{i=1}^{n}\\sum_{c=1}^{C} y_{ic}\\log(\\hat{y}_{ic})`} display={true} />





</div>



---



## 7. Weight Initialization



<div className="my-6 rounded-xl bg-gradient-to-r from-cyan-50 to-teal-50 p-6 border border-cyan-200">



**Why Not Initialize All Weights to Zero?**



If <MathBlock tex={`W^{[l]} = 0`} />, all neurons compute the same function ? no symmetry breaking ? network can't learn.



**Xavier/Glorot Initialization (Sigmoid/Tanh):**





<MathBlock tex={`W \\sim \\mathcal{N}\\left(0, \\frac{2}{n_{in} + n_{out}}\\right)`} display={true} />





**He Initialization (ReLU):**





<MathBlock tex={`W \\sim \\mathcal{N}\\left(0, \\frac{2}{n_{in}}\\right)`} display={true} />





</div>



---



## 8. Optimizers



<svg viewBox="0 0 700 250" className="w-full my-6" xmlns="http://www.w3.org/2000/svg">

  <text x="350" y="25" textAnchor="middle" fill="#1E293B" fontSize="16" fontWeight="bold">Optimization Algorithms Comparison</text>

  

  {/* Gradient Descent path */}

  <path d="M 100,200 Q 150,195 200,180 Q 250,150 300,120 Q 350,100 400,90 Q 450,85 500,80" 

        fill="none" stroke="#CBD5E1" strokeWidth="2" />

  <circle cx="100" cy="200" r="5" fill="#CBD5E1" />

  <circle cx="500" cy="80" r="8" fill="#CBD5E1" />

  <text x="300" y="215" textAnchor="middle" fill="#64748B" fontSize="10">Vanilla GD</text>

  

  {/* SGD path (noisy) */}

  <path d="M 100,200 Q 130,190 160,195 Q 190,170 220,160 Q 250,145 280,130 Q 310,120 340,105 Q 370,95 400,88 Q 430,82 460,78 Q 490,75 520,72" 

        fill="none" stroke="#6366F1" strokeWidth="2" strokeDasharray="4 3" />

  <text x="300" y="235" textAnchor="middle" fill="#6366F1" fontSize="10">SGD (noisy)</text>

  

  {/* Adam path (smooth, fast) */}

  <path d="M 100,200 Q 180,185 260,140 Q 340,95 420,75 Q 500,65 550,60" 

        fill="none" stroke="#10B981" strokeWidth="3" />

  <circle cx="550" cy="60" r="10" fill="#10B981" />

  <text x="350" y="250" textAnchor="middle" fill="#10B981" fontSize="11" fontWeight="bold">Adam (fastest convergence)</text>

  

  {/* Legend */}

  <line x1="100" y1="45" x2="140" y2="45" stroke="#CBD5E1" strokeWidth="2" />

  <text x="148" y="49" fill="#64748B" fontSize="10">Vanilla GD</text>

  <line x1="220" y1="45" x2="260" y2="45" stroke="#6366F1" strokeWidth="2" strokeDasharray="4 3" />

  <text x="268" y="49" fill="#6366F1" fontSize="10">SGD</text>

  <line x1="320" y1="45" x2="360" y2="45" stroke="#10B981" strokeWidth="3" />

  <text x="368" y="49" fill="#10B981" fontSize="10">Adam</text>

</svg>



### Adam Optimizer (Most Popular)



<div className="my-6 rounded-xl bg-gradient-to-r from-emerald-50 to-green-50 p-6 border border-emerald-200">



**Adam Update Rules:**





<MathBlock tex={`m_t = \\beta_1 m_{t-1} + (1 - \\beta_1) g_t \\quad \\text{(1st moment: mean)}`} display={true} />







<MathBlock tex={`v_t = \\beta_2 v_{t-1} + (1 - \\beta_2) g_t^2 \\quad \\text{(2nd moment: variance)}`} display={true} />







<MathBlock tex={`\\hat{m}_t = \\frac{m_t}{1 - \\beta_1^t}, \\quad \\hat{v}_t = \\frac{v_t}{1 - \\beta_2^t} \\quad \\text{(bias correction)}`} display={true} />







<MathBlock tex={`\\theta_t = \\theta_{t-1} - \\alpha \\frac{\\hat{m}_t}{\\sqrt{\\hat{v}_t} + \\epsilon}`} display={true} />





**Defaults:** <MathBlock tex={`\\beta_1 = 0.9`} />, <MathBlock tex={`\\beta_2 = 0.999`} />, <MathBlock tex={`\\epsilon = 10^{-8}`} />, <MathBlock tex={`\\alpha = 0.001`} />



</div>



---



## Key Takeaways



1. **Perceptron** = weighted sum + activation � the basic unit

2. **MLP** = multiple perceptrons in layers � universal approximator

3. **Activation functions** introduce non-linearity � ReLU is default

4. **Backpropagation** = chain rule applied recursively � computes gradients

5. **Adam optimizer** is the default choice for most problems

6. **Weight initialization** matters � use He init for ReLU



<div className="my-8 rounded-2xl bg-gradient-to-r from-violet-600 to-indigo-600 p-6 text-white">

  <h3 className="text-xl font-bold mb-2">Next: PyTorch Fundamentals</h3>

  <p className="text-violet-100">Implement neural networks in PyTorch with autograd, tensors, and GPU support.</p>

</div>
Neural Networks: Perceptron to MLP � Full Mathematical Foundation

Neural Networks: Perceptron to MLP

The Building Block of Deep Learning

Deep Learning Architecture Family

1. The Perceptron (Single Neuron)

Need Expert Data Science Help?