Neural Networks: Perceptron to MLP

Module 3: Advanced ML + Deep LearningFree Lesson

Advertisement

Neural Networks: Perceptron to MLP

Introduction

Neural networks are computational systems inspired by biological neurons. They learn to map inputs to outputs through layered transformations, enabling them to approximate virtually any function.

Architecture Diagram
Neural Network Architecture:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Input Layer        Hidden Layers        Output Layer
 (Features)         (Learning)           (Prediction)

    xโ‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                  โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                  โ”œโ”€โ”€โ”€โ–บโ”‚ hโ‚โฝยนโพ โ”œโ”€โ”€โ”€โ”
    xโ‚‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                  โ”‚                โ”œโ”€โ”€โ”€โ–บโ”‚ hโ‚โฝยฒโพ โ”œโ”€โ”€โ”€โ”
                  โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    xโ‚ƒ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ–บโ”‚ hโ‚‚โฝยนโพ โ”œโ”€โ”€โ”˜                โ”œโ”€โ”€โ”€โ–บโ”‚  ลท     โ”‚
                  โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   โ”‚
    xโ‚„ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ–บโ”‚ hโ‚ƒโฝยนโพ โ”œโ”€โ”€โ”˜โ”€โ”€โ”€โ–บโ”‚ hโ‚‚โฝยฒโพ โ”œโ”€โ”€โ”€โ”˜
                  โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    xโ‚… โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

    โ— = Neuron      โ”€ = Connection (weight)     โ”‚ = Layer
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

The Perceptron

Single Perceptron

The simplest neural unit computes a weighted sum and applies an activation:

DfPerceptron

A perceptron computes a weighted sum of inputs, adds a bias, and applies a non-linear activation function to produce an output.

Perceptron Output

z=โˆ‘i=1nwixi+b=wTx+bz = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b

Here,

  • wiw_i=Weight for feature i
  • xix_i=Input feature i
  • bb=Bias term
  • zz=Weighted sum (pre-activation)
y^=ฯ•(z)\hat{y} = \phi(z)

where ฯ•\phi is the activation function.

Architecture Diagram
Single Perceptron:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

    xโ‚ โ”€โ”€โ”€โ”€ wโ‚ โ”€โ”€โ”€โ”€โ”
                    โ”‚
    xโ‚‚ โ”€โ”€โ”€โ”€ wโ‚‚ โ”€โ”€โ”€โ”€โ”ค    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”œโ”€โ”€โ”€โ–บโ”‚ ฮฃ wแตขxแตข  โ”‚โ”€โ”€โ”€โ–บ z โ”€โ”€โ”€โ–บ ฯ†(z) โ”€โ”€โ”€โ–บ ลท
    xโ‚ƒ โ”€โ”€โ”€โ”€ wโ‚ƒ โ”€โ”€โ”€โ”€โ”ค    โ”‚ + b     โ”‚
                    โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                 bias

    Mathematical:
    z = wโ‚xโ‚ + wโ‚‚xโ‚‚ + wโ‚ƒxโ‚ƒ + b
    ลท = ฯ†(z)
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Activation Functions

import numpy as np
import matplotlib.pyplot as plt

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Activation Functions Visualization
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def softmax(z):
    exp_z = np.exp(z - np.max(z))
    return exp_z / exp_z.sum()

def gelu(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

# Plot all activations
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
z = np.linspace(-5, 5, 200)

activations = [
    ('Sigmoid', sigmoid(z), r'$\sigma(z) = \frac{1}{1+e^{-z}}$'),
    ('Tanh', tanh(z), r'$\tanh(z)$'),
    ('ReLU', relu(z), r'$\max(0, z)$'),
    ('Leaky ReLU', leaky_relu(z), r'$\max(0.01z, z)$'),
    ('GELU', gelu(z), r'$0.5z(1 + \tanh(\sqrt{2/\pi}(z+0.044715z^3)))$'),
    ('Derivatives', None, '')
]

for ax, (name, func, formula) in zip(axes.flat, activations):
    if name == 'Derivatives':
        # Plot derivatives
        sig_deriv = sigmoid(z) * (1 - sigmoid(z))
        tanh_deriv = 1 - tanh(z)**2
        relu_deriv = (z > 0).astype(float)

        ax.plot(z, sig_deriv, label='Sigmoid', linewidth=2)
        ax.plot(z, tanh_deriv, label='Tanh', linewidth=2)
        ax.plot(z, relu_deriv, label='ReLU', linewidth=2)
        ax.legend()
        ax.set_title('Derivatives')
    else:
        ax.plot(z, func, linewidth=2, color='steelblue')
        ax.set_title(f'{name}\n{formula}')
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)

plt.suptitle("Activation Functions", fontsize=16)
plt.tight_layout()
plt.savefig('activation_functions.png', dpi=150)
plt.show()

Activation Function Properties

FunctionRangeOutputVanishing GradientUse Case
Sigmoid(0, 1)ProbabilitiesYesBinary output
Tanh(-1, 1)Zero-centeredYesHidden layers
ReLU[0, โˆž)SparseNoDefault choice
Leaky ReLU(-โˆž, โˆž)No dead neuronsNoDying ReLU problem
GELU(-0.17, โˆž)Smooth ReLUNoTransformers
Softmax(0, 1) sum=1ProbabilitiesYesMulti-class output

Multi-Layer Perceptron (MLP)

Forward Propagation

Forward Propagation

a(l)=ฯ•(W(l)a(lโˆ’1)+b(l))\mathbf{a}^{(l)} = \phi\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)

Here,

  • a(l)\mathbf{a}^{(l)}=Activations at layer l
  • W(l)\mathbf{W}^{(l)}=Weight matrix at layer l
  • b(l)\mathbf{b}^{(l)}=Bias vector at layer l
  • ฯ•\phi=Activation function

For a network with LL layers:

a(0)=x(input)\mathbf{a}^{(0)} = \mathbf{x} \quad \text{(input)}
z(l)=W(l)a(lโˆ’1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}
a(l)=ฯ•(z(l))\mathbf{a}^{(l)} = \phi(\mathbf{z}^{(l)})
y^=a(L)(output)\hat{\mathbf{y}} = \mathbf{a}^{(L)} \quad \text{(output)}
Architecture Diagram
Forward Propagation Flow:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Layer 0 (Input)      Layer 1 (Hidden)      Layer 2 (Output)
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€     โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

 xโ‚ โ”€โ”€โ”€โ”€โ”€โ”€ wโ‚โ‚ โ”€โ”€โ”€โ”€โ”€โ”€โ–บ aโ‚โฝยนโพ โ”€โ”€โ”€โ”€โ”€โ”€ wโ‚โ‚‚ โ”€โ”€โ”€โ”€โ”€โ”€โ–บ
              wโ‚‚โ‚ โ†˜              โ†— wโ‚‚โ‚‚
 xโ‚‚ โ”€โ”€โ”€โ”€โ”€โ”€ wโ‚โ‚‚ โ”€โ”€โ”€โ”€โ”€โ”€โ–บ aโ‚‚โฝยนโพ โ”€โ”€โ”€โ”€โ”€โ”€ wโ‚‚โ‚‚ โ”€โ”€โ”€โ”€โ”€โ”€โ–บ ลท
              wโ‚ƒโ‚ โ†—              โ†˜ wโ‚ƒโ‚‚
 xโ‚ƒ โ”€โ”€โ”€โ”€โ”€โ”€ wโ‚โ‚ƒ โ”€โ”€โ”€โ”€โ”€โ”€โ–บ aโ‚ƒโฝยนโพ โ”€โ”€โ”€โ”€โ”€โ”€ wโ‚ƒโ‚‚ โ”€โ”€โ”€โ”€โ”€โ”€โ–บ

 Step 1: zโฝยนโพ = Wโฝยนโพx + bโฝยนโพ
 Step 2: aโฝยนโพ = ฯ†(zโฝยนโพ)
 Step 3: zโฝยฒโพ = Wโฝยฒโพaโฝยนโพ + bโฝยฒโพ
 Step 4: ลท = ฯ†(zโฝยฒโพ)
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Backpropagation

The chain rule enables efficient gradient computation:

ThChain Rule for Backpropagation

The gradient of the loss with respect to any weight is the product of gradients flowing backward through the network, computed via the chain rule.

โˆ‚Lโˆ‚wij(l)=โˆ‚Lโˆ‚aj(l)โ‹…โˆ‚aj(l)โˆ‚zj(l)โ‹…โˆ‚zj(l)โˆ‚wij(l)\frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}} = \frac{\partial \mathcal{L}}{\partial a_j^{(l)}} \cdot \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}}

Output Layer Gradient:

Output Layer Delta

ฮด(L)=โˆ‡aLโŠ™ฯ•โ€ฒ(z(L))\delta^{(L)} = \nabla_{\mathbf{a}} \mathcal{L} \odot \phi'(\mathbf{z}^{(L)})

Here,

  • ฮด(L)\delta^{(L)}=Error signal at the output layer
  • โˆ‡aL\nabla_{\mathbf{a}} \mathcal{L}=Gradient of loss w.r.t. activations
  • ฯ•โ€ฒ\phi'=Derivative of activation function

Hidden Layer Gradient (Recursive):

Hidden Layer Delta (Backpropagation)

ฮด(l)=(W(l+1)Tฮด(l+1))โŠ™ฯ•โ€ฒ(z(l))\delta^{(l)} = \left(\mathbf{W}^{(l+1)T} \delta^{(l+1)}\right) \odot \phi'(\mathbf{z}^{(l)})

Here,

  • ฮด(l)\delta^{(l)}=Error signal at layer l
  • W(l+1)T\mathbf{W}^{(l+1)T}=Transpose of next layer's weights
  • ฮด(l+1)\delta^{(l+1)}=Error signal from next layer

Weight Update:

โˆ‚Lโˆ‚W(l)=ฮด(l)a(lโˆ’1)T\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} \mathbf{a}^{(l-1)T}
โˆ‚Lโˆ‚b(l)=ฮด(l)\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}

โ„น๏ธ Computational Complexity of Backpropagation

The forward pass requires O(โˆ‘l=1Lnlโ‹…nlโˆ’1)O(\sum_{l=1}^{L} n_l \cdot n_{l-1}) operations (matrix multiplications). The backward pass has approximately the same computational cost. This means backpropagation is roughly 2x the cost of a single forward pass โ€” an extremely efficient way to compute gradients for all parameters simultaneously.

Architecture Diagram
Backpropagation (Reverse Mode):
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Forward Pass (Compute Predictions):
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 x โ†’ [zโฝยนโพ] โ†’ [aโฝยนโพ] โ†’ [zโฝยฒโพ] โ†’ [aโฝยฒโพ] โ†’ ลท โ†’ L

 Backward Pass (Compute Gradients):
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 โ†โ”€โ”€ ฮดโฝยฒโพ โ†โ”€โ”€ ฮดโฝยนโพ โ†โ”€โ”€ โˆ‚L/โˆ‚x
    (output)  (hidden)   (not needed)

 Key Insight: Gradients flow backward through the network
              Each layer needs gradient from the layer after it
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Universal Approximation Theorem

ThUniversal Approximation Theorem (Cybenko, 1989)

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of Rn\mathbb{R}^n, given appropriate weights and a non-constant activation function.

โˆ€ฯต>0,โ€…โ€ŠโˆƒN,โ€…โ€ŠW(1),b(1),W(2),b(2):โˆฃf(x)โˆ’โˆ‘j=1Nฮฑjฯ•(wjTx+bj)โˆฃ<ฯต\forall \epsilon > 0, \; \exists N, \; \mathbf{W}^{(1)}, \mathbf{b}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(2)} : \left| f(x) - \sum_{j=1}^{N} \alpha_j \phi(\mathbf{w}_j^T x + b_j) \right| < \epsilon

โ„น๏ธ Theoretical vs. Practical Implications

While the theorem guarantees that a single hidden layer is theoretically sufficient, it does not specify how many neurons are needed (the bound may be exponentially large), nor does it address learnability. In practice, deeper networks are more parameter-efficient: a depth-LL network can represent functions that would require exponentially many neurons in a single layer. This is why deep learning focuses on depth rather than width.

Architecture Diagram
Universal Approximation:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Any continuous function can be approximated:
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

 True Function:           Approximation with 5 neurons:
     โ•ฑโ•ฒ                        โ•ฑโ•ฒ
    โ•ฑ  โ•ฒ    โ•ฑโ•ฒ                โ•ฑ  โ•ฒ   โ•ฑโ•ฒ
   โ•ฑ    โ•ฒ  โ•ฑ  โ•ฒ              โ•ฑ    โ•ฒ โ•ฑ  โ•ฒ
  โ•ฑ      โ•ฒโ•ฑ    โ•ฒ            โ•ฑ      โ•ฒ    โ•ฒ
 โ•ฑ              โ•ฒโ”€โ”€        โ•ฑ        โ•ฒโ”€โ”€โ”€โ”€โ•ฒโ”€โ”€

 More neurons = better approximation
 But: width vs depth tradeoff exists
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

๐Ÿ“Forward Pass Worked Example

Consider a 2-layer network: input x=[1,2]T\mathbf{x} = [1, 2]^T, weights W(1)=[0.50.30.20.7]\mathbf{W}^{(1)} = \begin{bmatrix} 0.5 & 0.3 \\ 0.2 & 0.7 \end{bmatrix}, W(2)=[0.4,0.6]\mathbf{W}^{(2)} = [0.4, 0.6], biases b(1)=[0.1,0.2]T\mathbf{b}^{(1)} = [0.1, 0.2]^T, b(2)=0.0b^{(2)} = 0.0.

Layer 1: z(1)=W(1)x+b(1)=[0.5(1)+0.3(2)+0.1,โ€…โ€Š0.2(1)+0.7(2)+0.2]T=[1.2,1.8]T\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} = [0.5(1)+0.3(2)+0.1, \; 0.2(1)+0.7(2)+0.2]^T = [1.2, 1.8]^T

a(1)=ReLU(z(1))=[1.2,1.8]T\mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) = [1.2, 1.8]^T

Layer 2: z(2)=W(2)a(1)+b(2)=0.4(1.2)+0.6(1.8)=1.56z^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + b^{(2)} = 0.4(1.2) + 0.6(1.8) = 1.56

y^=ฯƒ(1.56)โ‰ˆ0.826\hat{y} = \sigma(1.56) \approx 0.826

This forward pass computes the prediction. Backpropagation would then compute gradients flowing backward through these same operations.

Complete Keras Implementation

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Generate Dataset (Non-linear)
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
X, y = make_moons(n_samples=2000, noise=0.2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], alpha=0.5, label='Class 0')
plt.scatter(X[y==1, 0], X[y==1, 1], alpha=0.5, label='Class 1')
plt.title("Make Moons Dataset")
plt.legend()
plt.show()

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Model 1: Simple Perceptron (Linear Boundary)
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
perceptron = keras.Sequential([
    layers.Dense(1, activation='sigmoid', input_shape=(2,))
])

perceptron.compile(
    optimizer='sgd',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("Model 1: Single Perceptron")
perceptron.summary()

history_p = perceptron.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Model 2: Deep MLP
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
mlp_model = keras.Sequential([
    # Input layer
    layers.Dense(64, activation='relu', input_shape=(2,)),
    layers.BatchNormalization(),
    layers.Dropout(0.3),

    # Hidden layers
    layers.Dense(32, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.2),

    layers.Dense(16, activation='relu'),

    # Output layer
    layers.Dense(1, activation='sigmoid')
])

# Compile with custom learning rate
optimizer = keras.optimizers.Adam(learning_rate=0.001)

mlp_model.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.AUC(name='auc')]
)

print("\nModel 2: Deep MLP")
mlp_model.summary()

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Training with Callbacks
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
callback_list = [
    # Early stopping
    callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True,
        verbose=1
    ),
    # Reduce learning rate on plateau
    callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-6,
        verbose=1
    ),
    # Model checkpoint
    callbacks.ModelCheckpoint(
        'best_mlp_model.keras',
        monitor='val_auc',
        mode='max',
        save_best_only=True,
        verbose=1
    )
]

history = mlp_model.fit(
    X_train, y_train,
    epochs=200,
    batch_size=32,
    validation_split=0.2,
    callbacks=callback_list,
    verbose=1
)

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Evaluation
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
test_loss, test_acc, test_auc = mlp_model.evaluate(X_test, y_test)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Plot Training History
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(history.history['loss'], label='Train Loss')
axes[0].plot(history.history['val_loss'], label='Val Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training & Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(history.history['accuracy'], label='Train Acc')
axes[1].plot(history.history['val_accuracy'], label='Val Acc')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training & Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_history.png', dpi=150)
plt.show()

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Decision Boundary Visualization
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def plot_decision_boundary(model, X, y, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()], verbose=0)
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdBu)
    plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label='Class 0', edgecolors='k')
    plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', label='Class 1', edgecolors='k')
    plt.title(title)
    plt.legend()
    plt.savefig(f'{title.lower().replace(" ", "_")}.png', dpi=150)
    plt.show()

plot_decision_boundary(mlp_model, X_test, y_test, "MLP Decision Boundary")

Weight Initialization

Why Initialization Matters

Random initialization breaks symmetry between neurons. If all weights start the same, all neurons compute the same thing and learn the same features. But the SCALE of initialization determines whether signals survive through deep networks.

โš ๏ธ Vanishing/Exploding Gradients

If weights are too small (โˆฃwโˆฃ<1|w| < 1), signals shrink exponentially through layers โ€” after 10 layers, a signal of 1.0 becomes 0.001. If weights are too large (โˆฃwโˆฃ>1|w| > 1), signals explode to 1024 after 10 layers. Optimal initialization keeps โˆฃwโˆฃโ‰ˆ1/fan_in|w| \approx 1/\sqrt{\text{fan\_in}}.

Architecture Diagram
The Vanishing/Exploding Gradient Problem:

  Layer 1      Layer 2      Layer 3      Layer 4
  [h1] ------- [h2] ------- [h3] ------- [h4] -------> output
    |             |             |             |
    w1            w2            w3            w4

  If |w| < 1 (e.g., 0.5):
    Signal: 1.0 -> 0.5 -> 0.25 -> 0.125 -> 0.0625
    After 10 layers: 0.001 (vanishes!)
    Gradients also shrink --> early layers learn NOTHING

  If |w| > 1 (e.g., 2.0):
    Signal: 1.0 -> 2.0 -> 4.0 -> 8.0 -> 16.0
    After 10 layers: 1024 (explodes!)
    Gradients also grow --> unstable training

  Optimal: |w| โ‰ˆ 1/sqrt(fan_in)
    Signal stays roughly constant through layers

Xavier/Glorot Initialization (for Sigmoid/Tanh)

Architecture Diagram
Formula:
  W ~ N(0, sqrt(2 / (fan_in + fan_out)))

  Or uniform: W ~ U(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out)))

Why this works:
  - fan_in:  number of inputs to this layer
  - fan_out: number of outputs from this layer
  - Variance of output = variance of input (when activation is linear)
  - Sigmoid is approximately linear near z=0, so this works well

Example:
  Layer: 256 inputs -> 128 outputs
  stddev = sqrt(2 / (256 + 128)) = sqrt(2/384) = 0.072
  Each weight drawn from N(0, 0.072)

He Initialization (for ReLU)

Architecture Diagram
Formula:
  W ~ N(0, sqrt(2 / fan_in))

Why different from Xavier:
  - ReLU kills half the activations (sets negative to 0)
  - This halves the variance of the output
  - We need to compensate by doubling the weight variance
  - Hence sqrt(2/fan_in) instead of sqrt(2/(fan_in+fan_out))

Example:
  Layer: 256 inputs -> 128 outputs, ReLU activation
  stddev = sqrt(2 / 256) = 0.088
  Each weight drawn from N(0, 0.088)

Initialization Summary

Architecture Diagram
Activation     Best Init        Variance Formula
-----------    -----------      ----------------
Sigmoid        Xavier/Glorot    sqrt(2/(fan_in+fan_out))
Tanh           Xavier/Glorot    sqrt(2/(fan_in+fan_out))
ReLU           He               sqrt(2/fan_in)
Leaky ReLU     He               sqrt(2/((1+alpha^2)*fan_in))
ELU            He               sqrt(2/fan_in)

Rule of thumb: Use He for ReLU variants, Xavier for saturating activations.

He initialization for ReLU layers

layer_he = layers.Dense( 64, activation='relu', kernel_initializer='he_normal' )

Xavier for sigmoid/tanh layers

layer_xavier = layers.Dense( 64, activation='tanh', kernel_initializer='glorot_normal' )

Custom initialization

def custom_init(shape, dtype=None): return tf.random.normal(shape, stddev=0.01, dtype=dtype)

layer_custom = layers.Dense(64, kernel_initializer=custom_init)

Architecture Diagram

<MathSummary title="Key Takeaways">
- **Perceptrons** compute linear combinations <MathBlock tex={`\\mathbf{w}^T \\mathbf{x} + b`} /> followed by a non-linear activation
- **MLPs** stack layers to learn hierarchical non-linear representations; deeper networks are more parameter-efficient
- **Backpropagation** computes gradients efficiently via the chain rule in <MathBlock tex:`O(\\text{forward pass})` /> time
- **Activation functions** introduce non-linearity; ReLU is the default, but GELU is preferred in transformers
- **Universal Approximation** guarantees theoretical capacity for single hidden layers, but depth provides exponential efficiency gains in practice
- **Weight initialization** (Xavier for sigmoid/tanh, He for ReLU) and **batch normalization** are critical for stable training
- **Vanishing/Exploding Gradients** are the primary challenge in deep networks; solved by skip connections, proper initialization, and normalization
</MathSummary>

---

## Practice Exercises

1. **Activation Comparison**: Train the same network with ReLU, Sigmoid, and Tanh - compare convergence
2. **Depth vs Width**: Experiment with different architectures (many narrow vs few wide layers)
3. **Spiral Dataset**: Create a spiral dataset and train a network to classify it
4. **Gradient Visualization**: Plot gradient magnitudes across layers to understand vanishing gradients

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement