Neural Networks: Perceptron to MLP

Introduction

Neural networks are computational systems inspired by biological neurons. They learn to map inputs to outputs through layered transformations, enabling them to approximate virtually any function.

Architecture Diagram

Neural Network Architecture:
═══════════════════════════════════════════════════════════════════

 Input Layer        Hidden Layers        Output Layer
 (Features)         (Learning)           (Prediction)

    x₁ ──────────┐
                  │    ┌───────┐
                  ├───►│ h₁⁽¹⁾ ├───┐
    x₂ ──────────┤    └───────┘   │    ┌───────┐
                  │                ├───►│ h₁⁽²⁾ ├───┐
                  │    ┌───────┐  │    └───────┘   │    ┌───────┐
    x₃ ──────────┼───►│ h₂⁽¹⁾ ├──┘                ├───►│  ŷ     │
                  │    └───────┘                    │    └───────┘
                  │    ┌───────┐  │    ┌───────┐   │
    x₄ ──────────┼───►│ h₃⁽¹⁾ ├──┘───►│ h₂⁽²⁾ ├───┘
                  │    └───────┘       └───────┘
    x₅ ──────────┘

    ● = Neuron      ─ = Connection (weight)     │ = Layer
═══════════════════════════════════════════════════════════════════

The Perceptron

Single Perceptron

The simplest neural unit computes a weighted sum and applies an activation:

DfPerceptron

A perceptron computes a weighted sum of inputs, adds a bias, and applies a non-linear activation function to produce an output.

Perceptron Output

z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b

Here,

$w_i$ =Weight for feature i
$x_i$ =Input feature i
$b$ =Bias term
$z$ =Weighted sum (pre-activation)

\hat{y} = \phi(z)

where $\phi$ is the activation function.

Architecture Diagram

Single Perceptron:
═══════════════════════════════════════════════════

    x₁ ──── w₁ ────┐
                    │
    x₂ ──── w₂ ────┤    ┌─────────┐
                    ├───►│ Σ wᵢxᵢ  │───► z ───► φ(z) ───► ŷ
    x₃ ──── w₃ ────┤    │ + b     │
                    │    └─────────┘
                 bias

    Mathematical:
    z = w₁x₁ + w₂x₂ + w₃x₃ + b
    ŷ = φ(z)
═══════════════════════════════════════════════════

Activation Functions

import numpy as np
import matplotlib.pyplot as plt

# ═══════════════════════════════════════════════════
# Activation Functions Visualization
# ═══════════════════════════════════════════════════
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def tanh(z):
    return np.tanh(z)

def relu(z):
    return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

def softmax(z):
    exp_z = np.exp(z - np.max(z))
    return exp_z / exp_z.sum()

def gelu(z):
    return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))

# Plot all activations
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
z = np.linspace(-5, 5, 200)

activations = [
    ('Sigmoid', sigmoid(z), r'$\sigma(z) = \frac{1}{1+e^{-z}}$'),
    ('Tanh', tanh(z), r'$\tanh(z)$'),
    ('ReLU', relu(z), r'$\max(0, z)$'),
    ('Leaky ReLU', leaky_relu(z), r'$\max(0.01z, z)$'),
    ('GELU', gelu(z), r'$0.5z(1 + \tanh(\sqrt{2/\pi}(z+0.044715z^3)))$'),
    ('Derivatives', None, '')
]

for ax, (name, func, formula) in zip(axes.flat, activations):
    if name == 'Derivatives':
        # Plot derivatives
        sig_deriv = sigmoid(z) * (1 - sigmoid(z))
        tanh_deriv = 1 - tanh(z)**2
        relu_deriv = (z > 0).astype(float)

        ax.plot(z, sig_deriv, label='Sigmoid', linewidth=2)
        ax.plot(z, tanh_deriv, label='Tanh', linewidth=2)
        ax.plot(z, relu_deriv, label='ReLU', linewidth=2)
        ax.legend()
        ax.set_title('Derivatives')
    else:
        ax.plot(z, func, linewidth=2, color='steelblue')
        ax.set_title(f'{name}\n{formula}')
    ax.grid(True, alpha=0.3)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)

plt.suptitle("Activation Functions", fontsize=16)
plt.tight_layout()
plt.savefig('activation_functions.png', dpi=150)
plt.show()

Activation Function Properties

Function	Range	Output	Vanishing Gradient	Use Case
Sigmoid	(0, 1)	Probabilities	Yes	Binary output
Tanh	(-1, 1)	Zero-centered	Yes	Hidden layers
ReLU	[0, ∞)	Sparse	No	Default choice
Leaky ReLU	(-∞, ∞)	No dead neurons	No	Dying ReLU problem
GELU	(-0.17, ∞)	Smooth ReLU	No	Transformers
Softmax	(0, 1) sum=1	Probabilities	Yes	Multi-class output

Multi-Layer Perceptron (MLP)

Forward Propagation

\mathbf{a}^{(l)} = \phi\left(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\right)

Here,

$\mathbf{a}^{(l)}$ =Activations at layer l
$\mathbf{W}^{(l)}$ =Weight matrix at layer l
$\mathbf{b}^{(l)}$ =Bias vector at layer l
$\phi$ =Activation function

For a network with $L$ layers:

\mathbf{a}^{(0)} = \mathbf{x} \quad \text{(input)}

\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

\mathbf{a}^{(l)} = \phi(\mathbf{z}^{(l)})

\hat{\mathbf{y}} = \mathbf{a}^{(L)} \quad \text{(output)}

Architecture Diagram

Forward Propagation Flow:
═══════════════════════════════════════════════════

 Layer 0 (Input)      Layer 1 (Hidden)      Layer 2 (Output)
 ─────────────────    ─────────────────     ─────────────────

 x₁ ────── w₁₁ ──────► a₁⁽¹⁾ ────── w₁₂ ──────►
              w₂₁ ↘              ↗ w₂₂
 x₂ ────── w₁₂ ──────► a₂⁽¹⁾ ────── w₂₂ ──────► ŷ
              w₃₁ ↗              ↘ w₃₂
 x₃ ────── w₁₃ ──────► a₃⁽¹⁾ ────── w₃₂ ──────►

 Step 1: z⁽¹⁾ = W⁽¹⁾x + b⁽¹⁾
 Step 2: a⁽¹⁾ = φ(z⁽¹⁾)
 Step 3: z⁽²⁾ = W⁽²⁾a⁽¹⁾ + b⁽²⁾
 Step 4: ŷ = φ(z⁽²⁾)
═══════════════════════════════════════════════════

Backpropagation

The chain rule enables efficient gradient computation:

ThChain Rule for Backpropagation

The gradient of the loss with respect to any weight is the product of gradients flowing backward through the network, computed via the chain rule.

\frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}} = \frac{\partial \mathcal{L}}{\partial a_j^{(l)}} \cdot \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}}

Output Layer Gradient:

Output Layer Delta

\delta^{(L)} = \nabla_{\mathbf{a}} \mathcal{L} \odot \phi'(\mathbf{z}^{(L)})

Here,

$\delta^{(L)}$ =Error signal at the output layer
$\nabla_{\mathbf{a}} \mathcal{L}$ =Gradient of loss w.r.t. activations
$\phi'$ =Derivative of activation function

Hidden Layer Gradient (Recursive):

Hidden Layer Delta (Backpropagation)

\delta^{(l)} = \left(\mathbf{W}^{(l+1)T} \delta^{(l+1)}\right) \odot \phi'(\mathbf{z}^{(l)})

Here,

$\delta^{(l)}$ =Error signal at layer l
$\mathbf{W}^{(l+1)T}$ =Transpose of next layer's weights
$\delta^{(l+1)}$ =Error signal from next layer

Weight Update:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} \mathbf{a}^{(l-1)T}

\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}

ℹ️ Computational Complexity of Backpropagation

The forward pass requires $O(\sum_{l=1}^{L} n_l \cdot n_{l-1})$ operations (matrix multiplications). The backward pass has approximately the same computational cost. This means backpropagation is roughly 2x the cost of a single forward pass — an extremely efficient way to compute gradients for all parameters simultaneously.

Architecture Diagram

Backpropagation (Reverse Mode):
═══════════════════════════════════════════════════

 Forward Pass (Compute Predictions):
 ───────────────────────────────────
 x → [z⁽¹⁾] → [a⁽¹⁾] → [z⁽²⁾] → [a⁽²⁾] → ŷ → L

 Backward Pass (Compute Gradients):
 ───────────────────────────────────
 ←── δ⁽²⁾ ←── δ⁽¹⁾ ←── ∂L/∂x
    (output)  (hidden)   (not needed)

 Key Insight: Gradients flow backward through the network
              Each layer needs gradient from the layer after it
═══════════════════════════════════════════════════

Universal Approximation Theorem

ThUniversal Approximation Theorem (Cybenko, 1989)

A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$ , given appropriate weights and a non-constant activation function.

\forall \epsilon > 0, \; \exists N, \; \mathbf{W}^{(1)}, \mathbf{b}^{(1)}, \mathbf{W}^{(2)}, \mathbf{b}^{(2)} : \left| f(x) - \sum_{j=1}^{N} \alpha_j \phi(\mathbf{w}_j^T x + b_j) \right| < \epsilon

ℹ️ Theoretical vs. Practical Implications

While the theorem guarantees that a single hidden layer is theoretically sufficient, it does not specify how many neurons are needed (the bound may be exponentially large), nor does it address learnability. In practice, deeper networks are more parameter-efficient: a depth- $L$ network can represent functions that would require exponentially many neurons in a single layer. This is why deep learning focuses on depth rather than width.

Architecture Diagram

Universal Approximation:
═══════════════════════════════════════════════════

 Any continuous function can be approximated:
 ────────────────────────────────────────────

 True Function:           Approximation with 5 neurons:
     ╱╲                        ╱╲
    ╱  ╲    ╱╲                ╱  ╲   ╱╲
   ╱    ╲  ╱  ╲              ╱    ╲ ╱  ╲
  ╱      ╲╱    ╲            ╱      ╲    ╲
 ╱              ╲──        ╱        ╲────╲──

 More neurons = better approximation
 But: width vs depth tradeoff exists
═══════════════════════════════════════════════════

📝Forward Pass Worked Example

Consider a 2-layer network: input $\mathbf{x} = [1, 2]^T$ , weights $\mathbf{W}^{(1)} = \begin{bmatrix} 0.5 & 0.3 \\ 0.2 & 0.7 \end{bmatrix}$ , $\mathbf{W}^{(2)} = [0.4, 0.6]$ , biases $\mathbf{b}^{(1)} = [0.1, 0.2]^T$ , $b^{(2)} = 0.0$ .

Layer 1: $\mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} = [0.5(1)+0.3(2)+0.1, \; 0.2(1)+0.7(2)+0.2]^T = [1.2, 1.8]^T$

\mathbf{a}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) = [1.2, 1.8]^T

Layer 2: $z^{(2)} = \mathbf{W}^{(2)}\mathbf{a}^{(1)} + b^{(2)} = 0.4(1.2) + 0.6(1.8) = 1.56$

\hat{y} = \sigma(1.56) \approx 0.826

This forward pass computes the prediction. Backpropagation would then compute gradients flowing backward through these same operations.

Complete Keras Implementation

import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# ═══════════════════════════════════════════════════
# Generate Dataset (Non-linear)
# ═══════════════════════════════════════════════════
X, y = make_moons(n_samples=2000, noise=0.2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], alpha=0.5, label='Class 0')
plt.scatter(X[y==1, 0], X[y==1, 1], alpha=0.5, label='Class 1')
plt.title("Make Moons Dataset")
plt.legend()
plt.show()

# ═══════════════════════════════════════════════════
# Model 1: Simple Perceptron (Linear Boundary)
# ═══════════════════════════════════════════════════
perceptron = keras.Sequential([
    layers.Dense(1, activation='sigmoid', input_shape=(2,))
])

perceptron.compile(
    optimizer='sgd',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("Model 1: Single Perceptron")
perceptron.summary()

history_p = perceptron.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    verbose=0
)

# ═══════════════════════════════════════════════════
# Model 2: Deep MLP
# ═══════════════════════════════════════════════════
mlp_model = keras.Sequential([
    # Input layer
    layers.Dense(64, activation='relu', input_shape=(2,)),
    layers.BatchNormalization(),
    layers.Dropout(0.3),

    # Hidden layers
    layers.Dense(32, activation='relu'),
    layers.BatchNormalization(),
    layers.Dropout(0.2),

    layers.Dense(16, activation='relu'),

    # Output layer
    layers.Dense(1, activation='sigmoid')
])

# Compile with custom learning rate
optimizer = keras.optimizers.Adam(learning_rate=0.001)

mlp_model.compile(
    optimizer=optimizer,
    loss='binary_crossentropy',
    metrics=['accuracy', keras.metrics.AUC(name='auc')]
)

print("\nModel 2: Deep MLP")
mlp_model.summary()

# ═══════════════════════════════════════════════════
# Training with Callbacks
# ═══════════════════════════════════════════════════
callback_list = [
    # Early stopping
    callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True,
        verbose=1
    ),
    # Reduce learning rate on plateau
    callbacks.ReduceLROnPlateau(
        monitor='val_loss',
        factor=0.5,
        patience=5,
        min_lr=1e-6,
        verbose=1
    ),
    # Model checkpoint
    callbacks.ModelCheckpoint(
        'best_mlp_model.keras',
        monitor='val_auc',
        mode='max',
        save_best_only=True,
        verbose=1
    )
]

history = mlp_model.fit(
    X_train, y_train,
    epochs=200,
    batch_size=32,
    validation_split=0.2,
    callbacks=callback_list,
    verbose=1
)

# ═══════════════════════════════════════════════════
# Evaluation
# ═══════════════════════════════════════════════════
test_loss, test_acc, test_auc = mlp_model.evaluate(X_test, y_test)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")

# ═══════════════════════════════════════════════════
# Plot Training History
# ═══════════════════════════════════════════════════
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss
axes[0].plot(history.history['loss'], label='Train Loss')
axes[0].plot(history.history['val_loss'], label='Val Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training & Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy
axes[1].plot(history.history['accuracy'], label='Train Acc')
axes[1].plot(history.history['val_accuracy'], label='Val Acc')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training & Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('training_history.png', dpi=150)
plt.show()

# ═══════════════════════════════════════════════════
# Decision Boundary Visualization
# ═══════════════════════════════════════════════════
def plot_decision_boundary(model, X, y, title):
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()], verbose=0)
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(10, 8))
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdBu)
    plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label='Class 0', edgecolors='k')
    plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', label='Class 1', edgecolors='k')
    plt.title(title)
    plt.legend()
    plt.savefig(f'{title.lower().replace(" ", "_")}.png', dpi=150)
    plt.show()

plot_decision_boundary(mlp_model, X_test, y_test, "MLP Decision Boundary")

Weight Initialization

Why Initialization Matters

Random initialization breaks symmetry between neurons. If all weights start the same, all neurons compute the same thing and learn the same features. But the SCALE of initialization determines whether signals survive through deep networks.

⚠️ Vanishing/Exploding Gradients

If weights are too small ( $|w| < 1$ ), signals shrink exponentially through layers — after 10 layers, a signal of 1.0 becomes 0.001. If weights are too large ( $|w| > 1$ ), signals explode to 1024 after 10 layers. Optimal initialization keeps $|w| \approx 1/\sqrt{\text{fan\_in}}$ .

Architecture Diagram

The Vanishing/Exploding Gradient Problem:

  Layer 1      Layer 2      Layer 3      Layer 4
  [h1] ------- [h2] ------- [h3] ------- [h4] -------> output
    |             |             |             |
    w1            w2            w3            w4

  If |w| < 1 (e.g., 0.5):
    Signal: 1.0 -> 0.5 -> 0.25 -> 0.125 -> 0.0625
    After 10 layers: 0.001 (vanishes!)
    Gradients also shrink --> early layers learn NOTHING

  If |w| > 1 (e.g., 2.0):
    Signal: 1.0 -> 2.0 -> 4.0 -> 8.0 -> 16.0
    After 10 layers: 1024 (explodes!)
    Gradients also grow --> unstable training

  Optimal: |w| ≈ 1/sqrt(fan_in)
    Signal stays roughly constant through layers

Xavier/Glorot Initialization (for Sigmoid/Tanh)

Architecture Diagram

Formula:
  W ~ N(0, sqrt(2 / (fan_in + fan_out)))

  Or uniform: W ~ U(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out)))

Why this works:
  - fan_in:  number of inputs to this layer
  - fan_out: number of outputs from this layer
  - Variance of output = variance of input (when activation is linear)
  - Sigmoid is approximately linear near z=0, so this works well

Example:
  Layer: 256 inputs -> 128 outputs
  stddev = sqrt(2 / (256 + 128)) = sqrt(2/384) = 0.072
  Each weight drawn from N(0, 0.072)

He Initialization (for ReLU)

Architecture Diagram

Formula:
  W ~ N(0, sqrt(2 / fan_in))

Why different from Xavier:
  - ReLU kills half the activations (sets negative to 0)
  - This halves the variance of the output
  - We need to compensate by doubling the weight variance
  - Hence sqrt(2/fan_in) instead of sqrt(2/(fan_in+fan_out))

Example:
  Layer: 256 inputs -> 128 outputs, ReLU activation
  stddev = sqrt(2 / 256) = 0.088
  Each weight drawn from N(0, 0.088)

Initialization Summary

Architecture Diagram

Activation     Best Init        Variance Formula
-----------    -----------      ----------------
Sigmoid        Xavier/Glorot    sqrt(2/(fan_in+fan_out))
Tanh           Xavier/Glorot    sqrt(2/(fan_in+fan_out))
ReLU           He               sqrt(2/fan_in)
Leaky ReLU     He               sqrt(2/((1+alpha^2)*fan_in))
ELU            He               sqrt(2/fan_in)

Rule of thumb: Use He for ReLU variants, Xavier for saturating activations.

He initialization for ReLU layers

layer_he = layers.Dense( 64, activation='relu', kernel_initializer='he_normal' )

Xavier for sigmoid/tanh layers

layer_xavier = layers.Dense( 64, activation='tanh', kernel_initializer='glorot_normal' )

Custom initialization

def custom_init(shape, dtype=None): return tf.random.normal(shape, stddev=0.01, dtype=dtype)

layer_custom = layers.Dense(64, kernel_initializer=custom_init)

Architecture Diagram


<MathSummary title="Key Takeaways">
- **Perceptrons** compute linear combinations <MathBlock tex={`\\mathbf{w}^T \\mathbf{x} + b`} /> followed by a non-linear activation
- **MLPs** stack layers to learn hierarchical non-linear representations; deeper networks are more parameter-efficient
- **Backpropagation** computes gradients efficiently via the chain rule in <MathBlock tex:`O(\\text{forward pass})` /> time
- **Activation functions** introduce non-linearity; ReLU is the default, but GELU is preferred in transformers
- **Universal Approximation** guarantees theoretical capacity for single hidden layers, but depth provides exponential efficiency gains in practice
- **Weight initialization** (Xavier for sigmoid/tanh, He for ReLU) and **batch normalization** are critical for stable training
- **Vanishing/Exploding Gradients** are the primary challenge in deep networks; solved by skip connections, proper initialization, and normalization
</MathSummary>

---

## Practice Exercises

1. **Activation Comparison**: Train the same network with ReLU, Sigmoid, and Tanh - compare convergence
2. **Depth vs Width**: Experiment with different architectures (many narrow vs few wide layers)
3. **Spiral Dataset**: Create a spiral dataset and train a network to classify it
4. **Gradient Visualization**: Plot gradient magnitudes across layers to understand vanishing gradients

Neural Networks: Perceptron to MLP

Neural Networks: Perceptron to MLP

Introduction

The Perceptron

Single Perceptron

DfPerceptron

Perceptron Output

Activation Functions

Activation Function Properties

Multi-Layer Perceptron (MLP)

Forward Propagation

Forward Propagation

Backpropagation

ThChain Rule for Backpropagation

Output Layer Delta

Hidden Layer Delta (Backpropagation)

Universal Approximation Theorem

ThUniversal Approximation Theorem (Cybenko, 1989)

📝Forward Pass Worked Example

Complete Keras Implementation

Weight Initialization

Why Initialization Matters

Xavier/Glorot Initialization (for Sigmoid/Tanh)

He Initialization (for ReLU)

Initialization Summary

He initialization for ReLU layers

Xavier for sigmoid/tanh layers

Custom initialization

Need Expert Data Science Help?