Neural Networks: Perceptron to MLP
Introduction
Neural networks are computational systems inspired by biological neurons. They learn to map inputs to outputs through layered transformations, enabling them to approximate virtually any function.
Neural Network Architecture:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Input Layer Hidden Layers Output Layer
(Features) (Learning) (Prediction)
xโ โโโโโโโโโโโ
โ โโโโโโโโโ
โโโโโบโ hโโฝยนโพ โโโโโ
xโ โโโโโโโโโโโค โโโโโโโโโ โ โโโโโโโโโ
โ โโโโโบโ hโโฝยฒโพ โโโโโ
โ โโโโโโโโโ โ โโโโโโโโโ โ โโโโโโโโโ
xโ โโโโโโโโโโโผโโโโบโ hโโฝยนโพ โโโโ โโโโโบโ ลท โ
โ โโโโโโโโโ โ โโโโโโโโโ
โ โโโโโโโโโ โ โโโโโโโโโ โ
xโ โโโโโโโโโโโผโโโโบโ hโโฝยนโพ โโโโโโโโบโ hโโฝยฒโพ โโโโโ
โ โโโโโโโโโ โโโโโโโโโ
xโ
โโโโโโโโโโโ
โ = Neuron โ = Connection (weight) โ = Layer
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The Perceptron
Single Perceptron
The simplest neural unit computes a weighted sum and applies an activation:
DfPerceptron
A perceptron computes a weighted sum of inputs, adds a bias, and applies a non-linear activation function to produce an output.
Perceptron Output
Here,
- =Weight for feature i
- =Input feature i
- =Bias term
- =Weighted sum (pre-activation)
where is the activation function.
Single Perceptron:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
xโ โโโโ wโ โโโโโ
โ
xโ โโโโ wโ โโโโโค โโโโโโโโโโโ
โโโโโบโ ฮฃ wแตขxแตข โโโโโบ z โโโโบ ฯ(z) โโโโบ ลท
xโ โโโโ wโ โโโโโค โ + b โ
โ โโโโโโโโโโโ
bias
Mathematical:
z = wโxโ + wโxโ + wโxโ + b
ลท = ฯ(z)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Activation Functions
import numpy as np
import matplotlib.pyplot as plt
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Activation Functions Visualization
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def tanh(z):
return np.tanh(z)
def relu(z):
return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
def softmax(z):
exp_z = np.exp(z - np.max(z))
return exp_z / exp_z.sum()
def gelu(z):
return 0.5 * z * (1 + np.tanh(np.sqrt(2/np.pi) * (z + 0.044715 * z**3)))
# Plot all activations
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
z = np.linspace(-5, 5, 200)
activations = [
('Sigmoid', sigmoid(z), r'$\sigma(z) = \frac{1}{1+e^{-z}}$'),
('Tanh', tanh(z), r'$\tanh(z)$'),
('ReLU', relu(z), r'$\max(0, z)$'),
('Leaky ReLU', leaky_relu(z), r'$\max(0.01z, z)$'),
('GELU', gelu(z), r'$0.5z(1 + \tanh(\sqrt{2/\pi}(z+0.044715z^3)))$'),
('Derivatives', None, '')
]
for ax, (name, func, formula) in zip(axes.flat, activations):
if name == 'Derivatives':
# Plot derivatives
sig_deriv = sigmoid(z) * (1 - sigmoid(z))
tanh_deriv = 1 - tanh(z)**2
relu_deriv = (z > 0).astype(float)
ax.plot(z, sig_deriv, label='Sigmoid', linewidth=2)
ax.plot(z, tanh_deriv, label='Tanh', linewidth=2)
ax.plot(z, relu_deriv, label='ReLU', linewidth=2)
ax.legend()
ax.set_title('Derivatives')
else:
ax.plot(z, func, linewidth=2, color='steelblue')
ax.set_title(f'{name}\n{formula}')
ax.grid(True, alpha=0.3)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
plt.suptitle("Activation Functions", fontsize=16)
plt.tight_layout()
plt.savefig('activation_functions.png', dpi=150)
plt.show()
Activation Function Properties
| Function | Range | Output | Vanishing Gradient | Use Case |
|---|---|---|---|---|
| Sigmoid | (0, 1) | Probabilities | Yes | Binary output |
| Tanh | (-1, 1) | Zero-centered | Yes | Hidden layers |
| ReLU | [0, โ) | Sparse | No | Default choice |
| Leaky ReLU | (-โ, โ) | No dead neurons | No | Dying ReLU problem |
| GELU | (-0.17, โ) | Smooth ReLU | No | Transformers |
| Softmax | (0, 1) sum=1 | Probabilities | Yes | Multi-class output |
Multi-Layer Perceptron (MLP)
Forward Propagation
Forward Propagation
Here,
- =Activations at layer l
- =Weight matrix at layer l
- =Bias vector at layer l
- =Activation function
For a network with layers:
Forward Propagation Flow:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Layer 0 (Input) Layer 1 (Hidden) Layer 2 (Output)
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
xโ โโโโโโ wโโ โโโโโโโบ aโโฝยนโพ โโโโโโ wโโ โโโโโโโบ
wโโ โ โ wโโ
xโ โโโโโโ wโโ โโโโโโโบ aโโฝยนโพ โโโโโโ wโโ โโโโโโโบ ลท
wโโ โ โ wโโ
xโ โโโโโโ wโโ โโโโโโโบ aโโฝยนโพ โโโโโโ wโโ โโโโโโโบ
Step 1: zโฝยนโพ = Wโฝยนโพx + bโฝยนโพ
Step 2: aโฝยนโพ = ฯ(zโฝยนโพ)
Step 3: zโฝยฒโพ = Wโฝยฒโพaโฝยนโพ + bโฝยฒโพ
Step 4: ลท = ฯ(zโฝยฒโพ)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Backpropagation
The chain rule enables efficient gradient computation:
ThChain Rule for Backpropagation
The gradient of the loss with respect to any weight is the product of gradients flowing backward through the network, computed via the chain rule.
Output Layer Gradient:
Output Layer Delta
Here,
- =Error signal at the output layer
- =Gradient of loss w.r.t. activations
- =Derivative of activation function
Hidden Layer Gradient (Recursive):
Hidden Layer Delta (Backpropagation)
Here,
- =Error signal at layer l
- =Transpose of next layer's weights
- =Error signal from next layer
Weight Update:
โน๏ธ Computational Complexity of Backpropagation
The forward pass requires operations (matrix multiplications). The backward pass has approximately the same computational cost. This means backpropagation is roughly 2x the cost of a single forward pass โ an extremely efficient way to compute gradients for all parameters simultaneously.
Backpropagation (Reverse Mode):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Forward Pass (Compute Predictions):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
x โ [zโฝยนโพ] โ [aโฝยนโพ] โ [zโฝยฒโพ] โ [aโฝยฒโพ] โ ลท โ L
Backward Pass (Compute Gradients):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโ ฮดโฝยฒโพ โโโ ฮดโฝยนโพ โโโ โL/โx
(output) (hidden) (not needed)
Key Insight: Gradients flow backward through the network
Each layer needs gradient from the layer after it
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Universal Approximation Theorem
ThUniversal Approximation Theorem (Cybenko, 1989)
A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of , given appropriate weights and a non-constant activation function.
โน๏ธ Theoretical vs. Practical Implications
While the theorem guarantees that a single hidden layer is theoretically sufficient, it does not specify how many neurons are needed (the bound may be exponentially large), nor does it address learnability. In practice, deeper networks are more parameter-efficient: a depth- network can represent functions that would require exponentially many neurons in a single layer. This is why deep learning focuses on depth rather than width.
Universal Approximation:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Any continuous function can be approximated:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
True Function: Approximation with 5 neurons:
โฑโฒ โฑโฒ
โฑ โฒ โฑโฒ โฑ โฒ โฑโฒ
โฑ โฒ โฑ โฒ โฑ โฒ โฑ โฒ
โฑ โฒโฑ โฒ โฑ โฒ โฒ
โฑ โฒโโ โฑ โฒโโโโโฒโโ
More neurons = better approximation
But: width vs depth tradeoff exists
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
๐Forward Pass Worked Example
Consider a 2-layer network: input , weights , , biases , .
Layer 1:
Layer 2:
This forward pass computes the prediction. Backpropagation would then compute gradients flowing backward through these same operations.
Complete Keras Implementation
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, callbacks
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Generate Dataset (Non-linear)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
X, y = make_moons(n_samples=2000, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Visualize
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], alpha=0.5, label='Class 0')
plt.scatter(X[y==1, 0], X[y==1, 1], alpha=0.5, label='Class 1')
plt.title("Make Moons Dataset")
plt.legend()
plt.show()
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Model 1: Simple Perceptron (Linear Boundary)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
perceptron = keras.Sequential([
layers.Dense(1, activation='sigmoid', input_shape=(2,))
])
perceptron.compile(
optimizer='sgd',
loss='binary_crossentropy',
metrics=['accuracy']
)
print("Model 1: Single Perceptron")
perceptron.summary()
history_p = perceptron.fit(
X_train, y_train,
epochs=50,
batch_size=32,
validation_split=0.2,
verbose=0
)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Model 2: Deep MLP
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
mlp_model = keras.Sequential([
# Input layer
layers.Dense(64, activation='relu', input_shape=(2,)),
layers.BatchNormalization(),
layers.Dropout(0.3),
# Hidden layers
layers.Dense(32, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.2),
layers.Dense(16, activation='relu'),
# Output layer
layers.Dense(1, activation='sigmoid')
])
# Compile with custom learning rate
optimizer = keras.optimizers.Adam(learning_rate=0.001)
mlp_model.compile(
optimizer=optimizer,
loss='binary_crossentropy',
metrics=['accuracy', keras.metrics.AUC(name='auc')]
)
print("\nModel 2: Deep MLP")
mlp_model.summary()
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Training with Callbacks
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
callback_list = [
# Early stopping
callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True,
verbose=1
),
# Reduce learning rate on plateau
callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=1e-6,
verbose=1
),
# Model checkpoint
callbacks.ModelCheckpoint(
'best_mlp_model.keras',
monitor='val_auc',
mode='max',
save_best_only=True,
verbose=1
)
]
history = mlp_model.fit(
X_train, y_train,
epochs=200,
batch_size=32,
validation_split=0.2,
callbacks=callback_list,
verbose=1
)
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Evaluation
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
test_loss, test_acc, test_auc = mlp_model.evaluate(X_test, y_test)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test AUC: {test_auc:.4f}")
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Plot Training History
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Loss
axes[0].plot(history.history['loss'], label='Train Loss')
axes[0].plot(history.history['val_loss'], label='Val Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training & Validation Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Accuracy
axes[1].plot(history.history['accuracy'], label='Train Acc')
axes[1].plot(history.history['val_accuracy'], label='Val Acc')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training & Validation Accuracy')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('training_history.png', dpi=150)
plt.show()
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
# Decision Boundary Visualization
# โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
def plot_decision_boundary(model, X, y, title):
h = 0.02
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()], verbose=0)
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.RdBu)
plt.scatter(X[y==0, 0], X[y==0, 1], c='red', label='Class 0', edgecolors='k')
plt.scatter(X[y==1, 0], X[y==1, 1], c='blue', label='Class 1', edgecolors='k')
plt.title(title)
plt.legend()
plt.savefig(f'{title.lower().replace(" ", "_")}.png', dpi=150)
plt.show()
plot_decision_boundary(mlp_model, X_test, y_test, "MLP Decision Boundary")
Weight Initialization
Why Initialization Matters
Random initialization breaks symmetry between neurons. If all weights start the same, all neurons compute the same thing and learn the same features. But the SCALE of initialization determines whether signals survive through deep networks.
โ ๏ธ Vanishing/Exploding Gradients
If weights are too small (), signals shrink exponentially through layers โ after 10 layers, a signal of 1.0 becomes 0.001. If weights are too large (), signals explode to 1024 after 10 layers. Optimal initialization keeps .
The Vanishing/Exploding Gradient Problem:
Layer 1 Layer 2 Layer 3 Layer 4
[h1] ------- [h2] ------- [h3] ------- [h4] -------> output
| | | |
w1 w2 w3 w4
If |w| < 1 (e.g., 0.5):
Signal: 1.0 -> 0.5 -> 0.25 -> 0.125 -> 0.0625
After 10 layers: 0.001 (vanishes!)
Gradients also shrink --> early layers learn NOTHING
If |w| > 1 (e.g., 2.0):
Signal: 1.0 -> 2.0 -> 4.0 -> 8.0 -> 16.0
After 10 layers: 1024 (explodes!)
Gradients also grow --> unstable training
Optimal: |w| โ 1/sqrt(fan_in)
Signal stays roughly constant through layers
Xavier/Glorot Initialization (for Sigmoid/Tanh)
Formula:
W ~ N(0, sqrt(2 / (fan_in + fan_out)))
Or uniform: W ~ U(-sqrt(6/(fan_in+fan_out)), sqrt(6/(fan_in+fan_out)))
Why this works:
- fan_in: number of inputs to this layer
- fan_out: number of outputs from this layer
- Variance of output = variance of input (when activation is linear)
- Sigmoid is approximately linear near z=0, so this works well
Example:
Layer: 256 inputs -> 128 outputs
stddev = sqrt(2 / (256 + 128)) = sqrt(2/384) = 0.072
Each weight drawn from N(0, 0.072)
He Initialization (for ReLU)
Formula:
W ~ N(0, sqrt(2 / fan_in))
Why different from Xavier:
- ReLU kills half the activations (sets negative to 0)
- This halves the variance of the output
- We need to compensate by doubling the weight variance
- Hence sqrt(2/fan_in) instead of sqrt(2/(fan_in+fan_out))
Example:
Layer: 256 inputs -> 128 outputs, ReLU activation
stddev = sqrt(2 / 256) = 0.088
Each weight drawn from N(0, 0.088)
Initialization Summary
Activation Best Init Variance Formula
----------- ----------- ----------------
Sigmoid Xavier/Glorot sqrt(2/(fan_in+fan_out))
Tanh Xavier/Glorot sqrt(2/(fan_in+fan_out))
ReLU He sqrt(2/fan_in)
Leaky ReLU He sqrt(2/((1+alpha^2)*fan_in))
ELU He sqrt(2/fan_in)
Rule of thumb: Use He for ReLU variants, Xavier for saturating activations.
He initialization for ReLU layers
layer_he = layers.Dense( 64, activation='relu', kernel_initializer='he_normal' )
Xavier for sigmoid/tanh layers
layer_xavier = layers.Dense( 64, activation='tanh', kernel_initializer='glorot_normal' )
Custom initialization
def custom_init(shape, dtype=None): return tf.random.normal(shape, stddev=0.01, dtype=dtype)
layer_custom = layers.Dense(64, kernel_initializer=custom_init)
<MathSummary title="Key Takeaways">
- **Perceptrons** compute linear combinations <MathBlock tex={`\\mathbf{w}^T \\mathbf{x} + b`} /> followed by a non-linear activation
- **MLPs** stack layers to learn hierarchical non-linear representations; deeper networks are more parameter-efficient
- **Backpropagation** computes gradients efficiently via the chain rule in <MathBlock tex:`O(\\text{forward pass})` /> time
- **Activation functions** introduce non-linearity; ReLU is the default, but GELU is preferred in transformers
- **Universal Approximation** guarantees theoretical capacity for single hidden layers, but depth provides exponential efficiency gains in practice
- **Weight initialization** (Xavier for sigmoid/tanh, He for ReLU) and **batch normalization** are critical for stable training
- **Vanishing/Exploding Gradients** are the primary challenge in deep networks; solved by skip connections, proper initialization, and normalization
</MathSummary>
---
## Practice Exercises
1. **Activation Comparison**: Train the same network with ReLU, Sigmoid, and Tanh - compare convergence
2. **Depth vs Width**: Experiment with different architectures (many narrow vs few wide layers)
3. **Spiral Dataset**: Create a spiral dataset and train a network to classify it
4. **Gradient Visualization**: Plot gradient magnitudes across layers to understand vanishing gradients