What Is a Neural Network?
A neural network is a function approximator built from layers of interconnected nodes (neurons) that transform input data into predictions. Inspired loosely by the brain, they excel at finding complex non-linear patterns.
Key Insight: A neural network with enough neurons and layers can approximate any continuous function — the Universal Approximation Theorem.
Anatomy of One Neuron
x₁ ──w₁──┐
x₂ ──w₂──┤
x₃ ──w₃──┼──→ [Σ xᵢwᵢ + b] ──→ [activation f] ──→ output
... │
xₙ ──wₙ──┘
Mathematically:
Where is the activation function and is the bias term.
Activation Functions
Activation functions introduce non-linearity — without them, stacking layers would still give a linear model.
| Function | Formula | Range | Use Case |
|---|---|---|---|
| Sigmoid | 1/(1+e^-z) | (0, 1) | Binary output layer |
| Tanh | (e^z - e^-z)/(e^z + e^-z) | (-1, 1) | Hidden layers (older) |
| ReLU | max(0, z) | [0, ∞) | Default for hidden layers |
| Leaky ReLU | max(0.01z, z) | (-∞, ∞) | Fixes dying ReLU problem |
| Softmax | e^zi / sum(e^zj) | (0,1) sums to 1 | Multi-class output |
Sigmoid ReLU Leaky ReLU
1 ─╮ / /
│ / ╱
0.5│ / /
│ ────/ ────╱
0 └───── ──╱
-3 3 0 0
import numpy as np
import matplotlib.pyplot as plt
z = np.linspace(-5, 5, 300)
activations = {
'Sigmoid': 1 / (1 + np.exp(-z)),
'Tanh': np.tanh(z),
'ReLU': np.maximum(0, z),
'Leaky ReLU': np.where(z >= 0, z, 0.01 * z),
}
fig, axes = plt.subplots(1, 4, figsize=(16, 3))
colors = ['#3b82f6', '#10b981', '#f59e0b', '#ef4444']
for ax, (name, a), color in zip(axes, activations.items(), colors):
ax.plot(z, a, color=color, linewidth=2.5)
ax.axhline(0, color='gray', linewidth=0.5, linestyle='--')
ax.axvline(0, color='gray', linewidth=0.5, linestyle='--')
ax.set_title(name, fontsize=12, fontweight='bold')
ax.set_ylim(-1.5, 1.5)
ax.grid(True, alpha=0.3)
plt.suptitle('Activation Functions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
Multi-Layer Perceptron (MLP) Architecture
INPUT LAYER HIDDEN LAYER 1 HIDDEN LAYER 2 OUTPUT LAYER
(4 neurons) (8 neurons) (8 neurons) (3 neurons)
x₁ ──────→ ○ ○ ○ ○ → ○ ○ ○ ○ → ŷ₁ (class 1)
x₂ ──────→ ○ ○ ○ ○ → ○ ○ ○ ○ → ŷ₂ (class 2)
x₃ ──────→ ○ ○ ○ ○ → ○ ○ ○ ○ → ŷ₃ (class 3)
x₄ ──────→ ○ ○ ○ ○ → ○ ○ ○ ○
Every neuron connects to every neuron in the next layer (dense/fully connected)
Layer computations:
Where is the layer index, is the weight matrix, and is the bias vector.
Backpropagation — How Neural Networks Learn
Training = forward pass to compute loss, then backward pass to compute gradients.
Step-by-Step
Forward Pass:
Input X → [Layer 1] → [Layer 2] → Output ŷ → Loss L(y, ŷ)
Backward Pass (chain rule):
∂L/∂W² = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂W²
∂L/∂W¹ = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹
Weight Update (Gradient Descent):
W ← W - η · ∂L/∂W (η = learning rate)
The chain rule unrolled:
Full Implementation — NumPy from Scratch
import numpy as np
class NeuralNetwork:
"""2-layer neural network built from scratch with NumPy."""
def __init__(self, input_size, hidden_size, output_size, lr=0.01):
# Xavier initialisation — prevents vanishing/exploding gradients
self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2 / hidden_size)
self.b2 = np.zeros((1, output_size))
self.lr = lr
self.losses = []
def _relu(self, z): return np.maximum(0, z)
def _relu_grad(self, z): return (z > 0).astype(float)
def _sigmoid(self, z): return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
def _softmax(self, z):
e = np.exp(z - z.max(axis=1, keepdims=True))
return e / e.sum(axis=1, keepdims=True)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1 # (n, hidden)
self.a1 = self._relu(self.z1) # (n, hidden)
self.z2 = self.a1 @ self.W2 + self.b2 # (n, output)
self.a2 = self._softmax(self.z2) # (n, output)
return self.a2
def loss(self, y_pred, y_true):
"""Cross-entropy loss with one-hot y_true."""
n = y_true.shape[0]
return -np.sum(y_true * np.log(y_pred + 1e-9)) / n
def backward(self, X, y_true):
n = X.shape[0]
# Output layer gradient (softmax + cross-entropy combined)
dz2 = (self.a2 - y_true) / n # (n, output)
dW2 = self.a1.T @ dz2 # (hidden, output)
db2 = dz2.sum(axis=0, keepdims=True) # (1, output)
# Hidden layer gradient
da1 = dz2 @ self.W2.T # (n, hidden)
dz1 = da1 * self._relu_grad(self.z1) # (n, hidden)
dW1 = X.T @ dz1 # (input, hidden)
db1 = dz1.sum(axis=0, keepdims=True) # (1, hidden)
# Update weights
self.W2 -= self.lr * dW2
self.b2 -= self.lr * db2
self.W1 -= self.lr * dW1
self.b1 -= self.lr * db1
def train(self, X, y, epochs=500, verbose=True):
for epoch in range(epochs):
y_pred = self.forward(X)
loss = self.loss(y_pred, y)
self.backward(X, y)
self.losses.append(loss)
if verbose and epoch % 100 == 0:
acc = (y_pred.argmax(1) == y.argmax(1)).mean()
print(f"Epoch {epoch:4d} | Loss: {loss:.4f} | Acc: {acc:.4f}")
def predict(self, X):
return self.forward(X).argmax(axis=1)
# ── Test on Iris dataset ─────────────────────────────────────────────
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
iris = load_iris()
X, y_raw = iris.data, iris.target.reshape(-1, 1)
y_ohe = OneHotEncoder(sparse_output=False).fit_transform(y_raw)
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_ohe, test_size=0.2, random_state=42
)
nn = NeuralNetwork(input_size=4, hidden_size=16, output_size=3, lr=0.05)
nn.train(X_train, y_train, epochs=500)
# Evaluate
y_pred = nn.predict(X_test)
y_true = y_test.argmax(1)
acc = (y_pred == y_true).mean()
print(f"\nTest Accuracy: {acc:.4f}")
# Plot learning curve
plt.figure(figsize=(8, 4))
plt.plot(nn.losses, color='#3b82f6', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.title('Training Loss Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Output:
Epoch 0 | Loss: 1.1542 | Acc: 0.3667
Epoch 100 | Loss: 0.5423 | Acc: 0.8250
Epoch 200 | Loss: 0.2871 | Acc: 0.9083
Epoch 300 | Loss: 0.1934 | Acc: 0.9417
Epoch 400 | Loss: 0.1482 | Acc: 0.9583
Test Accuracy: 0.9667
Keras/TensorFlow Implementation
import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Data
iris = load_iris()
X = StandardScaler().fit_transform(iris.data)
y = keras.utils.to_categorical(iris.target, 3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Build model
model = keras.Sequential([
keras.layers.Dense(64, activation='relu', input_shape=(4,),
kernel_initializer='he_normal'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dropout(0.2),
keras.layers.Dense(3, activation='softmax'),
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=16,
validation_split=0.2,
callbacks=[keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)],
verbose=0,
)
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {acc:.4f}")
# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(history.history['loss'], label='Train Loss', color='#3b82f6')
ax1.plot(history.history['val_loss'], label='Val Loss', color='#ef4444')
ax1.set_title('Loss'); ax1.legend(); ax1.grid(True, alpha=0.3)
ax2.plot(history.history['accuracy'], label='Train Acc', color='#3b82f6')
ax2.plot(history.history['val_accuracy'], label='Val Acc', color='#ef4444')
ax2.set_title('Accuracy'); ax2.legend(); ax2.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()
Hyperparameters and Their Effect
| Hyperparameter | Too Small | Too Large | Good Starting Point |
|---|---|---|---|
| Learning rate | Slow convergence | Diverges / oscillates | 0.001 (Adam) |
| Hidden units | Underfitting | Overfitting + slow | 64–256 |
| Batch size | Noisy gradients | Memory issues | 32–128 |
| Layers | Underfitting | Vanishing gradients | 2–5 |
| Dropout | No regularisation | Underfitting | 0.2–0.5 |
Key Takeaways
- Neurons apply a weighted sum + activation to produce outputs
- ReLU is the default activation for hidden layers — fast and effective
- Backpropagation uses the chain rule to compute gradients efficiently
- Xavier/He initialisation prevents vanishing/exploding gradients
- Dropout + BatchNorm are essential regularisation tools in practice
- Always monitor validation loss — divergence from train loss means overfitting