Neural Networks — Complete Introduction

What Is a Neural Network?

A neural network is a function approximator built from layers of interconnected nodes (neurons) that transform input data into predictions. Inspired loosely by the brain, they excel at finding complex non-linear patterns.

Key Insight: A neural network with enough neurons and layers can approximate any continuous function — the Universal Approximation Theorem.

Anatomy of One Neuron

    x₁ ──w₁──┐
    x₂ ──w₂──┤
    x₃ ──w₃──┼──→ [Σ xᵢwᵢ + b] ──→ [activation f] ──→ output
    ...       │
    xₙ ──wₙ──┘

Mathematically:

$z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^\top \mathbf{x} + b$

$\text{output} = f(z)$

Where $f$ is the activation function and $b$ is the bias term.

Activation Functions

Activation functions introduce non-linearity — without them, stacking layers would still give a linear model.

Function	Formula	Range	Use Case
Sigmoid	1/(1+e^-z)	(0, 1)	Binary output layer
Tanh	(e^z - e^-z)/(e^z + e^-z)	(-1, 1)	Hidden layers (older)
ReLU	max(0, z)	[0, ∞)	Default for hidden layers
Leaky ReLU	max(0.01z, z)	(-∞, ∞)	Fixes dying ReLU problem
Softmax	e^zi / sum(e^zj)	(0,1) sums to 1	Multi-class output

Sigmoid    ReLU       Leaky ReLU
   1 ─╮         /         /
      │        /        ╱
   0.5│       /        /
      │  ────/   ────╱
   0  └─────     ──╱
      -3  3    0    0

import numpy as np
import matplotlib.pyplot as plt

z = np.linspace(-5, 5, 300)

activations = {
    'Sigmoid':     1 / (1 + np.exp(-z)),
    'Tanh':        np.tanh(z),
    'ReLU':        np.maximum(0, z),
    'Leaky ReLU':  np.where(z >= 0, z, 0.01 * z),
}

fig, axes = plt.subplots(1, 4, figsize=(16, 3))
colors = ['#3b82f6', '#10b981', '#f59e0b', '#ef4444']

for ax, (name, a), color in zip(axes, activations.items(), colors):
    ax.plot(z, a, color=color, linewidth=2.5)
    ax.axhline(0, color='gray', linewidth=0.5, linestyle='--')
    ax.axvline(0, color='gray', linewidth=0.5, linestyle='--')
    ax.set_title(name, fontsize=12, fontweight='bold')
    ax.set_ylim(-1.5, 1.5)
    ax.grid(True, alpha=0.3)

plt.suptitle('Activation Functions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Multi-Layer Perceptron (MLP) Architecture

INPUT LAYER      HIDDEN LAYER 1   HIDDEN LAYER 2   OUTPUT LAYER
  (4 neurons)     (8 neurons)      (8 neurons)       (3 neurons)

    x₁ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○   →   ŷ₁ (class 1)
    x₂ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○   →   ŷ₂ (class 2)
    x₃ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○   →   ŷ₃ (class 3)
    x₄ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○

Every neuron connects to every neuron in the next layer (dense/fully connected)

Layer computations:

$\mathbf{a}^{[l]} = f^{[l]}\!\left(\mathbf{W}^{[l]}\, \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\right)$

Where $l$ is the layer index, $\mathbf{W}^{[l]}$ is the weight matrix, and $\mathbf{b}^{[l]}$ is the bias vector.

Backpropagation — How Neural Networks Learn

Training = forward pass to compute loss, then backward pass to compute gradients.

Step-by-Step

Forward Pass:
  Input X → [Layer 1] → [Layer 2] → Output ŷ → Loss L(y, ŷ)

Backward Pass (chain rule):
  ∂L/∂W² = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂W²
  ∂L/∂W¹ = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹

Weight Update (Gradient Descent):
  W ← W - η · ∂L/∂W      (η = learning rate)

The chain rule unrolled:

$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \cdot \prod_{k=l+1}^{L}\left(\frac{\partial \mathbf{a}^{[k]}}{\partial \mathbf{a}^{[k-1]}}\right) \cdot \frac{\partial \mathbf{a}^{[l]}}{\partial \mathbf{W}^{[l]}}$

Full Implementation — NumPy from Scratch

import numpy as np

class NeuralNetwork:
    """2-layer neural network built from scratch with NumPy."""

    def __init__(self, input_size, hidden_size, output_size, lr=0.01):
        # Xavier initialisation — prevents vanishing/exploding gradients
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2 / hidden_size)
        self.b2 = np.zeros((1, output_size))
        self.lr = lr
        self.losses = []

    def _relu(self, z):          return np.maximum(0, z)
    def _relu_grad(self, z):     return (z > 0).astype(float)
    def _sigmoid(self, z):       return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    def _softmax(self, z):
        e = np.exp(z - z.max(axis=1, keepdims=True))
        return e / e.sum(axis=1, keepdims=True)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1           # (n, hidden)
        self.a1 = self._relu(self.z1)              # (n, hidden)
        self.z2 = self.a1 @ self.W2 + self.b2     # (n, output)
        self.a2 = self._softmax(self.z2)           # (n, output)
        return self.a2

    def loss(self, y_pred, y_true):
        """Cross-entropy loss with one-hot y_true."""
        n = y_true.shape[0]
        return -np.sum(y_true * np.log(y_pred + 1e-9)) / n

    def backward(self, X, y_true):
        n = X.shape[0]

        # Output layer gradient (softmax + cross-entropy combined)
        dz2 = (self.a2 - y_true) / n              # (n, output)
        dW2 = self.a1.T @ dz2                     # (hidden, output)
        db2 = dz2.sum(axis=0, keepdims=True)      # (1, output)

        # Hidden layer gradient
        da1 = dz2 @ self.W2.T                     # (n, hidden)
        dz1 = da1 * self._relu_grad(self.z1)      # (n, hidden)
        dW1 = X.T @ dz1                           # (input, hidden)
        db1 = dz1.sum(axis=0, keepdims=True)      # (1, hidden)

        # Update weights
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def train(self, X, y, epochs=500, verbose=True):
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss   = self.loss(y_pred, y)
            self.backward(X, y)
            self.losses.append(loss)
            if verbose and epoch % 100 == 0:
                acc = (y_pred.argmax(1) == y.argmax(1)).mean()
                print(f"Epoch {epoch:4d} | Loss: {loss:.4f} | Acc: {acc:.4f}")

    def predict(self, X):
        return self.forward(X).argmax(axis=1)


# ── Test on Iris dataset ─────────────────────────────────────────────
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y_raw = iris.data, iris.target.reshape(-1, 1)

y_ohe = OneHotEncoder(sparse_output=False).fit_transform(y_raw)
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_ohe, test_size=0.2, random_state=42
)

nn = NeuralNetwork(input_size=4, hidden_size=16, output_size=3, lr=0.05)
nn.train(X_train, y_train, epochs=500)

# Evaluate
y_pred = nn.predict(X_test)
y_true = y_test.argmax(1)
acc = (y_pred == y_true).mean()
print(f"\nTest Accuracy: {acc:.4f}")

# Plot learning curve
plt.figure(figsize=(8, 4))
plt.plot(nn.losses, color='#3b82f6', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.title('Training Loss Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Output:

Epoch    0 | Loss: 1.1542 | Acc: 0.3667
Epoch  100 | Loss: 0.5423 | Acc: 0.8250
Epoch  200 | Loss: 0.2871 | Acc: 0.9083
Epoch  300 | Loss: 0.1934 | Acc: 0.9417
Epoch  400 | Loss: 0.1482 | Acc: 0.9583
Test Accuracy: 0.9667

Keras/TensorFlow Implementation

import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Data
iris = load_iris()
X = StandardScaler().fit_transform(iris.data)
y = keras.utils.to_categorical(iris.target, 3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(4,),
                       kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(3,  activation='softmax'),
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=16,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)],
    verbose=0,
)

loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {acc:.4f}")

# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(history.history['loss'],     label='Train Loss',     color='#3b82f6')
ax1.plot(history.history['val_loss'], label='Val Loss',       color='#ef4444')
ax1.set_title('Loss'); ax1.legend(); ax1.grid(True, alpha=0.3)

ax2.plot(history.history['accuracy'],     label='Train Acc', color='#3b82f6')
ax2.plot(history.history['val_accuracy'], label='Val Acc',   color='#ef4444')
ax2.set_title('Accuracy'); ax2.legend(); ax2.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()

Hyperparameters and Their Effect

Hyperparameter	Too Small	Too Large	Good Starting Point
Learning rate	Slow convergence	Diverges / oscillates	0.001 (Adam)
Hidden units	Underfitting	Overfitting + slow	64–256
Batch size	Noisy gradients	Memory issues	32–128
Layers	Underfitting	Vanishing gradients	2–5
Dropout	No regularisation	Underfitting	0.2–0.5

Key Takeaways

Neurons apply a weighted sum + activation to produce outputs
ReLU is the default activation for hidden layers — fast and effective
Backpropagation uses the chain rule to compute gradients efficiently
Xavier/He initialisation prevents vanishing/exploding gradients
Dropout + BatchNorm are essential regularisation tools in practice
Always monitor validation loss — divergence from train loss means overfitting