Neural Networks — Complete Introduction

Deep LearningNeural Network FundamentalsFree Lesson

Advertisement

What Is a Neural Network?

A neural network is a function approximator built from layers of interconnected nodes (neurons) that transform input data into predictions. Inspired loosely by the brain, they excel at finding complex non-linear patterns.

Key Insight: A neural network with enough neurons and layers can approximate any continuous function — the Universal Approximation Theorem.

Anatomy of One Neuron

    x₁ ──w₁──┐
    x₂ ──w₂──┤
    x₃ ──w₃──┼──→ [Σ xᵢwᵢ + b] ──→ [activation f] ──→ output
    ...       │
    xₙ ──wₙ──┘

Mathematically:

z=i=1nwixi+b=wx+bz = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^\top \mathbf{x} + b

output=f(z)\text{output} = f(z)

Where ff is the activation function and bb is the bias term.


Activation Functions

Activation functions introduce non-linearity — without them, stacking layers would still give a linear model.

FunctionFormulaRangeUse Case
Sigmoid1/(1+e^-z)(0, 1)Binary output layer
Tanh(e^z - e^-z)/(e^z + e^-z)(-1, 1)Hidden layers (older)
ReLUmax(0, z)[0, ∞)Default for hidden layers
Leaky ReLUmax(0.01z, z)(-∞, ∞)Fixes dying ReLU problem
Softmaxe^zi / sum(e^zj)(0,1) sums to 1Multi-class output
Sigmoid    ReLU       Leaky ReLU
   1 ─╮         /         /
      │        /        ╱
   0.5│       /        /
      │  ────/   ────╱
   0  └─────     ──╱
      -3  3    0    0
import numpy as np
import matplotlib.pyplot as plt

z = np.linspace(-5, 5, 300)

activations = {
    'Sigmoid':     1 / (1 + np.exp(-z)),
    'Tanh':        np.tanh(z),
    'ReLU':        np.maximum(0, z),
    'Leaky ReLU':  np.where(z >= 0, z, 0.01 * z),
}

fig, axes = plt.subplots(1, 4, figsize=(16, 3))
colors = ['#3b82f6', '#10b981', '#f59e0b', '#ef4444']

for ax, (name, a), color in zip(axes, activations.items(), colors):
    ax.plot(z, a, color=color, linewidth=2.5)
    ax.axhline(0, color='gray', linewidth=0.5, linestyle='--')
    ax.axvline(0, color='gray', linewidth=0.5, linestyle='--')
    ax.set_title(name, fontsize=12, fontweight='bold')
    ax.set_ylim(-1.5, 1.5)
    ax.grid(True, alpha=0.3)

plt.suptitle('Activation Functions', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Multi-Layer Perceptron (MLP) Architecture

INPUT LAYER      HIDDEN LAYER 1   HIDDEN LAYER 2   OUTPUT LAYER
  (4 neurons)     (8 neurons)      (8 neurons)       (3 neurons)

    x₁ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○   →   ŷ₁ (class 1)
    x₂ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○   →   ŷ₂ (class 2)
    x₃ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○   →   ŷ₃ (class 3)
    x₄ ──────→ ○  ○  ○  ○   →  ○  ○  ○  ○

Every neuron connects to every neuron in the next layer (dense/fully connected)

Layer computations:

a[l]=f[l] ⁣(W[l]a[l1]+b[l])\mathbf{a}^{[l]} = f^{[l]}\!\left(\mathbf{W}^{[l]}\, \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}\right)

Where ll is the layer index, W[l]\mathbf{W}^{[l]} is the weight matrix, and b[l]\mathbf{b}^{[l]} is the bias vector.


Backpropagation — How Neural Networks Learn

Training = forward pass to compute loss, then backward pass to compute gradients.

Step-by-Step

Forward Pass:
  Input X → [Layer 1] → [Layer 2] → Output ŷ → Loss L(y, ŷ)

Backward Pass (chain rule):
  ∂L/∂W² = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂W²
  ∂L/∂W¹ = ∂L/∂ŷ · ∂ŷ/∂z² · ∂z²/∂a¹ · ∂a¹/∂z¹ · ∂z¹/∂W¹

Weight Update (Gradient Descent):
  W ← W - η · ∂L/∂W      (η = learning rate)

The chain rule unrolled:

LW[l]=La[L]k=l+1L(a[k]a[k1])a[l]W[l]\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{[L]}} \cdot \prod_{k=l+1}^{L}\left(\frac{\partial \mathbf{a}^{[k]}}{\partial \mathbf{a}^{[k-1]}}\right) \cdot \frac{\partial \mathbf{a}^{[l]}}{\partial \mathbf{W}^{[l]}}


Full Implementation — NumPy from Scratch

import numpy as np

class NeuralNetwork:
    """2-layer neural network built from scratch with NumPy."""

    def __init__(self, input_size, hidden_size, output_size, lr=0.01):
        # Xavier initialisation — prevents vanishing/exploding gradients
        self.W1 = np.random.randn(input_size, hidden_size) * np.sqrt(2 / input_size)
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * np.sqrt(2 / hidden_size)
        self.b2 = np.zeros((1, output_size))
        self.lr = lr
        self.losses = []

    def _relu(self, z):          return np.maximum(0, z)
    def _relu_grad(self, z):     return (z > 0).astype(float)
    def _sigmoid(self, z):       return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    def _softmax(self, z):
        e = np.exp(z - z.max(axis=1, keepdims=True))
        return e / e.sum(axis=1, keepdims=True)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1           # (n, hidden)
        self.a1 = self._relu(self.z1)              # (n, hidden)
        self.z2 = self.a1 @ self.W2 + self.b2     # (n, output)
        self.a2 = self._softmax(self.z2)           # (n, output)
        return self.a2

    def loss(self, y_pred, y_true):
        """Cross-entropy loss with one-hot y_true."""
        n = y_true.shape[0]
        return -np.sum(y_true * np.log(y_pred + 1e-9)) / n

    def backward(self, X, y_true):
        n = X.shape[0]

        # Output layer gradient (softmax + cross-entropy combined)
        dz2 = (self.a2 - y_true) / n              # (n, output)
        dW2 = self.a1.T @ dz2                     # (hidden, output)
        db2 = dz2.sum(axis=0, keepdims=True)      # (1, output)

        # Hidden layer gradient
        da1 = dz2 @ self.W2.T                     # (n, hidden)
        dz1 = da1 * self._relu_grad(self.z1)      # (n, hidden)
        dW1 = X.T @ dz1                           # (input, hidden)
        db1 = dz1.sum(axis=0, keepdims=True)      # (1, hidden)

        # Update weights
        self.W2 -= self.lr * dW2
        self.b2 -= self.lr * db2
        self.W1 -= self.lr * dW1
        self.b1 -= self.lr * db1

    def train(self, X, y, epochs=500, verbose=True):
        for epoch in range(epochs):
            y_pred = self.forward(X)
            loss   = self.loss(y_pred, y)
            self.backward(X, y)
            self.losses.append(loss)
            if verbose and epoch % 100 == 0:
                acc = (y_pred.argmax(1) == y.argmax(1)).mean()
                print(f"Epoch {epoch:4d} | Loss: {loss:.4f} | Acc: {acc:.4f}")

    def predict(self, X):
        return self.forward(X).argmax(axis=1)


# ── Test on Iris dataset ─────────────────────────────────────────────
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

iris = load_iris()
X, y_raw = iris.data, iris.target.reshape(-1, 1)

y_ohe = OneHotEncoder(sparse_output=False).fit_transform(y_raw)
X_scaled = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_ohe, test_size=0.2, random_state=42
)

nn = NeuralNetwork(input_size=4, hidden_size=16, output_size=3, lr=0.05)
nn.train(X_train, y_train, epochs=500)

# Evaluate
y_pred = nn.predict(X_test)
y_true = y_test.argmax(1)
acc = (y_pred == y_true).mean()
print(f"\nTest Accuracy: {acc:.4f}")

# Plot learning curve
plt.figure(figsize=(8, 4))
plt.plot(nn.losses, color='#3b82f6', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Cross-Entropy Loss')
plt.title('Training Loss Curve')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Output:

Epoch    0 | Loss: 1.1542 | Acc: 0.3667
Epoch  100 | Loss: 0.5423 | Acc: 0.8250
Epoch  200 | Loss: 0.2871 | Acc: 0.9083
Epoch  300 | Loss: 0.1934 | Acc: 0.9417
Epoch  400 | Loss: 0.1482 | Acc: 0.9583
Test Accuracy: 0.9667

Keras/TensorFlow Implementation

import tensorflow as tf
from tensorflow import keras
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Data
iris = load_iris()
X = StandardScaler().fit_transform(iris.data)
y = keras.utils.to_categorical(iris.target, 3)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build model
model = keras.Sequential([
    keras.layers.Dense(64, activation='relu', input_shape=(4,),
                       kernel_initializer='he_normal'),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(32, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(3,  activation='softmax'),
])

model.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

model.summary()

history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=16,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)],
    verbose=0,
)

loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {acc:.4f}")

# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(history.history['loss'],     label='Train Loss',     color='#3b82f6')
ax1.plot(history.history['val_loss'], label='Val Loss',       color='#ef4444')
ax1.set_title('Loss'); ax1.legend(); ax1.grid(True, alpha=0.3)

ax2.plot(history.history['accuracy'],     label='Train Acc', color='#3b82f6')
ax2.plot(history.history['val_accuracy'], label='Val Acc',   color='#ef4444')
ax2.set_title('Accuracy'); ax2.legend(); ax2.grid(True, alpha=0.3)
plt.tight_layout(); plt.show()

Hyperparameters and Their Effect

HyperparameterToo SmallToo LargeGood Starting Point
Learning rateSlow convergenceDiverges / oscillates0.001 (Adam)
Hidden unitsUnderfittingOverfitting + slow64–256
Batch sizeNoisy gradientsMemory issues32–128
LayersUnderfittingVanishing gradients2–5
DropoutNo regularisationUnderfitting0.2–0.5

Key Takeaways

  1. Neurons apply a weighted sum + activation to produce outputs
  2. ReLU is the default activation for hidden layers — fast and effective
  3. Backpropagation uses the chain rule to compute gradients efficiently
  4. Xavier/He initialisation prevents vanishing/exploding gradients
  5. Dropout + BatchNorm are essential regularisation tools in practice
  6. Always monitor validation loss — divergence from train loss means overfitting

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement