Neural Networks Fundamentals
Neural networks learn complex patterns by stacking simple computational units (neurons) in layers.
The Perceptron
Single Neuron:
Inputs: x₁, x₂, ..., xₙ
Weights: w₁, w₂, ..., wₙ
Bias: b
Output = activation(Σ(wᵢxᵢ) + b)
This is just LINEAR REGRESSION + activation function!
Activation Functions
Sigmoid: σ(x) = 1/(1+e⁻ˣ)
├─ Output: (0, 1)
├─ Good for: Binary classification output
└─ Problem: Vanishing gradients
Tanh: tanh(x) = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ)
├─ Output: (-1, 1)
├─ Better than sigmoid (centered at 0)
└─ Still vanishing gradients
ReLU: f(x) = max(0, x)
├─ Output: [0, ∞)
├─ Fast computation
├─ Default choice for hidden layers
└─ Problem: Dead neurons (output always 0)
Leaky ReLU: f(x) = max(0.01x, x)
├─ Fixes dead neuron problem
└─ Slight negative slope for x < 0
GELU: f(x) = x × Φ(x)
├─ Used in Transformers
└─ Smooth approximation of ReLU
Multi-Layer Network
Input Layer Hidden Layers Output Layer
(x₁) (h₁¹) (h₁²) (y₁)
(x₂) → (h₂¹) → (h₂²) → (y₂)
(x₃) (h₃¹) (h₃²)
Each layer:
h = activation(Wx + b)
Width: Number of neurons per layer
Depth: Number of hidden layers
Universal Approximation Theorem:
A network with 1 hidden layer can approximate ANY function
(may need exponentially many neurons)
Backpropagation
How neural networks learn:
1. Forward Pass
Input → Compute outputs layer by layer
2. Compute Loss
Loss = L(predicted, actual)
3. Backward Pass (Backprop)
Compute gradient of loss w.r.t. EACH weight
Using chain rule of calculus:
∂L/∂w = ∂L/∂a × ∂a/∂z × ∂z/∂w
4. Update Weights
w = w - α × ∂L/∂w
(Gradient Descent)
PyTorch Implementation
import torch
import torch.nn as nn
import torch.optim as optim
class NeuralNet(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(10, 64),
nn.ReLU(),
nn.Linear(64, 32),
nn.ReLU(),
nn.Linear(32, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.layers(x)
# Initialize
model = NeuralNet()
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# Training loop
for epoch in range(100):
# Forward pass
y_pred = model(X_train)
loss = criterion(y_pred, y_train)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}: Loss={loss.item():.4f}")
Key Takeaways
- Neural networks are universal function approximators
- ReLU is the default activation for hidden layers
- Backpropagation computes gradients efficiently
- Learning rate is the most important hyperparameter
- Deeper networks can learn more complex patterns
- Overfitting is controlled by dropout, regularization, early stopping
- GPU acceleration is essential for training large networks
- PyTorch and TensorFlow are the main frameworks