Activation Functions — Sigmoid, ReLU, GELU & The Dead Neuron Problem
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without them, a deep network would collapse into a linear model.
See our Neural Networks tutorial for the basics of perceptrons and multi-layer networks.
Why Activation Functions?
DfNon-Linearity
Without activation functions, a deep network is just a linear transformation:
No matter how many layers you stack, the result is equivalent to a single linear layer. Activation functions break this linearity, allowing networks to learn arbitrary decision boundaries.
ℹ️ Universal Approximation Requirement
The universal approximation theorem requires a non-linear activation function. A network with linear activations can only learn linear functions, regardless of depth or width.
Sigmoid
DfSigmoid Function
The sigmoid function maps any real number to :
- Output range:
- Derivative:
- Peak derivative: at
Sigmoid Function
Here,
- =Input (any real number)
- =Output in (0, 1)
- =Derivative (max 0.25 at x=0)
ThVanishing Gradient in Sigmoid
The maximum derivative of the sigmoid function is . In a network with layers, the gradient through sigmoid layers is bounded by , which decays exponentially. For , the gradient is less than , making training extremely slow.
⚠️ Sigmoid Issues
- Vanishing gradients: Peak derivative is only 0.25
- Not zero-centered: Outputs are always positive, causing zigzagging in optimization
- Expensive: computation is slower than ReLU
- Use only in output layer for binary classification (with BCE loss)
Tanh
DfHyperbolic Tangent
Tanh maps inputs to :
- Output range:
- Derivative:
- Peak derivative: at
Tanh Function
Here,
- =Input (any real number)
- =Output in (-1, 1)
- =Derivative (max 1.0 at x=0)
💡 Tanh vs. Sigmoid
Tanh is zero-centered (outputs range from -1 to 1), which helps optimization. However, it still suffers from vanishing gradients for large (derivative approaches 0). Prefer ReLU for hidden layers.
ReLU (Rectified Linear Unit)
DfReLU
ReLU is the most widely used activation for hidden layers:
- Output range:
- Derivative:
- Advantages: Computationally cheap, no vanishing gradient for positive inputs, sparse activations
ReLU Function
Here,
- =Input value
- =Output (0 if negative, x if positive)
ℹ️ Why ReLU Works
ReLU's constant derivative of 1 for positive inputs prevents vanishing gradients. The sparsity (many zeros) provides a form of regularization. ReLU is also extremely fast to compute — just a max operation. This combination made deep training practical.
Dead Neuron Problem
DfDead Neurons
A "dead neuron" is a ReLU neuron that always outputs zero. If the weights are such that for all training inputs, the neuron never activates, and its gradient is always zero. The neuron can never recover.
Causes: Learning rate too high, poor initialization, large negative biases
⚠️ Dead Neuron Prevention
- Use Leaky ReLU or PReLU instead of ReLU
- Initialize biases to small positive values
- Use lower learning rates
- Monitor activation statistics during training
Leaky ReLU
DfLeaky ReLU
Leaky ReLU allows a small gradient for negative inputs:
where is a small constant (typically 0.01). This ensures the gradient is never exactly zero, preventing dead neurons.
Leaky ReLU
Here,
- =Input value
- =Slope for negative inputs (typically 0.01)
GELU (Gaussian Error Linear Unit)
DfGELU
GELU is a smooth approximation to ReLU, used in Transformers (BERT, GPT):
where is the CDF of the standard Gaussian distribution. GELU smoothly transitions between zero and identity, providing a probabilistic interpretation.
GELU Function
Here,
- =Input value
- =Standard Gaussian CDF
- =Error function
💡 GELU in Practice
GELU is the default activation in most Transformer models (BERT, GPT-2/3, Vision Transformers). It provides a smooth non-linearity that works well with self-attention. Approximate GELU: .
Swish (SiLU)
DfSwish / SiLU
Swish is a smooth, non-monotonic activation function discovered by neural architecture search:
When , this is the SiLU (Sigmoid Linear Unit). Swish is self-gated: the input is modulated by its own sigmoid transformation.
Swish Function
Here,
- =Input value
- =Learnable or fixed parameter (default 1)
- =Sigmoid function
Softmax
DfSoftmax
Softmax converts a vector of logits into a probability distribution:
Properties: outputs sum to 1, all outputs are positive, it is differentiable. Used in the output layer for multi-class classification.
ℹ️ Numerical Stability of Softmax
Direct computation of can overflow for large . The numerically stable version subtracts the maximum: . PyTorch's F.softmax and nn.CrossEntropyLoss handle this automatically.
Summary Comparison
| Activation | Formula | Range | Derivative | Use Case |
|---|---|---|---|---|
| Sigmoid | Output (binary) | |||
| Tanh | Hidden (legacy) | |||
| ReLU | or | Hidden (default) | ||
| Leaky ReLU | or | Hidden (no dead neurons) | ||
| GELU | Smooth | Transformers | ||
| Swish | Smooth | EfficientNet | ||
| Softmax | , sums to 1 | Jacobian | Output (multi-class) |
PyTorch Implementation
📝Example: All Activation Functions
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
x = torch.linspace(-5, 5, 200)
activations = {
'Sigmoid': torch.sigmoid(x),
'Tanh': torch.tanh(x),
'ReLU': F.relu(x),
'Leaky ReLU': F.leaky_relu(x, negative_slope=0.01),
'GELU': F.gelu(x),
'Swish (SiLU)': F.silu(x),
}
# Plot all activations
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
for ax, (name, y) in zip(axes.flat, activations.items()):
ax.plot(x.numpy(), y.numpy())
ax.set_title(name)
ax.grid(True)
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
plt.tight_layout()
plt.savefig('activations.png')
plt.show()
📝Example: Softmax vs. Log-Softmax
import torch
import torch.nn.functional as F
logits = torch.tensor([2.0, 1.0, 0.5, 0.1])
# Standard softmax
probs = F.softmax(logits, dim=0)
print(f"Softmax: {probs}")
print(f"Sum: {probs.sum():.4f}")
# Log-softmax (numerically stable)
log_probs = F.log_softmax(logits, dim=0)
print(f"\nLog-softmax: {log_probs}")
print(f"Sum of exp(log_probs): {log_probs.exp().sum():.4f}")
# Cross-entropy loss combines both
target = torch.tensor([0]) # Class 0
ce_loss = F.cross_entropy(logits.unsqueeze(0), target)
manual_ce = -log_probs[target]
print(f"\nCrossEntropy: {ce_loss.item():.4f}")
print(f"Manual CE: {manual_ce.item():.4f}")
Summary
📋Summary: Activation Functions
- ReLU is the default for hidden layers — fast, prevents vanishing gradients
- Sigmoid is only for binary classification output layers
- Tanh is zero-centered but still has vanishing gradients
- GELU/Swish are smooth alternatives used in Transformers
- Softmax converts logits to probabilities for multi-class output
- Dead neurons occur when ReLU neurons are always inactive — use Leaky ReLU or careful initialization
- Numerical stability: Use log-softmax + NLL loss instead of softmax + log
- Choice matters: Activation function choice affects training speed, convergence, and final performance
Practice Exercises
-
Conceptual: Why does the sigmoid's maximum derivative of 0.25 cause vanishing gradients? What happens to the gradient after passing through 5 sigmoid layers?
-
Coding: Implement a "dying ReLU" scenario: create a network where most neurons are dead. Show how Leaky ReLU prevents this. Monitor the fraction of zero activations during training.
-
Research: Read the original Swish paper (Ramachandran et al., 2017). How was Swish discovered? What search space was used?
-
Experiment: Train the same network on CIFAR-10 with ReLU, GELU, and Swish. Compare convergence speed and final accuracy. Which works best and why?
-
Mathematical: Derive the gradient of GELU with respect to . Implement a custom GELU autograd function in PyTorch with both forward and backward passes.