Activation Functions — Sigmoid, ReLU, GELU & The Dead Neuron Problem

FoundationsArchitectureFree Lesson

Advertisement

Activation Functions — Sigmoid, ReLU, GELU & The Dead Neuron Problem

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without them, a deep network would collapse into a linear model.

See our Neural Networks tutorial for the basics of perceptrons and multi-layer networks.


Why Activation Functions?

DfNon-Linearity

Without activation functions, a deep network is just a linear transformation:

f(x)=WLWL1W1x=Wxf(\mathbf{x}) = \mathbf{W}_L \mathbf{W}_{L-1} \cdots \mathbf{W}_1 \mathbf{x} = \mathbf{W}'\mathbf{x}

No matter how many layers you stack, the result is equivalent to a single linear layer. Activation functions break this linearity, allowing networks to learn arbitrary decision boundaries.

ℹ️ Universal Approximation Requirement

The universal approximation theorem requires a non-linear activation function. A network with linear activations can only learn linear functions, regardless of depth or width.


Sigmoid

DfSigmoid Function

The sigmoid function maps any real number to (0,1)(0, 1):

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}
  • Output range: (0,1)(0, 1)
  • Derivative: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x))
  • Peak derivative: 0.250.25 at x=0x = 0

Sigmoid Function

σ(x)=11+ex,σ(x)=σ(x)(1σ(x))\sigma(x) = \frac{1}{1 + e^{-x}}, \quad \sigma'(x) = \sigma(x)(1 - \sigma(x))

Here,

  • xx=Input (any real number)
  • σ(x)\sigma(x)=Output in (0, 1)
  • σ(x)\sigma'(x)=Derivative (max 0.25 at x=0)

ThVanishing Gradient in Sigmoid

The maximum derivative of the sigmoid function is 0.250.25. In a network with LL layers, the gradient through LL sigmoid layers is bounded by (0.25)L(0.25)^L, which decays exponentially. For L=10L = 10, the gradient is less than 10610^{-6}, making training extremely slow.

⚠️ Sigmoid Issues

  1. Vanishing gradients: Peak derivative is only 0.25
  2. Not zero-centered: Outputs are always positive, causing zigzagging in optimization
  3. Expensive: exe^{-x} computation is slower than ReLU
  4. Use only in output layer for binary classification (with BCE loss)

Tanh

DfHyperbolic Tangent

Tanh maps inputs to (1,1)(-1, 1):

tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}
  • Output range: (1,1)(-1, 1)
  • Derivative: tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x)
  • Peak derivative: 1.01.0 at x=0x = 0

Tanh Function

tanh(x)=exexex+ex,tanh(x)=1tanh2(x)\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}, \quad \tanh'(x) = 1 - \tanh^2(x)

Here,

  • xx=Input (any real number)
  • tanh(x)\tanh(x)=Output in (-1, 1)
  • tanh(x)\tanh'(x)=Derivative (max 1.0 at x=0)

💡 Tanh vs. Sigmoid

Tanh is zero-centered (outputs range from -1 to 1), which helps optimization. However, it still suffers from vanishing gradients for large x|x| (derivative approaches 0). Prefer ReLU for hidden layers.


ReLU (Rectified Linear Unit)

DfReLU

ReLU is the most widely used activation for hidden layers:

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)
  • Output range: [0,)[0, \infty)
  • Derivative: ReLU(x)={1if x>00if x<0\text{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \end{cases}
  • Advantages: Computationally cheap, no vanishing gradient for positive inputs, sparse activations

ReLU Function

ReLU(x)=max(0,x),ReLU(x)={1x>00x0\text{ReLU}(x) = \max(0, x), \quad \text{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}

Here,

  • xx=Input value
  • ReLU(x)\text{ReLU}(x)=Output (0 if negative, x if positive)

ℹ️ Why ReLU Works

ReLU's constant derivative of 1 for positive inputs prevents vanishing gradients. The sparsity (many zeros) provides a form of regularization. ReLU is also extremely fast to compute — just a max operation. This combination made deep training practical.

Dead Neuron Problem

DfDead Neurons

A "dead neuron" is a ReLU neuron that always outputs zero. If the weights are such that Wx+b<0W\mathbf{x} + b < 0 for all training inputs, the neuron never activates, and its gradient is always zero. The neuron can never recover.

Causes: Learning rate too high, poor initialization, large negative biases

⚠️ Dead Neuron Prevention

  • Use Leaky ReLU or PReLU instead of ReLU
  • Initialize biases to small positive values
  • Use lower learning rates
  • Monitor activation statistics during training

Leaky ReLU

DfLeaky ReLU

Leaky ReLU allows a small gradient for negative inputs:

LeakyReLU(x)={xif x>0αxif x0\text{LeakyReLU}(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}

where α\alpha is a small constant (typically 0.01). This ensures the gradient is never exactly zero, preventing dead neurons.

Leaky ReLU

LeakyReLU(x)={xx>0αxx0,α=0.01\text{LeakyReLU}(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}, \quad \alpha = 0.01

Here,

  • xx=Input value
  • α\alpha=Slope for negative inputs (typically 0.01)

GELU (Gaussian Error Linear Unit)

DfGELU

GELU is a smooth approximation to ReLU, used in Transformers (BERT, GPT):

GELU(x)=xΦ(x)=x12[1+erf(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

where Φ(x)\Phi(x) is the CDF of the standard Gaussian distribution. GELU smoothly transitions between zero and identity, providing a probabilistic interpretation.

GELU Function

GELU(x)=xΦ(x)=x12[1+erf(x2)]\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]

Here,

  • xx=Input value
  • Φ(x)\Phi(x)=Standard Gaussian CDF
  • erf\text{erf}=Error function

💡 GELU in Practice

GELU is the default activation in most Transformer models (BERT, GPT-2/3, Vision Transformers). It provides a smooth non-linearity that works well with self-attention. Approximate GELU: GELU(x)0.5x(1+tanh(2/π(x+0.044715x3)))\text{GELU}(x) \approx 0.5x(1 + \tanh(\sqrt{2/\pi}(x + 0.044715x^3))).


Swish (SiLU)

DfSwish / SiLU

Swish is a smooth, non-monotonic activation function discovered by neural architecture search:

Swish(x)=xσ(βx)\text{Swish}(x) = x \cdot \sigma(\beta x)

When β=1\beta = 1, this is the SiLU (Sigmoid Linear Unit). Swish is self-gated: the input is modulated by its own sigmoid transformation.

Swish Function

Swish(x)=xσ(βx),β=1 (SiLU)\text{Swish}(x) = x \cdot \sigma(\beta x), \quad \beta = 1 \text{ (SiLU)}

Here,

  • xx=Input value
  • β\beta=Learnable or fixed parameter (default 1)
  • σ\sigma=Sigmoid function

Softmax

DfSoftmax

Softmax converts a vector of logits into a probability distribution:

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Properties: outputs sum to 1, all outputs are positive, it is differentiable. Used in the output layer for multi-class classification.

softmax(zi)=ezij=1Kezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

ℹ️ Numerical Stability of Softmax

Direct computation of ezie^{z_i} can overflow for large ziz_i. The numerically stable version subtracts the maximum: softmax(zi)=ezimax(z)jezjmax(z)\text{softmax}(z_i) = \frac{e^{z_i - \max(z)}}{\sum_j e^{z_j - \max(z)}}. PyTorch's F.softmax and nn.CrossEntropyLoss handle this automatically.


Summary Comparison

ActivationFormulaRangeDerivativeUse Case
Sigmoid11+ex\frac{1}{1+e^{-x}}(0,1)(0,1)σ(1σ)\sigma(1-\sigma)Output (binary)
Tanhexexex+ex\frac{e^x-e^{-x}}{e^x+e^{-x}}(1,1)(-1,1)1tanh21-\tanh^2Hidden (legacy)
ReLUmax(0,x)\max(0,x)[0,)[0,\infty)00 or 11Hidden (default)
Leaky ReLUmax(αx,x)\max(\alpha x, x)(,)(-\infty,\infty)α\alpha or 11Hidden (no dead neurons)
GELUxΦ(x)x\Phi(x)(,)(-\infty,\infty)SmoothTransformers
Swishxσ(βx)x\sigma(\beta x)(,)(-\infty,\infty)SmoothEfficientNet
Softmaxeziezj\frac{e^{z_i}}{\sum e^{z_j}}(0,1)(0,1), sums to 1JacobianOutput (multi-class)

PyTorch Implementation

📝Example: All Activation Functions

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

x = torch.linspace(-5, 5, 200)

activations = {
    'Sigmoid': torch.sigmoid(x),
    'Tanh': torch.tanh(x),
    'ReLU': F.relu(x),
    'Leaky ReLU': F.leaky_relu(x, negative_slope=0.01),
    'GELU': F.gelu(x),
    'Swish (SiLU)': F.silu(x),
}

# Plot all activations
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
for ax, (name, y) in zip(axes.flat, activations.items()):
    ax.plot(x.numpy(), y.numpy())
    ax.set_title(name)
    ax.grid(True)
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
plt.tight_layout()
plt.savefig('activations.png')
plt.show()

📝Example: Softmax vs. Log-Softmax

import torch
import torch.nn.functional as F

logits = torch.tensor([2.0, 1.0, 0.5, 0.1])

# Standard softmax
probs = F.softmax(logits, dim=0)
print(f"Softmax: {probs}")
print(f"Sum: {probs.sum():.4f}")

# Log-softmax (numerically stable)
log_probs = F.log_softmax(logits, dim=0)
print(f"\nLog-softmax: {log_probs}")
print(f"Sum of exp(log_probs): {log_probs.exp().sum():.4f}")

# Cross-entropy loss combines both
target = torch.tensor([0])  # Class 0
ce_loss = F.cross_entropy(logits.unsqueeze(0), target)
manual_ce = -log_probs[target]
print(f"\nCrossEntropy: {ce_loss.item():.4f}")
print(f"Manual CE: {manual_ce.item():.4f}")

Summary

📋Summary: Activation Functions

  • ReLU is the default for hidden layers — fast, prevents vanishing gradients
  • Sigmoid is only for binary classification output layers
  • Tanh is zero-centered but still has vanishing gradients
  • GELU/Swish are smooth alternatives used in Transformers
  • Softmax converts logits to probabilities for multi-class output
  • Dead neurons occur when ReLU neurons are always inactive — use Leaky ReLU or careful initialization
  • Numerical stability: Use log-softmax + NLL loss instead of softmax + log
  • Choice matters: Activation function choice affects training speed, convergence, and final performance

Practice Exercises

  1. Conceptual: Why does the sigmoid's maximum derivative of 0.25 cause vanishing gradients? What happens to the gradient after passing through 5 sigmoid layers?

  2. Coding: Implement a "dying ReLU" scenario: create a network where most neurons are dead. Show how Leaky ReLU prevents this. Monitor the fraction of zero activations during training.

  3. Research: Read the original Swish paper (Ramachandran et al., 2017). How was Swish discovered? What search space was used?

  4. Experiment: Train the same network on CIFAR-10 with ReLU, GELU, and Swish. Compare convergence speed and final accuracy. Which works best and why?

  5. Mathematical: Derive the gradient of GELU with respect to xx. Implement a custom GELU autograd function in PyTorch with both forward and backward passes.

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement