DL Foundations

Activation Functions — Adding Non-Linearity to Neural Networks

Without activation functions, deep networks collapse into linear models no matter how many layers are stacked. Activation functions introduce the non-linearity that enables learning of arbitrary decision boundaries.

ReLU is the Default — Fast, prevents vanishing gradients, provides sparse activations for hidden layers
GELU for Transformers — Smooth approximation to ReLU used in BERT, GPT, and Vision Transformers
Dead Neuron Problem — ReLU neurons that always output zero require careful initialization and architecture choices

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Without them, a deep network would collapse into a linear model.

Why Activation Functions?

Sigmoid

Tanh

Activation Function Comparison

ReLU and Variants

GELU and Swish

Softmax

When to Use Each Activation

The Dead Neuron Problem

Summary

Activation functions introduce non-linearity, enabling deep networks to learn complex patterns
ReLU is the default for hidden layers: fast, prevents vanishing gradients, but causes dead neurons
GELU/Swish are preferred for transformers and deep networks: smooth, self-gated
Softmax for multi-class output, Sigmoid for binary output, None for regression
Dead neurons are a practical concern with ReLU; use Leaky ReLU or careful initialization to mitigate

Next: Loss Functions for Deep Learning

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Activation Functions — Adding Non-Linearity to Neural Networks

Activation Functions — Sigmoid, ReLU, GELU and The Dead Neuron Problem

Why Activation Functions?

Sigmoid

Tanh

Activation Function Comparison

ReLU and Variants

GELU and Swish

Softmax

When to Use Each Activation

The Dead Neuron Problem

Summary

Need Expert Deep Learning Help?