Neural Networks: Forward/Backward Propagation, Activation Functions — Asked at Google & OpenAI

🎯 The Interview Question

"Walk us through the complete forward and backward propagation process in a multi-layer neural network. Explain how activation functions contribute to non-linearity, and discuss the trade-offs between different activation functions. What would happen if you removed all activation functions?"

This question is a cornerstone of deep learning interviews at top companies. It tests your fundamental understanding of how neural networks learn.

📚 Detailed Answer

Forward Propagation: The Complete Picture

Forward propagation is the process of passing input data through the network to obtain a prediction. Let's break this down mathematically and conceptually.

Given an input vector $\mathbf{x} \in \mathbb{R}^n$ , a neural network with $L$ layers computes the output through a series of transformations:

\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}

\mathbf{a}^{(l)} = f^{(l)}(\mathbf{z}^{(l)})

where:

$\mathbf{W}^{(l)} \in \mathbb{R}^{m \times n}$ is the weight matrix for layer $l$
$\mathbf{b}^{(l)} \in \mathbb{R}^m$ is the bias vector
$f^{(l)}$ is the activation function
$\mathbf{a}^{(l)}$ is the activation output (with $\mathbf{a}^{(0)} = \mathbf{x}$ )

💡

The beauty of neural networks lies in the Universal Approximation Theorem: a single hidden layer with sufficient neurons and a non-linear activation function can approximate any continuous function on a compact set.

Why Non-Linearity Matters

Without activation functions, a multi-layer network collapses into a single linear transformation:

\mathbf{y} = \mathbf{W}_L \mathbf{W}_{L-1} \cdots \mathbf{W}_1 \mathbf{x} + \mathbf{b} = \mathbf{W}'\mathbf{x} + \mathbf{b}'

This means no matter how many layers you stack, the network can only learn linear decision boundaries. Real-world problems (image recognition, language understanding, speech processing) are inherently non-linear.

Backward Propagation: The Chain Rule in Action

Backpropagation efficiently computes gradients of the loss function with respect to all parameters using the chain rule. For a loss function $\mathcal{L}$ , we need $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}}$ for each layer.

Starting from the output layer:

\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}} \odot f'^{(L)}(\mathbf{z}^{(L)})

Then propagating backward:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} (\mathbf{a}^{(l-1)})^T

\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}

\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(l-1)}} = (\mathbf{W}^{(l)})^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}

The key insight is that we compute gradients layer by layer, reusing intermediate computations — making backpropagation an $O(n)$ operation in the number of parameters, versus $O(n^2)$ for numerical differentiation.

Activation Functions: Deep Dive

Sigmoid Function

\sigma(z) = \frac{1}{1 + e^{-z}}

Range: $(0, 1)$
Derivative: $\sigma(z)(1 - \sigma(z))$
Use case: Binary classification output layer
Problems: Vanishing gradients (derivative max = 0.25), not zero-centered, computationally expensive

Tanh Function

\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}

Range: $(-1, 1)$
Derivative: $1 - \tanh^2(z)$
Advantage: Zero-centered, stronger gradients than sigmoid
Still suffers: Vanishing gradient problem for large $|z|$

ReLU (Rectified Linear Unit)

\text{ReLU}(z) = \max(0, z)

Range: $[0, \infty)$
Derivative: 0 if $z < 0$ , 1 if $z > 0$ , undefined at 0
Advantages: Computationally efficient, mitigates vanishing gradients, sparse activation
Problem: "Dying ReLU" — neurons can get stuck outputting 0 forever

Leaky ReLU

\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

where $\alpha$ is typically 0.01. This prevents dying ReLU by allowing a small gradient for negative inputs.

GELU (Gaussian Error Linear Unit)

\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]

Used in Transformers (BERT, GPT), GELU provides a smooth approximation to ReLU that has been shown to improve training stability.

Swish / SiLU

\text{Swish}(z) = z \cdot \sigma(z)

A self-gated activation function discovered by Google Brain, used in EfficientNet and modern architectures.

Real-World Selection Guide

Scenario	Recommended Activation
Hidden layers (general)	ReLU or GELU
Output (binary classification)	Sigmoid
Output (multi-class)	Softmax
RNNs/LSTMs	Tanh, Sigmoid
Transformers	GELU, Swish
Very deep networks	Swish, Mish

Follow-Up Questions

Q: What happens if all weights are initialized to zero? A: Symmetry problem — all neurons in a layer learn the same features. Use He or Xavier initialization.

Q: How does batch normalization interact with activation functions? A: BN normalizes inputs to activations, reducing internal covariate shift and allowing higher learning rates.

Q: Why is GELU preferred over ReLU in Transformers? A: GELU is smooth (differentiable everywhere), has non-zero gradients for negative values, and empirically improves training on large-scale language models.