🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Neural Networks: Forward/Backward Propagation, Activation Functions — Asked at Google & OpenAI

Deep Learning Premium InterviewsNeural Network Fundamentals⭐ Premium

Advertisement

Google & OpenAI

Neural Networks: Forward/Backward Propagation & Activation Functions

Premium Interview Preparation — Deep Learning Fundamentals

🎯 The Interview Question

"Walk us through the complete forward and backward propagation process in a multi-layer neural network. Explain how activation functions contribute to non-linearity, and discuss the trade-offs between different activation functions. What would happen if you removed all activation functions?"

This question is a cornerstone of deep learning interviews at top companies. It tests your fundamental understanding of how neural networks learn.


📚 Detailed Answer

Forward Propagation: The Complete Picture

Forward propagation is the process of passing input data through the network to obtain a prediction. Let's break this down mathematically and conceptually.

Given an input vector xRn\mathbf{x} \in \mathbb{R}^n, a neural network with LL layers computes the output through a series of transformations:

z(l)=W(l)a(l1)+b(l)\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}
a(l)=f(l)(z(l))\mathbf{a}^{(l)} = f^{(l)}(\mathbf{z}^{(l)})

where:

  • W(l)Rm×n\mathbf{W}^{(l)} \in \mathbb{R}^{m \times n} is the weight matrix for layer ll
  • b(l)Rm\mathbf{b}^{(l)} \in \mathbb{R}^m is the bias vector
  • f(l)f^{(l)} is the activation function
  • a(l)\mathbf{a}^{(l)} is the activation output (with a(0)=x\mathbf{a}^{(0)} = \mathbf{x})

💡

The beauty of neural networks lies in the Universal Approximation Theorem: a single hidden layer with sufficient neurons and a non-linear activation function can approximate any continuous function on a compact set.

Why Non-Linearity Matters

Without activation functions, a multi-layer network collapses into a single linear transformation:

y=WLWL1W1x+b=Wx+b\mathbf{y} = \mathbf{W}_L \mathbf{W}_{L-1} \cdots \mathbf{W}_1 \mathbf{x} + \mathbf{b} = \mathbf{W}'\mathbf{x} + \mathbf{b}'

This means no matter how many layers you stack, the network can only learn linear decision boundaries. Real-world problems (image recognition, language understanding, speech processing) are inherently non-linear.

Backward Propagation: The Chain Rule in Action

Backpropagation efficiently computes gradients of the loss function with respect to all parameters using the chain rule. For a loss function L\mathcal{L}, we need LW(l)\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} for each layer.

Starting from the output layer:

Lz(L)=La(L)f(L)(z(L))\frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(L)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(L)}} \odot f'^{(L)}(\mathbf{z}^{(L)})

Then propagating backward:

LW(l)=Lz(l)(a(l1))T\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}} (\mathbf{a}^{(l-1)})^T
Lb(l)=Lz(l)\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}
La(l1)=(W(l))TLz(l)\frac{\partial \mathcal{L}}{\partial \mathbf{a}^{(l-1)}} = (\mathbf{W}^{(l)})^T \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(l)}}

The key insight is that we compute gradients layer by layer, reusing intermediate computations — making backpropagation an O(n)O(n) operation in the number of parameters, versus O(n2)O(n^2) for numerical differentiation.

Activation Functions: Deep Dive

Sigmoid Function

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
  • Range: (0,1)(0, 1)
  • Derivative: σ(z)(1σ(z))\sigma(z)(1 - \sigma(z))
  • Use case: Binary classification output layer
  • Problems: Vanishing gradients (derivative max = 0.25), not zero-centered, computationally expensive

Tanh Function

tanh(z)=ezezez+ez\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}
  • Range: (1,1)(-1, 1)
  • Derivative: 1tanh2(z)1 - \tanh^2(z)
  • Advantage: Zero-centered, stronger gradients than sigmoid
  • Still suffers: Vanishing gradient problem for large z|z|

ReLU (Rectified Linear Unit)

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)
  • Range: [0,)[0, \infty)
  • Derivative: 0 if z<0z < 0, 1 if z>0z > 0, undefined at 0
  • Advantages: Computationally efficient, mitigates vanishing gradients, sparse activation
  • Problem: "Dying ReLU" — neurons can get stuck outputting 0 forever

Leaky ReLU

LeakyReLU(z)={zif z>0αzif z0\text{LeakyReLU}(z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}

where α\alpha is typically 0.01. This prevents dying ReLU by allowing a small gradient for negative inputs.

GELU (Gaussian Error Linear Unit)

GELU(z)=zΦ(z)=z12[1+erf(z2)]\text{GELU}(z) = z \cdot \Phi(z) = z \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{z}{\sqrt{2}}\right)\right]

Used in Transformers (BERT, GPT), GELU provides a smooth approximation to ReLU that has been shown to improve training stability.

Swish / SiLU

Swish(z)=zσ(z)\text{Swish}(z) = z \cdot \sigma(z)

A self-gated activation function discovered by Google Brain, used in EfficientNet and modern architectures.

Real-World Selection Guide

ScenarioRecommended Activation
Hidden layers (general)ReLU or GELU
Output (binary classification)Sigmoid
Output (multi-class)Softmax
RNNs/LSTMsTanh, Sigmoid
TransformersGELU, Swish
Very deep networksSwish, Mish

Follow-Up Questions

Q: What happens if all weights are initialized to zero? A: Symmetry problem — all neurons in a layer learn the same features. Use He or Xavier initialization.

Q: How does batch normalization interact with activation functions? A: BN normalizes inputs to activations, reducing internal covariate shift and allowing higher learning rates.

Q: Why is GELU preferred over ReLU in Transformers? A: GELU is smooth (differentiable everywhere), has non-zero gradients for negative values, and empirically improves training on large-scale language models.

Related Topics

Advertisement