Sequence Models

LSTM Networks — Long Short-Term Memory for Sequential Data

LSTM networks solve the vanishing gradient problem of vanilla RNNs by using gating mechanisms to control information flow across time steps. The cell state acts as an information highway.

Three Gates Control Flow — Forget, input, and output gates decide what to discard, store, and output
Cell State Prevents Vanishing — Linear addition preserves gradient magnitude across hundreds of time steps
Bidirectional Uses Context — Processing in both directions captures past and future information

"LSTMs gave RNNs a long-term memory that doesn't fade."

LSTM Networks — Gates, Cell State and Bidirectional Architectures

LSTM (Long Short-Term Memory) networks solve the vanishing gradient problem of vanilla RNNs by using gating mechanisms to control information flow across time steps.

See our RNN Deep Dive tutorial for the fundamentals of vanilla RNNs and why LSTMs were invented.

Why LSTMs?

LSTM Cell Architecture

The Three Gates

Forget Gate

Input Gate

Cell State Update

Output Gate

Complete LSTM Equations

How this diagram works: This diagram shows the internal architecture of a single LSTM cell. The cell state (blue bar at top) acts as an information highway, flowing across time steps with only linear interactions (multiplication and addition). The three gates — forget (red), input (green), and output (purple) — are sigmoid neural networks that output values between 0 and 1, acting as soft switches. The forget gate controls what to discard from the cell state, the input gate controls what new information to store, and the output gate controls what to reveal. The critical insight is that the cell state update is linear: . Because gradients flow through addition (not matrix multiplication), they are preserved across many time steps, solving the vanishing gradient problem that plagues vanilla RNNs.

Why LSTMs Solve Vanishing Gradients

Bidirectional LSTM

Stacked LSTM

PyTorch Implementation

LSTM vs. GRU Comparison

Feature	LSTM	GRU
Gates	3 (forget, input, output)	2 (update, reset)
Cell state	Yes	No
Parameters	More ()	Fewer ()
Speed	Slower	Faster
Performance	Similar	Similar
Memory	Better for very long sequences	Competitive

Summary

Practice Exercises

Mathematical: Derive the gradient for the LSTM cell state update. Why does this prevent vanishing gradients?
Coding: Implement an LSTM cell from scratch using only torch.tensor operations (no nn.LSTM). Verify that your implementation matches PyTorch's output.
Experiment: Compare LSTM vs. GRU vs. Transformer on a long-range dependency task (e.g., copying the first element of a 500-element sequence). Which handles long sequences best?
Application: Build a bidirectional LSTM for sentiment analysis on IMDB reviews. Compare with a Transformer-based model. What are the tradeoffs?
Research: Read the original LSTM paper (Hochreiter and Schmidhuber, 1997). What was the original motivation? How has the architecture evolved?

What to Learn Next

-> GRU Networks Gated Recurrent Unit — a simpler alternative to LSTM with similar performance.

-> RNN Deep Dive Vanilla RNNs, BPTT, and why vanishing gradients motivated the invention of LSTM and GRU.

-> Sequence to Sequence Encoder-decoder architectures for translation, summarization, and text generation.

-> Attention Mechanisms Self-attention, multi-head attention, and how Transformers replaced RNNs for sequence modeling.

-> Vision Transformers ViT, DeiT, Swin — how Transformers are challenging CNN dominance in computer vision.

-> DL Systems Design Designing production ML systems — data pipelines, training infrastructure, and serving at scale.

LSTM Networks — Gates, Cell State and Bidirectional Architectures

LSTM Networks — Long Short-Term Memory for Sequential Data

LSTM Networks — Gates, Cell State and Bidirectional Architectures

Why LSTMs?

LSTM Cell Architecture

The Three Gates

Forget Gate

Input Gate

Cell State Update

Output Gate

Complete LSTM Equations

Why LSTMs Solve Vanishing Gradients

Bidirectional LSTM

Stacked LSTM

PyTorch Implementation

LSTM vs. GRU Comparison

Summary

Practice Exercises

What to Learn Next

Need Expert Deep Learning Help?