Deep Learning

RNNs and LSTMs — Neural Networks That Remember

Explore recurrent neural networks designed to process sequential data with memory of past inputs.

Sequential processing — handle time series and text data
LSTM gates — solve the vanishing gradient problem
GRU simplification — efficient recurrent architectures

Memory is the diary we all carry about with us.

RNN, LSTM and GRU — Complete Guide

Recurrent networks process sequential data by maintaining a hidden state that carries information across time steps. Unlike Transformers, they process one token at a time with memory per step but sequential operations for a sequence of length .

Vanilla RNN

At each time step , the RNN computes:

How the RNN processes sequences: This diagram shows an RNN "unrolled" across time steps. At each step t, the RNN cell takes two inputs: the current input x_t (e.g., a word in a sentence) and the previous hidden state h_{t-1} (memory of all past inputs). It combines them through W·[h,x]+b and applies tanh to produce the new hidden state h_t and output y_t. The crucial detail: the SAME weights W are used at every time step — this is weight sharing across time, which means the model learns a single function that works regardless of sequence length. The hidden state h_t acts as a compressed memory of everything seen so far. The information bottleneck problem is visible: all past information must fit into the fixed-size vector h_t. The red text at the bottom highlights the critical flaw: during backpropagation, gradients multiply through the chain of time steps, causing them to vanish (or explode) exponentially, making it impossible to learn long-range dependencies beyond ~10-20 steps.

LSTM (Long Short-Term Memory)

LSTM (Hochreiter and Schmidhuber, 1997) introduces a cell state as an information highway, with three gates controlling information flow:

GRU (Gated Recurrent Unit)

GRU (Cho et al., 2014) simplifies LSTM by merging the cell and hidden state and using only two gates:

Bidirectional RNN

Processes the sequence in both directions and concatenates hidden states:

Use case: NER, sentiment analysis — where full context is available. Cannot be used for autoregressive generation.

Sequence-to-Sequence Architecture

PyTorch Implementation

Key Takeaways

What to Learn Next

-> Transformers Learn the architecture replacing RNNs.

-> NLP Fundamentals Master natural language processing basics.

-> Time Series Analysis Apply RNNs to time-dependent data.

-> Attention Deep Dive Understand how attention solves the bottleneck.

-> Neural Networks Understand the foundation of deep learning.

-> Sequence-to-Sequence Build models for translation and summarization.

RNN, LSTM and GRU — Sequential Data Complete Guide