Sequence-to-Sequence Models
Seq2seq models map an input sequence to an output sequence of potentially different length. They are the foundation of machine translation, summarization, and dialogue systems.
Encoder-Decoder Architecture
DfEncoder-Decoder Framework
The encoder reads the input sequence and produces a context representation. The decoder generates the output sequence conditioned on the encoder's representation.
- Encoder: Processes input and produces hidden states
- Decoder: Generates output tokens autoregressively, conditioned on encoder states
Encoder Hidden States
Here,
- =Encoder hidden state at time t
- =Input token at time t
- =Encoder RNN (GRU/LSTM)
Decoder Output
Here,
- =Decoder hidden state at time t
- =Previous output token
- =Context vector from encoder
Output Probability
Here,
- =Output projection weights
- =Decoder hidden state
- =Output bias
Context Vector
DfContext Vector
The context vector is typically the final hidden state of the encoder:
For bidirectional encoders, the context is the concatenation of forward and backward final states:
ℹ️ Information Bottleneck
The fixed-length context vector creates an information bottleneck — the entire input sequence must be compressed into a single vector. This motivates attention mechanisms (covered in the next lesson) that allow the decoder to attend to all encoder states.
Teacher Forcing
DfTeacher Forcing
During training, the decoder receives the ground truth output from the previous step as input, rather than its own prediction:
This stabilizes training but creates a train-test discrepancy (exposure bias) since the decoder never sees its own errors during training.
Scheduled Sampling
Here,
- =Probability of using ground truth (decays during training)
- =Ground truth previous token
- =Model's predicted previous token
Decoding Strategies
Greedy Decoding
DfGreedy Decoding
At each step, select the token with highest probability:
Fast but globally suboptimal — a locally optimal choice may lead to a poor overall sequence.
Beam Search
DfBeam Search
Maintain candidates (beams) at each step. At each time step:
- Expand each beam with all possible next tokens
- Score each new beam as sum of log-probabilities
- Keep top- beams
- Repeat until end-of-sequence or max length
Score:
Beam Search Score
Here,
- =Candidate output sequence
- =Conditional probability of token t
- =Output sequence length
💡 Beam Search Tips
- Beam width to works well in practice
- Apply length normalization: divide score by to avoid favoring short sequences
- For diverse outputs, use diverse beam search with penalty for重复
- Beam search is not suitable for open-ended generation (use sampling instead)
Complete PyTorch Implementation
📝Example: Seq2Seq with GRU
import torch
import torch.nn as nn
class Encoder(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2,
dropout=0.3):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.gru = nn.GRU(embed_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.dropout = nn.Dropout(dropout)
def forward(self, src):
# src: (batch, src_len)
embedded = self.dropout(self.embedding(src))
outputs, hidden = self.gru(embedded)
# outputs: (batch, src_len, hidden_dim)
# hidden: (num_layers, batch, hidden_dim)
return outputs, hidden
class BahdanauAttention(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.W_a = nn.Linear(hidden_dim, hidden_dim, bias=False)
self.U_a = nn.Linear(hidden_dim, hidden_dim, bias=False)
self.v = nn.Linear(hidden_dim, 1, bias=False)
def forward(self, decoder_state, encoder_outputs):
# decoder_state: (batch, hidden_dim)
# encoder_outputs: (batch, src_len, hidden_dim)
src_len = encoder_outputs.shape[1]
decoder_state = decoder_state.unsqueeze(1).repeat(1, src_len, 1)
energy = torch.tanh(
self.W_a(encoder_outputs) + self.U_a(decoder_state)
)
attention = self.v(energy).squeeze(-1)
return torch.softmax(attention, dim=1)
class Decoder(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2,
dropout=0.3):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.attention = BahdanauAttention(hidden_dim)
self.gru = nn.GRU(embed_dim + hidden_dim, hidden_dim, num_layers,
batch_first=True, dropout=dropout)
self.fc_out = nn.Linear(hidden_dim * 2 + embed_dim, vocab_size)
self.dropout = nn.Dropout(dropout)
def forward(self, input_token, hidden, encoder_outputs):
# input_token: (batch, 1)
embedded = self.dropout(self.embedding(input_token))
# Attention
attn_weights = self.attention(hidden[-1], encoder_outputs)
context = torch.bmm(
attn_weights.unsqueeze(1), encoder_outputs
)
# GRU input
gru_input = torch.cat([embedded, context], dim=2)
output, hidden = self.gru(gru_input, hidden)
# Predict
prediction = self.fc_out(
torch.cat([output, context, embedded], dim=2)
).squeeze(1)
return prediction, hidden, attn_weights
class Seq2Seq(nn.Module):
def __init__(self, encoder, decoder, device):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.device = device
def forward(self, src, trg, teacher_forcing_ratio=0.5):
batch_size = src.shape[0]
trg_len = trg.shape[1]
trg_vocab_size = self.decoder.fc_out.out_features
outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
encoder_outputs, hidden = self.encoder(src)
input_token = trg[:, 0].unsqueeze(1)
for t in range(1, trg_len):
output, hidden, _ = self.decoder(
input_token, hidden, encoder_outputs
)
outputs[:, t] = output
# Teacher forcing
if torch.rand(1).item() < teacher_forcing_ratio:
input_token = trg[:, t].unsqueeze(1)
else:
input_token = output.argmax(dim=1, keepdim=True)
return outputs
# Build model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
encoder = Encoder(vocab_size=8000, embed_dim=256, hidden_dim=512)
decoder = Decoder(vocab_size=8000, embed_dim=256, hidden_dim=512)
model = Seq2Seq(encoder, decoder, device).to(device)
# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total_params:,}")
Training
Seq2Seq Loss
Here,
- =Ground truth token at time t
- =Ground truth tokens before time t
- =Input sequence
💡 Training Tips
- Use label smoothing (0.1) to prevent overconfident predictions
- Apply gradient clipping at norm 1.0
- Start teacher forcing at 1.0 and decay to 0.0 over training
- Use learning rate warmup for first few thousand steps
- Monitor both teacher-forced loss and greedy decoding BLEU
Practice Exercises
-
Implement beam search: Write a beam search decoder with length normalization. Compare outputs with greedy decoding.
-
Scheduled sampling: Modify the training loop to use scheduled sampling with linear decay of from 1.0 to 0.0.
-
Transformer decoder: Replace the RNN decoder with a Transformer decoder. Compare performance and training speed.
-
Summarization task: Train a seq2seq model on CNN/DailyMail dataset for text summarization. Evaluate with ROUGE scores.
Key Takeaways
📋Summary: Seq2Seq Models
- Encoder-decoder architecture maps variable-length input to output
- Context vector compresses entire input into fixed-length representation
- Teacher forcing accelerates training but causes exposure bias
- Greedy decoding is fast but locally optimal
- Beam search finds better sequences by maintaining candidates
- Attention mechanisms solve the information bottleneck problem
- Seq2seq models are foundational for translation, summarization, dialogue
- Modern systems use Transformers instead of RNNs for better parallelism
- BLEU score is standard evaluation for machine translation
- See also: Transformers for the attention-based architecture