Sequence-to-Sequence Models — Encoder-Decoder Architecture

Sequence ModelsSeq2SeqFree Lesson

Advertisement

Sequence-to-Sequence Models

Seq2seq models map an input sequence to an output sequence of potentially different length. They are the foundation of machine translation, summarization, and dialogue systems.


Encoder-Decoder Architecture

DfEncoder-Decoder Framework

The encoder reads the input sequence x=(x1,,xT)\mathbf{x} = (x_1, \ldots, x_T) and produces a context representation. The decoder generates the output sequence y=(y1,,yT)\mathbf{y} = (y_1, \ldots, y_{T'}) conditioned on the encoder's representation.

  • Encoder: Processes input and produces hidden states h1,,hTh_1, \ldots, h_T
  • Decoder: Generates output tokens autoregressively, conditioned on encoder states

Encoder Hidden States

ht=RNNenc(xt,ht1)h_t = \text{RNN}_{\text{enc}}(x_t, h_{t-1})

Here,

  • hth_t=Encoder hidden state at time t
  • xtx_t=Input token at time t
  • RNNenc\text{RNN}_{\text{enc}}=Encoder RNN (GRU/LSTM)

Decoder Output

st=RNNdec(yt1,st1,c)s_t = \text{RNN}_{\text{dec}}(y_{t-1}, s_{t-1}, c)

Here,

  • sts_t=Decoder hidden state at time t
  • yt1y_{t-1}=Previous output token
  • cc=Context vector from encoder

Output Probability

P(yty<t,x)=softmax(Wost+bo)P(y_t | y_{<t}, \mathbf{x}) = \text{softmax}(W_o \cdot s_t + b_o)

Here,

  • WoW_o=Output projection weights
  • sts_t=Decoder hidden state
  • bob_o=Output bias

Context Vector

DfContext Vector

The context vector cc is typically the final hidden state of the encoder:

c=hTc = h_T

For bidirectional encoders, the context is the concatenation of forward and backward final states:

c=[hT;h1]c = [\overrightarrow{h_T}; \overleftarrow{h_1}]

ℹ️ Information Bottleneck

The fixed-length context vector creates an information bottleneck — the entire input sequence must be compressed into a single vector. This motivates attention mechanisms (covered in the next lesson) that allow the decoder to attend to all encoder states.


Teacher Forcing

DfTeacher Forcing

During training, the decoder receives the ground truth output from the previous step as input, rather than its own prediction:

st=RNNdec(yt1true,st1,c)s_t = \text{RNN}_{\text{dec}}(y_{t-1}^{\text{true}}, s_{t-1}, c)

This stabilizes training but creates a train-test discrepancy (exposure bias) since the decoder never sees its own errors during training.

Scheduled Sampling

inputt={yt1truewith probability ϵy^t1with probability 1ϵ\text{input}_t = \begin{cases} y_{t-1}^{\text{true}} & \text{with probability } \epsilon \\ \hat{y}_{t-1} & \text{with probability } 1 - \epsilon \end{cases}

Here,

  • ϵ\epsilon=Probability of using ground truth (decays during training)
  • yt1truey_{t-1}^{\text{true}}=Ground truth previous token
  • y^t1\hat{y}_{t-1}=Model's predicted previous token

Decoding Strategies

Greedy Decoding

DfGreedy Decoding

At each step, select the token with highest probability:

y^t=argmaxyP(yy<t,x)\hat{y}_t = \arg\max_{y} P(y | y_{<t}, \mathbf{x})

Fast but globally suboptimal — a locally optimal choice may lead to a poor overall sequence.

Beam Search

DfBeam Search

Maintain kk candidates (beams) at each step. At each time step:

  1. Expand each beam with all possible next tokens
  2. Score each new beam as sum of log-probabilities
  3. Keep top-kk beams
  4. Repeat until end-of-sequence or max length

Score: score(y)=t=1TlogP(yty<t,x)\text{score}(\mathbf{y}) = \sum_{t=1}^{T'} \log P(y_t | y_{<t}, \mathbf{x})

Beam Search Score

score(y)=t=1TlogP(yty<t,x)\text{score}(\mathbf{y}) = \sum_{t=1}^{T'} \log P(y_t | y_{<t}, \mathbf{x})

Here,

  • y\mathbf{y}=Candidate output sequence
  • P(yty<t,x)P(y_t | y_{<t}, \mathbf{x})=Conditional probability of token t
  • TT'=Output sequence length

💡 Beam Search Tips

  • Beam width k=4k = 4 to 1010 works well in practice
  • Apply length normalization: divide score by TT' to avoid favoring short sequences
  • For diverse outputs, use diverse beam search with penalty for重复
  • Beam search is not suitable for open-ended generation (use sampling instead)

Complete PyTorch Implementation

📝Example: Seq2Seq with GRU

import torch
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2,
                 dropout=0.3):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.gru = nn.GRU(embed_dim, hidden_dim, num_layers,
                          batch_first=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        # src: (batch, src_len)
        embedded = self.dropout(self.embedding(src))
        outputs, hidden = self.gru(embedded)
        # outputs: (batch, src_len, hidden_dim)
        # hidden: (num_layers, batch, hidden_dim)
        return outputs, hidden


class BahdanauAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.W_a = nn.Linear(hidden_dim, hidden_dim, bias=False)
        self.U_a = nn.Linear(hidden_dim, hidden_dim, bias=False)
        self.v = nn.Linear(hidden_dim, 1, bias=False)

    def forward(self, decoder_state, encoder_outputs):
        # decoder_state: (batch, hidden_dim)
        # encoder_outputs: (batch, src_len, hidden_dim)
        src_len = encoder_outputs.shape[1]

        decoder_state = decoder_state.unsqueeze(1).repeat(1, src_len, 1)
        energy = torch.tanh(
            self.W_a(encoder_outputs) + self.U_a(decoder_state)
        )
        attention = self.v(energy).squeeze(-1)
        return torch.softmax(attention, dim=1)


class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2,
                 dropout=0.3):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.attention = BahdanauAttention(hidden_dim)
        self.gru = nn.GRU(embed_dim + hidden_dim, hidden_dim, num_layers,
                          batch_first=True, dropout=dropout)
        self.fc_out = nn.Linear(hidden_dim * 2 + embed_dim, vocab_size)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input_token, hidden, encoder_outputs):
        # input_token: (batch, 1)
        embedded = self.dropout(self.embedding(input_token))

        # Attention
        attn_weights = self.attention(hidden[-1], encoder_outputs)
        context = torch.bmm(
            attn_weights.unsqueeze(1), encoder_outputs
        )

        # GRU input
        gru_input = torch.cat([embedded, context], dim=2)
        output, hidden = self.gru(gru_input, hidden)

        # Predict
        prediction = self.fc_out(
            torch.cat([output, context, embedded], dim=2)
        ).squeeze(1)

        return prediction, hidden, attn_weights


class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.fc_out.out_features

        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)
        encoder_outputs, hidden = self.encoder(src)

        input_token = trg[:, 0].unsqueeze(1)

        for t in range(1, trg_len):
            output, hidden, _ = self.decoder(
                input_token, hidden, encoder_outputs
            )
            outputs[:, t] = output

            # Teacher forcing
            if torch.rand(1).item() < teacher_forcing_ratio:
                input_token = trg[:, t].unsqueeze(1)
            else:
                input_token = output.argmax(dim=1, keepdim=True)

        return outputs


# Build model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
encoder = Encoder(vocab_size=8000, embed_dim=256, hidden_dim=512)
decoder = Decoder(vocab_size=8000, embed_dim=256, hidden_dim=512)
model = Seq2Seq(encoder, decoder, device).to(device)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total_params:,}")

Training

Seq2Seq Loss

L=t=1TlogP(yty<t,x)\mathcal{L} = -\sum_{t=1}^{T'} \log P(y_t^* | y_{<t}^*, \mathbf{x})

Here,

  • yty_t^*=Ground truth token at time t
  • y<ty_{<t}^*=Ground truth tokens before time t
  • x\mathbf{x}=Input sequence

💡 Training Tips

  1. Use label smoothing (0.1) to prevent overconfident predictions
  2. Apply gradient clipping at norm 1.0
  3. Start teacher forcing at 1.0 and decay to 0.0 over training
  4. Use learning rate warmup for first few thousand steps
  5. Monitor both teacher-forced loss and greedy decoding BLEU

Practice Exercises

  1. Implement beam search: Write a beam search decoder with length normalization. Compare outputs with greedy decoding.

  2. Scheduled sampling: Modify the training loop to use scheduled sampling with linear decay of ϵ\epsilon from 1.0 to 0.0.

  3. Transformer decoder: Replace the RNN decoder with a Transformer decoder. Compare performance and training speed.

  4. Summarization task: Train a seq2seq model on CNN/DailyMail dataset for text summarization. Evaluate with ROUGE scores.


Key Takeaways

📋Summary: Seq2Seq Models

  • Encoder-decoder architecture maps variable-length input to output
  • Context vector compresses entire input into fixed-length representation
  • Teacher forcing accelerates training but causes exposure bias
  • Greedy decoding is fast but locally optimal
  • Beam search finds better sequences by maintaining kk candidates
  • Attention mechanisms solve the information bottleneck problem
  • Seq2seq models are foundational for translation, summarization, dialogue
  • Modern systems use Transformers instead of RNNs for better parallelism
  • BLEU score is standard evaluation for machine translation
  • See also: Transformers for the attention-based architecture

Advertisement

Need Expert Deep Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement