Syllabus Map
- Study map: Syllabus Study Map
Overview
- Encoder-decoder models map input sequences to output sequences.
- They are the standard seq2seq architecture for translation, summarization, and speech transcription.
Core Idea
- Encoder builds a latent representation.
- Decoder generates the target sequence.
- Decoder uses both:
- Causal self-attention over previous target tokens,
- Cross-attention over encoder outputs.
Transformer Encoder-Decoder Mechanics
Encoder
- Input tokens are embedded + positional encoding.
- Stack self-attention + FFN blocks to produce contextual memory .
Decoder
- Uses masked self-attention over generated prefix.
- Uses cross-attention where decoder queries attend to encoder memory:
- Here, comes from decoder states, and come from encoder outputs.
Training Objective
- Teacher forcing feeds ground-truth previous token during training.
- Loss is token-level cross-entropy:
- Label smoothing is commonly used for better generalization.
Inference Workflow
Step 1: Encode source sequence
- Run input sequence once through encoder.
- Cache encoder outputs for all decoding steps.
Step 2: Start decoding
- Initialize with start token.
- Predict next-token distribution.
Step 3: Search strategy
- Greedy decoding: fastest, less diverse.
- Beam search: better sequence quality.
- Sampling (top-k/top-p): more diverse generation.
Step 4: Stop condition
- Stop at end-of-sequence token or maximum length.
Use Cases
- Machine translation.
- Summarisation.
- Vision-language modelling.
- Speech recognition (audio encoder + text decoder).
Practical Notes
Decoding Tradeoffs
- Beam size improves quality but increases latency.
Training-Inference Mismatch
- Exposure bias can appear because training uses teacher forcing but inference uses model outputs.
Long-Context Cost
- For long inputs, memory can be a bottleneck due to attention complexity.