Syllabus Map
- Study map: Syllabus Study Map
Overview
- Vision Transformers (ViT) treat an image as a sequence of patch tokens.
- Self-attention models global context from the start.
- Strong performance with large-scale pretraining.
Core Idea
Patchify
- Split the image into fixed-size patches.
- Flatten each patch and project to an embedding.
If the image is and patch size is :
- : number of patch tokens.
- : patch size.
- Each patch has raw dimension (with input channels).
- Linear projection maps each patch to model dimension .
- is the learnable patch embedding matrix.
Tokens + Position
- Add a learnable [CLS] token for classification.
- Add positional embeddings so order is known.
- Final encoder input is:
- is the class token and is learnable positional embedding.
Transformer Encoder
- Stack multi-head self-attention + MLP blocks.
- Output at [CLS] token is used for classification.
- Layer structure (pre-norm style):
- Classification head:
Attention Computation
- Queries, keys, values are linear projections of tokens.
- Attention weights are scaled dot products.
- Full self-attention cost grows quadratically with token count:
- Smaller patch size increases , which increases compute and memory.
Why It Works
- Global receptive field from the first layer.
- Scales well with data and compute.
- Pretrained ViTs transfer well to detection and segmentation.
- Patch tokens learn semantically rich features when pretrained at scale (for example, ImageNet-21k / JFT-style corpora).
Key Variants
- ViT: baseline patch + global attention.
- DeiT: strong augmentation + distillation token to train with less data.
- Swin Transformer: shifted local windows for near-linear scaling with image size.
- Hybrid ViT: CNN stem before transformer to improve local inductive bias.
When To Use
- Large datasets: ViTs shine with lots of pretraining data.
- Transfer learning: strong backbone for downstream tasks.
- Hybrid setups: CNN stem + transformer for efficiency.
- High-resolution tasks: use hierarchical/windowed variants (for example, Swin) to reduce quadratic attention cost.
Practical Notes
Patch Size and Compute
- Patch size trades off detail vs compute.
- Smaller gives finer detail but larger and higher memory.
- Typical settings: for classification baselines, smaller patches for dense tasks.
Training and Transfer
- Training recipe matters for data-limited settings.
- Use RandAugment/Mixup/CutMix, label smoothing, stochastic depth, and AdamW.
- Positional embedding interpolation is needed when fine-tuning at different image resolutions.
- For dense prediction (segmentation/detection), attach FPN/UPerNet-style heads instead of only using [CLS].
Model Scales
- ViT-Tiny/Small: faster experimentation, lower memory.
- ViT-Base: common transfer-learning default.
- ViT-Large/Huge: best quality with large-scale pretraining and strong compute budget.