IOAI ML Notes Computer VisionDeep Learning

Vision Transformers

How vision transformers model images as token sequences.

Syllabus Map


Overview


Core Idea

Patchify

If the image is H×WH \times W and patch size is P×PP \times P:

N=HWP2N = \frac{H W}{P^2} xpRN×(P2C),z0=xpERN×Dx_p \in \mathbb{R}^{N \times (P^2 C)}, \quad z_0 = x_p E \in \mathbb{R}^{N \times D}

Tokens + Position

z0=[xcls;xpE]+Eposz_0 = [x_{\text{cls}}; x_p E] + E_{\text{pos}}

Transformer Encoder

zl=zl1+MSA(LN(zl1))z_l' = z_{l-1} + \text{MSA}(\text{LN}(z_{l-1})) zl=zl+MLP(LN(zl))z_l = z_l' + \text{MLP}(\text{LN}(z_l')) y^=softmax(WzLcls)\hat{y} = \text{softmax}(W z_L^{\text{cls}})

Attention Computation

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q,K,V)=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V O(N2D)\mathcal{O}(N^2 D)

Why It Works


Key Variants


When To Use


Practical Notes

Patch Size and Compute

Training and Transfer

Model Scales

← Back to Blog