IOAI ML Notes Neural NetworkDeep Learning

Attention and Transformers

Attention mechanisms and transformer architectures for sequence modelling.

Syllabus Map


Overview


Scaled Dot-Product Attention

Definition

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q,K,V)=\text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Intuition


Multi-Head Attention

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V) MHA(Q,K,V)=Concat(head1,,headh)WO\text{MHA}(Q,K,V)=\text{Concat}(\text{head}_1,\dots,\text{head}_h)W^O

KV-Head Design Variants

Multi-Head Attention (MHA)

Qi=XWiQ,Ki=XWiK,Vi=XWiV,headi=Attn(Qi,Ki,Vi)Q_i=XW_i^Q,\quad K_i=XW_i^K,\quad V_i=XW_i^V,\quad \text{head}_i=\text{Attn}(Q_i,K_i,V_i)

Multi-Query Attention (MQA)

Qi=XWiQ,K=XWK,V=XWV,headi=Attn(Qi,K,V)Q_i=XW_i^Q,\quad K=XW^K,\quad V=XW^V,\quad \text{head}_i=\text{Attn}(Q_i,K,V)

Grouped-Query Attention (GQA)

Qi=XWiQ,Kg=XWgK,Vg=XWgV,headi=Attn(Qi,Kg(i),Vg(i))Q_i=XW_i^Q,\quad K_{g}=XW_{g}^K,\quad V_{g}=XW_{g}^V,\quad \text{head}_i=\text{Attn}(Q_i,K_{g(i)},V_{g(i)})

Multi-head Latent Attention (MLA)

Z=XWKV,K=ZWK,V=ZWVZ=XW^{KV}_{\downarrow},\quad K=ZW^{K}_{\uparrow},\quad V=ZW^{V}_{\uparrow} Qi=XWiQ,headi=Attn(Qi,K,V)Q_i=XW_i^Q,\quad \text{head}_i=\text{Attn}(Q_i,K,V)

Transformer Block

X=X+MHA(LN(X))X' = X + \text{MHA}(\text{LN}(X)) Y=X+FFN(LN(X))Y = X' + \text{FFN}(\text{LN}(X'))

How FFN Works

FFN(x)=W2σ(W1x+b1)+b2\text{FFN}(x)=W_2\,\sigma(W_1x+b_1)+b_2

Positional Information


Encoder vs Decoder Attention

Encoder Self-Attention

Decoder Self-Attention (Causal)

Cross-Attention


Complexity and Scaling

O(n2d)\mathcal{O}(n^2 d)

Practical Notes

Architecture and Context

Inference Behavior

← Back to Blog