Syllabus Map

Overview

Masked Language Modeling (MLM) is a self-supervised objective for learning contextual token representations.
A subset of tokens is masked, and the model predicts original tokens using both left and right context.
MLM is widely used to pretrain encoder models such as BERT-style architectures.

Let $\mathcal{M}$ be masked token positions in sequence $x_{1:T}$ .
A “mask” means we hide the original token at position $i \in \mathcal{M}$ from the model input and ask the model to recover it.
In practice, selected tokens are usually replaced by [MASK], but the prediction target remains the original token.
$x_{\backslash \mathcal{M}}$ means the visible context after masking positions in $\mathcal{M}$ .
The model minimizes negative log-likelihood over masked positions:

\mathcal{L}_{\text{MLM}}=-\sum_{i\in\mathcal{M}}\log P_\theta(x_i\mid x_{\backslash \mathcal{M}})

Only masked positions contribute to loss, while unmasked tokens provide context.

Bidirectional context helps capture syntax and semantics more effectively than one-directional prediction.
The encoder learns reusable representations for classification, tagging, and retrieval.

Sample a small fraction of input tokens for corruption (commonly 15%).
For selected positions, typical replacement split is:
- replace with [MASK] most of the time,
- replace with random token sometimes,
- keep original token sometimes.

Reduces mismatch between pretraining and downstream text without [MASK].
Encourages robust contextual reasoning instead of shortcut reliance on a single mask token.

Fine-tune with task-specific objective (classification, token labeling, retrieval).
Optionally run domain-adaptive MLM before final fine-tuning.

Continuing MLM on in-domain text often improves specialized vocabulary handling.
Gains are strongest when downstream data distribution differs from general web text.
Stop early if validation indicates over-specialization.