IOAI ML Notes Computer VisionNatural Language Processing

Vision-Text Encoders

Joint vision-language embedding models.

Syllabus Map


Overview


Core Idea

v=fθ(I)fθ(I)2,t=gϕ(T)gϕ(T)2v = \frac{f_\theta(I)}{\|f_\theta(I)\|_2}, \quad t = \frac{g_\phi(T)}{\|g_\phi(T)\|_2} sij=τvitjs_{ij} = \tau \, v_i^\top t_j

Contrastive Training Objective

Lit=1Ni=1Nlogexp(sii)j=1Nexp(sij)\mathcal{L}_{i\rightarrow t} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(s_{ii})}{\sum_{j=1}^N \exp(s_{ij})} Lti=1Ni=1Nlogexp(sii)j=1Nexp(sji)\mathcal{L}_{t\rightarrow i} = -\frac{1}{N}\sum_{i=1}^N \log \frac{\exp(s_{ii})}{\sum_{j=1}^N \exp(s_{ji})} L=12(Lit+Lti)\mathcal{L} = \frac{1}{2}\left(\mathcal{L}_{i\rightarrow t}+\mathcal{L}_{t\rightarrow i}\right)

Example Models


CLIP Step-by-Step

Training

Step 1: Build paired mini-batch

Step 2: Encode image and text

Step 3: Compute similarity matrix

Sij=τvitjS_{ij} = \tau \, v_i^\top t_j

Step 4: Compute symmetric contrastive loss

Step 5: Update both encoders

Inference

Step 1: Prepare class prompts or queries

Step 2: Encode once, compare many

Step 3: Rank by cosine similarity

y^=argmaxc  vtc\hat{y}=\arg\max_c \; v^\top t_c

Step 4: Optional prompt ensembling


Inference Patterns

Zero-Shot Classification

y^=argmaxc  vtc\hat{y} = \arg\max_{c} \; v^\top t_c

Image-to-Text Retrieval

Text-to-Image Retrieval


Practical Notes

Common Uses

Prompting and Retrieval

Metrics

Limitations

← Back to Blog