Syllabus Map
- Study map: Syllabus Study Map
Overview
- Vision-text encoders map images and text into a shared space.
- Goal: make semantically matching image-text pairs close, and mismatched pairs far apart.
- Main use cases: zero-shot classification, cross-modal retrieval, and multimodal search.
Core Idea
- Train with paired image-text data.
- Use contrastive objectives to align modalities.
- Typical architecture is a dual encoder:
- Image encoder maps image to vector .
- Text encoder maps caption/prompt to vector .
- Embeddings are L2-normalized before similarity scoring.
- Similarity is scaled cosine similarity:
- is similarity between image and text , and is a learnable temperature scale.
Contrastive Training Objective
- For a batch of paired examples, compute the similarity matrix .
- Use symmetric InfoNCE loss: image-to-text and text-to-image.
- The diagonal terms are matched pairs; off-diagonals are in-batch negatives.
Example Models
- CLIP: ViT/ResNet image encoder + Transformer text encoder; trained on large noisy web image-text pairs.
- OpenCLIP: open-source CLIP reproductions trained on LAION-scale datasets.
- ALIGN: large-scale weakly supervised contrastive image-text pretraining.
CLIP Step-by-Step
Training
Step 1: Build paired mini-batch
- Sample a batch of image-caption pairs .
- Each pair is treated as a positive match.
Step 2: Encode image and text
- Compute image embeddings with vision encoder.
- Compute text embeddings with text encoder.
- L2-normalize both embeddings.
Step 3: Compute similarity matrix
- Compute all pairwise similarities between image and text embeddings in the batch.
Step 4: Compute symmetric contrastive loss
- Optimize image-to-text and text-to-image objectives jointly.
- Positive pairs are diagonal entries ; others are negatives.
Step 5: Update both encoders
- Backpropagate total contrastive loss.
- Update vision encoder, text encoder, and temperature parameter.
Inference
Step 1: Prepare class prompts or queries
- For zero-shot classification, create a text prompt per class.
- For retrieval, use arbitrary text/image queries.
Step 2: Encode once, compare many
- Encode query image/text.
- Encode candidate texts/images and cache embeddings.
Step 3: Rank by cosine similarity
- Compute similarity and rank candidates.
- Pick top class (classification) or top- results (retrieval).
Step 4: Optional prompt ensembling
- Use multiple templates per class and average embeddings.
- Improves robustness to wording choices.
Inference Patterns
Zero-Shot Classification
- Create text prompts per class (for example, “a photo of a {class}”).
- Encode image once and all prompts once.
- Choose class with highest image-text similarity.
Image-to-Text Retrieval
- Encode all candidate captions/documents.
- Retrieve top- texts by similarity to image embedding.
Text-to-Image Retrieval
- Encode all candidate images.
- Retrieve top- images by similarity to text embedding.
Practical Notes
Common Uses
- Enables zero-shot classification.
- Useful for retrieval and captioning pipelines.
Prompting and Retrieval
- Prompt engineering matters: template choice can change zero-shot accuracy.
- Multiple prompt ensembling (average class prompt embeddings) often improves results.
- For retrieval at scale, index normalized embeddings with ANN search (for example, FAISS/HNSW).
Metrics
- Zero-shot classification: top-1/top-5 accuracy.
- Retrieval: Recall@K and mean reciprocal rank.
Limitations
- Sensitive to web-data bias and prompt wording.
- Struggles with fine-grained counting or OCR-heavy scenes without task adaptation.