IOAI ML Notes Computer VisionDeep Learning

Self-Supervised Learning for Vision

Pretext tasks and representation learning without labels.

Syllabus Map


Overview


Core Ideas

Pretext Tasks

Representation Learning


Main Families (Vision)

Contrastive Learning

Bootstrap / Non-Contrastive

Masked Image Modeling


Step-by-Step: Contrastive Learning Objectives

Objective 1: Instance Discrimination (Image-Level)

s(a,b)=aTbabs(a, b) = \frac{a^T b}{\|a\| \|b\|} Li=log(exp(s(zi,zi+)/τ)kiexp(s(zi,zk)/τ))L_i = -\log \left( \frac{\exp(s(z_i, z_i^+) / \tau)} {\sum_{k \ne i} \exp(s(z_i, z_k) / \tau)} \right) Lbatch=12Ni=12NLiL_{\text{batch}} = \frac{1}{2N} \sum_{i=1}^{2N} L_i

Objective 2: Image Subsampling / Patching (Patch-Level)

Lpatch(i,p)=log(exp(s(zi,pa,zi,pb)/τ)(j,q)(i,p)exp(s(zi,pa,zj,q)/τ))L_{\text{patch}}(i, p) = -\log \left( \frac{\exp(s(z_{i,p}^a, z_{i,p}^b) / \tau)} {\sum_{(j,q) \ne (i,p)} \exp(s(z_{i,p}^a, z_{j,q}) / \tau)} \right) Ltotal=λimgLbatch+λpatch1Pp=1PLpatch(:,p)L_{\text{total}} = \lambda_{\text{img}} L_{\text{batch}} + \lambda_{\text{patch}} \frac{1}{P} \sum_{p=1}^{P} L_{\text{patch}}(:, p)

Step-by-Step: Instance Discrimination

Step 1: Sample a Batch

Step 2: Create Two Augmented Views

Step 3: Encode and Project

hi=f(xi),zi=g(hi)h_i = f(x_i), \quad z_i = g(h_i)

Step 4: Build Positive/Negative Sets

Step 5: Optimize InfoNCE

Step 6: Transfer Encoder


Step-by-Step: Image Subsampling / Patching

Step 1: Generate Local Regions or Tokens

Step 2: Encode Local Features

hi,p=f(xi,p),zi,p=gpatch(hi,p)h_{i,p} = f(x_i, p), \quad z_{i,p} = g_{\text{patch}}(h_{i,p})

Step 3: Define Patch Positives and Negatives

Step 4: Compute Patch Contrastive Loss

Ltotal=λimgLbatch+λpatchLpatch_avgL_{\text{total}} = \lambda_{\text{img}} L_{\text{batch}} + \lambda_{\text{patch}} L_{\text{patch\_avg}}

Step 5: Update and Transfer


Key Design Choices


Practical Notes

Data and Compute

Evaluation and Transfer

← Back to Blog