IOAI ML Notes Computer VisionDeep Learning

CNN Tasks

Core CNN-based vision tasks and common model families.

Syllabus Map


Overview


Image Classification

Core Idea


Object Detection

Core Idea

You-Only-Look-Once (YOLO)

Step 1: Prepare Labels

Step 2: Train Grid-Based Predictions

Step 3: Optimize Detection Loss

LYOLO=λboxLbox+λobjLobj+λclsLcls\mathcal{L}_{\text{YOLO}}= \lambda_{\text{box}}\mathcal{L}_{\text{box}}+ \lambda_{\text{obj}}\mathcal{L}_{\text{obj}}+ \lambda_{\text{cls}}\mathcal{L}_{\text{cls}} Lbox=iPos(1IoU(bi,b^i))\mathcal{L}_{\text{box}}=\sum_{i\in \text{Pos}}\left(1-\text{IoU}(b_i,\hat{b}_i)\right) Lobj=i[yilogpi+(1yi)log(1pi)]\mathcal{L}_{\text{obj}}=-\sum_i \left[y_i\log p_i+(1-y_i)\log(1-p_i)\right] Lcls=iPosc=1Cyi,clogp^i,c\mathcal{L}_{\text{cls}}=-\sum_{i\in \text{Pos}}\sum_{c=1}^{C} y_{i,c}\log \hat{p}_{i,c}

Step 4: Run Post-Processing

IoU(A,B)=ABAB\text{IoU}(A,B)=\frac{|A\cap B|}{|A\cup B|} suppress bj if IoU(bi,bj)>τnms and si>sj\text{suppress } b_j \text{ if } \text{IoU}(b_i,b_j)>\tau_{\text{nms}} \text{ and } s_i>s_j

Step 5: Evaluate and Deploy


Single Shot MultiBox Detector (SSD)

Step 1: Build Multi-Scale Feature Maps

Step 2: Match Ground Truth to Default Boxes

Step 3: Train Classification and Localization

LSSD=1N(Lcls+αLloc)\mathcal{L}_{\text{SSD}}=\frac{1}{N}\left(\mathcal{L}_{\text{cls}}+\alpha \mathcal{L}_{\text{loc}}\right) Lloc=iPos m{cx,cy,w,h}SmoothL1 ⁣(timt^im)\mathcal{L}_{\text{loc}}=\sum_{i\in \text{Pos}} \space \sum_{m\in\{c_x,c_y,w,h\}} \text{SmoothL1}\!\left(t_i^m-\hat{t}_i^m\right)

Step 4: Apply Hard Negative Mining

NnegNpos3\frac{N_{\text{neg}}}{N_{\text{pos}}}\le 3

Step 5: Decode and Filter Predictions


Detection Transformers (DETR)

Step 1: Encode Image Features

Step 2: Decode with Object Queries

Step 3: Hungarian Matching

σ^=argminσi[logp^σ(i)(ci)+λ1bib^σ(i)1+λ2(1GIoU(bi,b^σ(i)))]\hat{\sigma}=\arg\min_{\sigma}\sum_i \left[ -\log \hat{p}_{\sigma(i)}(c_i)+ \lambda_1\|b_i-\hat{b}_{\sigma(i)}\|_1+ \lambda_2\left(1-\text{GIoU}(b_i,\hat{b}_{\sigma(i)})\right) \right]

Step 4: Optimize Set Prediction Loss

LDETR=Lcls+λ1LL1+λ2LGIoU\mathcal{L}_{\text{DETR}}= \mathcal{L}_{\text{cls}}+ \lambda_1\mathcal{L}_{L1}+ \lambda_2\mathcal{L}_{\text{GIoU}}

Step 5: Inference without NMS


Image Segmentation

Core Idea

Semantic vs Instance Segmentation

Semantic Segmentation

Instance Segmentation

U-Net

Step 1: Encode Context

Step 2: Decode Spatial Detail

Step 3: Predict Pixel Masks

Step 4: Train with Dense Losses

Lseg=λceLCE+λdiceLDice\mathcal{L}_{\text{seg}}= \lambda_{\text{ce}}\mathcal{L}_{\text{CE}}+ \lambda_{\text{dice}}\mathcal{L}_{\text{Dice}} LDice=12ipiyi+ϵipi+iyi+ϵ\mathcal{L}_{\text{Dice}}= 1-\frac{2\sum_i p_i y_i+\epsilon}{\sum_i p_i+\sum_i y_i+\epsilon}

Step 5: Post-Process Masks

← Back to Blog