IOAI ML Notes Neural NetworkDeep Learning

Data Embeddings

Embeddings for text, images, audio and structured data with practical usage notes.

Syllabus Map


Overview


Core Idea


Sparse vs Dense Embeddings

Sparse Embeddings

xRV,x0Vx \in \mathbb{R}^V,\quad \|x\|_0 \ll V

Dense Embeddings

z=f(x),zRd,dVz = f(x),\quad z \in \mathbb{R}^d,\quad d \ll V

Key Tradeoffs

Practical Rule


Text Embeddings

Tokenisation matters

Static vs Contextual

Training objectives

Practical Notes

Default to cosine similarity for semantic comparisons

Normalize before cross-batch dot-product comparisons

Use mean pooling as a sentence-level baseline


BPE (Byte-Pair Encoding)

Core Idea

Steps

Step 1: Initialise the vocabulary

Step 2: Count symbol pairs

Step 3: Merge the best pair

Step 4: Update the corpus

Step 5: Repeat until target size

Step 6: Tokenise new text

Strengths and Weaknesses

WordPiece

Core Idea

Steps

Step 1: Initialise the vocabulary

Step 2: Score segmentations

Step 3: Propose new subwords

Step 4: Add the best candidate

Step 5: Re-tokenise the corpus

Step 6: Repeat until target size

Step 7: Tokenise new text

Strengths and Weaknesses

TF-IDF

Core Idea

Steps

Step 1: Build the vocabulary

Step 2: Compute term frequency

tf(t,d)={1+lgcount(t,d)if count(t,d)>00otherwise\text{tf}(t, d) = \begin{cases} 1 + \lg \text{count}(t, d) & \text{if } \text{count}(t, d) > 0 \\ 0 & \text{otherwise} \end{cases}

Step 3: Compute document frequency

Step 4: Compute inverse document frequency

Step 5: Build TF-IDF vectors

Step 6: Normalise (optional)

Strengths and Weaknesses

Word2Vec

Core Idea

Steps (CBOW)

Step 1: Build training pairs

Step 2: Predict the centre word

Step 3: Update embeddings

Steps (Skip-gram)

Step 1: Build training pairs

Step 2: Predict context words

Strengths and Weaknesses

BERT (Contextual Embeddings)

Core Idea

Steps (pretraining)

Pretraining Step 1: Prepare inputs

Pretraining Step 2: Apply masking

Pretraining Step 3: Encode with the transformer

Pretraining Step 4: Predict masked tokens

Pretraining Step 5: Optional NSP objective

Pretraining Step 6: Optimise

Steps (finetuning)

Finetuning Step 1: Add a task head

Finetuning Step 2: Train on task data

Finetuning Step 3: Tune for the task

Strengths and Weaknesses

GPT-Style (Autoregressive) Encodings

Core Idea

Steps (pretraining)

Pretraining Step 1: Prepare inputs

Pretraining Step 2: Predict next tokens

Pretraining Step 3: Optimise

Steps (finetuning)

Finetuning Step 1: Add task formatting

Finetuning Step 2: Train on task data

Strengths and Weaknesses


Positional Embeddings

Why positions matter

Absolute (Learned) Positions

Sinusoidal Positions

Relative Positions

Rotary Positional Encoding (RoPE)

RoPE1D(q)=R(θpos)q\text{RoPE}_{1D}(q) = R(\theta_{pos}) q R(θ)=[cosθsinθsinθcosθ],θpos=posωR(\theta)= \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}, \quad \theta_{pos} = pos \cdot \omega

Practical Notes

Choose add vs concat based on model budget

Use learned 2D positions for ViT patch tokens

Prefer relative or RoPE-style methods for long context


Image Embeddings

Common approaches

CNN Feature Embeddings

Vision Transformer (ViT) Patch Embeddings

CLIP-Style Multimodal Embeddings

CLIP (Step-by-Step)

Step 1: Build image-text pairs

Step 2: Encode each modality

Step 3: Project into shared space

u~i=uiui,v~i=vivi\tilde{u}_i=\frac{u_i}{\|u_i\|},\quad \tilde{v}_i=\frac{v_i}{\|v_i\|}

Step 4: Compute similarity matrix

sij=u~iv~jτs_{ij}=\frac{\tilde{u}_i^\top \tilde{v}_j}{\tau}

Step 5: Optimize contrastive loss

Step 6: Use for inference

Self-Supervised Vision Embeddings

Practical Notes

Start with linear probing for transfer setup

Normalize embeddings for retrieval


Audio Embeddings

What the model sees (representations)

Typical pipeline

Encoder families

Training objectives (how embeddings are learned)

Pooling to a fixed vector

Augmentations that matter

Practical Notes

Use a strong audio front-end baseline first

Keep normalization policy consistent

Match embedding post-processing to objective

Watch for shortcut features

Evaluate robustness under domain shift


Structured Data Embeddings

Categorical features

Numerical features

Mixed data


Graph Embeddings (Brief)


How To Choose Embedding Size


Practical Tips


PyTorch Examples

import torch
import torch.nn as nn

# Token embeddings
tok_emb = nn.Embedding(num_embeddings=30000, embedding_dim=256)

# Positional embeddings (learned)
pos_emb = nn.Embedding(num_embeddings=512, embedding_dim=256)

# Categorical feature embedding
city_emb = nn.Embedding(num_embeddings=1000, embedding_dim=32)

# Simple sentence embedding: mean pooling
tokens = torch.randint(0, 30000, (8, 32))  # batch, seq
emb = tok_emb(tokens)
sent_emb = emb.mean(dim=1)
← Back to Blog