Syllabus Map

Study map: Syllabus Study Map

Overview

This note covers pooling, batch normalisation, and layer normalisation.
These techniques improve stability, generalisation, and training speed in deep networks.

Pooling

Core idea

Pooling downsamples spatial dimensions to reduce compute and increase receptive field.
It introduces local translation tolerance but loses spatial precision.

Common types

Max pooling: keeps the largest activation in a window.
Average pooling: averages activations in a window.
Global pooling: reduces each feature map to a single value.

How it works (2D)

Input feature map: $H \times W$ , window $K \times K$ , stride $S$ , padding $P$ .
Output size:

H_{out} = \left\lfloour \frac{H + 2P - K}{S} \right\rfloour + 1,\quad W_{out} = \left\lfloour \frac{W + 2P - K}{S} \right\rfloour + 1

For a window $W$ $W$ of activations:
- Max pooling: $y = \max_{x \in W} x$
- Average pooling: $y = \frac{1}{|W|}\sum_{x \in W} x$

Gradient flow

Max pooling: gradient flows only to the argmax element in each window.
Average pooling: gradient is evenly distributed across all elements in the window.

Design knobs

Window size: larger windows discard more spatial detail.
Stride: larger stride downsamples faster.
Padding: used to preserve size when needed (“same” style pooling).

Practical Notes

Use pooling cautiously for localization-heavy tasks

Pooling can hurt detection/segmentation quality by discarding spatial detail.

Consider strided convolutions as learnable alternatives

Many modern CNNs replace fixed pooling with strided conv downsampling.

Prefer global average pooling before classifiers

Global average pooling is a common, strong default for classification heads.

Avoid aggressive early downsampling for small objects

Early spatial compression can remove small-object signal before deeper layers.

PyTorch examples

import torch.nn as nn

max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
global_avg = nn.AdaptiveAvgPool2d((1, 1))

Batch Normalisation

Core idea

Batch norm normalises activations per feature/channel using batch statistics.
It stabilises training, enables higher learning rates, and adds mild regularisation.

How it works

For a batch $B$ $B$ with $m$ $m$ examples:
- Mean: $\mu_B = \frac{1}{m}\sum_{i=1}^m x_i$
- Variance: $\sigma_B^2 = \frac{1}{m}\sum_{i=1}^m (x_i - \mu_B)^2$
- Normalise: $\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$
- Scale/shift: $y_i = \gamma \hat{x}_i + \beta$
Conv layers: stats are computed per channel over N, H, W.
Affine params: $\gamma$ and $\beta$ are learned per channel/feature.
Running stats: moving averages of mean/variance are stored for inference.

Practical Notes

Handle train/eval mode correctly

Train vs eval:
- model.train() uses batch stats and updates running averages.
- model.eval() uses running averages only.

Mitigate small-batch instability

Small batch sizes can make BN unstable.
- Options: SyncBatchNorm, GroupNorm, or LayerNorm.

Use standard layer ordering

Placement:
- Common: Conv → BatchNorm → ReLU.
- Avoid bias in conv layers when followed by BN (BN has $\beta$ ).

Watch BN momentum and inference behavior

Momentum controls how fast running stats update (PyTorch default is 0.1).
Inference with wrong mode settings can cause large accuracy drops.

PyTorch examples

import torch.nn as nn

bn1 = nn.BatchNorm1d(num_features=128)
bn2 = nn.BatchNorm2d(num_features=64)
bn3 = nn.BatchNorm3d(num_features=32)

# Typical conv block
block = nn.Sequential(
    nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False),
    nn.BatchNorm2d(64),
    nn.ReLU(inplace=True)
)

Layer Normalisation

Core idea

Layer norm normalises activations within each sample across its feature dimensions.
It does not depend on batch statistics, so it behaves the same in train and eval.

How it works

For a sample with $d$ $d$ features:
- Mean: $\mu = \frac{1}{d}\sum_{j=1}^d x_j$
- Variance: $\sigma^2 = \frac{1}{d}\sum_{j=1}^d (x_j - \mu)^2$
- Normalise: $\hat{x}_j = \frac{x_j - \mu}{\sqrt{\sigma^2 + \epsilon}}$
- Scale/shift: $y_j = \gamma \hat{x}_j + \beta$
Stats are computed per sample, not across the batch.

Practical Notes

Prefer for small-batch or sequence-heavy workloads

Batch-size agnostic behavior is stable even with very small batches.
Common in RNNs and Transformers, where batch stats can be noisy.

Leverage consistent train/eval behavior

No running averages are needed; train and eval behave identically.

Use proven placement patterns

Placement:
- Classic: Linear → LayerNorm → ReLU or Linear → LayerNorm.
- Transformers: often use pre-norm (LayerNorm → sublayer) for stability.

PyTorch examples

import torch.nn as nn

ln = nn.LayerNorm(normalized_shape=512)

block = nn.Sequential(
    nn.Linear(512, 512, bias=False),
    nn.LayerNorm(512),
    nn.ReLU(inplace=True)
)

Layer Norm vs Batch Norm

Key differences

Statistics source:
- BatchNorm uses batch-level statistics (across samples).
- LayerNorm uses per-sample statistics (across features).
Train vs eval behavior:
- BatchNorm behaves differently in train/eval because of running statistics.
- LayerNorm behaves the same in train/eval.
Batch-size sensitivity:
- BatchNorm can degrade with very small batches.
- LayerNorm is batch-size agnostic.
Typical use cases:
- BatchNorm: CNNs with reasonably large batches.
- LayerNorm: Transformers/RNNs or small-batch regimes.

Rule of thumb

Use BatchNorm for vision models with stable batch statistics.
Use LayerNorm when batch statistics are unreliable or sequence modeling dominates.

Pooling, Batch Norm, and Layer Norm

Syllabus Map

Overview

Pooling

Core idea

Common types

How it works (2D)

Gradient flow

Design knobs

Practical Notes

Use pooling cautiously for localization-heavy tasks

Consider strided convolutions as learnable alternatives

Prefer global average pooling before classifiers

Avoid aggressive early downsampling for small objects

PyTorch examples

Batch Normalisation

Core idea

How it works

Practical Notes

Handle train/eval mode correctly

Mitigate small-batch instability

Use standard layer ordering

Watch BN momentum and inference behavior

PyTorch examples

Layer Normalisation

Core idea

How it works

Practical Notes

Prefer for small-batch or sequence-heavy workloads

Leverage consistent train/eval behavior

Use proven placement patterns

PyTorch examples

Layer Norm vs Batch Norm

Key differences

Rule of thumb