Syllabus Map
Overview
- This note covers pooling, batch normalisation, and layer normalisation.
- These techniques improve stability, generalisation, and training speed in deep networks.
Pooling
Core idea
- Pooling downsamples spatial dimensions to reduce compute and increase receptive field.
- It introduces local translation tolerance but loses spatial precision.
Common types
- Max pooling: keeps the largest activation in a window.
- Average pooling: averages activations in a window.
- Global pooling: reduces each feature map to a single value.
How it works (2D)
- Input feature map: H×W, window K×K, stride S, padding P.
- Output size:
H_{out} = \left\lfloour \frac{H + 2P - K}{S} \right\rfloour + 1,\quad
W_{out} = \left\lfloour \frac{W + 2P - K}{S} \right\rfloour + 1
- For a window W of activations:
- Max pooling: y=maxx∈Wx
- Average pooling: y=∣W∣1∑x∈Wx
Gradient flow
- Max pooling: gradient flows only to the argmax element in each window.
- Average pooling: gradient is evenly distributed across all elements in the window.
Design knobs
- Window size: larger windows discard more spatial detail.
- Stride: larger stride downsamples faster.
- Padding: used to preserve size when needed (“same” style pooling).
Practical Notes
Use pooling cautiously for localization-heavy tasks
- Pooling can hurt detection/segmentation quality by discarding spatial detail.
Consider strided convolutions as learnable alternatives
- Many modern CNNs replace fixed pooling with strided conv downsampling.
Prefer global average pooling before classifiers
- Global average pooling is a common, strong default for classification heads.
Avoid aggressive early downsampling for small objects
- Early spatial compression can remove small-object signal before deeper layers.
PyTorch examples
import torch.nn as nn
max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)
global_avg = nn.AdaptiveAvgPool2d((1, 1))
Batch Normalisation
Core idea
- Batch norm normalises activations per feature/channel using batch statistics.
- It stabilises training, enables higher learning rates, and adds mild regularisation.
How it works
- For a batch B with m examples:
- Mean: μB=m1∑i=1mxi
- Variance: σB2=m1∑i=1m(xi−μB)2
- Normalise: x^i=σB2+ϵxi−μB
- Scale/shift: yi=γx^i+β
- Conv layers: stats are computed per channel over N, H, W.
- Affine params: γ and β are learned per channel/feature.
- Running stats: moving averages of mean/variance are stored for inference.
Practical Notes
Handle train/eval mode correctly
- Train vs eval:
model.train() uses batch stats and updates running averages.
model.eval() uses running averages only.
Mitigate small-batch instability
- Small batch sizes can make BN unstable.
- Options: SyncBatchNorm, GroupNorm, or LayerNorm.
Use standard layer ordering
- Placement:
- Common:
Conv → BatchNorm → ReLU.
- Avoid bias in conv layers when followed by BN (BN has β).
Watch BN momentum and inference behavior
- Momentum controls how fast running stats update (PyTorch default is
0.1).
- Inference with wrong mode settings can cause large accuracy drops.
PyTorch examples
import torch.nn as nn
bn1 = nn.BatchNorm1d(num_features=128)
bn2 = nn.BatchNorm2d(num_features=64)
bn3 = nn.BatchNorm3d(num_features=32)
# Typical conv block
block = nn.Sequential(
nn.Conv2d(64, 64, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True)
)
Layer Normalisation
Core idea
- Layer norm normalises activations within each sample across its feature dimensions.
- It does not depend on batch statistics, so it behaves the same in train and eval.
How it works
- For a sample with d features:
- Mean: μ=d1∑j=1dxj
- Variance: σ2=d1∑j=1d(xj−μ)2
- Normalise: x^j=σ2+ϵxj−μ
- Scale/shift: yj=γx^j+β
- Stats are computed per sample, not across the batch.
Practical Notes
Prefer for small-batch or sequence-heavy workloads
- Batch-size agnostic behavior is stable even with very small batches.
- Common in RNNs and Transformers, where batch stats can be noisy.
Leverage consistent train/eval behavior
- No running averages are needed; train and eval behave identically.
Use proven placement patterns
- Placement:
- Classic:
Linear → LayerNorm → ReLU or Linear → LayerNorm.
- Transformers: often use pre-norm (
LayerNorm → sublayer) for stability.
PyTorch examples
import torch.nn as nn
ln = nn.LayerNorm(normalized_shape=512)
block = nn.Sequential(
nn.Linear(512, 512, bias=False),
nn.LayerNorm(512),
nn.ReLU(inplace=True)
)
Layer Norm vs Batch Norm
Key differences
- Statistics source:
- BatchNorm uses batch-level statistics (across samples).
- LayerNorm uses per-sample statistics (across features).
- Train vs eval behavior:
- BatchNorm behaves differently in train/eval because of running statistics.
- LayerNorm behaves the same in train/eval.
- Batch-size sensitivity:
- BatchNorm can degrade with very small batches.
- LayerNorm is batch-size agnostic.
- Typical use cases:
- BatchNorm: CNNs with reasonably large batches.
- LayerNorm: Transformers/RNNs or small-batch regimes.
Rule of thumb
- Use BatchNorm for vision models with stable batch statistics.
- Use LayerNorm when batch statistics are unreliable or sequence modeling dominates.