Syllabus Map
- Study map: Syllabus Study Map
Overview
- This note covers weight initialisation methods and why they matter.
- Good initialisation prevents vanishing or exploding activations.
Distributions (Normal vs Uniform)
Normal distribution
- Bell‑shaped curve centred at zero.
- Higher probability near the mean, fewer extreme values.
Uniform distribution
- Flat distribution where all values in a range are equally likely.
- Produces bounded weights with no heavy tails.
Notation / Terminology
- : weight matrix for a layer.
- : bias vector.
- : number of input units (fan‑in).
- : number of output units (fan‑out).
- : variance of the weight distribution.
- Symmetry breaking: ensuring neurons start with different weights.
Core idea
- Initialisation sets the starting scale of weights.
- The goal is to keep activations and gradients in a stable range.
Common schemes
Random initialisation
- Initialise weights with small random values (uniform or normal).
- Helps break symmetry so neurons learn different features.
- Too large can cause exploding activations; too small can cause vanishing.
Constant initialisation
- Sets all weights to the same value.
- Useful for bias terms or controlled experiments.
- Bad for hidden weights because it does not break symmetry.
Xavier / Glorot
- This initialisation ensures that signals do not explode or vanish as they pass through many layers.
- It sets weights with a variance tied to the number of input and output connections, so the inputs and outputs have the same scale.
- Best for tanh/sigmoid activations.
- Variance:
Uniform form
- This keeps the initial activations from exploding or shrinking by matching variance across layers.
Normal form
- Equivalent variance to the uniform version, but with a Gaussian distribution.
He / Kaiming
- To keep signal scale stable, it starts weights a bit larger than Xavier, with variance based only on the number of input connections.
- Best for ReLU‑like activations.
- Variance:
Uniform form
- Keeps activation variance stable for ReLU‑like layers.
Normal form
- Standard Kaiming normal initialisation for ReLU‑like activations.
Practical Notes
Match initialisation to activation function
- Use Xavier-type schemes for tanh/sigmoid and He/Kaiming schemes for ReLU-like activations.
Initialize biases simply unless task-specific priors exist
- Biases are commonly set to zero, with selective non-zero initialization only when justified.