IOAI ML Notes Neural NetworkDeep Learning

Optimisers, Convergence, and Regularisation

Optimisers, learning rate behaviour, and regularisation techniques such as dropout and weight decay.

Syllabus Map


Overview


Optimisers

Core idea

Gradient descent

θt+1=θtηL(θt)\theta_{t+1} = \theta_t - \eta \nabla \mathcal{L}(\theta_t)

SGD

θt+1=θtηgt\theta_{t+1} = \theta_t - \eta g_t

Momentum

vt=μvt1+gtv_t = \mu v_{t-1} + g_t θt+1=θtηvt\theta_{t+1} = \theta_t - \eta v_t

Adagrad

rt=rt1+gt2r_t = r_{t-1} + g_t^2 θt+1=θtηgtrt+ϵ\theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{r_t} + \epsilon}

RMSprop

rt=ρrt1+(1ρ)gt2r_t = \rho r_{t-1} + (1-\rho) g_t^2 θt+1=θtηgtrt+ϵ\theta_{t+1} = \theta_t - \eta \frac{g_t}{\sqrt{r_t} + \epsilon}

Adam

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 θt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}}_t + \epsilon}

AdamW

θt+1=θtηm^tv^t+ϵηλθt\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}}_t + \epsilon} - \eta \lambda \theta_t

Decoupled weight decay vs L2 in Adam

Adam + L2 (coupled)

gt=gt+λθtg'_t = g_t + \lambda \theta_t mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g'_t vt=β2vt1+(1β2)(gt)2v_t = \beta_2 v_{t-1} + (1-\beta_2) (g'_t)^2 θt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}}_t + \epsilon}

AdamW (decoupled)

mt=β1mt1+(1β1)gtm_t = \beta_1 m_{t-1} + (1-\beta_1) g_t vt=β2vt1+(1β2)gt2v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 θt+1=θtηm^tv^t+ϵηλθt\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}}_t + \epsilon} - \eta \lambda \theta_t

Practical Notes

Start with a strong optimizer baseline

Compare with SGD + momentum for final quality

Tune learning rate before optimizer switching


Convergence

What convergence means

Learning rate behaviour

Learning rate schedulers

Step-based schedulers

Exponential schedulers

Cosine annealing schedulers

Cyclical schedulers

Performance-based schedulers

Loss performance on MNIST data

Warmup + decay (typical recipe)

Diagnostics


Regularisation

Core idea

Dropout

PyTorch example

import torch.nn as nn

model = nn.Sequential(
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Dropout(p=0.3),
    nn.Linear(128, 10),
)

Early stopping

PyTorch example

best_val = float("inf")
patience = 5
wait = 0

for epoch in range(100):
    # train ...
    val_loss = validate(model, val_loader)

    if val_loss < best_val:
        best_val = val_loss
        wait = 0
        best_state = {k: v.cpu() for k, v in model.state_dict().items()}
    else:
        wait += 1
        if wait >= patience:
            model.load_state_dict(best_state)
            break

Weight decay

Label smoothing

Practical usage

← Back to Blog