IOAI ML Notes Classical Machine LearningSupervised Learning

L1 and L2 Regularisation

A comprehensive guide to L1 and L2 Regularisation: exploring how LASSO and Ridge improve generalisation.

Syllabus Map


Introduction

Limitations of Linear Regression

Use of Regularisation

arg minw(L(w)+R(w))\argmin_w(L(w) + R(w))

Stability

Uniform Stability

Definition (Uniform Stability)

(A(S),z)(A(S(i)),z)β\big| \ell(A(S), z) - \ell(A(S^{(i)}), z) \big| \le \beta

Why Stability Matters

Generalisation Bound

LtrainLtrueβ+O ⁣(1m)|L_{\text{train}} - L_{\text{true}}| \le \beta + O\!\left(\frac{1}{m}\right)

Why Regularisation Improves Stability

w=(XX)1Xyw = (X^\top X)^{-1} X^\top y

L1 Regularisation (LASSO)

Overview

R(w)=λw1=λi=1dwiR(w) = \lambda ||w||_1 = \lambda \sum_{i=1}^{d} |w_i| J(w)=1mi=1m12(wxiyi)2+λw1J(w) = \frac{1}{m} \sum_{i=1}^{m}\frac{1}{2}(w^\top x_i - y_i)^2 + \lambda||w||_1

Features of L1 Regression

Produces Sparse Weights

Leads to Simpler Models

Gradient Calculation

wj(λwj)={λif wj>0λif wj<0[λ,λ]if wj=0\frac{\partial}{\partial w_j} \left( \lambda |w_j| \right) = \begin{cases} \lambda & \text{if } w_j > 0 \\ -\lambda & \text{if } w_j < 0 \\ [-\lambda, \lambda] & \text{if } w_j = 0 \end{cases}

L2 Regularisation (Ridge Regression)

Overview

R(w)=λw22=λi=1dwi2R(w) = \lambda ||w||_2^2 = \lambda \sum_{i=1}^{d} w_i^2 J(w)=1mi=1m12(wxiyi)2+λw22J(w) = \frac{1}{m} \sum_{i=1}^{m}\frac{1}{2}(w^\top x_i - y_i)^2 + \lambda||w||_2^2

Features of L2 Regression

Improves numerical stability:

Reduces model variance:

Gradient Calculation

J(w)=1mi=1m12(wxiyi)2+λw22J(w) = \frac{1}{m} \sum_{i=1}^{m}\frac{1}{2}(w^\top x_i - y_i)^2 + \lambda||w||_2^2

Gradient of L2 Norm

w=i=1dwi2||w|| = \sqrt{\sum_{i=1}^d w_i^2} wwj=wjr\frac{\partial ||w||}{\partial w_j} = \frac{w_j}{\sqrt{r}} ww=ww\nabla_w ||w|| = \frac{w}{||w||}

Gradient of Ridge Regulariser

R(w)=λw2=λiwi2R(w) = \lambda ||w||^2 = \lambda \sum_i w_i^2 Rwj=2λwj\frac{\partial R}{\partial w_j} = 2\lambda w_j wR(w)=2λw\nabla_w R(w) = 2\lambda w

Shape of L1 and L2 Constraints

L1 vs L2 constraint geometry

L1 and L2 Regularisation In Practice

When to Use L1 and L2 Regularisation

When Not to Use L1 and L2 Regularisation

Practical Notes

Preprocessing

Model Selection

Interpretation

← Back to Blog