IOAI ML Notes Classical Machine LearningSupervised Learning

Support Vector Machines

A comprehensive guide to support vector machines: exploring how margin maximisation and regularisation create powerful and flexible classifiers.

Syllabus Map


Overview

What is a Hyperplane?

Defining the separating hyperplane

w1x1+w2x2++wpxp+b=0w_1 x_1 + w_2 x_2 + \cdots + w_p x_p + b = 0 (H1) w1xi1+w2xi2++wpxip+b1if yi=1(H_1)\space w_1 x_{i1} + w_2 x_{i2} + \cdots + w_p x_{ip} + b \ge 1 \quad \text{if } y_i = 1 (H2) w1xi1+w2xi2++wpxip+b1if yi=1(H_2)\space w_1 x_{i1} + w_2 x_{i2} + \cdots + w_p x_{ip} + b \le -1 \quad \text{if } y_i = -1

Finding the optimal separating hyperplane

Defining the margin

1. Distance from a point to a hyperplane

d=w1x1+w2x2++wpxp+bw12+w22++wp2d = \frac{| w_1 x_1 + w_2 x_2 + \cdots + w_p x_p + b |} {\sqrt{w_1^2 + w_2^2 + \cdots + w_p^2}}

2. Distance between the two margin hyperplanes

(H1) w1x1+w2x2++wpxp+b=1\text{(H1) } w_1 x_1 + w_2 x_2 + \cdots + w_p x_p + b = 1 (H2) w1x1+w2x2++wpxp+b=1\text{(H2) } w_1 x_1 + w_2 x_2 + \cdots + w_p x_p + b = -1 γ (margin)=2w12+w22++wp2\gamma \space \text{(margin)} = \frac{2}{\sqrt{w_1^2 + w_2^2 + \cdots + w_p^2}}

3. Optimiser Function

maxγ s.t. i yi(w1x1++wpxp+b)0\max \gamma \text{ s.t. } \forall i \space y_i(w_1 x_1 + \cdots + w_p x_p + b) \ge 0 max(2w12++wp2) s.t. i yi(w1x1++wpxp+b)1\max (\frac{2}{\sqrt{w_1^2 + \cdots + w_p^2}}) \text{ s.t. } \forall i \space y_i(w_1 x_1 + \cdots + w_p x_p + b) \ge 1

4. Primal Optimisation Problem

minw1,,wp,b12(w12++wp2)\min_{w_1, \dots, w_p, b} \frac{1}{2}(w_1^2 + \cdots + w_p^2)

5. Lagrangian Optimisation

L(x,λ)=f(x)λg(x)\mathcal L (x, \lambda) = f(x) - \lambda \cdot g(x) L(w1,,wp,b,λ)=12(w12++wp2)i=1nλi[yi(w1xi,1++wpxi,p+b)1]\mathcal{L}(w_1, \dots, w_p, b, \lambda) = \frac{1}{2}(w_1^2 + \cdots + w_p^2) - \sum_{i=1}^n \lambda_i \left[ y_i(w_1 x_{i,1} + \cdots + w_p x_{i,p} + b) - 1 \right]

6. Finding Partial Derivatives

Lwi=wij=1nλj yj xj,i\frac{\partial \mathcal{L}}{\partial w_i} = w_i - \sum^{n}_{j=1} \lambda_j \space y_j \space x_{j, i} Lb=j=1nλj yj\frac{\partial \mathcal{L}}{\partial b} = - \sum^{n}_{j=1} \lambda_j \space y_j Lλi=[yj(w1xj,1++wpxj,p+b)1]\frac{\partial \mathcal{L}}{\partial \lambda_i} = - \big[ y_j (w_1 x_{j,1} + \cdots + w_p x_{j,p} + b) - 1 \big ]

7. Support Vectors

Lλi=[yj(w1xj,1++wpxj,p+b)1]=0\frac{\partial \mathcal{L}}{\partial \lambda_i} = \big[ y_j (w_1 x_{j,1} + \cdots + w_p x_{j,p} + b) - 1 \big ] = 0 [yj(w1xj,1++wpxj,p+b)1]=0- \big[ y_j (w_1 x_{j,1} + \cdots + w_p x_{j,p} + b) - 1 \big ] = 0 yj(w1xj,1++wpxj,p+b)1=0y_j (w_1 x_{j,1} + \cdots + w_p x_{j,p} + b) - 1 = 0 yi(w1xi,1++wpxi,p+b)=1y_i(w_1 x_{i,1} + \cdots + w_p x_{i,p} + b) = 1

8. Decision Rule

y^=sign(wTx+b)\hat y = \text{sign}(w^Tx+b)

Soft Constraints

minw1,,wp,b12(w12++wp2)+Ci=1nξi  s.t.  yi(w1xi,1++wpxi,p+b)1ξi s.t. ξi0\min_{w_1, \dots, w_p, b} \frac{1}{2}(w_1^2 + \cdots + w_p^2) + C \sum^{n}_{i=1} \xi_i \space \\ \text{ s.t. } \space \forall y_i(w_1 x_{i,1} + \cdots + w_p x_{i,p} + b) \ge 1 - \xi_i \\ \text{ s.t. } \xi_i \ge 0

Bias–Variance Tradeoff via CC

Acts as a regularisation parameter.

Geometric Interpretation of the SVM

Kernel Trick and RBF Kernel

K(xi,xj)=exp(γxixj2)K(x_i, x_j) = \exp\left(-\gamma \|x_i - x_j\|^2\right)

Support Vector Machines In Practice

When to Use SVMs

When Not to Use SVMs

Practical Notes

Preprocessing and Tuning

Model Choice

Probability Outputs

← Back to Blog