IOAI ML Notes Classical Machine LearningSupervised Learning

Logistic Regression

A comprehensive guide to Logistic Regression: exploring how it transforms Linear Regression into a powerful tool for binary and multi-class classification.

Syllabus Map


Overview


Preface: Relation to Linear Regression

y^i=wxi+b\hat{y}_i = w x_i + b

Linear Function

y^i=wxi+b\hat{y}_i = w x_i + b

Normalisation Functions

1. Sigmoid Function (Binary Classification)

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} y^i=P(yi=1xi,θ)=σ(wxi+b)\hat{y}_i = P(y_i = 1 \mid x_i, \theta) = \sigma(w x_i + b)

2. Softmax Function (Multi-class Classification)

8 For multi-class classification, the Softmax function generalises the sigmoid to multiple outputs:

σ(z)i=ezij=1nezj\sigma(\vec{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}

Cost Function

Maximum Likelihood Estimation (MLE)

1. Probability for a Single Data Point

P(yi=1xi,θ)=σ(wxi+b)P(y_i = 1 \mid x_i, \theta) = \sigma(w x_i + b) P(yi=0xi,θ)=1σ(wxi+b)P(y_i = 0 \mid x_i, \theta) = 1 - \sigma(w x_i + b) P(yixi,θ)=(y^i)yi(1y^i)1yiP(y_i \mid x_i, \theta) = (\hat{y}_i)^{y_i} (1 - \hat{y}_i)^{1 - y_i}

2. Likelihood of the Entire Dataset

P(yx,θ)=i=1nP(yixi,θ)P(y \mid x, \theta) = \prod_{i=1}^{n} P(y_i \mid x_i, \theta) θ^=argmaxθi=1nP(yixi,θ)\hat{\theta} = \arg\max_{\theta} \prod_{i=1}^{n} P(y_i \mid x_i, \theta) logP(yx,θ)=i=1n[yilogy^i+(1yi)log(1y^i)]\log P(y \mid x, \theta) = \sum_{i=1}^{n} [y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i)] C(w,b)=i=1n[yilogy^i+(1yi)log(1y^i)]C(w, b) = - \sum_{i=1}^{n} [y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i)]

Relationship with Cross-Entropy Loss

H(p,q)=i=1np(xi)logq(xi)H(p, q) = - \sum_{i=1}^{n} p(x_i) \log q(x_i)

Gradient Descent Optimisation

w=wαCww = w - \alpha \frac{\partial C}{\partial w} b=bαCbb = b - \alpha \frac{\partial C}{\partial b}

Derivation of Gradients

y^i=σ(wxi+b)\hat{y}_i = \sigma(w x_i + b) C(w,b)=i=1n[yilogy^i+(1yi)log(1y^i)]C(w, b) = - \sum_{i=1}^{n} [y_i \log \hat{y}_i + (1 - y_i) \log (1 - \hat{y}_i)]

1. Derivative of Cost with Respect to Predictions

Cy^i=(yiy^i1yi1y^i)=y^iyiy^i(1y^i)\frac{\partial C}{\partial \hat{y}_i} = - \left( \frac{y_i}{\hat{y}_i} - \frac{1 - y_i}{1 - \hat{y}_i} \right) = \frac{\hat{y}_i - y_i}{\hat{y}_i(1 - \hat{y}_i)}

2. Derivative of Predictions with Respect to Parameters

Since y^i=σ(zi)=σ(wxi+b)\hat{y}_i = \sigma(z_i) = \sigma(w x_i + b):

y^izi=y^i(1y^i)\frac{\partial \hat{y}_i}{\partial z_i} = \hat{y}_i (1 - \hat{y}_i) ziw=xi,zib=1\frac{\partial z_i}{\partial w} = x_i, \quad \frac{\partial z_i}{\partial b} = 1 y^iw=xiy^i(1y^i)\frac{\partial \hat{y}_i}{\partial w} = x_i \hat{y}_i (1 - \hat{y}_i) y^ib=y^i(1y^i)\frac{\partial \hat{y}_i}{\partial b} = \hat{y}_i (1 - \hat{y}_i)

3. Applying the Chain Rule

Cw=i=1n(y^iyi)xi\frac{\partial C}{\partial w} = \sum_{i=1}^{n} (\hat{y}_i - y_i) x_i Cb=i=1n(y^iyi)\frac{\partial C}{\partial b} = \sum_{i=1}^{n} (\hat{y}_i - y_i)

Final Gradient Update Rules

w=wα1mi=1n(y^iyi)xiw = w - \alpha \cdot \frac{1}{m} \sum_{i=1}^{n} (\hat{y}_i - y_i) x_i b=bα1mi=1n(y^iyi)b = b - \alpha \cdot \frac{1}{m} \sum_{i=1}^{n} (\hat{y}_i - y_i)

Logistic Regression In Practice

When to Use Logistic Regression

When Not to Use Logistic Regression

Practical Notes

Preprocessing and Tuning

Imbalance and Multiclass

Calibration

← Back to Blog