Syllabus Map
- Study map: Syllabus Study Map
Overview
-
Logistic Regression builds upon the foundation of Linear Regression.
-
While Linear Regression models continuous outcomes, Logistic Regression adapts the same linear model to predict probabilities for classification tasks.
Preface: Relation to Linear Regression
- In Linear Regression, predictions are made using:
-
However, such predictions are unbounded and not suitable for probabilities (which must lie between 0 and 1).
-
To address this, Logistic Regression introduces normalisation functions that map the linear output to a probability space.
Linear Function
- The base linear model used in Logistic Regression is identical to that of Linear Regression:
- This linear component serves as the input (logit) to the normalisation (activation) function.
Normalisation Functions
- Normalisation functions map the unbounded linear output into a constrained probability range.
1. Sigmoid Function (Binary Classification)
- For binary classification, we use the Sigmoid (or Logistic) function:
- This function ensures , interpreting the output as the probability that a given input belongs to class :
2. Softmax Function (Multi-class Classification)
8 For multi-class classification, the Softmax function generalises the sigmoid to multiple outputs:
- Softmax produces a vector of probabilities across all classes that sum to 1.
Cost Function
- Logistic Regression is trained by Maximum Likelihood Estimation (MLE): finding parameters that maximise the likelihood of the observed data.
Maximum Likelihood Estimation (MLE)
1. Probability for a Single Data Point
- For a binary target :
- We can express both cases compactly as:
2. Likelihood of the Entire Dataset
- Assuming independent samples:
- We maximise this likelihood with respect to :
- To simplify computation, we take the log of the likelihood:
- Since optimisation algorithms typically minimise functions, we take the negative log likelihood to obtain the cost function:
Relationship with Cross-Entropy Loss
- The Logistic Regression cost function is equivalent to Cross-Entropy Loss, which measures the dissimilarity between two probability distributions: the true labels and predicted probabilities :
- In this context, minimising cross-entropy ensures the predicted probabilities closely match the true class labels.
Gradient Descent Optimisation
- We optimise and using Gradient Descent, updating parameters iteratively to minimise the cost function:
- where is the learning rate.
Derivation of Gradients
- We start from:
1. Derivative of Cost with Respect to Predictions
2. Derivative of Predictions with Respect to Parameters
Since :
- Combining these equatons, we have:
3. Applying the Chain Rule
- Gradient w.r.t. :
- Gradient w.r.t. :
Final Gradient Update Rules
- Weight update:
- Bias update:
Logistic Regression In Practice
When to Use Logistic Regression
- When the decision boundary is roughly linear in feature space.
- When you need well-calibrated class probabilities.
- When interpretability of feature weights is important.
- When you have high-dimensional sparse features (e.g., text).
When Not to Use Logistic Regression
- When class separation is highly nonlinear without feature engineering.
- When severe class imbalance is not handled by weighting or resampling.
- When label noise is high and margins are weak.
- When you need structured or sequence outputs.
Practical Notes
Preprocessing and Tuning
- Standardise features and tune regularisation strength.
Imbalance and Multiclass
- Use class weights or decision thresholds to control precision and recall.
- Prefer multinomial loss for multi-class problems when available.
Calibration
- Check calibration and use Platt scaling if needed.