IOAI ML Notes Classical Machine LearningSupervised Learning

Ensemble Methods

A comprehensive guide to ensemble learning: exploring how combining multiple models improves accuracy, robustness, and generalisation.

Syllabus Map


Overview


Motivation for Ensemble Learning

Why Single Models Fail

Error Decomposition


General Ensemble Framework

Base Learners

Combining Predictions


Bagging (Bootstrap Aggregation) and Pasting

Core Idea

Why Bagging Works

Algorithm Outline

  1. A training set is split into random subsets for training.
  2. Weak learners are trained using these random subsets of data.
  3. A prediction for a new instance is made by simply aggregating the predictions of all predictors.
    • For classifiers, this is done by the statistical mode.
    • For regressors, this is done by the statistical mean.

Out-of-Bag Evaluation

Random Patching and Random Subspaces

Random Subspaces

Random Patching


Random Forests

Motivation

Core Principles

Algorithm Overview

  1. For each tree in the forest, a random subset of the original training data is selected with replacement.
  2. At each node during the tree-building process, only a random subset of features is considered for the best split.
  3. A decision tree is grown on each unique data and feature subset until a stopping criterion.
  4. A prediction for a new instance is made by simply aggregating the predictions of all predictors.
    • For classifiers, this is done by the statistical mode.
    • For regressors, this is done by the statistical mean.

Extra-Trees


Boosting

Core Idea

Why Boosting Works


AdaBoost

Key Concepts

Algorithm Outline

  1. Initialise equal sample weights and train a first weak classifier.
  2. Compute the classifier’s weighted error and assign it a model weight.
  3. Increase weights of misclassified samples, decrease weights of correctly classified samples, then normalise.
  4. Train the next weak classifier on the reweighted data and repeat steps 2-3 for multiple rounds.
  5. Make the final prediction using a weighted majority vote of all weak classifiers.

Intuition

In-Depth Algorithm

Step 1: Instantiation and Setting Weights

Step 2: Calculate Weighted Error Rate

rj=i=1mwi  s.t. y^i,jyii=1mwir_j = \frac{\sum^{m}_{i=1} w_i \space \text{ s.t. } \hat{y}_{i,j} \ne y_i}{\sum^{m}_{i=1} w_i}

Step 3: Calculate the Predictor’s Weight

αj=12ηlog1rjrj\alpha_j = \frac{1}{2} \eta \log \frac{1 - r_j}{r_j}

Step 4: Update the Weights of the Samples

wi{wiexp(αj)if y^i=yiwiexp(αj)   if y^iyii{1,2,3,...,m}w_i \begin{cases} w_i \exp(-\alpha_j) \quad \text{if } \hat y_i = y_i \\ w_i \exp(\alpha_j) \space \space \space \quad \text{if } \hat y_i \ne y_i \end{cases} \forall i \in \{1, 2, 3, ..., m\} wiwii=1mwii{1,2,3,...,m}w_i \leftarrow \frac{w_i}{\sum^{m}_{i=1}w_i} \forall i \in \{1, 2, 3, ..., m\}

Step 5: Make Predictions

y^=arg maxkj=1, y^j=kNαj\hat y = \argmax_k \sum^{N}_{j=1, \space \hat y_j = k} \alpha_j

Gradient Boosting

Core Idea

Algorithm Outline

  1. Initialise the model with a constant prediction that minimises the loss.
  2. Compute the negative gradient (residuals) of the loss with respect to the current predictions.
  3. Fit a new weak learner to these residuals.
  4. Scale the learner by a step size and add it to the ensemble.
  5. Repeat steps 2–4 for the desired number of iterations.

Loss Functions

How XGBoost Handles Missing Values


Stacking

Core Idea


Architecture


Why Stacking Works


Blending vs Stacking

Blending

Stacking


Bias–Variance Tradeoff in Ensembles

How Bagging Affects Bias & Variance


How Boosting Affects Bias & Variance


How Stacking Affects Both


Ensemble Methods In Practice

When to Use Ensemble Methods

When Not to Use Ensemble Methods

Practical Notes

Tuning and Diversity

Reliability and Overfitting

← Back to Blog