Syllabus Map

Overview

Bias and variance are commonly used to describe underfitting and overfitting behaviour.
Decomposing loss into bias and variance helps interpret model performance.

Squared loss: $S = (y - \hat{y})^2$ .
The expected squared loss decomposes into bias, variance, and a noise term (often called irreducible Error).

E[(y - \hat{y})^2] = \big(y - E[\hat{y}]\big)^2 + E\big[(\hat{y} - E[\hat{y}])^2\big]

Total Error (expected loss) can be viewed as:
- Bias (systematic Error from wrong assumptions),
- Variance (Error from sensitivity to training data),
- Irreducible Error (noise in the data that no model can remove).
In the squared-loss decomposition, the irreducible Error is the noise term that is often omitted for simplicity.

Notation:

Assume the data-generating process:

y = f(x) + \varepsilon, \quad E[\varepsilon] = 0, \quad Var(\varepsilon) = \sigma^2

Then:

E[(y - \hat{y})^2] = E[(f(x) + \varepsilon - \hat{y})^2]

Expand the expectation:

E[(y - \hat{y})^2] = E[(f(x) - \hat{y})^2] + E[\varepsilon^2]

And decompose the first term:

E[(f(x) - \hat{y})^2] = (f(x) - E[\hat{y}])^2 + E[(\hat{y} - E[\hat{y}])^2]

So:

Therefore:

E[(y - \hat{y})^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Bagging typically reduces variance compared to a single decision tree in the provided examples.

For 0-1 loss, if bias is 1, increasing variance can reduce loss (a counterintuitive edge case).