IOAI ML Notes Classical Machine LearningSupervised Learning

Decision Tree

A comprehensive guide to Decision Trees: exploring how they recursively split data to model complex decision boundaries for classification and regression tasks.

Syllabus Map


Introduction

Overall Goals


Structure of a Decision Tree

Root Node

Decision Node

Leaf Node

Parent Node

Child Node


Training Algorithm

Step-by-Step Procedure

1. Start with the full training dataset at the root node

2. Check if the node should stop splitting

For each feature

For numeric features:

For categorical features:

3. Compute the impurity or loss for each possible split

4. Select the split with the best score

5. Partition the dataset into child nodes

7. Repeat recursively for each child node

Recursive Splitting Logic


Splitting Criteria

Information Gain

IG(S,t)=H(S)SLSH(SL)SRSH(SR)IG(S, t) = H(S) - \frac{|S_L|}{|S|}H(S_L) - \frac{|S_R|}{|S|}H(S_R)

Entropy

H(S)=y  Sp(y) log2p(y)H(S) = - \sum_{y \space \in \space \mathcal{S}} p(y) \space \log_2 p(y)

GINI impurity

H(S)=1y  Sp(y)2H(S) = 1 - \sum_{y \space \in \space \mathcal{S}} p(y)^2

Mean Squared Loss (MSE)

MSE(S)=1S(x,yS)S(yyˉS)2\text{MSE}(S) = \frac{1}{|S|} \sum_{(x, y_S) \in S} (y - \bar y_S)^2

Where:

yˉS=1S(x,ys)Sy\bar y_S = \frac{1}{|S|} \sum_{(x, y_s) \in S} y

Data Splitting (CART)

Feature TypeSplit ConditionSplit DecisionExample Condition
Numeric / Continuousx < t or x > t\cdot Sort all unique values of (x)
\cdot Compute midpoints between consecutive values
\cdot Each midpoint = possible threshold (t)
Age < 32.5
Ordered Categoriesrank(x) < t\cdot Map categories to ranks
\cdot Treat as numeric thresholds
EducationLevel ≤ 2
Binary Categoricalx = v\cdot Single split into {v} vs. {not v}IsStudent = Yes
Multi-Categoricalx ∈ S or x ∉ S\cdot Enumerate all possible subsets S of categories
\cdot Often one-hot encode and treat as binary features
Colour ∈ {Red, Blue}
Booleanx < 0.5 or x ≥ 0.5\cdot Treated as numeric with only one possible thresholdIsWeekend < 0.5

Stopping Condition


Pruning Branches

Goals of Pruning

Pre-Pruning (Early Stopping)

Post-Pruning (Reducing Nodes)


Variants of Decision Tree Algorithms

Iterative Dichotomiser 3 (ID3)

Splitting Criterion: Information Gain / Entropy.

Output: Classification (yˉ\bar y).

Feature Handling: * Works primarily with categorical features. * Does not handle continuous variables directly.

Pruning: None.

Limitation: Biased toward features with many unique values.

Base Cases

ID3(S)={if yˉ s.t. (x,y)S,y=yˉ  return leaf with label yˉ,else if xˉ s.t. (x,y)S,x=xˉ  return leaf with {mode(y), for classificationmean(y), for regressionelse  continue splitting recursively\text{ID3}(S) = \begin{cases} \text{if } \exists\, \bar{y} \text{ s.t. } \forall (x, y) \in S,\, y = \bar{y} &\Rightarrow\; \text{return leaf with label } \bar{y}, \\ \text{else if } \exists\, \bar{x} \text{ s.t. } \forall (x, y) \in S,\, x = \bar{x} &\Rightarrow\; \text{return leaf with } \begin{cases} \text{mode}(y), \text{ for classification} \\ \text{mean}(y), \text{ for regression} \end{cases} \\ \text{else} &\Rightarrow\; \text{continue splitting recursively} \end{cases}

Case 1: All values of yy in SS is equal to yˉ\bar y

Case 2: No more attributes xx to split the set SS

Recursive Case

SL={(x,y)S:xft}S_L = \{ (x, y) \in S: x_f \le t \} SR={(x,y)S:xf>t}S_R = \{ (x, y) \in S: x_f \gt t \}

Key Limitation

A=argmaxAF IG(S,A)A^* = \underset{A \in \mathcal F}{\arg \max} \space IG(S, A) H(Sv)0H(S_v) \approx 0 IG(S,A)=H(S)H(Sv)H(S)IG(S, A) = H(S) - H(S_v) \approx H(S)

C4.5

Splitting Criterion: Gain Ratio.

Output: Classification (yˉ\bar y).

Feature Handling:

Pruning: Pessimistic pruning: Replaces subtrees with leaves when the Error estimate shows no significant improvement.

Limitation: Gain Ratio can over-penalise attributes with many branches, sometimes ignoring good splits.

Gain Ratio

GR(A)=IG(A)IS(A) \text{GR}(A) = \frac{\text{IG(A)}}{\text{IS}(A)}

Such that:

IS(A)=vVASvSlog2SvS \text{IS}(A) = \sum_{v \in \mathcal{V}_A} \frac{|S_v|}{|S|} \log_2 \frac{|S_v|}{|S|}

Pessimistic Pruning

Step 1: Grow the Full Tree

Step 2: Estimate the True Error Rate of a Node

E(t)=etNt E(t) = \frac{e_t}{N_t} E(t)=et+0.5Nt E(t) = \frac{e_t + 0.5}{N_t}

Step 3: Estimate the True Error Rate of a Sub-tree

E(T)=tTE(t)=tTet+0.5Nt E'(T) = \sum_{t \in T} E'(t) = \sum_{t \in T} \frac{e_t + 0.5}{N_t}

CART

Cost-Complexity Pruning

Step 1: Grow the Full Tree

Step 2: Calculate Cost Complexity Criterion of Tree and Subtree

Rα(T)=R(T)+αL(T)R_\alpha(T) = R(T) + \alpha{|L(T)|} R(t)=Ntr(t)R(t) = N_t \cdot r(t) R(T)=tL(T)r(t)p(t)=tL(T)R(t)R(T) = \sum_{t \in L(T)}{r(t) \cdot p(t)} = \sum_{t \in L(T)} R(t) g(t)=R(t)R(Tt)L(Tt)1g(t) = \frac{R(t) - R(T_t)}{|L(T_t) - 1|}

Step 4: Generate Pruning Path

T0T1T2TKTRT_0 \supset T_1 \supset T_2 \dots T_K \subset T_R

Step 5: Select the best subtree


Bias-Variance Trade Off

Overview

Effect of Tree Depth

Shallow Trees

Deep Trees


Inference and Predictions

Classification Trees

  1. Start at the root node
  2. At each node, evaluate the condition
  3. Move to the respective branch
  4. Continue until a leaf node is reached
  5. Output the class label and probability

Regression Trees

  1. Start at the root node
  2. At each node, evaluate the condition
  3. Move to the respective branch
  4. Continue until a leaf node is reached
  5. Output the numeric value

Decision Trees In Practice

When to Use Decision Trees

When Not to Use Decision Trees

Practical Notes

Regularization

Data Imbalance and Generalization

← Back to Blog