Skip to main content

Decision Trees

Decision trees partition data by recursive feature-based splitting. Ensemble methods (Random Forest, Gradient Boosting) combine multiple trees for improved performance.

Tree Structure

Loading diagram...

Splitting Criteria

Classification

Gini Impurity:

Gini = 1 - Sum(p_i^2) for all classes

Entropy / Information Gain:

Entropy = -Sum(p_i * log2(p_i)) for all classes

Regression

Variance Reduction:

Variance = (1/n) * Sum((y_i - y_mean)^2)

Overfitting Prevention

TechniqueDescription
Max depthLimit tree depth
Min samples splitMinimum samples required to split
Min samples leafMinimum samples required in leaf
PruningRemove branches post-training

Ensemble Methods

Random Forest

Bagging with feature randomization:

ComponentDescription
Bootstrap samplingTrain each tree on random subset with replacement
Feature randomizationEach split considers random subset of features
AggregationAverage predictions (regression) or majority vote (classification)

Effect: Reduces variance by averaging trees that overfit differently.

Gradient Boosting (XGBoost, LightGBM)

Sequential training to correct errors:

StepDescription
1Fit initial model
2Compute residuals
3Fit new tree to residuals
4Add to ensemble with learning rate
5Repeat

Effect: Reduces bias through iterative error correction.

Feature Importance

MethodDescription
Gini importanceReduction in impurity across all splits using that feature
Permutation importancePerformance drop when feature values are shuffled

Permutation importance is more reliable but computationally slower.

Reference

TopicDescription
Splitting decisionEvaluates all features and split points, selects maximum information gain (or minimum impurity). Greedy algorithm, not globally optimal.
Bagging vs boostingBagging: parallel training on bootstrap samples, average predictions, reduces variance. Boosting: sequential training to correct errors, reduces bias.
Random Forest overfitting reductionEach tree overfits to different data and feature subsets. Averaging cancels noise. Trade-off: reduced interpretability.
XGBoost vs Random ForestXGBoost typically achieves higher accuracy with proper tuning. Random Forest is simpler to configure. Start with Random Forest, use XGBoost for additional performance.
Feature importance interpretationGini importance: impurity reduction across splits. Permutation importance: accuracy drop when feature is shuffled. Permutation is more reliable.