Gradient Boosting: XGBoost, LightGBM, CatBoost

Module 3: Advanced ML + Deep LearningFree Lesson

Advertisement

Gradient Boosting: XGBoost, LightGBM, CatBoost

Introduction

Gradient Boosting is one of the most powerful machine learning algorithms, consistently winning Kaggle competitions and dominating tabular data tasks. Unlike bagging methods that build independent models, boosting sequentially trains weak learners, with each new model correcting the errors of its predecessors.

Architecture Diagram
Boosting Concept (Sequential Error Correction):
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Data: โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—โ—

 Step 1: Weak Learner 1 (High Bias)
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 Prediction: โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 Error: โ—โ—โ—โ—  โ—‹โ—‹โ—‹  โ—โ—โ—โ—โ—โ—  โ—‹โ—‹  โ—โ—โ—โ—โ—

 Step 2: Weak Learner 2 (Fits Residuals)
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 Prediction: โ”€โ”€โ•ฑโ•ฒโ”€โ”€โ•ฑโ•ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฑโ•ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 Error: โ—โ—โ—  โ—‹  โ—‹  โ—‹โ—‹  โ—โ—โ—  โ—‹  โ—โ—โ—โ—

 Step 3: Weak Learner 3 (Fits Residuals of Residuals)
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 Prediction: โ”€โ•ฒโ•ฑโ”€โ•ฒโ•ฑโ”€โ•ฒโ•ฑโ”€โ”€โ”€โ”€โ”€โ•ฒโ•ฑโ”€โ•ฒโ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€
 Error: โ—โ—  โ—‹  โ—‹  โ—‹  โ—‹โ—  โ—‹  โ—‹  โ—‹ โ—โ—

 Final Ensemble: Fโ‚ + ฮฑยทFโ‚‚ + ฮฑยทFโ‚ƒ โ†’ Strong Learner
 โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Accuracy: 60% โ†’ 75% โ†’ 88% โ†’ 95%

Theoretical Foundation

Gradient Descent in Function Space

The key insight of gradient boosting is performing gradient descent in function space rather than parameter space.

DfGradient Boosting Objective

The goal is to minimize a loss function LL by iteratively adding weak learners that fit the negative gradient (pseudo-residuals) of the loss with respect to the current ensemble prediction.

ThFunctional Gradient Descent

Gradient boosting performs gradient descent in function space. Each iteration fits a new weak learner to the negative gradient of the loss function with respect to the current ensemble's predictions. This is equivalent to a greedy function-space gradient descent where the step direction is the pseudo-residual.

Objective Function:

Gradient Boosting Objective Function

F0(x)=argโกminโกcโˆ‘i=1NL(yi,c)F_0(x) = \arg\min_c \sum_{i=1}^{N} L(y_i, c)

Here,

  • LL=Loss function
  • yiy_i=True label for instance i
  • cc=Constant prediction minimizing loss
  • NN=Number of training instances

For regression with squared loss:

F0(x)=yห‰=1Nโˆ‘i=1NyiF_0(x) = \bar{y} = \frac{1}{N}\sum_{i=1}^{N} y_i

Iterative Update:

Gradient Boosting Update Rule

Fm(x)=Fmโˆ’1(x)+ฮทโ‹…hm(x)F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

Here,

  • Fmโˆ’1(x)F_{m-1}(x)=Current ensemble prediction
  • ฮท\eta=Learning rate (shrinkage parameter)
  • hm(x)h_m(x)=New weak learner fitted to pseudo-residuals

Pseudo-Residuals:

Pseudo-Residuals (Negative Gradient)

rim=โˆ’โˆ‚L(yi,F(xi))โˆ‚F(xi)โˆฃF=Fmโˆ’1r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \Bigg|_{F = F_{m-1}}

Here,

  • rimr_{im}=Pseudo-residual for instance i at iteration m
  • LL=Loss function
  • Fmโˆ’1F_{m-1}=Current ensemble prediction

For squared loss L=12(yโˆ’F(x))2L = \frac{1}{2}(y - F(x))^2:

rim=yiโˆ’Fmโˆ’1(xi)r_{im} = y_i - F_{m-1}(x_i)

๐Ÿ’ก Pseudo-Residual Intuition

For squared loss, the pseudo-residual is simply the true residual (actual minus predicted). For other losses like log-loss, the pseudo-residual captures the direction in which the prediction should move to reduce the loss. This generalization is what makes gradient boosting applicable to any differentiable loss function.

Regularized Objective

XGBoost adds regularization to prevent overfitting:

XGBoost Regularized Objective

L(ฯ•)=โˆ‘i=1NL(yi,y^i)+โˆ‘k=1Kฮฉ(fk)\mathcal{L}(\phi) = \sum_{i=1}^{N} L(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)

Here,

  • LL=Training loss
  • y^i\hat{y}_i=Prediction for instance i
  • ฮฉ(fk)\Omega(f_k)=Regularization term for tree k
  • KK=Number of trees

Where:

Tree Complexity Regularization

ฮฉ(f)=ฮณT+12ฮปโˆ‘j=1Twj2\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2

Here,

  • TT=Number of leaves in the tree
  • wjw_j=Weight of leaf j
  • ฮณ\gamma=Complexity penalty per leaf (controls tree pruning)
  • ฮป\lambda=L2 regularization term on leaf weights
L(ฯ•)=โˆ‘i=1NL(yi,y^i)โŸTrainingย Loss+โˆ‘k=1K(ฮณTk+12ฮปโˆ‘j=1Tkwj2)โŸRegularization\mathcal{L}(\phi) = \underbrace{\sum_{i=1}^{N} L(y_i, \hat{y}_i)}_{\text{Training Loss}} + \underbrace{\sum_{k=1}^{K} \left( \gamma T_k + \frac{1}{2}\lambda \sum_{j=1}^{T_k} w_j^2 \right)}_{\text{Regularization}}

โ„น๏ธ Why Regularization Matters

Without regularization, gradient boosting will eventually memorize the training data. The regularization terms ฮณT\gamma T penalize tree complexity (more leaves = more penalty), while ฮป\lambda penalizes large leaf weights. The parameter ฮณ\gamma effectively controls the minimum loss reduction required to make a split โ€” acting as a pruning threshold.

Second-Order Approximation

XGBoost uses both first and second-order gradients (Hessian):

Second-Order Taylor Expansion of Loss

L(t)โ‰ˆโˆ‘i=1N[gift(xi)+12hift2(xi)]+ฮฉ(ft)\mathcal{L}^{(t)} \approx \sum_{i=1}^{N} \left[ g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + \Omega(f_t)

Here,

  • gig_i=First-order gradient of loss w.r.t. prediction
  • hih_i=Second-order gradient (Hessian) of loss w.r.t. prediction
  • ftf_t=New tree added at iteration t
  • ฮฉ(ft)\Omega(f_t)=Regularization of the new tree

Where:

gi=โˆ‚L(yi,y^i(tโˆ’1))โˆ‚y^i(tโˆ’1),hi=โˆ‚2L(yi,y^i(tโˆ’1))โˆ‚(y^i(tโˆ’1))2g_i = \frac{\partial L(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}, \quad h_i = \frac{\partial^2 L(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}

โ„น๏ธ Why Second-Order Methods Are Faster

Using the Hessian (second-order information) allows XGBoost to converge in fewer iterations compared to first-order-only methods like standard gradient boosting. The Taylor expansion provides a more accurate local approximation of the loss, enabling larger, more effective steps. This is analogous to Newton's method vs. gradient descent in numerical optimization.

Optimal Leaf Weights

wjโˆ—=โˆ’โˆ‘iโˆˆIjgiโˆ‘iโˆˆIjhi+ฮปw_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}

Here,

  • gig_i=First-order gradient of the loss
  • hih_i=Second-order gradient (Hessian) of the loss
  • ฮป\lambda=L2 regularization term
  • IjI_j=Set of instances assigned to leaf j

Optimal Split Gain:

Optimal Split Gain (XGBoost)

Gain=12[(โˆ‘iโˆˆILgi)2โˆ‘iโˆˆILhi+ฮป+(โˆ‘iโˆˆIRgi)2โˆ‘iโˆˆIRhi+ฮปโˆ’(โˆ‘iโˆˆIgi)2โˆ‘iโˆˆIhi+ฮป]โˆ’ฮณ\text{Gain} = \frac{1}{2} \left[ \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda} \right] - \gamma

Here,

  • ILI_L=Set of instances assigned to the left child
  • IRI_R=Set of instances assigned to the right child
  • II=Set of all instances at the current node
  • ฮณ\gamma=Pruning threshold (minimum gain required)

๐Ÿ“Computing Split Gain

Consider a node with 100 instances. The sum of gradients is โˆ‘iโˆˆIgi=โˆ’50\sum_{i \in I} g_i = -50 and sum of Hessians is โˆ‘iโˆˆIhi=80\sum_{i \in I} h_i = 80. With ฮป=1\lambda = 1 and ฮณ=0.1\gamma = 0.1:

A candidate split produces left child (60 instances): โˆ‘gi=โˆ’30,โˆ‘hi=50\sum g_i = -30, \sum h_i = 50 and right child (40 instances): โˆ‘gi=โˆ’20,โˆ‘hi=30\sum g_i = -20, \sum h_i = 30.

The score before the split is: (โˆ’50)280+1=250081โ‰ˆ30.86\frac{(-50)^2}{80 + 1} = \frac{2500}{81} \approx 30.86

After split: (โˆ’30)250+1+(โˆ’20)230+1=90051+40031โ‰ˆ17.65+12.90=30.55\frac{(-30)^2}{50 + 1} + \frac{(-20)^2}{30 + 1} = \frac{900}{51} + \frac{400}{31} \approx 17.65 + 12.90 = 30.55

Gain = 12(30.55โˆ’30.86)โˆ’0.1=โˆ’0.155โˆ’0.1=โˆ’0.255\frac{1}{2}(30.55 - 30.86) - 0.1 = -0.155 - 0.1 = -0.255

Since Gain < 0, this split would NOT be made. The ฮณ\gamma parameter acts as a pruning threshold โ€” only splits with positive gain are accepted.

Algorithm Comparison

ThBias-Variance Tradeoff in Boosting

Boosting reduces bias by sequentially fitting residuals, while the variance is controlled through regularization (learning rate, tree depth, subsampling). The key insight is that each weak learner only needs to be slightly better than random (high bias, low variance), and the ensemble error can be made arbitrarily small.

Architecture Diagram
Algorithm Feature Comparison:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Feature         โ”‚ XGBoost     โ”‚ LightGBM    โ”‚ CatBoost
 โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ชโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
 Growth          โ”‚ Level-wise  โ”‚ Leaf-wise   โ”‚ Symmetric (Oblivious)
 Algorithm       โ”‚ Pre-sorted  โ”‚ GOSS + EFB  โ”‚ Ordered Boosting
 Categorical     โ”‚ Label Enc   โ”‚ Native      โ”‚ Native + Target Stats
 GPU Support     โ”‚ Yes         โ”‚ Yes         โ”‚ Yes (Best)
 Memory          โ”‚ High        โ”‚ Low         โ”‚ Medium
 Speed           โ”‚ Medium      โ”‚ Fastest     โ”‚ Medium
 Overfitting     โ”‚ Moderate    โ”‚ Higher Risk โ”‚ Lower Risk
 โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•งโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

 Tree Growth Strategies:
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

 Level-wise (XGBoost):          Leaf-wise (LightGBM):
        โ—Ž                           โ—Ž
       / \                         / \
      โ—Ž   โ—Ž                       โ—Ž   โ—Ž
     / \ / \                       \
    โ—Ž  โ—Ž โ—Ž  โ—Ž                      โ—Ž  โ—Ž
   โ†‘ grows all nodes at            โ†‘ grows nodes with
     same depth                       max delta loss

Python Implementation

Complete Comparison

import numpy as np
import pandas as pd
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, roc_auc_score, classification_report,
    mean_squared_error
)
import warnings
warnings.filterwarnings('ignore')

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Generate Synthetic Dataset
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
X, y = make_classification(
    n_samples=50000,
    n_features=20,
    n_informative=15,
    n_redundant=3,
    n_clusters_per_class=2,
    random_state=42
)

# Add categorical-like features
X_df = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(20)])
X_df['cat_1'] = np.random.choice(['A', 'B', 'C', 'D'], size=50000)
X_df['cat_2'] = np.random.choice(['low', 'medium', 'high'], size=50000)
X_df['cat_3'] = np.random.choice(
    ['red', 'blue', 'green', 'yellow', 'purple'], size=50000
)

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X_df.shape[1]}")

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# XGBoost Implementation
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

# Label encode categorical features for XGBoost
le_dict = {}
X_train_xgb = X_train.copy()
X_test_xgb = X_test.copy()

for col in ['cat_1', 'cat_2', 'cat_3']:
    le = LabelEncoder()
    X_train_xgb[col] = le.fit_transform(X_train_xgb[col])
    X_test_xgb[col] = le.transform(X_test_xgb[col])
    le_dict[col] = le

# XGBoost DMatrix for optimized training
dtrain = xgb.DMatrix(X_train_xgb, label=y_train)
dtest = xgb.DMatrix(X_test_xgb, label=y_test)

# XGBoost Parameters
xgb_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 5,
    'gamma': 0.1,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'seed': 42,
    'nthread': -1
}

# Training with early stopping
print("\n" + "=" * 50)
print("XGBoost Training")
print("=" * 50)

start_time = time.time()
xgb_model = xgb.train(
    xgb_params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dtest, 'eval')],
    early_stopping_rounds=50,
    verbose_eval=100
)
xgb_time = time.time() - start_time

# Predictions
xgb_probs = xgb_model.predict(dtest)
xgb_preds = (xgb_probs > 0.5).astype(int)

print(f"\nXGBoost Results:")
print(f"  Training Time: {xgb_time:.2f}s")
print(f"  Best Iteration: {xgb_model.best_iteration}")
print(f"  Accuracy: {accuracy_score(y_test, xgb_preds):.4f}")
print(f"  AUC: {roc_auc_score(y_test, xgb_probs):.4f}")

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# LightGBM Implementation
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
import lightgbm as lgb

# LightGBM handles categorical features natively
X_train_lgb = X_train.copy()
X_test_lgb = X_test.copy()

for col in ['cat_1', 'cat_2', 'cat_3']:
    X_train_lgb[col] = X_train_lgb[col].astype('category')
    X_test_lgb[col] = X_test_lgb[col].astype('category')

# LightGBM Parameters
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_child_samples': 20,
    'reg_alpha': 0.1,
    'reg_lambda': 0.1,
    'verbose': -1,
    'n_jobs': -1,
    'seed': 42
}

print("\n" + "=" * 50)
print("LightGBM Training")
print("=" * 50)

start_time = time.time()
lgb_model = lgb.LGBMClassifier(**lgb_params, n_estimators=1000)
lgb_model.fit(
    X_train_lgb, y_train,
    eval_set=[(X_test_lgb, y_test)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
lgb_time = time.time() - start_time

# Predictions
lgb_probs = lgb_model.predict_proba(X_test_lgb)[:, 1]
lgb_preds = lgb_model.predict(X_test_lgb)

print(f"\nLightGBM Results:")
print(f"  Training Time: {lgb_time:.2f}s")
print(f"  Best Iteration: {lgb_model.best_iteration_}")
print(f"  Accuracy: {accuracy_score(y_test, lgb_preds):.4f}")
print(f"  AUC: {roc_auc_score(y_test, lgb_probs):.4f}")

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# CatBoost Implementation
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
from catboost import CatBoostClassifier, Pool

# CatBoost handles categorical features natively
cat_features = ['cat_1', 'cat_2', 'cat_3']

print("\n" + "=" * 50)
print("CatBoost Training")
print("=" * 50)

start_time = time.time()
cat_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    l2_leaf_reg=3,
    min_data_in_leaf=20,
    random_seed=42,
    eval_metric='AUC',
    early_stopping_rounds=50,
    verbose=100,
    cat_features=cat_features,
    task_type='CPU'
)

cat_model.fit(X_train, y_train, eval_set=(X_test, y_test))
cat_time = time.time() - start_time

# Predictions
cat_probs = cat_model.predict_proba(X_test)[:, 1]
cat_preds = cat_model.predict(X_test)

print(f"\nCatBoost Results:")
print(f"  Training Time: {cat_time:.2f}s")
print(f"  Best Iteration: {cat_model.best_iteration_}")
print(f"  Accuracy: {accuracy_score(y_test, cat_preds):.4f}")
print(f"  AUC: {roc_auc_score(y_test, cat_probs):.4f}")

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Comparison Summary
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
print("\n" + "=" * 50)
print("COMPARISON SUMMARY")
print("=" * 50)

results = pd.DataFrame({
    'Algorithm': ['XGBoost', 'LightGBM', 'CatBoost'],
    'Accuracy': [
        accuracy_score(y_test, xgb_preds),
        accuracy_score(y_test, lgb_preds),
        accuracy_score(y_test, cat_preds)
    ],
    'AUC': [
        roc_auc_score(y_test, xgb_probs),
        roc_auc_score(y_test, lgb_probs),
        roc_auc_score(y_test, cat_probs)
    ],
    'Time (s)': [xgb_time, lgb_time, cat_time]
})

print(results.to_string(index=False))

Hyperparameter Tuning with Optuna

import optuna
from optuna.samplers import TPESampler

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# XGBoost Hyperparameter Tuning
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
def objective_xgb(trial):
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10, log=True),
        'seed': 42
    }

    dtrain = xgb.DMatrix(X_train_xgb, label=y_train)
    dval = xgb.DMatrix(X_test_xgb, label=y_test)

    model = xgb.train(
        params, dtrain,
        num_boost_round=500,
        evals=[(dval, 'eval')],
        early_stopping_rounds=30,
        verbose_eval=False
    )

    preds = model.predict(dval)
    return roc_auc_score(y_test, preds)

# Run optimization
study_xgb = optuna.create_study(
    direction='maximize',
    sampler=TPESampler(seed=42)
)
study_xgb.optimize(objective_xgb, n_trials=50, show_progress_bar=True)

print(f"\nBest XGBoost AUC: {study_xgb.best_value:.4f}")
print(f"Best Parameters: {study_xgb.best_params}")

Feature Importance Analysis

import matplotlib.pyplot as plt

# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
# Feature Importance Comparison
# โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# XGBoost importance
xgb_importance = xgb_model.get_score(importance_type='weight')
xgb_imp_df = pd.DataFrame({
    'feature': list(xgb_importance.keys()),
    'importance': list(xgb_importance.values())
}).sort_values('importance', ascending=True).tail(15)

axes[0].barh(xgb_imp_df['feature'], xgb_imp_df['importance'])
axes[0].set_title('XGBoost Feature Importance')
axes[0].set_xlabel('Weight')

# LightGBM importance
lgb_importance = pd.DataFrame({
    'feature': X_train_lgb.columns,
    'importance': lgb_model.feature_importances_
}).sort_values('importance', ascending=True).tail(15)

axes[1].barh(lgb_importance['feature'], lgb_importance['importance'])
axes[1].set_title('LightGBM Feature Importance')
axes[1].set_xlabel('Split Count')

# CatBoost importance
cat_importance = cat_model.get_feature_importance()
cat_imp_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': cat_importance
}).sort_values('importance', ascending=True).tail(15)

axes[2].barh(cat_imp_df['feature'], cat_imp_df['importance'])
axes[2].set_title('CatBoost Feature Importance')
axes[2].set_xlabel('Importance')

plt.tight_layout()
plt.savefig('feature_importance_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Real-World Use Cases

DomainUse CaseBest Algorithm
FinanceCredit scoring, fraud detectionXGBoost
E-commerceClick-through rate predictionLightGBM
HealthcareDisease diagnosis, patient readmissionCatBoost
MarketingCustomer churn predictionLightGBM
InsuranceRisk assessment, claim predictionXGBoost

๐Ÿ“‹Key Takeaways

  • Gradient Boosting performs gradient descent in function space, sequentially adding weak learners that fit pseudo-residuals
  • XGBoost โ€” Most mature, uses second-order gradients (Hessian), excellent regularization with ฮป,ฮฑ,ฮณ\lambda, \alpha, \gamma
  • LightGBM โ€” Fastest training with leaf-wise growth, GOSS + EFB for large datasets, native categorical support
  • CatBoost โ€” Best for categorical features, ordered boosting reduces target leakage and overfitting
  • Regularization โ€” The objective combines training loss with complexity penalty: L=โˆ‘L(yi,y^i)+โˆ‘ฮฉ(fk)\mathcal{L} = \sum L(y_i, \hat{y}_i) + \sum \Omega(f_k)
  • Learning Rate โ€” Lower values (0.01-0.1) with more trees usually outperform higher rates; acts as shrinkage
  • Second-Order Methods โ€” XGBoost's use of the Hessian enables faster convergence than first-order-only methods

Practice Exercises

  1. Dataset Comparison: Train all three algorithms on a real dataset (e.g., Ames Housing) and compare performance
  2. Categorical Feature Study: Create a dataset with mixed features and compare how each algorithm handles categoricals
  3. Hyperparameter Sensitivity: Plot how accuracy changes with different max_depth and learning_rate values
  4. Stacking Ensemble: Use XGBoost, LightGBM, and CatBoost as base learners in a stacking ensemble

๐Ÿ“Gradient Boosting Walkthrough (3 Rounds)

Setup: Predict house prices with squared loss. Training data: 5 houses with true prices [200K, 250K, 180K, 320K, 270K].

Round 1: Initial prediction is the mean: F0=244KF_0 = 244K. Residuals (errors): [-44K, 6K, -64K, 76K, 26K].

Round 2: Fit a small tree to the residuals. Suppose the tree splits on "square footage > 1500" and predicts +10K for large houses, -20K for small. With learning rate ฮท=0.1\eta = 0.1: F1(x)=244K+0.1โ‹…h1(x)F_1(x) = 244K + 0.1 \cdot h_1(x). New residuals shrink.

Round 3: Fit another tree to the new residuals. Each iteration reduces the remaining error. After M rounds: FM(x)=F0(x)+ฮทโˆ‘m=1Mhm(x)F_M(x) = F_0(x) + \eta \sum_{m=1}^{M} h_m(x).

Key Insight: Each tree only needs to be a weak learner (better than random). The ensemble becomes strong through additive combination.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement