Gradient Boosting: XGBoost, LightGBM, CatBoost

Introduction

Gradient Boosting is one of the most powerful machine learning algorithms, consistently winning Kaggle competitions and dominating tabular data tasks. Unlike bagging methods that build independent models, boosting sequentially trains weak learners, with each new model correcting the errors of its predecessors.

Architecture Diagram

Boosting Concept (Sequential Error Correction):
═══════════════════════════════════════════════════

 Data: ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

 Step 1: Weak Learner 1 (High Bias)
 ────────────────────────────────────
 Prediction: ───────────────────────────
 Error: ●●●●  ○○○  ●●●●●●  ○○  ●●●●●

 Step 2: Weak Learner 2 (Fits Residuals)
 ────────────────────────────────────
 Prediction: ──╱╲──╱╲────────╱╲────────
 Error: ●●●  ○  ○  ○○  ●●●  ○  ●●●●

 Step 3: Weak Learner 3 (Fits Residuals of Residuals)
 ────────────────────────────────────
 Prediction: ─╲╱─╲╱─╲╱─────╲╱─╲╱───────
 Error: ●●  ○  ○  ○  ○●  ○  ○  ○ ●●

 Final Ensemble: F₁ + α·F₂ + α·F₃ → Strong Learner
 ═══════════════════════════════════════════════════

 Accuracy: 60% → 75% → 88% → 95%

Theoretical Foundation

Gradient Descent in Function Space

The key insight of gradient boosting is performing gradient descent in function space rather than parameter space.

DfGradient Boosting Objective

The goal is to minimize a loss function $L$ by iteratively adding weak learners that fit the negative gradient (pseudo-residuals) of the loss with respect to the current ensemble prediction.

ThFunctional Gradient Descent

Gradient boosting performs gradient descent in function space. Each iteration fits a new weak learner to the negative gradient of the loss function with respect to the current ensemble's predictions. This is equivalent to a greedy function-space gradient descent where the step direction is the pseudo-residual.

Objective Function:

Gradient Boosting Objective Function

F_0(x) = \arg\min_c \sum_{i=1}^{N} L(y_i, c)

Here,

$L$ =Loss function
$y_i$ =True label for instance i
$c$ =Constant prediction minimizing loss
$N$ =Number of training instances

For regression with squared loss:

F_0(x) = \bar{y} = \frac{1}{N}\sum_{i=1}^{N} y_i

Iterative Update:

Gradient Boosting Update Rule

F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

Here,

$F_{m-1}(x)$ =Current ensemble prediction
$\eta$ =Learning rate (shrinkage parameter)
$h_m(x)$ =New weak learner fitted to pseudo-residuals

Pseudo-Residuals:

Pseudo-Residuals (Negative Gradient)

r_{im} = -\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \Bigg|_{F = F_{m-1}}

Here,

$r_{im}$ =Pseudo-residual for instance i at iteration m
$L$ =Loss function
$F_{m-1}$ =Current ensemble prediction

For squared loss $L = \frac{1}{2}(y - F(x))^2$ :

r_{im} = y_i - F_{m-1}(x_i)

💡 Pseudo-Residual Intuition

For squared loss, the pseudo-residual is simply the true residual (actual minus predicted). For other losses like log-loss, the pseudo-residual captures the direction in which the prediction should move to reduce the loss. This generalization is what makes gradient boosting applicable to any differentiable loss function.

Regularized Objective

XGBoost adds regularization to prevent overfitting:

XGBoost Regularized Objective

\mathcal{L}(\phi) = \sum_{i=1}^{N} L(y_i, \hat{y}_i) + \sum_{k=1}^{K} \Omega(f_k)

Here,

$L$ =Training loss
$\hat{y}_i$ =Prediction for instance i
$\Omega(f_k)$ =Regularization term for tree k
$K$ =Number of trees

Where:

Tree Complexity Regularization

\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2

Here,

$T$ =Number of leaves in the tree
$w_j$ =Weight of leaf j
$\gamma$ =Complexity penalty per leaf (controls tree pruning)
$\lambda$ =L2 regularization term on leaf weights

\mathcal{L}(\phi) = \underbrace{\sum_{i=1}^{N} L(y_i, \hat{y}_i)}_{\text{Training Loss}} + \underbrace{\sum_{k=1}^{K} \left( \gamma T_k + \frac{1}{2}\lambda \sum_{j=1}^{T_k} w_j^2 \right)}_{\text{Regularization}}

ℹ️ Why Regularization Matters

Without regularization, gradient boosting will eventually memorize the training data. The regularization terms $\gamma T$ penalize tree complexity (more leaves = more penalty), while $\lambda$ penalizes large leaf weights. The parameter $\gamma$ effectively controls the minimum loss reduction required to make a split — acting as a pruning threshold.

Second-Order Approximation

XGBoost uses both first and second-order gradients (Hessian):

Second-Order Taylor Expansion of Loss

\mathcal{L}^{(t)} \approx \sum_{i=1}^{N} \left[ g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i) \right] + \Omega(f_t)

Here,

$g_i$ =First-order gradient of loss w.r.t. prediction
$h_i$ =Second-order gradient (Hessian) of loss w.r.t. prediction
$f_t$ =New tree added at iteration t
$\Omega(f_t)$ =Regularization of the new tree

Where:

g_i = \frac{\partial L(y_i, \hat{y}_i^{(t-1)})}{\partial \hat{y}_i^{(t-1)}}, \quad h_i = \frac{\partial^2 L(y_i, \hat{y}_i^{(t-1)})}{\partial (\hat{y}_i^{(t-1)})^2}

ℹ️ Why Second-Order Methods Are Faster

Using the Hessian (second-order information) allows XGBoost to converge in fewer iterations compared to first-order-only methods like standard gradient boosting. The Taylor expansion provides a more accurate local approximation of the loss, enabling larger, more effective steps. This is analogous to Newton's method vs. gradient descent in numerical optimization.

Optimal Leaf Weights

w_j^* = -\frac{\sum_{i \in I_j} g_i}{\sum_{i \in I_j} h_i + \lambda}

Here,

$g_i$ =First-order gradient of the loss
$h_i$ =Second-order gradient (Hessian) of the loss
$\lambda$ =L2 regularization term
$I_j$ =Set of instances assigned to leaf j

Optimal Split Gain:

Optimal Split Gain (XGBoost)

\text{Gain} = \frac{1}{2} \left[ \frac{(\sum_{i \in I_L} g_i)^2}{\sum_{i \in I_L} h_i + \lambda} + \frac{(\sum_{i \in I_R} g_i)^2}{\sum_{i \in I_R} h_i + \lambda} - \frac{(\sum_{i \in I} g_i)^2}{\sum_{i \in I} h_i + \lambda} \right] - \gamma

Here,

$I_L$ =Set of instances assigned to the left child
$I_R$ =Set of instances assigned to the right child
$I$ =Set of all instances at the current node
$\gamma$ =Pruning threshold (minimum gain required)

📝Computing Split Gain

Consider a node with 100 instances. The sum of gradients is $\sum_{i \in I} g_i = -50$ and sum of Hessians is $\sum_{i \in I} h_i = 80$ . With $\lambda = 1$ and $\gamma = 0.1$ :

A candidate split produces left child (60 instances): $\sum g_i = -30, \sum h_i = 50$ and right child (40 instances): $\sum g_i = -20, \sum h_i = 30$ .

The score before the split is: $\frac{(-50)^2}{80 + 1} = \frac{2500}{81} \approx 30.86$

After split: $\frac{(-30)^2}{50 + 1} + \frac{(-20)^2}{30 + 1} = \frac{900}{51} + \frac{400}{31} \approx 17.65 + 12.90 = 30.55$

Gain = $\frac{1}{2}(30.55 - 30.86) - 0.1 = -0.155 - 0.1 = -0.255$

Since Gain < 0, this split would NOT be made. The $\gamma$ parameter acts as a pruning threshold — only splits with positive gain are accepted.

Algorithm Comparison

ThBias-Variance Tradeoff in Boosting

Boosting reduces bias by sequentially fitting residuals, while the variance is controlled through regularization (learning rate, tree depth, subsampling). The key insight is that each weak learner only needs to be slightly better than random (high bias, low variance), and the ensemble error can be made arbitrarily small.

Architecture Diagram

Algorithm Feature Comparison:
═══════════════════════════════════════════════════════════════════

 Feature         │ XGBoost     │ LightGBM    │ CatBoost
 ════════════════╪═════════════╪═════════════╪═════════════════════
 Growth          │ Level-wise  │ Leaf-wise   │ Symmetric (Oblivious)
 Algorithm       │ Pre-sorted  │ GOSS + EFB  │ Ordered Boosting
 Categorical     │ Label Enc   │ Native      │ Native + Target Stats
 GPU Support     │ Yes         │ Yes         │ Yes (Best)
 Memory          │ High        │ Low         │ Medium
 Speed           │ Medium      │ Fastest     │ Medium
 Overfitting     │ Moderate    │ Higher Risk │ Lower Risk
 ════════════════╧═════════════╧═════════════╧═════════════════════

 Tree Growth Strategies:
 ───────────────────────

 Level-wise (XGBoost):          Leaf-wise (LightGBM):
        ◎                           ◎
       / \                         / \
      ◎   ◎                       ◎   ◎
     / \ / \                       \
    ◎  ◎ ◎  ◎                      ◎  ◎
   ↑ grows all nodes at            ↑ grows nodes with
     same depth                       max delta loss

Python Implementation

Complete Comparison

import numpy as np
import pandas as pd
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, roc_auc_score, classification_report,
    mean_squared_error
)
import warnings
warnings.filterwarnings('ignore')

# ═══════════════════════════════════════════════════
# Generate Synthetic Dataset
# ═══════════════════════════════════════════════════
X, y = make_classification(
    n_samples=50000,
    n_features=20,
    n_informative=15,
    n_redundant=3,
    n_clusters_per_class=2,
    random_state=42
)

# Add categorical-like features
X_df = pd.DataFrame(X, columns=[f'feat_{i}' for i in range(20)])
X_df['cat_1'] = np.random.choice(['A', 'B', 'C', 'D'], size=50000)
X_df['cat_2'] = np.random.choice(['low', 'medium', 'high'], size=50000)
X_df['cat_3'] = np.random.choice(
    ['red', 'blue', 'green', 'yellow', 'purple'], size=50000
)

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Features: {X_df.shape[1]}")

# ═══════════════════════════════════════════════════
# XGBoost Implementation
# ═══════════════════════════════════════════════════
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder

# Label encode categorical features for XGBoost
le_dict = {}
X_train_xgb = X_train.copy()
X_test_xgb = X_test.copy()

for col in ['cat_1', 'cat_2', 'cat_3']:
    le = LabelEncoder()
    X_train_xgb[col] = le.fit_transform(X_train_xgb[col])
    X_test_xgb[col] = le.transform(X_test_xgb[col])
    le_dict[col] = le

# XGBoost DMatrix for optimized training
dtrain = xgb.DMatrix(X_train_xgb, label=y_train)
dtest = xgb.DMatrix(X_test_xgb, label=y_test)

# XGBoost Parameters
xgb_params = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'min_child_weight': 5,
    'gamma': 0.1,
    'reg_alpha': 0.1,
    'reg_lambda': 1.0,
    'seed': 42,
    'nthread': -1
}

# Training with early stopping
print("\n" + "=" * 50)
print("XGBoost Training")
print("=" * 50)

start_time = time.time()
xgb_model = xgb.train(
    xgb_params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dtest, 'eval')],
    early_stopping_rounds=50,
    verbose_eval=100
)
xgb_time = time.time() - start_time

# Predictions
xgb_probs = xgb_model.predict(dtest)
xgb_preds = (xgb_probs > 0.5).astype(int)

print(f"\nXGBoost Results:")
print(f"  Training Time: {xgb_time:.2f}s")
print(f"  Best Iteration: {xgb_model.best_iteration}")
print(f"  Accuracy: {accuracy_score(y_test, xgb_preds):.4f}")
print(f"  AUC: {roc_auc_score(y_test, xgb_probs):.4f}")

# ═══════════════════════════════════════════════════
# LightGBM Implementation
# ═══════════════════════════════════════════════════
import lightgbm as lgb

# LightGBM handles categorical features natively
X_train_lgb = X_train.copy()
X_test_lgb = X_test.copy()

for col in ['cat_1', 'cat_2', 'cat_3']:
    X_train_lgb[col] = X_train_lgb[col].astype('category')
    X_test_lgb[col] = X_test_lgb[col].astype('category')

# LightGBM Parameters
lgb_params = {
    'objective': 'binary',
    'metric': 'auc',
    'boosting_type': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'min_child_samples': 20,
    'reg_alpha': 0.1,
    'reg_lambda': 0.1,
    'verbose': -1,
    'n_jobs': -1,
    'seed': 42
}

print("\n" + "=" * 50)
print("LightGBM Training")
print("=" * 50)

start_time = time.time()
lgb_model = lgb.LGBMClassifier(**lgb_params, n_estimators=1000)
lgb_model.fit(
    X_train_lgb, y_train,
    eval_set=[(X_test_lgb, y_test)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
lgb_time = time.time() - start_time

# Predictions
lgb_probs = lgb_model.predict_proba(X_test_lgb)[:, 1]
lgb_preds = lgb_model.predict(X_test_lgb)

print(f"\nLightGBM Results:")
print(f"  Training Time: {lgb_time:.2f}s")
print(f"  Best Iteration: {lgb_model.best_iteration_}")
print(f"  Accuracy: {accuracy_score(y_test, lgb_preds):.4f}")
print(f"  AUC: {roc_auc_score(y_test, lgb_probs):.4f}")

# ═══════════════════════════════════════════════════
# CatBoost Implementation
# ═══════════════════════════════════════════════════
from catboost import CatBoostClassifier, Pool

# CatBoost handles categorical features natively
cat_features = ['cat_1', 'cat_2', 'cat_3']

print("\n" + "=" * 50)
print("CatBoost Training")
print("=" * 50)

start_time = time.time()
cat_model = CatBoostClassifier(
    iterations=1000,
    learning_rate=0.1,
    depth=6,
    l2_leaf_reg=3,
    min_data_in_leaf=20,
    random_seed=42,
    eval_metric='AUC',
    early_stopping_rounds=50,
    verbose=100,
    cat_features=cat_features,
    task_type='CPU'
)

cat_model.fit(X_train, y_train, eval_set=(X_test, y_test))
cat_time = time.time() - start_time

# Predictions
cat_probs = cat_model.predict_proba(X_test)[:, 1]
cat_preds = cat_model.predict(X_test)

print(f"\nCatBoost Results:")
print(f"  Training Time: {cat_time:.2f}s")
print(f"  Best Iteration: {cat_model.best_iteration_}")
print(f"  Accuracy: {accuracy_score(y_test, cat_preds):.4f}")
print(f"  AUC: {roc_auc_score(y_test, cat_probs):.4f}")

# ═══════════════════════════════════════════════════
# Comparison Summary
# ═══════════════════════════════════════════════════
print("\n" + "=" * 50)
print("COMPARISON SUMMARY")
print("=" * 50)

results = pd.DataFrame({
    'Algorithm': ['XGBoost', 'LightGBM', 'CatBoost'],
    'Accuracy': [
        accuracy_score(y_test, xgb_preds),
        accuracy_score(y_test, lgb_preds),
        accuracy_score(y_test, cat_preds)
    ],
    'AUC': [
        roc_auc_score(y_test, xgb_probs),
        roc_auc_score(y_test, lgb_probs),
        roc_auc_score(y_test, cat_probs)
    ],
    'Time (s)': [xgb_time, lgb_time, cat_time]
})

print(results.to_string(index=False))

Hyperparameter Tuning with Optuna

import optuna
from optuna.samplers import TPESampler

# ═══════════════════════════════════════════════════
# XGBoost Hyperparameter Tuning
# ═══════════════════════════════════════════════════
def objective_xgb(trial):
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10, log=True),
        'seed': 42
    }

    dtrain = xgb.DMatrix(X_train_xgb, label=y_train)
    dval = xgb.DMatrix(X_test_xgb, label=y_test)

    model = xgb.train(
        params, dtrain,
        num_boost_round=500,
        evals=[(dval, 'eval')],
        early_stopping_rounds=30,
        verbose_eval=False
    )

    preds = model.predict(dval)
    return roc_auc_score(y_test, preds)

# Run optimization
study_xgb = optuna.create_study(
    direction='maximize',
    sampler=TPESampler(seed=42)
)
study_xgb.optimize(objective_xgb, n_trials=50, show_progress_bar=True)

print(f"\nBest XGBoost AUC: {study_xgb.best_value:.4f}")
print(f"Best Parameters: {study_xgb.best_params}")

Feature Importance Analysis

import matplotlib.pyplot as plt

# ═══════════════════════════════════════════════════
# Feature Importance Comparison
# ═══════════════════════════════════════════════════
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# XGBoost importance
xgb_importance = xgb_model.get_score(importance_type='weight')
xgb_imp_df = pd.DataFrame({
    'feature': list(xgb_importance.keys()),
    'importance': list(xgb_importance.values())
}).sort_values('importance', ascending=True).tail(15)

axes[0].barh(xgb_imp_df['feature'], xgb_imp_df['importance'])
axes[0].set_title('XGBoost Feature Importance')
axes[0].set_xlabel('Weight')

# LightGBM importance
lgb_importance = pd.DataFrame({
    'feature': X_train_lgb.columns,
    'importance': lgb_model.feature_importances_
}).sort_values('importance', ascending=True).tail(15)

axes[1].barh(lgb_importance['feature'], lgb_importance['importance'])
axes[1].set_title('LightGBM Feature Importance')
axes[1].set_xlabel('Split Count')

# CatBoost importance
cat_importance = cat_model.get_feature_importance()
cat_imp_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance': cat_importance
}).sort_values('importance', ascending=True).tail(15)

axes[2].barh(cat_imp_df['feature'], cat_imp_df['importance'])
axes[2].set_title('CatBoost Feature Importance')
axes[2].set_xlabel('Importance')

plt.tight_layout()
plt.savefig('feature_importance_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

Real-World Use Cases

Domain	Use Case	Best Algorithm
Finance	Credit scoring, fraud detection	XGBoost
E-commerce	Click-through rate prediction	LightGBM
Healthcare	Disease diagnosis, patient readmission	CatBoost
Marketing	Customer churn prediction	LightGBM
Insurance	Risk assessment, claim prediction	XGBoost

📋Key Takeaways

Gradient Boosting performs gradient descent in function space, sequentially adding weak learners that fit pseudo-residuals
XGBoost — Most mature, uses second-order gradients (Hessian), excellent regularization with $\lambda, \alpha, \gamma$
LightGBM — Fastest training with leaf-wise growth, GOSS + EFB for large datasets, native categorical support
CatBoost — Best for categorical features, ordered boosting reduces target leakage and overfitting
Regularization — The objective combines training loss with complexity penalty: $\mathcal{L} = \sum L(y_i, \hat{y}_i) + \sum \Omega(f_k)$
Learning Rate — Lower values (0.01-0.1) with more trees usually outperform higher rates; acts as shrinkage
Second-Order Methods — XGBoost's use of the Hessian enables faster convergence than first-order-only methods

Practice Exercises

Dataset Comparison: Train all three algorithms on a real dataset (e.g., Ames Housing) and compare performance
Categorical Feature Study: Create a dataset with mixed features and compare how each algorithm handles categoricals
Hyperparameter Sensitivity: Plot how accuracy changes with different max_depth and learning_rate values
Stacking Ensemble: Use XGBoost, LightGBM, and CatBoost as base learners in a stacking ensemble

📝Gradient Boosting Walkthrough (3 Rounds)

Setup: Predict house prices with squared loss. Training data: 5 houses with true prices [200K, 250K, 180K, 320K, 270K].

Round 1: Initial prediction is the mean: $F_0 = 244K$ . Residuals (errors): [-44K, 6K, -64K, 76K, 26K].

Round 2: Fit a small tree to the residuals. Suppose the tree splits on "square footage > 1500" and predicts +10K for large houses, -20K for small. With learning rate $\eta = 0.1$ : $F_1(x) = 244K + 0.1 \cdot h_1(x)$ . New residuals shrink.

Round 3: Fit another tree to the new residuals. Each iteration reduces the remaining error. After M rounds: $F_M(x) = F_0(x) + \eta \sum_{m=1}^{M} h_m(x)$ .

Key Insight: Each tree only needs to be a weak learner (better than random). The ensemble becomes strong through additive combination.

Gradient Boosting: XGBoost, LightGBM, CatBoost

Gradient Boosting: XGBoost, LightGBM, CatBoost

Introduction

Theoretical Foundation

Gradient Descent in Function Space

DfGradient Boosting Objective

ThFunctional Gradient Descent

Gradient Boosting Objective Function

Gradient Boosting Update Rule

Pseudo-Residuals (Negative Gradient)

Regularized Objective

XGBoost Regularized Objective

Tree Complexity Regularization

Second-Order Approximation

Second-Order Taylor Expansion of Loss

Optimal Leaf Weights

Optimal Split Gain (XGBoost)

📝Computing Split Gain

Algorithm Comparison

ThBias-Variance Tradeoff in Boosting

Python Implementation

Complete Comparison

Hyperparameter Tuning with Optuna

Feature Importance Analysis

Real-World Use Cases

📋Key Takeaways

Practice Exercises

📝Gradient Boosting Walkthrough (3 Rounds)

Need Expert Data Science Help?