CW

Hyperparameter Tuning: Grid Search, Random Search and Optuna

Module 8: Tree-Based ModelsFree Lesson

Advertisement

Hyperparameter Tuning: Grid Search, Random Search and Optuna

â„šī¸ Module 8 — Tree-Based Models

This lesson covers systematic approaches to hyperparameter optimization, from exhaustive grid search to intelligent Bayesian methods and modern tools like Optuna.

1. Hyperparameters vs Parameters

Understanding the distinction between hyperparameters and learned parameters is fundamental to model optimization.

Parameters are learned from data during training — weights in a neural network, split thresholds in a decision tree. Hyperparameters control the learning process itself and must be set before training begins.

AspectParametersHyperparameters
Set whenDuring trainingBefore training
Learned fromDataManual / search
ExamplesWeights, biases, split pointsLearning rate, max depth, n_estimators
Optimized viaGradient descent, EMGrid search, random search, Bayesian

Formal Definition

A machine learning model fθf_\theta has parameters θ\theta learned from training data D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^n via:

θ∗=arg⁡min⁡θL(θ;D)\theta^* = \arg\min_\theta \mathcal{L}(\theta; \mathcal{D})

Hyperparameters Îģ\lambda govern the learning process:

θ∗(Îģ)=arg⁥min⁥θL(θ;D,Îģ)\theta^*(\lambda) = \arg\min_\theta \mathcal{L}(\theta; \mathcal{D}, \lambda)

The goal of hyperparameter tuning is to find:

Îģ∗=arg⁥min⁥ÎģLval(θ∗(Îģ);Dval)\lambda^* = \arg\min_\lambda \mathcal{L}_{\text{val}}(\theta^*(\lambda); \mathcal{D}_{\text{val}})

where Lval\mathcal{L}_{\text{val}} is the validation loss — we never use test data for this.

Why Tuning Matters

A well-tuned model can outperform a more complex model with default settings. The bias-variance tradeoff is directly controlled by hyperparameters:

  • Too restrictive (e.g., max_depth=2): high bias, underfitting
  • Too flexible (e.g., max_depth=50): high variance, overfitting
  • Just right: optimal generalization

2. Grid Search — Exhaustive Exploration

Grid Search is the most straightforward approach: define a discrete set of values for each hyperparameter and evaluate every possible combination.

Algorithm

Architecture Diagram
GridSearch(model, param_grid, X, y, cv):
    best_score = -inf
    best_params = null
    for each combination in product(param_grid.values()):
        score = CrossValScore(model, combination, X, y, cv)
        if score > best_score:
            best_score = score
            best_params = combination
    return best_params, best_score

Python Implementation

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'min_samples_split': [2, 5, 10],
    'subsample': [0.8, 0.9, 1.0]
}

grid_search = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2,
    return_train_score=True
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

The Curse of Dimensionality in Grid Search

With dd hyperparameters and kk values per parameter, grid search evaluates kdk^d combinations:

HyperparametersValues eachCombinationsTime (10s each)
25254 min
3512521 min
456251.7 hours
553,1258.7 hours
6515,6254.3 days

Grid Search vs Random Search

The visual below demonstrates how random search covers the search space more efficiently. Each axis represents one hyperparameter, and colored dots represent evaluations.

function GridSearchSVG() {
  return (
    <svg viewBox="0 0 600 400" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '600px'}}>
      <defs>
        <linearGradient id="gridBg" x1="0%" y1="0%" x2="100%" y2="100%">
          <stop offset="0%" stopColor="#1a1a2e" />
          <stop offset="100%" stopColor="#16213e" />
        </linearGradient>
      </defs>
      <rect width="600" height="400" fill="url(#gridBg)" rx="12" />

      <text x="300" y="35" textAnchor="middle" fill="#e0e0e0" fontSize="16" fontWeight="bold" fontFamily="monospace">
        Grid Search vs Random Search
      </text>

      {/* Grid Search Side */}
      <text x="150" y="65" textAnchor="middle" fill="#00d2ff" fontSize="13" fontWeight="bold" fontFamily="monospace">
        Grid Search
      </text>

      {[0,1,2,3,4,5,6,7].map(i => (
        <line key={`gh${i}`} x1="60" y1={80 + i * 35} x2="240" y2={80 + i * 35} stroke="#333" strokeWidth="0.5" />
      ))}
      {[0,1,2,3,4,5,6,7].map(i => (
        <line key={`gv${i}`} x1={60 + i * 22.5} y1="80" x2={60 + i * 22.5} y2="350" stroke="#333" strokeWidth="0.5" />
      ))}

      {[0,1,2,3,4,5,6,7].map(row =>
        [0,1,2,3,4,5,6,7].map(col => (
          <circle
            key={`gp${row}-${col}`}
            cx={60 + col * 22.5 + 11}
            cy={80 + row * 35 + 17}
            r="3"
            fill="#3a7bd5"
            opacity="0.7"
          />
        ))
      )}

      <rect x="148" y="218" width="25" height="25" fill="none" stroke="#ff6b6b" strokeWidth="2" rx="3" />
      <text x="160" y="235" textAnchor="middle" fill="#ff6b6b" fontSize="8" fontFamily="monospace">Best</text>

      <text x="150" y="380" textAnchor="middle" fill="#888" fontSize="10" fontFamily="monospace">
        8 x 8 = 64 evaluations
      </text>

      {/* Random Search Side */}
      <text x="450" y="65" textAnchor="middle" fill="#00ff88" fontSize="13" fontWeight="bold" fontFamily="monospace">
        Random Search
      </text>

      {[0,1,2,3,4,5,6,7].map(i => (
        <line key={`rh${i}`} x1="360" y1={80 + i * 35} x2="540" y2={80 + i * 35} stroke="#333" strokeWidth="0.5" />
      ))}
      {[0,1,2,3,4,5,6,7].map(i => (
        <line key={`rv${i}`} x1={360 + i * 22.5} y1="80" x2={360 + i * 22.5} y2="350" stroke="#333" strokeWidth="0.5" />
      ))}

      {[[372,135],[395,240],[420,100],[445,280],[470,160],[495,310],[515,200],[380,330],[410,180],[460,120],[500,260],[520,340],[375,200],[430,300],[485,90]].map(([cx, cy], i) => (
        <circle key={`rp${i}`} cx={cx} cy={cy} r="3" fill="#00ff88" opacity="0.7" />
      ))}

      <circle cx="495" cy="200" r="8" fill="none" stroke="#ff6b6b" strokeWidth="2" />
      <text x="495" y="220" textAnchor="middle" fill="#ff6b6b" fontSize="8" fontFamily="monospace">Best</text>

      <text x="450" y="380" textAnchor="middle" fill="#888" fontSize="10" fontFamily="monospace">
        15 random evaluations
      </text>
    </svg>
  );
}

Key Insight: Grid search wastes budget on unimportant hyperparameters. If max_depth matters more than subsample, grid search still evaluates all subsample values for every max_depth.

3. Random Search — Efficient Sampling

Random search samples hyperparameter combinations from specified distributions. Bergstra and Bengio (2012) showed random search is more efficient than grid search when only a few hyperparameters are truly important.

Why Random Search Wins

The key insight: if one hyperparameter (e.g., learning rate) dominates performance, grid search wastes k−1k-1 evaluations per dimension. Random search explores the dominant dimension k×k \times more effectively.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(2, 20),
    'learning_rate': uniform(0.001, 0.3),
    'min_samples_split': randint(2, 20),
    'subsample': uniform(0.6, 0.4),
    'min_samples_leaf': randint(1, 10)
}

random_search = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_distributions=param_distributions,
    n_iter=200,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)
print(f"Best params: {random_search.best_params_}")

Log-Uniform Distributions for Learning Rate

Learning rates span orders of magnitude, so uniform sampling is suboptimal:

import numpy as np
from scipy.stats import loguniform

# Bad: uniform sampling concentrates points in [0.1, 0.3]
# Good: log-uniform samples proportionally across scales
learning_rates_log = loguniform(1e-3, 1e-1)

# Manual implementation
def log_uniform(low, high, size=1):
    return np.exp(np.random.uniform(np.log(low), np.log(high), size))

Comparison: Grid vs Random

CriterionGrid SearchRandom Search
CoverageUniform gridStratified sampling
Curse of dimensionalityExponentialLinear in n_iter
Important dimensionsWasted budgetBetter coverage
ReproducibilityDeterministicDepends on seed
ParallelizationDifficultTrivial
Budget efficiencyLowHigh

4. Bayesian Optimization — Intelligent Search

Bayesian optimization builds a surrogate model of the objective function and uses an acquisition function to decide where to sample next.

The Loop

function BayesianOptSVG() {
  return (
    <svg viewBox="0 0 700 350" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '700px'}}>
      <defs>
        <linearGradient id="bayesBg" x1="0%" y1="0%" x2="100%" y2="100%">
          <stop offset="0%" stopColor="#0d1117" />
          <stop offset="100%" stopColor="#161b22" />
        </linearGradient>
      </defs>
      <rect width="700" height="350" fill="url(#bayesBg)" rx="12" />

      <text x="350" y="30" textAnchor="middle" fill="#e0e0e0" fontSize="15" fontWeight="bold" fontFamily="monospace">
        Bayesian Optimization Loop
      </text>

      <path d="M 60 280 Q 120 200 180 220 Q 240 240 300 150 Q 360 60 420 180 Q 480 300 540 120 Q 600 60 640 100"
        fill="none" stroke="#3a7bd5" strokeWidth="2" opacity="0.5" />

      <path d="M 60 290 Q 120 230 180 240 Q 240 250 300 170 Q 360 90 420 200 Q 480 290 540 140 Q 600 80 640 120"
        fill="none" stroke="#00d2ff" strokeWidth="2.5" />

      <path d="M 60 260 Q 120 190 180 200 Q 240 210 300 120 Q 360 40 420 150 Q 480 260 540 100 Q 600 40 640 80
               L 640 160 Q 600 120 540 180 Q 480 320 420 250 Q 360 140 300 220 Q 240 290 180 280 Q 120 270 60 320 Z"
        fill="#00d2ff" opacity="0.1" />

      {[[120,225],[200,235],[350,130],[500,210],[580,95]].map(([x, y], i) => (
        <circle key={i} cx={x} cy={y} r="5" fill="#ff6b6b" stroke="#fff" strokeWidth="1.5" />
      ))}

      <circle cx="420" cy="150" r="7" fill="none" stroke="#00ff88" strokeWidth="2" strokeDasharray="4,2" />
      <text x="420" y="140" textAnchor="middle" fill="#00ff88" fontSize="9" fontFamily="monospace">next</text>

      <text x="80" y="330" fill="#3a7bd5" fontSize="10" fontFamily="monospace">True objective</text>
      <text x="220" y="330" fill="#00d2ff" fontSize="10" fontFamily="monospace">Surrogate (GP)</text>
      <text x="370" y="330" fill="#ff6b6b" fontSize="10" fontFamily="monospace">Observations</text>
      <text x="520" y="330" fill="#00ff88" fontSize="10" fontFamily="monospace">Next eval</text>

      <path d="M 60 310 Q 150 315 240 305 Q 330 280 420 250 Q 510 310 600 300"
        fill="none" stroke="#ffd93d" strokeWidth="1.5" strokeDasharray="5,3" />
      <text x="500" y="260" fill="#ffd93d" fontSize="9" fontFamily="monospace">Acquisition</text>
    </svg>
  );
}

Gaussian Process Surrogate

The surrogate model is typically a Gaussian Process (GP):

f(x)âˆŧGP(m(x),k(x,x′))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

where m(x)m(x) is the mean function and k(x,x′)k(x, x') is the kernel (covariance function). Given observations Dn={xi,yi}i=1n\mathcal{D}_n = \{x_i, y_i\}_{i=1}^n, the posterior predictive is:

f(x∗)âˆŖDnâˆŧN(Îŧn(x∗),΃n2(x∗))f(x^*) | \mathcal{D}_n \sim \mathcal{N}(\mu_n(x^*), \sigma_n^2(x^*))
Îŧn(x∗)=k(x∗,X)[K(X,X)+΃2I]−1y\mu_n(x^*) = k(x^*, X) [K(X, X) + \sigma^2 I]^{-1} y
΃n2(x∗)=k(x∗,x∗)−k(x∗,X)[K(X,X)+΃2I]−1k(X,x∗)\sigma_n^2(x^*) = k(x^*, x^*) - k(x^*, X) [K(X, X) + \sigma^2 I]^{-1} k(X, x^*)

Acquisition Functions

The acquisition function Îą(x)\alpha(x) balances exploration and exploitation:

Expected Improvement (EI):

EI(x)=E[max⁡(0,f∗−f(x))]\text{EI}(x) = \mathbb{E}[\max(0, f^* - f(x))]
=(Îŧ(x)−f∗)ÎĻ(Îŧ(x)−fâˆ—Īƒ(x))+΃(x)Ī•(Îŧ(x)−fâˆ—Īƒ(x))= (\mu(x) - f^*) \Phi\left(\frac{\mu(x) - f^*}{\sigma(x)}\right) + \sigma(x) \phi\left(\frac{\mu(x) - f^*}{\sigma(x)}\right)

where f∗f^* is the best observed value, ÎĻ\Phi and Ī•\phi are the standard normal CDF and PDF.

Upper Confidence Bound (UCB):

UCB(x)=Îŧ(x)+Îēâ‹…Īƒ(x)\text{UCB}(x) = \mu(x) + \kappa \cdot \sigma(x)

where Îē\kappa controls exploration. Higher Îē\kappa = more exploration.

Thompson Sampling: Sample f∗âˆŧGP(Îŧn(x),΃n2(x))f^* \sim \mathcal{GP}(\mu_n(x), \sigma_n^2(x)) and optimize the sample.

Exploration vs Exploitation

The acquisition function encodes a fundamental tradeoff:

  • Exploration: sample where uncertainty ΃(x)\sigma(x) is high (learning the landscape)
  • Exploitation: sample where predicted value Îŧ(x)\mu(x) is good (refining the optimum)

EI naturally balances both: high Îŧ(x)\mu(x) increases exploitation, high ΃(x)\sigma(x) increases exploration (only when Îŧ(x)\mu(x) is near f∗f^*).

from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args

space = [
    Real(0.001, 0.3, name='learning_rate', prior='log-uniform'),
    Integer(50, 500, name='n_estimators'),
    Integer(2, 20, name='max_depth'),
    Real(0.6, 1.0, name='subsample')
]

@use_named_args(space)
def objective(learning_rate, n_estimators, max_depth, subsample):
    model = GradientBoostingClassifier(
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        max_depth=max_depth,
        subsample=subsample,
        random_state=42
    )
    return -cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()

result = gp_minimize(
    objective,
    space,
    n_calls=50,
    n_initial_points=10,
    random_state=42,
    acq_func='EI'
)

print(f"Best score: {-result.fun:.4f}")
print(f"Best params: {result.x}")

5. Optuna — State-of-the-Art Optimization

Optuna uses Tree-structured Parzen Estimator (TPE) and supports advanced features like pruning, conditional hyperparameters, and rich visualization.

TPE Algorithm

TPE models p(xâˆŖy)p(x | y) instead of p(yâˆŖx)p(y | x) — a key departure from Gaussian Process approaches:

  1. Split observations into "good" (y<y∗y < y^*) and "bad" (yâ‰Ĩy∗y \geq y^*) using quantile Îŗ\gamma
  2. Model good observations: ℓ(x)=p(xâˆŖy<y∗)\ell(x) = p(x | y < y^*)
  3. Model bad observations: g(x)=p(xâˆŖyâ‰Ĩy∗)g(x) = p(x | y \geq y^*)
  4. Maximize ratio: ℓ(x)g(x)\frac{\ell(x)}{g(x)}

TPE is non-parametric (uses kernel density estimation), scales better than GP-based methods, and naturally handles conditional hyperparameters.

Basic Usage

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
    }

    model = GradientBoostingClassifier(**params, random_state=42)
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score

study = optuna.create_study(direction='maximize', study_name='gbm_tuning')
study.optimize(objective, n_trials=100, show_progress_bar=True)

print(f"Best value: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Conditional Hyperparameters

Optuna handles conditional spaces natively — parameters that only apply when another parameter takes a specific value:

def objective_advanced(trial):
    classifier = trial.suggest_categorical('classifier', ['rf', 'gbm', 'xgb'])

    if classifier == 'rf':
        params = {
            'n_estimators': trial.suggest_int('rf_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('rf_max_depth', 2, 32),
            'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 20)
        }
        model = RandomForestClassifier(**params, random_state=42)

    elif classifier == 'gbm':
        params = {
            'n_estimators': trial.suggest_int('gbm_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('gbm_max_depth', 2, 15),
            'learning_rate': trial.suggest_float('gbm_lr', 1e-3, 0.3, log=True)
        }
        model = GradientBoostingClassifier(**params, random_state=42)

    else:  # xgb
        params = {
            'n_estimators': trial.suggest_int('xgb_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('xgb_max_depth', 2, 15),
            'learning_rate': trial.suggest_float('xgb_lr', 1e-3, 0.3, log=True)
        }
        model = XGBClassifier(**params, use_label_encoder=False, eval_metric='logloss')

    score = cross_val_score(model, X_train, y_train, cv=5).mean()
    return score

Pruning — Early Termination of Bad Trials

Pruning terminates unpromising trials early, saving computational budget:

import optuna
from optuna.pruners import MedianPruner, SuccessiveHalvingPruner

def objective_with_pruning(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True)
    }

    scores = []
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for fold, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        model = GradientBoostingClassifier(**params, random_state=42)
        model.fit(X_tr, y_tr)

        score = model.score(X_val, y_val)
        trial.report(score, step=fold)

        if trial.should_prune():
            raise optuna.TrialPruned()

        scores.append(score)

    return np.mean(scores)

pruner = MedianPruner(n_startup_trials=10, n_warmup_steps=3)
study = optuna.create_study(
    direction='maximize',
    pruner=pruner,
    study_name='pruning_demo'
)

Pruning Strategies

PrunerMechanismBest For
MedianPrunerPrune if below median of previous trialsGeneral purpose
SuccessiveHalvingPrunerEliminate bottom fraction each roundLarge search spaces
HyperbandPrunerBudget allocation with early stoppingResource-constrained
PatientPrunerWait N trials before pruningNoisy objectives

Optuna Visualization

import optuna.visualization as vis

# Optimization history — shows improvement over trials
fig1 = vis.plot_optimization_history(study)
fig1.show()

# Parameter importances — which hyperparameters matter most
fig2 = vis.plot_param_importances(study)
fig2.show()

# Slice plot — objective value for each parameter
fig3 = vis.plot_slice(study)
fig3.show()

# Contour plot — interaction between two parameters
fig4 = vis.plot_contour(study, params=['learning_rate', 'max_depth'])
fig4.show()

# Parallel coordinate — high-dimensional view
fig5 = vis.plot_parallel_coordinate(study)
fig5.show()
function OptunaProcessSVG() {
  return (
    <svg viewBox="0 0 700 400" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '700px'}}>
      <defs>
        <linearGradient id="optBg" x1="0%" y1="0%" x2="100%" y2="100%">
          <stop offset="0%" stopColor="#1a1a2e" />
          <stop offset="100%" stopColor="#0f3460" />
        </linearGradient>
      </defs>
      <rect width="700" height="400" fill="url(#optBg)" rx="12" />

      <text x="350" y="30" textAnchor="middle" fill="#e0e0e0" fontSize="15" fontWeight="bold" fontFamily="monospace">
        Optuna Optimization Process
      </text>

      {/* Trial history */}
      <text x="120" y="60" textAnchor="middle" fill="#00d2ff" fontSize="11" fontWeight="bold" fontFamily="monospace">Trial History</text>

      {[
        {trial: 1, score: 0.82, pruned: false},
        {trial: 2, score: 0.78, pruned: false},
        {trial: 3, score: 0.75, pruned: false},
        {trial: 4, score: 0.65, pruned: true},
        {trial: 5, score: 0.85, pruned: false},
        {trial: 6, score: 0.83, pruned: false},
        {trial: 7, score: 0.58, pruned: true},
        {trial: 8, score: 0.88, pruned: false},
        {trial: 9, score: 0.87, pruned: false},
        {trial: 10, score: 0.91, pruned: false},
        {trial: 11, score: 0.70, pruned: true},
        {trial: 12, score: 0.89, pruned: false}
      ].map(({trial, score, pruned}, i) => {
        const barWidth = (score - 0.5) * 200;
        const color = pruned ? '#ff6b6b' : score >= 0.9 ? '#00ff88' : score >= 0.85 ? '#ffd93d' : '#3a7bd5';
        return (
          <g key={i}>
            <rect x="30" y={75 + i * 25} width={barWidth} height="18" fill={color} opacity={pruned ? 0.4 : 0.8} rx="3" />
            {pruned && <text x={35 + barWidth} y={88 + i * 25} fill="#ff6b6b" fontSize="8" fontFamily="monospace">pruned</text>}
            <text x="28" y={88 + i * 25} textAnchor="end" fill="#888" fontSize="9" fontFamily="monospace">T{trial}</text>
          </g>
        );
      })}

      {/* Best value tracker */}
      <text x="450" y="60" textAnchor="middle" fill="#00ff88" fontSize="11" fontWeight="bold" fontFamily="monospace">Best Value</text>
      <path d="M 350 200 L 380 200 L 380 170 L 420 170 L 420 120 L 500 120 L 500 90 L 600 90"
        fill="none" stroke="#00ff88" strokeWidth="2" />
      {[
        [350, 200, '0.82'], [420, 170, '0.85'], [500, 120, '0.88'], [600, 90, '0.91']
      ].map(([x, y, label], i) => (
        <g key={`best${i}`}>
          <circle cx={x} cy={y} r="4" fill="#00ff88" />
          <text x={x} y={y - 10} textAnchor="middle" fill="#00ff88" fontSize="9" fontFamily="monospace">{label}</text>
        </g>
      ))}

      <text x="475" y="390" textAnchor="middle" fill="#888" fontSize="10" fontFamily="monospace">Trials</text>
    </svg>
  );
}

6. Learning Rate Schedules

Learning rate scheduling reduces the learning rate during training, allowing fast convergence early and fine-grained updates late.

Common Schedules

ηt=η0⋅f(t)\eta_t = \eta_0 \cdot f(t)

where Ρ0\eta_0 is the initial learning rate and tt is the current step.

ScheduleFormulaCharacteristics
Step DecayΡt=Ρ0â‹…ÎŗâŒŠt/s⌋\eta_t = \eta_0 \cdot \gamma^{\lfloor t/s \rfloor}Reduce by factor Îŗ\gamma every ss steps
Exponential Decayηt=η0⋅e−kt\eta_t = \eta_0 \cdot e^{-kt}Smooth continuous decay
Cosine AnnealingΡt=Ρmin⁥+12(Ρ0−ηmin⁥)(1+cos⁥(Ī€tT))\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\frac{\pi t}{T}))Periodic warm restarts
Linear Warmupηt=η0⋅tTwarmup\eta_t = \eta_0 \cdot \frac{t}{T_{\text{warmup}}} for t<Twarmupt < T_{\text{warmup}}Avoid early instability
Polynomial Decayηt=η0(1−tT)p\eta_t = \eta_0 (1 - \frac{t}{T})^pFlexible power control
function LearningRateSVG() {
  const width = 600;
  const height = 350;
  const pad = {top: 40, right: 30, bottom: 50, left: 60};
  const plotW = width - pad.left - pad.right;
  const plotH = height - pad.top - pad.bottom;

  function x(t) { return pad.left + (t / 100) * plotW; }
  function y(lr) { return pad.top + plotH - (lr / 0.3) * plotH; }

  // Schedule curves
  const stepDecay = Array.from({length: 100}, (_, t) => [t, 0.3 * Math.pow(0.5, Math.floor(t / 25))]);
  const expDecay = Array.from({length: 100}, (_, t) => [t, 0.3 * Math.exp(-0.03 * t)]);
  const cosine = Array.from({length: 100}, (_, t) => [t, 0.001 + 0.5 * (0.3 - 0.001) * (1 + Math.cos(Math.PI * t / 100))]);
  const warmup = Array.from({length: 100}, (_, t) => [t, t < 20 ? 0.3 * t / 20 : 0.3 * Math.exp(-0.03 * (t - 20))]);

  function makePath(data) {
    return data.map(([t, lr], i) => `${i === 0 ? 'M' : 'L'} ${x(t).toFixed(1)} ${y(lr).toFixed(1)}`).join(' ');
  }

  return (
    <svg viewBox={`0 0 ${width} ${height}`} xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '600px'}}>
      <defs>
        <linearGradient id="lrBg" x1="0%" y1="0%" x2="100%" y2="100%">
          <stop offset="0%" stopColor="#0d1117" />
          <stop offset="100%" stopColor="#161b22" />
        </linearGradient>
      </defs>
      <rect width={width} height={height} fill="url(#lrBg)" rx="12" />

      <text x={width/2} y="25" textAnchor="middle" fill="#e0e0e0" fontSize="14" fontWeight="bold" fontFamily="monospace">
        Learning Rate Schedules
      </text>

      {/* Axes */}
      <line x1={pad.left} y1={pad.top} x2={pad.left} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />
      <line x1={pad.left} y1={pad.top + plotH} x2={pad.left + plotW} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />

      {/* Grid lines */}
      {[0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3].map(lr => (
        <g key={`gy${lr}`}>
          <line x1={pad.left} y1={y(lr)} x2={pad.left + plotW} y2={y(lr)} stroke="#222" strokeWidth="0.5" />
          <text x={pad.left - 8} y={y(lr) + 3} textAnchor="end" fill="#666" fontSize="9" fontFamily="monospace">{lr.toFixed(2)}</text>
        </g>
      ))}

      {/* Curves */}
      <path d={makePath(stepDecay)} fill="none" stroke="#ff6b6b" strokeWidth="2" />
      <path d={makePath(expDecay)} fill="none" stroke="#00d2ff" strokeWidth="2" />
      <path d={makePath(cosine)} fill="none" stroke="#ffd93d" strokeWidth="2" />
      <path d={makePath(warmup)} fill="none" stroke="#00ff88" strokeWidth="2" strokeDasharray="6,3" />

      {/* Legend */}
      {[
        [400, '#ff6b6b', 'Step Decay'],
        [400, '#00d2ff', 'Exponential'],
        [420, '#ffd93d', 'Cosine Annealing'],
        [420, '#00ff88', 'Warmup + Decay']
      ].map(([yOff, color, label], i) => (
        <g key={`leg${i}`}>
          <line x1={width - 150} y1={yOff + i * 18} x2={width - 130} y2={yOff + i * 18} stroke={color} strokeWidth="2" />
          <text x={width - 125} y={yOff + i * 18 + 3} fill="#aaa" fontSize="9" fontFamily="monospace">{label}</text>
        </g>
      ))}

      {/* Axis labels */}
      <text x={width/2} y={height - 8} textAnchor="middle" fill="#888" fontSize="11" fontFamily="monospace">Epoch</text>
      <text x="15" y={height/2} textAnchor="middle" fill="#888" fontSize="11" fontFamily="monospace" transform={`rotate(-90, 15, ${height/2})`}>Learning Rate</text>
    </svg>
  );
}

Practical Implementation

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

# Step decay with GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import learning_curve

# Manual learning rate schedule
import numpy as np

def step_decay(initial_lr=0.1, drop=0.5, epochs_drop=10):
    def schedule(epoch):
        return initial_lr * (drop ** np.floor(epoch / epochs_drop))
    return schedule

# Keras callback example
from tensorflow.keras.callbacks import LearningRateScheduler

lr_scheduler = LearningRateScheduler(step_decay(initial_lr=0.1, drop=0.5, epochs_drop=10))

# Cosine annealing with warm restarts
def cosine_annealing(epoch, T_max=10, eta_min=1e-5, eta_max=0.1):
    return eta_min + 0.5 * (eta_max - eta_min) * (1 + np.cos(np.pi * epoch / T_max))

lr_callback = LearningRateScheduler(lambda epoch: cosine_annealing(epoch))

7. Early Stopping

Early stopping halts training when validation performance stops improving, preventing overfitting and saving compute.

Mathematical Formulation

Let Lt\mathcal{L}_t be the validation loss at epoch tt. Training stops when:

Ltâ‰Ĩmin⁥s≤tLs+Īĩfor p consecutive epochs\mathcal{L}_t \geq \min_{s \leq t} \mathcal{L}_s + \epsilon \quad \text{for } p \text{ consecutive epochs}

where pp is the patience parameter and Īĩ\epsilon is a tolerance threshold.

Implementation

import numpy as np
from sklearn.base import clone

class EarlyStopping:
    def __init__(self, patience=10, min_delta=1e-4, restore_best=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best
        self.counter = 0
        self.best_loss = np.inf
        self.best_model = None

    def __call__(self, model, X_val, y_val):
        current_loss = -cross_val_score(model, X_val, y_val, cv=3).mean()

        if current_loss < self.best_loss - self.min_delta:
            self.best_loss = current_loss
            self.best_model = clone(model)
            self.counter = 0
            return False  # don't stop
        else:
            self.counter += 1
            if self.counter >= self.patience:
                if self.restore_best:
                    return self.best_model
                return True  # signal to stop
            return False

# Usage with iterative training
def train_with_early_stopping(model, X_train, y_train, X_val, y_val, max_epochs=100):
    stopper = EarlyStopping(patience=10, min_delta=1e-4)

    for epoch in range(max_epochs):
        model.fit(X_train, y_train)
        should_stop = stopper(model, X_val, y_val)

        if should_stop is True:
            print(f"Early stopping at epoch {epoch}")
            break
        elif should_stop is not False:
            best_model = should_stop

    return best_model if stopper.restore_best else model

Early Stopping for Gradient Boosting

# XGBoost / LightGBM built-in early stopping
import xgboost as xgb
import lightgbm as lgb

# XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

params = {
    'max_depth': 6,
    'learning_rate': 0.1,
    'objective': 'binary:logistic',
    'eval_metric': 'auc'
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,
    verbose_eval=100
)

# LightGBM
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train)

model = lgb.train(
    {'num_leaves': 31, 'learning_rate': 0.05},
    lgb_train,
    num_boost_round=1000,
    valid_sets=[lgb_val],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)

Patience and Overfitting

function EarlyStoppingSVG() {
  const width = 600;
  const height = 320;
  const pad = {top: 40, right: 30, bottom: 50, left: 60};
  const plotW = width - pad.left - pad.right;
  const plotH = height - pad.top - pad.bottom;

  function xp(t) { return pad.left + (t / 50) * plotW; }
  function yp(loss) { return pad.top + plotH - ((loss - 0.2) / 0.8) * plotH; }

  // Simulated loss curves
  const trainLoss = Array.from({length: 50}, (_, t) => [t, 0.9 * Math.exp(-0.05 * t) + 0.05 + 0.02 * Math.random()]);
  const valLoss = Array.from({length: 50}, (_, t) => {
    const base = 0.9 * Math.exp(-0.04 * t) + 0.15;
    const overfit = t > 25 ? 0.003 * (t - 25) : 0;
    return [t, base + overfit + 0.02 * Math.random()];
  });

  function makePath(data) {
    return data.map(([t, loss], i) => `${i === 0 ? 'M' : 'L'} ${xp(t).toFixed(1)} ${yp(loss).toFixed(1)}`).join(' ');
  }

  const bestEpoch = 22;

  return (
    <svg viewBox={`0 0 ${width} ${height}`} xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '600px'}}>
      <defs>
        <linearGradient id="esBg" x1="0%" y1="0%" x2="100%" y2="100%">
          <stop offset="0%" stopColor="#0d1117" />
          <stop offset="100%" stopColor="#161b22" />
        </linearGradient>
      </defs>
      <rect width={width} height={height} fill="url(#esBg)" rx="12" />

      <text x={width/2} y="25" textAnchor="middle" fill="#e0e0e0" fontSize="14" fontWeight="bold" fontFamily="monospace">
        Early Stopping: Patience = 10
      </text>

      <line x1={pad.left} y1={pad.top} x2={pad.left} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />
      <line x1={pad.left} y1={pad.top + plotH} x2={pad.left + plotW} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />

      <path d={makePath(trainLoss)} fill="none" stroke="#00d2ff" strokeWidth="2" />
      <path d={makePath(valLoss)} fill="none" stroke="#ff6b6b" strokeWidth="2" />

      {/* Best epoch marker */}
      <line x1={xp(bestEpoch)} y1={pad.top} x2={xp(bestEpoch)} y2={pad.top + plotH} stroke="#00ff88" strokeWidth="1" strokeDasharray="4,4" />
      <text x={xp(bestEpoch)} y={pad.top - 5} textAnchor="middle" fill="#00ff88" fontSize="9" fontFamily="monospace">best</text>

      {/* Stop marker */}
      <line x1={xp(32)} y1={pad.top} x2={xp(32)} y2={pad.top + plotH} stroke="#ffd93d" strokeWidth="1" strokeDasharray="4,4" />
      <text x={xp(32)} y={pad.top - 5} textAnchor="middle" fill="#ffd93d" fontSize="9" fontFamily="monospace">stop</text>

      {/* Patience region */}
      <rect x={xp(bestEpoch)} y={pad.top} width={xp(32) - xp(bestEpoch)} height={plotH} fill="#ffd93d" opacity="0.05" />

      {/* Legend */}
      <line x1={width - 150} y1={50} x2={width - 130} y2={50} stroke="#00d2ff" strokeWidth="2" />
      <text x={width - 125} y={53} fill="#aaa" fontSize="9" fontFamily="monospace">Train loss</text>
      <line x1={width - 150} y1={68} x2={width - 130} y2={68} stroke="#ff6b6b" strokeWidth="2" />
      <text x={width - 125} y={71} fill="#aaa" fontSize="9" fontFamily="monospace">Val loss</text>

      <text x={width/2} y={height - 8} textAnchor="middle" fill="#888" fontSize="11" fontFamily="monospace">Epoch</text>
    </svg>
  );
}

8. Implementation in Python — Complete Pipeline

End-to-End Tuning Pipeline

import optuna
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

def create_objective(X_train, y_train, model_type='gbm'):
    """Create an Optuna objective function for the specified model type."""

    def objective(trial):
        if model_type == 'gbm':
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 50, 500),
                'max_depth': trial.suggest_int('max_depth', 2, 15),
                'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
                'subsample': trial.suggest_float('subsample', 0.6, 1.0),
                'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
            }
            model = GradientBoostingClassifier(**params, random_state=42)

        elif model_type == 'rf':
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 50, 500),
                'max_depth': trial.suggest_int('max_depth', 2, 30),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
                'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
                'bootstrap': trial.suggest_categorical('bootstrap', [True, False])
            }
            model = RandomForestClassifier(**params, random_state=42)

        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy', n_jobs=-1)
        return scores.mean()

    return objective


# Create and run study
study = optuna.create_study(
    direction='maximize',
    study_name='complete_pipeline',
    storage='sqlite:///optuna_study.db',  # persist results
    load_if_exists=True
)

objective_fn = create_objective(X_train, y_train, model_type='gbm')
study.optimize(objective_fn, n_trials=100, show_progress_bar=True)

# Analyze results
print(f"Best trial score: {study.best_trial.value:.4f}")
print(f"Best trial params: {study.best_trial.params}")

# Visualization
import optuna.visualization as vis
vis.plot_optimization_history(study).show()
vis.plot_param_importances(study).show()

# Final evaluation
best_model = GradientBoostingClassifier(**study.best_params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Multi-Objective Optimization

# Optimize accuracy AND training time simultaneously
def objective_multi(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True)
    }

    model = GradientBoostingClassifier(**params, random_state=42)

    import time
    start = time.time()
    score = cross_val_score(model, X_train, y_train, cv=5).mean()
    elapsed = time.time() - start

    return score, elapsed

study_multi = optuna.create_study(
    directions=['maximize', 'minimize'],  # maximize accuracy, minimize time
    study_name='multi_objective'
)
study_multi.optimize(objective_multi, n_trials=100)

# Pareto front
best_trials = study_multi.best_trials
for trial in best_trials:
    print(f"Accuracy: {trial.values[0]:.4f}, Time: {trial.values[1]:.1f}s")

Saving and Loading Studies

# Save study to database
study = optuna.create_study(
    study_name='my_study',
    storage='sqlite:///optuna_studies.db',
    load_if_exists=True
)

# Resume later
study.optimize(objective, n_trials=50)  # continues from where it left off

# Export results to dataframe
df = study.trials_dataframe()
df.to_csv('study_results.csv', index=False)

# Load a completed study
loaded_study = optuna.load_study(
    study_name='my_study',
    storage='sqlite:///optuna_studies.db'
)

Key Takeaways

â„šī¸ Summary

Grid Search is simple but exponential — use only when search space is small (3-4 parameters).

Random Search is efficient for low-effective-dimension problems — always better than grid with equal budget.

Bayesian Optimization is sample-efficient — best when evaluations are expensive (deep learning, hyperparameter tuning of expensive models).

Optuna with TPE is the modern standard — handles conditional parameters, pruning, multi-objective, and scales to hundreds of trials.

Early stopping is cheap insurance — always use it for iterative learners like gradient boosting.

Learning rate schedules enable fast convergence with fine-tuned solutions — cosine annealing with warm restarts is often a strong default.

References

  1. Bergstra, J., and Bengio, Y. (2012). Random search for hyper-parameter optimization. JMLR, 13, 281-305.
  2. Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. NeurIPS.
  3. Akiba, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. KDD.
  4. Li, L., et al. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 18(185), 1-52.
  5. Smith, L. N. (2017). Cyclical learning rates for training neural networks. WACV.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement