Hyperparameter Tuning: Grid Search, Random Search and Optuna
âšī¸ Module 8 â Tree-Based Models
This lesson covers systematic approaches to hyperparameter optimization, from exhaustive grid search to intelligent Bayesian methods and modern tools like Optuna.
1. Hyperparameters vs Parameters
Understanding the distinction between hyperparameters and learned parameters is fundamental to model optimization.
Parameters are learned from data during training â weights in a neural network, split thresholds in a decision tree. Hyperparameters control the learning process itself and must be set before training begins.
| Aspect | Parameters | Hyperparameters |
|---|---|---|
| Set when | During training | Before training |
| Learned from | Data | Manual / search |
| Examples | Weights, biases, split points | Learning rate, max depth, n_estimators |
| Optimized via | Gradient descent, EM | Grid search, random search, Bayesian |
Formal Definition
A machine learning model has parameters learned from training data via:
Hyperparameters govern the learning process:
The goal of hyperparameter tuning is to find:
where is the validation loss â we never use test data for this.
Why Tuning Matters
A well-tuned model can outperform a more complex model with default settings. The bias-variance tradeoff is directly controlled by hyperparameters:
- Too restrictive (e.g.,
max_depth=2): high bias, underfitting - Too flexible (e.g.,
max_depth=50): high variance, overfitting - Just right: optimal generalization
2. Grid Search â Exhaustive Exploration
Grid Search is the most straightforward approach: define a discrete set of values for each hyperparameter and evaluate every possible combination.
Algorithm
GridSearch(model, param_grid, X, y, cv):
best_score = -inf
best_params = null
for each combination in product(param_grid.values()):
score = CrossValScore(model, combination, X, y, cv)
if score > best_score:
best_score = score
best_params = combination
return best_params, best_score
Python Implementation
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [3, 5, 7, 10],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
'min_samples_split': [2, 5, 10],
'subsample': [0.8, 0.9, 1.0]
}
grid_search = GridSearchCV(
estimator=GradientBoostingClassifier(random_state=42),
param_grid=param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=2,
return_train_score=True
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
The Curse of Dimensionality in Grid Search
With hyperparameters and values per parameter, grid search evaluates combinations:
| Hyperparameters | Values each | Combinations | Time (10s each) |
|---|---|---|---|
| 2 | 5 | 25 | 4 min |
| 3 | 5 | 125 | 21 min |
| 4 | 5 | 625 | 1.7 hours |
| 5 | 5 | 3,125 | 8.7 hours |
| 6 | 5 | 15,625 | 4.3 days |
Grid Search vs Random Search
The visual below demonstrates how random search covers the search space more efficiently. Each axis represents one hyperparameter, and colored dots represent evaluations.
function GridSearchSVG() {
return (
<svg viewBox="0 0 600 400" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '600px'}}>
<defs>
<linearGradient id="gridBg" x1="0%" y1="0%" x2="100%" y2="100%">
<stop offset="0%" stopColor="#1a1a2e" />
<stop offset="100%" stopColor="#16213e" />
</linearGradient>
</defs>
<rect width="600" height="400" fill="url(#gridBg)" rx="12" />
<text x="300" y="35" textAnchor="middle" fill="#e0e0e0" fontSize="16" fontWeight="bold" fontFamily="monospace">
Grid Search vs Random Search
</text>
{/* Grid Search Side */}
<text x="150" y="65" textAnchor="middle" fill="#00d2ff" fontSize="13" fontWeight="bold" fontFamily="monospace">
Grid Search
</text>
{[0,1,2,3,4,5,6,7].map(i => (
<line key={`gh${i}`} x1="60" y1={80 + i * 35} x2="240" y2={80 + i * 35} stroke="#333" strokeWidth="0.5" />
))}
{[0,1,2,3,4,5,6,7].map(i => (
<line key={`gv${i}`} x1={60 + i * 22.5} y1="80" x2={60 + i * 22.5} y2="350" stroke="#333" strokeWidth="0.5" />
))}
{[0,1,2,3,4,5,6,7].map(row =>
[0,1,2,3,4,5,6,7].map(col => (
<circle
key={`gp${row}-${col}`}
cx={60 + col * 22.5 + 11}
cy={80 + row * 35 + 17}
r="3"
fill="#3a7bd5"
opacity="0.7"
/>
))
)}
<rect x="148" y="218" width="25" height="25" fill="none" stroke="#ff6b6b" strokeWidth="2" rx="3" />
<text x="160" y="235" textAnchor="middle" fill="#ff6b6b" fontSize="8" fontFamily="monospace">Best</text>
<text x="150" y="380" textAnchor="middle" fill="#888" fontSize="10" fontFamily="monospace">
8 x 8 = 64 evaluations
</text>
{/* Random Search Side */}
<text x="450" y="65" textAnchor="middle" fill="#00ff88" fontSize="13" fontWeight="bold" fontFamily="monospace">
Random Search
</text>
{[0,1,2,3,4,5,6,7].map(i => (
<line key={`rh${i}`} x1="360" y1={80 + i * 35} x2="540" y2={80 + i * 35} stroke="#333" strokeWidth="0.5" />
))}
{[0,1,2,3,4,5,6,7].map(i => (
<line key={`rv${i}`} x1={360 + i * 22.5} y1="80" x2={360 + i * 22.5} y2="350" stroke="#333" strokeWidth="0.5" />
))}
{[[372,135],[395,240],[420,100],[445,280],[470,160],[495,310],[515,200],[380,330],[410,180],[460,120],[500,260],[520,340],[375,200],[430,300],[485,90]].map(([cx, cy], i) => (
<circle key={`rp${i}`} cx={cx} cy={cy} r="3" fill="#00ff88" opacity="0.7" />
))}
<circle cx="495" cy="200" r="8" fill="none" stroke="#ff6b6b" strokeWidth="2" />
<text x="495" y="220" textAnchor="middle" fill="#ff6b6b" fontSize="8" fontFamily="monospace">Best</text>
<text x="450" y="380" textAnchor="middle" fill="#888" fontSize="10" fontFamily="monospace">
15 random evaluations
</text>
</svg>
);
}
Key Insight: Grid search wastes budget on unimportant hyperparameters. If max_depth matters more than subsample, grid search still evaluates all subsample values for every max_depth.
3. Random Search â Efficient Sampling
Random search samples hyperparameter combinations from specified distributions. Bergstra and Bengio (2012) showed random search is more efficient than grid search when only a few hyperparameters are truly important.
Why Random Search Wins
The key insight: if one hyperparameter (e.g., learning rate) dominates performance, grid search wastes evaluations per dimension. Random search explores the dominant dimension more effectively.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(2, 20),
'learning_rate': uniform(0.001, 0.3),
'min_samples_split': randint(2, 20),
'subsample': uniform(0.6, 0.4),
'min_samples_leaf': randint(1, 10)
}
random_search = RandomizedSearchCV(
estimator=GradientBoostingClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=200,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"Best params: {random_search.best_params_}")
Log-Uniform Distributions for Learning Rate
Learning rates span orders of magnitude, so uniform sampling is suboptimal:
import numpy as np
from scipy.stats import loguniform
# Bad: uniform sampling concentrates points in [0.1, 0.3]
# Good: log-uniform samples proportionally across scales
learning_rates_log = loguniform(1e-3, 1e-1)
# Manual implementation
def log_uniform(low, high, size=1):
return np.exp(np.random.uniform(np.log(low), np.log(high), size))
Comparison: Grid vs Random
| Criterion | Grid Search | Random Search |
|---|---|---|
| Coverage | Uniform grid | Stratified sampling |
| Curse of dimensionality | Exponential | Linear in n_iter |
| Important dimensions | Wasted budget | Better coverage |
| Reproducibility | Deterministic | Depends on seed |
| Parallelization | Difficult | Trivial |
| Budget efficiency | Low | High |
4. Bayesian Optimization â Intelligent Search
Bayesian optimization builds a surrogate model of the objective function and uses an acquisition function to decide where to sample next.
The Loop
function BayesianOptSVG() {
return (
<svg viewBox="0 0 700 350" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '700px'}}>
<defs>
<linearGradient id="bayesBg" x1="0%" y1="0%" x2="100%" y2="100%">
<stop offset="0%" stopColor="#0d1117" />
<stop offset="100%" stopColor="#161b22" />
</linearGradient>
</defs>
<rect width="700" height="350" fill="url(#bayesBg)" rx="12" />
<text x="350" y="30" textAnchor="middle" fill="#e0e0e0" fontSize="15" fontWeight="bold" fontFamily="monospace">
Bayesian Optimization Loop
</text>
<path d="M 60 280 Q 120 200 180 220 Q 240 240 300 150 Q 360 60 420 180 Q 480 300 540 120 Q 600 60 640 100"
fill="none" stroke="#3a7bd5" strokeWidth="2" opacity="0.5" />
<path d="M 60 290 Q 120 230 180 240 Q 240 250 300 170 Q 360 90 420 200 Q 480 290 540 140 Q 600 80 640 120"
fill="none" stroke="#00d2ff" strokeWidth="2.5" />
<path d="M 60 260 Q 120 190 180 200 Q 240 210 300 120 Q 360 40 420 150 Q 480 260 540 100 Q 600 40 640 80
L 640 160 Q 600 120 540 180 Q 480 320 420 250 Q 360 140 300 220 Q 240 290 180 280 Q 120 270 60 320 Z"
fill="#00d2ff" opacity="0.1" />
{[[120,225],[200,235],[350,130],[500,210],[580,95]].map(([x, y], i) => (
<circle key={i} cx={x} cy={y} r="5" fill="#ff6b6b" stroke="#fff" strokeWidth="1.5" />
))}
<circle cx="420" cy="150" r="7" fill="none" stroke="#00ff88" strokeWidth="2" strokeDasharray="4,2" />
<text x="420" y="140" textAnchor="middle" fill="#00ff88" fontSize="9" fontFamily="monospace">next</text>
<text x="80" y="330" fill="#3a7bd5" fontSize="10" fontFamily="monospace">True objective</text>
<text x="220" y="330" fill="#00d2ff" fontSize="10" fontFamily="monospace">Surrogate (GP)</text>
<text x="370" y="330" fill="#ff6b6b" fontSize="10" fontFamily="monospace">Observations</text>
<text x="520" y="330" fill="#00ff88" fontSize="10" fontFamily="monospace">Next eval</text>
<path d="M 60 310 Q 150 315 240 305 Q 330 280 420 250 Q 510 310 600 300"
fill="none" stroke="#ffd93d" strokeWidth="1.5" strokeDasharray="5,3" />
<text x="500" y="260" fill="#ffd93d" fontSize="9" fontFamily="monospace">Acquisition</text>
</svg>
);
}
Gaussian Process Surrogate
The surrogate model is typically a Gaussian Process (GP):
where is the mean function and is the kernel (covariance function). Given observations , the posterior predictive is:
Acquisition Functions
The acquisition function balances exploration and exploitation:
Expected Improvement (EI):
where is the best observed value, and are the standard normal CDF and PDF.
Upper Confidence Bound (UCB):
where controls exploration. Higher = more exploration.
Thompson Sampling: Sample and optimize the sample.
Exploration vs Exploitation
The acquisition function encodes a fundamental tradeoff:
- Exploration: sample where uncertainty is high (learning the landscape)
- Exploitation: sample where predicted value is good (refining the optimum)
EI naturally balances both: high increases exploitation, high increases exploration (only when is near ).
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args
space = [
Real(0.001, 0.3, name='learning_rate', prior='log-uniform'),
Integer(50, 500, name='n_estimators'),
Integer(2, 20, name='max_depth'),
Real(0.6, 1.0, name='subsample')
]
@use_named_args(space)
def objective(learning_rate, n_estimators, max_depth, subsample):
model = GradientBoostingClassifier(
learning_rate=learning_rate,
n_estimators=n_estimators,
max_depth=max_depth,
subsample=subsample,
random_state=42
)
return -cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
result = gp_minimize(
objective,
space,
n_calls=50,
n_initial_points=10,
random_state=42,
acq_func='EI'
)
print(f"Best score: {-result.fun:.4f}")
print(f"Best params: {result.x}")
5. Optuna â State-of-the-Art Optimization
Optuna uses Tree-structured Parzen Estimator (TPE) and supports advanced features like pruning, conditional hyperparameters, and rich visualization.
TPE Algorithm
TPE models instead of â a key departure from Gaussian Process approaches:
- Split observations into "good" () and "bad" () using quantile
- Model good observations:
- Model bad observations:
- Maximize ratio:
TPE is non-parametric (uses kernel density estimation), scales better than GP-based methods, and naturally handles conditional hyperparameters.
Basic Usage
import optuna
from sklearn.model_selection import cross_val_score
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 2, 20),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
}
model = GradientBoostingClassifier(**params, random_state=42)
score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
return score
study = optuna.create_study(direction='maximize', study_name='gbm_tuning')
study.optimize(objective, n_trials=100, show_progress_bar=True)
print(f"Best value: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
Conditional Hyperparameters
Optuna handles conditional spaces natively â parameters that only apply when another parameter takes a specific value:
def objective_advanced(trial):
classifier = trial.suggest_categorical('classifier', ['rf', 'gbm', 'xgb'])
if classifier == 'rf':
params = {
'n_estimators': trial.suggest_int('rf_n_estimators', 50, 500),
'max_depth': trial.suggest_int('rf_max_depth', 2, 32),
'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 20)
}
model = RandomForestClassifier(**params, random_state=42)
elif classifier == 'gbm':
params = {
'n_estimators': trial.suggest_int('gbm_n_estimators', 50, 500),
'max_depth': trial.suggest_int('gbm_max_depth', 2, 15),
'learning_rate': trial.suggest_float('gbm_lr', 1e-3, 0.3, log=True)
}
model = GradientBoostingClassifier(**params, random_state=42)
else: # xgb
params = {
'n_estimators': trial.suggest_int('xgb_n_estimators', 50, 500),
'max_depth': trial.suggest_int('xgb_max_depth', 2, 15),
'learning_rate': trial.suggest_float('xgb_lr', 1e-3, 0.3, log=True)
}
model = XGBClassifier(**params, use_label_encoder=False, eval_metric='logloss')
score = cross_val_score(model, X_train, y_train, cv=5).mean()
return score
Pruning â Early Termination of Bad Trials
Pruning terminates unpromising trials early, saving computational budget:
import optuna
from optuna.pruners import MedianPruner, SuccessiveHalvingPruner
def objective_with_pruning(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 2, 20),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True)
}
scores = []
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):
X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]
model = GradientBoostingClassifier(**params, random_state=42)
model.fit(X_tr, y_tr)
score = model.score(X_val, y_val)
trial.report(score, step=fold)
if trial.should_prune():
raise optuna.TrialPruned()
scores.append(score)
return np.mean(scores)
pruner = MedianPruner(n_startup_trials=10, n_warmup_steps=3)
study = optuna.create_study(
direction='maximize',
pruner=pruner,
study_name='pruning_demo'
)
Pruning Strategies
| Pruner | Mechanism | Best For |
|---|---|---|
| MedianPruner | Prune if below median of previous trials | General purpose |
| SuccessiveHalvingPruner | Eliminate bottom fraction each round | Large search spaces |
| HyperbandPruner | Budget allocation with early stopping | Resource-constrained |
| PatientPruner | Wait N trials before pruning | Noisy objectives |
Optuna Visualization
import optuna.visualization as vis
# Optimization history â shows improvement over trials
fig1 = vis.plot_optimization_history(study)
fig1.show()
# Parameter importances â which hyperparameters matter most
fig2 = vis.plot_param_importances(study)
fig2.show()
# Slice plot â objective value for each parameter
fig3 = vis.plot_slice(study)
fig3.show()
# Contour plot â interaction between two parameters
fig4 = vis.plot_contour(study, params=['learning_rate', 'max_depth'])
fig4.show()
# Parallel coordinate â high-dimensional view
fig5 = vis.plot_parallel_coordinate(study)
fig5.show()
function OptunaProcessSVG() {
return (
<svg viewBox="0 0 700 400" xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '700px'}}>
<defs>
<linearGradient id="optBg" x1="0%" y1="0%" x2="100%" y2="100%">
<stop offset="0%" stopColor="#1a1a2e" />
<stop offset="100%" stopColor="#0f3460" />
</linearGradient>
</defs>
<rect width="700" height="400" fill="url(#optBg)" rx="12" />
<text x="350" y="30" textAnchor="middle" fill="#e0e0e0" fontSize="15" fontWeight="bold" fontFamily="monospace">
Optuna Optimization Process
</text>
{/* Trial history */}
<text x="120" y="60" textAnchor="middle" fill="#00d2ff" fontSize="11" fontWeight="bold" fontFamily="monospace">Trial History</text>
{[
{trial: 1, score: 0.82, pruned: false},
{trial: 2, score: 0.78, pruned: false},
{trial: 3, score: 0.75, pruned: false},
{trial: 4, score: 0.65, pruned: true},
{trial: 5, score: 0.85, pruned: false},
{trial: 6, score: 0.83, pruned: false},
{trial: 7, score: 0.58, pruned: true},
{trial: 8, score: 0.88, pruned: false},
{trial: 9, score: 0.87, pruned: false},
{trial: 10, score: 0.91, pruned: false},
{trial: 11, score: 0.70, pruned: true},
{trial: 12, score: 0.89, pruned: false}
].map(({trial, score, pruned}, i) => {
const barWidth = (score - 0.5) * 200;
const color = pruned ? '#ff6b6b' : score >= 0.9 ? '#00ff88' : score >= 0.85 ? '#ffd93d' : '#3a7bd5';
return (
<g key={i}>
<rect x="30" y={75 + i * 25} width={barWidth} height="18" fill={color} opacity={pruned ? 0.4 : 0.8} rx="3" />
{pruned && <text x={35 + barWidth} y={88 + i * 25} fill="#ff6b6b" fontSize="8" fontFamily="monospace">pruned</text>}
<text x="28" y={88 + i * 25} textAnchor="end" fill="#888" fontSize="9" fontFamily="monospace">T{trial}</text>
</g>
);
})}
{/* Best value tracker */}
<text x="450" y="60" textAnchor="middle" fill="#00ff88" fontSize="11" fontWeight="bold" fontFamily="monospace">Best Value</text>
<path d="M 350 200 L 380 200 L 380 170 L 420 170 L 420 120 L 500 120 L 500 90 L 600 90"
fill="none" stroke="#00ff88" strokeWidth="2" />
{[
[350, 200, '0.82'], [420, 170, '0.85'], [500, 120, '0.88'], [600, 90, '0.91']
].map(([x, y, label], i) => (
<g key={`best${i}`}>
<circle cx={x} cy={y} r="4" fill="#00ff88" />
<text x={x} y={y - 10} textAnchor="middle" fill="#00ff88" fontSize="9" fontFamily="monospace">{label}</text>
</g>
))}
<text x="475" y="390" textAnchor="middle" fill="#888" fontSize="10" fontFamily="monospace">Trials</text>
</svg>
);
}
6. Learning Rate Schedules
Learning rate scheduling reduces the learning rate during training, allowing fast convergence early and fine-grained updates late.
Common Schedules
where is the initial learning rate and is the current step.
| Schedule | Formula | Characteristics |
|---|---|---|
| Step Decay | Reduce by factor every steps | |
| Exponential Decay | Smooth continuous decay | |
| Cosine Annealing | Periodic warm restarts | |
| Linear Warmup | for | Avoid early instability |
| Polynomial Decay | Flexible power control |
function LearningRateSVG() {
const width = 600;
const height = 350;
const pad = {top: 40, right: 30, bottom: 50, left: 60};
const plotW = width - pad.left - pad.right;
const plotH = height - pad.top - pad.bottom;
function x(t) { return pad.left + (t / 100) * plotW; }
function y(lr) { return pad.top + plotH - (lr / 0.3) * plotH; }
// Schedule curves
const stepDecay = Array.from({length: 100}, (_, t) => [t, 0.3 * Math.pow(0.5, Math.floor(t / 25))]);
const expDecay = Array.from({length: 100}, (_, t) => [t, 0.3 * Math.exp(-0.03 * t)]);
const cosine = Array.from({length: 100}, (_, t) => [t, 0.001 + 0.5 * (0.3 - 0.001) * (1 + Math.cos(Math.PI * t / 100))]);
const warmup = Array.from({length: 100}, (_, t) => [t, t < 20 ? 0.3 * t / 20 : 0.3 * Math.exp(-0.03 * (t - 20))]);
function makePath(data) {
return data.map(([t, lr], i) => `${i === 0 ? 'M' : 'L'} ${x(t).toFixed(1)} ${y(lr).toFixed(1)}`).join(' ');
}
return (
<svg viewBox={`0 0 ${width} ${height}`} xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '600px'}}>
<defs>
<linearGradient id="lrBg" x1="0%" y1="0%" x2="100%" y2="100%">
<stop offset="0%" stopColor="#0d1117" />
<stop offset="100%" stopColor="#161b22" />
</linearGradient>
</defs>
<rect width={width} height={height} fill="url(#lrBg)" rx="12" />
<text x={width/2} y="25" textAnchor="middle" fill="#e0e0e0" fontSize="14" fontWeight="bold" fontFamily="monospace">
Learning Rate Schedules
</text>
{/* Axes */}
<line x1={pad.left} y1={pad.top} x2={pad.left} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />
<line x1={pad.left} y1={pad.top + plotH} x2={pad.left + plotW} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />
{/* Grid lines */}
{[0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3].map(lr => (
<g key={`gy${lr}`}>
<line x1={pad.left} y1={y(lr)} x2={pad.left + plotW} y2={y(lr)} stroke="#222" strokeWidth="0.5" />
<text x={pad.left - 8} y={y(lr) + 3} textAnchor="end" fill="#666" fontSize="9" fontFamily="monospace">{lr.toFixed(2)}</text>
</g>
))}
{/* Curves */}
<path d={makePath(stepDecay)} fill="none" stroke="#ff6b6b" strokeWidth="2" />
<path d={makePath(expDecay)} fill="none" stroke="#00d2ff" strokeWidth="2" />
<path d={makePath(cosine)} fill="none" stroke="#ffd93d" strokeWidth="2" />
<path d={makePath(warmup)} fill="none" stroke="#00ff88" strokeWidth="2" strokeDasharray="6,3" />
{/* Legend */}
{[
[400, '#ff6b6b', 'Step Decay'],
[400, '#00d2ff', 'Exponential'],
[420, '#ffd93d', 'Cosine Annealing'],
[420, '#00ff88', 'Warmup + Decay']
].map(([yOff, color, label], i) => (
<g key={`leg${i}`}>
<line x1={width - 150} y1={yOff + i * 18} x2={width - 130} y2={yOff + i * 18} stroke={color} strokeWidth="2" />
<text x={width - 125} y={yOff + i * 18 + 3} fill="#aaa" fontSize="9" fontFamily="monospace">{label}</text>
</g>
))}
{/* Axis labels */}
<text x={width/2} y={height - 8} textAnchor="middle" fill="#888" fontSize="11" fontFamily="monospace">Epoch</text>
<text x="15" y={height/2} textAnchor="middle" fill="#888" fontSize="11" fontFamily="monospace" transform={`rotate(-90, 15, ${height/2})`}>Learning Rate</text>
</svg>
);
}
Practical Implementation
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
# Step decay with GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import learning_curve
# Manual learning rate schedule
import numpy as np
def step_decay(initial_lr=0.1, drop=0.5, epochs_drop=10):
def schedule(epoch):
return initial_lr * (drop ** np.floor(epoch / epochs_drop))
return schedule
# Keras callback example
from tensorflow.keras.callbacks import LearningRateScheduler
lr_scheduler = LearningRateScheduler(step_decay(initial_lr=0.1, drop=0.5, epochs_drop=10))
# Cosine annealing with warm restarts
def cosine_annealing(epoch, T_max=10, eta_min=1e-5, eta_max=0.1):
return eta_min + 0.5 * (eta_max - eta_min) * (1 + np.cos(np.pi * epoch / T_max))
lr_callback = LearningRateScheduler(lambda epoch: cosine_annealing(epoch))
7. Early Stopping
Early stopping halts training when validation performance stops improving, preventing overfitting and saving compute.
Mathematical Formulation
Let be the validation loss at epoch . Training stops when:
where is the patience parameter and is a tolerance threshold.
Implementation
import numpy as np
from sklearn.base import clone
class EarlyStopping:
def __init__(self, patience=10, min_delta=1e-4, restore_best=True):
self.patience = patience
self.min_delta = min_delta
self.restore_best = restore_best
self.counter = 0
self.best_loss = np.inf
self.best_model = None
def __call__(self, model, X_val, y_val):
current_loss = -cross_val_score(model, X_val, y_val, cv=3).mean()
if current_loss < self.best_loss - self.min_delta:
self.best_loss = current_loss
self.best_model = clone(model)
self.counter = 0
return False # don't stop
else:
self.counter += 1
if self.counter >= self.patience:
if self.restore_best:
return self.best_model
return True # signal to stop
return False
# Usage with iterative training
def train_with_early_stopping(model, X_train, y_train, X_val, y_val, max_epochs=100):
stopper = EarlyStopping(patience=10, min_delta=1e-4)
for epoch in range(max_epochs):
model.fit(X_train, y_train)
should_stop = stopper(model, X_val, y_val)
if should_stop is True:
print(f"Early stopping at epoch {epoch}")
break
elif should_stop is not False:
best_model = should_stop
return best_model if stopper.restore_best else model
Early Stopping for Gradient Boosting
# XGBoost / LightGBM built-in early stopping
import xgboost as xgb
import lightgbm as lgb
# XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)
params = {
'max_depth': 6,
'learning_rate': 0.1,
'objective': 'binary:logistic',
'eval_metric': 'auc'
}
model = xgb.train(
params,
dtrain,
num_boost_round=1000,
evals=[(dtrain, 'train'), (dval, 'val')],
early_stopping_rounds=50,
verbose_eval=100
)
# LightGBM
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train)
model = lgb.train(
{'num_leaves': 31, 'learning_rate': 0.05},
lgb_train,
num_boost_round=1000,
valid_sets=[lgb_val],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
Patience and Overfitting
function EarlyStoppingSVG() {
const width = 600;
const height = 320;
const pad = {top: 40, right: 30, bottom: 50, left: 60};
const plotW = width - pad.left - pad.right;
const plotH = height - pad.top - pad.bottom;
function xp(t) { return pad.left + (t / 50) * plotW; }
function yp(loss) { return pad.top + plotH - ((loss - 0.2) / 0.8) * plotH; }
// Simulated loss curves
const trainLoss = Array.from({length: 50}, (_, t) => [t, 0.9 * Math.exp(-0.05 * t) + 0.05 + 0.02 * Math.random()]);
const valLoss = Array.from({length: 50}, (_, t) => {
const base = 0.9 * Math.exp(-0.04 * t) + 0.15;
const overfit = t > 25 ? 0.003 * (t - 25) : 0;
return [t, base + overfit + 0.02 * Math.random()];
});
function makePath(data) {
return data.map(([t, loss], i) => `${i === 0 ? 'M' : 'L'} ${xp(t).toFixed(1)} ${yp(loss).toFixed(1)}`).join(' ');
}
const bestEpoch = 22;
return (
<svg viewBox={`0 0 ${width} ${height}`} xmlns="http://www.w3.org/2000/svg" style={{width: '100%', maxWidth: '600px'}}>
<defs>
<linearGradient id="esBg" x1="0%" y1="0%" x2="100%" y2="100%">
<stop offset="0%" stopColor="#0d1117" />
<stop offset="100%" stopColor="#161b22" />
</linearGradient>
</defs>
<rect width={width} height={height} fill="url(#esBg)" rx="12" />
<text x={width/2} y="25" textAnchor="middle" fill="#e0e0e0" fontSize="14" fontWeight="bold" fontFamily="monospace">
Early Stopping: Patience = 10
</text>
<line x1={pad.left} y1={pad.top} x2={pad.left} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />
<line x1={pad.left} y1={pad.top + plotH} x2={pad.left + plotW} y2={pad.top + plotH} stroke="#444" strokeWidth="1" />
<path d={makePath(trainLoss)} fill="none" stroke="#00d2ff" strokeWidth="2" />
<path d={makePath(valLoss)} fill="none" stroke="#ff6b6b" strokeWidth="2" />
{/* Best epoch marker */}
<line x1={xp(bestEpoch)} y1={pad.top} x2={xp(bestEpoch)} y2={pad.top + plotH} stroke="#00ff88" strokeWidth="1" strokeDasharray="4,4" />
<text x={xp(bestEpoch)} y={pad.top - 5} textAnchor="middle" fill="#00ff88" fontSize="9" fontFamily="monospace">best</text>
{/* Stop marker */}
<line x1={xp(32)} y1={pad.top} x2={xp(32)} y2={pad.top + plotH} stroke="#ffd93d" strokeWidth="1" strokeDasharray="4,4" />
<text x={xp(32)} y={pad.top - 5} textAnchor="middle" fill="#ffd93d" fontSize="9" fontFamily="monospace">stop</text>
{/* Patience region */}
<rect x={xp(bestEpoch)} y={pad.top} width={xp(32) - xp(bestEpoch)} height={plotH} fill="#ffd93d" opacity="0.05" />
{/* Legend */}
<line x1={width - 150} y1={50} x2={width - 130} y2={50} stroke="#00d2ff" strokeWidth="2" />
<text x={width - 125} y={53} fill="#aaa" fontSize="9" fontFamily="monospace">Train loss</text>
<line x1={width - 150} y1={68} x2={width - 130} y2={68} stroke="#ff6b6b" strokeWidth="2" />
<text x={width - 125} y={71} fill="#aaa" fontSize="9" fontFamily="monospace">Val loss</text>
<text x={width/2} y={height - 8} textAnchor="middle" fill="#888" fontSize="11" fontFamily="monospace">Epoch</text>
</svg>
);
}
8. Implementation in Python â Complete Pipeline
End-to-End Tuning Pipeline
import optuna
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
def create_objective(X_train, y_train, model_type='gbm'):
"""Create an Optuna objective function for the specified model type."""
def objective(trial):
if model_type == 'gbm':
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 2, 15),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
}
model = GradientBoostingClassifier(**params, random_state=42)
elif model_type == 'rf':
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 2, 30),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
'bootstrap': trial.suggest_categorical('bootstrap', [True, False])
}
model = RandomForestClassifier(**params, random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy', n_jobs=-1)
return scores.mean()
return objective
# Create and run study
study = optuna.create_study(
direction='maximize',
study_name='complete_pipeline',
storage='sqlite:///optuna_study.db', # persist results
load_if_exists=True
)
objective_fn = create_objective(X_train, y_train, model_type='gbm')
study.optimize(objective_fn, n_trials=100, show_progress_bar=True)
# Analyze results
print(f"Best trial score: {study.best_trial.value:.4f}")
print(f"Best trial params: {study.best_trial.params}")
# Visualization
import optuna.visualization as vis
vis.plot_optimization_history(study).show()
vis.plot_param_importances(study).show()
# Final evaluation
best_model = GradientBoostingClassifier(**study.best_params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
Multi-Objective Optimization
# Optimize accuracy AND training time simultaneously
def objective_multi(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 2, 20),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True)
}
model = GradientBoostingClassifier(**params, random_state=42)
import time
start = time.time()
score = cross_val_score(model, X_train, y_train, cv=5).mean()
elapsed = time.time() - start
return score, elapsed
study_multi = optuna.create_study(
directions=['maximize', 'minimize'], # maximize accuracy, minimize time
study_name='multi_objective'
)
study_multi.optimize(objective_multi, n_trials=100)
# Pareto front
best_trials = study_multi.best_trials
for trial in best_trials:
print(f"Accuracy: {trial.values[0]:.4f}, Time: {trial.values[1]:.1f}s")
Saving and Loading Studies
# Save study to database
study = optuna.create_study(
study_name='my_study',
storage='sqlite:///optuna_studies.db',
load_if_exists=True
)
# Resume later
study.optimize(objective, n_trials=50) # continues from where it left off
# Export results to dataframe
df = study.trials_dataframe()
df.to_csv('study_results.csv', index=False)
# Load a completed study
loaded_study = optuna.load_study(
study_name='my_study',
storage='sqlite:///optuna_studies.db'
)
Key Takeaways
âšī¸ Summary
Grid Search is simple but exponential â use only when search space is small (3-4 parameters).
Random Search is efficient for low-effective-dimension problems â always better than grid with equal budget.
Bayesian Optimization is sample-efficient â best when evaluations are expensive (deep learning, hyperparameter tuning of expensive models).
Optuna with TPE is the modern standard â handles conditional parameters, pruning, multi-objective, and scales to hundreds of trials.
Early stopping is cheap insurance â always use it for iterative learners like gradient boosting.
Learning rate schedules enable fast convergence with fine-tuned solutions â cosine annealing with warm restarts is often a strong default.
References
- Bergstra, J., and Bengio, Y. (2012). Random search for hyper-parameter optimization. JMLR, 13, 281-305.
- Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. NeurIPS.
- Akiba, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. KDD.
- Li, L., et al. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 18(185), 1-52.
- Smith, L. N. (2017). Cyclical learning rates for training neural networks. WACV.