Hyperparameter Tuning: Grid Search, Random Search and Optuna

ℹ️Module 8 — Tree-Based Models

This lesson covers systematic approaches to hyperparameter optimization, from exhaustive grid search to intelligent Bayesian methods and modern tools like Optuna.

1. Hyperparameters vs Parameters

Understanding the distinction between hyperparameters and learned parameters is fundamental to model optimization.

Parameters are learned from data during training — weights in a neural network, split thresholds in a decision tree. Hyperparameters control the learning process itself and must be set before training begins.

Aspect	Parameters	Hyperparameters
Set when	During training	Before training
Learned from	Data	Manual / search
Examples	Weights, biases, split points	Learning rate, max depth, n_estimators
Optimized via	Gradient descent, EM	Grid search, random search, Bayesian

Formal Definition

A machine learning model has parameters learned from training data via:

Hyperparameters govern the learning process:

The goal of hyperparameter tuning is to find:

where is the validation loss — we never use test data for this.

Why Tuning Matters

A well-tuned model can outperform a more complex model with default settings. The bias-variance tradeoff is directly controlled by hyperparameters:

Too restrictive (e.g., max_depth=2): high bias, underfitting
Too flexible (e.g., max_depth=50): high variance, overfitting
Just right: optimal generalization

2. Grid Search — Exhaustive Exploration

Grid Search is the most straightforward approach: define a discrete set of values for each hyperparameter and evaluate every possible combination.

Algorithm

GridSearch Algorithm Flowchart Input model, param_grid, X, y, cv Initialize best_score = -∞, best_params = null Generate Grid product(param_grid) For Each Combination? Yes Cross Val score = CrossValScore( model, params, cv) Compare score > best_score? if yes → update Update Best best_score = score No Return best_params, best_score Search Space Visualization Best max_depth → n_estimators ↑ Complexity With d params and k values each: k^d combinations 5 params × 5 values = 3,125 evaluations

Python Implementation

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1, 0.2],
    'min_samples_split': [2, 5, 10],
    'subsample': [0.8, 0.9, 1.0]
}

grid_search = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2,
    return_train_score=True
)

grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

The Curse of Dimensionality in Grid Search

With hyperparameters and values per parameter, grid search evaluates combinations:

Hyperparameters	Values each	Combinations	Time (10s each)
2	5	25	4 min
3	5	125	21 min
4	5	625	1.7 hours
5	5	3,125	8.7 hours
6	5	15,625	4.3 days

Grid Search vs Random Search

The visual below demonstrates how random search covers the search space more efficiently. Each axis represents one hyperparameter, and colored dots represent evaluations.

Key Insight: Grid search wastes budget on unimportant hyperparameters. If max_depth matters more than subsample, grid search still evaluates all subsample values for every max_depth.

3. Random Search — Efficient Sampling

Random search samples hyperparameter combinations from specified distributions. Bergstra and Bengio (2012) showed random search is more efficient than grid search when only a few hyperparameters are truly important.

Why Random Search Wins

The key insight: if one hyperparameter (e.g., learning rate) dominates performance, grid search wastes evaluations per dimension. Random search explores the dominant dimension more effectively.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(2, 20),
    'learning_rate': uniform(0.001, 0.3),
    'min_samples_split': randint(2, 20),
    'subsample': uniform(0.6, 0.4),
    'min_samples_leaf': randint(1, 10)
}

random_search = RandomizedSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_distributions=param_distributions,
    n_iter=200,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)
print(f"Best params: {random_search.best_params_}")

Log-Uniform Distributions for Learning Rate

Learning rates span orders of magnitude, so uniform sampling is suboptimal:

import numpy as np
from scipy.stats import loguniform

# Bad: uniform sampling concentrates points in [0.1, 0.3]
# Good: log-uniform samples proportionally across scales
learning_rates_log = loguniform(1e-3, 1e-1)

# Manual implementation
def log_uniform(low, high, size=1):
    return np.exp(np.random.uniform(np.log(low), np.log(high), size))

Comparison: Grid vs Random

Criterion	Grid Search	Random Search
Coverage	Uniform grid	Stratified sampling
Curse of dimensionality	Exponential	Linear in n_iter
Important dimensions	Wasted budget	Better coverage
Reproducibility	Deterministic	Depends on seed
Parallelization	Difficult	Trivial
Budget efficiency	Low	High

4. Bayesian Optimization — Intelligent Search

Bayesian optimization builds a surrogate model of the objective function and uses an acquisition function to decide where to sample next.

The Loop

Gaussian Process Surrogate

The surrogate model is typically a Gaussian Process (GP):

where is the mean function and is the kernel (covariance function). Given observations , the posterior predictive is:

Acquisition Functions

The acquisition function balances exploration and exploitation:

Expected Improvement (EI):

where is the best observed value, and are the standard normal CDF and PDF.

Upper Confidence Bound (UCB):

where controls exploration. Higher = more exploration.

Thompson Sampling: Sample and optimize the sample.

Exploration vs Exploitation

The acquisition function encodes a fundamental tradeoff:

Exploration: sample where uncertainty is high (learning the landscape)
Exploitation: sample where predicted value is good (refining the optimum)

EI naturally balances both: high increases exploitation, high increases exploration (only when is near ).

from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args

space = [
    Real(0.001, 0.3, name='learning_rate', prior='log-uniform'),
    Integer(50, 500, name='n_estimators'),
    Integer(2, 20, name='max_depth'),
    Real(0.6, 1.0, name='subsample')
]

@use_named_args(space)
def objective(learning_rate, n_estimators, max_depth, subsample):
    model = GradientBoostingClassifier(
        learning_rate=learning_rate,
        n_estimators=n_estimators,
        max_depth=max_depth,
        subsample=subsample,
        random_state=42
    )
    return -cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()

result = gp_minimize(
    objective,
    space,
    n_calls=50,
    n_initial_points=10,
    random_state=42,
    acq_func='EI'
)

print(f"Best score: {-result.fun:.4f}")
print(f"Best params: {result.x}")

5. Optuna — State-of-the-Art Optimization

Optuna uses Tree-structured Parzen Estimator (TPE) and supports advanced features like pruning, conditional hyperparameters, and rich visualization.

TPE Algorithm

TPE models instead of — a key departure from Gaussian Process approaches:

Split observations into "good" () and "bad" () using quantile
Model good observations:
Model bad observations:
Maximize ratio:

TPE is non-parametric (uses kernel density estimation), scales better than GP-based methods, and naturally handles conditional hyperparameters.

Basic Usage

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
    }

    model = GradientBoostingClassifier(**params, random_state=42)
    score = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy').mean()
    return score

study = optuna.create_study(direction='maximize', study_name='gbm_tuning')
study.optimize(objective, n_trials=100, show_progress_bar=True)

print(f"Best value: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Conditional Hyperparameters

Optuna handles conditional spaces natively — parameters that only apply when another parameter takes a specific value:

def objective_advanced(trial):
    classifier = trial.suggest_categorical('classifier', ['rf', 'gbm', 'xgb'])

    if classifier == 'rf':
        params = {
            'n_estimators': trial.suggest_int('rf_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('rf_max_depth', 2, 32),
            'min_samples_split': trial.suggest_int('rf_min_samples_split', 2, 20)
        }
        model = RandomForestClassifier(**params, random_state=42)

    elif classifier == 'gbm':
        params = {
            'n_estimators': trial.suggest_int('gbm_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('gbm_max_depth', 2, 15),
            'learning_rate': trial.suggest_float('gbm_lr', 1e-3, 0.3, log=True)
        }
        model = GradientBoostingClassifier(**params, random_state=42)

    else:  # xgb
        params = {
            'n_estimators': trial.suggest_int('xgb_n_estimators', 50, 500),
            'max_depth': trial.suggest_int('xgb_max_depth', 2, 15),
            'learning_rate': trial.suggest_float('xgb_lr', 1e-3, 0.3, log=True)
        }
        model = XGBClassifier(**params, use_label_encoder=False, eval_metric='logloss')

    score = cross_val_score(model, X_train, y_train, cv=5).mean()
    return score

Pruning — Early Termination of Bad Trials

Pruning terminates unpromising trials early, saving computational budget:

import optuna
from optuna.pruners import MedianPruner, SuccessiveHalvingPruner

def objective_with_pruning(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True)
    }

    scores = []
    kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for fold, (train_idx, val_idx) in enumerate(kf.split(X_train, y_train)):
        X_tr, X_val = X_train.iloc[train_idx], X_train.iloc[val_idx]
        y_tr, y_val = y_train.iloc[train_idx], y_train.iloc[val_idx]

        model = GradientBoostingClassifier(**params, random_state=42)
        model.fit(X_tr, y_tr)

        score = model.score(X_val, y_val)
        trial.report(score, step=fold)

        if trial.should_prune():
            raise optuna.TrialPruned()

        scores.append(score)

    return np.mean(scores)

pruner = MedianPruner(n_startup_trials=10, n_warmup_steps=3)
study = optuna.create_study(
    direction='maximize',
    pruner=pruner,
    study_name='pruning_demo'
)

Pruning Strategies

Pruner	Mechanism	Best For
MedianPruner	Prune if below median of previous trials	General purpose
SuccessiveHalvingPruner	Eliminate bottom fraction each round	Large search spaces
HyperbandPruner	Budget allocation with early stopping	Resource-constrained
PatientPruner	Wait N trials before pruning	Noisy objectives

Optuna Visualization

import optuna.visualization as vis

# Optimization history — shows improvement over trials
fig1 = vis.plot_optimization_history(study)
fig1.show()

# Parameter importances — which hyperparameters matter most
fig2 = vis.plot_param_importances(study)
fig2.show()

# Slice plot — objective value for each parameter
fig3 = vis.plot_slice(study)
fig3.show()

# Contour plot — interaction between two parameters
fig4 = vis.plot_contour(study, params=['learning_rate', 'max_depth'])
fig4.show()

# Parallel coordinate — high-dimensional view
fig5 = vis.plot_parallel_coordinate(study)
fig5.show()

6. Learning Rate Schedules

Learning rate scheduling reduces the learning rate during training, allowing fast convergence early and fine-grained updates late.

Common Schedules

where is the initial learning rate and is the current step.

Schedule	Formula	Characteristics
Step Decay		Reduce by factor every steps
Exponential Decay		Smooth continuous decay
Cosine Annealing		Periodic warm restarts
Linear Warmup	for	Avoid early instability
Polynomial Decay		Flexible power control

Practical Implementation

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

# Step decay with GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import learning_curve

# Manual learning rate schedule
import numpy as np

def step_decay(initial_lr=0.1, drop=0.5, epochs_drop=10):
    def schedule(epoch):
        return initial_lr * (drop ** np.floor(epoch / epochs_drop))
    return schedule

# Keras callback example
from tensorflow.keras.callbacks import LearningRateScheduler

lr_scheduler = LearningRateScheduler(step_decay(initial_lr=0.1, drop=0.5, epochs_drop=10))

# Cosine annealing with warm restarts
def cosine_annealing(epoch, T_max=10, eta_min=1e-5, eta_max=0.1):
    return eta_min + 0.5 * (eta_max - eta_min) * (1 + np.cos(np.pi * epoch / T_max))

lr_callback = LearningRateScheduler(lambda epoch: cosine_annealing(epoch))

7. Early Stopping

Early stopping halts training when validation performance stops improving, preventing overfitting and saving compute.

Mathematical Formulation

Let be the validation loss at epoch . Training stops when:

where is the patience parameter and is a tolerance threshold.

Implementation

import numpy as np
from sklearn.base import clone

class EarlyStopping:
    def __init__(self, patience=10, min_delta=1e-4, restore_best=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best
        self.counter = 0
        self.best_loss = np.inf
        self.best_model = None

    def __call__(self, model, X_val, y_val):
        current_loss = -cross_val_score(model, X_val, y_val, cv=3).mean()

        if current_loss < self.best_loss - self.min_delta:
            self.best_loss = current_loss
            self.best_model = clone(model)
            self.counter = 0
            return False  # don't stop
        else:
            self.counter += 1
            if self.counter >= self.patience:
                if self.restore_best:
                    return self.best_model
                return True  # signal to stop
            return False

# Usage with iterative training
def train_with_early_stopping(model, X_train, y_train, X_val, y_val, max_epochs=100):
    stopper = EarlyStopping(patience=10, min_delta=1e-4)

    for epoch in range(max_epochs):
        model.fit(X_train, y_train)
        should_stop = stopper(model, X_val, y_val)

        if should_stop is True:
            print(f"Early stopping at epoch {epoch}")
            break
        elif should_stop is not False:
            best_model = should_stop

    return best_model if stopper.restore_best else model

Early Stopping for Gradient Boosting

# XGBoost / LightGBM built-in early stopping
import xgboost as xgb
import lightgbm as lgb

# XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_val, label=y_val)

params = {
    'max_depth': 6,
    'learning_rate': 0.1,
    'objective': 'binary:logistic',
    'eval_metric': 'auc'
}

model = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,
    verbose_eval=100
)

# LightGBM
lgb_train = lgb.Dataset(X_train, y_train)
lgb_val = lgb.Dataset(X_val, y_val, reference=lgb_train)

model = lgb.train(
    {'num_leaves': 31, 'learning_rate': 0.05},
    lgb_train,
    num_boost_round=1000,
    valid_sets=[lgb_val],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)

Patience and Overfitting

8. Implementation in Python — Complete Pipeline

End-to-End Tuning Pipeline

import optuna
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')

def create_objective(X_train, y_train, model_type='gbm'):
    """Create an Optuna objective function for the specified model type."""

    def objective(trial):
        if model_type == 'gbm':
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 50, 500),
                'max_depth': trial.suggest_int('max_depth', 2, 15),
                'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
                'subsample': trial.suggest_float('subsample', 0.6, 1.0),
                'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
            }
            model = GradientBoostingClassifier(**params, random_state=42)

        elif model_type == 'rf':
            params = {
                'n_estimators': trial.suggest_int('n_estimators', 50, 500),
                'max_depth': trial.suggest_int('max_depth', 2, 30),
                'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
                'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
                'max_features': trial.suggest_categorical('max_features', ['sqrt', 'log2', None]),
                'bootstrap': trial.suggest_categorical('bootstrap', [True, False])
            }
            model = RandomForestClassifier(**params, random_state=42)

        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='accuracy', n_jobs=-1)
        return scores.mean()

    return objective


# Create and run study
study = optuna.create_study(
    direction='maximize',
    study_name='complete_pipeline',
    storage='sqlite:///optuna_study.db',  # persist results
    load_if_exists=True
)

objective_fn = create_objective(X_train, y_train, model_type='gbm')
study.optimize(objective_fn, n_trials=100, show_progress_bar=True)

# Analyze results
print(f"Best trial score: {study.best_trial.value:.4f}")
print(f"Best trial params: {study.best_trial.params}")

# Visualization
import optuna.visualization as vis
vis.plot_optimization_history(study).show()
vis.plot_param_importances(study).show()

# Final evaluation
best_model = GradientBoostingClassifier(**study.best_params, random_state=42)
best_model.fit(X_train, y_train)
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))

Multi-Objective Optimization

# Optimize accuracy AND training time simultaneously
def objective_multi(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 2, 20),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True)
    }

    model = GradientBoostingClassifier(**params, random_state=42)

    import time
    start = time.time()
    score = cross_val_score(model, X_train, y_train, cv=5).mean()
    elapsed = time.time() - start

    return score, elapsed

study_multi = optuna.create_study(
    directions=['maximize', 'minimize'],  # maximize accuracy, minimize time
    study_name='multi_objective'
)
study_multi.optimize(objective_multi, n_trials=100)

# Pareto front
best_trials = study_multi.best_trials
for trial in best_trials:
    print(f"Accuracy: {trial.values[0]:.4f}, Time: {trial.values[1]:.1f}s")

Saving and Loading Studies

# Save study to database
study = optuna.create_study(
    study_name='my_study',
    storage='sqlite:///optuna_studies.db',
    load_if_exists=True
)

# Resume later
study.optimize(objective, n_trials=50)  # continues from where it left off

# Export results to dataframe
df = study.trials_dataframe()
df.to_csv('study_results.csv', index=False)

# Load a completed study
loaded_study = optuna.load_study(
    study_name='my_study',
    storage='sqlite:///optuna_studies.db'
)

Key Takeaways

ℹ️Summary

Grid Search is simple but exponential — use only when search space is small (3-4 parameters).

Random Search is efficient for low-effective-dimension problems — always better than grid with equal budget.

Bayesian Optimization is sample-efficient — best when evaluations are expensive (deep learning, hyperparameter tuning of expensive models).

Optuna with TPE is the modern standard — handles conditional parameters, pruning, multi-objective, and scales to hundreds of trials.

Early stopping is cheap insurance — always use it for iterative learners like gradient boosting.

Learning rate schedules enable fast convergence with fine-tuned solutions — cosine annealing with warm restarts is often a strong default.

References

Bergstra, J., and Bengio, Y. (2012). Random search for hyper-parameter optimization. JMLR, 13, 281-305.
Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. NeurIPS.
Akiba, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. KDD.
Li, L., et al. (2018). Hyperband: A novel bandit-based approach to hyperparameter optimization. JMLR, 18(185), 1-52.
Smith, L. N. (2017). Cyclical learning rates for training neural networks. WACV.

Hyperparameter Tuning: Grid Search, Random Search and Optuna

Hyperparameter Tuning: Grid Search, Random Search and Optuna

1. Hyperparameters vs Parameters

Formal Definition

Why Tuning Matters

2. Grid Search — Exhaustive Exploration

Algorithm

Python Implementation

The Curse of Dimensionality in Grid Search

Grid Search vs Random Search

3. Random Search — Efficient Sampling

Why Random Search Wins

Log-Uniform Distributions for Learning Rate

Comparison: Grid vs Random

4. Bayesian Optimization — Intelligent Search

The Loop

Gaussian Process Surrogate

Acquisition Functions

Exploration vs Exploitation

5. Optuna — State-of-the-Art Optimization

TPE Algorithm

Basic Usage

Conditional Hyperparameters

Pruning — Early Termination of Bad Trials

Pruning Strategies

Optuna Visualization

6. Learning Rate Schedules

Common Schedules

Practical Implementation

7. Early Stopping

Mathematical Formulation

Implementation

Early Stopping for Gradient Boosting

Patience and Overfitting

8. Implementation in Python — Complete Pipeline

End-to-End Tuning Pipeline

Multi-Objective Optimization

Saving and Loading Studies

Key Takeaways

References

Need Expert Data Science Help?