Hyperparameter Optimization

Why It Matters

💡 Why It Matters

Good hyperparameters can improve model performance more than architecture changes. A well-tuned model with a simple architecture often outperforms a poorly tuned model with a complex architecture. Hyperparameter optimization (HPO) is the process of finding the optimal set of hyperparameters for a machine learning model, and it's a critical step in the machine learning pipeline.

Hyperparameter Optimization

DfHyperparameter Optimization

Hyperparameter optimization (HPO) is the problem of finding a set of hyperparameters for a learning algorithm that yields an optimal model. Given a learning algorithm $\mathcal{A}$ with hyperparameters $\lambda \in \Lambda$ , and a dataset $D$ , the goal is to find:

\lambda^* = \arg\min_{\lambda \in \Lambda} \mathcal{L}(\mathcal{A}_\lambda, D)

where $\mathcal{L}$ is the loss function (e.g., cross-validation error) and $\mathcal{A}_\lambda$ denotes algorithm $\mathcal{A}$ trained with hyperparameters $\lambda$ .

DfTypes of Hyperparameters

Hyperparameters can be categorized into:

Continuous: Learning rate, regularization strength, dropout rate
Integer: Number of layers, number of neurons, batch size
Categorical: Activation function, optimizer type, kernel type
Conditional: Some hyperparameters only apply when others take specific values (e.g., number of trees in a forest)

Grid Search

DfGrid Search

Grid search is an exhaustive search over a specified parameter grid. It evaluates every combination of hyperparameters in the grid. For hyperparameters $\lambda_1, \lambda_2, \dots, \lambda_k$ with $n_1, n_2, \dots, n_k$ possible values respectively, grid search evaluates $N = n_1 \times n_2 \times \cdots \times n_k$ combinations.

Grid Search Complexity

⚠️ Curse of Dimensionality

Grid search suffers from the curse of dimensionality. Adding just one hyperparameter with 10 possible values increases the search space by 10x. With 5 hyperparameters each with 10 values, you need 100,000 evaluations.

Random Search

DfRandom Search

Random search samples hyperparameter configurations from a specified distribution. Unlike grid search, it doesn't evaluate every point in a grid. For each iteration, it independently samples each hyperparameter from its distribution.

ThRandom Search Efficiency

Random search is more efficient than grid search in high-dimensional spaces because it explores more of the hyperparameter space. Specifically, random search can find a good solution with probability $p$ using $O(\log(1/(1-p)))$ evaluations, while grid search requires $O(k \log(1/(1-p)))$ evaluations where $k$ is the number of hyperparameters.

Random Search Sampling

Bayesian Optimization

DfBayesian Optimization

Bayesian optimization is a sequential model-based optimization approach for black-box functions. It builds a surrogate model of the objective function and uses an acquisition function to decide where to evaluate next. The process iterates:

Fit surrogate model to observed data
Maximize acquisition function to propose next point
Evaluate the true objective at that point
Update the surrogate model

Acquisition Functions

DfExpected Improvement (EI)

The expected improvement acquisition function measures the expected improvement over the current best value:

\text{EI}(x) = \mathbb{E}[\max(0, f_{\text{best}} - \hat{f}(x))]

where $f_{\text{best}}$ is the best observed value and $\hat{f}(x)$ is the surrogate model's prediction.

Expected Improvement Formula

DfUpper Confidence Bound (UCB)

The UCB acquisition function balances exploration and exploitation:

\text{UCB}(x) = \mu(x) + \kappa \sigma(x)

where $\kappa$ controls the exploration-exploitation tradeoff. Larger $\kappa$ encourages exploration.

DfProbability of Improvement (PI)

The PI acquisition function measures the probability of improving over the current best:

\text{PI}(x) = \Phi\left(\frac{\mu(x) - f_{\text{best}}}{\sigma(x)}\right)

PI tends to get stuck in local optima compared to EI.

Gaussian Process for HPO

DfGaussian Process Regression

A Gaussian Process (GP) is a collection of random variables, any finite number of which have a multivariate Gaussian distribution. A GP is completely specified by its mean function $m(x)$ and covariance function $k(x, x')$ :

f(x) \sim \mathcal{GP}(m(x), k(x, x'))

GP Predictive Distribution

DfCommon Kernel Functions

Squared Exponential (RBF): $k(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2l^2}\right)$
Matérn: $k(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\|x - x'\|}{l}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\|x - x'\|}{l}\right)$
Rational Quadratic: $k(x, x') = \sigma^2\left(1 + \frac{\|x - x'\|^2}{2\alpha l^2}\right)^{-\alpha}$

Tree-structured Parzen Estimators

DfTree-structured Parzen Estimators (TPE)

TPE is a sequential model-based optimization algorithm that models $p(x|y)$ and $p(y)$ instead of $p(y|x)$ like GP-based methods. It uses two density estimators:

$l(x)$ : density of $x$ given $y < y^*$ (promising configurations)
$g(x)$ : density of $x$ given $y \geq y^*$ (non-promising configurations)

The algorithm maximizes the ratio $\frac{l(x)}{g(x)}$ to propose new configurations.

TPE Acquisition Function

💡 TPE Advantages

TPE handles conditional hyperparameters naturally and can work with any metric. It doesn't require a kernel function like GPs and scales better with the number of observations.

Multi-Fidelity Methods

DfMulti-Fidelity Optimization

Multi-fidelity methods use approximations (low-fidelity) of the objective function to reduce computational cost. The idea is to evaluate many configurations with low fidelity (e.g., fewer epochs, smaller dataset) and only evaluate the most promising ones with full fidelity.

Successive Halving

DfSuccessive Halving

Successive halving is a multi-fidelity algorithm that:

Start with $n$ configurations and small resource budget $r_0$
Evaluate all configurations for $r_0$ iterations
Keep the top half and double the budget
Repeat until one configuration remains

Successive Halving Budget Allocation

Hyperband

DfHyperband

Hyperband is an extension of successive halving that tries different amounts of resource allocation. It runs multiple "brackets" with different starting configurations and resource levels:

Start with $n$ configurations at minimum resources
Successive halving eliminates half each round
Multiple brackets with different starting points

Early Stopping

DfEarly Stopping

Early stopping is a regularization technique where training is stopped before convergence to prevent overfitting. The model is trained for multiple epochs, and training is stopped when performance on a validation set stops improving.

Early Stopping Criterion

⚠️ Patience Parameter

Early stopping with patience $\tau$ waits $\tau$ epochs of no improvement before stopping. Too small $\tau$ may stop too early; too large $\tau$ wastes computation. Typical values: $\tau \in [5, 20]$ .

Python Implementation

Optuna

📝Optuna Implementation

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 10, 300)
    max_depth = trial.suggest_int('max_depth', 2, 20)
    min_samples_split = trial.suggest_float('min_samples_split', 0.01, 0.5)
    
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42
    )
    
    score = cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(f"Best params: {study.best_params}")
print(f"Best score: {study.best_value:.4f}")

Scikit-optimize

📝Scikit-Optimize Implementation

from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

search_spaces = {
    'n_estimators': Integer(10, 300),
    'max_depth': Integer(2, 20),
    'min_samples_split': Real(0.01, 0.5),
    'criterion': Categorical(['gini', 'entropy'])
}

opt = BayesSearchCV(
    RandomForestClassifier(random_state=42),
    search_spaces,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    random_state=42
)

opt.fit(X, y)
print(f"Best params: {opt.best_params_}")
print(f"Best score: {opt.best_score_:.4f}")

Optuna with Pruning

📝Optuna with Pruning

import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from optuna.pruners import MedianPruner

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 10, 300)
    learning_rate = trial.suggest_float('learning_rate', 0.001, 0.3, log=True)
    max_depth = trial.suggest_int('max_depth', 2, 10)
    subsample = trial.suggest_float('subsample', 0.5, 1.0)
    
    model = GradientBoostingClassifier(
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        max_depth=max_depth,
        subsample=subsample,
        random_state=42
    )
    
    scores = []
    for fold, (train_idx, val_idx) in enumerate(
        cross_val_score(model, X, y, cv=3, return_estimator=False)
    ):
        model.fit(X[train_idx], y[train_idx])
        score = model.score(X[val_idx], y[val_idx])
        scores.append(score)
        
        trial.report(sum(scores) / len(scores), fold)
        if trial.should_prune():
            raise optuna.TrialPruned()
    
    return sum(scores) / len(scores)

study = optuna.create_study(
    direction='maximize',
    pruner=MedianPruner(n_startup_trials=10, n_warmup_steps=5)
)
study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best params: {study.best_params}")
print(f"Best score: {study.best_value:.4f}")

Applications in AI/ML

ℹ️ Applications

Hyperparameter optimization is crucial in:

Deep Learning: Learning rate, batch size, architecture depth/width
Ensemble Methods: Number of estimators, max depth, min samples
SVMs: C parameter, kernel type, gamma
Neural Architecture Search: Layer types, connections, activation functions
Reinforcement Learning: Discount factor, exploration rate, network architecture

Common Mistakes

Mistake	Description	Solution
Using grid search for many parameters	Exponential cost explosion	Use random search or Bayesian optimization
Not setting a random seed	Results not reproducible	Always set random_state or seed
Ignoring conditional parameters	Wasting evaluations on invalid configs	Use TPE or define conditional spaces
Evaluating too few configurations	Missing good solutions	Start with 20-50 random trials
Not using early stopping	Training poor configs for full duration	Use successive halving or patience
Overfitting validation set	Tuning too aggressively on small validation sets	Use nested cross-validation
Not scaling resources	Using same budget for all configs	Use multi-fidelity methods

Interview Questions

Q: When should you use Bayesian optimization over random search?

A: Use Bayesian optimization when each evaluation is expensive (e.g., deep learning models) and you have a limited budget. Random search is better for cheap evaluations or when you have many parallel resources.

Q: Explain the exploration-exploitation tradeoff in HPO.

A: Exploration means trying new, uncertain regions of the hyperparameter space to potentially find better configurations. Exploitation means focusing on regions that have shown good results. The tradeoff is balancing between trying new things and refining what works.

Q: What are the advantages of TPE over Gaussian Processes for HPO?

A: TPE handles conditional hyperparameters naturally, doesn't require kernel specification, scales better with observations, and can work with any metric. GPs provide better uncertainty estimates but struggle with high dimensions.

Q: How does Hyperband improve upon successive halving?

A: Hyperband runs multiple brackets with different starting points and resource allocations, reducing the risk of early elimination of good configurations. It's more robust to the unknown difficulty of the optimization problem.

Q: What is the role of acquisition functions in Bayesian optimization?

A: Acquisition functions guide the search by balancing exploration (high uncertainty regions) and exploitation (high predicted value regions). Common choices include EI, UCB, and PI.

Practice Problems

📝Problem 1: Grid Search Complexity

You have 4 hyperparameters with 5 possible values each. Grid search requires evaluating 625 configurations. If each evaluation takes 2 minutes, how long does the complete grid search take?

💡Solution 1

Total time = 625 configurations × 2 minutes = 1250 minutes = 20.83 hours

📝Problem 2: Expected Improvement

A GP model predicts $\mu(x) = 0.85$ , $\sigma(x) = 0.05$ at a test point. The current best validation score is 0.80. Calculate the expected improvement at this point.

💡Solution 2

Given: $\mu(x) = 0.85$ , $\sigma(x) = 0.05$ , $f_{\text{best}} = 0.80$

First, calculate $Z$ :

Z = \frac{\mu(x) - f_{\text{best}}}{\sigma(x)} = \frac{0.85 - 0.80}{0.05} = 1.0

Using standard normal tables:

\Phi(1.0) \approx 0.8413

\phi(1.0) \approx 0.2420

\text{EI}(x) = (0.85 - 0.80) \times 0.8413 + 0.05 \times 0.2420

\text{EI}(x) = 0.05 \times 0.8413 + 0.05 \times 0.2420

\text{EI}(x) = 0.0421 + 0.0121 = 0.0542

📝Problem 3: Successive Halving

You have 16 configurations and a total budget of 800 iterations. How many rounds does successive halving need, and how many resources does each configuration get in round 2?

💡Solution 3

Number of rounds: $k = \log_2(16) + 1 = 4 + 1 = 5$ rounds

Resources per configuration at round $i$ :

r_i = B \cdot \frac{2^i}{n} = 800 \cdot \frac{2^i}{16}

At round 2 ( $i = 2$ ): $r_2 = 800 \cdot \frac{2^2}{16} = 800 \cdot \frac{4}{16} = 800 \cdot 0.25 = 200$ iterations

Configurations surviving to round 2: $n_2 = 16 \cdot 2^{-2} = 16 \cdot 0.25 = 4$ configurations

📝Problem 4: Early Stopping Patience

A model's validation loss over 20 epochs is: [0.5, 0.45, 0.42, 0.40, 0.39, 0.38, 0.385, 0.39, 0.395, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50]

With patience = 5, at what epoch should training stop?

💡Solution 4

Minimum validation loss: 0.38 at epoch 6

Epoch 7: 0.385 > 0.38 (no improvement, counter = 1) Epoch 8: 0.39 > 0.38 (no improvement, counter = 2) Epoch 9: 0.395 > 0.38 (no improvement, counter = 3) Epoch 10: 0.40 > 0.38 (no improvement, counter = 4) Epoch 11: 0.41 > 0.38 (no improvement, counter = 5)

At epoch 11, patience of 5 is reached. Training should stop at epoch 11.

Quick Reference

📋Quick Reference

Grid Search: Exhaustive, exponential cost, good for small spaces
Random Search: Samples randomly, more efficient than grid search
Bayesian Optimization: Model-based, sample-efficient, uses acquisition functions
TPE: Handles conditional parameters, doesn't need kernel specification
Successive Halving: Multi-fidelity, eliminates poor configurations early
Hyperband: Robust extension of successive halving with multiple brackets
Early Stopping: Prevents overfitting, saves training time

Cross-References

Related Topics: Gradient Descent, Regularization, Cross-Validation, Ensemble Methods
Prerequisites: Linear Algebra, Probability, Statistics, Machine Learning Basics
Advanced Topics: Neural Architecture Search, Meta-Learning, AutoML Pipelines
Libraries: Optuna, Scikit-optimize, Hyperopt, BOHB, Ray Tune
Applications — Deep Learning, Reinforcement Learning, Computer Vision, NLP