← Math|69 of 100
Optimization

Hyperparameter Optimization

Master grid search, random search, Bayesian optimization, and hyperband for model tuning.

πŸ“‚ AutoMLπŸ“– Lesson 69 of 100πŸŽ“ Free Course

Advertisement

Why It Matters

πŸ’‘ Why It Matters

Good hyperparameters can improve model performance more than architecture changes. A well-tuned model with a simple architecture often outperforms a poorly tuned model with a complex architecture. Hyperparameter optimization (HPO) is the process of finding the optimal set of hyperparameters for a machine learning model, and it's a critical step in the machine learning pipeline.


Hyperparameter Optimization

DfHyperparameter Optimization

Hyperparameter optimization (HPO) is the problem of finding a set of hyperparameters for a learning algorithm that yields an optimal model. Given a learning algorithm A\mathcal{A} with hyperparameters Ξ»βˆˆΞ›\lambda \in \Lambda, and a dataset DD, the goal is to find:

Ξ»βˆ—=arg⁑minβ‘Ξ»βˆˆΞ›L(AΞ»,D)\lambda^* = \arg\min_{\lambda \in \Lambda} \mathcal{L}(\mathcal{A}_\lambda, D)

where L\mathcal{L} is the loss function (e.g., cross-validation error) and AΞ»\mathcal{A}_\lambda denotes algorithm A\mathcal{A} trained with hyperparameters Ξ»\lambda.

DfTypes of Hyperparameters

Hyperparameters can be categorized into:

  1. Continuous: Learning rate, regularization strength, dropout rate
  2. Integer: Number of layers, number of neurons, batch size
  3. Categorical: Activation function, optimizer type, kernel type
  4. Conditional: Some hyperparameters only apply when others take specific values (e.g., number of trees in a forest)

Grid Search

DfGrid Search

Grid search is an exhaustive search over a specified parameter grid. It evaluates every combination of hyperparameters in the grid. For hyperparameters Ξ»1,Ξ»2,…,Ξ»k\lambda_1, \lambda_2, \dots, \lambda_k with n1,n2,…,nkn_1, n_2, \dots, n_k possible values respectively, grid search evaluates N=n1Γ—n2Γ—β‹―Γ—nkN = n_1 \times n_2 \times \cdots \times n_k combinations.

Grid Search Complexity

⚠️ Curse of Dimensionality

Grid search suffers from the curse of dimensionality. Adding just one hyperparameter with 10 possible values increases the search space by 10x. With 5 hyperparameters each with 10 values, you need 100,000 evaluations.


Random Search

DfRandom Search

Random search samples hyperparameter configurations from a specified distribution. Unlike grid search, it doesn't evaluate every point in a grid. For each iteration, it independently samples each hyperparameter from its distribution.

ThRandom Search Efficiency

Random search is more efficient than grid search in high-dimensional spaces because it explores more of the hyperparameter space. Specifically, random search can find a good solution with probability pp using O(log⁑(1/(1βˆ’p)))O(\log(1/(1-p))) evaluations, while grid search requires O(klog⁑(1/(1βˆ’p)))O(k \log(1/(1-p))) evaluations where kk is the number of hyperparameters.

Random Search Sampling


Bayesian Optimization

DfBayesian Optimization

Bayesian optimization is a sequential model-based optimization approach for black-box functions. It builds a surrogate model of the objective function and uses an acquisition function to decide where to evaluate next. The process iterates:

  1. Fit surrogate model to observed data
  2. Maximize acquisition function to propose next point
  3. Evaluate the true objective at that point
  4. Update the surrogate model

Acquisition Functions

DfExpected Improvement (EI)

The expected improvement acquisition function measures the expected improvement over the current best value:

EI(x)=E[max⁑(0,fbestβˆ’f^(x))]\text{EI}(x) = \mathbb{E}[\max(0, f_{\text{best}} - \hat{f}(x))]

where fbestf_{\text{best}} is the best observed value and f^(x)\hat{f}(x) is the surrogate model's prediction.

Expected Improvement Formula

DfUpper Confidence Bound (UCB)

The UCB acquisition function balances exploration and exploitation:

UCB(x)=ΞΌ(x)+ΞΊΟƒ(x)\text{UCB}(x) = \mu(x) + \kappa \sigma(x)

where ΞΊ\kappa controls the exploration-exploitation tradeoff. Larger ΞΊ\kappa encourages exploration.

DfProbability of Improvement (PI)

The PI acquisition function measures the probability of improving over the current best:

PI(x)=Ξ¦(ΞΌ(x)βˆ’fbestΟƒ(x))\text{PI}(x) = \Phi\left(\frac{\mu(x) - f_{\text{best}}}{\sigma(x)}\right)

PI tends to get stuck in local optima compared to EI.


Gaussian Process for HPO

DfGaussian Process Regression

A Gaussian Process (GP) is a collection of random variables, any finite number of which have a multivariate Gaussian distribution. A GP is completely specified by its mean function m(x)m(x) and covariance function k(x,xβ€²)k(x, x'):

f(x)∼GP(m(x),k(x,xβ€²))f(x) \sim \mathcal{GP}(m(x), k(x, x'))

GP Predictive Distribution

DfCommon Kernel Functions

  • Squared Exponential (RBF): k(x,xβ€²)=Οƒ2exp⁑(βˆ’βˆ₯xβˆ’xβ€²βˆ₯22l2)k(x, x') = \sigma^2 \exp\left(-\frac{\|x - x'\|^2}{2l^2}\right)
  • MatΓ©rn: k(x,xβ€²)=Οƒ221βˆ’Ξ½Ξ“(Ξ½)(2Ξ½βˆ₯xβˆ’xβ€²βˆ₯l)Ξ½KΞ½(2Ξ½βˆ₯xβˆ’xβ€²βˆ₯l)k(x, x') = \sigma^2 \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}\|x - x'\|}{l}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}\|x - x'\|}{l}\right)
  • Rational Quadratic: k(x,xβ€²)=Οƒ2(1+βˆ₯xβˆ’xβ€²βˆ₯22Ξ±l2)βˆ’Ξ±k(x, x') = \sigma^2\left(1 + \frac{\|x - x'\|^2}{2\alpha l^2}\right)^{-\alpha}

Tree-structured Parzen Estimators

DfTree-structured Parzen Estimators (TPE)

TPE is a sequential model-based optimization algorithm that models p(x∣y)p(x|y) and p(y)p(y) instead of p(y∣x)p(y|x) like GP-based methods. It uses two density estimators:

  • l(x)l(x): density of xx given y<yβˆ—y < y^* (promising configurations)
  • g(x)g(x): density of xx given yβ‰₯yβˆ—y \geq y^* (non-promising configurations)

The algorithm maximizes the ratio l(x)g(x)\frac{l(x)}{g(x)} to propose new configurations.

TPE Acquisition Function

πŸ’‘ TPE Advantages

TPE handles conditional hyperparameters naturally and can work with any metric. It doesn't require a kernel function like GPs and scales better with the number of observations.


Multi-Fidelity Methods

DfMulti-Fidelity Optimization

Multi-fidelity methods use approximations (low-fidelity) of the objective function to reduce computational cost. The idea is to evaluate many configurations with low fidelity (e.g., fewer epochs, smaller dataset) and only evaluate the most promising ones with full fidelity.

Successive Halving

DfSuccessive Halving

Successive halving is a multi-fidelity algorithm that:

  1. Start with nn configurations and small resource budget r0r_0
  2. Evaluate all configurations for r0r_0 iterations
  3. Keep the top half and double the budget
  4. Repeat until one configuration remains

Successive Halving Budget Allocation

Hyperband

DfHyperband

Hyperband is an extension of successive halving that tries different amounts of resource allocation. It runs multiple "brackets" with different starting configurations and resource levels:

  1. Start with nn configurations at minimum resources
  2. Successive halving eliminates half each round
  3. Multiple brackets with different starting points

Early Stopping

DfEarly Stopping

Early stopping is a regularization technique where training is stopped before convergence to prevent overfitting. The model is trained for multiple epochs, and training is stopped when performance on a validation set stops improving.

Early Stopping Criterion

⚠️ Patience Parameter

Early stopping with patience Ο„\tau waits Ο„\tau epochs of no improvement before stopping. Too small Ο„\tau may stop too early; too large Ο„\tau wastes computation. Typical values: Ο„βˆˆ[5,20]\tau \in [5, 20].


Python Implementation

Optuna

πŸ“Optuna Implementation

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 10, 300)
    max_depth = trial.suggest_int('max_depth', 2, 20)
    min_samples_split = trial.suggest_float('min_samples_split', 0.01, 0.5)
    
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        random_state=42
    )
    
    score = cross_val_score(model, X, y, cv=5, scoring='accuracy').mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(f"Best params: {study.best_params}")
print(f"Best score: {study.best_value:.4f}")

Scikit-optimize

πŸ“Scikit-Optimize Implementation

from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

search_spaces = {
    'n_estimators': Integer(10, 300),
    'max_depth': Integer(2, 20),
    'min_samples_split': Real(0.01, 0.5),
    'criterion': Categorical(['gini', 'entropy'])
}

opt = BayesSearchCV(
    RandomForestClassifier(random_state=42),
    search_spaces,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    random_state=42
)

opt.fit(X, y)
print(f"Best params: {opt.best_params_}")
print(f"Best score: {opt.best_score_:.4f}")

Optuna with Pruning

πŸ“Optuna with Pruning

import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from optuna.pruners import MedianPruner

X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 10, 300)
    learning_rate = trial.suggest_float('learning_rate', 0.001, 0.3, log=True)
    max_depth = trial.suggest_int('max_depth', 2, 10)
    subsample = trial.suggest_float('subsample', 0.5, 1.0)
    
    model = GradientBoostingClassifier(
        n_estimators=n_estimators,
        learning_rate=learning_rate,
        max_depth=max_depth,
        subsample=subsample,
        random_state=42
    )
    
    scores = []
    for fold, (train_idx, val_idx) in enumerate(
        cross_val_score(model, X, y, cv=3, return_estimator=False)
    ):
        model.fit(X[train_idx], y[train_idx])
        score = model.score(X[val_idx], y[val_idx])
        scores.append(score)
        
        trial.report(sum(scores) / len(scores), fold)
        if trial.should_prune():
            raise optuna.TrialPruned()
    
    return sum(scores) / len(scores)

study = optuna.create_study(
    direction='maximize',
    pruner=MedianPruner(n_startup_trials=10, n_warmup_steps=5)
)
study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best params: {study.best_params}")
print(f"Best score: {study.best_value:.4f}")

Applications in AI/ML

ℹ️ Applications

Hyperparameter optimization is crucial in:

  • Deep Learning: Learning rate, batch size, architecture depth/width
  • Ensemble Methods: Number of estimators, max depth, min samples
  • SVMs: C parameter, kernel type, gamma
  • Neural Architecture Search: Layer types, connections, activation functions
  • Reinforcement Learning: Discount factor, exploration rate, network architecture

Common Mistakes

MistakeDescriptionSolution
Using grid search for many parametersExponential cost explosionUse random search or Bayesian optimization
Not setting a random seedResults not reproducibleAlways set random_state or seed
Ignoring conditional parametersWasting evaluations on invalid configsUse TPE or define conditional spaces
Evaluating too few configurationsMissing good solutionsStart with 20-50 random trials
Not using early stoppingTraining poor configs for full durationUse successive halving or patience
Overfitting validation setTuning too aggressively on small validation setsUse nested cross-validation
Not scaling resourcesUsing same budget for all configsUse multi-fidelity methods

Interview Questions

Q: When should you use Bayesian optimization over random search?

A: Use Bayesian optimization when each evaluation is expensive (e.g., deep learning models) and you have a limited budget. Random search is better for cheap evaluations or when you have many parallel resources.

Q: Explain the exploration-exploitation tradeoff in HPO.

A: Exploration means trying new, uncertain regions of the hyperparameter space to potentially find better configurations. Exploitation means focusing on regions that have shown good results. The tradeoff is balancing between trying new things and refining what works.

Q: What are the advantages of TPE over Gaussian Processes for HPO?

A: TPE handles conditional hyperparameters naturally, doesn't require kernel specification, scales better with observations, and can work with any metric. GPs provide better uncertainty estimates but struggle with high dimensions.

Q: How does Hyperband improve upon successive halving?

A: Hyperband runs multiple brackets with different starting points and resource allocations, reducing the risk of early elimination of good configurations. It's more robust to the unknown difficulty of the optimization problem.

Q: What is the role of acquisition functions in Bayesian optimization?

A: Acquisition functions guide the search by balancing exploration (high uncertainty regions) and exploitation (high predicted value regions). Common choices include EI, UCB, and PI.


Practice Problems

πŸ“Problem 1: Grid Search Complexity

You have 4 hyperparameters with 5 possible values each. Grid search requires evaluating 625 configurations. If each evaluation takes 2 minutes, how long does the complete grid search take?

πŸ’‘Solution 1

Total time = 625 configurations Γ— 2 minutes = 1250 minutes = 20.83 hours

πŸ“Problem 2: Expected Improvement

A GP model predicts ΞΌ(x)=0.85\mu(x) = 0.85, Οƒ(x)=0.05\sigma(x) = 0.05 at a test point. The current best validation score is 0.80. Calculate the expected improvement at this point.

πŸ’‘Solution 2

Given: ΞΌ(x)=0.85\mu(x) = 0.85, Οƒ(x)=0.05\sigma(x) = 0.05, fbest=0.80f_{\text{best}} = 0.80

First, calculate ZZ:

Z=ΞΌ(x)βˆ’fbestΟƒ(x)=0.85βˆ’0.800.05=1.0Z = \frac{\mu(x) - f_{\text{best}}}{\sigma(x)} = \frac{0.85 - 0.80}{0.05} = 1.0

Using standard normal tables:

Ξ¦(1.0)β‰ˆ0.8413\Phi(1.0) \approx 0.8413Ο•(1.0)β‰ˆ0.2420\phi(1.0) \approx 0.2420EI(x)=(0.85βˆ’0.80)Γ—0.8413+0.05Γ—0.2420\text{EI}(x) = (0.85 - 0.80) \times 0.8413 + 0.05 \times 0.2420EI(x)=0.05Γ—0.8413+0.05Γ—0.2420\text{EI}(x) = 0.05 \times 0.8413 + 0.05 \times 0.2420EI(x)=0.0421+0.0121=0.0542\text{EI}(x) = 0.0421 + 0.0121 = 0.0542

πŸ“Problem 3: Successive Halving

You have 16 configurations and a total budget of 800 iterations. How many rounds does successive halving need, and how many resources does each configuration get in round 2?

πŸ’‘Solution 3

Number of rounds: k=log⁑2(16)+1=4+1=5k = \log_2(16) + 1 = 4 + 1 = 5 rounds

Resources per configuration at round ii:

ri=Bβ‹…2in=800β‹…2i16r_i = B \cdot \frac{2^i}{n} = 800 \cdot \frac{2^i}{16}

At round 2 (i=2i = 2): r2=800β‹…2216=800β‹…416=800β‹…0.25=200r_2 = 800 \cdot \frac{2^2}{16} = 800 \cdot \frac{4}{16} = 800 \cdot 0.25 = 200 iterations

Configurations surviving to round 2: n2=16β‹…2βˆ’2=16β‹…0.25=4n_2 = 16 \cdot 2^{-2} = 16 \cdot 0.25 = 4 configurations

πŸ“Problem 4: Early Stopping Patience

A model's validation loss over 20 epochs is: [0.5, 0.45, 0.42, 0.40, 0.39, 0.38, 0.385, 0.39, 0.395, 0.40, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.50]

With patience = 5, at what epoch should training stop?

πŸ’‘Solution 4

Minimum validation loss: 0.38 at epoch 6

Epoch 7: 0.385 > 0.38 (no improvement, counter = 1) Epoch 8: 0.39 > 0.38 (no improvement, counter = 2) Epoch 9: 0.395 > 0.38 (no improvement, counter = 3) Epoch 10: 0.40 > 0.38 (no improvement, counter = 4) Epoch 11: 0.41 > 0.38 (no improvement, counter = 5)

At epoch 11, patience of 5 is reached. Training should stop at epoch 11.


Quick Reference

πŸ“‹Quick Reference

  • Grid Search: Exhaustive, exponential cost, good for small spaces
  • Random Search: Samples randomly, more efficient than grid search
  • Bayesian Optimization: Model-based, sample-efficient, uses acquisition functions
  • TPE: Handles conditional parameters, doesn't need kernel specification
  • Successive Halving: Multi-fidelity, eliminates poor configurations early
  • Hyperband: Robust extension of successive halving with multiple brackets
  • Early Stopping: Prevents overfitting, saves training time

Cross-References

  • Related Topics: Gradient Descent, Regularization, Cross-Validation, Ensemble Methods
  • Prerequisites: Linear Algebra, Probability, Statistics, Machine Learning Basics
  • Advanced Topics: Neural Architecture Search, Meta-Learning, AutoML Pipelines
  • Libraries: Optuna, Scikit-optimize, Hyperopt, BOHB, Ray Tune
  • Applications β€” Deep Learning, Reinforcement Learning, Computer Vision, NLP
Lesson Progress69 / 100