Hyperparameter Tuning: GridSearch, Random Search & Optuna

Module 2: Machine LearningFree Lesson

Advertisement

Hyperparameter Tuning: GridSearch, Random Search & Optuna

Hyperparameters vs Parameters

Understanding the difference between hyperparameters and parameters is crucial for model optimization.

Definitions

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      MODEL COMPONENTS                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  PARAMETERS (Learned from Data):                                β”‚
β”‚  β”œβ”€β”€ Linear Regression: coefficients (Ξ²), intercept            β”‚
β”‚  β”œβ”€β”€ Neural Networks: weights, biases                           β”‚
β”‚  β”œβ”€β”€ Decision Trees: split thresholds                          β”‚
β”‚  └── SVM: support vectors                                      β”‚
β”‚                                                                 β”‚
β”‚  HYPERPARAMETERS (Set Before Training):                         β”‚
β”‚  β”œβ”€β”€ Linear Regression: None (or regularization Ξ±)             β”‚
β”‚  β”œβ”€β”€ Neural Networks: learning rate, layers, neurons           β”‚
β”‚  β”œβ”€β”€ Decision Trees: max_depth, min_samples_split              β”‚
β”‚  └── SVM: C, kernel, gamma                                     β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Mathematical Perspective

For a model f_\\theta with parameters theta\\theta:

  • Parameters theta\\theta are learned by minimizing loss: \\theta^* = \\arg\\min_\\theta L(\\theta)
  • Hyperparameters lambda\\lambda control the learning process: \\theta^*(\\lambda) = \\arg\\min_\\theta L(\\theta; \\lambda)

The goal of hyperparameter tuning is:

Ξ»βˆ—=arg⁑minβ‘Ξ»β€…β€ŠCV(L(ΞΈβˆ—(Ξ»)))\lambda^* = \arg\min_{\lambda} \; \text{CV}\left(L(\theta^*(\lambda))\right)

Grid Search

The most straightforward approach: exhaustively try all combinations.

Visual Representation

Architecture Diagram
Hyperparameter Space for SVM:

  gamma
    ↑
  1.0 β”‚  ●     ●     ●     ●
      β”‚
  0.1 β”‚  ●     ●     ●     ●
      β”‚
 0.01 β”‚  ●     ●     ●     ●
      β”‚
0.001 β”‚  ●     ●     ●     ●
      └──────────────────────→ C
        0.1   1.0   10   100

Total combinations: 4 Γ— 4 = 16 models to train

Complete Grid Search Implementation

πŸ“Grid Search Cross-Validation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import time

np.random.seed(42)
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=10,
    n_classes=2, random_state=42
)

# Define parameter grid for SVM
param_grid_svm = {
    'C': [0.01, 0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

print(f"Total combinations: {np.prod([len(v) for v in param_grid_svm.values()])}")

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(random_state=42))
])

start_time = time.time()
grid_search = GridSearchCV(
    pipeline, param_grid_svm, cv=5, scoring='accuracy',
    n_jobs=-1, verbose=1, return_train_score=True
)
grid_search.fit(X, y)
grid_time = time.time() - start_time

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Time taken: {grid_time:.2f}s")

Random Search

More efficient than Grid Search for large parameter spaces:

πŸ’‘ Why Random Search is Often Better

Random Search is more efficient because:

  1. Budget-constrained: With a fixed budget, Random Search explores more of the parameter space
  2. Continuous parameters: Can sample from distributions, not just grids
  3. Unimportant parameters: If only a few parameters matter, Random Search focuses budget on those
  4. Mathematical insight: For dd dimensions with budget nn, Random Search covers nn unique points vs Grid's n1/dn^{1/d} unique values per dimension
from scipy.stats import uniform, randint, loguniform

param_distributions = {
    'svm__C': loguniform(0.01, 100),
    'svm__gamma': loguniform(0.001, 1),
    'svm__kernel': ['rbf', 'linear']
}

random_search = RandomizedSearchCV(
    pipeline, param_distributions, n_iter=50, cv=5,
    scoring='accuracy', n_jobs=-1, random_state=42
)

start_time = time.time()
random_search.fit(X, y)
random_time = time.time() - start_time

print(f"Random Search (50 iterations): {random_search.best_score_:.4f} ({random_time:.2f}s)")
print(f"Best params: {random_search.best_params_}")

Bayesian Optimization with Optuna

Optuna uses Bayesian optimization with Tree-structured Parzen Estimator (TPE) to intelligently search the hyperparameter space.

How Optuna Works

Architecture Diagram
Optuna Optimization Process:

Iteration 1: Random sampling β†’ Score = 0.75
Iteration 2: Random sampling β†’ Score = 0.82
Iteration 3: TPE suggests β†’ Score = 0.85 (explores promising area)
Iteration 4: TPE suggests β†’ Score = 0.87 (refines search)
Iteration 5: TPE suggests β†’ Score = 0.86 (explores alternative)
   ...
Iteration 50: TPE suggests β†’ Score = 0.91 (converges)

Key Features:
β€’ Pruning: Stop bad trials early
β€’ Conditional parameters: Different params for different conditions
β€’ Dashboard: Visual monitoring of optimization

Complete Optuna Implementation

try:
    import optuna
    from optuna.samplers import TPESampler
    OPTUNA_AVAILABLE = True
except ImportError:
    OPTUNA_AVAILABLE = False
    print("Optuna not installed. Install with: pip install optuna")

if OPTUNA_AVAILABLE:
    optuna.logging.set_verbosity(optuna.logging.WARNING)
    
    def objective(trial):
        classifier_name = trial.suggest_categorical('classifier', ['SVM', 'RF', 'GBM'])
        
        if classifier_name == 'SVM':
            C = trial.suggest_float('svm__C', 0.01, 100, log=True)
            gamma = trial.suggest_float('svm__gamma', 0.001, 1, log=True)
            kernel = trial.suggest_categorical('svm__kernel', ['rbf', 'linear'])
            clf = SVC(C=C, gamma=gamma, kernel=kernel, random_state=42)
            
        elif classifier_name == 'RF':
            n_estimators = trial.suggest_int('rf__n_estimators', 50, 300)
            max_depth = trial.suggest_int('rf__max_depth', 3, 20)
            min_samples_split = trial.suggest_int('rf__min_samples_split', 2, 20)
            clf = RandomForestClassifier(
                n_estimators=n_estimators, max_depth=max_depth,
                min_samples_split=min_samples_split, random_state=42, n_jobs=-1
            )
            
        else:
            n_estimators = trial.suggest_int('gbm__n_estimators', 50, 300)
            learning_rate = trial.suggest_float('gbm__learning_rate', 0.01, 0.3, log=True)
            max_depth = trial.suggest_int('gbm__max_depth', 3, 10)
            from sklearn.ensemble import GradientBoostingClassifier
            clf = GradientBoostingClassifier(
                n_estimators=n_estimators, learning_rate=learning_rate,
                max_depth=max_depth, random_state=42
            )
        
        from sklearn.model_selection import StratifiedKFold
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
        return scores.mean()
    
    study = optuna.create_study(direction='maximize', sampler=TPESampler(seed=42))
    study.optimize(objective, n_trials=50, show_progress_bar=True)
    
    print(f"Best trial:")
    print(f"  Value (CV Score): {study.best_trial.value:.4f}")
    print(f"  Params: {study.best_trial.params}")

Resource Allocation Strategies

Parallelization Strategies

Architecture Diagram
Strategy 1: Independent Jobs (Easy)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Job 1: Trial 1 ──────────────────→ β”‚
β”‚ Job 2: Trial 2 ──────────────────→ β”‚
β”‚ Job 3: Trial 3 ──────────────────→ β”‚
β”‚ Job 4: Trial 4 ──────────────────→ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β€’ Each trial independent
β€’ Easy to parallelize
β€’ Wastes resources on bad trials

Strategy 2: Successive Halving (Efficient)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ All 16 trials (small budget)        β”‚
β”‚ β”œβ”€β”€ Top 8 advance ────────────────→ β”‚
β”‚ β”‚   β”œβ”€β”€ Top 4 advance ────────────→ β”‚
β”‚ β”‚   β”‚   β”œβ”€β”€ Top 2 advance ────────→ β”‚
β”‚ β”‚   β”‚   β”‚   └── Winner!            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β€’ Early stopping for bad configs
β€’ More efficient resource use
β€’ Requires careful budget allocation

Successive Halving

Successive Halving Budget

Bk=B0β‹…Ξ·kβˆ’KB_k = B_0 \cdot \eta^{k-K}

Here,

  • B0B_0=Initial total budget
  • Ξ·\eta=Elimination factor (typically 2 or 3)
  • KK=Number of rounds
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV

param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10]
}

rf = RandomForestClassifier(random_state=42)

halving_search = HalvingGridSearchCV(
    rf, param_grid, cv=5, scoring='accuracy',
    n_jobs=-1, factor=2, random_state=42
)

start = time.time()
halving_search.fit(X, y)
halving_time = time.time() - start

print(f"Halving Grid Search:")
print(f"  Time: {halving_time:.2f}s")
print(f"  Best score: {halving_search.best_score_:.4f}")

Best Practices

When to Use Each Method

MethodWhen to UseProsCons
Grid SearchSmall parameter space, few paramsExhaustive, simpleSlow for large spaces
Random SearchLarge parameter spacesFast, good coverageMight miss optimal
OptunaComplex spaces, limited budgetSmart search, pruningRequires library
HalvingMany combinations, limited computeVery efficientLess stable

Practical Tips

ℹ️ Hyperparameter Tuning Best Practices

  1. START COARSE, THEN REFINE: First pass with wide ranges, second pass narrow around best region
  2. USE LOG SCALE: For learning rates (0.001 to 0.1) and regularization (0.001 to 100)
  3. MONITOR OVERFITTING: Compare train vs CV scores; large gap = overfitting
  4. SET RANDOM SEEDS: Ensures reproducibility and fair comparison
  5. USE PIPELINES: Prevents data leakage; proper preprocessing in CV
  6. ALLOCATE BUDGET WISELY: Start with fewer CV folds, increase for final selection

Key Takeaways

πŸ“‹Summary: Hyperparameter Tuning

  1. Grid Search is simple but exponential in parameters
  2. Random Search is often more efficient than Grid Search
  3. Optuna uses Bayesian optimization for intelligent search with pruning
  4. Halving strategies dramatically reduce computation
  5. Always use pipelines to prevent data leakage
  6. Start coarse, then refine the search space
  7. Monitor train vs CV scores for overfitting
  8. Use log scale for learning rates and regularization

Practice Exercises

Exercise 1: Grid vs Random Search

# Compare Grid Search and Random Search on:
# 1. SVM with 3 hyperparameters
# 2. Random Forest with 4 hyperparameters
# 3. Gradient Boosting with 4 hyperparameters
# Measure: Time taken, Best score, Number of evaluations

Exercise 2: Optuna Optimization

# Use Optuna to optimize a Gradient Boosting model:
# 1. Define search space with conditional parameters
# 2. Implement pruning
# 3. Visualize optimization history
# 4. Analyze parameter importance
# 5. Compare with Grid/Random Search

Exercise 3: Resource Allocation

# Implement and compare:
# 1. Standard Grid Search (5-fold CV)
# 2. Halving Grid Search
# 3. Halving Random Search
# On a dataset with 1000+ samples and 10+ features.

Exercise 4: Practical Optimization Pipeline

from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# Build a complete hyperparameter optimization pipeline:
# 1. Split into train/test (80/20)
# 2. Try 3 different algorithms
# 3. For each: Use Optuna with 100 trials + pruning
# 4. Select best model overall
# 5. Evaluate on test set
# 6. Compare CV estimate with test performance

Summary

AspectGrid SearchRandom SearchOptunaHalving
StrategyExhaustiveRandom samplingBayesian (TPE)Progressive elimination
SpeedSlowFastFastVery Fast
EffectivenessGoodGoodBestGood
ComplexityLowLowMediumMedium
Best ForSmall spacesLarge spacesComplex spacesMany combinations

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement