Hyperparameter Tuning: GridSearch, Random Search & Optuna

Hyperparameters vs Parameters

Understanding the difference between hyperparameters and parameters is crucial for model optimization.

Definitions

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                      MODEL COMPONENTS                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  PARAMETERS (Learned from Data):                                │
│  ├── Linear Regression: coefficients (β), intercept            │
│  ├── Neural Networks: weights, biases                           │
│  ├── Decision Trees: split thresholds                          │
│  └── SVM: support vectors                                      │
│                                                                 │
│  HYPERPARAMETERS (Set Before Training):                         │
│  ├── Linear Regression: None (or regularization α)             │
│  ├── Neural Networks: learning rate, layers, neurons           │
│  ├── Decision Trees: max_depth, min_samples_split              │
│  └── SVM: C, kernel, gamma                                     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Mathematical Perspective

For a model $f_\\theta$ with parameters $\\theta$ :

Parameters $\\theta$ are learned by minimizing loss: $\\theta^* = \\arg\\min_\\theta L(\\theta)$
Hyperparameters $\\lambda$ control the learning process: $\\theta^*(\\lambda) = \\arg\\min_\\theta L(\\theta; \\lambda)$

The goal of hyperparameter tuning is:

\lambda^* = \arg\min_{\lambda} \; \text{CV}\left(L(\theta^*(\lambda))\right)

Grid Search

The most straightforward approach: exhaustively try all combinations.

Visual Representation

Architecture Diagram

Hyperparameter Space for SVM:

  gamma
    ↑
  1.0 │  ●     ●     ●     ●
      │
  0.1 │  ●     ●     ●     ●
      │
 0.01 │  ●     ●     ●     ●
      │
0.001 │  ●     ●     ●     ●
      └──────────────────────→ C
        0.1   1.0   10   100

Total combinations: 4 × 4 = 16 models to train

Complete Grid Search Implementation

📝Grid Search Cross-Validation

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import time

np.random.seed(42)
X, y = make_classification(
    n_samples=2000, n_features=20, n_informative=10,
    n_classes=2, random_state=42
)

# Define parameter grid for SVM
param_grid_svm = {
    'C': [0.01, 0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}

print(f"Total combinations: {np.prod([len(v) for v in param_grid_svm.values()])}")

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(random_state=42))
])

start_time = time.time()
grid_search = GridSearchCV(
    pipeline, param_grid_svm, cv=5, scoring='accuracy',
    n_jobs=-1, verbose=1, return_train_score=True
)
grid_search.fit(X, y)
grid_time = time.time() - start_time

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Time taken: {grid_time:.2f}s")

Random Search

More efficient than Grid Search for large parameter spaces:

💡 Why Random Search is Often Better

Random Search is more efficient because:

Budget-constrained: With a fixed budget, Random Search explores more of the parameter space
Continuous parameters: Can sample from distributions, not just grids
Unimportant parameters: If only a few parameters matter, Random Search focuses budget on those
Mathematical insight: For $d$ dimensions with budget $n$ , Random Search covers $n$ unique points vs Grid's $n^{1/d}$ unique values per dimension

from scipy.stats import uniform, randint, loguniform

param_distributions = {
    'svm__C': loguniform(0.01, 100),
    'svm__gamma': loguniform(0.001, 1),
    'svm__kernel': ['rbf', 'linear']
}

random_search = RandomizedSearchCV(
    pipeline, param_distributions, n_iter=50, cv=5,
    scoring='accuracy', n_jobs=-1, random_state=42
)

start_time = time.time()
random_search.fit(X, y)
random_time = time.time() - start_time

print(f"Random Search (50 iterations): {random_search.best_score_:.4f} ({random_time:.2f}s)")
print(f"Best params: {random_search.best_params_}")

Bayesian Optimization with Optuna

Optuna uses Bayesian optimization with Tree-structured Parzen Estimator (TPE) to intelligently search the hyperparameter space.

How Optuna Works

Architecture Diagram

Optuna Optimization Process:

Iteration 1: Random sampling → Score = 0.75
Iteration 2: Random sampling → Score = 0.82
Iteration 3: TPE suggests → Score = 0.85 (explores promising area)
Iteration 4: TPE suggests → Score = 0.87 (refines search)
Iteration 5: TPE suggests → Score = 0.86 (explores alternative)
   ...
Iteration 50: TPE suggests → Score = 0.91 (converges)

Key Features:
• Pruning: Stop bad trials early
• Conditional parameters: Different params for different conditions
• Dashboard: Visual monitoring of optimization

Complete Optuna Implementation

try:
    import optuna
    from optuna.samplers import TPESampler
    OPTUNA_AVAILABLE = True
except ImportError:
    OPTUNA_AVAILABLE = False
    print("Optuna not installed. Install with: pip install optuna")

if OPTUNA_AVAILABLE:
    optuna.logging.set_verbosity(optuna.logging.WARNING)
    
    def objective(trial):
        classifier_name = trial.suggest_categorical('classifier', ['SVM', 'RF', 'GBM'])
        
        if classifier_name == 'SVM':
            C = trial.suggest_float('svm__C', 0.01, 100, log=True)
            gamma = trial.suggest_float('svm__gamma', 0.001, 1, log=True)
            kernel = trial.suggest_categorical('svm__kernel', ['rbf', 'linear'])
            clf = SVC(C=C, gamma=gamma, kernel=kernel, random_state=42)
            
        elif classifier_name == 'RF':
            n_estimators = trial.suggest_int('rf__n_estimators', 50, 300)
            max_depth = trial.suggest_int('rf__max_depth', 3, 20)
            min_samples_split = trial.suggest_int('rf__min_samples_split', 2, 20)
            clf = RandomForestClassifier(
                n_estimators=n_estimators, max_depth=max_depth,
                min_samples_split=min_samples_split, random_state=42, n_jobs=-1
            )
            
        else:
            n_estimators = trial.suggest_int('gbm__n_estimators', 50, 300)
            learning_rate = trial.suggest_float('gbm__learning_rate', 0.01, 0.3, log=True)
            max_depth = trial.suggest_int('gbm__max_depth', 3, 10)
            from sklearn.ensemble import GradientBoostingClassifier
            clf = GradientBoostingClassifier(
                n_estimators=n_estimators, learning_rate=learning_rate,
                max_depth=max_depth, random_state=42
            )
        
        from sklearn.model_selection import StratifiedKFold
        cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
        scores = cross_val_score(clf, X, y, cv=cv, scoring='accuracy', n_jobs=-1)
        return scores.mean()
    
    study = optuna.create_study(direction='maximize', sampler=TPESampler(seed=42))
    study.optimize(objective, n_trials=50, show_progress_bar=True)
    
    print(f"Best trial:")
    print(f"  Value (CV Score): {study.best_trial.value:.4f}")
    print(f"  Params: {study.best_trial.params}")

Resource Allocation Strategies

Parallelization Strategies

Architecture Diagram

Strategy 1: Independent Jobs (Easy)
┌─────────────────────────────────────┐
│ Job 1: Trial 1 ──────────────────→ │
│ Job 2: Trial 2 ──────────────────→ │
│ Job 3: Trial 3 ──────────────────→ │
│ Job 4: Trial 4 ──────────────────→ │
└─────────────────────────────────────┘
• Each trial independent
• Easy to parallelize
• Wastes resources on bad trials

Strategy 2: Successive Halving (Efficient)
┌─────────────────────────────────────┐
│ All 16 trials (small budget)        │
│ ├── Top 8 advance ────────────────→ │
│ │   ├── Top 4 advance ────────────→ │
│ │   │   ├── Top 2 advance ────────→ │
│ │   │   │   └── Winner!            │
└─────────────────────────────────────┘
• Early stopping for bad configs
• More efficient resource use
• Requires careful budget allocation

Successive Halving

Successive Halving Budget

B_k = B_0 \cdot \eta^{k-K}

Here,

$B_0$ =Initial total budget
$\eta$ =Elimination factor (typically 2 or 3)
$K$ =Number of rounds

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV

param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10]
}

rf = RandomForestClassifier(random_state=42)

halving_search = HalvingGridSearchCV(
    rf, param_grid, cv=5, scoring='accuracy',
    n_jobs=-1, factor=2, random_state=42
)

start = time.time()
halving_search.fit(X, y)
halving_time = time.time() - start

print(f"Halving Grid Search:")
print(f"  Time: {halving_time:.2f}s")
print(f"  Best score: {halving_search.best_score_:.4f}")

Best Practices

When to Use Each Method

Method	When to Use	Pros	Cons
Grid Search	Small parameter space, few params	Exhaustive, simple	Slow for large spaces
Random Search	Large parameter spaces	Fast, good coverage	Might miss optimal
Optuna	Complex spaces, limited budget	Smart search, pruning	Requires library
Halving	Many combinations, limited compute	Very efficient	Less stable

Practical Tips

ℹ️ Hyperparameter Tuning Best Practices

START COARSE, THEN REFINE: First pass with wide ranges, second pass narrow around best region
USE LOG SCALE: For learning rates (0.001 to 0.1) and regularization (0.001 to 100)
MONITOR OVERFITTING: Compare train vs CV scores; large gap = overfitting
SET RANDOM SEEDS: Ensures reproducibility and fair comparison
USE PIPELINES: Prevents data leakage; proper preprocessing in CV
ALLOCATE BUDGET WISELY: Start with fewer CV folds, increase for final selection

Key Takeaways

📋Summary: Hyperparameter Tuning

Grid Search is simple but exponential in parameters
Random Search is often more efficient than Grid Search
Optuna uses Bayesian optimization for intelligent search with pruning
Halving strategies dramatically reduce computation
Always use pipelines to prevent data leakage
Start coarse, then refine the search space
Monitor train vs CV scores for overfitting
Use log scale for learning rates and regularization

Practice Exercises

Exercise 1: Grid vs Random Search

# Compare Grid Search and Random Search on:
# 1. SVM with 3 hyperparameters
# 2. Random Forest with 4 hyperparameters
# 3. Gradient Boosting with 4 hyperparameters
# Measure: Time taken, Best score, Number of evaluations

Exercise 2: Optuna Optimization

# Use Optuna to optimize a Gradient Boosting model:
# 1. Define search space with conditional parameters
# 2. Implement pruning
# 3. Visualize optimization history
# 4. Analyze parameter importance
# 5. Compare with Grid/Random Search

Exercise 3: Resource Allocation

# Implement and compare:
# 1. Standard Grid Search (5-fold CV)
# 2. Halving Grid Search
# 3. Halving Random Search
# On a dataset with 1000+ samples and 10+ features.

Exercise 4: Practical Optimization Pipeline

from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# Build a complete hyperparameter optimization pipeline:
# 1. Split into train/test (80/20)
# 2. Try 3 different algorithms
# 3. For each: Use Optuna with 100 trials + pruning
# 4. Select best model overall
# 5. Evaluate on test set
# 6. Compare CV estimate with test performance

Summary

Aspect	Grid Search	Random Search	Optuna	Halving
Strategy	Exhaustive	Random sampling	Bayesian (TPE)	Progressive elimination
Speed	Slow	Fast	Fast	Very Fast
Effectiveness	Good	Good	Best	Good
Complexity	Low	Low	Medium	Medium
Best For	Small spaces	Large spaces	Complex spaces	Many combinations

Hyperparameter Tuning: GridSearch, Random Search & Optuna

Hyperparameter Tuning: GridSearch, Random Search & Optuna

Hyperparameters vs Parameters

Definitions

Mathematical Perspective

Grid Search

Visual Representation

Complete Grid Search Implementation

📝Grid Search Cross-Validation

Random Search

Bayesian Optimization with Optuna

How Optuna Works

Complete Optuna Implementation

Resource Allocation Strategies

Parallelization Strategies

Successive Halving

Successive Halving Budget

Best Practices

When to Use Each Method

Practical Tips

Key Takeaways

📋Summary: Hyperparameter Tuning

Practice Exercises

Exercise 1: Grid vs Random Search

Exercise 2: Optuna Optimization

Exercise 3: Resource Allocation

Exercise 4: Practical Optimization Pipeline

Summary

Need Expert Data Science Help?