Model Selection & Hyperparameter Tuning
Choosing the right model and tuning it properly is crucial for ML success.
Algorithm Selection
Quick Guide:
Small dataset (<1K samples):
āā SVM with RBF kernel
āā KNN
āā Naive Bayes
āā Random Forest
Medium dataset (1K-100K):
āā XGBoost / LightGBM
āā Random Forest
āā Neural Networks (simple)
āā SVM with linear kernel
Large dataset (>100K):
āā XGBoost / LightGBM
āā Neural Networks
āā Linear models
āā SGDClassifier
High dimensional (features > samples):
āā Linear models (L1/L2)
āā SVM
āā Naive Bayes
Interpretability needed:
āā Decision Trees
āā Linear/Logistic Regression
āā Rule-based models
Hyperparameter Tuning
Grid Search:
āā Try EVERY combination
āā Guaranteed to find best in grid
āā Exponentially expensive
āā Use for small parameter spaces
Random Search:
āā Random combinations
āā Often finds good results faster
āā Better use of budget
āā Default choice for most cases
Bayesian Optimization:
āā Uses past results to guide search
āā Most efficient
āā Best for expensive models
āā Use library: Optuna
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
# Grid Search
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
print(f"Best: {grid.best_params_}")
# Random Search (faster)
random = RandomizedSearchCV(RandomForestClassifier(), param_grid, n_iter=20, cv=5)
random.fit(X_train, y_train)
Optuna (Bayesian Optimization)
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 300),
'max_depth': trial.suggest_int('max_depth', 3, 20),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True)
}
model = xgb.XGBClassifier(**params)
return cross_val_score(model, X, y, cv=5).mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best params: {study.best_params}")
Key Takeaways
- Start with simple models as baselines
- Random search is usually better than grid search
- Bayesian optimization (Optuna) is most efficient
- Always use cross-validation for evaluation
- XGBoost/LightGBM are often the best tabular models
- Scale data for SVM, KNN, Neural Networks
- Feature engineering matters more than model choice
- Ensemble multiple models for best performance