Ensemble Methods — Complete Guide
Ensemble methods combine multiple models to produce better predictions than any single model.
Types of Ensembles
Bagging (Bootstrap Aggregating):
├─ Train models INDEPENDENTLY on different data samples
├─ Combine by averaging/voting
├─ Reduces variance
└─ Example: Random Forest
Boosting:
├─ Train models SEQUENTIALLY
├─ Each model corrects previous errors
├─ Reduces bias and variance
└─ Example: XGBoost, AdaBoost
Stacking:
├─ Train different model types
├─ Train a meta-learner to combine predictions
├─ Uses diverse base models
└─ Often wins competitions
Voting:
├─ Hard voting: majority wins
├─ Soft voting: average probabilities
└─ Simple but effective
Python Implementation
from sklearn.ensemble import (
VotingClassifier, StackingClassifier,
BaggingClassifier, AdaBoostClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
# Voting
voting = VotingClassifier(estimators=[
('lr', LogisticRegression()),
('rf', RandomForestClassifier()),
('svc', SVC(probability=True))
], voting='soft')
# Bagging
bagging = BaggingClassifier(
base_estimator=DecisionTreeClassifier(),
n_estimators=100, random_state=42
)
# Boosting
boosting = AdaBoostClassifier(
base_estimator=DecisionTreeClassifier(max_depth=1),
n_estimators=100, learning_rate=0.1
)
# Stacking
stacking = StackingClassifier(estimators=[
('rf', RandomForestClassifier()),
('svm', SVC(probability=True)),
('xgb', xgb.XGBClassifier())
], final_estimator=LogisticRegression())
Key Takeaways
- Ensembles almost always outperform single models
- Bagging reduces variance (Random Forest)
- Boosting reduces bias and variance (XGBoost)
- Stacking combines different model types
- Diversity between models is key to ensemble success
- Voting is simple — good baseline ensemble
- XGBoost is the most popular boosting algorithm
- Ensembles trade interpretability for performance