Decision Trees: Gini vs Entropy, Pruning & Feature Importance
Understanding the building blocks of ensemble methods
Interview Question
"Compare Gini impurity and entropy as splitting criteria for decision trees. What is pruning and why is it necessary? How do you interpret feature importance in a decision tree?"
Difficulty: Medium | Frequently asked at Amazon, Netflix, Google
Theoretical Foundation
What is a Decision Tree?
A decision tree is a recursive, partition-based model that splits the feature space into regions and makes predictions based on the majority class (classification) or mean value (regression) in each region.
The tree is built by:
- Selecting the best feature to split on
- Dividing the data based on that feature
- Recursively applying steps 1-2 to each subset
- Stopping when a termination criterion is met
Splitting Criteria
The goal is to find splits that create the most "pure" child nodes. Three main criteria exist:
1. Gini Impurity
where is the proportion of class in node .
Properties:
- Range: for binary classification
- Minimum (0) when all samples belong to one class
- Maximum (0.5) when classes are perfectly balanced
- Measures the probability of misclassifying a randomly chosen element
2. Entropy (Information Gain)
Information Gain:
Properties:
- Range: for binary classification
- Minimum (0) when all samples belong to one class
- Maximum (1) when classes are perfectly balanced
- Based on information theory (Shannon entropy)
3. Classification Error
Not commonly used for tree building because it's not differentiable and less sensitive to node purity.
βΉοΈ
Key Insight: Gini and Entropy produce very similar trees in practice. The main difference is that Entropy tends to produce slightly more balanced trees because it's more sensitive to changes in class probabilities.
Gini vs Entropy: Detailed Comparison
| Aspect | Gini Impurity | Entropy |
|---|---|---|
| Range (binary) | [0, 0.5] | [0, 1] |
| Computation | Faster (no logarithm) | Slower (logarithm) |
| Sensitivity | Less sensitive to small probability changes | More sensitive |
| Tree Structure | Tends to isolate largest class | Tends to produce balanced trees |
| Bias | Slight bias toward large classes | Less bias |
| scikit-learn Default | Yes (criterion='gini') | No |
| XGBoost Default | Yes | No |
β οΈ
Interview Trap: Many candidates claim Gini is always faster than Entropy. While this is generally true (no logarithm computation), the difference is negligible for most practical datasets. The choice should be based on the specific problem, not computational considerations.
Pruning: Preventing Overfitting
Decision trees are prone to overfitting because they can grow arbitrarily deep, creating rules that capture noise in the training data. Pruning reduces tree complexity.
Pre-Pruning (Early Stopping)
Stop growing the tree before it becomes too complex:
- max_depth: Maximum depth of the tree
- min_samples_split: Minimum samples required to split a node
- min_samples_leaf: Minimum samples required in a leaf node
- max_leaf_nodes: Maximum number of leaf nodes
- min_impurity_decrease: Minimum impurity decrease required for a split
Post-Pruning
Grow the full tree first, then remove branches that don't improve validation performance:
- Cost Complexity Pruning (CCP): Minimizes where is the error and is the number of leaves
- Reduced Error Pruning: Remove nodes that don't improve validation accuracy
π‘
Production Tip: In practice, pre-pruning with cross-validation is more common because it's computationally cheaper. Post-pruning is theoretically more principled but requires growing the full tree first.
Feature Importance
Decision trees provide feature importance measures:
Impurity-Based Importance (Gini Importance)
where is the impurity decrease at node .
Limitations:
- Biased toward high-cardinality features
- Can be misleading when features are correlated
- Not normalized by default (sums to 1 in scikit-learn)
Permutation-Based Importance
Measures the decrease in model performance when a feature's values are randomly shuffled:
Advantages:
- Model-agnostic
- Less biased toward high-cardinality features
- Captures feature interactions
Code Implementation
Explanation of Code
-
Gini vs Entropy: Compares the two criteria on the same dataset, showing they produce similar results but may differ in tree structure.
-
Pruning: Demonstrates how different pruning parameters affect model complexity and generalization.
-
Feature Importance: Compares impurity-based and permutation-based importance measures, showing their differences.
-
Tree Visualization: Shows how to visualize and interpret the tree structure.
-
Decision Boundary: Illustrates the axis-parallel decision boundaries characteristic of decision trees.
Real-World Applications
Amazon: Product Recommendation
Amazon uses decision trees (within ensemble methods) for:
- Customer Segmentation: Grouping customers based on purchase behavior
- Churn Prediction: Identifying customers likely to cancel Prime
- Dynamic Pricing: Tree-based rules for price optimization
Netflix: Content Classification
Netflix uses decision trees for:
- Genre Classification: Categorizing content based on viewing patterns
- Personalization: Tree-based rules for UI customization
- A/B Test Analysis: Segmenting users for experiment analysis
π‘
Amazon Interview Tip: Be prepared to discuss how decision trees handle missing values. Mention that XGBoost and LightGBM have built-in handling for missing values, while scikit-learn requires imputation.
Common Follow-Up Questions
Q1: Why don't decision trees usually outperform ensemble methods?
Single decision trees have high variance because small changes in data can lead to completely different tree structures. Ensemble methods (Random Forest, Gradient Boosting) reduce variance by combining multiple trees.
Q2: How do decision trees handle categorical variables?
Scikit-learn decision trees only handle numerical data. Categorical variables must be encoded (one-hot or ordinal). However, some implementations (CatBoost, LightGBM) handle categories natively.
Q3: What is the computational complexity of building a decision tree?
where is samples and is features. At each node, we evaluate all features and all possible thresholds.
Q4: How do you handle imbalanced datasets with decision trees?
- Use
class_weight='balanced'to adjust the loss function - Use SMOTE for oversampling
- Adjust the decision threshold
- Use appropriate evaluation metrics (F1, AUC-PR)
Company-Specific Tips
Amazon Interview Tips
- Emphasize interpretability as a key advantage
- Discuss rule-based systems that can be derived from trees
- Be ready to explain why pruning is necessary (bias-variance tradeoff)
- Mention production considerations (model size, inference speed)
Netflix Interview Tips
- Focus on how trees handle non-linear relationships
- Discuss feature engineering for categorical features
- Be prepared to explain ensemble methods built on trees
- Mention scalability concerns with very large datasets