Amazon & Netflix Interview

Decision Trees: Gini vs Entropy, Pruning & Feature Importance

Understanding the building blocks of ensemble methods

Interview Question

"Compare Gini impurity and entropy as splitting criteria for decision trees. What is pruning and why is it necessary? How do you interpret feature importance in a decision tree?"

Difficulty: Medium | Frequently asked at Amazon, Netflix, Google

Theoretical Foundation

What is a Decision Tree?

A decision tree is a recursive, partition-based model that splits the feature space into regions and makes predictions based on the majority class (classification) or mean value (regression) in each region.

The tree is built by:

Selecting the best feature to split on
Dividing the data based on that feature
Recursively applying steps 1-2 to each subset
Stopping when a termination criterion is met

Splitting Criteria

The goal is to find splits that create the most "pure" child nodes. Three main criteria exist:

1. Gini Impurity

\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2

where $p_k$ is the proportion of class $k$ in node $t$ .

Properties:

Range: $[0, 0.5]$ for binary classification
Minimum (0) when all samples belong to one class
Maximum (0.5) when classes are perfectly balanced
Measures the probability of misclassifying a randomly chosen element

2. Entropy (Information Gain)

\text{Entropy}(t) = -\sum_{k=1}^{K} p_k \log_2(p_k)

Information Gain:

\text{IG}(t, \text{split}) = \text{Entropy}(t) - \sum_{i \in \{L,R\}} \frac{n_i}{n} \text{Entropy}(t_i)

Properties:

Range: $[0, 1]$ for binary classification
Minimum (0) when all samples belong to one class
Maximum (1) when classes are perfectly balanced
Based on information theory (Shannon entropy)

3. Classification Error

\text{Error}(t) = 1 - \max_k p_k

Not commonly used for tree building because it's not differentiable and less sensitive to node purity.

ℹ️

Key Insight: Gini and Entropy produce very similar trees in practice. The main difference is that Entropy tends to produce slightly more balanced trees because it's more sensitive to changes in class probabilities.

Gini vs Entropy: Detailed Comparison

Aspect	Gini Impurity	Entropy
Range (binary)	[0, 0.5]	[0, 1]
Computation	Faster (no logarithm)	Slower (logarithm)
Sensitivity	Less sensitive to small probability changes	More sensitive
Tree Structure	Tends to isolate largest class	Tends to produce balanced trees
Bias	Slight bias toward large classes	Less bias
scikit-learn Default	Yes (`criterion='gini'`)	No
XGBoost Default	Yes	No

⚠️

Interview Trap: Many candidates claim Gini is always faster than Entropy. While this is generally true (no logarithm computation), the difference is negligible for most practical datasets. The choice should be based on the specific problem, not computational considerations.

Pruning: Preventing Overfitting

Decision trees are prone to overfitting because they can grow arbitrarily deep, creating rules that capture noise in the training data. Pruning reduces tree complexity.

Pre-Pruning (Early Stopping)

Stop growing the tree before it becomes too complex:

max_depth: Maximum depth of the tree
min_samples_split: Minimum samples required to split a node
min_samples_leaf: Minimum samples required in a leaf node
max_leaf_nodes: Maximum number of leaf nodes
min_impurity_decrease: Minimum impurity decrease required for a split

Post-Pruning

Grow the full tree first, then remove branches that don't improve validation performance:

Cost Complexity Pruning (CCP): Minimizes $R_\alpha(T) = R(T) + \alpha |T|$ where $R(T)$ is the error and $|T|$ is the number of leaves
Reduced Error Pruning: Remove nodes that don't improve validation accuracy

💡

Production Tip: In practice, pre-pruning with cross-validation is more common because it's computationally cheaper. Post-pruning is theoretically more principled but requires growing the full tree first.

Feature Importance

Decision trees provide feature importance measures:

Impurity-Based Importance (Gini Importance)

\text{Importance}(j) = \sum_{t \in \text{nodes}} \mathbb{1}[\text{feature } j \text{ used at } t] \cdot \frac{n_t}{n} \cdot \Delta I(t)

where $\Delta I(t)$ is the impurity decrease at node $t$ .

Limitations:

Biased toward high-cardinality features
Can be misleading when features are correlated
Not normalized by default (sums to 1 in scikit-learn)

Permutation-Based Importance

Measures the decrease in model performance when a feature's values are randomly shuffled:

\text{Perm Importance}(j) = \text{Score}_{\text{original}} - \text{Score}_{\text{shuffled}}

Advantages:

Model-agnostic
Less biased toward high-cardinality features
Captures feature interactions

Code Implementation

Explanation of Code

Gini vs Entropy: Compares the two criteria on the same dataset, showing they produce similar results but may differ in tree structure.
Pruning: Demonstrates how different pruning parameters affect model complexity and generalization.
Feature Importance: Compares impurity-based and permutation-based importance measures, showing their differences.
Tree Visualization: Shows how to visualize and interpret the tree structure.
Decision Boundary: Illustrates the axis-parallel decision boundaries characteristic of decision trees.

Real-World Applications

Amazon: Product Recommendation

Amazon uses decision trees (within ensemble methods) for:

Customer Segmentation: Grouping customers based on purchase behavior
Churn Prediction: Identifying customers likely to cancel Prime
Dynamic Pricing: Tree-based rules for price optimization

Netflix: Content Classification

Netflix uses decision trees for:

Genre Classification: Categorizing content based on viewing patterns
Personalization: Tree-based rules for UI customization
A/B Test Analysis: Segmenting users for experiment analysis

💡

Amazon Interview Tip: Be prepared to discuss how decision trees handle missing values. Mention that XGBoost and LightGBM have built-in handling for missing values, while scikit-learn requires imputation.

Common Follow-Up Questions

Q1: Why don't decision trees usually outperform ensemble methods?

Single decision trees have high variance because small changes in data can lead to completely different tree structures. Ensemble methods (Random Forest, Gradient Boosting) reduce variance by combining multiple trees.

Q2: How do decision trees handle categorical variables?

Scikit-learn decision trees only handle numerical data. Categorical variables must be encoded (one-hot or ordinal). However, some implementations (CatBoost, LightGBM) handle categories natively.

Q3: What is the computational complexity of building a decision tree?

$O(n \cdot p \cdot \log n)$ where $n$ is samples and $p$ is features. At each node, we evaluate all features and all possible thresholds.

Q4: How do you handle imbalanced datasets with decision trees?

Use class_weight='balanced' to adjust the loss function
Use SMOTE for oversampling
Adjust the decision threshold
Use appropriate evaluation metrics (F1, AUC-PR)

Company-Specific Tips

Amazon Interview Tips

Emphasize interpretability as a key advantage
Discuss rule-based systems that can be derived from trees
Be ready to explain why pruning is necessary (bias-variance tradeoff)
Mention production considerations (model size, inference speed)

Netflix Interview Tips

Focus on how trees handle non-linear relationships
Discuss feature engineering for categorical features
Be prepared to explain ensemble methods built on trees
Mention scalability concerns with very large datasets

Decision Trees: Gini vs Entropy, Pruning & Feature Importance

Decision Trees: Gini vs Entropy, Pruning & Feature Importance

Interview Question

Theoretical Foundation

What is a Decision Tree?

Splitting Criteria

1. Gini Impurity

2. Entropy (Information Gain)

3. Classification Error

Gini vs Entropy: Detailed Comparison

Pruning: Preventing Overfitting

Pre-Pruning (Early Stopping)

Post-Pruning

Feature Importance

Impurity-Based Importance (Gini Importance)

Permutation-Based Importance

Code Implementation

Explanation of Code

Real-World Applications

Amazon: Product Recommendation

Netflix: Content Classification

Common Follow-Up Questions

Company-Specific Tips

Amazon Interview Tips

Netflix Interview Tips

Related Topics