Feature Engineering: Encoding, Scaling & Feature Selection
The art and science of creating predictive features
Interview Question
"Explain different encoding techniques for categorical variables. When would you use target encoding vs one-hot encoding? What are the best methods for feature selection and why?"
Difficulty: Medium | Frequently asked at Google, Apple, Amazon
Theoretical Foundation
Why Feature Engineering Matters
Feature engineering is often more important than model selection. Good features can:
- Improve model performance significantly
- Reduce training time
- Improve interpretability
- Reduce overfitting
Categorical Variable Encoding
1. Label Encoding
Maps categories to integers:
Pros: Simple, no dimensionality increase Cons: Implies ordinal relationship (A < B < C)
2. One-Hot Encoding
Creates binary columns for each category:
Pros: No ordinal assumption, widely supported Cons: High-cardinality features β many columns (curse of dimensionality)
3. Target Encoding
Replaces category with mean of target variable:
where is a smoothing parameter.
Pros: Captures target relationship, handles high cardinality Cons: Can cause target leakage, requires careful regularization
4. Frequency Encoding
Replaces category with its frequency:
Pros: Simple, captures popularity Cons: Loses category-specific information
5. Binary Encoding
Converts integer encoding to binary representation:
Pros: Fewer columns than one-hot, no ordinal assumption Cons: Less interpretable
6. Hash Encoding
Hashes category to fixed number of buckets:
Pros: Fixed dimensionality, handles new categories Cons: Hash collisions, less interpretable
Encoding Comparison
| Method | Cardinality | Memory | Information | Use Case |
|---|---|---|---|---|
| Label | Any | Low | Ordinal only | Tree models |
| One-Hot | Low | High | Full | Linear models |
| Target | Any | Low | Target-aware | High cardinality |
| Frequency | Any | Low | Frequency only | Simple baselines |
| Binary | Any | Medium | Partial | Medium cardinality |
| Hash | Any | Fixed | Compressed | Very high cardinality |
βΉοΈ
Key Insight: The choice of encoding depends on the model. Tree models can handle label encoding because they don't assume linearity. Linear models require one-hot encoding to avoid false ordinal relationships.
Feature Scaling
1. Min-Max Scaling
- Range:
- Sensitive to outliers
- Best for neural networks
2. Standard Scaling (Z-score)
- Range: , mean=0, std=1
- Less sensitive to outliers
- Best for SVM, logistic regression
3. Robust Scaling
- Uses median and IQR instead of mean and std
- Robust to outliers
- Best for data with outliers
4. Normalization (L2)
- Unit norm vectors
- Best for text classification (TF-IDF)
Feature Selection Methods
1. Filter Methods
Select features based on statistical tests:
- Correlation: Remove highly correlated features
- Chi-Squared: Test independence with target
- Mutual Information: Measure dependency with target
- Variance Threshold: Remove low-variance features
2. Wrapper Methods
Use model performance to select features:
- Forward Selection: Add features one by one
- Backward Elimination: Remove features one by one
- Recursive Feature Elimination (RFE): Iteratively remove least important
3. Embedded Methods
Feature selection during model training:
- L1 Regularization (Lasso): Shrinks coefficients to zero
- Tree-Based Importance: Use feature importances from trees
- Elastic Net: Combines L1 and L2
Feature Selection Comparison
| Method | Computational Cost | Accuracy | Interpretability |
|---|---|---|---|
| Filter | Low | Medium | High |
| Wrapper | High | High | Medium |
| Embedded | Medium | High | Medium |
π‘
Production Tip: In production, use embedded methods (L1, tree importance) because they're computationally efficient and consider feature interactions. Filter methods are good for initial screening.
Feature Engineering Best Practices
- Understand the domain: Domain knowledge guides feature creation
- Start simple: Begin with basic features before complex ones
- Validate: Always validate feature engineering with cross-validation
- Document: Keep track of feature transformations for reproducibility
- Monitor: Track feature distributions in production for drift
Code Implementation
Explanation of Code
-
Categorical Encoding: Demonstrates label, one-hot, frequency, and target encoding.
-
Feature Scaling: Compares Min-Max, Standard, and Robust scaling.
-
Filter Methods: Shows variance threshold, correlation, and mutual information.
-
Wrapper Methods: Demonstrates recursive feature elimination.
-
Embedded Methods: Shows L1 regularization and tree-based importance.
-
Pipeline: Creates a complete feature engineering pipeline.
Real-World Applications
Google: Search Features
Google engineers features for:
- Query Understanding: N-grams, TF-IDF, word embeddings
- User Features: Click history, session patterns
- Document Features: PageRank, content quality scores
Apple: Siri Features
Apple engineers features for:
- Voice Recognition: MFCCs, spectral features
- Intent Classification: Entity embeddings, context features
- Personalization: User behavior patterns
π‘
Google Interview Tip: Emphasize the importance of feature engineering in production systems. Even with deep learning, feature engineering for categorical and temporal features is crucial.
Common Follow-Up Questions
Q1: When should you use target encoding over one-hot encoding?
Use target encoding when:
- Categorical feature has high cardinality (>20 categories)
- You have enough data to avoid overfitting
- The category has a strong relationship with the target
Use one-hot encoding when:
- Categorical feature has low cardinality
- You're using linear models
- Interpretability is important
Q2: How do you handle high-cardinality categorical features?
- Target encoding with regularization
- Frequency encoding
- Hash encoding
- Feature hashing (the hashing trick)
- Grouping rare categories
Q3: What is the difference between filter and wrapper methods?
Filter methods evaluate features independently of the model (faster, less accurate). Wrapper methods evaluate features using model performance (slower, more accurate). Embedded methods combine both advantages.
Q4: How do you detect and handle feature drift in production?
- Monitor feature distributions (KS test, PSI)
- Retrain models periodically
- Use online learning algorithms
- Implement fallback models
Company-Specific Tips
Google Interview Tips
- Discuss scalable feature engineering for massive datasets
- Be ready to explain feature hashing for high cardinality
- Mention online feature engineering for streaming data
- Talk about feature stores for production systems
Apple Interview Tips
- Focus on on-device feature engineering for privacy
- Discuss feature selection for model compression
- Be prepared to explain temporal feature engineering
- Mention federated feature engineering