πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Feature Engineering: Encoding, Scaling & Feature Selection

Machine LearningFeature Engineering⭐ Premium

Advertisement

Google & Apple Interview

Feature Engineering: Encoding, Scaling & Feature Selection

The art and science of creating predictive features

Interview Question

"Explain different encoding techniques for categorical variables. When would you use target encoding vs one-hot encoding? What are the best methods for feature selection and why?"

Difficulty: Medium | Frequently asked at Google, Apple, Amazon


Theoretical Foundation

Why Feature Engineering Matters

Feature engineering is often more important than model selection. Good features can:

  • Improve model performance significantly
  • Reduce training time
  • Improve interpretability
  • Reduce overfitting

Categorical Variable Encoding

1. Label Encoding

Maps categories to integers: {A,B,C}β†’{0,1,2}\{A, B, C\} \to \{0, 1, 2\}

Pros: Simple, no dimensionality increase Cons: Implies ordinal relationship (A < B < C)

2. One-Hot Encoding

Creates binary columns for each category:

xi={1ifΒ categoryΒ isΒ i0otherwisex_i = \begin{cases} 1 & \text{if category is } i \\ 0 & \text{otherwise} \end{cases}

Pros: No ordinal assumption, widely supported Cons: High-cardinality features β†’ many columns (curse of dimensionality)

3. Target Encoding

Replaces category with mean of target variable:

encoding(c)=βˆ‘i:xi=cyi+Ξ±β‹…yΛ‰nc+Ξ±\text{encoding}(c) = \frac{\sum_{i: x_i = c} y_i + \alpha \cdot \bar{y}}{n_c + \alpha}

where Ξ±\alpha is a smoothing parameter.

Pros: Captures target relationship, handles high cardinality Cons: Can cause target leakage, requires careful regularization

4. Frequency Encoding

Replaces category with its frequency:

encoding(c)=ncn\text{encoding}(c) = \frac{n_c}{n}

Pros: Simple, captures popularity Cons: Loses category-specific information

5. Binary Encoding

Converts integer encoding to binary representation:

3β†’011,5β†’1013 \to 011, \quad 5 \to 101

Pros: Fewer columns than one-hot, no ordinal assumption Cons: Less interpretable

6. Hash Encoding

Hashes category to fixed number of buckets:

encoding(c)=hash(c)mod  k\text{encoding}(c) = \text{hash}(c) \mod k

Pros: Fixed dimensionality, handles new categories Cons: Hash collisions, less interpretable

Encoding Comparison

MethodCardinalityMemoryInformationUse Case
LabelAnyLowOrdinal onlyTree models
One-HotLowHighFullLinear models
TargetAnyLowTarget-awareHigh cardinality
FrequencyAnyLowFrequency onlySimple baselines
BinaryAnyMediumPartialMedium cardinality
HashAnyFixedCompressedVery high cardinality

ℹ️

Key Insight: The choice of encoding depends on the model. Tree models can handle label encoding because they don't assume linearity. Linear models require one-hot encoding to avoid false ordinal relationships.

Feature Scaling

1. Min-Max Scaling

xscaled=xβˆ’xmin⁑xmaxβ‘βˆ’xmin⁑x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}
  • Range: [0,1][0, 1]
  • Sensitive to outliers
  • Best for neural networks

2. Standard Scaling (Z-score)

xscaled=xβˆ’ΞΌΟƒx_{\text{scaled}} = \frac{x - \mu}{\sigma}
  • Range: (βˆ’βˆž,∞)(-\infty, \infty), mean=0, std=1
  • Less sensitive to outliers
  • Best for SVM, logistic regression

3. Robust Scaling

xscaled=xβˆ’medianIQRx_{\text{scaled}} = \frac{x - \text{median}}{IQR}
  • Uses median and IQR instead of mean and std
  • Robust to outliers
  • Best for data with outliers

4. Normalization (L2)

xscaled=xβˆ₯xβˆ₯2x_{\text{scaled}} = \frac{x}{\|x\|_2}
  • Unit norm vectors
  • Best for text classification (TF-IDF)

Feature Selection Methods

1. Filter Methods

Select features based on statistical tests:

  • Correlation: Remove highly correlated features
  • Chi-Squared: Test independence with target
  • Mutual Information: Measure dependency with target
  • Variance Threshold: Remove low-variance features

2. Wrapper Methods

Use model performance to select features:

  • Forward Selection: Add features one by one
  • Backward Elimination: Remove features one by one
  • Recursive Feature Elimination (RFE): Iteratively remove least important

3. Embedded Methods

Feature selection during model training:

  • L1 Regularization (Lasso): Shrinks coefficients to zero
  • Tree-Based Importance: Use feature importances from trees
  • Elastic Net: Combines L1 and L2

Feature Selection Comparison

MethodComputational CostAccuracyInterpretability
FilterLowMediumHigh
WrapperHighHighMedium
EmbeddedMediumHighMedium

πŸ’‘

Production Tip: In production, use embedded methods (L1, tree importance) because they're computationally efficient and consider feature interactions. Filter methods are good for initial screening.

Feature Engineering Best Practices

  1. Understand the domain: Domain knowledge guides feature creation
  2. Start simple: Begin with basic features before complex ones
  3. Validate: Always validate feature engineering with cross-validation
  4. Document: Keep track of feature transformations for reproducibility
  5. Monitor: Track feature distributions in production for drift

Code Implementation

Explanation of Code

  1. Categorical Encoding: Demonstrates label, one-hot, frequency, and target encoding.

  2. Feature Scaling: Compares Min-Max, Standard, and Robust scaling.

  3. Filter Methods: Shows variance threshold, correlation, and mutual information.

  4. Wrapper Methods: Demonstrates recursive feature elimination.

  5. Embedded Methods: Shows L1 regularization and tree-based importance.

  6. Pipeline: Creates a complete feature engineering pipeline.


Real-World Applications

Google: Search Features

Google engineers features for:

  • Query Understanding: N-grams, TF-IDF, word embeddings
  • User Features: Click history, session patterns
  • Document Features: PageRank, content quality scores

Apple: Siri Features

Apple engineers features for:

  • Voice Recognition: MFCCs, spectral features
  • Intent Classification: Entity embeddings, context features
  • Personalization: User behavior patterns

πŸ’‘

Google Interview Tip: Emphasize the importance of feature engineering in production systems. Even with deep learning, feature engineering for categorical and temporal features is crucial.


Common Follow-Up Questions

Q1: When should you use target encoding over one-hot encoding?

Use target encoding when:

  • Categorical feature has high cardinality (>20 categories)
  • You have enough data to avoid overfitting
  • The category has a strong relationship with the target

Use one-hot encoding when:

  • Categorical feature has low cardinality
  • You're using linear models
  • Interpretability is important

Q2: How do you handle high-cardinality categorical features?

  1. Target encoding with regularization
  2. Frequency encoding
  3. Hash encoding
  4. Feature hashing (the hashing trick)
  5. Grouping rare categories

Q3: What is the difference between filter and wrapper methods?

Filter methods evaluate features independently of the model (faster, less accurate). Wrapper methods evaluate features using model performance (slower, more accurate). Embedded methods combine both advantages.

Q4: How do you detect and handle feature drift in production?

  1. Monitor feature distributions (KS test, PSI)
  2. Retrain models periodically
  3. Use online learning algorithms
  4. Implement fallback models

Company-Specific Tips

Google Interview Tips

  • Discuss scalable feature engineering for massive datasets
  • Be ready to explain feature hashing for high cardinality
  • Mention online feature engineering for streaming data
  • Talk about feature stores for production systems

Apple Interview Tips

  • Focus on on-device feature engineering for privacy
  • Discuss feature selection for model compression
  • Be prepared to explain temporal feature engineering
  • Mention federated feature engineering

Related Topics

Advertisement