Google & Apple Interview

Feature Engineering: Encoding, Scaling & Feature Selection

The art and science of creating predictive features

Interview Question

"Explain different encoding techniques for categorical variables. When would you use target encoding vs one-hot encoding? What are the best methods for feature selection and why?"

Difficulty: Medium | Frequently asked at Google, Apple, Amazon

Theoretical Foundation

Why Feature Engineering Matters

Feature engineering is often more important than model selection. Good features can:

Improve model performance significantly
Reduce training time
Improve interpretability
Reduce overfitting

Categorical Variable Encoding

1. Label Encoding

Maps categories to integers: $\{A, B, C\} \to \{0, 1, 2\}$

Pros: Simple, no dimensionality increase Cons: Implies ordinal relationship (A < B < C)

2. One-Hot Encoding

Creates binary columns for each category:

x_i = \begin{cases} 1 & \text{if category is } i \\ 0 & \text{otherwise} \end{cases}

Pros: No ordinal assumption, widely supported Cons: High-cardinality features → many columns (curse of dimensionality)

3. Target Encoding

Replaces category with mean of target variable:

\text{encoding}(c) = \frac{\sum_{i: x_i = c} y_i + \alpha \cdot \bar{y}}{n_c + \alpha}

where $\alpha$ is a smoothing parameter.

Pros: Captures target relationship, handles high cardinality Cons: Can cause target leakage, requires careful regularization

4. Frequency Encoding

Replaces category with its frequency:

\text{encoding}(c) = \frac{n_c}{n}

Pros: Simple, captures popularity Cons: Loses category-specific information

5. Binary Encoding

Converts integer encoding to binary representation:

3 \to 011, \quad 5 \to 101

Pros: Fewer columns than one-hot, no ordinal assumption Cons: Less interpretable

6. Hash Encoding

Hashes category to fixed number of buckets:

\text{encoding}(c) = \text{hash}(c) \mod k

Pros: Fixed dimensionality, handles new categories Cons: Hash collisions, less interpretable

Encoding Comparison

Method	Cardinality	Memory	Information	Use Case
Label	Any	Low	Ordinal only	Tree models
One-Hot	Low	High	Full	Linear models
Target	Any	Low	Target-aware	High cardinality
Frequency	Any	Low	Frequency only	Simple baselines
Binary	Any	Medium	Partial	Medium cardinality
Hash	Any	Fixed	Compressed	Very high cardinality

ℹ️

Key Insight: The choice of encoding depends on the model. Tree models can handle label encoding because they don't assume linearity. Linear models require one-hot encoding to avoid false ordinal relationships.

Feature Scaling

1. Min-Max Scaling

x_{\text{scaled}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

Range: $[0, 1]$
Sensitive to outliers
Best for neural networks

2. Standard Scaling (Z-score)

x_{\text{scaled}} = \frac{x - \mu}{\sigma}

Range: $(-\infty, \infty)$ , mean=0, std=1
Less sensitive to outliers
Best for SVM, logistic regression

3. Robust Scaling

x_{\text{scaled}} = \frac{x - \text{median}}{IQR}

Uses median and IQR instead of mean and std
Robust to outliers
Best for data with outliers

4. Normalization (L2)

x_{\text{scaled}} = \frac{x}{\|x\|_2}

Unit norm vectors
Best for text classification (TF-IDF)

Feature Selection Methods

1. Filter Methods

Select features based on statistical tests:

Correlation: Remove highly correlated features
Chi-Squared: Test independence with target
Mutual Information: Measure dependency with target
Variance Threshold: Remove low-variance features

2. Wrapper Methods

Use model performance to select features:

Forward Selection: Add features one by one
Backward Elimination: Remove features one by one
Recursive Feature Elimination (RFE): Iteratively remove least important

3. Embedded Methods

Feature selection during model training:

L1 Regularization (Lasso): Shrinks coefficients to zero
Tree-Based Importance: Use feature importances from trees
Elastic Net: Combines L1 and L2

Feature Selection Comparison

Method	Computational Cost	Accuracy	Interpretability
Filter	Low	Medium	High
Wrapper	High	High	Medium
Embedded	Medium	High	Medium

💡

Production Tip: In production, use embedded methods (L1, tree importance) because they're computationally efficient and consider feature interactions. Filter methods are good for initial screening.

Feature Engineering Best Practices

Understand the domain: Domain knowledge guides feature creation
Start simple: Begin with basic features before complex ones
Validate: Always validate feature engineering with cross-validation
Document: Keep track of feature transformations for reproducibility
Monitor: Track feature distributions in production for drift

Code Implementation

Explanation of Code

Categorical Encoding: Demonstrates label, one-hot, frequency, and target encoding.
Feature Scaling: Compares Min-Max, Standard, and Robust scaling.
Filter Methods: Shows variance threshold, correlation, and mutual information.
Wrapper Methods: Demonstrates recursive feature elimination.
Embedded Methods: Shows L1 regularization and tree-based importance.
Pipeline: Creates a complete feature engineering pipeline.

Real-World Applications

Google: Search Features

Google engineers features for:

Query Understanding: N-grams, TF-IDF, word embeddings
User Features: Click history, session patterns
Document Features: PageRank, content quality scores

Apple: Siri Features

Apple engineers features for:

Voice Recognition: MFCCs, spectral features
Intent Classification: Entity embeddings, context features
Personalization: User behavior patterns

💡

Google Interview Tip: Emphasize the importance of feature engineering in production systems. Even with deep learning, feature engineering for categorical and temporal features is crucial.

Common Follow-Up Questions

Q1: When should you use target encoding over one-hot encoding?

Use target encoding when:

Categorical feature has high cardinality (>20 categories)
You have enough data to avoid overfitting
The category has a strong relationship with the target

Use one-hot encoding when:

Categorical feature has low cardinality
You're using linear models
Interpretability is important

Q2: How do you handle high-cardinality categorical features?

Target encoding with regularization
Frequency encoding
Hash encoding
Feature hashing (the hashing trick)
Grouping rare categories

Q3: What is the difference between filter and wrapper methods?

Filter methods evaluate features independently of the model (faster, less accurate). Wrapper methods evaluate features using model performance (slower, more accurate). Embedded methods combine both advantages.

Q4: How do you detect and handle feature drift in production?

Monitor feature distributions (KS test, PSI)
Retrain models periodically
Use online learning algorithms
Implement fallback models

Company-Specific Tips

Google Interview Tips

Discuss scalable feature engineering for massive datasets
Be ready to explain feature hashing for high cardinality
Mention online feature engineering for streaming data
Talk about feature stores for production systems

Apple Interview Tips

Focus on on-device feature engineering for privacy
Discuss feature selection for model compression
Be prepared to explain temporal feature engineering
Mention federated feature engineering

Feature Engineering: Encoding, Scaling & Feature Selection

Feature Engineering: Encoding, Scaling & Feature Selection

Interview Question

Theoretical Foundation

Why Feature Engineering Matters

Categorical Variable Encoding

1. Label Encoding

2. One-Hot Encoding

3. Target Encoding

4. Frequency Encoding

5. Binary Encoding

6. Hash Encoding

Encoding Comparison

Feature Scaling

1. Min-Max Scaling

2. Standard Scaling (Z-score)

3. Robust Scaling

4. Normalization (L2)

Feature Selection Methods

1. Filter Methods

2. Wrapper Methods

3. Embedded Methods

Feature Selection Comparison

Feature Engineering Best Practices

Code Implementation

Explanation of Code

Real-World Applications

Google: Search Features

Apple: Siri Features

Common Follow-Up Questions

Company-Specific Tips

Google Interview Tips

Apple Interview Tips

Related Topics