Naive Bayes: Assumptions, Laplace Smoothing & Text Classification
The probabilistic classifier that powers spam filters and search engines
Interview Question
"Explain the 'naive' assumption in Naive Bayes. Why does this assumption often work well in practice despite being incorrect? What is Laplace smoothing and why is it necessary?"
Difficulty: Medium | Frequently asked at Google, Baidu, Amazon
Theoretical Foundation
Bayes' Theorem
Naive Bayes is based on Bayes' theorem:
where:
- : Posterior probability of class given features
- : Likelihood of features given class
- : Prior probability of class
- : Evidence (normalizing constant)
The Naive Assumption
The "naive" assumption is conditional independence: features are independent given the class label.
This simplifies the posterior to:
Why is this "naive"?
- In reality, features are often correlated
- For example, word frequency and sentence length are correlated in text
- The assumption is rarely true in practice
βΉοΈ
Key Insight: Despite the unrealistic independence assumption, Naive Bayes often performs well because:
- It only needs to get the relative ordering of class probabilities correct
- Correlated features may cancel each other's effects
- The decision boundary is robust to small violations of independence
Why Naive Bayes Works Despite the Assumption
-
Classification Only Needs Ordering: We only need , not exact probabilities
-
Feature Correlation Cancellation: If features and are positively correlated, their individual likelihoods may overestimate the joint likelihood, but this affects all classes similarly
-
Regularization Effect: The independence assumption acts as implicit regularization, preventing overfitting
-
High-Dimensional Data: In high dimensions, the independence assumption becomes less problematic because the curse of dimensionality affects all models
Types of Naive Bayes
1. Gaussian Naive Bayes
Assumes features follow a Gaussian distribution:
- For continuous features
- Estimates and for each class and feature
2. Multinomial Naive Bayes
Assumes features are counts from a multinomial distribution:
where is the count of feature in class , and is the smoothing parameter.
- For discrete features (word counts in text)
- Commonly used for document classification
3. Bernoulli Naive Bayes
Assumes features are binary:
- For binary features (word presence/absence)
- Considers feature absence as informative
Laplace Smoothing
The Problem: If a feature value never appears with a class in training data, , making the entire posterior zero.
The Solution: Laplace smoothing adds to each count:
where:
- : Standard Laplace smoothing
- : Lidstone smoothing
- : No smoothing (can cause zero probabilities)
β οΈ
Interview Trap: Many candidates forget to mention Laplace smoothing. Without it, Naive Bayes fails completely when encountering unseen feature values. Always mention this when discussing Naive Bayes.
Log Probability
To avoid numerical underflow with many features, we use log probabilities:
This converts multiplication to addition, which is more numerically stable.
Feature Selection for Naive Bayes
Since Naive Bayes assumes independence, irrelevant features hurt performance. Common feature selection methods:
- Mutual Information: Select features most informative about the class
- Chi-Squared Test: Select features with highest statistic
- Document Frequency: Remove very rare or very common terms
Code Implementation
Explanation of Code
-
Gaussian NB: Demonstrates Naive Bayes on continuous data with class statistics.
-
Multinomial NB: Shows text classification using word counts.
-
Laplace Smoothing: Demonstrates how alpha affects performance.
-
Bernoulli NB: Shows binary feature classification.
-
Feature Importance: Analyzes which words are most indicative of each class.
-
NB Variants: Compares different Naive Bayes variants on mixed data.
-
Probability Calibration: Analyzes the quality of probability estimates.
Real-World Applications
Google: Spam Detection
Google uses Naive Bayes for:
- Gmail Spam Filter: Multinomial NB on word frequencies
- Search Query Classification: Categorizing search intent
- Language Detection: Identifying document language
Baidu: Search Ranking
Baidu uses Naive Bayes for:
- Query Understanding: Classifying search queries
- Content Classification: Categorizing web pages
- Ad Targeting: Predicting user interests
π‘
Google Interview Tip: Be prepared to discuss why Naive Bayes is still used despite being "outdated." Mention its speed, interpretability, and effectiveness as a baseline.
Common Follow-Up Questions
Q1: Why does Naive Bayes work well for text classification?
Text data has many features (words) that are somewhat independent. While words are correlated, the independence assumption acts as regularization. Also, the bag-of-words representation ignores word order, making the assumption less problematic.
Q2: How do you handle zero probabilities in Naive Bayes?
Use Laplace smoothing (additive smoothing) with . This ensures no probability is exactly zero. The choice of can be tuned via cross-validation.
Q3: What is the difference between Gaussian, Multinomial, and Bernoulli NB?
- Gaussian: For continuous features, assumes normal distribution
- Multinomial: For count data (word frequencies), assumes multinomial distribution
- Bernoulli: For binary features (word presence/absence), assumes Bernoulli distribution
Q4: How do you handle missing values in Naive Bayes?
Simply ignore the missing feature when computing the likelihood. The posterior is proportional to the product of available feature likelihoods.
Company-Specific Tips
Google Interview Tips
- Discuss scalability to massive datasets (billions of documents)
- Be ready to explain online learning for streaming text
- Mention feature hashing for memory efficiency
- Talk about semi-supervised Naive Bayes
Baidu Interview Tips
- Focus on Chinese text classification challenges
- Discuss character-level vs word-level features
- Be prepared to explain smoothing choices
- Mention domain adaptation for different topics