Google & Baidu Interview

Naive Bayes: Assumptions, Laplace Smoothing & Text Classification

The probabilistic classifier that powers spam filters and search engines

Interview Question

"Explain the 'naive' assumption in Naive Bayes. Why does this assumption often work well in practice despite being incorrect? What is Laplace smoothing and why is it necessary?"

Difficulty: Medium | Frequently asked at Google, Baidu, Amazon

Theoretical Foundation

Bayes' Theorem

Naive Bayes is based on Bayes' theorem:

P(C_k | x) = \frac{P(x | C_k) P(C_k)}{P(x)}

where:

$P(C_k | x)$ : Posterior probability of class $C_k$ given features $x$
$P(x | C_k)$ : Likelihood of features given class $C_k$
$P(C_k)$ : Prior probability of class $C_k$
$P(x)$ : Evidence (normalizing constant)

The Naive Assumption

The "naive" assumption is conditional independence: features are independent given the class label.

P(x | C_k) = \prod_{i=1}^{p} P(x_i | C_k)

This simplifies the posterior to:

P(C_k | x) \propto P(C_k) \prod_{i=1}^{p} P(x_i | C_k)

Why is this "naive"?

In reality, features are often correlated
For example, word frequency and sentence length are correlated in text
The assumption is rarely true in practice

ℹ️

Key Insight: Despite the unrealistic independence assumption, Naive Bayes often performs well because:

It only needs to get the relative ordering of class probabilities correct
Correlated features may cancel each other's effects
The decision boundary is robust to small violations of independence

Why Naive Bayes Works Despite the Assumption

Classification Only Needs Ordering: We only need $P(C_1|x) > P(C_2|x)$ , not exact probabilities
Feature Correlation Cancellation: If features $x_i$ and $x_j$ are positively correlated, their individual likelihoods may overestimate the joint likelihood, but this affects all classes similarly
Regularization Effect: The independence assumption acts as implicit regularization, preventing overfitting
High-Dimensional Data: In high dimensions, the independence assumption becomes less problematic because the curse of dimensionality affects all models

Types of Naive Bayes

1. Gaussian Naive Bayes

Assumes features follow a Gaussian distribution:

P(x_i | C_k) = \frac{1}{\sqrt{2\pi \sigma_k^2}} \exp\left(-\frac{(x_i - \mu_k)^2}{2\sigma_k^2}\right)

For continuous features
Estimates $\mu_k$ and $\sigma_k^2$ for each class and feature

2. Multinomial Naive Bayes

Assumes features are counts from a multinomial distribution:

P(x_i | C_k) = \frac{N_{ki} + \alpha}{N_k + \alpha p}

where $N_{ki}$ is the count of feature $i$ in class $C_k$ , and $\alpha$ is the smoothing parameter.

For discrete features (word counts in text)
Commonly used for document classification

3. Bernoulli Naive Bayes

Assumes features are binary:

P(x_i | C_k) = P(x_i = 1 | C_k)^{x_i} (1 - P(x_i = 1 | C_k))^{1-x_i}

For binary features (word presence/absence)
Considers feature absence as informative

Laplace Smoothing

The Problem: If a feature value never appears with a class in training data, $P(x_i | C_k) = 0$ , making the entire posterior zero.

The Solution: Laplace smoothing adds $\alpha$ to each count:

P(x_i | C_k) = \frac{N_{ki} + \alpha}{N_k + \alpha p}

where:

$\alpha = 1$ : Standard Laplace smoothing
$\alpha > 1$ : Lidstone smoothing
$\alpha = 0$ : No smoothing (can cause zero probabilities)

⚠️

Interview Trap: Many candidates forget to mention Laplace smoothing. Without it, Naive Bayes fails completely when encountering unseen feature values. Always mention this when discussing Naive Bayes.

Log Probability

To avoid numerical underflow with many features, we use log probabilities:

\log P(C_k | x) \propto \log P(C_k) + \sum_{i=1}^{p} \log P(x_i | C_k)

This converts multiplication to addition, which is more numerically stable.

Feature Selection for Naive Bayes

Since Naive Bayes assumes independence, irrelevant features hurt performance. Common feature selection methods:

Mutual Information: Select features most informative about the class
Chi-Squared Test: Select features with highest $\chi^2$ statistic
Document Frequency: Remove very rare or very common terms

Code Implementation

Explanation of Code

Gaussian NB: Demonstrates Naive Bayes on continuous data with class statistics.
Multinomial NB: Shows text classification using word counts.
Laplace Smoothing: Demonstrates how alpha affects performance.
Bernoulli NB: Shows binary feature classification.
Feature Importance: Analyzes which words are most indicative of each class.
NB Variants: Compares different Naive Bayes variants on mixed data.
Probability Calibration: Analyzes the quality of probability estimates.

Real-World Applications

Google: Spam Detection

Google uses Naive Bayes for:

Gmail Spam Filter: Multinomial NB on word frequencies
Search Query Classification: Categorizing search intent
Language Detection: Identifying document language

Baidu: Search Ranking

Baidu uses Naive Bayes for:

Query Understanding: Classifying search queries
Content Classification: Categorizing web pages
Ad Targeting: Predicting user interests

💡

Google Interview Tip: Be prepared to discuss why Naive Bayes is still used despite being "outdated." Mention its speed, interpretability, and effectiveness as a baseline.

Common Follow-Up Questions

Q1: Why does Naive Bayes work well for text classification?

Text data has many features (words) that are somewhat independent. While words are correlated, the independence assumption acts as regularization. Also, the bag-of-words representation ignores word order, making the assumption less problematic.

Q2: How do you handle zero probabilities in Naive Bayes?

Use Laplace smoothing (additive smoothing) with $\alpha \geq 1$ . This ensures no probability is exactly zero. The choice of $\alpha$ can be tuned via cross-validation.

Q3: What is the difference between Gaussian, Multinomial, and Bernoulli NB?

Gaussian: For continuous features, assumes normal distribution
Multinomial: For count data (word frequencies), assumes multinomial distribution
Bernoulli: For binary features (word presence/absence), assumes Bernoulli distribution

Q4: How do you handle missing values in Naive Bayes?

Simply ignore the missing feature when computing the likelihood. The posterior is proportional to the product of available feature likelihoods.

Naive Bayes: Assumptions, Laplace Smoothing & Text Classification

Naive Bayes: Assumptions, Laplace Smoothing & Text Classification

Interview Question

Theoretical Foundation

Bayes' Theorem

The Naive Assumption

Why Naive Bayes Works Despite the Assumption

Types of Naive Bayes

1. Gaussian Naive Bayes

2. Multinomial Naive Bayes

3. Bernoulli Naive Bayes

Laplace Smoothing

Log Probability

Feature Selection for Naive Bayes

Code Implementation

Explanation of Code

Real-World Applications

Google: Spam Detection

Baidu: Search Ranking

Common Follow-Up Questions

Company-Specific Tips

Google Interview Tips

Baidu Interview Tips

Related Topics