Gmail, Outlook, Yahoo Mail

Design Email Spam Detection at Scale

Building real-time spam classification for billions of emails with high precision

Interview Question

"Design an email spam detection system like Gmail that can classify billions of emails daily with >99.9% precision, handle adversarial attacks, and provide real-time protection against evolving spam patterns."

Difficulty: Hard | Frequently asked at Google (Gmail), Microsoft (Outlook), Yahoo, Proofpoint

1. Requirements Gathering

Functional Requirements

Real-time Classification: Classify incoming emails as spam/not-spam in real-time
Multi-class Detection: Detect different types of spam (phishing, malware, promotional, etc.)
User-specific Learning: Learn from individual user preferences and reporting
Quarantine Management: Move spam to separate folder with user control
Reporting and Feedback: Users can mark emails as spam/not-spam
Explainability: Explain why an email was classified as spam
Continuous Learning: Adapt to new spam patterns without full retraining

Non-Functional Requirements

Latency: < 100ms for classification (critical for email delivery)
Throughput: 100B+ emails/day, 10M+ emails/second at peak
Precision: > 99.9% (false positive rate < 0.1%)
Recall: > 99% (catch most spam)
Availability: 99.999% uptime
Scale: 1.8B+ Gmail users, petabytes of email data
Privacy: End-to-end encryption compatibility, GDPR compliance

ℹ️

Scale Perspective: Gmail processes over 300 billion emails daily. Their spam filter blocks 99.9% of spam, phishing, and malware before it reaches users. The system must handle massive volumes while maintaining extremely low false positive rates to avoid blocking legitimate emails.

2. High-Level Architecture Overview

The spam detection system follows a multi-layered architecture: Pre-filtering → Content Analysis → Behavioral Analysis → User Feedback → Decision.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                         INCOMING EMAILS                                      │
│  SMTP Servers │ Email APIs │ Mobile Clients │ Web Clients │ IMAP/POP3       │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         PRE-FILTERING LAYER                                 │
│  IP Reputation │ Sender Authentication │ Rate Limiting │ Block Lists        │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
┌────────────────────────┐ ┌───────────────┐ ┌──────────────────────┐
│  CONTENT ANALYSIS      │ │ BEHAVIORAL    │ │ USER-SPECIFIC        │
│  (NLP + ML)            │ │ ANALYSIS      │ │ FILTERING            │
│  (Real-time)           │ │ (Batch)       │ │ (Real-time)          │
└────────────────────────┘ └───────────────┘ └──────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        DECISION ENGINE                                       │
│  Score Aggregation │ Threshold Tuning │ Explanation │ Action Execution      │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
┌────────────────────────┐ ┌───────────────┐ ┌──────────────────────┐
│  INBOX DELIVERY        │ │ SPAM FOLDER   │ │ QUARANTINE           │
│  (Safe emails)         │ │ (Suspicious)  │ │ (Malicious)          │
└────────────────────────┘ └───────────────┘ └──────────────────────┘

💡

Key Insight: Spam detection requires multiple layers of defense. No single model can catch all spam while maintaining low false positives. The system must combine content analysis, sender reputation, user behavior, and real-time feedback.

3. Data Pipeline Design

3.1 Email Data Model

from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime

@dataclass
class Email:
    message_id: str
    sender: str
    recipients: List[str]
    subject: str
    timestamp: datetime
    body_text: str
    body_html: Optional[str]
    attachments: List[Dict]
    headers: Dict[str, str]
    sender_domain: str
    sender_ip: str
    sender_authenticated: bool
    email_size: int
    has_attachments: bool
    attachment_types: List[str]

@dataclass
class SpamLabel:
    message_id: str
    is_fraud: bool
    spam_type: Optional[str]
    confidence: float
    reporter_id: Optional[str]
    report_reason: Optional[str]
    resolved_at: Optional[datetime]

3.2 Feature Engineering Pipeline

class SpamFeatureExtractor:
    def __init__(self):
        self.text_analyzer = TextAnalyzer()
        self.sender_analyzer = SenderAnalyzer()
        self.url_analyzer = URLAnalyzer()
        
    async def extract_features(self, email: Email) -> Dict:
        features = {}
        content_features = await self.extract_content_features(email)
        features.update(content_features)
        sender_features = await self.extract_sender_features(email)
        features.update(sender_features)
        url_features = await self.extract_url_features(email)
        features.update(url_features)
        header_features = self.extract_header_features(email)
        features.update(header_features)
        user_features = await self.extract_user_features(email)
        features.update(user_features)
        return features
    
    async def extract_content_features(self, email: Email):
        text = email.body_text or ""
        html = email.body_html or ""
        return {
            'text_length': len(text),
            'word_count': len(text.split()),
            'uppercase_ratio': sum(1 for c in text if c.isupper()) / max(len(text), 1),
            'spam_keyword_count': self.count_spam_keywords(text),
            'urgency_keyword_count': self.count_urgency_keywords(text),
            'has_html': bool(html),
            'html_to_text_ratio': len(html) / max(len(text), 1),
            'attachment_count': len(email.attachments),
            'has_executable': self.has_executable_attachment(email.attachments),
            'sentiment_score': await self.analyze_sentiment(text),
            'language': await self.detect_language(text)
        }
    
    async def extract_sender_features(self, email: Email):
        sender_domain = email.sender_domain
        sender_ip = email.sender_ip
        return {
            'spf_pass': 'spf=pass' in email.headers.get('Authentication-Results', ''),
            'dkim_pass': 'dkim=pass' in email.headers.get('Authentication-Results', ''),
            'dmarc_pass': 'dmarc=pass' in email.headers.get('Authentication-Results', ''),
            'domain_age_days': await self.get_domain_age(sender_domain),
            'domain_is_free_provider': self.is_free_email_provider(sender_domain),
            'ip_reputation_score': await self.get_ip_reputation(sender_ip),
            'ip_in_blocklist': await self.check_ip_blocklist(sender_ip),
            'sender_daily_volume': await self.get_sender_daily_volume(email.sender),
            'sender_spam_history': await self.get_sender_spam_history(email.sender)
        }
    
    async def extract_url_features(self, email: Email):
        urls = self.extract_urls(email.body_text, email.body_html)
        url_features = {
            'url_count': len(urls),
            'has_shortened_urls': self.has_shortened_urls(urls),
            'has_ip_urls': self.has_ip_urls(urls),
            'has_https': self.has_https(urls)
        }
        for i, url in enumerate(urls[:5]):
            url_analysis = await self.url_analyzer.analyze(url)
            url_features[f'url_{i}_reputation'] = url_analysis['reputation']
            url_features[f'url_{i}_is_phishing'] = url_analysis['is_phishing']
        return url_features

⚠️

Critical Feature Engineering Considerations:

Temporal features: Spam patterns change rapidly - features must be fresh
User-specific features: Different users have different spam tolerance
Adversarial robustness: Spammers constantly try to evade detection
Privacy: Features must be computed without exposing email content

4. Model Selection and Training Approach

4.1 Multi-Model Architecture

class SpamDetectionEnsemble:
    def __init__(self):
        self.content_model = self.load_model('content_classifier_v2')
        self.sender_model = self.load_model('sender_reputation_v1')
        self.url_model = self.load_model('url_classifier_v1')
        self.user_model = self.load_model('user_classifier_v1')
        self.meta_model = self.load_model('meta_scorer_v1')
    
    def predict(self, features: Dict) -> Dict:
        predictions = {}
        predictions['content'] = self.content_model.predict_proba(features['content_features'])[1]
        predictions['sender'] = self.sender_model.predict_proba(features['sender_features'])[1]
        predictions['url'] = self.url_model.predict_proba(features['url_features'])[1]
        predictions['user'] = self.user_model.predict_proba(features['user_features'])[1]
        
        meta_features = np.array([
            predictions['content'], predictions['sender'],
            predictions['url'], predictions['user'],
            features.get('spam_score', 0), features.get('phishing_score', 0)
        ]).reshape(1, -1)
        
        final_score = self.meta_model.predict_proba(meta_features)[0][1]
        return {
            'spam_score': final_score,
            'component_scores': predictions,
            'spam_type': self.classify_spam_type(predictions, features)
        }

4.2 Transformer-based Content Model

import transformers

class TransformerSpamClassifier(transformers.AutoModelForSequenceClassification):
    def __init__(self, model_name='bert-base-uncased'):
        super().__init__.from_pretrained(model_name, num_labels=2)
        self.dropout = transformers.nn.Dropout(0.1)
        self.classifier = transformers.nn.Linear(self.config.hidden_size, 2)
    
    def forward(self, input_ids, attention_mask=None):
        outputs = self.base_model(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = self.dropout(outputs.pooler_output)
        logits = self.classifier(pooled_output)
        return logits

class SpamTrainingPipeline:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
    
    def focal_loss(self, y_true, y_pred, alpha=0.25, gamma=2.0):
        y_pred = tf.clip_by_value(y_pred, 1e-7, 1 - 1e-7)
        bce = -y_true * tf.math.log(y_pred) - (1 - y_true) * tf.math.log(1 - y_pred)
        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
        alpha_t = y_true * alpha + (1 - y_true) * (1 - alpha)
        focal_weight = alpha_t * tf.pow(1 - p_t, gamma)
        return focal_weight * bce

ℹ️

Training Strategy: For spam detection:

Use focal loss to handle extreme class imbalance
Combine multiple data sources (user reports, honey pots, threat intelligence)
Use data augmentation for rare spam types
Implement hard negative mining for difficult examples

5. Serving Architecture

5.1 Real-time Classification Pipeline

Architecture Diagram

Email → Pre-filtering → Feature Extraction → Model Scoring → Decision → Delivery
         (< 10ms)          (< 20ms)           (< 30ms)       (< 10ms)

5.2 Model Serving

class SpamModelServing:
    def __init__(self):
        self.onnx_session = ort.InferenceSession(
            "spam_classifier.onnx",
            providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
        )
        self.models = {
            'transformer': TransformerModel(),
            'gradient_boosting': GradientBoostingModel(),
            'logistic_regression': LogisticRegressionModel()
        }
        self.model_weights = {'transformer': 0.5, 'gradient_boosting': 0.3, 'logistic_regression': 0.2}
    
    async def predict(self, features: Dict) -> Dict:
        predictions = {}
        for model_name, model in self.models.items():
            prediction = await model.predict(features)
            predictions[model_name] = prediction
        
        ensemble_prediction = sum(predictions[name] * self.model_weights[name] for name in predictions)
        return {
            'spam_score': float(ensemble_prediction),
            'is_spam': ensemble_prediction > 0.5,
            'individual_predictions': predictions,
            'inference_time_ms': self.measure_latency()
        }

💡

Latency Optimization Tips:

Implement fast pre-filtering with blocklists/allowlists
Cache features aggressively (sender reputation changes slowly)
Use model distillation for faster inference
Implement request batching for GPU efficiency

6. Monitoring and Observability

6.1 Key Metrics

class SpamMonitoringMetrics:
    MODEL_METRICS = ['precision', 'recall', 'f1_score', 'auc_roc', 'false_positive_rate']
    BUSINESS_METRICS = ['spam_detection_rate', 'user_report_rate', 'false_positive_complaints']
    OPERATIONAL_METRICS = ['classification_latency_p50', 'classification_latency_p99', 'throughput_emails_per_second']
    ADVERSARIAL_METRICS = ['new_spam_patterns_detected', 'evasion_attempts_detected', 'model_robustness_score']

⚠️

Critical Monitoring Points:

False positives: Even 0.1% FPR can block millions of legitimate emails
Adversarial attacks: Monitor for new evasion techniques
Model drift: Spam patterns evolve rapidly
User complaints: Track user reports of missed spam or false positives

7. Scale Considerations and Trade-offs

7.1 Horizontal Scaling

Architecture Diagram

Email Volume: Partition by recipient domain or message hash
Model Serving: GPU instances for transformer models, CPU for gradient boosting
Feature Store: Redis cluster with read replicas

7.2 Cost vs Performance Trade-offs

Dimension	Option A (Cost Optimized)	Option B (Performance Optimized)
Model Complexity	Logistic regression (fast)	Transformer model (accurate)
Feature Freshness	Batch features (hourly)	Real-time features (streaming)
Model Retraining	Weekly retraining	Daily retraining
Inference Hardware	CPU instances	GPU instances

8. Advanced Topics

8.1 Adversarial Robustness

class AdversarialDetector:
    def __init__(self):
        self.text_analyzer = AdversarialTextAnalyzer()
        self.image_analyzer = AdversarialImageAnalyzer()
    
    async def detect(self, email, features):
        scores = []
        text_score = await self.text_analyzer.detect_obfuscation(email.body_text)
        scores.append(text_score)
        if email.attachments:
            image_score = await self.image_analyzer.detect_image_text(email.attachments)
            scores.append(image_score)
        encoding_score = self.detect_encoding_tricks(email)
        scores.append(encoding_score)
        return max(scores) if scores else 0

8.2 User-Specific Filtering

class UserSpecificFiltering:
    async def classify_with_user_context(self, email, user_id):
        global_prediction = await self.global_model.predict(email)
        if user_id in self.user_models:
            user_prediction = await self.user_models[user_id].predict(email)
            final_prediction = 0.7 * global_prediction + 0.3 * user_prediction
        else:
            final_prediction = global_prediction
        user_threshold = await self.get_user_threshold(user_id)
        return {
            'spam_score': final_prediction,
            'is_spam': final_prediction > user_threshold,
            'user_threshold': user_threshold
        }

8.3 Explainable Spam Detection

class SpamExplainer:
    def generate_narrative(self, features, top_features):
        narrative_parts = []
        for feature_name, importance in top_features:
            if feature_name == 'sender_reputation' and features[feature_name] < 0.3:
                narrative_parts.append("Sender has poor reputation score")
            elif feature_name == 'spam_keyword_count' and features[feature_name] > 5:
                narrative_parts.append(f"Email contains {features[feature_name]} spam-related keywords")
            elif feature_name == 'url_reputation' and features[feature_name] < 0.3:
                narrative_parts.append("Email contains URLs with poor reputation")
        return "Reasons: " + "; ".join(narrative_parts)

ℹ️

Explainability Requirements: For spam detection:

Provide clear explanations to users
Allow users to override spam decisions
Track explanation quality metrics
Use explanations to improve model performance

9. Implementation Roadmap

Phase 1: Rule-based System (Weeks 1-2)

Implement basic rule engine
Set up email ingestion pipeline
Create basic monitoring dashboard

Phase 2: ML Models (Weeks 3-6)

Feature engineering pipeline
Train initial gradient boosting model
Implement model serving

Phase 3: Advanced Detection (Weeks 7-10)

Transformer-based content model
User-specific filtering
Adversarial robustness

Phase 4: Optimization (Weeks 11-14)

Latency optimization
Cost optimization
Advanced monitoring

10. Summary and Key Takeaways

Architecture Recap

Multi-layered defense: Pre-filtering + Content analysis + Behavioral analysis
Real-time features: Streaming features for up-to-date classification
Ensemble approach: Combine multiple models for better performance
Explainable decisions: All decisions must be explainable to users

Key Metrics

Precision/Recall: > 99.9% precision, > 99% recall
Latency: < 100ms for classification
Throughput: 10M+ emails/second

Common Interview Mistakes

Not discussing false positive rate (critical for user experience)
Ignoring adversarial robustness
Forgetting about user-specific preferences
Not considering privacy requirements

ℹ️

Final Interview Tip: Emphasize the balance between catching spam and avoiding false positives. Discuss how you'd handle adversarial attacks and user feedback. Show understanding of both ML techniques and production requirements for email systems.

Design Email Spam Detection at Scale

Design Email Spam Detection at Scale

Interview Question

1. Requirements Gathering

Functional Requirements

Non-Functional Requirements

2. High-Level Architecture Overview

3. Data Pipeline Design

3.1 Email Data Model

3.2 Feature Engineering Pipeline

4. Model Selection and Training Approach

4.1 Multi-Model Architecture

4.2 Transformer-based Content Model

5. Serving Architecture

5.1 Real-time Classification Pipeline

5.2 Model Serving

6. Monitoring and Observability

6.1 Key Metrics

7. Scale Considerations and Trade-offs

7.1 Horizontal Scaling

7.2 Cost vs Performance Trade-offs

8. Advanced Topics

8.1 Adversarial Robustness

8.2 User-Specific Filtering

8.3 Explainable Spam Detection

9. Implementation Roadmap

Phase 1: Rule-based System (Weeks 1-2)

Phase 2: ML Models (Weeks 3-6)

Phase 3: Advanced Detection (Weeks 7-10)

Phase 4: Optimization (Weeks 11-14)

10. Summary and Key Takeaways

Architecture Recap

Key Metrics

Common Interview Mistakes

Further Reading