The Case Study Interview
Data science case studies test your ability to think through messy, real-world problems. Unlike coding interviews, there's no single correct answer β interviewers want to see your reasoning process.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β What Interviewers Evaluate β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Structure β β Technical β β Communicationβ β
β β β β Depth β β β β
β β Can you β β Do you know β β Can you β β
β β break down β β the right β β explain β β
β β a problem? β β tools? β β clearly? β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β 40% 30% 30% β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Case Study Format
The Standard Framework
DfCase Study Response Structure
A structured approach to answering data science case study questions.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Case Study Response Structure β
β β
β 1. CLARIFY (2-3 min) β
β ββ Ask questions to understand the problem β
β β
β 2. STRUCTURE (3-5 min) β
β ββ Break into components, choose approach β
β β
β 3. DIVE DEEP (15-20 min) β
β ββ Work through the analysis β
β β
β 4. SYNTHESIZE (3-5 min) β
β ββ Summarize findings, recommend action β
β β
β Total: 25-30 minutes β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Clarifying Questions to Always Ask
clarifying_questions = {
"Problem Definition": [
"What business problem are we trying to solve?",
"Who is the stakeholder and what decisions will this inform?",
"What does success look like? How will we measure it?",
"Is this a one-time analysis or an ongoing system?"
],
"Data Understanding": [
"What data do we have access to?",
"How is the data collected? What's the data generation process?",
"What's the time range and granularity?",
"Are there known data quality issues?",
"What are the key entities and relationships?"
],
"Constraints": [
"What's the timeline for this project?",
"Are there engineering resources for deployment?",
"What's the current process and why does it need to change?",
"Are there regulatory or ethical constraints?"
],
"Scope": [
"What's the minimum viable solution?",
"Are there existing baselines or benchmarks?",
"What has been tried before and what happened?"
]
}
Product Metrics
The Metric Framework
DfProduct Metrics Hierarchy
A hierarchical approach to defining product metrics, starting with a north star metric and breaking down into input, process, and output metrics.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Product Metrics Hierarchy β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β NORTH STAR METRIC β β
β β (Single metric that captures value) β β
β β e.g., Weekly Active Buyers β β
β ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β β β β β
β βββββββΌββββββ βββββββΌββββββ βββββββΌββββββ β
β β Input β β Process β β Output β β
β β Metrics β β Metrics β β Metrics β β
β β β β β β β β
β β # of new β β Conversionβ β Revenue β β
β β signups β β rate β β per user β β
β βββββββββββββ βββββββββββββ βββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Common Product Metrics Definitions
# Key metrics every data scientist should know
METRICS = {
# Acquisition
"CAC": "Customer Acquisition Cost = Marketing Spend / New Customers",
# Activation
"Activation Rate": "% of signups completing key action within X days",
"Time to Value": "Time from signup to first 'aha moment'",
# Engagement
"DAU/MAU": "Daily Active Users / Monthly Active Users = Stickiness",
"Session Duration": "Average time spent per session",
"Feature Adoption": "% of users using a specific feature",
# Revenue
"ARPU": "Average Revenue Per User = Total Revenue / Active Users",
"LTV": "Customer Lifetime Value = ARPU Γ Average Lifespan",
"LTV/CAC": "Target: >3 for healthy unit economics",
# Retention
"Retention Rate": "% of users returning after N days",
"Churn Rate": "% of users leaving per period",
"Net Revenue Retention": "Expansion + Renewals - Churn - Contraction",
# Satisfaction
"NPS": "Net Promoter Score = % Promoters - % Detractors",
"CSAT": "Customer Satisfaction Score"
}
Retention Curve Analysis
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
def retention_curve_analysis(retention_data: dict) -> dict:
"""
Analyze retention curves and identify key patterns.
Parameters:
retention_data: {day: retention_rate} e.g., {1: 1.0, 7: 0.45, 30: 0.25}
"""
days = np.array(list(retention_data.keys()))
rates = np.array(list(retention_data.values()))
# Fit exponential decay: R(t) = a * exp(-b*t) + c
def exp_decay(t, a, b, c):
return a * np.exp(-b * t) + c
try:
popt, pcov = curve_fit(exp_decay, days, rates, p0=[0.8, 0.1, 0.1],
maxfev=5000)
a, b, c = popt
# Steady-state retention (asymptote)
steady_state = c
# Half-life (days to lose half of remaining users)
half_life = np.log(2) / b if b > 0 else float('inf')
# Critical period (when curve flattens)
# Find where derivative < threshold
derivatives = np.gradient(exp_decay(days, *popt), days)
flattening_day = days[np.argmin(np.abs(derivatives + 0.01))]
except Exception:
steady_state = rates[-1]
half_life = None
flattening_day = None
return {
"steady_state_retention": steady_state,
"half_life_days": half_life,
"flattening_day": flattening_day,
"d1_retention": retention_data.get(1, None),
"d7_retention": retention_data.get(7, None),
"d30_retention": retention_data.get(30, None),
"is_healthy": steady_state > 0.15 if steady_state else False
}
# Example usage
retention = {
1: 1.00, 2: 0.65, 3: 0.52, 7: 0.38,
14: 0.28, 30: 0.22, 60: 0.18, 90: 0.16
}
analysis = retention_curve_analysis(retention)
print(f"Steady-state retention: {analysis['steady_state_retention']:.1%}")
print(f"Healthy: {analysis['is_healthy']}")
Experimentation Questions
A/B Test Design Framework
class ABTestDesign:
def __init__(self, baseline_rate: float, mde: float, alpha: float = 0.05,
power: float = 0.80):
"""
Design an A/B test.
Args:
baseline_rate: Current conversion rate (e.g., 0.05 for 5%)
mde: Minimum detectable effect (e.g., 0.1 for 10% relative lift)
alpha: Significance level (default 0.05)
power: Statistical power (default 0.80)
"""
self.baseline = baseline_rate
self.mde = mde
self.alpha = alpha
self.power = power
def calculate_sample_size(self) -> int:
"""Calculate required sample size per variant."""
from scipy import stats
# Effect size (Cohen's h for proportions)
p1 = self.baseline
p2 = self.baseline * (1 + self.mde)
h = 2 * (np.arcsin(np.sqrt(p2)) - np.arcsin(np.sqrt(p1)))
# Z-scores
z_alpha = stats.norm.ppf(1 - self.alpha / 2)
z_beta = stats.norm.ppf(self.power)
# Sample size formula
n = ((z_alpha + z_beta) / h) ** 2 * 2
return int(np.ceil(n))
def calculate_duration(self, daily_traffic: int) -> int:
"""Calculate test duration in days."""
sample_per_variant = self.calculate_sample_size()
total_sample = sample_per_variant * 2
days = total_sample / daily_traffic
return int(np.ceil(days))
def analyze_results(self, control_n: int, control_conv: int,
treatment_n: int, treatment_conv: int) -> dict:
"""Analyze A/B test results."""
from scipy import stats
p_control = control_conv / control_n
p_treatment = treatment_conv / treatment_n
# Pooled proportion
p_pooled = (control_conv + treatment_conv) / (control_n + treatment_n)
# Standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/control_n + 1/treatment_n))
# Z-test
z_stat = (p_treatment - p_control) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# Confidence interval for difference
se_diff = np.sqrt(p_control*(1-p_control)/control_n +
p_treatment*(1-p_treatment)/treatment_n)
ci_lower = (p_treatment - p_control) - 1.96 * se_diff
ci_upper = (p_treatment - p_control) + 1.96 * se_diff
# Relative lift
relative_lift = (p_treatment - p_control) / p_control * 100
return {
"control_rate": p_control,
"treatment_rate": p_treatment,
"absolute_difference": p_treatment - p_control,
"relative_lift_pct": relative_lift,
"p_value": p_value,
"significant": p_value < self.alpha,
"ci_95": (ci_lower, ci_upper),
"recommendation": "Ship treatment" if p_value < self.alpha
and p_treatment > p_control else "No change"
}
# Example: Design a test
design = ABTestDesign(baseline_rate=0.05, mde=0.10)
print(f"Required sample size: {design.calculate_sample_size():,} per variant")
print(f"At 10K daily users: {design.calculate_duration(10000)} days")
Common Experimentation Pitfalls
β οΈ Common Experimentation Pitfalls
Be aware of these common pitfalls when designing and analyzing experiments.
pitfalls = {
"Peeking Problem": {
"issue": "Checking results before reaching sample size inflates false positives",
"solution": "Use sequential testing or pre-commit to a sample size"
},
"Multiple Comparisons": {
"issue": "Testing many variants increases chance of false positive",
"solution": "Apply Bonferroni correction or use False Discovery Rate"
},
"Novelty Effect": {
"issue": "New feature gets temporary boost, then fades",
"solution": "Run test for 2+ weeks, exclude new users initially"
},
"Simpson's Paradox": {
"issue": "Overall trend reverses when segmented",
"solution": "Always check results by key segments (platform, geography)"
},
"Network Effects": {
"issue": "Control group affected by treatment users",
"solution": "Use cluster randomization (by user group, not individual)"
}
}
Problem-Solving Frameworks
Framework 1: Metric Definition
DfMetric Definition Framework
A structured approach to defining product metrics for any feature or product.
"How would you measure success for [feature/product]?"
Step 1: Clarify the product and its goals
Step 2: Identify the user journey (acquire β activate β retain β revenue)
Step 3: Define metrics at each stage
Step 4: Pick ONE north star metric
Step 5: Identify guardrail metrics (things that shouldn't get worse)
def metric_definition_framework(product: str) -> dict:
"""
Framework for defining product metrics.
"""
return {
"product": product,
"user_journey": {
"acquisition": {
"metrics": ["signups", "downloads", "traffic"],
"example": "Daily new signups"
},
"activation": {
"metrics": ["first_action", "setup_complete", "aha_moment"],
"example": "% completing onboarding within 24h"
},
"engagement": {
"metrics": ["DAU/MAU", "session_length", "features_used"],
"example": "Weekly sessions per user"
},
"retention": {
"metrics": ["D1/D7/D30 retention", "churn rate"],
"example": "D30 retention rate"
},
"revenue": {
"metrics": ["conversion", "ARPU", "LTV"],
"example": "Revenue per active user"
}
},
"north_star": "Pick the metric that best captures user value delivery",
"guardrails": "Metrics that must not degrade (e.g., latency, errors)"
}
Framework 2: Metric Drop Investigation
DfMetric Drop Investigation Framework
A structured approach to investigating why a metric dropped.
"Metric X dropped by Y%. How would you investigate?"
Step 1: Validate the drop (is it real or a data artifact?)
Step 2: Segment the drop (by platform, geography, user type, time)
Step 3: Identify the most affected segment
Step 4: Generate hypotheses for that segment
Step 5: Prioritize hypotheses by likelihood and impact
Step 6: Design analysis to test top hypothesis
Step 7: Recommend action
def metric_drop_investigation(metric_name: str, drop_pct: float) -> dict:
"""
Structured investigation plan for a metric drop.
"""
return {
"step_1_validate": {
"questions": [
f"Is the {metric_name} drop statistically significant?",
"Is this within normal variance?",
"Did the metric definition change?",
"Is there a data pipeline issue?"
],
"tools": ["Check data freshness", "Run significance test",
"Compare to historical variance"]
},
"step_2_segment": {
"dimensions": [
"Platform (iOS vs Android vs Web)",
"Geography (country, region)",
"User type (new vs returning)",
"Time of day / day of week",
"App version / release"
],
"goal": "Find the segment with the largest drop"
},
"step_3_hypothesize": {
"categories": {
"Technical": ["Bug in latest release", "Performance degradation",
"Data pipeline issue"],
"External": ["Seasonality", "Competitor action",
"Market event"],
"Product": ["Recent feature change", "UI change",
"Pricing change"]
}
},
"step_4_analyze": {
"approach": "Compare affected segment vs unaffected segment",
"tools": ["Funnel analysis", "Cohort analysis",
"Before/after comparison"]
}
}
Framework 3: Design a Prediction System
DfPrediction System Framework
A structured approach to designing a prediction system for any problem.
"Design a system to predict [X]"
Step 1: Define the problem (what, why, for whom)
Step 2: Define the prediction target and time horizon
Step 3: Identify features and data sources
Step 4: Choose evaluation metrics
Step 5: Baseline model (simple heuristics)
Step 6: Production requirements (latency, throughput, freshness)
Step 7: Monitoring and iteration plan
def prediction_system_framework(problem: str) -> dict:
return {
"problem_definition": {
"what": problem,
"why": "Business value and decision it enables",
"who": "Stakeholders who will use it"
},
"prediction_target": {
"definition": "Precise, measurable, time-bounded",
"examples": {
"churn": "Will this user be inactive in 30 days?",
"conversion": "Will this user purchase within 7 days?",
"fraud": "Is this transaction fraudulent?"
}
},
"features": {
"behavioral": ["click patterns", "session data", "usage frequency"],
"transactional": ["purchase history", "payment method", "AOV"],
"contextual": ["device", "location", "time of day"],
"derived": ["trends", "ratios", "aggregations"]
},
"evaluation": {
"classification": ["AUC-ROC", "Precision@K", "Recall@K",
"F1", "Business-specific metric"],
"regression": ["RMSE", "MAE", "MAPE", "RΒ²"]
},
"production": {
"latency": "Real-time vs batch",
"throughput": "Requests per second",
"freshness": "How often to retrain",
"monitoring": "Drift detection, performance tracking"
}
}
Framework 4: Experimentation Design
DfExperimentation Design Framework
A structured approach to designing experiments.
def experimentation_framework(hypothesis: str) -> dict:
return {
"hypothesis": hypothesis,
"components": {
"control": "Current experience",
"treatment": "New experience to test",
"unit_of_randomization": "User level vs session level",
"primary_metric": "Main success metric",
"guardrail_metrics": "Metrics that must not degrade"
},
"design": {
"sample_size": "Calculate using power analysis",
"duration": "Minimum 1-2 weeks for weekly patterns",
"segments": "Pre-specify key segments to analyze",
"novelty_effect": "Run long enough or exclude new users"
},
"analysis": {
"method": "Two-proportion z-test or t-test",
"significance": "p < 0.05 (or adjusted for multiple tests)",
"practical_significance": "Is the effect size meaningful?",
"segments": "Check for heterogeneous treatment effects"
}
}
Practice Problems
Problem 1: Metric Definition
Question: You're the data scientist for a music streaming app.
How would you measure the success of a new "Discover Weekly" playlist feature?
Your answer should cover:
1. Define the user journey
2. Suggest metrics at each stage
3. Pick a north star metric
4. Identify guardrail metrics
Problem 2: Metric Drop
Question: Daily active users dropped 8% week-over-week.
Walk through your investigation.
Your answer should cover:
1. Validation steps
2. Segmentation strategy
3. Hypothesis generation
4. Analysis plan
5. Communication of findings
Problem 3: Experimentation
Question: You want to test a new checkout flow that simplifies
the purchase process from 5 steps to 3 steps.
Your answer should cover:
1. Hypothesis and prediction
2. Primary metric selection
3. Sample size calculation approach
4. Test duration
5. Potential pitfalls to watch for
Problem 4: Prediction System
Question: Design a system to predict which users will churn
in the next 30 days.
Your answer should cover:
1. Problem definition and business value
2. Target variable definition
3. Feature ideas (at least 5)
4. Evaluation metrics
5. Baseline approach
6. Production requirements
Problem 5: Trade-off Analysis
Question: Your model has 85% precision and 70% recall.
The business wants to reduce false positives (wrongly flagging
good users). How do you approach this trade-off?
Your answer should cover:
1. Explain precision-recall trade-off
2. How to adjust the threshold
3. Impact on business metrics
4. How to find the optimal balance
# Solution to Problem 5: Threshold optimization
import numpy as np
from sklearn.metrics import precision_recall_curve
def optimize_threshold(y_true, y_scores, cost_fp=10, cost_fn=50):
"""
Find optimal classification threshold based on business costs.
Args:
y_true: True labels
y_scores: Predicted probabilities
cost_fp: Cost of false positive (e.g., $10 discount given wrongly)
cost_fn: Cost of false negative (e.g., $50 lost from churn)
"""
precisions, recalls, thresholds = precision_recall_curve(y_true, y_scores)
# Calculate total cost for each threshold
costs = []
for i, threshold in enumerate(thresholds):
predictions = (y_scores >= threshold).astype(int)
tp = np.sum((predictions == 1) & (y_true == 1))
fp = np.sum((predictions == 1) & (y_true == 0))
fn = np.sum((predictions == 0) & (y_true == 1))
total_cost = (fp * cost_fp) + (fn * cost_fn)
costs.append(total_cost)
optimal_idx = np.argmin(costs)
optimal_threshold = thresholds[optimal_idx]
return {
"optimal_threshold": optimal_threshold,
"precision_at_optimal": precisions[optimal_idx],
"recall_at_optimal": recalls[optimal_idx],
"min_cost": costs[optimal_idx]
}
Key Takeaways
πSummary: DS Case Study Prep
- Always structure your answer: Clarify -> Structure -> Dive Deep -> Synthesize β this demonstrates maturity and ensures completeness
- Ask clarifying questions -- it shows maturity and avoids wasted effort on the wrong problem
- Know your product metrics cold: LTV, CAC, retention, churn, activation -- these are the language of product teams
- For experimentation, always think about sample size, duration, and pitfalls -- most A/B test failures are methodological, not statistical
- Communicate trade-offs explicitly -- every decision has a cost, and interviewers want to see you reason about them
- Start with a simple baseline before proposing complex solutions -- this shows pragmatic engineering judgment
- Tie everything back to business impact and actionable recommendations -- the best analysis is useless if it doesn't drive action
Practice Exercises
- Mock interview: Practice with a friend using problems above (25 min limit)
- Metric journal: For every app you use, write down what metrics they likely track
- Read case studies: Study published case studies from tech companies
- Build a framework doc: Create your own cheat sheet for common case types
- Time yourself: Practice staying within the 25-30 minute window