← Math|60 of 100
Statistics

Applications in Data Science

See how statistics powers A/B testing, causal inference, and experimental design.

📂 Applications📖 Lesson 60 of 100🎓 Free Course

Advertisement

Applications in Data Science

â„šī¸ Why It Matters

Statistical thinking is essential for trustworthy data science, from experiments to causal claims. Without rigorous statistics, A/B tests produce false positives, models overfit, and causal claims confuse correlation with causation. Mastering the full statistical toolkit — from hypothesis testing to causal inference — ensures your conclusions are reliable, reproducible, and actionable.


Overview

Statistics powers the complete data science lifecycle. A/B testing uses two-sample proportion or mean tests to compare treatment and control groups, enabling data-driven product decisions. Power analysis determines required sample sizes before experiments, preventing wasted resources on underpowered studies. Causal inference distinguishes correlation from causation using randomized experiments (gold standard), propensity scores, instrumental variables, and difference-in-differences for observational data. Feature selection uses chi-square tests, permutation importance, and mutual information. Model evaluation relies on cross-validation, AUC-ROC, and calibration curves. Understanding how these pieces fit together transforms data analysis from ad hoc number-crunching into rigorous, reproducible science.


Key Concepts

Two-Proportion Z-Test (A/B Testing)

Z=p^1−p^2p^(1−p^)(1/n1+1/n2)Z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p})(1/n_1 + 1/n_2)}}

Here,

  • p^1,p^2\hat{p}_1, \hat{p}_2=Sample proportions for control and treatment
  • p^\hat{p}=Pooled proportion: $(x_1 + x_2)/(n_1 + n_2)$

Power Analysis (Sample Size)

n=(zÎą/2+zβ)2⋅2΃2δ2n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot 2\sigma^2}{\delta^2}

Here,

  • δ\delta=Minimum detectable effect (MDE)
  • zÎą/2z_{\alpha/2}=Significance level critical value (1.96 for Îą=0.05)
  • zβz_{\beta}=Power critical value (0.842 for power=80%)

Cohen's d (Effect Size)

d=xˉ1−xˉ2spd = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

Here,

  • sps_p=Pooled standard deviation

Chi-Square Feature Selection

·2=∑(Oi−Ei)2Ei\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}

Here,

  • OiO_i=Observed frequency of feature-category combination
  • EiE_i=Expected frequency under independence

Causal Inference Methods

MethodDescriptionKey AssumptionWhen to Use
Randomized ExperimentGold standardRandom assignmentWhen feasible
Propensity Score MatchingMatch treated and control on covariatesNo unmeasured confoundersObservational, pre-treatment covariates available
Instrumental VariablesUse exogenous variationExclusion restrictionWhen confounders are unmeasurable
Difference-in-DifferencesCompare pre/post changesParallel trendsBefore/after with control group

A/B Testing Workflow

  1. Define the metric: Choose what to measure (conversion rate, revenue, latency)
  2. Formulate hypotheses: H0H_0: no difference, H1H_1: difference exists
  3. Power analysis: Determine sample size before collecting data
  4. Random assignment: Users randomly assigned to control (A) and treatment (B)
  5. Collect data: Run experiment for predetermined duration
  6. Compute test statistic: Z-test for proportions or t-test for means
  7. Make decision: Reject H0H_0 if p-value ≤ α\alpha and effect is practically meaningful

Quick Example

📝A/B Test: Conversion Rate

Control: 50/1000 converted. Treatment: 70/1000 converted.

p^1=0.05,p^2=0.07,p^=0.06\hat{p}_1 = 0.05, \quad \hat{p}_2 = 0.07, \quad \hat{p} = 0.06
Z=0.05−0.070.06×0.94×(1/1000+1/1000)=−0.020.0106=−1.887Z = \frac{0.05 - 0.07}{\sqrt{0.06 \times 0.94 \times (1/1000 + 1/1000)}} = \frac{-0.02}{0.0106} = -1.887

p=0.059>0.05p = 0.059 > 0.05. Fail to reject at α=0.05\alpha = 0.05 — the difference is not statistically significant. However, the effect size (2 percentage points) may be practically meaningful; collect more data or consider the business context.

📝Sample Size Calculation

To detect a 5% improvement in conversion rate (from 10% to 15%) with 80% power at Îą=0.05\alpha = 0.05:

Using power analysis: n≈6000n \approx 6000 per group. This ensures the study can detect the effect if it exists. Always compute this before running the experiment — underpowered studies waste resources and produce inconclusive results.

📝Feature Selection with Chi-Square

In NLP, you have 1000 word features and a binary target (spam/ham). For each word, test whether it's independent of the target using chi-square. Words with low p-values (strong association) are kept; words with high p-values are removed. Apply Benjamini-Hochberg FDR correction to control false discoveries across 1000 tests. Select top 50 features for your classifier.

Common Pitfalls in Applied Statistics

PitfallWhy It's WrongCorrect Approach
Stopping experiment when p < 0.05Inflates false positive ratePre-specify sample size, run to completion
Ignoring practical significanceTrivial effects become "significant" with large nnReport effect sizes and confidence intervals
Cherry-picking subgroupsInflates false discovery ratePre-specify subgroups, adjust for multiple testing
Using accuracy for imbalanced classes95% accuracy by always predicting majority classUse F1, AUC-ROC, or precision-recall curves
Correlation ≠ CausationObservational association doesn't imply causationUse experiments or causal inference methods

Key Takeaways

📋Summary: Applications in Data Science

  • A/B Testing: Use two-sample proportion or mean tests. Randomize, pre-specify Îą\alpha, compute power before collecting data.
  • Power Analysis: Determine sample size using Cohen's d, desired power (0.80), and Îą=0.05\alpha = 0.05. Underpowered studies waste resources.
  • Causal Inference: Randomized experiments are the gold standard. For observational data, use propensity scores, IV, or DiD under strong assumptions.
  • Feature Selection: Chi-square tests for categorical features; permutation importance for any model; mutual information for non-linear relationships.
  • Model Evaluation: Cross-validate for unbiased performance estimates. Use AUC-ROC for threshold-independent evaluation. Check calibration.
  • Multiple Comparisons: Every test inflates false positive risk. Use Bonferroni, Holm, or FDR correction when running many tests.
  • Reproducibility: Pre-register hypotheses, report all tests, provide confidence intervals alongside p-values, and share code/data.
  • Beyond p-values: Effect sizes, confidence intervals, and practical significance matter more than binary significant/not-significant decisions.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Machine Learning Applications

  • Statistics in Machine Learning — How statistical methods power ML: hypothesis testing for model comparison, confidence intervals for metrics, and Bayesian approaches

Review and Roadmap

Related Topics

Lesson Progress60 / 100