Statistical Foundations for Data Science

Data Science FundamentalsStatisticsFree Lesson

Advertisement

Importance of Statistics in Data Science

Statistics provides the mathematical foundation for making inferences and predictions from data. Without statistical knowledge, data science becomes mere pattern matching without understanding of uncertainty.

Population vs Sample

  • Population: The complete set of all items of interest
  • Sample: A subset of the population used for analysis

Population  Parameters=μ,σ,pPopulation\;Parameters = \mu, \sigma, p Sample  Statistics=xˉ,s,p^Sample\;Statistics = \bar{x}, s, \hat{p}

Types of Data

TypeDescriptionExamples
NumericalQuantitative continuous valuesHeight, weight, temperature
CategoricalQualitative discrete valuesGender, color, city
OrdinalOrdered categorical dataEducation level, rating
Time SeriesData over timeStock prices, temperature

Descriptive Statistics

Measures of Central Tendency:

Mean  (xˉ)=1ni=1nxiMean\;(\bar{x}) = \frac{1}{n}\sum_{i=1}^{n} x_i

Median={xn+12n oddxn2+xn2+12n evenMedian = \begin{cases} x_{\frac{n+1}{2}} & n\text{ odd} \\ \frac{x_{\frac{n}{2}} + x_{\frac{n}{2}+1}}{2} & n\text{ even} \end{cases}

Mode=Most frequent valueMode = \text{Most frequent value}

Measures of Dispersion:

Variance=1n1i=1n(xixˉ)2Variance = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

Standard  Deviation=VarianceStandard\;Deviation = \sqrt{Variance}

Range=MaxMinRange = Max - Min

IQR=Q3Q1IQR = Q_3 - Q_1

Probability Distributions

Common distributions used in data science:

Binomial Distribution - Number of successes in n trials: P(X=k)=(nk)pk(1p)nkP(X=k) = \binom{n}{k} p^k (1-p)^{n-k}

Normal Distribution - Bell-shaped continuous distribution: f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

Poisson Distribution - Count of events in fixed interval: P(X=k)=λkeλk!P(X=k) = \frac{\lambda^k e^{-\lambda}}{k!}

Hypothesis Testing

The foundation of statistical inference:

  1. Null Hypothesis (H₀): The default assumption
  2. Alternative Hypothesis (H₁): What we're trying to prove
  3. Test Statistic: Calculated from sample data
  4. P-value: Probability of observing results if H₀ is true
  5. Significance Level (α): Threshold for rejection (typically 0.05)

Common tests:

  • t-test: Comparing means
  • Chi-square test: Testing independence
  • ANOVA: Comparing multiple means
  • Correlation test: Testing relationship between variables

Confidence Intervals

A range of values likely to contain the population parameter:

CI=xˉ±zα/2×σnCI = \bar{x} \pm z_{\alpha/2} \times \frac{\sigma}{\sqrt{n}}

Correlation

r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2}\sqrt{\sum(y_i - \bar{y})^2}}

  • r = 1: Perfect positive correlation
  • r = -1: Perfect negative correlation
  • r = 0: No linear correlation

Python Implementation

import numpy as np
import pandas as pd
from scipy import stats

# Descriptive statistics
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
mean = np.mean(data)
median = np.median(data)
std = np.std(data, ddof=1)
variance = np.var(data, ddof=1)

# Hypothesis testing - t-test
group1 = [85, 87, 92, 78, 88]
group2 = [79, 82, 89, 75, 81]
t_stat, p_value = stats.ttest_ind(group1, group2)

# Correlation
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
correlation = np.corrcoef(x, y)[0, 1]

Key Takeaways

  1. Statistics provides the framework for data-driven decision making
  2. Understanding populations vs samples is crucial
  3. Descriptive statistics summarize data characteristics
  4. Probability distributions model real-world phenomena
  5. Hypothesis testing enables statistical inference

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement