Survey Analysis

Survey Research Fundamentals

Survey research collects structured information from samples to understand populations. It is widely used in social science, marketing, public health, and many other fields. Well-designed surveys produce reliable estimates; poorly designed surveys produce misleading results.

Survey methods have evolved from paper questionnaires to online platforms, phone interviews, and mixed-mode designs. Each mode has strengths and weaknesses. Mode effects can influence responses and must be considered.

Understanding survey design and analysis is essential for data scientists working with survey data. Many datasets come from surveys, and interpreting these data requires understanding their structure.

Survey Sampling

Survey sampling selects units from populations for study. Proper sampling enables inference to populations while minimizing cost.

Sampling Methods

Simple random sampling gives each unit equal selection probability. Stratified sampling divides population into strata and samples within strata. Cluster sampling groups units and samples clusters. These can be combined in multi-stage designs.

Probability sampling methods assign known selection probabilities. This enables design-based inference with known sampling properties.

Non-probability methods (convenience, quota, snowball) don't assign known probabilities. They might reach certain populations but complicate inference.

Sample Size Determination

Sample size depends on desired precision, population variability, and budget constraints. Proportional allocation distributes sample proportionally across strata. Optimal allocation allocates more to high-variance strata.

Small populations require smaller samples for given precision. The finite population correction reduces required sample size when sampling a large proportion of the population.

Survey costs vary by mode and population. Budget constraints might limit sample size or require less efficient designs.

Survey Question Design

Question design determines what information is collected. Good questions collect desired information accurately; poor questions introduce error.

Question Types

Closed-ended questions provide response options. They are easier to analyze but might not capture all relevant responses. Open-ended questions allow free responses but are harder to analyze.

Rating scales measure attitudes or frequencies. Likert scales measure agreement with statements. They provide ordinal data requiring careful analysis.

Dichotomous questions have two response options. They are simple but might force false dichotomies.

Question Wording

Wordings should be clear, specific, and unbiased. Technical jargon confuses respondents. Leading questions bias responses. Double-barreled questions ask two things at once.

Question order can affect responses. Early questions might prime later responses. Sensitive questions should come later after rapport is established.

Pre-testing reveals problems that designers miss. Cognitive interviews reveal how respondents interpret questions. Pilot surveys test entire questionnaires.

Survey Modes

Different survey modes have different characteristics. Mode choice affects cost, coverage, response rates, and response patterns.

Interview Modes

Telephone interviews reach many people efficiently. They allow probing but might not suit complex answer cards. Response rates have declined with caller ID and mobile phones.

Face-to-face interviews allow showing cards and complex questions. They achieve highest response rates but are most expensive. Interviewer effects can occur.

Video interviews combine features of phone and in-person. They are growing with internet penetration.

Self-Administered Modes

Paper questionnaires can be distributed and collected. They avoid interviewer effects but might have lower response rates and more missing data.

Web surveys are increasingly common. They are inexpensive and allow complex question logic. Coverage might be limited to those with internet access.

Mixed-mode surveys combine modes. They might use different modes for different populations or phases. Mode effects complicate analysis.

Data Processing and Cleaning

Raw survey data require processing before analysis. This includes editing, coding, and handling missing data.

Data Editing

Data editing checks for errors and inconsistencies. Range checks flag values outside plausible ranges. Consistency checks verify logical relationships.

Outliers might be legitimate or errors. Investigation reveals appropriate action. Some errors can be corrected from other sources; others require setting to missing.

Human review catches some errors that automated checks miss. Editing adds cost but improves quality.

Coding Open-Ended Responses

Open-ended responses require coding to analyze. Code lists are developed from responses. Coders assign codes. Inter-coder reliability is assessed.

Machine learning can assist coding. Text classification algorithms can classify responses automatically. Human verification is often needed.

Complex coding schemes with many categories are difficult to use reliably. Coding should be kept as simple as possible.

Missing Data

Missing data occur for various reasons. Item non-response happens when respondents skip questions. Unit non-response happens when entire surveys are incomplete.

Weighting adjustments address non-response by giving greater weight to respondents like non-respondents. This requires information about non-respondents.

Imputation fills missing values with plausible values. Mean imputation is simple but distorts relationships. Multiple imputation properly propagates uncertainty.

Survey Weighting

Survey weights adjust for unequal selection probabilities and non-response. They enable estimates that represent the target population.

Weighting Process

Base weights reflect selection probabilities. For unequal probability sampling, weights are inverse selection probabilities.

Non-response adjustments modify weights for non-respondents. Raking iteratively adjusts weights to known population marginals. This requires external population information.

Calibration adjusts weights to conform to known totals. This improves efficiency while maintaining design properties.

Weighting Considerations

Weights should not be extreme. Extreme weights increase variance. Sometimes weights are trimmed or winsorized.

Complex weights require special analysis. Survey procedures incorporate weights in variance estimation. Ignoring weights produces incorrect standard errors.

Weighting introduces a tradeoff between bias (from non-response) and variance (from weights). Optimal weights balance these.

Analyzing Survey Data

Survey data require special analysis methods that account for design features. Standard methods might produce incorrect results.

Descriptive Analysis

Population estimates use weights to produce representative statistics. Means, proportions, and totals incorporate survey weights. They estimate population quantities.

Graphics visualize survey data. Error bars show confidence intervals. Weighted histograms show distributions.

Subgroup analysis requires careful consideration. Small subgroups might have large margins of error. Combining categories might be necessary.

Hypothesis Testing

Comparing groups requires accounting for design. T-tests for survey data use weighted means and design-based variance estimates. Standard errors reflect stratification and clustering.

Chi-square tests for categorical data use weighted counts and design-based variance. The Rao-Scott adjustment accounts for design effects.

Regression analysis for survey data uses weighted least squares. Survey regression procedures incorporate design information.

Variance Estimation

Survey estimates have sampling variability. Variance estimation quantifies uncertainty from sampling.

Design-Based Variance

Design-based variance comes from the random mechanism of sampling. Different samples would produce different estimates. The variance measures this variability.

Taylor series linearization provides variance estimates for many statistics. It approximates variance using the design and data.

Resampling methods (jackknife, bootstrap) simulate repeated sampling. They estimate variance by computing estimates for modified samples.

Confidence Intervals

Confidence intervals combine point estimates with margins of error. The level (typically 95%) indicates coverage probability.

Complex designs might produce confidence intervals with poor coverage. Bootstrap or other methods might be needed.

Small sample sizes or rare outcomes might produce unreliable intervals. Exact methods or Bayesian approaches might be alternatives.

Key Takeaways

Survey sampling methods affect what populations can be studied and what conclusions are possible
Question design determines what information is collected and its quality
Survey mode affects response rates and response patterns
Weighting adjusts for unequal selection and non-response to produce representative estimates
Survey analysis requires methods accounting for design features
Variance estimation quantifies uncertainty from sampling