Introduction to Survival Analysis
Survival analysis analyzes time-to-event data, where the event of interest is often death or failure. It is used in medical research (time to death or disease recurrence), engineering (time to component failure), and economics (time to unemployment). The key complication is that for some subjects, the event has not occurred by the end of observation—censoring.
Survival analysis requires special methods because standard approaches cannot handle censored data properly. Treating censored observations as non-events would produce biased estimates. The statistical methods properly account for incomplete observation.
Understanding survival data structure, appropriate estimation methods, and regression models for survival outcomes provides powerful tools for time-to-event problems.
Survival Data Structure
Survival data include a time variable and an indicator of whether the event occurred. This structure requires special handling.
Time Variables
The time variable measures duration from a starting point to the event or censoring. This might be time from diagnosis to death, time from treatment to recurrence, or time from hire to termination.
Time origin must be clearly defined and consistently applied. Starting points might be diagnosis, randomization, treatment start, or hire date. Different origins produce different analyses.
Time scales might be calendar time, time on test, age, or other relevant scales. The choice should reflect the scientific question.
Censoring
Right censoring occurs when the event has not happened by the end of observation. The subject might still be event-free at study end or might have dropped out. This is the most common censoring type.
Left censoring occurs when the event happened before study entry but the exact time is unknown. This is less common but important in some settings.
Interval censoring occurs when we know the event happened between two times but not exactly when. This occurs in periodic follow-up.
Hazard and Survival Functions
The hazard function describes the instantaneous risk of the event given survival to that time. It is the probability of the event in a small interval, conditional on surviving to the beginning of the interval.
The survival function is the probability of surviving beyond a given time. It starts at 1 (everyone alive at time 0) and declines toward 0 as events occur.
The two functions are mathematically related. The survival function is the product of survival probabilities; the hazard function determines this accumulation.
Kaplan-Meier Estimation
The Kaplan-Meier (product limit) estimator provides nonparametric estimation of the survival function. It handles censored observations properly.
Estimation Method
The Kaplan-Meier estimator multiplies conditional probabilities of surviving each time interval. At each event time, the probability of surviving through that interval is estimated. These are multiplied across times.
At times with events, the survival estimate drops. At times with only censored observations, the estimate stays constant. This produces step functions.
The estimator is nonparametric—it doesn't assume a specific distribution for survival times. This makes it widely applicable.
Confidence Intervals
Confidence intervals for the survival function account for estimation uncertainty. The Greenwood formula provides standard errors.
The log-log transformation produces more accurate confidence intervals, especially for survival probabilities near 0 or 1. This is the default in many software packages.
Pointwise confidence intervals have the stated coverage at each time. Simultaneous confidence bands are wider but cover the entire curve with stated confidence.
Comparing Survival Curves
The log-rank test compares survival curves across groups. It tests whether the groups have the same survival function. This is the nonparametric extension of the two-sample test.
The test compares observed and expected events at each time. It gives more weight to early differences than some alternatives. Various versions weight differently.
The test provides a p-value for overall difference. It does not indicate which specific groups differ. Pairwise comparisons can identify specific differences.
Cox Proportional Hazards Model
The Cox model provides regression modeling for survival data. It relates covariates to the hazard function while making minimal distributional assumptions.
Model Specification
The Cox model specifies: h(t|X) = h₀(t) × exp(β'X), where h(t|X) is the hazard given covariates X, h₀(t) is the baseline hazard, and β are coefficients.
The baseline hazard is unspecified—this is the "semi-parametric" aspect. The model focuses on how covariates affect the hazard relative to each other.
The exponentiated coefficients are hazard ratios. A hazard ratio of 2 means twice the hazard (event more likely per unit time) at any given time.
Estimation
Cox model estimation uses partial likelihood. This likelihood uses only the ordering of events, not their absolute times. This accounts for censoring.
Estimation proceeds by maximizing the partial likelihood. This is similar to logistic regression in many ways but accounts for time.
The baseline hazard is not estimated in standard approaches. It can be estimated after fitting for prediction purposes.
Proportional Hazards Assumption
The proportional hazards assumption states that hazard ratios are constant over time. This is the key assumption of the Cox model.
Graphical checks plot log(-log(survival)) versus log(time). Parallel lines across groups suggest proportional hazards. Schoenfeld residuals test the assumption formally.
Violations of the assumption are serious. Time-dependent covariates, stratification, or alternative models might be needed.
Accelerated Failure Time Models
Accelerated failure time models provide an alternative to proportional hazards. They model the effect of covariates on time rather than hazard.
Model Specification
The accelerated failure time model is: log(T) = β'X + ε, where T is survival time, X are covariates, and ε is an error term. The effect of covariates is to accelerate or decelerate time to event.
If the acceleration factor for a group is 2, the group experiences events twice as fast. The survival time is divided by the factor.
This is different from the proportional hazards interpretation. In proportional hazards, the hazard ratio multiplies hazard at each time. In accelerated failure time, time is scaled.
Error Distributions
Different error distributions produce different survival distributions. Common choices include Weibull, lognormal, and log-logistic. The Weibull is particularly flexible.
The Weibull can be expressed as either proportional hazards or accelerated failure time, providing a connection between the approaches.
Model selection might be based on theoretical considerations, model fit, or predictive performance.
Stratified and Time-Dependent Covariates
Extended Cox models handle more complex situations than the basic model.
Stratified Models
Stratified Cox models allow different baseline hazards across strata. This is useful when proportional hazards holds within strata but not across them.
Common stratifications include gender, study site, or other factors that might violate proportional hazards. The stratification variable is not included as a covariate.
The test of the treatment effect is based on a pooled estimate across strata.
Time-Dependent Covariates
Time-dependent covariates change over follow-up. They might be defined by external time-varying variables or internal time-varying measurements.
The Cox model handles time-dependent covariates differently. At each event time, covariate values at that time are used. This requires careful data structure.
Testing proportional hazards using Schoenfeld residuals might reveal time-dependent effects. If proportional hazards fails, time-dependent covariates might help.
Key Takeaways
- Survival analysis handles censored time-to-event data that standard methods cannot address properly
- Kaplan-Meier estimation provides nonparametric survival function estimation
- The Cox model provides semi-parametric regression for survival data
- The proportional hazards assumption is key to Cox model interpretation
- Accelerated failure time models provide an alternative approach with different interpretation
- Extended models handle violations of basic assumptions through stratification or time-dependent covariates