Understanding Correlation
Correlation analysis quantifies the strength and direction of relationships between variables. Unlike regression, which models dependent variables based on predictors, correlation simply measures association without implying causation. Understanding correlation is essential for exploratory analysis, variable selection, and understanding data patterns.
Correlation coefficients range from -1 to +1. Values near 0 indicate weak or no linear relationship. Values near -1 or +1 indicate strong linear relationships. The sign indicates direction: positive values mean variables increase together, negative values mean one decreases as the other increases.
Correlation is fundamental to many statistical methods. It forms the basis for principal component analysis and factor analysis. It is used in regression diagnostics to detect multicollinearity. It guides feature selection in machine learning.
Pearson Correlation
Pearson correlation measures the strength of linear relationship between two continuous variables. It is the most commonly used correlation measure.
Computation
The Pearson correlation coefficient equals the covariance of two variables divided by the product of their standard deviations: r = Σ(xi - x̄)(yi - ȳ) / √[Σ(xi - x̄)² × Σ(yi - ȳ)²].
This formula produces values between -1 and +1. It equals +1 when all points lie on an increasing straight line. It equals -1 when all points lie on a decreasing straight line. It equals 0 when there is no linear relationship.
Software computes this automatically. Understanding the formula helps interpret what the correlation means and identify potential issues.
Interpretation
The magnitude of r indicates relationship strength regardless of sign. Values near 0.1 indicate weak relationship, near 0.3 moderate, near 0.5 substantial, near 0.7 strong, near 0.9 very strong. These are guidelines; actual interpretation depends on context.
The sign indicates direction. Positive r means both variables tend to be above or below their means together. Negative r means one tends to be above while the other is below.
Correlation is affected by range restriction. Narrowing the range of either variable reduces observed correlation. This can create misleading impressions about relationships in restricted populations.
Significance Testing
Significance tests evaluate whether observed correlation is likely due to chance. The null hypothesis states no linear relationship (ρ = 0). The test uses the t-distribution with n-2 degrees of freedom.
The test assumes both variables are normally distributed and the relationship is linear. It is reasonably robust to moderate violations, especially with larger samples.
Statistical significance does not indicate practical importance. Large samples can produce significant correlations for tiny effects. Effect size measures (like r²) help assess practical significance.
Spearman Rank Correlation
Spearman correlation measures monotonic relationship using ranks rather than raw values. It is the nonparametric alternative to Pearson correlation.
Rank-Based Computation
Spearman correlation is the Pearson correlation of ranks. First, convert both variables to ranks (1 to n). Then compute Pearson correlation on the ranks.
Tied values receive average ranks. This maintains the correlation structure while handling duplicates. The resulting coefficient has the same interpretation as Pearson correlation.
The formula can also be expressed in terms of rank differences: r = 1 - (6 × Σdi²) / (n(n² - 1)), where di is the difference between ranks of paired observations.
When to Use Spearman
Spearman is appropriate when the relationship is monotonic but not linear. It handles ordinal data and is robust to outliers. It is appropriate when variables are not normally distributed.
Spearman is appropriate for ranked data, Likert scale data, and other ordinal measurements. It does not require interval-scale assumptions.
With normal distributions and linear relationships, Spearman is slightly less efficient than Pearson. However, it is nearly as efficient in many practical situations and is more robust to violations.
Kendall's Tau
Kendall's tau is another rank correlation coefficient. It is based on concordance and discordance of pairs rather than ranks directly.
Concordance and Discordance
For each pair of observations, determine whether the order is the same in both variables (concordant) or different (discordant). Ties are handled separately.
The tau coefficient equals (C - D) / (n(n-1)/2), where C is number of concordant pairs, D is number of discordant pairs, and the denominator is the total number of pairs.
This measures the probability of concordance minus the probability of discordance. Values range from -1 to +1 with the same interpretation as other correlation coefficients.
Properties and Use
Kendall's tau is appropriate for ordinal data and small samples. It is often preferred over Spearman when sample sizes are small or when there are many ties.
It has a more direct probabilistic interpretation than Spearman. It is less influenced by tied ranks than Spearman.
The test statistic for significance has an approximate normal distribution for moderate to large samples, enabling hypothesis testing.
Partial Correlation
Partial correlation measures the relationship between two variables while controlling for other variables. This isolates direct associations from spurious correlations due to common causes.
Controlling Variables
Partial correlation computes correlation after removing the effect of control variables. For example, the partial correlation between education and income after controlling for age measures their relationship independent of age.
The formula for partial correlation removes the linear effect of control variables from both variables. More controls can be added, but interpretation becomes complex with many controls.
The technique is valuable for addressing confounding. It helps identify relationships that remain after accounting for alternative explanations.
Applications
Partial correlation is used to identify direct relationships in the presence of confounding. It can help select variables for regression by identifying unique contributions.
It is useful in experimental designs where randomization might not be complete. It can also help in observational studies where certain confounders can be measured.
The number of control variables should be limited relative to sample size. Too many controls can lead to unreliable estimates.
Correlation Matrices
Correlation matrices display pairwise correlations among multiple variables. They provide a comprehensive view of relationships in multivariate data.
Construction and Display
A correlation matrix has variables in rows and columns. Each cell contains the correlation between the row and column variables. The diagonal always contains 1 (correlation with self).
Matrices can be displayed as tables or heatmaps. Color coding helps visualize patterns. Clustering by correlation groups similar variables.
Computing matrices is straightforward in most statistical software. They are often the first step in multivariate exploration.
Pattern Identification
Examining correlation matrices reveals clusters of correlated variables. High correlations among some variables suggest potential multicollinearity in regression. Low correlations among predictors might indicate the need for transformations.
Singularities (correlations of exactly 1 or -1) indicate duplicate variables. Near-singularities indicate nearly duplicate variables that might cause computational problems.
The determinant of the correlation matrix indicates multicollinearity severity. Small determinants indicate severe multicollinearity requiring attention.
Limitations of Correlation
Correlation analysis has important limitations that must be considered to avoid misinterpretation.
Correlation Does Not Imply Causation
The classic warning applies: correlation measures association, not causation. Two variables might be correlated because one causes the other, the other causes the first, a third variable causes both, or the relationship is coincidental.
Establishing causation requires additional evidence beyond correlation. Experimental designs with randomization provide the strongest evidence. Observational studies can suggest but not establish causation.
Spurious correlations are common. With many variables and limited data, some correlations are inevitable. Replication helps distinguish real relationships from chance findings.
Other Limitations
Correlation only measures linear relationships. Strong curved relationships can produce near-zero correlations. Visual examination should always accompany correlation computation.
Outliers can dramatically affect correlation. A single extreme point might create or obscure relationships. Robust correlation measures like Spearman are less affected.
The distinction between statistical association and practical significance matters. Correlations might be statistically significant but explain little variance, limiting practical utility.
Biserial and Point-Biserial Correlation
These correlation types handle situations where one variable is dichotomous.
Point-Biserial Correlation
Point-biserial correlation applies when one variable is truly dichotomous (categorical with two levels representing a binary characteristic) and the other is continuous. It is mathematically equivalent to Pearson correlation after coding the binary variable as 0 and 1.
It measures the relationship between a binary and continuous variable. The interpretation is the same as Pearson correlation: values near 0 indicate no relationship, values near ±1 indicate strong relationships.
The test of significance is equivalent to the t-test for comparing means across groups. This connection provides a useful way to think about the correlation.
Biserial Correlation
Biserial correlation applies when one variable is artificially dichotomous (a continuous variable that has been cut into two categories). This is different from a naturally binary variable.
The biserial correlation estimates what the correlation would be if the original continuous variable had been used. It requires assumptions about the underlying distribution.
It is appropriate for variables like pass/fail that come from continuous underlying tendencies. The computation is more complex than point-biserial correlation.
Canonical Correlation
Canonical correlation examines relationships between sets of multiple variables. It finds linear combinations of each set that correlate maximally.
Method Overview
The first canonical correlation maximizes correlation between linear combinations of the two variable sets. Subsequent canonical correlations maximize remaining correlation subject to being uncorrelated with previous combinations.
Each canonical variate is a weighted combination of variables in its set. The weights indicate relative contributions. They are chosen to maximize the correlation with the other set.
The number of canonical correlations equals the minimum of the number of variables in each set. Most correlations are typically small beyond the first few.
Interpretation and Use
Canonical correlation provides a holistic view of multivariate relationships. It is more interpretable than separate pairwise correlations. However, interpretation can be complex.
The technique is used in psychology, social sciences, and marketing to understand relationships between sets of variables like attitudes and behaviors, or motivations and outcomes.
It can be seen as an extension of multiple regression to multiple dependent variables, allowing exploration of how predictor sets relate to outcome sets.
Key Takeaways
- Correlation measures strength and direction of linear (Pearson) or monotonic (Spearman, Kendall) relationships
- Partial correlation measures relationships while controlling for other variables
- Correlation matrices provide comprehensive views of multivariate relationships
- Correlation does not imply causation—additional evidence is needed to establish causal relationships
- Correlation only captures linear relationships—strong non-linear relationships can produce near-zero correlations
- Special correlation types handle dichotomous and ordinal data