Covariance — Measuring Joint Variation of Two Variables

Foundations of StatisticsDescriptive StatisticsFree Lesson

Advertisement

Covariance

Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.

Cov(X,Y)=i=1n(xixˉ)(yiyˉ)n1\text{Cov}(X, Y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{n-1}

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
n = 100

# Three relationships
x = np.random.normal(50, 10, n)
y_pos = 2*x + np.random.normal(0, 10, n)   # positive covariance
y_neg = -2*x + np.random.normal(0, 10, n)  # negative covariance
y_zero = np.random.normal(50, 10, n)        # zero covariance

for y, label in [(y_pos,"Positive"),(y_neg,"Negative"),(y_zero,"Near-Zero")]:
    cov = np.cov(x, y)[0,1]
    print(f"Cov(X, Y) = {cov:+.4f}  → {label}")

Manual Calculation

data = pd.DataFrame({
    'study_hours': [2, 3, 5, 4, 6, 1, 7, 3, 5, 8],
    'exam_score':  [60,65,75,70,80,55,85,65,72,90]
})

mean_x = data['study_hours'].mean()
mean_y = data['exam_score'].mean()
deviations_xy = (data['study_hours'] - mean_x) * (data['exam_score'] - mean_y)
cov_manual = deviations_xy.sum() / (len(data) - 1)
cov_numpy  = np.cov(data['study_hours'], data['exam_score'])[0, 1]

print(f"Manual covariance: {cov_manual:.4f}")
print(f"NumPy covariance:  {cov_numpy:.4f}")

Covariance Matrix

For multiple variables, the covariance matrix contains pairwise covariances:

iris = sns.load_dataset('iris')
numeric = iris.select_dtypes(include='number')

cov_matrix = numeric.cov()
print("Covariance Matrix:")
print(cov_matrix.round(4))

fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cov_matrix, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Covariance Matrix (Iris Dataset)')
plt.tight_layout()
plt.savefig('covariance_matrix.png', dpi=150)
plt.show()

Covariance vs Correlation

FeatureCovariancePearson Correlation
Range(−∞, +∞)[−1, +1]
UnitsUnits of X × Units of YDimensionless
Scale-dependent?YesNo
Interpretable magnitude?NoYes

r=Cov(X,Y)sXsYr = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y}

cov = np.cov(data['study_hours'], data['exam_score'])[0,1]
sx = data['study_hours'].std(ddof=1)
sy = data['exam_score'].std(ddof=1)
r_from_cov = cov / (sx * sy)
r_numpy = np.corrcoef(data['study_hours'], data['exam_score'])[0,1]

print(f"r from covariance formula: {r_from_cov:.6f}")
print(f"r from np.corrcoef:        {r_numpy:.6f}")

Key Takeaways

  1. Positive covariance: both variables tend to be above (or below) their means together
  2. Negative covariance: one above mean when other is below
  3. Covariance is scale-dependent — use correlation (standardized covariance) for interpretable strength
  4. np.cov() returns the covariance matrix — [0,1] or [1,0] element is the cross-covariance
  5. The diagonal of the covariance matrix contains each variable's own variance
  6. Portfolio variance = wᵀ Σ w where Σ is the covariance matrix — covariance drives diversification

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement