Covariance
Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
n = 100
# Three relationships
x = np.random.normal(50, 10, n)
y_pos = 2*x + np.random.normal(0, 10, n) # positive covariance
y_neg = -2*x + np.random.normal(0, 10, n) # negative covariance
y_zero = np.random.normal(50, 10, n) # zero covariance
for y, label in [(y_pos,"Positive"),(y_neg,"Negative"),(y_zero,"Near-Zero")]:
cov = np.cov(x, y)[0,1]
print(f"Cov(X, Y) = {cov:+.4f} → {label}")
Manual Calculation
data = pd.DataFrame({
'study_hours': [2, 3, 5, 4, 6, 1, 7, 3, 5, 8],
'exam_score': [60,65,75,70,80,55,85,65,72,90]
})
mean_x = data['study_hours'].mean()
mean_y = data['exam_score'].mean()
deviations_xy = (data['study_hours'] - mean_x) * (data['exam_score'] - mean_y)
cov_manual = deviations_xy.sum() / (len(data) - 1)
cov_numpy = np.cov(data['study_hours'], data['exam_score'])[0, 1]
print(f"Manual covariance: {cov_manual:.4f}")
print(f"NumPy covariance: {cov_numpy:.4f}")
Covariance Matrix
For multiple variables, the covariance matrix contains pairwise covariances:
iris = sns.load_dataset('iris')
numeric = iris.select_dtypes(include='number')
cov_matrix = numeric.cov()
print("Covariance Matrix:")
print(cov_matrix.round(4))
fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cov_matrix, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Covariance Matrix (Iris Dataset)')
plt.tight_layout()
plt.savefig('covariance_matrix.png', dpi=150)
plt.show()
Covariance vs Correlation
| Feature | Covariance | Pearson Correlation |
|---|---|---|
| Range | (−∞, +∞) | [−1, +1] |
| Units | Units of X × Units of Y | Dimensionless |
| Scale-dependent? | Yes | No |
| Interpretable magnitude? | No | Yes |
cov = np.cov(data['study_hours'], data['exam_score'])[0,1]
sx = data['study_hours'].std(ddof=1)
sy = data['exam_score'].std(ddof=1)
r_from_cov = cov / (sx * sy)
r_numpy = np.corrcoef(data['study_hours'], data['exam_score'])[0,1]
print(f"r from covariance formula: {r_from_cov:.6f}")
print(f"r from np.corrcoef: {r_numpy:.6f}")
Key Takeaways
- Positive covariance: both variables tend to be above (or below) their means together
- Negative covariance: one above mean when other is below
- Covariance is scale-dependent — use correlation (standardized covariance) for interpretable strength
- np.cov() returns the covariance matrix — [0,1] or [1,0] element is the cross-covariance
- The diagonal of the covariance matrix contains each variable's own variance
- Portfolio variance = wᵀ Σ w where Σ is the covariance matrix — covariance drives diversification