Covariance

Descriptive Statistics

Do Two Variables Move Together? Covariance Tells You

Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.

Direction of relationship — Positive, negative, or zero covariance reveals the sign of association
Magnitude is scale-dependent — The raw number is hard to interpret without normalization
Covariance matrix — The foundation of multivariate statistics and portfolio theory
Gateway to correlation — Standardizing covariance produces the Pearson correlation coefficient

Covariance is the raw material from which correlation is built. Understanding it gives you the foundation for everything that follows.

What is Covariance?

Definition

Covariance measures how two variables change together. Positive covariance means they tend to move in the same direction; negative means opposite directions.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(42)
n = 100

# Three relationships
x = np.random.normal(50, 10, n)
y_pos = 2*x + np.random.normal(0, 10, n)   # positive covariance
y_neg = -2*x + np.random.normal(0, 10, n)  # negative covariance
y_zero = np.random.normal(50, 10, n)        # zero covariance

for y, label in [(y_pos,"Positive"),(y_neg,"Negative"),(y_zero,"Near-Zero")]:
    cov = np.cov(x, y)[0,1]
    print(f"Cov(X, Y) = {cov:+.4f}  -> {label}")

Manual Calculation

data = pd.DataFrame({
    'study_hours': [2, 3, 5, 4, 6, 1, 7, 3, 5, 8],
    'exam_score':  [60,65,75,70,80,55,85,65,72,90]
})

mean_x = data['study_hours'].mean()
mean_y = data['exam_score'].mean()
deviations_xy = (data['study_hours'] - mean_x) * (data['exam_score'] - mean_y)
cov_manual = deviations_xy.sum() / (len(data) - 1)
cov_numpy  = np.cov(data['study_hours'], data['exam_score'])[0, 1]

print(f"Manual covariance: {cov_manual:.4f}")
print(f"NumPy covariance:  {cov_numpy:.4f}")

Covariance Matrix

For multiple variables, the covariance matrix contains pairwise covariances:

iris = sns.load_dataset('iris')
numeric = iris.select_dtypes(include='number')

cov_matrix = numeric.cov()
print("Covariance Matrix:")
print(cov_matrix.round(4))

fig, ax = plt.subplots(figsize=(6, 5))
sns.heatmap(cov_matrix, annot=True, fmt='.2f', cmap='RdBu_r', center=0, ax=ax)
ax.set_title('Covariance Matrix (Iris Dataset)')
plt.tight_layout()
plt.savefig('covariance_matrix.png', dpi=150)
plt.show()

Covariance vs Correlation

Feature	Covariance	Pearson Correlation
Range	(−∞, +∞)	[−1, +1]
Units	Units of X × Units of Y	Dimensionless
Scale-dependent?	Yes	No
Interpretable magnitude?	No	Yes

cov = np.cov(data['study_hours'], data['exam_score'])[0,1]
sx = data['study_hours'].std(ddof=1)
sy = data['exam_score'].std(ddof=1)
r_from_cov = cov / (sx * sy)
r_numpy = np.corrcoef(data['study_hours'], data['exam_score'])[0,1]

print(f"r from covariance formula: {r_from_cov:.6f}")
print(f"r from np.corrcoef:        {r_numpy:.6f}")

Covariance in Machine Learning

ML Application	Covariance Usage	Why
PCA	Covariance matrix → eigenvectors	Dimensionality reduction
Multicollinearity	High cov between features	Remove redundant features
Portfolio optimization	Asset covariance → risk	Financial ML
Feature selection	Low cov with target → remove	No predictive power

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data

# Covariance matrix
cov_matrix = np.cov(X.T)
print("Covariance matrix:")
print(cov_matrix.round(2))

# PCA uses covariance matrix
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"\nExplained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.3f}")
print("PCA finds directions of maximum variance!")

Covariance — Measuring Joint Variation of Two Variables