Scatter Plots
A scatter plot (scatter diagram) displays the relationship between two continuous variables by plotting one on each axis. It's the primary tool for exploring associations before computing correlations.
Reading a Scatter Plot
Look for four features:
- Direction: Positive (↗), Negative (↘), or No pattern
- Form: Linear, Curved (nonlinear), or Unclear
- Strength: How tightly points cluster around the pattern
- Outliers: Points far from the main cloud
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
np.random.seed(42)
n = 100
fig, axes = plt.subplots(2, 3, figsize=(14, 9))
# 1. Strong positive linear
x1 = np.random.uniform(0, 10, n)
y1 = 2 * x1 + np.random.normal(0, 1, n)
r1, _ = stats.pearsonr(x1, y1)
axes[0,0].scatter(x1, y1, alpha=0.6, color='steelblue')
axes[0,0].set_title(f'Strong Positive Linear\nr = {r1:.3f}')
# 2. Weak positive linear
y2 = 1.5 * x1 + np.random.normal(0, 4, n)
r2, _ = stats.pearsonr(x1, y2)
axes[0,1].scatter(x1, y2, alpha=0.6, color='coral')
axes[0,1].set_title(f'Weak Positive Linear\nr = {r2:.3f}')
# 3. Strong negative
y3 = -2 * x1 + 20 + np.random.normal(0, 1, n)
r3, _ = stats.pearsonr(x1, y3)
axes[0,2].scatter(x1, y3, alpha=0.6, color='mediumseagreen')
axes[0,2].set_title(f'Strong Negative Linear\nr = {r3:.3f}')
# 4. No relationship
y4 = np.random.normal(5, 3, n)
r4, _ = stats.pearsonr(x1, y4)
axes[1,0].scatter(x1, y4, alpha=0.6, color='orchid')
axes[1,0].set_title(f'No Linear Relationship\nr = {r4:.3f}')
# 5. Nonlinear (quadratic) — r is MISLEADING!
y5 = (x1 - 5)**2 + np.random.normal(0, 2, n)
r5, _ = stats.pearsonr(x1, y5)
axes[1,1].scatter(x1, y5, alpha=0.6, color='orange')
axes[1,1].set_title(f'Nonlinear (Quadratic)\nr = {r5:.3f} — MISLEADING!')
axes[1,1].annotate('Pearson r misses\ncurved patterns!', xy=(5, 1), fontsize=9,
color='red', ha='center')
# 6. Outlier effect
x6 = np.concatenate([x1, [9]])
y6 = np.concatenate([x1 + np.random.normal(0, 0.5, n), [0]]) # outlier breaks pattern
r6, _ = stats.pearsonr(x6, y6)
r6_no_out, _ = stats.pearsonr(x1, x1 + np.random.normal(0, 0.5, n))
axes[1,2].scatter(x6[:-1], y6[:-1], alpha=0.6, color='steelblue', label='Main data')
axes[1,2].scatter(x6[-1], y6[-1], color='red', s=100, zorder=5, label='Outlier')
axes[1,2].set_title(f'Outlier Effect\nr = {r6:.3f} (was {r6_no_out:.3f})')
axes[1,2].legend()
plt.tight_layout()
plt.savefig('scatter_patterns.png', dpi=150)
plt.show()
Enhanced Scatter Plots
# Load real dataset
df = sns.load_dataset('tips')
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# 1. Color by category
for sex, group in df.groupby('sex'):
axes[0].scatter(group['total_bill'], group['tip'],
alpha=0.6, label=sex, s=50)
# Add regression line
slope, intercept, r, p, se = stats.linregress(df['total_bill'], df['tip'])
x_line = np.linspace(df['total_bill'].min(), df['total_bill'].max(), 100)
axes[0].plot(x_line, slope*x_line + intercept, 'k--', linewidth=2)
axes[0].set_title(f'Total Bill vs Tip (by Sex)\nr = {r:.3f}, p = {p:.4f}')
axes[0].set_xlabel('Total Bill ($)')
axes[0].set_ylabel('Tip ($)')
axes[0].legend()
# 2. Bubble chart (3rd variable = size)
axes[1].scatter(df['total_bill'], df['tip'],
s=df['size'] * 30, alpha=0.5,
c=df['size'], cmap='viridis')
axes[1].set_title('Bubble Chart: Size = Party Size')
axes[1].set_xlabel('Total Bill ($)')
axes[1].set_ylabel('Tip ($)')
# 3. Seaborn lmplot equivalent
sns.regplot(data=df, x='total_bill', y='tip', ax=axes[2],
scatter_kws={'alpha': 0.5}, line_kws={'color': 'red'})
axes[2].set_title('Regression with Confidence Band')
plt.tight_layout()
plt.savefig('scatter_enhanced.png', dpi=150)
plt.show()
# Pearson r and its interpretation
r, p = stats.pearsonr(df['total_bill'], df['tip'])
print(f"r = {r:.4f}, p = {p:.6f}")
print(f"r² = {r**2:.4f} ({r**2*100:.1f}% of variance in tip explained by bill)")
Pair Plots (Scatter Plot Matrix)
# Visualize all pairwise relationships at once
iris = sns.load_dataset('iris')
pair_grid = sns.pairplot(iris, hue='species', plot_kws={'alpha': 0.5})
pair_grid.fig.suptitle('Iris Dataset: All Pairwise Scatter Plots', y=1.02)
plt.savefig('pair_plot.png', dpi=150, bbox_inches='tight')
plt.show()
Key Takeaways
- Always plot before computing r — Anscombe's Quartet shows why (four datasets with identical r but wildly different scatter plots)
- Pearson r only measures linear association — it misses curves, clusters, and outliers
- Scatter plot direction, form, and strength must all be described
- Color, size, and shape encodings add dimensions beyond x and y
- Regression lines summarize the trend but show the scatter too — not just the line
- Outliers can dramatically inflate or deflate correlation coefficients