Matplotlib + Seaborn: Visualization Grammar

Module 1: FoundationsFree Lesson

Advertisement

The Grammar of Graphics

Data visualization is the graphical representation of information and data. A good visualization tells a story and reveals patterns that numbers alone cannot.

Architecture Diagram
Data → Aesthetics → Geometries → Statistics → Coordinates → Facets → Theme
  │        │            │            │            │           │         │
  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                          Grammar of Graphics

ā„¹ļø Why Grammar of Graphics Matters

The grammar separates the what (data) from the how (geometries, aesthetics), allowing you to compose complex plots from simple, reusable components. Mastering this mental model makes learning any plotting library faster.

Matplotlib: The Foundation

Basic Plot Structure

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Figure anatomy
"""
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│              Figure                 │
│  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”  │
│  │            Axes               │  │
│  │  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”  │  │
│  │  │      Plot Area          │  │  │
│  │  │   (lines, bars, etc.)   │  │  │
│  │  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  │  │
│  │  X-axis    Title    Y-axis    │  │
│  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  │
│             Legend                   │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
"""

# Basic line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(x, y, label='sin(x)', color='blue', linewidth=2)
ax.set_xlabel('X Axis', fontsize=12)
ax.set_ylabel('Y Axis', fontsize=12)
ax.set_title('Basic Line Plot', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

šŸ’” Object-Oriented vs pyplot Interface

Always prefer the object-oriented interface (fig, ax = plt.subplots()) over the state-based plt.plot() approach. The OO interface gives explicit control, works better in scripts, and avoids ambiguity when multiple figures exist.

Essential Plot Types

# 1. Line Plot (trends over time)
dates = pd.date_range('2024-01-01', periods=12)
values = np.random.randn(12).cumsum() + 100

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(dates, values, marker='o', linestyle='-', color='#2196F3')
ax.fill_between(dates, values - 10, values + 10, alpha=0.2)
ax.set_title('Stock Price Trend')
ax.set_xlabel('Date')
ax.set_ylabel('Price')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# 2. Bar Plot (comparisons)
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 78, 32]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Vertical bars
axes[0].bar(categories, values, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'])
axes[0].set_title('Vertical Bar Plot')

# Horizontal bars
axes[1].barh(categories, values, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7'])
axes[1].set_title('Horizontal Bar Plot')

plt.tight_layout()
plt.show()

# 3. Histogram (distributions)
data = np.random.normal(0, 1, 1000)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Basic histogram
axes[0].hist(data, bins=30, edgecolor='black', alpha=0.7)
axes[0].set_title('Basic Histogram')

# Cumulative histogram
axes[1].hist(data, bins=30, cumulative=True, edgecolor='black', alpha=0.7)
axes[1].set_title('Cumulative Histogram')

plt.tight_layout()
plt.show()

# 4. Scatter Plot (relationships)
x = np.random.randn(100)
y = x * 2 + np.random.randn(100) * 0.5
colors = np.random.rand(100)
sizes = np.random.rand(100) * 200

plt.figure(figsize=(10, 6))
scatter = plt.scatter(x, y, c=colors, s=sizes, alpha=0.6, cmap='viridis')
plt.colorbar(scatter)
plt.title('Scatter Plot with Color and Size')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

# 5. Box Plot (distributions and outliers)
data = [np.random.normal(0, std, 100) for std in range(1, 4)]

plt.figure(figsize=(10, 6))
plt.boxplot(data, labels=['Group 1', 'Group 2', 'Group 3'])
plt.title('Box Plot Comparison')
plt.ylabel('Values')
plt.show()

# 6. Pie Chart (proportions)
sizes = [35, 25, 20, 15, 5]
labels = ['Product A', 'Product B', 'Product C', 'Product D', 'Other']
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7']
explode = (0.1, 0, 0, 0, 0)  # Explode first slice

plt.figure(figsize=(8, 8))
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90)
plt.title('Market Share')
plt.show()

ā„¹ļø Choosing the Right Plot

  • Line: Trends over continuous variables (time series)
  • Bar: Comparing discrete categories
  • Histogram: Distribution of a single numerical variable
  • Scatter: Relationship between two numerical variables
  • Box: Distribution summary with outliers (five-number summary)
  • Pie: Parts of a whole (use sparingly — bar charts are usually better)

Subplots and Layouts

# Multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Line
axes[0, 0].plot(x, np.sin(x), 'b-', label='sin')
axes[0, 0].plot(x, np.cos(x), 'r--', label='cos')
axes[0, 0].set_title('Trigonometric Functions')
axes[0, 0].legend()

# Plot 2: Bar
axes[0, 1].bar(categories, values, color='skyblue')
axes[0, 1].set_title('Bar Chart')

# Plot 3: Scatter
axes[1, 0].scatter(x[:50], y[:50], c='green', alpha=0.6)
axes[1, 0].set_title('Scatter Plot')

# Plot 4: Histogram
axes[1, 1].hist(data, bins=20, color='orange', edgecolor='black')
axes[1, 1].set_title('Histogram')

plt.tight_layout()
plt.show()

# Complex layout with GridSpec
import matplotlib.gridspec as gridspec

fig = plt.figure(figsize=(14, 10))
gs = gridspec.GridSpec(3, 3, figure=fig)

# Large plot spanning 2 rows, 2 columns
ax_main = fig.add_subplot(gs[0:2, 0:2])
ax_main.plot(x, y, 'b-')
ax_main.set_title('Main Plot')

# Side plots
ax_right1 = fig.add_subplot(gs[0, 2])
ax_right1.barh(categories[:3], values[:3])

ax_right2 = fig.add_subplot(gs[1, 2])
ax_right2.pie(sizes[:3], labels=labels[:3])

# Bottom plot
ax_bottom = fig.add_subplot(gs[2, :])
ax_bottom.plot(x, np.sin(x) * 100, 'r-')
ax_bottom.set_title('Bottom Plot')

plt.tight_layout()
plt.show()

Seaborn: Statistical Visualization

Distribution Plots

import seaborn as sns

# Set theme
sns.set_theme(style="whitegrid")

# Load example dataset
tips = sns.load_dataset('tips')
iris = sns.load_dataset('iris')

# 1. Histogram with KDE
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.histplot(data=tips, x='total_bill', kde=True, ax=axes[0])
axes[0].set_title('Histogram with KDE')

sns.histplot(data=tips, x='total_bill', hue='time', kde=True, ax=axes[1])
axes[1].set_title('Histogram by Time')

plt.tight_layout()
plt.show()

# 2. KDE Plot (Kernel Density Estimation)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.kdeplot(data=tips, x='total_bill', fill=True, ax=axes[0])
axes[0].set_title('KDE Plot')

sns.kdeplot(data=tips, x='total_bill', hue='day', fill=True, ax=axes[1])
axes[1].set_title('KDE by Day')

plt.tight_layout()
plt.show()

# 3. ECDF Plot (Empirical CDF)
plt.figure(figsize=(10, 6))
sns.ecdfplot(data=tips, x='total_bill', hue='time')
plt.title('Empirical CDF')
plt.show()

# 4. Rug Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.rugplot(data=tips, x='total_bill', ax=axes[0])
axes[0].set_title('Rug Plot')

sns.scatterplot(data=tips, x='total_bill', y='tip', ax=axes[1])
sns.rugplot(data=tips, x='total_bill', y='tip', ax=axes[1])
axes[1].set_title('Scatter with Rugs')

plt.tight_layout()
plt.show()

Categorical Plots

# 1. Box Plot
plt.figure(figsize=(12, 6))
sns.boxplot(data=tips, x='day', y='total_bill', hue='sex')
plt.title('Total Bill by Day and Gender')
plt.show()

# 2. Violin Plot
plt.figure(figsize=(12, 6))
sns.violinplot(data=tips, x='day', y='total_bill', hue='sex', split=True)
plt.title('Violin Plot')
plt.show()

# 3. Swarm Plot
plt.figure(figsize=(12, 6))
sns.swarmplot(data=tips, x='day', y='total_bill', hue='sex', size=4)
plt.title('Swarm Plot')
plt.show()

# 4. Strip Plot
plt.figure(figsize=(12, 6))
sns.stripplot(data=tips, x='day', y='total_bill', hue='sex', jitter=True)
plt.title('Strip Plot')
plt.show()

# 5. Bar Plot
plt.figure(figsize=(12, 6))
sns.barplot(data=tips, x='day', y='total_bill', hue='sex', errorbar='sd')
plt.title('Bar Plot with Error Bars')
plt.show()

# 6. Count Plot
plt.figure(figsize=(10, 5))
sns.countplot(data=tips, x='day', hue='sex')
plt.title('Count Plot')
plt.show()

# 7. Point Plot
plt.figure(figsize=(10, 5))
sns.pointplot(data=tips, x='day', y='total_bill', hue='sex')
plt.title('Point Plot')
plt.show()

šŸ’” Box vs Violin vs Swarm

  • Box plot: Compact summary (median, quartiles, outliers). Best for comparing distributions across groups.
  • Violin plot: Combines box plot with KDE. Reveals multimodal distributions that box plots hide.
  • Swarm plot: Shows every data point without overlap. Best for small datasets (< 1000 points).

Relationship Plots

# 1. Scatter Plot with Regression
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.scatterplot(data=tips, x='total_bill', y='tip', ax=axes[0])
axes[0].set_title('Basic Scatter')

sns.regplot(data=tips, x='total_bill', y='tip', ax=axes[1])
axes[1].set_title('Scatter with Regression')

plt.tight_layout()
plt.show()

# 2. Joint Plot
g = sns.jointplot(data=tips, x='total_bill', y='tip', kind='scatter')
g.fig.suptitle('Joint Plot')
plt.show()

# Different kinds
sns.jointplot(data=tips, x='total_bill', y='tip', kind='kde')
plt.show()

sns.jointplot(data=tips, x='total_bill', y='tip', kind='hex')
plt.show()

# 3. Pair Plot (matrix of relationships)
g = sns.pairplot(iris, hue='species')
g.fig.suptitle('Pair Plot - Iris Dataset')
plt.show()

# 4. Heatmap (correlation matrix)
plt.figure(figsize=(10, 8))
corr = tips.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5)
plt.title('Correlation Heatmap')
plt.show()

# 5. Clustermap
sns.clustermap(corr, annot=True, cmap='coolwarm', 
               standard_scale=1, figsize=(10, 8))
plt.title('Clustermap')
plt.show()

Pearson Correlation Coefficient

r=āˆ‘i=1n(xiāˆ’xˉ)(yiāˆ’yˉ)āˆ‘i=1n(xiāˆ’xˉ)2ā‹…āˆ‘i=1n(yiāˆ’yˉ)2r = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2 \cdot \sum_{i=1}^{n}(y_i - \bar{y})^2}}

Here,

  • rr=Pearson correlation coefficient (-1 to 1)
  • xˉ,yˉxĢ„, ȳ=Sample means of x and y
  • nn=Number of data points

Matrix Plots

# FacetGrid for complex layouts
g = sns.FacetGrid(tips, col='time', row='sex', height=4, aspect=1.2)
g.map_dataframe(sns.histplot, x='total_bill', bins=15)
g.set_titles('{row_name} - {col_name}')
plt.show()

# PairGrid
g = sns.PairGrid(iris, hue='species')
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
g.map_diag(sns.histplot)
g.add_legend()
plt.show()

# JointGrid
g = sns.JointGrid(data=tips, x='total_bill', y='tip')
g.plot_joint(sns.scatterplot)
g.plot_marginals(sns.histplot)
plt.show()

Customization and Themes

# Available themes
themes = ['darkgrid', 'whitegrid', 'dark', 'white', 'ticks']

# Set theme
sns.set_theme(style="whitegrid", palette="muted")

# Custom color palettes
palette = sns.color_palette("husl", 10)
sns.set_palette(palette)

# Custom styling
plt.rcParams.update({
    'figure.figsize': (10, 6),
    'font.size': 12,
    'axes.titlesize': 14,
    'axes.labelsize': 12,
    'xtick.labelsize': 10,
    'ytick.labelsize': 10,
    'legend.fontsize': 10,
    'figure.dpi': 100,
    'savefig.dpi': 300,
    'savefig.bbox': 'tight'
})

# Style context
with sns.plotting_context("notebook"):
    sns.lineplot(data=tips, x='total_bill', y='tip')
    plt.title('With Notebook Context')

# Color codes
colors = {
    'red': '#FF6B6B',
    'green': '#4ECDC4',
    'blue': '#45B7D1',
    'yellow': '#FFEAA7',
    'purple': '#9B59B6'
}

# Use in plots
plt.figure(figsize=(10, 6))
sns.barplot(data=tips, x='day', y='total_bill', palette=colors.values())
plt.show()

Publication-Quality Visualizations

def create_publication_plot(data, x, y, hue=None, title="", filename=None):
    """Create publication-quality visualization"""
    
    # Set style
    sns.set_theme(style="whitegrid", context="paper")
    
    # Create figure
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Plot
    if hue:
        sns.scatterplot(data=data, x=x, y=y, hue=hue, s=100, alpha=0.7, ax=ax)
    else:
        sns.scatterplot(data=data, x=x, y=y, s=100, alpha=0.7, ax=ax)
    
    # Customize
    ax.set_title(title, fontsize=14, fontweight='bold', pad=20)
    ax.set_xlabel(x.replace('_', ' ').title(), fontsize=12)
    ax.set_ylabel(y.replace('_', ' ').title(), fontsize=12)
    
    # Remove top and right spines
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    # Add grid
    ax.grid(True, alpha=0.3, linestyle='--')
    
    plt.tight_layout()
    
    if filename:
        plt.savefig(f'{filename}.png', dpi=300, bbox_inches='tight')
        plt.savefig(f'{filename}.pdf', bbox_inches='tight')
    
    plt.show()

# Usage
create_publication_plot(tips, 'total_bill', 'tip', 'day', 
                       'Tip vs Total Bill', 'tip_vs_bill')
Data-InkĀ Ratio=InkĀ usedĀ forĀ dataTotalĀ inkĀ usedĀ inĀ graphic\text{Data-Ink Ratio} = \frac{\text{Ink used for data}}{\text{Total ink used in graphic}}

šŸ’” Tufte's Principles for Publication Figures

Edward Tufte's principles: (1) Maximize the data-ink ratio — remove chartjunk, (2) Avoid redundant data representations, (3) Small multiples over 3D effects, (4) Labels directly on data rather than legends when possible.

Practical Example: Sales Dashboard

# Create comprehensive dashboard
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# Simulate data
np.random.seed(42)
dates = pd.date_range('2024-01-01', '2024-12-31', freq='D')
products = ['Product A', 'Product B', 'Product C']
regions = ['North', 'South', 'East', 'West']

data = {
    'date': np.random.choice(dates, 500),
    'product': np.random.choice(products, 500),
    'region': np.random.choice(regions, 500),
    'sales': np.random.randint(100, 1000, 500),
    'quantity': np.random.randint(1, 50, 500)
}
df = pd.DataFrame(data)
df['revenue'] = df['sales'] * df['quantity']

# Create dashboard
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# 1. Sales Trend
ax1 = fig.add_subplot(gs[0, :2])
daily_sales = df.groupby('date')['revenue'].sum()
ax1.plot(daily_sales.index, daily_sales.values, color='#2196F3', linewidth=1)
ax1.fill_between(daily_sales.index, daily_sales.values, alpha=0.2)
ax1.set_title('Daily Revenue Trend', fontweight='bold')
ax1.set_ylabel('Revenue ($)')

# 2. Revenue by Product (Pie)
ax2 = fig.add_subplot(gs[0, 2])
product_revenue = df.groupby('product')['revenue'].sum()
ax2.pie(product_revenue, labels=product_revenue.index, autopct='%1.1f%%',
        colors=['#FF6B6B', '#4ECDC4', '#45B7D1'])
ax2.set_title('Revenue by Product')

# 3. Regional Performance
ax3 = fig.add_subplot(gs[1, :2])
region_product = df.groupby(['region', 'product'])['revenue'].sum().unstack()
region_product.plot(kind='bar', ax=ax3, colormap='Set2')
ax3.set_title('Revenue by Region & Product')
ax3.set_xlabel('Region')
ax3.set_ylabel('Revenue ($)')
ax3.legend(title='Product')

# 4. Sales Distribution
ax4 = fig.add_subplot(gs[1, 2])
sns.histplot(data=df, x='sales', kde=True, ax=ax4, color='#45B7D1')
ax4.set_title('Sales Distribution')

# 5. Top Performing Days
ax5 = fig.add_subplot(gs[2, :2])
top_days = df.groupby('date')['revenue'].sum().nlargest(10)
top_days.plot(kind='barh', ax=ax5, color='#4ECDC4')
ax5.set_title('Top 10 Revenue Days')
ax5.set_xlabel('Revenue ($)')

# 6. Summary Statistics
ax6 = fig.add_subplot(gs[2, 2])
ax6.axis('off')
summary_text = f"""
Summary Statistics
─────────────────
Total Revenue: ${df['revenue'].sum():,.0f}
Average Daily: ${daily_sales.mean():,.0f}
Best Day: ${daily_sales.max():,.0f}
Worst Day: ${daily_sales.min():,.0f}
─────────────────
Total Transactions: {len(df)}
Avg Transaction: ${df['revenue'].mean():,.0f}
"""
ax6.text(0.1, 0.5, summary_text, transform=ax6.transAxes,
         fontsize=11, verticalalignment='center',
         bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.suptitle('Sales Dashboard 2024', fontsize=16, fontweight='bold', y=1.02)
plt.show()

šŸ“Worked Example: Choosing the Right Visualization

Goal: Compare monthly revenue across 3 product lines over 12 months.

Analysis:

  • 1 continuous variable (revenue) over time → line chart
  • 3 categories (products) → multiple lines with distinct colors
  • Need trend clarity → add markers and fill_between for confidence bands

Decision: Multi-series line chart with:

  • X-axis: month (continuous)
  • Y-axis: revenue (continuous)
  • Color encoding: product category (discrete)
  • Shaded region: ±1 std dev uncertainty band

Key Takeaways

šŸ“‹Summary: Matplotlib & Seaborn Visualization

  1. Matplotlib provides full control via the object-oriented interface (fig, ax); prefer it over plt.plot() for production code
  2. Seaborn is built on Matplotlib and excels at statistical plots — it handles grouping, aggregation, and estimation automatically
  3. Grammar of Graphics teaches you to decompose any plot into data, aesthetics, geometries, statistics, coordinates, facets, and themes
  4. KDE provides a smooth estimate of the probability density function; bandwidth selection is critical
  5. Choose the right plot for your data type and question (see the plot selection guide above)
  6. Publication quality: maximize data-ink ratio, remove chartjunk, use direct labels, save at 300+ DPI
  7. Customization: themes (sns.set_theme), palettes (sns.color_palette), and plt.rcParams give full control over appearance

Practice Exercise

  1. Create a multi-panel figure with 4 different plot types using plt.subplots(2, 2)
  2. Customize colors, fonts, and layout using sns.set_theme and plt.rcParams
  3. Build a Seaborn FacetGrid that facets a dataset by two categorical variables
  4. Compute and visualize a correlation heatmap with annotations
  5. Create a publication-quality scatter plot with regression line, removing top/right spines
  6. Export your final figure in both PNG (300 DPI) and PDF formats
  7. Build a mini-dashboard with at least 5 panels summarizing a dataset

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement