SQL for Data Scientists

Why SQL for Data Scientists?

SQL (Structured Query Language) is the universal language for data retrieval. While Python and R handle analysis, SQL is how you get data from databases.

Architecture Diagram

Data Sources          Data Warehouse         Data Science
┌──────────┐         ┌──────────────┐       ┌────────────┐
│ PostgreSQL│────────→│              │──────→│  Analysis  │
│ MySQL     │────────→│  SQL Queries │──────→│  Modeling  │
│ BigQuery  │────────→│              │──────→│  Viz       │
│ Snowflake │────────→│              │──────→│            │
└──────────┘         └──────────────┘       └────────────┘

ℹ️ Why SQL Before Python?

Most enterprise data lives in relational databases. SQL is optimized for filtering, aggregating, and joining millions of rows server-side — far more efficient than loading everything into Python. Master SQL first, then use Python for advanced analytics that SQL cannot do.

SQL Basics

SELECT Statements

-- Basic SELECT
SELECT * FROM customers;

-- Select specific columns
SELECT customer_id, name, email 
FROM customers;

-- Limit results
SELECT * FROM customers 
LIMIT 10;

-- Unique values
SELECT DISTINCT city 
FROM customers;

-- Aliases
SELECT 
    name AS customer_name,
    email AS contact_email
FROM customers;

Filtering with WHERE

-- Comparison operators
SELECT * FROM orders 
WHERE amount > 100;

SELECT * FROM orders 
WHERE status = 'completed';

SELECT * FROM customers 
WHERE age >= 18 AND age <= 65;

-- BETWEEN (inclusive)
SELECT * FROM orders 
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';

-- IN
SELECT * FROM customers 
WHERE city IN ('New York', 'Los Angeles', 'Chicago');

-- LIKE (pattern matching)
SELECT * FROM customers 
WHERE name LIKE 'J%';          -- Starts with J

SELECT * FROM customers 
WHERE email LIKE '%@gmail.com'; -- Gmail users

-- NULL handling
SELECT * FROM customers 
WHERE phone IS NOT NULL;

SELECT * FROM customers 
WHERE phone IS NULL;

💡 NULL Semantics

NULL is not a value — it represents missing or unknown data. Comparisons with NULL always return NULL (unknown), not TRUE or FALSE. Always use IS NULL / IS NOT NULL instead of = or !=. Be aware: NULL = NULL evaluates to NULL, not TRUE.

Sorting and Limiting

-- ORDER BY
SELECT * FROM orders 
ORDER BY amount DESC;

-- Multiple columns
SELECT * FROM customers 
ORDER BY last_name ASC, first_name ASC;

-- LIMIT and OFFSET (pagination)
SELECT * FROM orders 
ORDER BY order_date DESC 
LIMIT 10 OFFSET 20;  -- Get rows 21-30

Aggregation Functions

-- COUNT
SELECT COUNT(*) FROM customers;
SELECT COUNT(DISTINCT city) FROM customers;

-- SUM, AVG, MIN, MAX
SELECT 
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value,
    MIN(amount) AS min_order,
    MAX(amount) AS max_order
FROM orders;

-- ROUND
SELECT 
    ROUND(AVG(amount), 2) AS avg_amount
FROM orders;

-- COUNT with conditions
SELECT 
    COUNT(*) AS total_orders,
    COUNT(CASE WHEN status = 'completed' THEN 1 END) AS completed,
    COUNT(CASE WHEN status = 'pending' THEN 1 END) AS pending
FROM orders;

GROUP BY

-- Basic GROUP BY
SELECT 
    city,
    COUNT(*) AS customer_count
FROM customers
GROUP BY city;

-- Multiple columns
SELECT 
    city,
    status,
    COUNT(*) AS order_count,
    SUM(amount) AS total_amount
FROM orders
GROUP BY city, status;

-- GROUP BY with HAVING
SELECT 
    city,
    COUNT(*) AS customer_count
FROM customers
GROUP BY city
HAVING COUNT(*) > 10;

-- Common aggregations
SELECT 
    DATE_TRUNC('month', order_date) AS month,
    COUNT(*) AS order_count,
    SUM(amount) AS revenue,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;

ℹ️ WHERE vs HAVING

WHERE filters rows before grouping — cannot use aggregate functions
HAVING filters groups after grouping — can use aggregate functions
Execution order: FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT

JOIN Operations

Types of JOINs

-- INNER JOIN (only matching rows)
SELECT 
    c.customer_id,
    c.name,
    o.order_id,
    o.amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;

-- LEFT JOIN (all from left, matching from right)
SELECT 
    c.customer_id,
    c.name,
    o.order_id,
    o.amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;

-- RIGHT JOIN (all from right, matching from left)
SELECT 
    c.customer_id,
    c.name,
    o.order_id,
    o.amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;

-- FULL OUTER JOIN (all from both)
SELECT 
    c.customer_id,
    c.name,
    o.order_id,
    o.amount
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;

-- SELF JOIN
SELECT 
    e.name AS employee,
    m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;

💡 JOIN Types Visualized

INNER: Only rows with matches in both tables (intersection)
LEFT: All rows from left table + matched rows from right (NULLs where no match)
RIGHT: All rows from right table + matched rows from left
FULL OUTER: All rows from both tables (NULLs where no match on either side)
CROSS: Every combination of rows from both tables (Cartesian product)

JOIN Examples

-- Multi-table JOIN
SELECT 
    c.name AS customer_name,
    o.order_date,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;

-- JOIN with aggregation
SELECT 
    c.customer_name,
    SUM(o.amount) AS total_spent,
    COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name
ORDER BY total_spent DESC;

-- Cohort analysis
SELECT 
    DATE_TRUNC('month', first_order) AS cohort_month,
    COUNT(DISTINCT customer_id) AS cohort_size,
    SUM(CASE WHEN order_date >= first_order + INTERVAL '30 days' 
        THEN 1 ELSE 0 END) AS retained_after_30d
FROM (
    SELECT 
        customer_id,
        order_date,
        MIN(order_date) OVER (PARTITION BY customer_id) AS first_order
    FROM orders
) customer_orders
GROUP BY DATE_TRUNC('month', first_order);

Window Functions

Window functions perform calculations across rows related to the current row.

-- ROW_NUMBER
SELECT 
    name,
    department,
    salary,
    ROW_NUMBER() OVER (ORDER BY salary DESC) AS salary_rank
FROM employees;

-- RANK and DENSE_RANK
SELECT 
    name,
    department,
    salary,
    RANK() OVER (ORDER BY salary DESC) AS rank_with_gaps,
    DENSE_RANK() OVER (ORDER BY salary DESC) AS rank_no_gaps
FROM employees;

-- Partition by department
SELECT 
    name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;

-- Running totals
SELECT 
    order_date,
    amount,
    SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;

-- Moving average
SELECT 
    order_date,
    amount,
    AVG(amount) OVER (
        ORDER BY order_date 
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM orders;

-- LAG and LEAD
SELECT 
    order_date,
    amount,
    LAG(amount, 1) OVER (ORDER BY order_date) AS prev_amount,
    LEAD(amount, 1) OVER (ORDER BY order_date) AS next_amount,
    amount - LAG(amount, 1) OVER (ORDER BY order_date) AS change
FROM orders;

-- Percentiles
SELECT 
    name,
    salary,
    NTILE(4) OVER (ORDER BY salary) AS salary_quartile,
    PERCENT_RANK() OVER (ORDER BY salary) AS percentile_rank
FROM employees;

ℹ️ ROW_NUMBER vs RANK vs DENSE_RANK

ROW_NUMBER: Unique sequential integers (1, 2, 3, 4) — no ties
RANK: Gaps for ties (1, 2, 2, 4) — if two rows tie for 2nd, next is 4th
DENSE_RANK: No gaps for ties (1, 2, 2, 3) — if two rows tie for 2nd, next is 3rd Choose based on whether you need to break ties and whether gaps matter.

Subqueries and CTEs

Subqueries

-- Scalar subquery
SELECT 
    name,
    salary,
    salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;

-- IN subquery
SELECT * FROM customers
WHERE customer_id IN (
    SELECT customer_id FROM orders
    WHERE amount > 1000
);

-- EXISTS
SELECT * FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.customer_id = c.customer_id 
    AND o.amount > 1000
);

-- Correlated subquery
SELECT 
    name,
    salary,
    (SELECT COUNT(*) FROM employees e2 
     WHERE e2.department = e1.department 
     AND e2.salary > e1.salary) + 1 AS dept_salary_rank
FROM employees e1;

💡 IN vs EXISTS

IN: Best when the subquery result set is small. Evaluates the subquery once.
EXISTS: Best when the outer table is small. Evaluates the subquery once per outer row and short-circuits on first match.
For large subquery results, EXISTS is often faster because it can stop as soon as a match is found.

Common Table Expressions (CTEs)

-- Basic CTE
WITH customer_stats AS (
    SELECT 
        customer_id,
        COUNT(*) AS order_count,
        SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.name,
    cs.order_count,
    cs.total_spent
FROM customers c
JOIN customer_stats cs ON c.customer_id = cs.customer_id
WHERE cs.total_spent > 1000;

-- Multiple CTEs
WITH monthly_revenue AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM orders
    GROUP BY 1
),
revenue_growth AS (
    SELECT 
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month)) / 
            LAG(revenue) OVER (ORDER BY month) * 100 AS growth_rate
    FROM monthly_revenue
)
SELECT * FROM revenue_growth
WHERE growth_rate > 10;

-- Recursive CTE (for hierarchical data)
WITH RECURSIVE org_chart AS (
    -- Base case
    SELECT 
        employee_id,
        name,
        manager_id,
        1 AS level
    FROM employees
    WHERE manager_id IS NULL
    
    UNION ALL
    
    -- Recursive case
    SELECT 
        e.employee_id,
        e.name,
        e.manager_id,
        oc.level + 1
    FROM employees e
    INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart
ORDER BY level, name;

Advanced SQL Techniques

PIVOT and UNPIVOT

-- PIVOT (using CASE)
SELECT 
    customer_id,
    SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1 THEN amount END) AS jan,
    SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2 THEN amount END) AS feb,
    SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3 THEN amount END) AS mar
FROM orders
GROUP BY customer_id;

-- Dynamic PIVOT (PostgreSQL)
-- Using crosstab function
CREATE EXTENSION IF NOT EXISTS tablefunc;

SELECT * FROM crosstab(
    'SELECT customer_id, 
            EXTRACT(MONTH FROM order_date) AS month, 
            SUM(amount)
     FROM orders
     GROUP BY 1, 2
     ORDER BY 1, 2'
) AS ct(customer_id INT, jan DECIMAL, feb DECIMAL, mar DECIMAL);

Window Functions for Analytics

-- Cohort retention analysis
WITH first_purchase AS (
    SELECT 
        customer_id,
        DATE_TRUNC('month', MIN(order_date)) AS cohort_month
    FROM orders
    GROUP BY 1
),
cohort_activity AS (
    SELECT 
        fp.cohort_month,
        DATE_TRUNC('month', o.order_date) AS activity_month,
        COUNT(DISTINCT o.customer_id) AS active_customers
    FROM first_purchase fp
    JOIN orders o ON fp.customer_id = o.customer_id
    GROUP BY 1, 2
)
SELECT 
    cohort_month,
    activity_month,
    active_customers,
    EXTRACT(MONTH FROM AGE(activity_month, cohort_month)) AS months_since_first,
    active_customers / FIRST_VALUE(active_customers) OVER (
        PARTITION BY cohort_month ORDER BY activity_month
    ) AS retention_rate
FROM cohort_activity
ORDER BY cohort_month, activity_month;

-- Funnel analysis
WITH funnel AS (
    SELECT 
        COUNT(DISTINCT CASE WHEN event = 'page_view' THEN user_id END) AS viewers,
        COUNT(DISTINCT CASE WHEN event = 'add_to_cart' THEN user_id END) AS cart_adders,
        COUNT(DISTINCT CASE WHEN event = 'checkout' THEN user_id END) AS checkouters,
        COUNT(DISTINCT CASE WHEN event = 'purchase' THEN user_id END) AS purchasers
    FROM events
    WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
)
SELECT 
    viewers,
    cart_adders,
    checkouters,
    purchasers,
    cart_adders::float / viewers AS view_to_cart_rate,
    checkouters::float / cart_adders AS cart_to_checkout_rate,
    purchasers::float / checkouters AS checkout_to_purchase_rate,
    purchasers::float / viewers AS overall_conversion_rate
FROM funnel;

Python + SQL Integration

import pandas as pd
from sqlalchemy import create_engine
import sqlite3

# Method 1: SQLAlchemy (recommended)
engine = create_engine('postgresql://user:password@host:port/database')
df = pd.read_sql('SELECT * FROM customers', engine)

# Method 2: SQLite
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers', conn)

# Method 3: pandasql (SQL syntax in Python)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

result = pysqldf("""
    SELECT city, COUNT(*) as customer_count
    FROM df
    GROUP BY city
    ORDER BY customer_count DESC
""")

# Complex query with pandas
query = """
WITH customer_metrics AS (
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        SUM(amount) as total_spent,
        AVG(amount) as avg_order_value
    FROM orders
    WHERE order_date >= '2024-01-01'
    GROUP BY customer_id
)
SELECT 
    c.customer_id,
    c.name,
    c.city,
    cm.order_count,
    cm.total_spent,
    cm.avg_order_value,
    CASE 
        WHEN cm.total_spent > 1000 THEN 'High Value'
        WHEN cm.total_spent > 500 THEN 'Medium Value'
        ELSE 'Low Value'
    END as customer_segment
FROM customers c
JOIN customer_metrics cm ON c.customer_id = cm.customer_id
ORDER BY cm.total_spent DESC
"""

df = pd.read_sql(query, engine)

# Save results back to database
df.to_sql('customer_segments', engine, if_exists='replace', index=False)

# Execute DDL/DML
with engine.connect() as conn:
    conn.execute("""
        CREATE TABLE IF NOT EXISTS daily_stats (
            date DATE,
            metric_name VARCHAR(100),
            metric_value DECIMAL(10, 2)
        )
    """)

💡 SQLAlchemy Best Practices

Use connection pooling (create_engine(..., pool_size=5)) for repeated queries
Always use parameterized queries to prevent SQL injection
Use pd.read_sql() with chunksize for large result sets: pd.read_sql(query, engine, chunksize=10000)
For writes, use method='multi' for faster bulk inserts

Performance Optimization

-- 1. Use indexes
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
CREATE INDEX idx_orders_composite ON orders(customer_id, order_date);

-- 2. Avoid SELECT *
SELECT customer_id, name FROM customers;  -- Good
SELECT * FROM customers;                  -- Bad

-- 3. Use EXPLAIN to analyze queries
EXPLAIN ANALYZE
SELECT * FROM orders 
WHERE customer_id = 123;

-- 4. Filter early
SELECT * FROM orders
WHERE order_date >= '2024-01-01'  -- Filter first
AND customer_id = 123;

-- 5. Use appropriate JOIN types
-- Prefer INNER JOIN over FULL OUTER JOIN
-- Ensure JOIN columns are indexed

-- 6. Avoid functions on indexed columns
-- Bad
WHERE EXTRACT(YEAR FROM order_date) = 2024
-- Good
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'

ℹ️ Index Design Principles

Index columns used in WHERE, JOIN ON, ORDER BY, and GROUP BY
Composite indexes follow the leftmost prefix rule: index (a, b, c) helps queries on a, (a, b), or (a, b, c) — but NOT on b alone
Over-indexing slows writes (INSERT, UPDATE, DELETE) because each index must be maintained
Use partial indexes for frequently filtered subsets: CREATE INDEX idx ON orders(status) WHERE status = 'pending'

Common Data Science SQL Patterns

-- 1. Recency, Frequency, Monetary (RFM) Analysis
WITH rfm AS (
    SELECT 
        customer_id,
        DATEDIFF(day, MAX(order_date), CURRENT_DATE) AS recency,
        COUNT(DISTINCT order_id) AS frequency,
        SUM(amount) AS monetary
    FROM orders
    GROUP BY customer_id
),
rfm_scored AS (
    SELECT 
        *,
        NTILE(5) OVER (ORDER BY recency DESC) AS r_score,
        NTILE(5) OVER (ORDER BY frequency) AS f_score,
        NTILE(5) OVER (ORDER BY monetary) AS m_score
    FROM rfm
)
SELECT 
    *,
    CONCAT(r_score, f_score, m_score) AS rfm_cell
FROM rfm_scored;

-- 2. User retention cohort analysis
WITH user_cohorts AS (
    SELECT 
        user_id,
        DATE_TRUNC('month', MIN(event_date)) AS cohort_month
    FROM events
    GROUP BY 1
),
cohort_retention AS (
    SELECT 
        uc.cohort_month,
        EXTRACT(MONTH FROM AGE(e.event_date, uc.cohort_month)) AS months_since,
        COUNT(DISTINCT e.user_id) AS active_users
    FROM user_cohorts uc
    JOIN events e ON uc.user_id = e.user_id
    GROUP BY 1, 2
)
SELECT 
    cohort_month,
    months_since,
    active_users,
    FIRST_VALUE(active_users) OVER (
        PARTITION BY cohort_month ORDER BY months_since
    ) AS cohort_size,
    active_users::float / FIRST_VALUE(active_users) OVER (
        PARTITION BY cohort_month ORDER BY months_since
    ) AS retention_rate
FROM cohort_retention;

-- 3. Session analysis
WITH sessions AS (
    SELECT 
        user_id,
        session_id,
        MIN(event_time) AS session_start,
        MAX(event_time) AS session_end,
        COUNT(*) AS event_count
    FROM events
    GROUP BY user_id, session_id
)
SELECT 
    user_id,
    session_id,
    session_start,
    session_end,
    EXTRACT(EPOCH FROM (session_end - session_start)) / 60 AS session_duration_min,
    event_count,
    event_count / EXTRACT(EPOCH FROM (session_end - session_start)) * 3600 AS events_per_hour
FROM sessions;

📝Worked Example: RFM Segmentation

Given: Orders table with customer_id, order_date, amount

Step 1: Compute RFM metrics per customer

Recency = days since last purchase
Frequency = count of distinct orders
Monetary = total amount spent

Step 2: Score each metric into quintiles (1-5) using NTILE(5)

Step 3: Combine into RFM cell (e.g., "531" = high recency, medium frequency, low monetary)

Step 4: Map to segments:

555, 554, 545 → Champions
544, 534, 435 → Loyal Customers
333, 332, 323 → Need Attention
111, 112, 121 → Lost/At Risk

Key Takeaways

📋Summary: SQL for Data Scientists

SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT — learn the execution order, not just the syntax
JOINs combine data from multiple tables — always specify the join condition and prefer INNER JOIN when appropriate
Window functions (OVER clause) enable complex analytics without self-joins — they compute across related rows without collapsing them
CTEs (WITH clauses) make queries readable and maintainable — use them to break complex logic into named building blocks
Indexes are critical for performance — index columns used in WHERE, JOIN, ORDER BY; use EXPLAIN ANALYZE to verify
Python + SQL integration (SQLAlchemy, pandas) is essential — use parameterized queries and chunked reads for production
Common patterns: RFM analysis, cohort retention, funnel analysis, session analysis — these are the bread and butter of data science SQL

Practice Exercise

Write a query to find the top 10 customers by total revenue using GROUP BY and ORDER BY
Calculate month-over-month revenue growth rate using LAG() window function
Build a cohort retention analysis that tracks users from their first purchase month
Create an RFM segmentation using NTILE() for scoring
Write a recursive CTE to traverse an organizational hierarchy
Optimize a slow-running query using EXPLAIN ANALYZE and appropriate indexes
Use a window function to compute a 7-day moving average of daily sales