Why SQL for Data Scientists?
SQL (Structured Query Language) is the universal language for data retrieval. While Python and R handle analysis, SQL is how you get data from databases.
Data Sources Data Warehouse Data Science
ββββββββββββ ββββββββββββββββ ββββββββββββββ
β PostgreSQLβββββββββββ βββββββββ Analysis β
β MySQL βββββββββββ SQL Queries βββββββββ Modeling β
β BigQuery βββββββββββ βββββββββ Viz β
β Snowflake βββββββββββ βββββββββ β
ββββββββββββ ββββββββββββββββ ββββββββββββββ
βΉοΈ Why SQL Before Python?
Most enterprise data lives in relational databases. SQL is optimized for filtering, aggregating, and joining millions of rows server-side β far more efficient than loading everything into Python. Master SQL first, then use Python for advanced analytics that SQL cannot do.
SQL Basics
SELECT Statements
-- Basic SELECT
SELECT * FROM customers;
-- Select specific columns
SELECT customer_id, name, email
FROM customers;
-- Limit results
SELECT * FROM customers
LIMIT 10;
-- Unique values
SELECT DISTINCT city
FROM customers;
-- Aliases
SELECT
name AS customer_name,
email AS contact_email
FROM customers;
Filtering with WHERE
-- Comparison operators
SELECT * FROM orders
WHERE amount > 100;
SELECT * FROM orders
WHERE status = 'completed';
SELECT * FROM customers
WHERE age >= 18 AND age <= 65;
-- BETWEEN (inclusive)
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
-- IN
SELECT * FROM customers
WHERE city IN ('New York', 'Los Angeles', 'Chicago');
-- LIKE (pattern matching)
SELECT * FROM customers
WHERE name LIKE 'J%'; -- Starts with J
SELECT * FROM customers
WHERE email LIKE '%@gmail.com'; -- Gmail users
-- NULL handling
SELECT * FROM customers
WHERE phone IS NOT NULL;
SELECT * FROM customers
WHERE phone IS NULL;
π‘ NULL Semantics
NULL is not a value β it represents missing or unknown data. Comparisons with NULL always return NULL (unknown), not TRUE or FALSE. Always use IS NULL / IS NOT NULL instead of = or !=. Be aware: NULL = NULL evaluates to NULL, not TRUE.
Sorting and Limiting
-- ORDER BY
SELECT * FROM orders
ORDER BY amount DESC;
-- Multiple columns
SELECT * FROM customers
ORDER BY last_name ASC, first_name ASC;
-- LIMIT and OFFSET (pagination)
SELECT * FROM orders
ORDER BY order_date DESC
LIMIT 10 OFFSET 20; -- Get rows 21-30
Aggregation Functions
-- COUNT
SELECT COUNT(*) FROM customers;
SELECT COUNT(DISTINCT city) FROM customers;
-- SUM, AVG, MIN, MAX
SELECT
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value,
MIN(amount) AS min_order,
MAX(amount) AS max_order
FROM orders;
-- ROUND
SELECT
ROUND(AVG(amount), 2) AS avg_amount
FROM orders;
-- COUNT with conditions
SELECT
COUNT(*) AS total_orders,
COUNT(CASE WHEN status = 'completed' THEN 1 END) AS completed,
COUNT(CASE WHEN status = 'pending' THEN 1 END) AS pending
FROM orders;
GROUP BY
-- Basic GROUP BY
SELECT
city,
COUNT(*) AS customer_count
FROM customers
GROUP BY city;
-- Multiple columns
SELECT
city,
status,
COUNT(*) AS order_count,
SUM(amount) AS total_amount
FROM orders
GROUP BY city, status;
-- GROUP BY with HAVING
SELECT
city,
COUNT(*) AS customer_count
FROM customers
GROUP BY city
HAVING COUNT(*) > 10;
-- Common aggregations
SELECT
DATE_TRUNC('month', order_date) AS month,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
AVG(amount) AS avg_order_value
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
βΉοΈ WHERE vs HAVING
- WHERE filters rows before grouping β cannot use aggregate functions
- HAVING filters groups after grouping β can use aggregate functions
- Execution order: FROM β WHERE β GROUP BY β HAVING β SELECT β ORDER BY β LIMIT
JOIN Operations
Types of JOINs
-- INNER JOIN (only matching rows)
SELECT
c.customer_id,
c.name,
o.order_id,
o.amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
-- LEFT JOIN (all from left, matching from right)
SELECT
c.customer_id,
c.name,
o.order_id,
o.amount
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
-- RIGHT JOIN (all from right, matching from left)
SELECT
c.customer_id,
c.name,
o.order_id,
o.amount
FROM customers c
RIGHT JOIN orders o ON c.customer_id = o.customer_id;
-- FULL OUTER JOIN (all from both)
SELECT
c.customer_id,
c.name,
o.order_id,
o.amount
FROM customers c
FULL OUTER JOIN orders o ON c.customer_id = o.customer_id;
-- SELF JOIN
SELECT
e.name AS employee,
m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id;
π‘ JOIN Types Visualized
- INNER: Only rows with matches in both tables (intersection)
- LEFT: All rows from left table + matched rows from right (NULLs where no match)
- RIGHT: All rows from right table + matched rows from left
- FULL OUTER: All rows from both tables (NULLs where no match on either side)
- CROSS: Every combination of rows from both tables (Cartesian product)
JOIN Examples
-- Multi-table JOIN
SELECT
c.name AS customer_name,
o.order_date,
p.product_name,
oi.quantity,
oi.unit_price
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;
-- JOIN with aggregation
SELECT
c.customer_name,
SUM(o.amount) AS total_spent,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_name
ORDER BY total_spent DESC;
-- Cohort analysis
SELECT
DATE_TRUNC('month', first_order) AS cohort_month,
COUNT(DISTINCT customer_id) AS cohort_size,
SUM(CASE WHEN order_date >= first_order + INTERVAL '30 days'
THEN 1 ELSE 0 END) AS retained_after_30d
FROM (
SELECT
customer_id,
order_date,
MIN(order_date) OVER (PARTITION BY customer_id) AS first_order
FROM orders
) customer_orders
GROUP BY DATE_TRUNC('month', first_order);
Window Functions
Window functions perform calculations across rows related to the current row.
-- ROW_NUMBER
SELECT
name,
department,
salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS salary_rank
FROM employees;
-- RANK and DENSE_RANK
SELECT
name,
department,
salary,
RANK() OVER (ORDER BY salary DESC) AS rank_with_gaps,
DENSE_RANK() OVER (ORDER BY salary DESC) AS rank_no_gaps
FROM employees;
-- Partition by department
SELECT
name,
department,
salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
-- Running totals
SELECT
order_date,
amount,
SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders;
-- Moving average
SELECT
order_date,
amount,
AVG(amount) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM orders;
-- LAG and LEAD
SELECT
order_date,
amount,
LAG(amount, 1) OVER (ORDER BY order_date) AS prev_amount,
LEAD(amount, 1) OVER (ORDER BY order_date) AS next_amount,
amount - LAG(amount, 1) OVER (ORDER BY order_date) AS change
FROM orders;
-- Percentiles
SELECT
name,
salary,
NTILE(4) OVER (ORDER BY salary) AS salary_quartile,
PERCENT_RANK() OVER (ORDER BY salary) AS percentile_rank
FROM employees;
βΉοΈ ROW_NUMBER vs RANK vs DENSE_RANK
- ROW_NUMBER: Unique sequential integers (1, 2, 3, 4) β no ties
- RANK: Gaps for ties (1, 2, 2, 4) β if two rows tie for 2nd, next is 4th
- DENSE_RANK: No gaps for ties (1, 2, 2, 3) β if two rows tie for 2nd, next is 3rd Choose based on whether you need to break ties and whether gaps matter.
Subqueries and CTEs
Subqueries
-- Scalar subquery
SELECT
name,
salary,
salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;
-- IN subquery
SELECT * FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders
WHERE amount > 1000
);
-- EXISTS
SELECT * FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
AND o.amount > 1000
);
-- Correlated subquery
SELECT
name,
salary,
(SELECT COUNT(*) FROM employees e2
WHERE e2.department = e1.department
AND e2.salary > e1.salary) + 1 AS dept_salary_rank
FROM employees e1;
π‘ IN vs EXISTS
- IN: Best when the subquery result set is small. Evaluates the subquery once.
- EXISTS: Best when the outer table is small. Evaluates the subquery once per outer row and short-circuits on first match.
- For large subquery results, EXISTS is often faster because it can stop as soon as a match is found.
Common Table Expressions (CTEs)
-- Basic CTE
WITH customer_stats AS (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT
c.name,
cs.order_count,
cs.total_spent
FROM customers c
JOIN customer_stats cs ON c.customer_id = cs.customer_id
WHERE cs.total_spent > 1000;
-- Multiple CTEs
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY 1
),
revenue_growth AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
(revenue - LAG(revenue) OVER (ORDER BY month)) /
LAG(revenue) OVER (ORDER BY month) * 100 AS growth_rate
FROM monthly_revenue
)
SELECT * FROM revenue_growth
WHERE growth_rate > 10;
-- Recursive CTE (for hierarchical data)
WITH RECURSIVE org_chart AS (
-- Base case
SELECT
employee_id,
name,
manager_id,
1 AS level
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- Recursive case
SELECT
e.employee_id,
e.name,
e.manager_id,
oc.level + 1
FROM employees e
INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart
ORDER BY level, name;
Advanced SQL Techniques
PIVOT and UNPIVOT
-- PIVOT (using CASE)
SELECT
customer_id,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1 THEN amount END) AS jan,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2 THEN amount END) AS feb,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3 THEN amount END) AS mar
FROM orders
GROUP BY customer_id;
-- Dynamic PIVOT (PostgreSQL)
-- Using crosstab function
CREATE EXTENSION IF NOT EXISTS tablefunc;
SELECT * FROM crosstab(
'SELECT customer_id,
EXTRACT(MONTH FROM order_date) AS month,
SUM(amount)
FROM orders
GROUP BY 1, 2
ORDER BY 1, 2'
) AS ct(customer_id INT, jan DECIMAL, feb DECIMAL, mar DECIMAL);
Window Functions for Analytics
-- Cohort retention analysis
WITH first_purchase AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM orders
GROUP BY 1
),
cohort_activity AS (
SELECT
fp.cohort_month,
DATE_TRUNC('month', o.order_date) AS activity_month,
COUNT(DISTINCT o.customer_id) AS active_customers
FROM first_purchase fp
JOIN orders o ON fp.customer_id = o.customer_id
GROUP BY 1, 2
)
SELECT
cohort_month,
activity_month,
active_customers,
EXTRACT(MONTH FROM AGE(activity_month, cohort_month)) AS months_since_first,
active_customers / FIRST_VALUE(active_customers) OVER (
PARTITION BY cohort_month ORDER BY activity_month
) AS retention_rate
FROM cohort_activity
ORDER BY cohort_month, activity_month;
-- Funnel analysis
WITH funnel AS (
SELECT
COUNT(DISTINCT CASE WHEN event = 'page_view' THEN user_id END) AS viewers,
COUNT(DISTINCT CASE WHEN event = 'add_to_cart' THEN user_id END) AS cart_adders,
COUNT(DISTINCT CASE WHEN event = 'checkout' THEN user_id END) AS checkouters,
COUNT(DISTINCT CASE WHEN event = 'purchase' THEN user_id END) AS purchasers
FROM events
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
)
SELECT
viewers,
cart_adders,
checkouters,
purchasers,
cart_adders::float / viewers AS view_to_cart_rate,
checkouters::float / cart_adders AS cart_to_checkout_rate,
purchasers::float / checkouters AS checkout_to_purchase_rate,
purchasers::float / viewers AS overall_conversion_rate
FROM funnel;
Python + SQL Integration
import pandas as pd
from sqlalchemy import create_engine
import sqlite3
# Method 1: SQLAlchemy (recommended)
engine = create_engine('postgresql://user:password@host:port/database')
df = pd.read_sql('SELECT * FROM customers', engine)
# Method 2: SQLite
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM customers', conn)
# Method 3: pandasql (SQL syntax in Python)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())
result = pysqldf("""
SELECT city, COUNT(*) as customer_count
FROM df
GROUP BY city
ORDER BY customer_count DESC
""")
# Complex query with pandas
query = """
WITH customer_metrics AS (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spent,
AVG(amount) as avg_order_value
FROM orders
WHERE order_date >= '2024-01-01'
GROUP BY customer_id
)
SELECT
c.customer_id,
c.name,
c.city,
cm.order_count,
cm.total_spent,
cm.avg_order_value,
CASE
WHEN cm.total_spent > 1000 THEN 'High Value'
WHEN cm.total_spent > 500 THEN 'Medium Value'
ELSE 'Low Value'
END as customer_segment
FROM customers c
JOIN customer_metrics cm ON c.customer_id = cm.customer_id
ORDER BY cm.total_spent DESC
"""
df = pd.read_sql(query, engine)
# Save results back to database
df.to_sql('customer_segments', engine, if_exists='replace', index=False)
# Execute DDL/DML
with engine.connect() as conn:
conn.execute("""
CREATE TABLE IF NOT EXISTS daily_stats (
date DATE,
metric_name VARCHAR(100),
metric_value DECIMAL(10, 2)
)
""")
π‘ SQLAlchemy Best Practices
- Use connection pooling (
create_engine(..., pool_size=5)) for repeated queries - Always use parameterized queries to prevent SQL injection
- Use
pd.read_sql()with chunksize for large result sets:pd.read_sql(query, engine, chunksize=10000) - For writes, use
method='multi'for faster bulk inserts
Performance Optimization
-- 1. Use indexes
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
CREATE INDEX idx_orders_composite ON orders(customer_id, order_date);
-- 2. Avoid SELECT *
SELECT customer_id, name FROM customers; -- Good
SELECT * FROM customers; -- Bad
-- 3. Use EXPLAIN to analyze queries
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE customer_id = 123;
-- 4. Filter early
SELECT * FROM orders
WHERE order_date >= '2024-01-01' -- Filter first
AND customer_id = 123;
-- 5. Use appropriate JOIN types
-- Prefer INNER JOIN over FULL OUTER JOIN
-- Ensure JOIN columns are indexed
-- 6. Avoid functions on indexed columns
-- Bad
WHERE EXTRACT(YEAR FROM order_date) = 2024
-- Good
WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'
βΉοΈ Index Design Principles
- Index columns used in WHERE, JOIN ON, ORDER BY, and GROUP BY
- Composite indexes follow the leftmost prefix rule: index (a, b, c) helps queries on a, (a, b), or (a, b, c) β but NOT on b alone
- Over-indexing slows writes (INSERT, UPDATE, DELETE) because each index must be maintained
- Use partial indexes for frequently filtered subsets:
CREATE INDEX idx ON orders(status) WHERE status = 'pending'
Common Data Science SQL Patterns
-- 1. Recency, Frequency, Monetary (RFM) Analysis
WITH rfm AS (
SELECT
customer_id,
DATEDIFF(day, MAX(order_date), CURRENT_DATE) AS recency,
COUNT(DISTINCT order_id) AS frequency,
SUM(amount) AS monetary
FROM orders
GROUP BY customer_id
),
rfm_scored AS (
SELECT
*,
NTILE(5) OVER (ORDER BY recency DESC) AS r_score,
NTILE(5) OVER (ORDER BY frequency) AS f_score,
NTILE(5) OVER (ORDER BY monetary) AS m_score
FROM rfm
)
SELECT
*,
CONCAT(r_score, f_score, m_score) AS rfm_cell
FROM rfm_scored;
-- 2. User retention cohort analysis
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(event_date)) AS cohort_month
FROM events
GROUP BY 1
),
cohort_retention AS (
SELECT
uc.cohort_month,
EXTRACT(MONTH FROM AGE(e.event_date, uc.cohort_month)) AS months_since,
COUNT(DISTINCT e.user_id) AS active_users
FROM user_cohorts uc
JOIN events e ON uc.user_id = e.user_id
GROUP BY 1, 2
)
SELECT
cohort_month,
months_since,
active_users,
FIRST_VALUE(active_users) OVER (
PARTITION BY cohort_month ORDER BY months_since
) AS cohort_size,
active_users::float / FIRST_VALUE(active_users) OVER (
PARTITION BY cohort_month ORDER BY months_since
) AS retention_rate
FROM cohort_retention;
-- 3. Session analysis
WITH sessions AS (
SELECT
user_id,
session_id,
MIN(event_time) AS session_start,
MAX(event_time) AS session_end,
COUNT(*) AS event_count
FROM events
GROUP BY user_id, session_id
)
SELECT
user_id,
session_id,
session_start,
session_end,
EXTRACT(EPOCH FROM (session_end - session_start)) / 60 AS session_duration_min,
event_count,
event_count / EXTRACT(EPOCH FROM (session_end - session_start)) * 3600 AS events_per_hour
FROM sessions;
πWorked Example: RFM Segmentation
Given: Orders table with customer_id, order_date, amount
Step 1: Compute RFM metrics per customer
- Recency = days since last purchase
- Frequency = count of distinct orders
- Monetary = total amount spent
Step 2: Score each metric into quintiles (1-5) using NTILE(5)
Step 3: Combine into RFM cell (e.g., "531" = high recency, medium frequency, low monetary)
Step 4: Map to segments:
- 555, 554, 545 β Champions
- 544, 534, 435 β Loyal Customers
- 333, 332, 323 β Need Attention
- 111, 112, 121 β Lost/At Risk
Key Takeaways
πSummary: SQL for Data Scientists
- SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT β learn the execution order, not just the syntax
- JOINs combine data from multiple tables β always specify the join condition and prefer INNER JOIN when appropriate
- Window functions (OVER clause) enable complex analytics without self-joins β they compute across related rows without collapsing them
- CTEs (WITH clauses) make queries readable and maintainable β use them to break complex logic into named building blocks
- Indexes are critical for performance β index columns used in WHERE, JOIN, ORDER BY; use EXPLAIN ANALYZE to verify
- Python + SQL integration (SQLAlchemy, pandas) is essential β use parameterized queries and chunked reads for production
- Common patterns: RFM analysis, cohort retention, funnel analysis, session analysis β these are the bread and butter of data science SQL
Practice Exercise
- Write a query to find the top 10 customers by total revenue using GROUP BY and ORDER BY
- Calculate month-over-month revenue growth rate using LAG() window function
- Build a cohort retention analysis that tracks users from their first purchase month
- Create an RFM segmentation using NTILE() for scoring
- Write a recursive CTE to traverse an organizational hierarchy
- Optimize a slow-running query using EXPLAIN ANALYZE and appropriate indexes
- Use a window function to compute a 7-day moving average of daily sales