Why SQL Matters for Data Engineers
SQL is the lingua franca of data. Whether you're building ETL pipelines, writing transformations, debugging data quality issues, or optimizing queries ā SQL is your primary tool. Data engineers who master SQL are significantly more productive and effective.
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā SQL IN DATA ENGINEERING ā
ā ā
ā ETL Pipelines ā SELECT, INSERT, UPDATE, DELETE ā
ā Data Modeling ā CREATE TABLE, ALTER TABLE, constraints ā
ā Analytics ā JOIN, GROUP BY, Window Functions ā
ā Optimization ā Indexes, EXPLAIN, query plans ā
ā Data Quality ā DISTINCT, NULL handling, validation ā
ā Administration ā GRANT, REVOKE, transactions ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Basic SELECT Queries
Retrieving Data
-- Select specific columns
SELECT first_name, last_name, email
FROM customers;
-- Select all columns
SELECT *
FROM orders;
-- Select with aliases
SELECT
first_name AS "First Name",
last_name AS "Last Name",
email AS "Email Address"
FROM customers;
-- Select distinct values
SELECT DISTINCT country
FROM customers;
-- Select with LIMIT
SELECT *
FROM orders
ORDER BY order_date DESC
LIMIT 100;
Filtering with WHERE
-- Comparison operators
SELECT *
FROM orders
WHERE status = 'completed'
AND amount > 100.00;
-- IN operator
SELECT *
FROM products
WHERE category_id IN (1, 3, 5);
-- BETWEEN operator
SELECT *
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';
-- LIKE operator
SELECT *
FROM customers
WHERE email LIKE '%@gmail.com';
-- IS NULL check
SELECT *
FROM customers
WHERE phone_number IS NULL;
-- NOT operators
SELECT *
FROM orders
WHERE status != 'cancelled'
AND amount NOT BETWEEN 0 AND 10;
Sorting and Limiting
-- ORDER BY
SELECT *
FROM orders
ORDER BY order_date DESC, amount ASC;
-- LIMIT and OFFSET (pagination)
SELECT *
FROM orders
ORDER BY order_id
LIMIT 20 OFFSET 40; -- Page 3 (records 41-60)
-- TOP N per group (using window function)
SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY rating DESC) as rank
FROM products
) ranked
WHERE rank <= 3;
Aggregate Functions
-- Basic aggregations
SELECT
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value,
MIN(amount) AS min_order,
MAX(amount) AS max_order
FROM orders;
-- GROUP BY
SELECT
status,
COUNT(*) AS order_count,
AVG(amount) AS avg_amount
FROM orders
GROUP BY status;
-- GROUP BY with HAVING (filter after aggregation)
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 5
AND SUM(amount) > 1000;
-- Multiple grouping columns
SELECT
country,
city,
COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;
JOIN Operations
JOIN Types Visualized
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā JOIN TYPES ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā¤
ā ā
ā INNER JOIN LEFT JOIN RIGHT JOIN ā
ā āāāāā¬āāāā āāāāā¬āāāā āāāāā¬āāāā ā
ā ā A ā B ā ā A ā B ā ā A ā B ā ā
ā ā ā ā ā ā ā ā ā ā ā ā ā
ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā ā
ā ā ā ā ā ā ā ā ā ā ā ā ā
ā āāāāā“āāāā āāāāā“āāāā āāāāā“āāāā ā
ā Only matching All A + matching All B + matching ā
ā ā
ā FULL OUTER JOIN CROSS JOIN ā
ā āāāāā¬āāāā āāāāā¬āāāā ā
ā ā A ā B ā ā A ā B ā ā
ā ā ā ā ā ā ā ā āāāā ā
ā ā ā ā ā ā ā ā ā āāāā ā
ā ā ā ā ā ā ā ā āāāā ā
ā āāāāā“āāāā āāāāā“āāāā ā
ā All rows Every A Ć every B ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
Practical JOIN Examples
-- INNER JOIN: Orders with customer information
SELECT
o.order_id,
o.order_date,
o.amount,
c.first_name,
c.last_name,
c.email
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- LEFT JOIN: All customers, even those without orders
SELECT
c.customer_id,
c.first_name,
c.last_name,
COUNT(o.order_id) AS total_orders,
COALESCE(SUM(o.amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;
-- Multiple JOINs
SELECT
o.order_id,
c.first_name AS customer_name,
p.product_name,
oi.quantity,
oi.unit_price
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;
-- Self JOIN: Find customers in the same city
SELECT
c1.first_name AS customer_1,
c2.first_name AS customer_2,
c1.city
FROM customers c1
INNER JOIN customers c2 ON c1.city = c2.city
WHERE c1.customer_id < c2.customer_id;
Subqueries
-- Scalar subquery: Get orders above average
SELECT *
FROM orders
WHERE amount > (SELECT AVG(amount) FROM orders);
-- IN subquery: Customers who placed orders
SELECT *
FROM customers
WHERE customer_id IN (SELECT DISTINCT customer_id FROM orders);
-- Correlated subquery: Latest order per customer
SELECT *
FROM orders o1
WHERE order_date = (
SELECT MAX(order_date)
FROM orders o2
WHERE o2.customer_id = o1.customer_id
);
-- EXISTS subquery: Customers with high-value orders
SELECT *
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
AND o.amount > 1000
);
-- Subquery in FROM clause
SELECT
customer_id,
total_spent,
RANK() OVER (ORDER BY total_spent DESC) AS spending_rank
FROM (
SELECT
customer_id,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
) customer_totals;
Window Functions
Window functions perform calculations across a set of rows related to the current row ā without collapsing them like GROUP BY.
-- ROW_NUMBER: Unique sequential numbering
SELECT
order_id,
customer_id,
order_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) AS order_sequence
FROM orders;
-- RANK and DENSE_RANK: Ranking with ties
SELECT
product_name,
category,
total_sales,
RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS dense_rank
FROM product_sales;
-- Running total (cumulative sum)
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue
FROM daily_sales;
-- Moving average (window frame)
SELECT
order_date,
daily_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7day_avg
FROM daily_sales;
-- LAG and LEAD: Access previous/next rows
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
LEAD(revenue, 1) OVER (ORDER BY month) AS next_month,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue;
Window Function Reference
| Function | Description | Use Case |
|---|---|---|
| ROW_NUMBER() | Sequential number, no ties | Deduplication, pagination |
| RANK() | Rank with gaps for ties | Leaderboards |
| DENSE_RANK() | Rank without gaps | Continuous ranking |
| NTILE(n) | Divide into n buckets | Percentiles, quartiles |
| LAG(col, n) | Value from n rows before | Time series comparison |
| LEAD(col, n) | Value from n rows after | Time series comparison |
| SUM() OVER | Cumulative sum | Running totals |
| AVG() OVER | Moving average | Trend analysis |
Common Table Expressions (CTEs)
CTEs make complex queries readable and reusable.
-- Basic CTE
WITH customer_orders AS (
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
)
SELECT
c.first_name,
c.last_name,
co.order_count,
co.total_spent
FROM customers c
INNER JOIN customer_orders co ON c.customer_id = co.customer_id
WHERE co.order_count >= 5;
-- Multiple CTEs
WITH
monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
),
monthly_avg AS (
SELECT
AVG(revenue) AS avg_monthly_revenue
FROM monthly_sales
)
SELECT
ms.month,
ms.revenue,
ma.avg_monthly_revenue,
ms.revenue - ma.avg_monthly_revenue AS variance
FROM monthly_sales ms
CROSS JOIN monthly_avg ma;
-- CTE for data quality checks
WITH order_validation AS (
SELECT
order_id,
CASE
WHEN amount <= 0 THEN 'INVALID_AMOUNT'
WHEN order_date > CURRENT_DATE THEN 'FUTURE_DATE'
WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER'
ELSE 'VALID'
END AS validation_status
FROM orders
)
SELECT
validation_status,
COUNT(*) AS record_count
FROM order_validation
GROUP BY validation_status;
Indexes and Performance Basics
What is an Index?
An index is a data structure that speeds up data retrieval at the cost of additional storage and write overhead.
-- Create index on frequently queried columns
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);
-- Composite index (multiple columns)
CREATE INDEX idx_orders_status_date ON orders(status, order_date);
-- Partial index (PostgreSQL)
CREATE INDEX idx_orders_pending ON orders(order_date)
WHERE status = 'pending';
-- Check if index is being used
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;
EXPLAIN ANALYZE
-- Before optimization
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';
-- Output might show:
-- Seq Scan on orders (cost=0.00..1234.00 rows=50000)
-- Filter: (order_date >= '2024-01-01')
-- Rows Removed by Filter: 950000
-- Hash Join (cost=234.56..1567.89 rows=50000)
-- -> Seq Scan on customers (cost=0.00..234.56 rows=10000)
Common Performance Issues
| Issue | Symptom | Solution |
|---|---|---|
| Sequential scan | Full table scan on large tables | Add appropriate index |
| Missing join index | Slow JOINs on large tables | Index foreign keys |
| SELECT * | Fetching unnecessary columns | Select only needed columns |
| N+1 queries | Many small queries in loops | Use JOINs or batch queries |
| Unparameterized queries | Plan cache pollution | Use parameterized queries |
SQL Best Practices for Data Engineers
Naming Conventions
-- Use snake_case for identifiers
CREATE TABLE order_items ( -- Good
order_item_id SERIAL PRIMARY KEY,
order_id INTEGER REFERENCES orders(order_id),
product_id INTEGER REFERENCES products(product_id),
quantity INTEGER NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Use descriptive names
total_revenue -- Good
revenue -- Acceptable
rev -- Bad (ambiguous)
Writing Readable SQL
-- Bad: All on one line
SELECT o.order_id,c.first_name,c.last_name,p.product_name,oi.quantity,oi.unit_price FROM orders o INNER JOIN customers c ON o.customer_id=c.customer_id INNER JOIN order_items oi ON o.order_id=oi.order_id INNER JOIN products p ON oi.product_id=p.product_id WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31' AND o.status='completed';
-- Good: Formatted and readable
SELECT
o.order_id,
c.first_name,
c.last_name,
p.product_name,
oi.quantity,
oi.unit_price
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
INNER JOIN order_items oi
ON o.order_id = oi.order_id
INNER JOIN products p
ON oi.product_id = p.product_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31'
AND o.status = 'completed';
Key Takeaways
- SQL is essential for every data engineer ā it's the primary language for data manipulation
- Master the basics first: SELECT, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT
- Window functions are powerful ā learn ROW_NUMBER, RANK, LAG/LEAD, and cumulative aggregates
- CTEs improve readability ā use them to break complex queries into logical steps
- Indexes are critical ā understand when and how to index for performance
- Always use EXPLAIN ANALYZE ā understand your query execution plans
- Write readable SQL ā format consistently, use aliases, comment complex logic
Practice Exercises
-
Basic queries: Write a query to find the top 10 customers by total order amount in the last 30 days.
-
Joins: Write a query showing all products that have never been ordered, along with the number of times each product has been ordered.
-
Window functions: Write a query that calculates the running total of revenue by day, and the percentage change from the previous day.
-
CTEs: Using CTEs, write a query that identifies customers whose average order value is above the company-wide average.
-
Performance: Take a slow query from your work, run EXPLAIN ANALYZE, identify the bottleneck, and optimize it with indexes or query restructuring.