SQL Fundamentals for Data Engineers

Module 1: FoundationsSQL BasicsFree Lesson

Advertisement

Why SQL Matters for Data Engineers

SQL is the lingua franca of data. Whether you're building ETL pipelines, writing transformations, debugging data quality issues, or optimizing queries — SQL is your primary tool. Data engineers who master SQL are significantly more productive and effective.

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    SQL IN DATA ENGINEERING                    │
│                                                             │
│  ETL Pipelines    │  SELECT, INSERT, UPDATE, DELETE         │
│  Data Modeling    │  CREATE TABLE, ALTER TABLE, constraints │
│  Analytics        │  JOIN, GROUP BY, Window Functions       │
│  Optimization     │  Indexes, EXPLAIN, query plans          │
│  Data Quality     │  DISTINCT, NULL handling, validation    │
│  Administration   │  GRANT, REVOKE, transactions            │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Basic SELECT Queries

Retrieving Data

-- Select specific columns
SELECT first_name, last_name, email
FROM customers;

-- Select all columns
SELECT *
FROM orders;

-- Select with aliases
SELECT 
    first_name AS "First Name",
    last_name AS "Last Name",
    email AS "Email Address"
FROM customers;

-- Select distinct values
SELECT DISTINCT country
FROM customers;

-- Select with LIMIT
SELECT *
FROM orders
ORDER BY order_date DESC
LIMIT 100;

Filtering with WHERE

-- Comparison operators
SELECT *
FROM orders
WHERE status = 'completed'
  AND amount > 100.00;

-- IN operator
SELECT *
FROM products
WHERE category_id IN (1, 3, 5);

-- BETWEEN operator
SELECT *
FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31';

-- LIKE operator
SELECT *
FROM customers
WHERE email LIKE '%@gmail.com';

-- IS NULL check
SELECT *
FROM customers
WHERE phone_number IS NULL;

-- NOT operators
SELECT *
FROM orders
WHERE status != 'cancelled'
  AND amount NOT BETWEEN 0 AND 10;

Sorting and Limiting

-- ORDER BY
SELECT *
FROM orders
ORDER BY order_date DESC, amount ASC;

-- LIMIT and OFFSET (pagination)
SELECT *
FROM orders
ORDER BY order_id
LIMIT 20 OFFSET 40;  -- Page 3 (records 41-60)

-- TOP N per group (using window function)
SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY rating DESC) as rank
    FROM products
) ranked
WHERE rank <= 3;

Aggregate Functions

-- Basic aggregations
SELECT 
    COUNT(*) AS total_orders,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value,
    MIN(amount) AS min_order,
    MAX(amount) AS max_order
FROM orders;

-- GROUP BY
SELECT 
    status,
    COUNT(*) AS order_count,
    AVG(amount) AS avg_amount
FROM orders
GROUP BY status;

-- GROUP BY with HAVING (filter after aggregation)
SELECT 
    customer_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
GROUP BY customer_id
HAVING COUNT(*) >= 5
   AND SUM(amount) > 1000;

-- Multiple grouping columns
SELECT 
    country,
    city,
    COUNT(*) AS customer_count
FROM customers
GROUP BY country, city
ORDER BY country, customer_count DESC;

JOIN Operations

JOIN Types Visualized

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                         JOIN TYPES                               │
ā”œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¤
│                                                                 │
│  INNER JOIN          LEFT JOIN           RIGHT JOIN             │
│  ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”           ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”              │
│  │ A │ B │          │ A │ B │           │ A │ B │              │
│  │   │   │          │ ā— │   │           │   │ ā— │              │
│  │ ā— │ ā— │          │ ā— │ ā— │           │ ā— │ ā— │              │
│  │   │   │          │ ā— │   │           │   │ ā— │              │
│  ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜          ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜           ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜              │
│  Only matching      All A + matching    All B + matching       │
│                                                                 │
│  FULL OUTER JOIN     CROSS JOIN                                  │
│  ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”          ā”Œā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”                                  │
│  │ A │ B │          │ A │ B │                                  │
│  │ ā— │   │          │ ā— │ ā—ā—ā—ā”‚                                │
│  │ ā— │ ā— │          │ ā— │ ā—ā—ā—ā”‚                                │
│  │   │ ā— │          │ ā— │ ā—ā—ā—ā”‚                                │
│  ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜          ā””ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”˜                                  │
│  All rows           Every A Ɨ every B                          │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Practical JOIN Examples

-- INNER JOIN: Orders with customer information
SELECT 
    o.order_id,
    o.order_date,
    o.amount,
    c.first_name,
    c.last_name,
    c.email
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

-- LEFT JOIN: All customers, even those without orders
SELECT 
    c.customer_id,
    c.first_name,
    c.last_name,
    COUNT(o.order_id) AS total_orders,
    COALESCE(SUM(o.amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name;

-- Multiple JOINs
SELECT 
    o.order_id,
    c.first_name AS customer_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON o.order_id = oi.order_id
INNER JOIN products p ON oi.product_id = p.product_id;

-- Self JOIN: Find customers in the same city
SELECT 
    c1.first_name AS customer_1,
    c2.first_name AS customer_2,
    c1.city
FROM customers c1
INNER JOIN customers c2 ON c1.city = c2.city
WHERE c1.customer_id < c2.customer_id;

Subqueries

-- Scalar subquery: Get orders above average
SELECT *
FROM orders
WHERE amount > (SELECT AVG(amount) FROM orders);

-- IN subquery: Customers who placed orders
SELECT *
FROM customers
WHERE customer_id IN (SELECT DISTINCT customer_id FROM orders);

-- Correlated subquery: Latest order per customer
SELECT *
FROM orders o1
WHERE order_date = (
    SELECT MAX(order_date)
    FROM orders o2
    WHERE o2.customer_id = o1.customer_id
);

-- EXISTS subquery: Customers with high-value orders
SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.amount > 1000
);

-- Subquery in FROM clause
SELECT 
    customer_id,
    total_spent,
    RANK() OVER (ORDER BY total_spent DESC) AS spending_rank
FROM (
    SELECT 
        customer_id,
        SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
) customer_totals;

Window Functions

Window functions perform calculations across a set of rows related to the current row — without collapsing them like GROUP BY.

-- ROW_NUMBER: Unique sequential numbering
SELECT 
    order_id,
    customer_id,
    order_date,
    amount,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id 
        ORDER BY order_date DESC
    ) AS order_sequence
FROM orders;

-- RANK and DENSE_RANK: Ranking with ties
SELECT 
    product_name,
    category,
    total_sales,
    RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS rank,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY total_sales DESC) AS dense_rank
FROM product_sales;

-- Running total (cumulative sum)
SELECT 
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue
FROM daily_sales;

-- Moving average (window frame)
SELECT 
    order_date,
    daily_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7day_avg
FROM daily_sales;

-- LAG and LEAD: Access previous/next rows
SELECT 
    month,
    revenue,
    LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
    LEAD(revenue, 1) OVER (ORDER BY month) AS next_month,
    revenue - LAG(revenue, 1) OVER (ORDER BY month) AS month_over_month
FROM monthly_revenue;

Window Function Reference

FunctionDescriptionUse Case
ROW_NUMBER()Sequential number, no tiesDeduplication, pagination
RANK()Rank with gaps for tiesLeaderboards
DENSE_RANK()Rank without gapsContinuous ranking
NTILE(n)Divide into n bucketsPercentiles, quartiles
LAG(col, n)Value from n rows beforeTime series comparison
LEAD(col, n)Value from n rows afterTime series comparison
SUM() OVERCumulative sumRunning totals
AVG() OVERMoving averageTrend analysis

Common Table Expressions (CTEs)

CTEs make complex queries readable and reusable.

-- Basic CTE
WITH customer_orders AS (
    SELECT 
        customer_id,
        COUNT(*) AS order_count,
        SUM(amount) AS total_spent
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.first_name,
    c.last_name,
    co.order_count,
    co.total_spent
FROM customers c
INNER JOIN customer_orders co ON c.customer_id = co.customer_id
WHERE co.order_count >= 5;

-- Multiple CTEs
WITH 
monthly_sales AS (
    SELECT 
        DATE_TRUNC('month', order_date) AS month,
        SUM(amount) AS revenue
    FROM orders
    GROUP BY DATE_TRUNC('month', order_date)
),
monthly_avg AS (
    SELECT 
        AVG(revenue) AS avg_monthly_revenue
    FROM monthly_sales
)
SELECT 
    ms.month,
    ms.revenue,
    ma.avg_monthly_revenue,
    ms.revenue - ma.avg_monthly_revenue AS variance
FROM monthly_sales ms
CROSS JOIN monthly_avg ma;

-- CTE for data quality checks
WITH order_validation AS (
    SELECT 
        order_id,
        CASE 
            WHEN amount <= 0 THEN 'INVALID_AMOUNT'
            WHEN order_date > CURRENT_DATE THEN 'FUTURE_DATE'
            WHEN customer_id IS NULL THEN 'MISSING_CUSTOMER'
            ELSE 'VALID'
        END AS validation_status
    FROM orders
)
SELECT 
    validation_status,
    COUNT(*) AS record_count
FROM order_validation
GROUP BY validation_status;

Indexes and Performance Basics

What is an Index?

An index is a data structure that speeds up data retrieval at the cost of additional storage and write overhead.

-- Create index on frequently queried columns
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_order_date ON orders(order_date);

-- Composite index (multiple columns)
CREATE INDEX idx_orders_status_date ON orders(status, order_date);

-- Partial index (PostgreSQL)
CREATE INDEX idx_orders_pending ON orders(order_date) 
WHERE status = 'pending';

-- Check if index is being used
EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;

EXPLAIN ANALYZE

-- Before optimization
EXPLAIN ANALYZE
SELECT COUNT(*)
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE o.order_date >= '2024-01-01';

-- Output might show:
-- Seq Scan on orders (cost=0.00..1234.00 rows=50000)
--   Filter: (order_date >= '2024-01-01')
--   Rows Removed by Filter: 950000
-- Hash Join (cost=234.56..1567.89 rows=50000)
--   -> Seq Scan on customers (cost=0.00..234.56 rows=10000)

Common Performance Issues

IssueSymptomSolution
Sequential scanFull table scan on large tablesAdd appropriate index
Missing join indexSlow JOINs on large tablesIndex foreign keys
SELECT *Fetching unnecessary columnsSelect only needed columns
N+1 queriesMany small queries in loopsUse JOINs or batch queries
Unparameterized queriesPlan cache pollutionUse parameterized queries

SQL Best Practices for Data Engineers

Naming Conventions

-- Use snake_case for identifiers
CREATE TABLE order_items (  -- Good
    order_item_id SERIAL PRIMARY KEY,
    order_id INTEGER REFERENCES orders(order_id),
    product_id INTEGER REFERENCES products(product_id),
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Use descriptive names
total_revenue  -- Good
revenue        -- Acceptable
rev            -- Bad (ambiguous)

Writing Readable SQL

-- Bad: All on one line
SELECT o.order_id,c.first_name,c.last_name,p.product_name,oi.quantity,oi.unit_price FROM orders o INNER JOIN customers c ON o.customer_id=c.customer_id INNER JOIN order_items oi ON o.order_id=oi.order_id INNER JOIN products p ON oi.product_id=p.product_id WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31' AND o.status='completed';

-- Good: Formatted and readable
SELECT 
    o.order_id,
    c.first_name,
    c.last_name,
    p.product_name,
    oi.quantity,
    oi.unit_price
FROM orders o
INNER JOIN customers c 
    ON o.customer_id = c.customer_id
INNER JOIN order_items oi 
    ON o.order_id = oi.order_id
INNER JOIN products p 
    ON oi.product_id = p.product_id
WHERE o.order_date BETWEEN '2024-01-01' AND '2024-12-31'
  AND o.status = 'completed';

Key Takeaways

  1. SQL is essential for every data engineer — it's the primary language for data manipulation
  2. Master the basics first: SELECT, WHERE, JOIN, GROUP BY, ORDER BY, LIMIT
  3. Window functions are powerful — learn ROW_NUMBER, RANK, LAG/LEAD, and cumulative aggregates
  4. CTEs improve readability — use them to break complex queries into logical steps
  5. Indexes are critical — understand when and how to index for performance
  6. Always use EXPLAIN ANALYZE — understand your query execution plans
  7. Write readable SQL — format consistently, use aliases, comment complex logic

Practice Exercises

  1. Basic queries: Write a query to find the top 10 customers by total order amount in the last 30 days.

  2. Joins: Write a query showing all products that have never been ordered, along with the number of times each product has been ordered.

  3. Window functions: Write a query that calculates the running total of revenue by day, and the percentage change from the previous day.

  4. CTEs: Using CTEs, write a query that identifies customers whose average order value is above the company-wide average.

  5. Performance: Take a slow query from your work, run EXPLAIN ANALYZE, identify the bottleneck, and optimize it with indexes or query restructuring.

Advertisement

Need Expert Data Engineering Help?

Professional DE consulting, pipeline architecture, and data platform services.

Advertisement