Modern SQL

You write SELECT * FROM users WHERE age > 25 and hit enter. Simple, right? Three seconds later, your result appears. You're happy.

But what you don't see is the absolute chaos that just happened behind the scenes. Your innocent little query triggered an optimizer that considered 47 different execution strategies, ran statistical analysis on your data distribution, predicted I/O costs down to the millisecond, and ultimately chose an algorithm you've probably never heard of - all in a fraction of a second.

Modern SQL databases are frighteningly smart. They're doing things that would make a PhD dissertation look simple. Let's dive into the wizard's workshop and see what kind of sorcery is actually happening.

The Query Journey: From SQL to Execution

First, let's trace the path your query takes through the database:

Your SQL
   ↓
Parser → Check syntax, build parse tree
   ↓
Binder → Verify tables/columns exist, resolve names
   ↓
Optimizer → THIS IS WHERE THE MAGIC HAPPENS
   ↓
Execution Plan → The actual algorithm to run
   ↓
Execution Engine → Just do what the optimizer said
   ↓
Results!

Most people focus on writing SQL or tuning indexes. But the optimizer? That's where databases flex their 50 years of computer science research.

ℹ️

The Optimizer's Job

Given one SQL query, the optimizer might generate hundreds or thousands of possible execution plans. Its job: find the fastest one without actually running them all. It's like trying to predict which route through the city is fastest without actually driving each one.

The Cost Model: Predicting the Future

Here's the first bit of magic: the optimizer doesn't just guess. It models the cost of each possible plan.

Cost factors:

I/O cost: How many pages to read from disk?
CPU cost: How many tuples to process?
Network cost: (for distributed databases) How much data to transfer?
Memory cost: Will this fit in buffer pool or require disk spills?

Let's say you have:

SELECT * FROM users 
WHERE age > 25 AND city = 'New York';

The optimizer considers:

Option 1: Scan the whole table

Cost: Read all 10,000 pages = 10,000 I/O ops
Then filter in memory
Estimated time: ~10 seconds

Option 2: Use index on age

Cost: Read index (height=3) = 3 I/O ops
Then read matching data pages = ~3,000 pages = 3,000 I/O ops
Estimated time: ~3 seconds

Option 3: Use index on city

Cost: Read index = 3 I/O ops
Read matching pages = 500 pages = 500 I/O ops
Estimated time: ~0.5 seconds ← WINNER!

The optimizer picks Option 3. But how did it know city='New York' would only match 500 pages?

Statistics.

💡

The Statistics System

Databases maintain statistics about your data: number of rows, distinct values per column, data distribution histograms, correlation between columns, and more. Run ANALYZE or UPDATE STATISTICS regularly, or your optimizer is flying blind!

Cardinality Estimation: The Art of Fortune Telling

Cardinality = how many rows a query will return. Getting this right is CRITICAL because it affects every downstream decision.

Simple Predicate

WHERE age = 30

If the table has 1,000,000 rows and age has 70 distinct values (ages 18-87), the optimizer estimates:

Cardinality = 1,000,000 / 70 ≈ 14,285 rows

This assumes uniform distribution - a simplification, but reasonable.

Multiple Predicates (The Independence Assumption)

WHERE age = 30 AND city = 'New York'

Optimizer assumes age and city are independent:

Selectivity(age=30) = 1/70 = 0.014
Selectivity(city='NY') = 0.05 (5% of users in NY)
Combined = 0.014 × 0.05 = 0.0007
Cardinality = 1,000,000 × 0.0007 = 700 rows

But what if young people prefer cities? Then age and city are correlated, and this estimate is wrong!

⚠️

When Estimates Go Wrong

The optimizer estimated 700 rows, so it chose a nested loop join. Reality: 50,000 rows. Now your query takes 10 minutes instead of 10 seconds because the wrong algorithm was chosen. This is why DBAs obsess over statistics quality!

Modern Solution: Histograms and Multi-Dimensional Statistics

PostgreSQL, SQL Server, and Oracle now maintain histograms - bucketed distributions of actual data:

age histogram:
[18-25]: 200,000 rows  (young users!)
[26-35]: 400,000 rows  (peak)
[36-50]: 300,000 rows
[51+]:   100,000 rows

Even better, some databases track multi-column statistics to capture correlations:

CREATE STATISTICS young_city_corr 
ON age, city FROM users;

Now the optimizer knows that age and city ARE correlated and adjusts estimates accordingly.

Join Algorithms: More Than You Ever Wanted to Know

Here's where databases really show off. You write:

SELECT u.name, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.city = 'Boston';

Simple, right? But the optimizer has to choose from dozens of algorithms:

Nested Loop Join (The Simple One)

for each row in users where city='Boston':
    for each row in orders where user_id = user.id:
        output joined row

Cost: If 100 Boston users and 1,000,000 orders:

Outer loop: 100 iterations
Inner loop: 1,000,000 / (num_users) ≈ 10 per user
Total: 100 × 10 = 1,000 comparisons

When to use: Small outer table, index on inner table's join key. Perfect for this query!

Hash Join (The Clever One)

1. Build hash table on smaller table (users from Boston)
2. Probe: for each order, hash user_id and look up in hash table
3. Output matches

Cost:

Build phase: Read Boston users (100 rows)
Probe phase: Read all orders (1,000,000 rows), O(1) lookup each
Total: ~1,000,100 operations, but no random I/O!

When to use: No indexes available, joining large tables, can fit build side in memory.

💻

The Hash Join Trick

Hash joins are I/O efficient because they read each table sequentially (no random seeks). Even if nested loop needs fewer comparisons, hash join might be faster because sequential I/O is so much quicker than random access!

Sort-Merge Join (The Sophisticated One)

1. Sort users by id
2. Sort orders by user_id  
3. Merge: walk through both sorted lists simultaneously

Cost:

Sort users: 100 × log(100) ≈ 664
Sort orders: 1,000,000 × log(1,000,000) ≈ 20,000,000
Merge: 100 + 1,000,000 = 1,000,100
Total: ~20,001,000 operations

Looks expensive! But if the data is ALREADY sorted (because of an index or previous operation), the sorts are free. Then merge is just two sequential scans - super fast!

When to use: Data already sorted, or you need sorted output anyway (for ORDER BY or GROUP BY downstream).

The Optimizer's Decision

The optimizer estimates costs for ALL of these (and more), considering:

Available indexes
Data cardinalities
Memory available
Whether output needs to be sorted

Then it picks the winner. And it does this for EVERY join in your query, considering all possible orderings!

SELECT *
FROM A JOIN B ON A.id = B.id
       JOIN C ON B.id = C.id
       JOIN D ON C.id = D.id;

Possible join orders:

((A ⋈ B) ⋈ C) ⋈ D
(A ⋈ (B ⋈ C)) ⋈ D
A ⋈ ((B ⋈ C) ⋈ D)
... and many more

For N tables, there are roughly (2N)! / N! possible orderings. For 10 tables? 17 trillion possibilities.

The optimizer can't check them all. So it uses heuristics, dynamic programming, and sometimes genetic algorithms to search the space efficiently.

🔬

The Join Ordering Problem

Finding the optimal join order is NP-hard. Modern optimizers use sophisticated search strategies: PostgreSQL uses dynamic programming (exact for <12 tables, heuristic for more), SQL Server uses a "memo" structure to cache subproblems, and some experimental optimizers use machine learning!

Adaptive Query Processing: Learning on the Fly

Here's where it gets wild. Modern databases don't just plan and execute - they adapt mid-query.

Adaptive Join Selection (SQL Server)

SQL Server's optimizer might say: "I'm not sure if nested loop or hash join is better. Let me start with nested loop, but if I process more than 1000 rows, switch to hash join mid-execution."

Start: Nested Loop Join
  → After 500 rows: "This is fine, keep going"
  → After 1500 rows: "Wait, this is taking forever!"
  → Switch to Hash Join without restarting query

The database is literally changing algorithms WHILE YOUR QUERY IS RUNNING.

Runtime Filter Pushdown (ClickHouse, Snowflake)

Consider:

SELECT * FROM big_table b
JOIN small_table s ON b.id = s.id
WHERE s.category = 'active';

Traditional plan:

Scan big_table (1 billion rows)
Scan small_table, filter to 'active' (100 rows)
Join (now only need to check 100 IDs from big_table)

But we wasted time scanning 1 billion rows!

Runtime filter pushdown:

Scan small_table first, get IDs: {42, 87, 153, ...} (100 IDs)
Build a bloom filter or hash set
Scan big_table, but skip rows where ID not in filter
Now only read ~100 rows from big_table!

The filter is computed AT RUNTIME and pushed down dynamically. You didn't ask for this. The database just decided to do it because it's smarter than you.

💡

Bloom Filters: Space Magic

A bloom filter is a probabilistic data structure that answers "is X in the set?" in O(1) time and constant space. It might have false positives (says yes when it's no) but never false negatives. Perfect for filtering billions of rows with just KB of memory!

Cardinality Re-Estimation (Oracle)

Oracle's optimizer can detect when its estimates were wrong:

Expected: 1,000 rows after filter
Reality: 500,000 rows (oops!)

Oracle: "My estimate was garbage. Let me re-plan 
         the rest of the query with correct cardinality."

Mid-query re-optimization. Because plans go stale, and modern databases know it.

Parallel Execution: Divide and Conquer

Your query:

SELECT COUNT(*) FROM huge_table WHERE value > 1000;

Traditional: One thread scans 10 million rows. Takes 10 seconds.

Parallel execution:

Thread 1: Scan rows 0-2.5M
Thread 2: Scan rows 2.5M-5M  
Thread 3: Scan rows 5M-7.5M
Thread 4: Scan rows 7.5M-10M

Each thread: COUNT(*)
Final: SUM(all counts)

Now it takes 2.5 seconds (assuming 4 cores and perfect scaling).

But wait, there's more! Modern databases do parallel everything:

Parallel Hash Join:

1. Partition users into 4 buckets by hash(id)
2. Partition orders into 4 buckets by hash(user_id)
3. Four threads, each joins one bucket pair
4. Merge results

Parallel Aggregation:

SELECT city, AVG(age) FROM users GROUP BY city;

1. Each thread scans part of table, computes local aggregates
2. Combine phase: merge partial aggregates
3. Compute final AVG from combined SUM/COUNT

The optimizer decides:

How many threads to use
How to partition the data
Where to place exchange operators (data shuffling points)
Whether parallelism is even worth it (overhead vs speedup)

⚠️

Parallelism Isn't Free

Coordinating threads, partitioning data, and merging results has overhead. For small queries, parallel execution is SLOWER. The optimizer must predict when parallelism helps vs hurts. Getting this wrong means your "optimization" made things worse!

Vectorized Execution: SIMD on Steroids

Traditional query execution (Volcano model):

while (tuple = next()) {
    result = apply_filter(tuple);
    emit(result);
}

One tuple at a time. Lots of function calls, branches, cache misses.

Vectorized execution (DuckDB, ClickHouse):

while (batch = next_batch()) {  // Get 1024 tuples
    results = apply_filter_vectorized(batch);  // Process all at once
    emit_batch(results);
}

Process tuples in batches of 1024-2048. The filter function operates on arrays:

// Instead of:
for (int i = 0; i < 1024; i++) {
    if (ages[i] > 25) output[j++] = rows[i];
}

// Compiler generates SIMD:
// Check 8 ages at once with AVX2 instructions
// 128x fewer branches, better cache locality

Modern CPUs have SIMD (Single Instruction Multiple Data) that can process 8-16 values simultaneously. Vectorized engines exploit this automatically.

Result: 10-100x speedup on analytical queries. DuckDB crushes Postgres on aggregations because of this.

💻

Real-World Impact

Query: SELECT SUM(price) FROM orders WHERE status = 'completed'

PostgreSQL (tuple-at-a-time): 5 seconds
DuckDB (vectorized): 0.3 seconds

Same data, same machine. The execution model matters THAT much.

Just-In-Time (JIT) Compilation: Compiling Your Query

Here's some next-level sorcery: compile your query to machine code.

Traditional interpretation:

For each row:
    Push onto stack
    Call filter function
    Call projection function
    Pop from stack
    Emit result

Thousands of function calls, stack operations, indirection.

JIT compilation (PostgreSQL with LLVM, Hyper/Tableau):

1. Take query plan
2. Generate C code or LLVM IR
3. Compile to native machine code
4. Execute compiled function directly

Your query becomes a tight loop with no function call overhead:

; Pseudo-assembly for: WHERE age > 25 AND city = 'Boston'
loop:
    load age from [rdi]
    cmp age, 25
    jle skip
    load city_ptr from [rdi+8]
    cmp [city_ptr], 'Boston'
    jne skip
    ; emit row
skip:
    add rdi, 32  ; next row
    jmp loop

No interpretation, no indirection. Just raw CPU instructions.

Cost: Compilation takes 10-100ms. So JIT only helps for long-running queries (seconds or more). The optimizer must predict if compilation overhead is worth it!

🔬

HyPer/Umbra Innovation

The HyPer database (now Tableau's engine) pioneered query compilation. Their approach: compile the entire query pipeline into one tight loop with no materialization. Result: analytical queries 10-100x faster than traditional row-at-a-time execution.

Approximate Query Processing: Good Enough is Perfect

Sometimes you don't need exact answers:

SELECT AVG(price) FROM orders;

Do you REALLY need to scan all 1 billion rows to get an average? Or would "approximately $47.32 ± $0.50" be fine?

Sampling

SELECT AVG(price) FROM orders TABLESAMPLE BERNOULLI(1);

Read only 1% of rows, compute average on sample. 100x faster, answer is usually within 1% of truth.

Sketches (HyperLogLog for COUNT DISTINCT)

SELECT COUNT(DISTINCT user_id) FROM events;

Traditional: Hash all user_ids into a set, count size. Memory = O(cardinality).

HyperLogLog sketch: Use ~1KB of memory, get count with ~2% error.

For each user_id:
    hash = hash(user_id)
    bucket = hash % 16384
    leading_zeros = count_leading_zeros(hash)
    max_zeros[bucket] = max(max_zeros[bucket], leading_zeros)

Cardinality ≈ 2^(average(max_zeros))

Sounds like magic? It is. But it works.

Result: COUNT(DISTINCT) on billions of rows in seconds, not hours.

💡

When to Use Approximation

Dashboards, analytics, exploration - approximation is perfect. Financial reports, compliance - need exact answers. Modern databases like ClickHouse and Snowflake make sampling trivial, and many have built-in sketch algorithms.

Push-Based vs Pull-Based Execution

Traditional (pull-based / Volcano model):

Top operator: "Give me next row"
  ↓
Join: "Give me next row from both inputs"
  ↓
Scan: "Read next row from disk"

Data is pulled up through the pipeline. Simple, but lots of function call overhead.

Push-based (MonetDB, Vectorwise):

Scan: "I have 1024 rows, pushing to filter"
  ↓
Filter: "Got 1024, filtered to 800, pushing to join"
  ↓
Join: "Got 800, joined to 600, pushing to output"

Data is pushed through operators. Fewer function calls, better cache locality, easier to vectorize.

Morsel-Driven (HyPer): Hybrid approach. Process data in "morsels" (chunks), push within operators but pull between pipeline breakers (like hash join build phase).

The optimizer chooses the execution model based on query shape and workload!

Zone Maps / Small Materialized Aggregates

Here's a sneaky optimization you never asked for:

When writing pages to disk, the database tracks metadata:

Page 42: 
  min(timestamp) = 2024-01-01
  max(timestamp) = 2024-01-07
  min(price) = 10.50
  max(price) = 999.99

Query:

SELECT * FROM orders WHERE timestamp > '2024-06-01';

Optimizer: "Page 42 has max timestamp of 2024-01-07. Skip it entirely!"

Without reading the page, we know it has no matching rows. This is called zone map filtering or small materialized aggregates.

Result: Prune entire pages/partitions without I/O. Analytical queries get 10-1000x faster.

ClickHouse, Snowflake, and Redshift do this automatically. You didn't ask for it. The database just does it because it's clever.

💻

Real Example: Time-Series Data

Table with 1 year of data, partitioned by day (365 partitions).
Query: WHERE timestamp > NOW() - INTERVAL '7 days'

Zone maps let optimizer skip 358 partitions immediately.
Scan 7 days of data instead of 365 days = 50x speedup!

Machine Learning in the Optimizer

This is where databases officially become science fiction.

Learned Cardinality Estimation (Research / Neo, Bao)

Traditional: Use statistics and independence assumption.

ML approach: Train a neural network on query workload:

Input: Query features (predicates, joins, tables)
Output: Estimated cardinality

Training data: Actual query executions

The model learns correlations, data skew, and patterns that statistics miss.

Result: 10-100x better estimates than traditional methods in research papers. Production adoption is starting.

Learned Indexes (Research)

B-Trees are great, but what if we could do better?

Key insight: An index is just a function mapping keys to positions.

Traditional B-Tree: 
  key → traverse tree → find position

Learned Index:
  key → neural network → predict position → verify

Train a neural network to predict "where is key X in the sorted array?"

Result: In some workloads, learned indexes are 2-3x faster and 10x smaller than B-Trees. Still research, but Google is experimenting.

📝

The ML-Database Convergence

We're seeing ML infuse databases (learned optimizers) AND databases infuse ML (vector databases, embedding search). The lines are blurring. In 10 years, every database will have ML components under the hood.

The Explain Plan: Your Window Into the Optimizer's Mind

Want to see what the optimizer chose?

EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.city = 'Boston';

PostgreSQL output:

Nested Loop  (cost=0.56..892.34 rows=100 width=64) 
             (actual time=0.043..5.231 rows=112 loops=1)
  Buffers: shared hit=245 read=12
  ->  Index Scan on users u (cost=0.42..23.45 rows=100 width=32)
                              (actual time=0.021..0.156 rows=112 loops=1)
        Index Cond: (city = 'Boston')
        Buffers: shared hit=45
  ->  Index Scan on orders o (cost=0.14..8.68 rows=1 width=32)
                              (actual time=0.002..0.042 rows=10 loops=112)
        Index Cond: (user_id = u.id)
        Buffers: shared hit=200 read=12
Planning Time: 0.342 ms
Execution Time: 5.487 ms

This tells you EVERYTHING:

Nested loop join chosen
Index scans on both tables
Estimated 100 rows, actually got 112 (pretty good!)
245 buffer hits (cache!), only 12 disk reads
Execution took 5.4ms

If your query is slow, start with EXPLAIN. It shows you what the optimizer thought vs reality.

💡

Reading Explain Plans

Key things to look for:
- Seq Scan on large table? Probably need an index
- Estimated rows << actual rows? Stats are stale
- Lots of disk reads? Need more buffer pool memory
- Hash Join on tiny tables? Optimizer confused, maybe outdated stats

Modern SQL Features You Should Know

The SQL standard has evolved. Modern databases support wild features:

Window Functions (Every Modern DB)

SELECT name, salary,
       AVG(salary) OVER (PARTITION BY department) as dept_avg,
       ROW_NUMBER() OVER (ORDER BY salary DESC) as rank
FROM employees;

Compute aggregates over "windows" of rows without GROUP BY collapsing. Incredibly powerful for analytics.

CTEs and Recursive Queries (SQL:1999)

WITH RECURSIVE subordinates AS (
    SELECT id, name, manager_id FROM employees WHERE id = 1
    UNION ALL
    SELECT e.id, e.name, e.manager_id 
    FROM employees e
    JOIN subordinates s ON e.manager_id = s.id
)
SELECT * FROM subordinates;

Traverse hierarchies, compute transitive closures. This is graph traversal in SQL!

Lateral Joins (PostgreSQL, Oracle)

SELECT u.name, o.*
FROM users u
CROSS JOINj LATERAL (
    SELECT * FROM orders 
    WHERE user_id = u.id 
    ORDER BY created_at DESC 
    LIMIT 5
) o;

For each user, get their 5 most recent orders. The subquery can reference the outer query! This was impossible in old SQL.

JSON Support (PostgreSQL, MySQL, SQL Server)

SELECT data->>'name' as name,
       jsonb_array_elements(data->'tags') as tag
FROM documents
WHERE data @> '{"status": "active"}';

Store JSON, query it with SQL, index it, join it. The relational/document boundary is gone.

GROUPING SETS / CUBE / ROLLUP

SELECT city, product, SUM(sales)
FROM orders
GROUP BY GROUPING SETS (
    (city, product),
    (city),
    (product),
    ()
);

Compute multiple group-by aggregations in one pass. Used to require UNION of multiple queries. Now it's one efficient operation.

ℹ️

SQL is Not Dead

People keep predicting SQL's death. But SQL keeps getting MORE powerful. Modern SQL can express complex analytics, graph traversals, time-series operations, and even some ML tasks. It's 50 years old and more relevant than ever.

When the Optimizer Gets It Wrong

Optimizers are smart but not perfect. Common failure modes:

Stale Statistics

-- Yesterday: 1000 rows
-- Today: 10,000,000 rows (bulk insert)
-- Optimizer still thinks: 1000 rows

Solution: ANALYZE / UPDATE STATISTICS after bulk changes!

Correlated Columns

WHERE age < 25 AND student = true

If young people are usually students (correlation), independence assumption fails.

Solution: Multi-column statistics or hints.

Parameter Sniffing (SQL Server)

EXEC GetUsers @city = 'Boston'  -- Optimizer plans for Boston (100 rows)
EXEC GetUsers @city = 'New York'  -- Reuses plan, but NY has 10M rows!

Plan was optimal for first parameter, terrible for second.

Solution: OPTION (RECOMPILE) or plan guides.

Function Calls Hide Selectivity

WHERE UPPER(name) = 'ALICE'

Optimizer can't use index on name (function applied). Also can't estimate selectivity.

Solution: Use functional indexes or write WHERE name = 'Alice' OR name = 'ALICE'.

⚠️

The 80-20 Rule of Query Performance

80% of slow queries are due to:
- Missing indexes (40%)
- Stale statistics (20%)
- Poorly written SQL (15%)
- Wrong data types/implicit conversions (5%)

Only 20% are actually hard optimization problems requiring deep tuning.

The Future: What's Coming

Autonomous Databases (Oracle, Azure SQL)

Databases that automatically:

Tune themselves
Create indexes
Adjust memory allocation
Detect and fix performance issues

The DBA becomes optional.

Unified OLTP/OLAP (TiDB, CockroachDB + Analytics)

One database for both transactions AND analytics. No more ETL to data warehouses.

Hybrid storage engines (row + column), workload-aware optimization.

Serverless Query Engines (BigQuery, Athena, Snowflake)

Separate storage from compute. Scale to petabytes, pay only for queries run.

No servers to manage, infinite scale.

GPU-Accelerated Databases (BlazingSQL, OmniSci)

Push operations to GPUs for 10-100x speedup on analytics.

Thousands of cores processing data in parallel.

🔬

The Pace of Innovation

In the last 10 years, we've seen: columnar execution, vectorization, JIT compilation, adaptive optimization, GPU acceleration, and ML-driven tuning. Database systems research is THRIVING. The next 10 years will be even wilder.

TL;DR

Modern SQL databases are absurdly sophisticated:

Query Optimization:

Cost models predict execution time with scary accuracy
Consider hundreds/thousands of possible plans
Use statistics, histograms, and ML for cardinality estimation
Find optimal join orders in exponential search space

Execution Innovations:

Adaptive algorithms switch strategies mid-query
Parallel execution across cores automatically
Vectorized/SIMD processing for 10-100x speedup
JIT compilation turns queries into machine code
Push-based execution for better cache performance

Smart Shortcuts:

Zone maps skip entire partitions without reading
Runtime filter pushdown avoids billions of rows
Approximate processing for "good enough" answers
Learned indexes and ML-powered optimizers (coming soon)

Modern SQL:

Window functions, CTEs, lateral joins
JSON support, recursive queries
GROUPING SETS for multi-dimensional analytics
Still evolving after 50 years!

The next time you write a simple SELECT statement, remember: you've just triggered a cascade of algorithms that would make a PhD dissertation look trivial. The database is working HARD to make your query look easy.

And that's beautiful.

Bioinformatics Forever