Concurrency Control Theory -not for exam

Remember our ACID article? We talked about how databases promise to keep your data safe and correct. But there's a problem we glossed over: what happens when multiple transactions run at the same time?

Spoiler alert: chaos. Beautiful, fascinating, wallet-draining chaos.

The $25 That Vanished Into Thin Air

Let's start with a horror story. You've got $100 in your bank account. You try to pay for something that costs $25. Simple, right?

Read Balance: $100
Check if $100 > $25? ✓
Pay $25
New Balance: $75
Write Balance: $75

Works perfectly! Until the power goes out right after you read the balance but before you write it back. Now what? Did the payment go through? Is your money gone? This is where Atomicity saves you - either the entire transaction happens or none of it does.

But here's an even scarier scenario: What if TWO payments of $25 try to execute at the exact same time?

Transaction 1: Read Balance ($100) → Check funds → Pay $25
Transaction 2: Read Balance ($100) → Check funds → Pay $25
Transaction 1: Write Balance ($75)
Transaction 2: Write Balance ($75)

Both transactions read $100, both think they have enough money, both pay $25... and your final balance is $75 instead of $50. You just got a free $25! (Your bank is not happy.)

This is the nightmare that keeps database architects awake at night. And it's exactly why concurrency control exists.

🚫

The Real World Impact

These aren't theoretical problems. In 2012, Knight Capital lost $440 million in 45 minutes due to a race condition in their trading system. Concurrent transactions matter!

The Strawman Solution: Just Don't

The simplest solution? Don't allow concurrency at all. Execute one transaction at a time, in order, like a polite British queue.

Transaction 1 → Complete → Transaction 2 → Complete → Transaction 3 → ...

Before each transaction starts, copy the entire database to a new file. If it succeeds, overwrite the original. If it fails, delete the copy. Done!

This actually works! It's perfectly correct! It also has the performance of a potato.

Why? Because while one transaction is waiting for a slow disk read, every other transaction in the world is just... waiting. Doing nothing. Your expensive multi-core server is running one thing at a time like it's 1975.

We can do better.

The Goal: Having Your Cake and Eating It Too

What we actually want:

Better utilization: Use all those CPU cores! Don't let them sit idle!
Increased response times: When one transaction waits for I/O, let another one run
Correctness: Don't lose money or corrupt data
Fairness: Don't let one transaction starve forever

The challenge is allowing transactions to interleave their operations while still maintaining the illusion that they ran one at a time.

📖

Key Concept: Serializability

A schedule (interleaving of operations) is serializable if its result is equivalent to *some* serial execution of the transactions. We don't care which order, just that there exists *some* valid serial order that produces the same result.

The DBMS View: It's All About Reads and Writes

The database doesn't understand your application logic. It doesn't know you're transferring money or booking hotel rooms. All it sees is:

Transaction T1: R(A), W(A), R(B), W(B)
Transaction T2: R(A), W(A), R(B), W(B)

Where R = Read and W = Write. That's it. The DBMS's job is to interleave these operations in a way that doesn't break correctness.

The Classic Example: Interest vs Transfer

You've got two accounts, A and B, each with $1000. Two transactions run:

T1: Transfer $100 from A to B

A = A - 100  // A becomes $900
B = B + 100  // B becomes $1100

T2: Add 6% interest to both accounts

A = A * 1.06
B = B * 1.06

What should the final balance be? Well, A + B should equal $2120 (the original $2000 plus 6% interest).

Serial Execution: The Safe Path

If T1 runs completely before T2:

A = 1000 - 100 = 900
B = 1000 + 100 = 1100
Then apply interest:
A = 900 * 1.06 = 954
B = 1100 * 1.06 = 1166
Total: $2120 ✓

If T2 runs completely before T1:

A = 1000 * 1.06 = 1060
B = 1000 * 1.06 = 1060
Then transfer:
A = 1060 - 100 = 960
B = 1060 + 100 = 1160
Total: $2120 ✓

Both valid! Different final states, but both correct because A + B = $2120.

Good Interleaving: Still Correct

T1: A = A - 100  (A = 900)
T1: B = B + 100  (B = 1100)
T2: A = A * 1.06 (A = 954)
T2: B = B * 1.06 (B = 1166)
Total: $2120 ✓

This interleaving is equivalent to running T1 then T2 serially. We're good!

Bad Interleaving: Money Disappears

T1: A = A - 100  (A = 900)
T2: A = A * 1.06 (A = 1060) ← Used old value of A!
T2: B = B * 1.06 (B = 1060)
T1: B = B + 100  (B = 1160) ← Used old value of B!
Total: $2114 ✗

We lost $6! This schedule is NOT equivalent to any serial execution. It's incorrect.

⚠️

The Problem

T1 read A before T2 updated it, but T2 read B before T1 updated it. The transactions are interleaved in an inconsistent way - each transaction sees a mix of old and new values.

Conflicting Operations: The Root of All Evil

When do operations actually conflict? When they can cause problems if interleaved incorrectly?

Two operations conflict if:

They're from different transactions
They're on the same object (same data item)
At least one is a write

This gives us three types of conflicts:

Read-Write Conflicts: The Unrepeatable Read

T1: R(A) → sees $10
T2: W(A) → writes $19
T1: R(A) → sees $19

T1 reads A twice in the same transaction and gets different values! The data changed underneath it. This is called an unrepeatable read.

Write-Read Conflicts: The Dirty Read

T1: W(A) → writes $12 (not committed yet)
T2: R(A) → reads $12
T2: W(A) → writes $14 (based on dirty data)
T2: COMMIT
T1: ROLLBACK ← Oh no!

T2 read data that T1 wrote but never committed. That data never "really existed" because T1 rolled back. T2 made decisions based on a lie. This is a dirty read.

💻

Real World Example

You're booking the last seat on a flight. The reservation system reads "1 seat available" from a transaction that's updating inventory but hasn't committed. You book the seat. That transaction rolls back. Turns out there were actually 0 seats. Now you're stuck at the airport arguing with gate agents.

Write-Write Conflicts: The Lost Update

T1: W(A) → writes "Bob"
T2: W(A) → writes "Alice"

T2's write overwrites T1's write. If T1 hasn't committed yet, its update is lost. This is the lost update problem.

Conflict Serializability: The Practical Standard

Now we can formally define what makes a schedule acceptable. A schedule is conflict serializable if we can transform it into a serial schedule by swapping non-conflicting operations.

The Dependency Graph Trick

Here's a clever way to check if a schedule is conflict serializable:

Draw one node for each transaction
Draw an edge from Ti to Tj if Ti has an operation that conflicts with an operation in Tj, and Ti's operation comes first
If the graph has a cycle, the schedule is NOT conflict serializable

Example: The Bad Schedule

T1: R(A), W(A), R(B), W(B)
T2: R(A), W(A), R(B), W(B)

With interleaving:

T1: R(A), W(A)
T2: R(A), W(A)
T2: R(B), W(B)
T1: R(B), W(B)

Dependency graph:

T1 → T2  (T1 writes A, T2 reads A - T1 must come first)
T2 → T1  (T2 writes B, T1 reads B - T2 must come first)

There's a cycle! T1 needs to come before T2 AND T2 needs to come before T1. Impossible! This schedule is not conflict serializable.

💡

Why This Matters

The dependency graph gives us a mechanical way to check serializability. If there's no cycle, we can find a valid serial order by doing a topological sort of the graph. This is how the DBMS reasons about schedules!

View Serializability: The Broader Definition

Conflict serializability is practical, but it's also conservative - it rejects some schedules that are actually correct.

View serializability is more permissive. Two schedules are view equivalent if:

If T1 reads the initial value of A in one schedule, it reads the initial value in the other
If T1 reads a value of A written by T2 in one schedule, it does so in the other
If T1 writes the final value of A in one schedule, it does so in the other

Consider this schedule:

T1: R(A), W(A)
T2: W(A)
T3: W(A)

The dependency graph has cycles (it's not conflict serializable), but it's view serializable! Why? Because T3 writes the final value of A in both the interleaved schedule and the serial schedule T1→T2→T3. The intermediate writes by T1 and T2 don't matter - they're overwritten anyway.

This is called a blind write - writing a value without reading it first.

ℹ️

Why Don't Databases Use View Serializability?

Checking view serializability is NP-Complete. It's computationally expensive and impractical for real-time transaction processing. Conflict serializability is polynomial time and good enough for 99.9% of cases.

The Universe of Schedules

┌─────────────────────────────────────┐
│      All Possible Schedules         │
│  ┌───────────────────────────────┐  │
│  │   View Serializable           │  │
│  │  ┌─────────────────────────┐  │  │
│  │  │ Conflict Serializable   │  │  │
│  │  │  ┌───────────────────┐  │  │  │
│  │  │  │  Serial Schedules │  │  │  │
│  │  │  └───────────────────┘  │  │  │
│  │  └─────────────────────────┘  │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

Most databases enforce conflict serializability because:

It's efficient to check
It covers the vast majority of practical cases
It can be enforced with locks, timestamps, or optimistic methods

How Do We Actually Enforce This?

We've talked about what serializability means, but not how to enforce it. That's the job of concurrency control protocols, which come in two flavors:

Pessimistic: Assume conflicts will happen, prevent them proactively

Two-Phase Locking (2PL) - most common
Timestamp Ordering
"Don't let problems arise in the first place"

Optimistic: Assume conflicts are rare, deal with them when detected

Optimistic Concurrency Control (OCC)
Multi-Version Concurrency Control (MVCC)
"Let transactions run freely, check for conflicts at commit time"

We'll dive deep into these in the next article, but the key insight is that all of them are trying to ensure the schedules they produce are serializable.

📝

Important Distinction

This article is about checking whether schedules are correct. The next article is about generating correct schedules in the first place. The theory tells us what's correct; the protocols tell us how to achieve it.

The NoSQL Backlash (That's Now Backtracking)

Around 2010, the NoSQL movement said "transactions are slow, ACID is overkill, eventual consistency is fine!" Systems like early MongoDB and Cassandra threw out strict serializability for performance.

And you know what? They were fast! They could handle millions of writes per second!

They also had data corruption, lost writes, and developers pulling their hair out debugging race conditions.

The pendulum has swung back. Modern databases (NewSQL, distributed SQL) are proving you can have both performance AND correctness. Turns out the computer scientists in the 1970s knew what they were doing.

🔬

Historical Note

The theory of serializability was developed in the 1970s-1980s by pioneers like Jim Gray, Phil Bernstein, and Christos Papadimitriou. It's stood the test of time because it's based on fundamental principles, not implementation details.

TL;DR

The Problem: Multiple concurrent transactions can interfere with each other, causing lost updates, dirty reads, and inconsistent data.

The Solution: Ensure all schedules are serializable - equivalent to some serial execution.

Key Concepts:

Conflicting operations: Two operations on the same object from different transactions, at least one is a write
Conflict serializability: Can transform the schedule into a serial one by swapping non-conflicting operations (check with dependency graphs)
View serializability: Broader definition, but too expensive to enforce in practice

Types of Conflicts:

Read-Write: Unrepeatable reads
Write-Read: Dirty reads
Write-Write: Lost updates

Next Time: We'll learn about Two-Phase Locking, MVCC, and how databases actually enforce serializability in practice. The theory is beautiful; the implementation is where the magic happens! 🔒

Bioinformatics Forever