How Databases Actually Store Your Data

You write INSERT INTO users (name, email) VALUES ('Alice', 'alice@example.com') and hit enter. It works! Magic!

But have you ever wondered what actually happens? Where does "Alice" go? How does the database find her again when you run SELECT * FROM users WHERE name = 'Alice'?

The beautiful abstraction of tables, rows, and columns is just that - an abstraction. Under the hood, your database is playing Tetris with bytes on a spinning disk (or SSD), trying to pack data efficiently while making it fast to retrieve.

Let's pop the hood and see how this really works.

The Great Illusion: Logical vs Physical

When you think about a database, you probably imagine something like this:

users table:
┌────┬───────┬──────────────────┬─────┐
│ id │ name  │ email            │ age │
├────┼───────┼──────────────────┼─────┤
│ 1  │ Alice │ alice@example.com│ 28  │
│ 2  │ Bob   │ bob@example.com  │ 35  │
│ 3  │ Carol │ carol@example.com│ 42  │
└────┴───────┴──────────────────┴─────┘

Nice, neat rows and columns. Very spreadsheet-like. This is the logical view - how humans think about data.

But on disk? It looks more like:

01001000 01100101 01101100 01101100 01101111 00100000
01010111 01101111 01110010 01101100 01100100 00100001
...millions more bytes...

The physical view is just bytes in files. The database's job is to bridge this gap - to take your neat logical tables and figure out how to jam them into bytes efficiently.

ℹ️
The Storage Manager's Job

The storage manager is the part of the DBMS that translates between "give me the user with id=42" (logical) and "read bytes 8192-8256 from file users.db" (physical). It's like a translator between two completely different languages.

The Storage Hierarchy: A Tale of Speed and Money

Before we dive into how data is stored, we need to understand the hardware reality. Not all storage is created equal:

CPU Registers:     ~1 nanosecond    (Tiny, blazing fast, $$$$$)
CPU Cache:         ~1-10 ns         (Small, very fast, $$$$)
RAM:               ~100 ns          (Medium, fast, $$$)
SSD:               ~100 microseconds (Large, pretty fast, $$)
HDD:               ~10 milliseconds  (Huge, slow, $)
Network Storage:   ~100+ ms         (Infinite, slower, $)

Notice that gap between RAM and SSD? 1,000x slower. And HDD? 100,000x slower than RAM.

This is why databases are obsessed with keeping data in memory (RAM) and avoiding disk I/O at all costs. Every disk access is a tragedy. Every cache hit is a celebration.

⚠️
The Performance Reality

You can execute millions of CPU instructions in the time it takes to read one block from a hard disk. This is why database design is all about minimizing I/O - the CPU is sitting there twiddling its thumbs waiting for the disk.

Pages: The Fundamental Unit of I/O

Here's a key insight: databases don't read individual rows from disk. That would be insane. Instead, they work with pages (also called blocks).

A page is a fixed-size chunk of data, typically 4KB, 8KB, or 16KB. When you ask for one row, the database reads an entire page containing that row (and probably many other rows too).

Why? Because of how disks work. Reading 1 byte from disk takes about the same time as reading 8KB - you pay for the seek time either way. Might as well read a decent chunk while you're there.

Disk File:
┌────────────┬────────────┬────────────┬────────────┐
│  Page 0    │  Page 1    │  Page 2    │  Page 3    │
│  (8 KB)    │  (8 KB)    │  (8 KB)    │  (8 KB)    │
└────────────┴────────────┴────────────┴────────────┘
     ↓
 Contains multiple rows:
 ┌──────────┐
 │ Row 1    │
 │ Row 2    │
 │ Row 3    │
 │ Row 4    │
 │ ...      │
 └──────────┘

Everything in a database happens at page granularity:

  • Read a row? Read the whole page
  • Update a row? Read the page, modify it in memory, write the whole page back
  • Lock a row? Actually lock the whole page (in some systems)
💡
Page Size Matters

Bigger pages = fewer I/O operations but more wasted space and higher contention. Smaller pages = more I/O but better space utilization. Most databases settle on 8KB as a reasonable compromise. PostgreSQL uses 8KB, MySQL InnoDB uses 16KB.

Inside a Page: Slotted Page Layout

So we've got an 8KB page. How do we store rows in it? The most common approach is the slotted page structure:

┌──────────────────────────────────────────┐ ← Page Start (8KB)
│           Page Header                     │
│  - Number of slots used                   │
│  - Free space pointer                     │
│  - Page checksum                          │
├──────────────────────────────────────────┤
│           Slot Array                      │
│  Slot 0: [offset=7800, length=120]       │
│  Slot 1: [offset=7500, length=180]       │
│  Slot 2: [offset=7200, length=150]       │
│  ...                                      │
├──────────────────────────────────────────┤
│                                           │
│         Free Space (grows down)           │
│                                           │
├──────────────────────────────────────────┤
│  Tuple 2: [data...]                      │ ← Offset 7200
│  Tuple 1: [data...]                      │ ← Offset 7500
│  Tuple 0: [data...]                      │ ← Offset 7800
└──────────────────────────────────────────┘ ← Page End

The clever bit: the slot array grows down from the top, the actual tuple data grows up from the bottom. They meet in the middle. When they collide, the page is full.

Why this design?

  • Indirection: Want to move a tuple within the page? Just update the slot's offset, don't touch anything else
  • Efficient deletion: Mark a slot as empty, reuse it later
  • Variable-length records: No problem, just store the actual length in the slot
💻
Example: Finding Row 5

1. Database knows row 5 is on page 12
2. Read page 12 into memory (8KB I/O operation)
3. Look at slot 5 in the slot array: offset=7500, length=180
4. Jump to byte 7500 in the page, read 180 bytes
5. That's your row!

All this happens in microseconds once the page is in memory.

Tuple Layout: How Rows Become Bytes

Inside each slot, we've got the actual row data (called a tuple). How is it laid out?

Fixed-Length Fields (Simple):

Row: (id=42, age=28, salary=50000)

┌────────┬────────┬────────────┐
│   42   │   28   │   50000    │
│ 4 bytes│ 4 bytes│  4 bytes   │
└────────┴────────┴────────────┘

Easy! Just concatenate the values. To find the age field, jump to byte offset 4. To find salary, jump to byte offset 8.

Variable-Length Fields (Tricky):

Row: (id=42, name="Alice", email="alice@example.com")

┌────────┬────────┬────────┬───────┬──────────────────────┐
│   42   │ off=16 │ off=22 │ Alice │ alice@example.com    │
│ 4 bytes│ 4 bytes│ 4 bytes│ 5 byte│     17 bytes         │
└────────┴────────┴────────┴───────┴──────────────────────┘
         ↑        ↑
         └────────┴── Offsets to variable-length data

The fixed-length header contains offsets pointing to where the variable-length data actually lives. When you want the name, you look at the offset, jump there, and read until you hit the next field.

NULL Handling:

Many databases use a null bitmap at the start of each tuple:

┌──────────────┬────────┬────────┬────────┐
│ Null Bitmap  │ Field1 │ Field2 │ Field3 │
│ (bits: 010)  │   42   │  NULL  │   28   │
└──────────────┴────────┴────────┴────────┘

Each bit indicates if the corresponding field is NULL. If it is, you don't even store the value - saves space!

Heap Files: The Simplest Storage Structure

Now that we know how to store rows in pages, how do we organize pages into files? The simplest approach is a heap file - just a random collection of pages with no particular order.

users.heap file:
┌──────────┬──────────┬──────────┬──────────┐
│ Page 0   │ Page 1   │ Page 2   │ Page 3   │
│ [rows]   │ [rows]   │ [rows]   │ [rows]   │
└──────────┴──────────┴──────────┴──────────┘
   ↓ No particular order!
   Rows inserted wherever there's space

Insertion: Find a page with free space (keep a free space map), stick the new row there.

Lookup by ID: Scan every single page until you find it. Slow! This is why we need indexes.

Deletion: Mark the row as deleted, or compact the page to reclaim space.

Heap files are simple but have terrible performance for searches. Finding one specific row means reading the entire table. For a million-row table, that's thousands of I/O operations.

This is where indexes save the day.

📝
When Heap Files Are Okay

If you're always scanning the entire table anyway (like for analytics), heap files are fine. No point in maintaining indexes if you're going to read everything. But for OLTP workloads with point queries? You absolutely need indexes.

Indexes: The Database's Phone Book

An index is a separate data structure that maintains a sorted order and lets you find rows quickly. It's like the index in the back of a book - instead of reading every page to find "Serializability," you look it up in the index and jump straight to page 347.

B-Tree Index: The King of Indexes

The B-Tree (actually B+Tree in most databases) is the workhorse index structure. It's a balanced tree where:

  • Internal nodes contain keys and pointers to child nodes
  • Leaf nodes contain keys and pointers to actual rows (or row IDs)
  • All leaf nodes are at the same depth
  • Tree stays balanced on inserts/deletes
                   [50, 100]
                  /    |     \
          [10,30,40] [60,80] [120,150]
            / | | \    / |     /  |
          [...data...]       [...data...]

Finding id=75:

  1. Start at root: 75 is between 50 and 100, go middle
  2. At [60, 80]: 75 is between 60 and 80, go middle
  3. At leaf node, find the record or pointer to page containing id=75
  4. Read that page, extract the row

For a million-row table, a B-Tree might have height 3-4. That's only 3-4 I/O operations to find any row! Compare that to scanning thousands of pages in a heap file.

💡
Why B-Trees?

B-Trees have high fanout (hundreds of children per node), which keeps the tree shallow. Fewer levels = fewer I/O operations. They're also self-balancing and handle range queries beautifully (all leaves are linked, just traverse left to right).

Hash Index: Fast but Limited

Hash indexes use a hash function to map keys directly to buckets:

hash(id=42) = 7 → Bucket 7 → [pointers to rows with id=42]
hash(id=100) = 3 → Bucket 3 → [pointers to rows with id=100]

Pros: O(1) lookups for exact matches - incredibly fast!

Cons: Can't do range queries. WHERE id > 50 requires scanning all buckets. Also, hash collisions need to be handled.

Hash indexes are great for equality lookups (WHERE id = 42) but terrible for anything else. B-Trees handle both equality and ranges, which is why they're more popular.

Clustered vs Non-Clustered Indexes

Clustered Index: The table data itself is organized by the index key. The leaf nodes of the index ARE the actual rows.

B-Tree (clustered on id):
Leaf nodes contain: [id=10, name="Alice", ...full row data...]
                    [id=20, name="Bob", ...full row data...]

Benefit: Finding a row by the clustered key is super fast - one index lookup and you have the whole row.

Cost: You can only have ONE clustered index per table (because the data can only be physically sorted one way). In MySQL InnoDB, the primary key is always clustered.

Non-Clustered Index: Leaf nodes contain row IDs or pointers, not the actual data.

B-Tree (non-clustered on email):
Leaf nodes contain: [email="alice@ex.com", row_id=1]
                    [email="bob@ex.com", row_id=2]

To get the full row, you need two lookups:

  1. Search the index to find row_id
  2. Look up row_id in the main table (clustered index or heap)

This is called a index lookup or bookmark lookup. It's slower than a clustered index but still way faster than scanning the whole table.

💻
Real Query Example

SELECT * FROM users WHERE email = 'alice@example.com'

Without index on email: Scan entire heap file (1000+ I/O operations)

With non-clustered index on email:
1. Search B-Tree index (3-4 I/O operations) → find row_id=42
2. Look up row_id=42 in clustered index (1-2 I/O operations)
Total: ~5 I/O operations vs 1000+

That's a 200x speedup!

The Buffer Pool: RAM to the Rescue

Remember how disk I/O is 100,000x slower than RAM? The buffer pool (also called buffer cache) is the database's attempt to minimize this pain.

The buffer pool is a large chunk of RAM (often gigabytes) that caches pages from disk:

┌─────────────────────────────────────────┐
│         Buffer Pool (RAM)                │
├─────────────────────────────────────────┤
│  Frame 0: Page 42 (dirty)               │
│  Frame 1: Page 17 (clean)               │
│  Frame 2: Page 99 (dirty)               │
│  Frame 3: Page 5  (clean)               │
│  ...                                     │
│  Frame N: Empty                          │
└─────────────────────────────────────────┘
         ↕ (only on cache miss)
┌─────────────────────────────────────────┐
│           Disk Storage                   │
└─────────────────────────────────────────┘

How it works:

  1. Query needs page 42
  2. Check buffer pool: Is page 42 already in memory?
  3. Cache hit: Great! Use it directly. No disk I/O! 🎉
  4. Cache miss: Sad. Read page 42 from disk, put it in buffer pool, evict something else if full

Dirty pages: Pages that have been modified in memory but not yet written to disk. Eventually they need to be flushed back to disk (called write-back).

Replacement policy: When the buffer pool is full and you need to load a new page, which one do you evict? Most databases use LRU (Least Recently Used) or variants like Clock or LRU-K.

ℹ️
The 80-20 Rule

Typically, 80% of queries access 20% of the data. If your buffer pool can hold that "hot" 20%, your cache hit rate will be ~80%. This is why throwing more RAM at a database often dramatically improves performance - more cache hits!

Sequential vs Random I/O: The Secret to Performance

Not all I/O is created equal. Sequential I/O (reading consecutive pages) is MUCH faster than random I/O (reading scattered pages).

Why? Mechanical sympathy. On an HDD:

  • Sequential read: Read head is already in position, just keep reading. Fast!
  • Random read: Move read head to new location (seek time ~10ms), then read. Slow!

Even on SSDs, sequential I/O is faster due to how flash memory works.

This is why database design obsesses over data locality:

  • Keep related data on the same page or adjacent pages
  • Use clustered indexes to physically sort data by common access patterns
  • Partition large tables to keep hot data together

Table scans (reading the entire table sequentially) are actually pretty fast IF you're going to read most of the data anyway. Reading 1000 pages sequentially might be faster than reading 50 pages randomly!

This is why the query optimizer sometimes chooses a table scan over using an index - if you're retrieving a large percentage of rows, scanning is more efficient.

⚠️
Index Selectivity Matters

An index on gender (2 values) is almost useless - the optimizer will likely ignore it and scan the table.

An index on email (unique values) is incredibly valuable - it makes queries 1000x faster.

The more selective (fewer duplicates) the index, the more useful it is.

Column-Oriented Storage: A Different Approach

Everything we've discussed so far assumes row-oriented storage - rows are stored together. But there's another way: column-oriented storage.

Row-oriented (traditional):

Page 0: [Row1: id=1, name="Alice", age=28]
        [Row2: id=2, name="Bob", age=35]
        [Row3: id=3, name="Carol", age=42]

Column-oriented:

Page 0: [id column: 1, 2, 3, 4, 5, ...]
Page 1: [name column: "Alice", "Bob", "Carol", ...]
Page 2: [age column: 28, 35, 42, ...]

All values for one column are stored together!

Benefits:

  • Analytical queries: SELECT AVG(age) FROM users only reads the age column, ignores name/email. Huge I/O savings!
  • Compression: Similar values compress better. A column of integers compresses 10x-100x better than mixed row data
  • SIMD: Modern CPUs can process arrays of similar values super fast

Drawbacks:

  • OLTP queries: SELECT * FROM users WHERE id=42 needs to read multiple column files and reassemble the row. Slow!
  • Updates: Updating one row requires touching multiple column files

This is why column stores like ClickHouse, Vertica, and RedShift are amazing for analytics (read-heavy, aggregate queries) but terrible for OLTP (transactional, row-level updates).

Modern databases like PostgreSQL are hybrid - primarily row-oriented but with column-store extensions for analytics.

Data Files in Practice: PostgreSQL Example

Let's see how PostgreSQL actually organizes data on disk:

/var/lib/postgresql/data/
├── base/                    ← Database files
│   ├── 16384/              ← Database OID
│   │   ├── 16385           ← Table file (heap)
│   │   ├── 16385_fsm       ← Free space map
│   │   ├── 16385_vm        ← Visibility map
│   │   ├── 16386           ← Index file (B-Tree)
│   │   └── ...
├── pg_wal/                 ← Write-ahead log
└── pg_xact/                ← Transaction commit log
  • Table file (16385): Heap of pages, each 8KB
  • Free space map: Tracks which pages have free space for inserts
  • Visibility map: Tracks which pages have all rows visible to all transactions (for vacuum optimization)
  • Index files: B-Tree structures, also page-based

When you INSERT a row, PostgreSQL:

  1. Checks free space map for a page with room
  2. Loads that page into buffer pool
  3. Adds row to page using slotted layout
  4. Marks page as dirty
  5. Eventually writes back to disk
🔬
PostgreSQL Page Anatomy

You can actually inspect pages using pageinspect extension:

SELECT * FROM heap_page_items(get_raw_page('users', 0));

This shows you the slot array, tuple offsets, free space - everything we've discussed! It's like an X-ray of your database.

TL;DR

The Storage Hierarchy:

  • RAM is fast (~100ns), disk is slow (~10ms)
  • Minimize I/O at all costs!

Pages are the fundamental unit:

  • Fixed-size chunks (typically 8KB)
  • Everything happens at page granularity
  • Slotted page layout for flexible tuple storage

Heap files are simple but slow:

  • Unordered collection of pages
  • Scans require reading everything
  • Need indexes for fast lookups

Indexes make queries fast:

  • B-Trees: balanced, support ranges, most common
  • Hash indexes: fast equality, no ranges
  • Clustered vs non-clustered trade-offs

Buffer pool caches hot data:

  • Keep frequently accessed pages in RAM
  • LRU eviction policy
  • High cache hit rate = fast database

Sequential I/O >> Random I/O:

  • Keep related data together
  • Data locality matters enormously
  • Sometimes scans beat indexes!

Column stores for analytics:

  • Store columns separately
  • Great compression and SIMD
  • Fast aggregates, slow row retrieval

Next time you run a query, picture the journey: SQL → query plan → index traversal → page reads → buffer pool → disk → pages → slots → tuples → bytes. It's a beautiful dance of abstraction layers, all working together to make SELECT look simple!