Luke a Pro

Luke Sun

Developer & Marketer

🇺🇦
EN||

Chapter 1: B-Tree and B+Tree — The Mathematical Skeleton of Indexes

| , 4 minutes reading.

1. Definition

A B-Tree (Balanced Multi-way Search Tree) is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. It was defined by Rudolf Bayer and Edward M. McCreight in 1972.

A B+Tree is a variant of the B-Tree with the following characteristics:

  1. Internal nodes do not store data: They only store keys to act as indexes.
  2. Linked Leaf Nodes: All data is stored in the leaf nodes, which are linked together (often singly or doubly) to facilitate range scans.

2. Technical Depth: Why B+Tree instead of Binary Trees?

The primary reason database indexes use B+Trees rather than Red-Black trees or AVL trees is to minimize disk I/O operations.

A. Fan-out and Tree Height

On disk, data is read in units called Pages (InnoDB defaults to 16KB).

  • Binary Trees: Each node has only 2 children. The height HH is approximately log2N\log_2 N. For 10 million rows, H24H \approx 24, requiring up to 24 disk I/Os.
  • B+Trees: By increasing the Fan-out (e.g., storing 100 keys per node), the tree height HH is drastically reduced. For the same 10 million rows, the height is typically only 3-4 levels.

B. Disk Pre-fetching and Range Scans

Since internal nodes do not store data, a single Page can hold significantly more keys, further increasing fan-out. Additionally, the linked structure of leaf nodes allows Range Scans to move horizontally across the leaf layer without re-traversing parent nodes.

3. Visualizing the Invisible: The Page Split Process

When a Page is full (reaches its fill-factor limit) and a new record is inserted, the database triggers a low-level operation: the Page Split.

sequenceDiagram
    participant App as Insert Request
    participant Buffer as Buffer Pool (RAM)
    participant PageA as Page 101 (Full)
    participant PageB as Page 102 (New)

    App->>Buffer: Insert Key 50
    Buffer->>PageA: Attempt Write
    Note over PageA: Overflow Detected
    
    Buffer->>PageB: Request New Page
    PageA->>PageB: Migrate portion of records (e.g., Keys 30-49)
    Note over PageA,PageB: Nodes Rebalanced
    
    Buffer-->>Buffer: Update Parent Index Pointer
    Note over Buffer: Disk I/O Amplification begins

4. Real-World Case: Instagram’s ID Generation Strategy (2011)

Background: In 2011, the Instagram engineering team needed a high-performance ID generation system that could scale across shards.

Technical Choice: They initially considered UUID (Universally Unique Identifier). Reasons for Rejection:

  1. Index Size: UUIDs are 128-bit, while bigint is 64-bit. For tables with billions of records, the index becomes bloated and difficult to fit in RAM.
  2. B+Tree Performance Nightmare: UUIDs are random. In a B+Tree, this forces data to be inserted into random Pages. Frequent Page Splits cause physical storage to become fragmented and drastically lower the Fill Factor. This results in heavy random disk I/O, slowing down writes and squeezing Buffer Pool space.

Final Solution: Inspired by Twitter’s Snowflake, they designed a 64-bit ID with a Timestamp Prefix.

  • Advantage: Because IDs are chronologically ordered, new records are mostly appended to the rightmost leaf nodes of the B+Tree. This avoids frequent random splits in the middle of the tree and generally maintains a higher Fill Factor (depending on the split policy), significantly improving cache hit rates and Write Locality.

Lesson: In high-performance database design, the sequentiality of an index column is often more important than its randomness. High-quality architects trade some randomness for physical B+Tree continuity.

5. Detailed Defense & Optimization

A. Root Fix: Choose the Right Clustered Index

  • Prioritize Auto-increment IDs or keys with a natural chronological order.
  • For non-sequential keys, consider lowering the FILLFACTOR (Postgres) or innodb_fill_factor (MySQL) to reserve space for random inserts and mitigate splitting.

B. Defense in Depth: Covering Index

  • Include frequently queried fields in the secondary index.
  • Benefit: Data exists directly in the secondary index’s leaf node, allowing the engine to return results by avoiding the additional back-to-table lookup; the initial index traversal remains O(logN)O(\log N).

C. Fragmentation Management

  • Regularly perform OPTIMIZE TABLE (MySQL) or REINDEX (Postgres) to reclaim physical space and rebalance the tree structure.

6. References