11 - Database Replication and Sharding

Replication

Replication means keeping copies of your data on multiple servers. Two main purposes: fault tolerance (if one dies, others have the data) and read scalability (spread read queries across replicas).

Leader-Follower Replication

The most common pattern. One server is the leader (primary). It handles all writes. Followers (replicas) receive a copy of every write and serve read queries.

Client writes → Leader
                  ↓
            ┌─────┼─────┐
            ▼     ▼     ▼
         Follower 1  2  3

Client reads → any Follower

Advantages: simple model, read scalability, fault tolerance.

Disadvantages: the leader is a single point of failure for writes. Followers may serve stale data (replication lag).

If the leader dies, one follower gets promoted. This is called failover. It can be automatic or manual.

Synchronous vs Asynchronous Replication

Synchronous — the leader waits for followers to confirm the write before responding to the client. Guarantees no data loss but adds latency to every write.

Asynchronous — the leader responds immediately and replicates in the background. Fast writes but followers may lag behind. If the leader crashes before replicating, data is lost.

Most systems use semi-synchronous: one follower is synchronous (guarantees at least one copy), the rest are asynchronous.

Multi-Leader Replication

Multiple servers accept writes. Used for multi-region deployments where you want low-latency writes in every region.

The problem: write conflicts. Two leaders can accept different writes to the same record at the same time. Example: a user in Tokyo edits a shared document's title via the Asia leader, while their colleague in London edits the same title via the EU leader. Both writes succeed locally. When the leaders sync, which value wins?

Conflict resolution strategies:

Last write wins — use timestamps, latest write overwrites. Simple but silently loses the other write.
Merge the changes — combine both values (works for some data types like lists, not for a name field). Complex.
Let the application resolve — store both versions, show the user a conflict to resolve (like Git merge conflicts).

Multi-leader is complex. Avoid it unless you have a strong reason (multi-region low-latency writes).

Sharding (Partitioning)

Replication copies the same data to multiple servers. Sharding splits the data across multiple servers. Each shard holds a subset of the total data.

Why shard? When your dataset is too large for a single machine, or write throughput exceeds what one server can handle.

Sharding Strategies

Range-based — shard by ranges of a key. Users A-M on shard 1, N-Z on shard 2. Simple but can create hot spots (if most users have names starting with S).

Hash-based — hash the key and mod by the number of shards. Distributes data evenly. But range queries become expensive (you must query all shards).

Directory-based — a lookup service maps each key to its shard. Flexible but the directory is a single point of failure and a potential bottleneck.

Challenges of Sharding

Sharding adds significant complexity:

Joins across shards — expensive or impossible. You must denormalize.
Rebalancing — adding or removing shards requires moving data.
Hot spots — if you shard by user ID and one user goes viral (millions of people viewing their profile), the shard holding that user gets hammered while others sit idle. Fixes include splitting the hot shard further, caching the hot data, or adding a random suffix to spread one key across multiple shards.
Transactions across shards — distributed transactions are slow and complex (two-phase commit).

The rule: don't shard until you have to. Vertical scaling and read replicas solve most problems. Shard when a single machine can't hold your data or handle your write throughput.

Consistent Hashing

With simple hash-based sharding (hash(key) % num_shards), adding or removing a shard remaps almost every key. If you go from 4 shards to 5, roughly 80% of your data needs to move. That's a nightmare at scale.

Hash Ring Structure

Both approaches use the same hash function (e.g., MD5, murmur3) to hash the key. The difference is what happens after:

Naive modulo: hash(key) % num_nodes — the result is tied to the number of nodes. Change the node count and almost every key remaps.
Hash ring: the hash maps to a fixed position on a circle. Nodes also sit on the circle. Each node owns the range from the previous node's position to its own.

Consistent hashing arranges all possible hash values in a circle (ring) from 0 to 2³²-1. Both keys and nodes are hashed onto this ring:

           0
         /   \
       N1      N3
      /           \
     |      ring    |
      \           /
       N2      N4
         \   /
         2³²-1

Each node owns a range — all keys from the previous node (clockwise) up to itself. To find which node owns a key: hash the key, walk clockwise on the ring until you hit a node. That node owns the key.

Example (ring 0 to 200):
  N1 at position 50   → owns range (150, 50]  (wraps around from N4)
  N3 at position 80   → owns range (50, 80]
  N2 at position 120  → owns range (80, 120]
  N4 at position 150  → owns range (120, 150]

hash("user:99") = 65  → lands in (50, 80] → owned by N3
hash("user:42") = 140 → lands in (120, 150] → owned by N4

Node Addition and Removal

When a new node N5 joins between N2 and N4, it only takes keys that fall between N2 and N5's position on the ring. Keys assigned to N1, N3, and the rest of N4's range don't move. Only ~1/N of data migrates instead of nearly all of it.

When N3 dies, its keys move to the next node clockwise. Only N3's keys are affected — everything else stays.

Virtual Nodes (Vnodes)

With only a few physical nodes, the ring can be unbalanced — one node might own 60% of the hash space by chance. Solution: each physical node gets multiple positions (virtual nodes) on the ring:

Physical Node A → vnode_A1, vnode_A2, vnode_A3, ... (e.g., 128 vnodes)
Physical Node B → vnode_B1, vnode_B2, vnode_B3, ...

More points on the ring means more even distribution. This also lets you assign more vnodes to beefier machines for weighted distribution.

Consistent Hashing vs Naive Hash Mod

Scenario	Naive `hash % N`	Consistent hashing
Add 1 node to 10	~90% keys move	~10% keys move
Remove 1 node from 10	~90% keys move	~10% keys move
Node failure	Massive reshuffling	Only failed node's keys redistribute

Used by: Cassandra, Memcached, Redis Cluster, and most distributed caches. (Note: the original Dynamo paper (2007) used consistent hashing, but AWS DynamoDB uses a different proprietary partitioning scheme.)

Key Takeaways

Replication provides fault tolerance and read scalability
Leader-follower is the standard pattern; multi-leader adds complexity
Async replication is fast but risks data loss; sync is safe but slow
Sharding splits data across machines for write scalability
Hash-based sharding distributes evenly; range-based enables range queries
Don't shard prematurely — it adds significant operational complexity