Sharding a Relational Database with Foreign Keys
Sharding splits data across multiple database nodes. The challenge with relational databases: foreign keys assume related data lives together. This example walks through strategies to shard tables that reference each other.
The Problem
You have an e-commerce database:
users(id, name, email)
orders(id, user_id FK→users, total, created_at)
order_items(id, order_id FK→orders, product_id FK→products, quantity)
products(id, name, price, category)A single PostgreSQL instance can't handle the write load anymore. You need to shard — but orders references users, and order_items references both orders and products. How do you split this across nodes without breaking joins and constraints?
Strategy 1: Co-located Sharding
Pick a shard key shared across related tables. Rows that reference each other land on the same shard.
Shard key: user_id
Shard 1 (user_id 1–1000):
users → rows where id 1–1000
orders → rows where user_id 1–1000
order_items → rows where order.user_id 1–1000
Shard 2 (user_id 1001–2000):
users → rows where id 1001–2000
orders → rows where user_id 1001–2000
order_items → rows where order.user_id 1001–2000Foreign keys between users → orders → order_items work normally within each shard because all related data is co-located.
The catch: order_items doesn't have user_id directly. You must denormalize — add user_id to order_items so the shard router knows where to place each row.
-- Before: order_items only references order_id
CREATE TABLE order_items (
id BIGINT PRIMARY KEY,
order_id BIGINT REFERENCES orders(id),
product_id BIGINT,
quantity INT
);
-- After: add user_id for shard routing
CREATE TABLE order_items (
id BIGINT PRIMARY KEY,
user_id BIGINT NOT NULL, -- denormalized shard key
order_id BIGINT,
product_id BIGINT,
quantity INT
);Strategy 2: Reference Tables
The products table is referenced by order_items but can't be co-located with users — products don't belong to a user. Solution: replicate small, rarely-changing tables to every shard.
Shard 1: Shard 2:
users (1–1000) users (1001–2000)
orders (user 1–1000) orders (user 1001–2000)
order_items (user 1–1000) order_items (user 1001–2000)
products (FULL COPY) products (FULL COPY)Products change infrequently (new products added, prices updated). Replicate changes asynchronously to all shards. Now order_items.product_id can still join locally.
When this works: small tables (< 100K rows), low write frequency. Categories, countries, config tables are good candidates.
When it doesn't: if the table is large or frequently updated, replication overhead becomes a problem.
Strategy 3: Drop Cross-Shard FKs
For references that can't be co-located or replicated, remove the database-level FK constraint and enforce integrity in application code.
-- No FK constraint on product_id
CREATE TABLE order_items (
id BIGINT PRIMARY KEY,
user_id BIGINT NOT NULL,
order_id BIGINT REFERENCES orders(id), -- same shard, FK works
product_id BIGINT NOT NULL, -- cross-shard, no FK
quantity INT
);The application validates that product_id exists before inserting. You lose automatic cascade deletes and referential guarantees, but gain the ability to shard independently.
Choosing the Right Shard Key
Decision framework:
- Identify the most common query path — "get all orders for user X" means
user_idis a strong candidate - Check that related tables can include the key — if not, you'll need to denormalize
- Verify even distribution — one user with 50% of all orders creates a hot shard
- Accept some cross-shard queries — optimize the common path, tolerate the rare one (e.g., "find all orders for product Y" will scatter across shards)
Multi-Tenant SaaS Example
A natural fit for co-located sharding:
Shard key: tenant_id
tenants(id, name, plan)
users(id, tenant_id, email)
projects(id, tenant_id, name)
tasks(id, tenant_id, project_id, title, status)
comments(id, tenant_id, task_id, body)Every table includes tenant_id. All data for one tenant lives on one shard. FKs work within a shard:
tasks.project_id → projects.id✓ (same tenant, same shard)comments.task_id → tasks.id✓ (same tenant, same shard)
Cross-tenant queries (admin dashboards, analytics) go through a scatter-gather layer that queries all shards and merges results.
Tradeoffs Summary
| Approach | Works when | Breaks when |
|---|---|---|
| Co-located sharding | Tables share a natural parent key | No common key exists across all tables |
| Reference tables | Table is small and rarely written | Table is large or frequently updated |
| Drop FK constraints | Cross-shard references are unavoidable | You need strong referential integrity guarantees |
| Scatter-gather queries | Cross-shard reads are rare | Cross-shard reads are on the hot path |
Key Insight
Sharding a relational DB means partitioning into independent "mini-databases" where relational semantics (joins, FKs, transactions) still hold within a shard. Cross-shard relationships are handled explicitly through replication, application logic, or eventual consistency.
Concepts Used
| Concept | Lesson | How it's used here |
|---|---|---|
| Sharding strategies | 11 | Hash and range sharding to distribute data |
| Replication | 11 | Reference tables replicated to all shards |
| Referential integrity tradeoffs | 13 | Dropping FKs trades strict consistency for operational flexibility in distributed schemas |
| Back-of-envelope estimation | 02 | Deciding when a single instance is no longer enough |