Sharding a Relational Database with Foreign Keys

Sharding splits data across multiple database nodes. The challenge with relational databases: foreign keys assume related data lives together. This example walks through strategies to shard tables that reference each other.

The Problem

You have an e-commerce database:

users(id, name, email)
orders(id, user_id FK→users, total, created_at)
order_items(id, order_id FK→orders, product_id FK→products, quantity)
products(id, name, price, category)

A single PostgreSQL instance can't handle the write load anymore. You need to shard — but orders references users, and order_items references both orders and products. How do you split this across nodes without breaking joins and constraints?

Strategy 1: Co-located Sharding

Pick a shard key shared across related tables. Rows that reference each other land on the same shard.

Shard key: user_id

Shard 1 (user_id 1–1000):
  users       → rows where id 1–1000
  orders      → rows where user_id 1–1000
  order_items → rows where order.user_id 1–1000

Shard 2 (user_id 1001–2000):
  users       → rows where id 1001–2000
  orders      → rows where user_id 1001–2000
  order_items → rows where order.user_id 1001–2000

Foreign keys between users → orders → order_items work normally within each shard because all related data is co-located.

The catch: order_items doesn't have user_id directly. You must denormalize — add user_id to order_items so the shard router knows where to place each row.

-- Before: order_items only references order_id
CREATE TABLE order_items (
  id BIGINT PRIMARY KEY,
  order_id BIGINT REFERENCES orders(id),
  product_id BIGINT,
  quantity INT
);

-- After: add user_id for shard routing
CREATE TABLE order_items (
  id BIGINT PRIMARY KEY,
  user_id BIGINT NOT NULL,   -- denormalized shard key
  order_id BIGINT,
  product_id BIGINT,
  quantity INT
);

Strategy 2: Reference Tables

The products table is referenced by order_items but can't be co-located with users — products don't belong to a user. Solution: replicate small, rarely-changing tables to every shard.

Shard 1:                    Shard 2:
  users (1–1000)              users (1001–2000)
  orders (user 1–1000)        orders (user 1001–2000)
  order_items (user 1–1000)   order_items (user 1001–2000)
  products (FULL COPY)        products (FULL COPY)

Products change infrequently (new products added, prices updated). Replicate changes asynchronously to all shards. Now order_items.product_id can still join locally.

When this works: small tables (< 100K rows), low write frequency. Categories, countries, config tables are good candidates.

When it doesn't: if the table is large or frequently updated, replication overhead becomes a problem.

Strategy 3: Drop Cross-Shard FKs

For references that can't be co-located or replicated, remove the database-level FK constraint and enforce integrity in application code.

-- No FK constraint on product_id
CREATE TABLE order_items (
  id BIGINT PRIMARY KEY,
  user_id BIGINT NOT NULL,
  order_id BIGINT REFERENCES orders(id),  -- same shard, FK works
  product_id BIGINT NOT NULL,             -- cross-shard, no FK
  quantity INT
);

The application validates that product_id exists before inserting. You lose automatic cascade deletes and referential guarantees, but gain the ability to shard independently.

Choosing the Right Shard Key

Decision framework:

Identify the most common query path — "get all orders for user X" means user_id is a strong candidate
Check that related tables can include the key — if not, you'll need to denormalize
Verify even distribution — one user with 50% of all orders creates a hot shard
Accept some cross-shard queries — optimize the common path, tolerate the rare one (e.g., "find all orders for product Y" will scatter across shards)

Multi-Tenant SaaS Example

A natural fit for co-located sharding:

Shard key: tenant_id

tenants(id, name, plan)
users(id, tenant_id, email)
projects(id, tenant_id, name)
tasks(id, tenant_id, project_id, title, status)
comments(id, tenant_id, task_id, body)

Every table includes tenant_id. All data for one tenant lives on one shard. FKs work within a shard:

tasks.project_id → projects.id ✓ (same tenant, same shard)
comments.task_id → tasks.id ✓ (same tenant, same shard)

Cross-tenant queries (admin dashboards, analytics) go through a scatter-gather layer that queries all shards and merges results.

Tradeoffs Summary

Approach	Works when	Breaks when
Co-located sharding	Tables share a natural parent key	No common key exists across all tables
Reference tables	Table is small and rarely written	Table is large or frequently updated
Drop FK constraints	Cross-shard references are unavoidable	You need strong referential integrity guarantees
Scatter-gather queries	Cross-shard reads are rare	Cross-shard reads are on the hot path

Key Insight

Sharding a relational DB means partitioning into independent "mini-databases" where relational semantics (joins, FKs, transactions) still hold within a shard. Cross-shard relationships are handled explicitly through replication, application logic, or eventual consistency.

Concepts Used

Concept	Lesson	How it's used here
Sharding strategies	11	Hash and range sharding to distribute data
Replication	11	Reference tables replicated to all shards
Referential integrity tradeoffs	13	Dropping FKs trades strict consistency for operational flexibility in distributed schemas
Back-of-envelope estimation	02	Deciding when a single instance is no longer enough