Notification System

A complete system design walkthrough following the four-step process from the lesson. Design a system that sends push notifications, emails, and SMS to users.

Step 1: Requirements

Functional:

Send push notifications, emails, and SMS to users
Users set preferences (opt out of channels, mute hours)
Scheduled notifications (send at 9am user's local time)
Track delivery status (sent, delivered, failed)
Other services trigger notifications — we only deliver

Non-functional:

10M notifications/day (~100/sec average, ~300/sec peak)
Delivery within 30 seconds of trigger
99.9% delivery rate (retries + fallback providers; some permanently fail)
No duplicate deliveries

Out of scope: notification content creation, user-facing UI for reading notifications.

Step 2: Estimation

10M notifications/day ÷ 86,400 sec ≈ 100/sec average
Peak (3x average): ~300/sec
Storage per notification: ~500 bytes
Daily storage: 10M × 500B = 5 GB/day
Monthly: ~150 GB

300/sec is modest — a single server handles this. The challenge is reliability (retries, multiple channels, deduplication), not raw throughput.

Step 3: High-Level Architecture

Triggering Services (order-service, auth-service, etc.)
         │
         ▼
  Notification Service (validates, applies preferences, deduplicates)
         │
         ▼
    Message Queue (one topic per channel)
         │
   ┌─────┼─────┐
   ▼     ▼     ▼
 Push   Email  SMS     ← Worker pools (one per channel)
   │     │     │
   ▼     ▼     ▼
  FCM  SendGrid Twilio  ← Third-party delivery APIs

Why a queue? Delivery is slow and unreliable (third-party APIs, network issues). The queue decouples "decide to send" from "actually send." If Twilio is down, SMS messages wait instead of being lost.

Why separate topics per channel? SMS failures shouldn't block email delivery. Each channel scales independently.

Step 4: Deep Dive — Deduplication & Retries

Preventing Duplicates

Each notification gets a unique ID at creation. Before sending, the worker checks a deduplication store:

Redis key: "dedup:{notification_id}"
TTL: 24 hours

Worker logic:
  1. SETNX dedup:{id} → if key already exists, skip (duplicate)
  2. If new, send the notification
  3. On success, mark status = delivered
  4. On failure, re-enqueue with retry count + 1

This makes the system idempotent. Even if the queue delivers a message twice (at-least-once semantics), the user gets only one notification.

Retry Strategy

Attempt 1: immediate
Attempt 2: after 5 seconds
Attempt 3: after 30 seconds
Attempt 4: after 5 minutes
Attempt 5: move to dead-letter queue (permanent failure)

Exponential backoff prevents hammering a failing provider. After max retries, the notification moves to a dead-letter queue for manual investigation or alerting.

Scheduled Notifications

A scheduler service runs every minute, queries for notifications where scheduled_at <= now() and status = pending, then publishes them to the queue. Index on (status, scheduled_at) makes this query fast.

Data Model

CREATE TABLE notifications (
  id            UUID PRIMARY KEY,
  user_id       UUID NOT NULL,
  channel       VARCHAR(10) NOT NULL,  -- push, email, sms
  title         TEXT,
  body          TEXT,
  status        VARCHAR(20) DEFAULT 'pending',
  retry_count   INT DEFAULT 0,
  scheduled_at  TIMESTAMP,
  sent_at       TIMESTAMP,
  delivered_at  TIMESTAMP,
  created_at    TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_pending_scheduled
  ON notifications (status, scheduled_at)
  WHERE status = 'pending';

API

POST /notifications
{
  "user_id": "uuid",
  "channel": "push",
  "title": "Your order shipped",
  "body": "Track it here...",
  "scheduled_at": "2026-01-15T09:00:00Z"  // optional
}

Response: 201 Created
{ "id": "notification-uuid", "status": "pending" }

GET /notifications/{id}/status
Response: { "status": "delivered", "delivered_at": "..." }

Key Tradeoffs

Decision	Tradeoff
Queue per channel	More infrastructure, but channels fail independently
Redis deduplication	Extra dependency, but prevents duplicate sends
At-least-once + idempotency	Simpler than exactly-once, same user experience
24h dedup TTL	Covers retries; old IDs expire to save memory
Exponential backoff	Slower recovery, but doesn't overwhelm failing providers

What This Demonstrates

This example follows all four steps of the process:

Requirements — scoped what we build and what we don't
Estimation — proved scale is manageable, focused design on reliability
High-level design — queue-based architecture for decoupling and resilience
Deep dive — solved deduplication (the hardest correctness problem)