Notification System
A complete system design walkthrough following the four-step process from the lesson. Design a system that sends push notifications, emails, and SMS to users.
Step 1: Requirements
Functional:
- Send push notifications, emails, and SMS to users
- Users set preferences (opt out of channels, mute hours)
- Scheduled notifications (send at 9am user's local time)
- Track delivery status (sent, delivered, failed)
- Other services trigger notifications — we only deliver
Non-functional:
- 10M notifications/day (~100/sec average, ~300/sec peak)
- Delivery within 30 seconds of trigger
- 99.9% delivery rate (retries + fallback providers; some permanently fail)
- No duplicate deliveries
Out of scope: notification content creation, user-facing UI for reading notifications.
Step 2: Estimation
10M notifications/day ÷ 86,400 sec ≈ 100/sec average
Peak (3x average): ~300/sec
Storage per notification: ~500 bytes
Daily storage: 10M × 500B = 5 GB/day
Monthly: ~150 GB300/sec is modest — a single server handles this. The challenge is reliability (retries, multiple channels, deduplication), not raw throughput.
Step 3: High-Level Architecture
Triggering Services (order-service, auth-service, etc.)
│
▼
Notification Service (validates, applies preferences, deduplicates)
│
▼
Message Queue (one topic per channel)
│
┌─────┼─────┐
▼ ▼ ▼
Push Email SMS ← Worker pools (one per channel)
│ │ │
▼ ▼ ▼
FCM SendGrid Twilio ← Third-party delivery APIsWhy a queue? Delivery is slow and unreliable (third-party APIs, network issues). The queue decouples "decide to send" from "actually send." If Twilio is down, SMS messages wait instead of being lost.
Why separate topics per channel? SMS failures shouldn't block email delivery. Each channel scales independently.
Step 4: Deep Dive — Deduplication & Retries
Preventing Duplicates
Each notification gets a unique ID at creation. Before sending, the worker checks a deduplication store:
Redis key: "dedup:{notification_id}"
TTL: 24 hours
Worker logic:
1. SETNX dedup:{id} → if key already exists, skip (duplicate)
2. If new, send the notification
3. On success, mark status = delivered
4. On failure, re-enqueue with retry count + 1This makes the system idempotent. Even if the queue delivers a message twice (at-least-once semantics), the user gets only one notification.
Retry Strategy
Attempt 1: immediate
Attempt 2: after 5 seconds
Attempt 3: after 30 seconds
Attempt 4: after 5 minutes
Attempt 5: move to dead-letter queue (permanent failure)Exponential backoff prevents hammering a failing provider. After max retries, the notification moves to a dead-letter queue for manual investigation or alerting.
Scheduled Notifications
A scheduler service runs every minute, queries for notifications where scheduled_at <= now() and status = pending, then publishes them to the queue. Index on (status, scheduled_at) makes this query fast.
Data Model
CREATE TABLE notifications (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
channel VARCHAR(10) NOT NULL, -- push, email, sms
title TEXT,
body TEXT,
status VARCHAR(20) DEFAULT 'pending',
retry_count INT DEFAULT 0,
scheduled_at TIMESTAMP,
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_pending_scheduled
ON notifications (status, scheduled_at)
WHERE status = 'pending';API
POST /notifications
{
"user_id": "uuid",
"channel": "push",
"title": "Your order shipped",
"body": "Track it here...",
"scheduled_at": "2026-01-15T09:00:00Z" // optional
}
Response: 201 Created
{ "id": "notification-uuid", "status": "pending" }
GET /notifications/{id}/status
Response: { "status": "delivered", "delivered_at": "..." }Key Tradeoffs
| Decision | Tradeoff |
|---|---|
| Queue per channel | More infrastructure, but channels fail independently |
| Redis deduplication | Extra dependency, but prevents duplicate sends |
| At-least-once + idempotency | Simpler than exactly-once, same user experience |
| 24h dedup TTL | Covers retries; old IDs expire to save memory |
| Exponential backoff | Slower recovery, but doesn't overwhelm failing providers |
What This Demonstrates
This example follows all four steps of the process:
- Requirements — scoped what we build and what we don't
- Estimation — proved scale is manageable, focused design on reliability
- High-level design — queue-based architecture for decoupling and resilience
- Deep dive — solved deduplication (the hardest correctness problem)