15 - Message Queues

📋 Jump to Takeaways

We've covered how data is stored, replicated, and kept consistent. Now let's look at how services communicate without tight coupling.

Why Queues

Not everything needs to happen immediately. When a user places an order, you validate payment and reserve inventory synchronously (the user waits for this). But sending the confirmation email, generating the invoice, and notifying the warehouse can happen asynchronously.

Message queues decouple producers (who create work) from consumers (who process it). The producer drops a message on the queue and moves on. The consumer picks it up when ready.

Benefits of Async Processing

Decoupling — the producer doesn't need to know who processes the message or how. Services evolve independently.

Buffering — if consumers are slow, messages queue up instead of causing failures. The system absorbs traffic spikes gracefully.

Reliability — if a consumer crashes, the message stays in the queue. Another consumer picks it up. Work doesn't get lost.

Scalability — add more consumers to process messages faster. No changes to the producer.

Queue vs Pub/Sub

Queue (point-to-point) — each message is consumed by exactly one consumer. Used for task distribution. Example: processing uploaded images.

Pub/Sub (publish-subscribe) — each message is delivered to all subscribers. Used for event broadcasting. Example: "order placed" event triggers email service, inventory service, and analytics service simultaneously.

Some systems support both patterns. Kafka uses consumer groups: within a group, each message goes to one consumer (queue behavior). Across groups, each message goes to all groups (pub/sub behavior).

Message Ordering and Delivery

At-most-once — fire and forget. The producer sends the message and doesn't wait for confirmation. If the queue drops it, it's gone. Example: logging or metrics. If you lose one data point out of millions, nobody notices. Fast, no retries, no overhead.

At-least-once — the queue guarantees delivery, but might deliver the same message twice. This happens when the consumer processes a message but crashes before acknowledging it. The queue thinks it wasn't processed and redelivers. Example: sending a welcome email. If the email worker crashes after sending but before acknowledging, the user might get two welcome emails. To prevent this, make your consumer idempotent — check "did I already send this?" before sending (e.g., store the message ID in a database).

Exactly-once — each message is processed exactly once. Extremely hard in distributed systems because you'd need the queue and the consumer to agree atomically. Most systems don't bother. They use at-least-once delivery with idempotent consumers, which gives you effectively-once behavior in practice.

Ordering — most queues guarantee ordering within a partition. Messages in the same partition arrive in the order they were sent. Example: if you partition by user ID, all events for user 123 arrive in order. But events for user 123 and user 456 might arrive in any relative order. Global ordering across all messages requires a single partition, which kills throughput.

When do you need ordering? When the sequence affects correctness: bank transactions (deposit before withdraw), state machines (placed → shipped → delivered), chat messages. You don't need it when events are independent — sending emails to different users, processing image uploads, recording analytics.

Common Systems

RabbitMQ — traditional message broker. Supports complex routing, priorities, and multiple protocols. Good for task queues.

Apache Kafka — distributed log. Messages are persisted and can be replayed. High throughput, good for event streaming and data pipelines.

Amazon SQS — managed queue service. Simple, scalable, no infrastructure to manage. Good default for AWS workloads.

Redis Streams — lightweight streaming built into Redis. Good for simpler use cases where you already run Redis.

Dead Letter Queues

When a message fails processing repeatedly, you don't want it blocking the queue forever. A dead letter queue (DLQ) catches these failed messages for later inspection.

The pattern: after N failed attempts, move the message to the DLQ. Alert the team. Investigate and reprocess manually or fix the bug and replay.

Backpressure

When producers generate messages faster than consumers can process them, the queue grows unbounded. Eventually you run out of memory or disk.

Solutions:

Rate limit producers — reject or slow down producers when the queue is too deep
Scale consumers — auto-scale based on queue depth
Drop messages — acceptable for some use cases (metrics, logs)
Set queue limits — reject new messages when the queue hits a max size

Key Takeaways

Queues decouple services and absorb traffic spikes
Use point-to-point for task distribution, pub/sub for event broadcasting
Design consumers to be idempotent (at-least-once delivery is the practical default)
Dead letter queues catch poison messages that fail repeatedly
Monitor queue depth and implement backpressure to prevent unbounded growth