20 - Monitoring and Observability

Monitoring vs Observability

Monitoring tells you when something is wrong. Dashboards, alerts, thresholds. "CPU is at 95%" or "error rate spiked."

Observability tells you why something is wrong. It's the ability to understand your system's internal state from its external outputs. You can ask new questions without deploying new code.

Monitoring is a subset of observability. You need both.

Three Pillars of Observability

The three pillars are: Metrics, Logs, and Traces. Together they answer what is broken (metrics), why it broke (logs), and where in the call chain it broke (traces).

Metrics

Numeric measurements over time. Counters, gauges, histograms.

Request rate — how many requests per second
Error rate — percentage of requests that fail
Latency — how long requests take (p50, p95, p99) P95 vs P99:
- P95 — good default for most services. Catches the majority of slow requests without being too sensitive to rare outliers.
- P99 — use when tail latency matters (e.g., user-facing APIs, payment systems). A bad P99 means your heaviest users (who make many requests) will regularly hit slow responses.
Example: If your API handles 10,000 requests/minute:
- P95 = 200ms → ~500 requests are slower than 200ms
- P99 = 200ms → only ~100 requests are slower than 200ms
Saturation — how full your resources are (CPU, memory, disk, connections)

These four — rate, errors, latency, saturation — are the "golden signals." If you monitor nothing else, monitor these.

Tools: Prometheus, Datadog, CloudWatch.

Logs

Structured records of events. Each log entry captures what happened, when, and in what context.

{"timestamp": "2026-05-18T10:30:00Z", "level": "error", "service": "payment", "user_id": "u_123", "message": "charge failed", "error": "card_declined"}

Use structured logs (JSON), not unstructured text. Structured logs are searchable and aggregatable.

Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki, CloudWatch Logs.

Traces

A trace follows a single request as it flows through multiple services. Each service adds a "span" with timing information.

API Gateway     2ms
  ↓
Auth Service    5ms
  ↓
Order Service   50ms
  ↓
Database        45ms

Traces show you where time is spent and which service is the bottleneck. Essential for debugging latency in microservices.

Tools: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry.

Alerting

Alerts notify you when something needs attention. Good alerting has:

Clear thresholds — alert on symptoms (high error rate), not causes (high CPU). High CPU that doesn't affect users isn't urgent.
Low noise — too many alerts and the team ignores them all. Every alert should be actionable.
Severity levels — page someone at 2am for a site outage, not for a non-critical warning.
Runbooks — each alert links to a document explaining what to check and how to fix it.

SLIs, SLOs, and SLAs

SLI (Service Level Indicator) — a metric that measures service quality. "99.2% of requests complete in under 200ms."

SLO (Service Level Objective) — a target for an SLI. "We aim for 99.9% of requests under 200ms." Internal goal.

SLA (Service Level Agreement) — a contract with customers. "If availability drops below 99.9%, we issue credits." External commitment with consequences.

Set SLOs slightly tighter than SLAs. This gives you a buffer before you breach the contract.

Availability in "Nines"

Availability	Downtime/year	Downtime/month	Downtime/day
99% (two nines)	3d 15h 36m	7h 18m	14m 24s
99.9% (three nines)	8h 45m 57s	43m 50s	1m 26s
99.99% (four nines)	52m 35.7s	4m 23s	8.6s
99.999% (five nines)	5m 15.6s	26.3s	0.86s

Each additional nine is exponentially harder (and more expensive) to achieve. Most web applications target three or four nines.

Error Budgets

If your SLO is 99.9% availability, you have a 0.1% error budget per month. That's about 43 minutes of downtime.

When you're within budget, ship fast and take risks. When you're burning through it, slow down and focus on reliability. Error budgets turn the reliability vs velocity debate into a data-driven decision.

Real-World Example: E-Commerce Checkout is Slow

Monitoring tells you something is wrong

You have alerts configured:

✅ CPU < 80% — fine
✅ Memory < 70% — fine
✅ Database connections < 100 — fine
🚨 P95 latency > 2s on /checkout — ALERT FIRES

Monitoring tells you: "Checkout is slow." But why? All your predefined metrics look normal. You're stuck.

Observability tells you why

Now you dig in with observability tools:

1. Traces — Pull up a slow checkout request in your APM:

POST /checkout         total: 4.2s
├── auth-service       12ms  ✓
├── inventory-service  45ms  ✓
├── payment-service    3.9s  ← HERE
│   └── stripe-api     3.8s  ← external call hanging
└── email-service      skipped (timeout)

2. Logs — Filter payment-service logs for that trace ID:

[WARN] Stripe webhook retry #3 — connection timeout to api.stripe.com

3. Metrics — Check payment-service outbound connections: Stripe API latency spiked from 200ms → 4s at 2:15 PM.

Root cause: Stripe had a partial outage. Your services are fine.

The difference

	Question	Answer
Monitoring	Is something wrong?	Yes, checkout is slow
Observability	Why is it wrong?	Stripe's API is timing out, causing payment-service to block for 3.8s

Key Takeaways

Monitor the four golden signals: rate, errors, latency, saturation
Use structured logs, not plain text
Traces are essential for debugging latency across services
Alert on symptoms, not causes. Keep alerts actionable.
SLOs define your reliability target; error budgets balance speed and stability