20 - Monitoring and Observability

📋 Jump to Takeaways

Monitoring vs Observability

Monitoring tells you when something is wrong. Dashboards, alerts, thresholds. "CPU is at 95%" or "error rate spiked."

Observability tells you why something is wrong. It's the ability to understand your system's internal state from its external outputs. You can ask new questions without deploying new code.

Monitoring is a subset of observability. You need both.

Three Pillars of Observability

The three pillars are: Metrics, Logs, and Traces. Together they answer what is broken (metrics), why it broke (logs), and where in the call chain it broke (traces).

Metrics

Numeric measurements over time. Counters, gauges, histograms.

  1. Request rate — how many requests per second

  2. Error rate — percentage of requests that fail

  3. Latency — how long requests take (p50, p95, p99) P95 vs P99:

    • P95 — good default for most services. Catches the majority of slow requests without being too sensitive to rare outliers.
    • P99 — use when tail latency matters (e.g., user-facing APIs, payment systems). A bad P99 means your heaviest users (who make many requests) will regularly hit slow responses.

    Example: If your API handles 10,000 requests/minute:

    • P95 = 200ms → ~500 requests are slower than 200ms
    • P99 = 200ms → only ~100 requests are slower than 200ms
  4. Saturation — how full your resources are (CPU, memory, disk, connections)

These four — rate, errors, latency, saturation — are the "golden signals." If you monitor nothing else, monitor these.

Tools: Prometheus, Datadog, CloudWatch.

Logs

Structured records of events. Each log entry captures what happened, when, and in what context.

{"timestamp": "2026-05-18T10:30:00Z", "level": "error", "service": "payment", "user_id": "u_123", "message": "charge failed", "error": "card_declined"}

Use structured logs (JSON), not unstructured text. Structured logs are searchable and aggregatable.

Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Loki, CloudWatch Logs.

Traces

A trace follows a single request as it flows through multiple services. Each service adds a "span" with timing information.

API Gateway     2ms

Auth Service    5ms

Order Service   50ms

Database        45ms

Traces show you where time is spent and which service is the bottleneck. Essential for debugging latency in microservices.

Tools: Jaeger, Zipkin, AWS X-Ray, OpenTelemetry.

Alerting

Alerts notify you when something needs attention. Good alerting has:

  • Clear thresholds — alert on symptoms (high error rate), not causes (high CPU). High CPU that doesn't affect users isn't urgent.
  • Low noise — too many alerts and the team ignores them all. Every alert should be actionable.
  • Severity levels — page someone at 2am for a site outage, not for a non-critical warning.
  • Runbooks — each alert links to a document explaining what to check and how to fix it.

SLIs, SLOs, and SLAs

SLI (Service Level Indicator) — a metric that measures service quality. "99.2% of requests complete in under 200ms."

SLO (Service Level Objective) — a target for an SLI. "We aim for 99.9% of requests under 200ms." Internal goal.

SLA (Service Level Agreement) — a contract with customers. "If availability drops below 99.9%, we issue credits." External commitment with consequences.

Set SLOs slightly tighter than SLAs. This gives you a buffer before you breach the contract.

Error Budgets

If your SLO is 99.9% availability, you have a 0.1% error budget per month. That's about 43 minutes of downtime.

When you're within budget, ship fast and take risks. When you're burning through it, slow down and focus on reliability. Error budgets turn the reliability vs velocity debate into a data-driven decision.

Real-World Example: E-Commerce Checkout is Slow

Monitoring tells you something is wrong

You have alerts configured:

  • ✅ CPU < 80% — fine
  • ✅ Memory < 70% — fine
  • ✅ Database connections < 100 — fine
  • 🚨 P95 latency > 2s on /checkout — ALERT FIRES

Monitoring tells you: "Checkout is slow." But why? All your predefined metrics look normal. You're stuck.

Observability tells you why

Now you dig in with observability tools:

1. Traces — Pull up a slow checkout request in your APM:

POST /checkout         total: 4.2s
├── auth-service       12ms  ✓
├── inventory-service  45ms  ✓
├── payment-service    3.9s  ← HERE
│   └── stripe-api     3.8s  ← external call hanging
└── email-service      skipped (timeout)

2. Logs — Filter payment-service logs for that trace ID:

[WARN] Stripe webhook retry #3 — connection timeout to api.stripe.com

3. Metrics — Check payment-service outbound connections: Stripe API latency spiked from 200ms → 4s at 2:15 PM.

Root cause: Stripe had a partial outage. Your services are fine.

The difference

Question Answer
Monitoring Is something wrong? Yes, checkout is slow
Observability Why is it wrong? Stripe's API is timing out, causing payment-service to block for 3.8s

Key Takeaways

  • Monitor the four golden signals: rate, errors, latency, saturation
  • Use structured logs, not plain text
  • Traces are essential for debugging latency across services
  • Alert on symptoms, not causes. Keep alerts actionable.
  • SLOs define your reliability target; error budgets balance speed and stability

📝 Ready to test your knowledge?

Answer the quiz below to mark this lesson complete.

Spot something off? Report an issue

© 2026 ByteLearn.dev. Free courses for developers. · Privacy