The first time I set up API monitoring I was tracking 40 different metrics. Each one theoretically mattered. In practice, during an incident, figuring out which one had gone bad took 20 minutes. I moved to the Google SRE book’s “Four Golden Signals” approach, the dashboard simplified, and incident detection sped up. Here it is.
The four core signals
- Latency: how long requests take.
- Error rate: the percentage of requests that fail.
- Throughput: requests per second.
- Saturation: how full your capacity is.
Everything else is a derivation or a drill-down. The main dashboard should hold only these.
Latency
Averages lie. Use percentiles:
- p50 (median): how the majority experience the API
- p95: the slower-end users
- p99: worst case
- p99.9: tail latency, the outliers
An API can average 200ms and have a p99 of 3 seconds. If you stare at the average you’ll never see it.
Prometheus query:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))Alert thresholds depend on the project, but as a starting point:
– p95 > 500ms: warning
– p95 > 1s: page
– p99 > 2s: page
Latency creeping up is usually the first sign a downstream service is slowing. A tired database, a cold cache, network issues, anything like that.
Error rate
The rate of HTTP 5xx. Measure 4xx separately, that’s user-side.
Prometheus query:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Normally this should sit below 0.1%. 1% is “pay attention”, 5% is “wake everyone up”.
Alert thresholds:
– 0.5%: warning
– 1%: page
Error rate spikes usually mean a bad deploy, exhausted database connections, or a third-party API outage.
Throughput
Requests per second (RPS). Both a capacity metric and a proxy for business health.
sum(rate(http_requests_total[1m]))Three patterns to watch:
- Sudden drop: traffic fell off a cliff. Load balancer problem? DNS? The frontend might be failing and not firing requests.
- Sudden spike: DDoS? Viral moment? Retry storm? Rate limiting may need to kick in.
- Off-pattern anomaly: normal Friday 2pm is 200 RPS, this Friday is 50 RPS. Something’s wrong.
For throughput, anomaly detection works better than a fixed threshold. Alert when it drifts outside the average ± one standard deviation.
Saturation
How full the resources are. CPU, memory, disk I/O, network bandwidth, DB connection pool, thread pool.
What to keep an eye on:
- CPU: persistent over 80% is a bottleneck.
- Memory: over 85%, OOM risk.
- DB connection pool: 80% in use, exhaustion is close.
- Disk I/O: high iowait means the DB gets slow.
- Queue length: backed-up job queue means you don’t have enough consumers.
Saturation is a leading indicator for the other three. Saturation climbing before latency moves buys you time to act.
Dashboard layout
Four signals on one page, traffic-light colours:
┌─────────────┬─────────────┐
│ Latency │ Error rate │
│ p95: 180ms │ 0.2% │
│ ● green │ ● green │
├─────────────┼─────────────┤
│ Throughput │ Saturation │
│ 450 rps │ CPU 62% │
│ ● green │ DB 45/100 │
└─────────────┴─────────────┘Per-service breakdowns, endpoint lists, and database metrics live in the drill-downs below.
Avoid alert fatigue
Don’t alert on every anomaly. An anomaly is not an incident.
- Alert only if it needs action.
- Multi-window: 5-minute and 1-hour windows. Both have to cross the threshold before the alert fires.
- Symptom-based alerts. “CPU 80%” alone is noise. “Latency is degrading and CPU is at 80%” is useful.
The SRE book covers burn-rate alerts in detail. They track your SLO budget and fire when you’re burning through it fast.
Logging and tracing
Metrics tell you “something happened”. To figure out what:
- Logging: structured JSON logs. Error, user_id, request_id, endpoint.
- Tracing: how a request moved across your service graph. OpenTelemetry, Jaeger.
- Profiling: CPU and memory profiles on demand. Pprof or continuous profiling.
Metrics + logs + traces = the observability triad. Without all three, debugging takes forever.
Tool stack
The combination I typically use:
- Metrics: Prometheus + Grafana.
- Logs: Loki or ELK.
- Tracing: Tempo or Jaeger.
- Alerting: Alertmanager + PagerDuty.
- SLO tracking: Nobl9 or a custom Grafana dashboard.
At small scale a single integrated tool (Datadog, New Relic) saves you work, at the cost of a much higher bill. At larger scale the open-source stack pays off.
Where to start
Day one, put up the four signals. One dashboard. Only critical alerts. As incidents happen, you add drill-down metrics. Don’t start with 40.