API monitoring: the four signals that actually matter during an incident

The first time I set up API monitoring I was tracking 40 different metrics. Each one theoretically mattered. In practice, during an incident, figuring out which one had gone bad took 20 minutes. I moved to the Google SRE book’s “Four Golden Signals” approach, the dashboard simplified, and incident detection sped up. Here it is.

The four core signals

Latency: how long requests take.
Error rate: the percentage of requests that fail.
Throughput: requests per second.
Saturation: how full your capacity is.

Everything else is a derivation or a drill-down. The main dashboard should hold only these.

Latency

Averages lie. Use percentiles:

p50 (median): how the majority experience the API
p95: the slower-end users
p99: worst case
p99.9: tail latency, the outliers

An API can average 200ms and have a p99 of 3 seconds. If you stare at the average you’ll never see it.

Prometheus query:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Alert thresholds depend on the project, but as a starting point:
– p95 > 500ms: warning
– p95 > 1s: page
– p99 > 2s: page

Latency creeping up is usually the first sign a downstream service is slowing. A tired database, a cold cache, network issues, anything like that.

Error rate

The rate of HTTP 5xx. Measure 4xx separately, that’s user-side.

Prometheus query:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Normally this should sit below 0.1%. 1% is “pay attention”, 5% is “wake everyone up”.

Alert thresholds:
– 0.5%: warning
– 1%: page

Error rate spikes usually mean a bad deploy, exhausted database connections, or a third-party API outage.

Throughput

Requests per second (RPS). Both a capacity metric and a proxy for business health.

sum(rate(http_requests_total[1m]))

Three patterns to watch:

Sudden drop: traffic fell off a cliff. Load balancer problem? DNS? The frontend might be failing and not firing requests.
Sudden spike: DDoS? Viral moment? Retry storm? Rate limiting may need to kick in.
Off-pattern anomaly: normal Friday 2pm is 200 RPS, this Friday is 50 RPS. Something’s wrong.

For throughput, anomaly detection works better than a fixed threshold. Alert when it drifts outside the average ± one standard deviation.

Saturation

How full the resources are. CPU, memory, disk I/O, network bandwidth, DB connection pool, thread pool.

What to keep an eye on:

CPU: persistent over 80% is a bottleneck.
Memory: over 85%, OOM risk.
DB connection pool: 80% in use, exhaustion is close.
Disk I/O: high iowait means the DB gets slow.
Queue length: backed-up job queue means you don’t have enough consumers.

Saturation is a leading indicator for the other three. Saturation climbing before latency moves buys you time to act.

Dashboard layout

Four signals on one page, traffic-light colours:

┌─────────────┬─────────────┐
│  Latency    │  Error rate │
│  p95: 180ms │  0.2%       │
│  ● green    │  ● green    │
├─────────────┼─────────────┤
│  Throughput │  Saturation │
│  450 rps    │  CPU 62%    │
│  ● green    │  DB 45/100  │
└─────────────┴─────────────┘

Per-service breakdowns, endpoint lists, and database metrics live in the drill-downs below.

Avoid alert fatigue

Don’t alert on every anomaly. An anomaly is not an incident.

Alert only if it needs action.
Multi-window: 5-minute and 1-hour windows. Both have to cross the threshold before the alert fires.
Symptom-based alerts. “CPU 80%” alone is noise. “Latency is degrading and CPU is at 80%” is useful.

The SRE book covers burn-rate alerts in detail. They track your SLO budget and fire when you’re burning through it fast.

Logging and tracing

Metrics tell you “something happened”. To figure out what:

Logging: structured JSON logs. Error, user_id, request_id, endpoint.
Tracing: how a request moved across your service graph. OpenTelemetry, Jaeger.
Profiling: CPU and memory profiles on demand. Pprof or continuous profiling.

Metrics + logs + traces = the observability triad. Without all three, debugging takes forever.

Tool stack

The combination I typically use:

Metrics: Prometheus + Grafana.
Logs: Loki or ELK.
Tracing: Tempo or Jaeger.
Alerting: Alertmanager + PagerDuty.
SLO tracking: Nobl9 or a custom Grafana dashboard.

At small scale a single integrated tool (Datadog, New Relic) saves you work, at the cost of a much higher bill. At larger scale the open-source stack pays off.

Where to start

Day one, put up the four signals. One dashboard. Only critical alerts. As incidents happen, you add drill-down metrics. Don’t start with 40.