Designing on-call rotation: from solo developer to a real team

The biggest gap I felt moving from freelance to a small team was on-call rotation. When you work solo, the alarm is always yours. Customer on the phone, critical bug in the channel, midnight deploy, all of it lands on you. Once you build a team, you have to learn to distribute the work properly, otherwise nobody lasts a year.

Over the last 18 months I’ve set up on-call at 3 team sizes: 2, 4, and 8 people. Different patterns work at each size. Here’s what I learned.

The point of on-call

It helps to think of on-call not as a stress source but as a health barometer for the system. A good on-call rotation delivers on these 3 goals:

Fast response: minimum time from detecting a production outage to resolution
Distributed knowledge: every engineer understands the critical parts of the system, no single person silo
Sustainable life: nobody feels burned out a year in

A solo developer only hits goal 1. No distribution, no sustainability. That’s the real win of moving to a team.

Rotation cadence

By team size:

2 people: weekly rotation, Friday evening handoff. The person who was on the previous week quietly takes the weekend (lives heavily). Burn-out arrives fast.

4 people: weekly rotation, 1 week on per month. Three weeks between rotations is enough to recover.

8 people: weekly or biweekly, primary plus secondary running in parallel. Primary takes the first page, secondary is the backup.

Daily rotation doesn’t feel right to me. Too many context switches, incidents don’t close in 24 hours, the new person hasn’t warmed up before they hand off.

Monthly rotation is too long: a full month on call destroys people, even 3 months of recovery afterwards isn’t enough.

Weekly is the sweet spot.

Tool: PagerDuty, Opsgenie, or something simple

On my first team we started with a Discord bot plus manual tracking. It collapsed 3 months in. Missed pages, wrong escalations, no rotation record kept.

PagerDuty or Opsgenie is non-negotiable:
– Rotation schedule definition
– Escalation policy (primary doesn’t respond, go to secondary, then to manager)
– Alarm source integrations (Datadog, Sentry, UptimeRobot)
– Automatic incident timeline
– Mobile app (it will wake you, but it will actually wake you)

PagerDuty gets pricey for small teams (around $30+ per person per month), Opsgenie lives in the Atlassian ecosystem (good if you’re on Jira).

If budget is tight: Grafana OnCall (open source, self-hosted), or a simple rotation sheet plus Discord/Slack alert bot combo. Holds up in early-stage work.

Alert fatigue: the real threat

The fastest way to break a team is to flood on-call with spam alarms. If “CPU hit 85%” fires 50 times a week, nobody takes it seriously, and the real incident slips through.

The discipline:

Every alarm has to be actionable. An alarm is for acting, not for observing. “CPU high” is a metric to watch, not an alarm. “CPU p95 over 90% for the last 15 minutes” plus a linked runbook is an alarm.

Tune thresholds tight. Loose thresholds generate false alarms, tight thresholds mean nobody looks. Ideal: when an alarm fires, there’s a 90% chance something real is wrong.

Every alarm needs a runbook. When the alarm opens, the on-call engineer follows these steps: did it crash? check this, read this log, run this command, if it doesn’t resolve, escalate.

Without a runbook, the engineer is debugging blind, time-to-resolution goes up, stress goes up, and everyone resents it.

Weekly alert review. Every week the person rolling off on-call reviews the alarms that fired. “This shouldn’t be an alarm”, “the threshold on this one is wrong”, “these 3 alarms are actually one root cause”, and they get fixed.

Teams that stick to this discipline typically cut alarm volume in half within 3 months, and the ones that remain get taken seriously.

Severity levels

Making everything P1 equals making nothing P1.

The levels I use:

P1 (critical): a meaningful chunk of production is down, revenue impact. Immediate response, on-call gets out of bed.
P2 (major): feature degradation, some users affected. Response within the hour.
P3 (minor): small problem, limited user impact. Fixed during business hours.
P4 (low): cosmetic, non-urgent. Ticket opened, handled outside the rotation.

Only P1 wakes you at night. P2 is an off-hours Slack notification, no immediate action. P3 and P4 go by email, picked up at the start of the workday.

Incident response protocol

When the on-call engineer gets paged, the sequence is:

Acknowledge (within 5 minutes): “ack” in PagerDuty, no further harassment
Triage: how bad is this, do we escalate?
Communicate: open an incident channel in Slack, update the status page
Mitigate: stop the bleeding the shortest way (rollback, circuit break, traffic redirect)
Resolve: get to root cause, fix it, verify
Postmortem: write a blameless postmortem within 24 to 48 hours

Blameless postmortem

The most important piece of work after an incident is the postmortem. A good one covers:

What happened, chronologically
Why it happened, down to root cause (5 Whys)
How it was detected (detection time)
How it was resolved (mitigation time)
Action items to prevent a repeat (owner plus due date)
“Nobody is at fault”, it’s a system problem, not a human error

A blameless culture is the whole point. Write “Ali’s deploy broke it” once and nobody takes risks again, postmortems stop getting attended, learning dies.

Compensation

On-call is work. If you’re doing weekly rotations, you should be paid for it. Options:

Hourly on-call pay: paid per hour (unpaid secondary rotation plus paid on-call)
Flat weekly bonus: fixed premium for the rotation week
Comp time: if you got paged during the rotation week, next day off
Equity / bonus pool weighting: on-call hours weighted into year-end bonus

On-call without compensation is a burnout guarantee. Make it a clear company policy.

Solo-to-team transition tip

Going from 1 to 2, the first employee falls over. “I’ve been doing it already, they don’t need to help”, the tendency is not to hand off. Three months later the new person is still half standing up and the founder is still buried alone.

Flip it. Put the new hire on rotation from week one. The founder takes secondary, the new hire is primary. Every alarm goes to the new person first, escalating to the founder only if they can’t handle it. Knowledge transfer accelerates, responsibility is shared, you get to breathe.

Final note

On-call culture is a mirror of company culture. Spam alarms, silent sufferers, a burned-out team equals no management. A good rotation plus runbooks plus blameless postmortems plus fair compensation equals a healthy engineering organization.

Teams that invest in that discipline end up in a far better place long-term.