A critical bug hit production. We were down for 45 minutes. Site down, checkout broken, support channels exploding. The team did everything right and fixed it. Hero moment.
The next day, what do you do? Most teams say “it’s handled, on to the next thing”. Skipping the postmortem guarantees you’ll hit the same bug again in six months.
I’ve written 15 incident postmortems in the last three years. None of them were pleasant. But none of those incidents repeated after. Here’s the practical framework.
What a postmortem is actually for
A postmortem does four things:
- Learning: what happened, why, how we noticed, how we fixed it
- System visibility: hidden weaknesses are now on paper
- Team memory: six months later a new hire can read it and pick up the context
- Accountability without blame: what happened, yes, but targeting the system, not the person
Postmortems get confused with bug root cause reports. A BRM is purely technical. A postmortem is technical plus organizational. Why did the process let this incident through, why didn’t tests catch it, why did the alarm come in late.
The 48-hour rule
Within 48 hours of the incident closing, the postmortem draft should be ready. Any later and:
- People forget details
- Energy drops, the “it’s over” feeling sets in
- New work piles on top
At 48 hours it’s still hot, you’re calm but sharp.
Template
The template I use:
# Incident Postmortem: [Summary title]
**Date**: 2026-04-20
**Duration**: 45 minutes (14:22 - 15:07 TRT)
**Severity**: P1 (critical, checkout down)
**Author**: [name]
**Status**: Draft / In Review / Final
## Summary
2-3 sentences: what happened, what was the impact, how was it resolved.
## Timeline
14:22 - PagerDuty alarm: checkout 500 error rate >10%
14:23 - On-call acknowledged
14:25 - Incident channel opened
14:32 - Suspected root cause: recent deploy
14:40 - Rollback started
14:45 - Error rate dropping
15:07 - Error rate at 0, incident closed
## Impact
- X orders failed (revenue impact: ~$Y)
- Z users saw error page
- W tickets reached support
- SLA impact: 99.9% target for the quarter dropped to 99.87%
## Root Cause
[Technical, to the deepest level]
A migration ran against the `orders` table during a deploy. The migration dropped an index, then rebuilt a new one under production traffic. The rebuild held a lock, `SELECT ... FOR UPDATE` queries queued, the connection pool filled up, the API started returning 500.
## 5 Whys
1. Why did checkout go down? -> DB connection pool filled up
2. Why was the pool full? -> Queries were queueing
3. Why were queries queueing? -> Index rebuild was holding a lock
4. Why did the index rebuild run under production traffic? -> The migration script didn't use online index creation
5. Why wasn't it caught in migration review? -> The review checklist didn't include an online DDL check
## What went well
- Alarm fired fast (within 2 minutes)
- On-call ack was fast (1 minute)
- Incident channel and communication were clear
- Rollback pattern was ready, executed in 8 minutes
## What went badly
- Migration review was thin, a breaking change shipped to prod
- Staging doesn't have the same traffic pattern, the issue couldn't be caught in test
- During rollback the deploy history UI was slow and versioning was unclear
- Customer communication was late (status page updated at minute 12)
## Action Items
| Action | Owner | Due | Priority |
|--------|-------|-----|----------|
| Add online DDL check to migration review checklist | Ayse | 2026-04-25 | P1 |
| Set up production traffic replay in staging | Burak | 2026-05-15 | P2 |
| Improve version pinning visibility in deploy UI | DevOps team | 2026-05-30 | P2 |
| Status page auto-update (triggered from alarms) | Cem | 2026-05-10 | P1 |
## Lessons Learned
- Online DDL, always. Especially on large tables.
- Without traffic replication in staging, production-only issues only surface in prod.
- Auto status page updates are critical for customer trust.Blameless language
The most sensitive part of a postmortem is language. Instead of “Ali was slow on the rollback”, write “the rollback took 8 minutes because searching version history in the deploy UI is slow”. Target the system, not the person.
Blameless rewrites:
- “X made a mistake” -> “The system misled the engineer at this step”
- “Why didn’t you notice?” -> “Was there a signal that would have made this visible?”
- “You shouldn’t have done that” -> “Could a better tool have supported this action?”
This isn’t just politeness. In teams with a blame culture, people hide mistakes, don’t attend the postmortem, and the real root cause stays invisible.
The review meeting
After the draft, a review meeting. Typically 30 to 45 minutes.
Attendees:
- Everyone involved in the incident
- Anyone trained for the on-call rotation
- Engineering manager
- (For major incidents) product and support
In the meeting:
- Author walks through the draft (5 min)
- Timeline gets verified (5 min)
- Agreement on root cause (10 min)
- Action item owners and due dates get nailed down (15 min)
- Q&A (5 min)
At the end, the draft moves from “In Review” to “Final”.
Action item tracking
If you write the postmortem but the action items never close, the next incident has the same root cause.
Practical discipline:
- Action items go into Jira or Linear as tickets
- In sprint planning, P1 action items get picked up first
- Every monthly engineering all-hands reports 90-day action item completion rate
- If it’s under 70%, a retro digs into why they aren’t closing
Looking for patterns
A single postmortem is a microscope. Once you have ten or more, patterns start showing up:
- “7 of the last 12 incidents were deploy-related” -> invest in the deploy pipeline
- “3 incidents had slow on-call alarm detection” -> review PagerDuty configuration
- “5 incidents had root causes staging didn’t catch” -> invest in staging
I run a quarterly incident review. I open every postmortem and pull out the patterns. The output shapes the engineering roadmap.
Small incidents
Is a postmortem necessary for a 5-minute minor outage? I do a “lightweight postmortem”: two paragraphs plus one action item. Forcing the full template is overkill.
But if the same minor outage happens twice, it’s now a pattern. Lightweight for the first two, full postmortem for the third.
Sharing
Postmortems aren’t confidential, they should be team-wide. In the knowledge base, in Notion, somewhere shared.
Some companies publish public postmortems (Cloudflare, GitHub, Gitlab are famous for this). Strong signal for customer trust, but not required. At minimum, keep them open internally.
Closing note
A postmortem isn’t a formality, it’s a learning practice. Done well, the same bug doesn’t repeat, the team feels safer, and the system gets a little more refined each time.
Done badly, or skipped, silent technical debt piles up. Six months later when a new engineer hits the same bug, nobody can say “this was already known” because nobody wrote it down. Lost institutional memory doesn’t come back.