Documentation discipline: from README to runbook

When people talk about software documentation, the standard line is “it’s important”. In practice, docs are the thing that constantly falls behind, goes stale, and nobody updates.

Across 10 teams and projects I’ve tried different approaches to docs. Some flopped, some stuck. Here are the ones that stuck.

The four kinds of documentation

Not all documentation is the same. Different goals, audiences, formats:

1. Onboarding (README, setup guide). For new developers. “Where do I start?”

2. Reference (API docs, config reference). For while you’re coding. “What does this function return?”

3. Explanation (architecture, design docs, ADRs). “Why is it built this way?”

4. How-to and runbooks. “How do I do this task?” “What do I do during this incident?”

Lumping all of these under a single “documentation” umbrella makes writing and maintaining them harder. Each has its own place, format, and cadence.

README: the front door

The README is the most-read piece of documentation. It needs:

A one-line description: what is this project?

Quick start: five commands to get it running.

Prerequisites: Node >= 18, Docker, and so on.

Contributing guide link: if you take contributions.

Links: to detailed docs, API reference, changelog.

If the README is too long, no one reads it. 200 lines is the ceiling. Push detail to other files.

Architecture doc: the big picture

For a new engineer to grok the big picture. I typically reach for this template:

# Architecture

## Overview

[2-3 paragraphs: what this system does, core principles]

## System Diagram

[Mermaid or Excalidraw]

## Services

### API Gateway
- Purpose: public traffic entry
- Tech: NGINX + custom Lua
- Dependencies: auth service, rate limiter

### User Service
- Purpose: auth, user CRUD
- Tech: Go + PostgreSQL
- Dependencies: session store

...

## Data Flow

[Diagrams of critical user journeys: signup, login, order creation]

## Key Design Decisions

- Why PostgreSQL over MongoDB: ACID requirements, complex joins
- Why a message queue: async processing, retry guarantees
- ...

Readable in half an hour, gets you the big picture, leaves detail elsewhere.

ADRs: Architecture Decision Records

Why was this decision made? For the question someone asks six months later: “why this framework?”

ADR format:

# ADR-0012: Choosing PostgreSQL

## Status
Accepted

## Context
CRUD-heavy system, 10K orders a day, reporting needs complex joins.

## Decision
We use PostgreSQL instead of MySQL or MongoDB.

## Consequences
- Advanced window functions and CTEs available (positive)
- Managed Postgres is common but fewer providers than MySQL (negative)
- Team Postgres knowledge is average (risk: learning curve)

Every significant technology or pattern decision gets an ADR. Store them in the repo under /docs/adr/, numbered sequentially.

ADRs earn their keep over time. Two years later, “why did we do this?” has an answer.

Runbook: operational scenarios

The system is running, someone’s on call. The runbook answers “which switch do I flip?”

One runbook per alert:

# Alert: High API Error Rate

## Context
Fires when `api_error_rate > 5%`.

## Impact
Checkout may be failing for users. Revenue at risk.

## Investigation
1. Check Grafana for the affected endpoint: https://grafana/...
2. Look for a recent deploy: `kubectl rollout history`
3. Is an upstream service down? Stripe status page, payment provider status.

## Mitigation
- Spike after a recent deploy: roll back with `kubectl rollout undo`
- Upstream down: confirm circuit breaker is active, feature-flag it off
- DB connection pool full: scale up workers

## Postmortem
Write up the postmortem on the wiki once the incident is resolved.

The runbook is linked directly from the alert. The on-call engineer clicks the alert and lands on the runbook.

Writing the runbook forces you to make the action plan concrete. You’re not figuring it out mid-incident.

Who writes the docs

Classic mistake: “we’ll hire a technical writer, they’ll handle docs”. Result: written once, goes stale, details missing.

Good pattern: the owner of the code owns the docs. A feature PR updates the docs. Docs-as-code.

CI enforcement:
– Feature PRs must touch docs (reviewer checkpoint)
– README references a file that doesn’t exist = CI fail
– Broken link = CI fail

Automation stops the rot.

Docs-as-code: keep them in the repo

Putting docs in external tools like Confluence or Notion silos them. They live outside version control, don’t go through PR review.

I keep /docs/ in the repo. Markdown format. Git history is docs history. Changes get reviewed in PRs.

Bonus: turn the docs into a static site with MkDocs, Docusaurus, or Nextra and deploy from CI. The team reads them on their own site.

Docs review in every PR

Add this to the PR review checklist:

Has the new feature been documented?
Have the docs been updated for changed behaviour?
Is there a migration guide for breaking changes?

Don’t say “it’s a small fix, no doc change needed”. That’s where gaps start.

Stale docs: the moving target

Documentation is never done. Code changes, docs change. Without discipline, docs are stale in six months.

Anti-pattern: adding “note: this info is from 2024 and may be outdated” to a live doc. If the doc is wrong, fix it. Don’t slap a disclaimer on it.

Better pattern: docs maintenance as sprint work. Every sprint, review a specific area of docs and bring it current.

On larger teams, a “doc gardener” rotation: each month one person puts four hours into reviewing docs and fixing stale pages.

Common mistake: writer’s block

“I’m not a good enough writer”, some developers say. Documentation doesn’t need perfect prose.

Short sentences, clear information. “This function does X. It returns Y. On failure it throws Z.” That’s a good doc.

Writing docs is a muscle. Write docs for every feature; three months in you’re fast.

Common mistake 2: no examples

Description without an example, or example without a description: both half-baked. Good docs combine the two.

### createOrder(data)

Creates a new order.

**Parameters:**
- `data.userId` (string): user ID
- `data.items` (array): order items
- `data.totalCents` (integer): total in cents

**Returns:** Order object

**Example:**

javascript
const order = await createOrder({
userId: ‘user_123’,
items: [{productId: ‘p1’, quantity: 2}],
totalCents: 15000
});
// order.id, order.status, …


**Throws:**
- `InsufficientStockError`: if an item is out of stock
- `PaymentRequiredError`: if the user has no payment method

That’s the pattern for reference docs.

Documentation-first development

Sometimes you write the docs before you build the feature. “This API endpoint will do X and take these parameters”, get team feedback, then implement.

This pattern forces you to think about the API upfront. Looking at “how will this read in docs?” surfaces ergonomic issues early.

Metrics: how good are your docs

Hard to measure directly, but proxies:

Onboarding time (how fast does a new developer become productive)
Self-serve vs Slack-help ratio (can questions be answered by the docs)
Doc PR frequency (is it being touched)
Broken link count (maintenance level)

Quarterly review: how current are the docs?

Closing thought

Docs aren’t a hobby, they’re an engineering practice. Not writing them is technical debt.

A six-item minimum:
1. README in every repo
2. Architecture doc per system
3. ADRs for important decisions
4. Runbooks for every alert
5. Docs check in PR review
6. Quarterly doc maintenance sprint

Teams that apply all six onboard quickly, respond to incidents quickly, and keep knowledge siloing to a minimum. Teams without docs are fighting every problem alone.