Disaster recovery: the mistakes I made before a real incident exposed them

Every company has a disaster recovery (DR) plan somewhere. A 50-page PDF, a runbook, step-by-step instructions. Run it during a real incident and 70% of it doesn’t work.

I’ve designed DR plans on a few projects and tested two of them under real incidents. This is the write-up of the mistakes I made and what I learned.

Mistake 1: The plan was never tested

The most common one. A DR plan gets written, it sits on paper, nobody ever runs it for real.

What falls apart during an actual incident:

The command written in the runbook no longer works (syntax changed)
Backup restore doesn’t finish in the expected window
Recovery server credentials are expired
DNS TTL is too high, propagation takes hours
Team members are rotated out and seeing the runbook for the first time

Fix: quarterly DR drill in a staging environment that’s close to production. Realistic simulation.

Mistake 2: RTO and RPO undefined

RTO (Recovery Time Objective): how much downtime is tolerable? (“4 hours”)

RPO (Recovery Point Objective): how much data loss is tolerable? (“15 minutes”)

These two are the foundation of the framework. Without them:

Backup frequency is unclear
Replica strategy is unclear
Investment level is unclear

Fix: talk to the business. Is “8 hours of downtime is OK” acceptable, or does it have to be up within 2 hours? An RPO of 24 hours versus 15 minutes is a completely different architecture.

Mistake 3: Backups never restored

You’re taking backups, but you’ve never tried restoring them.

What shows up when you actually try:

Backup is corrupted (silent failure)
Restore script doesn’t work against the prod environment
Backup decryption key was lost
No permission to access the storage bucket
Incremental backups taken, but full restore isn’t possible

Fix: monthly backup restore test. Restore to staging, verify data integrity. Automate it.

Mistake 4: Hidden single points of failure

On paper the obvious SPOFs look handled. In a real incident, the hidden ones show up.

Example SPOFs:

A single CI/CD runner (required for prod deploy, not covered in the DR plan)
A single DNS provider (what if Cloudflare is down?)
A single certificate authority
A single secrets manager instance
A single monitoring provider

Fix: dependency mapping. “What does it take to keep prod up?” For every dependency, a failover plan.

Mistake 5: Procedural knowledge lives in one or two heads

The DR plan exists as a runbook, but only one or two people on the team actually know it in detail. And those people are on vacation or unreachable.

Fix:

Rotation (everyone has actually executed the runbook)
Documentation that’s step-by-step and assumption-free
Quarterly game days
New on-call joiners always paired with someone experienced

Mistake 6: Communication plan missing

The incident starts. Who do you call?

Typical gaps:

No customer communication plan
No stakeholder notification template
Incident channel (Slack) not prepared
Status page not updated automatically
No owner for post-incident communication

Fix:

Incident communication ladder: who, to whom, when.
Template emails: ready to go, customized per situation.
Status page: automated plus manual, in real time.
Incident Slack channel: pre-created.
Customer communication: “We’re aware, investigating” then “ETA” then “Resolved” pattern.

Mistake 7: Treating everything as a total outage

The DR plan is written for “full collapse”. But 80% of the time it’s a partial failure: one service is down, the rest are fine.

Triggering full failover on a partial failure is overshoot. It creates a second outage.

Fix:

Graduated response. Partial failure gets limited action; full failure triggers the DR plan.
Service-level failover. Bring one service up from backup while the rest of the cluster keeps running.
Decision framework: “full DR is triggered only when X, Y and Z hold; otherwise graduated response.”

Mistake 8: Skipping the post-incident review

Incident resolved. The team is relieved, “that’s over”. No blameless postmortem happens.

This step is critical:

Root cause analysis
Contributing factors
What went well
What could be improved
Action items (tracked to completion)

Without a postmortem, the same incident is very likely to repeat.

Fix: postmortem mandatory for every major incident. Template:

Timeline (what happened when)
Impact (customer, business, technical)
Detection (how it was discovered)
Resolution (what fixed it)
Root cause (5 whys)
Action items (prevent plus detect plus mitigate)

Mistake 9: Treating recovery as just restore

Recovery isn’t only data restore. Full service recovery looks like:

Data integrity check
Dependency service restart ordering
Cache warming (cold cache means slow performance)
User session handling (lost sessions vs persistent)
Monitoring re-enabled
Alerting silences reset
Customer communication update

Fix: checklist-based recovery. “Restore completed” is only the first item.

Mistake 10: No infrastructure as code

In a DR scenario you need to spin up a secondary region. Creating infrastructure manually takes forever.

IaC (infrastructure as code):

Terraform, CloudFormation, Pulumi
Primary plus secondary region templates
“terraform apply” brings up a fresh environment

On the last project, secondary region spin-up dropped from 4 hours to 30 minutes because of IaC.

Fix: critical infrastructure lives in IaC. Manual setup discouraged.

Drill planning

Good setup for a quarterly DR drill:

Pre-drill:

Pick the scenario (a specific failure)
Participant list (on-call team plus support)
Time window (hours)
Success criteria (RTO met? RPO met? Communication ran? Did the docs work?)

During the drill:

Real-time observation. What is the team doing, and why?
Stopwatch. RTO measurement.
Note the gaps. Steps the runbook didn’t cover.

Post-drill:

Debrief session
Findings document
Action items tracked
Runbook updates

One scenario per quarter. Over 12 months you end up testing four different failure modes.

Budgeting for DR

DR is expensive. Far less expensive than the real incident.

DR budget items:

Secondary region infrastructure (idle 30 to 50% of primary cost)
Backup storage (monthly cost)
Monitoring tools (paid tier)
DR testing time (team hours)
Training (workshops, simulations)

Incident cost estimate:

Downtime times revenue per hour
Loss of customer trust
Team overtime plus recovery
Regulatory penalties (compliance)

Balance matters: fully redundant everything isn’t practical. Budget inside a “tolerable downtime” framework.

Cloud-specific considerations

AWS, GCP and Azure all have outages. “Cloud reliability” isn’t infinite.

Multi-region strategies:

Same cloud, different region (AWS eu-west-1 plus eu-west-2)
Multi-cloud (AWS plus GCP)
Hybrid (cloud plus on-prem backup)

Complexity grows exponentially. For most teams, single region multi-AZ is enough.

Takeaway

The DR plan looks good on paper. Actual testing surfaces the problems. Ten common mistakes: untested plan, fuzzy RTO/RPO, backups never restored, hidden SPOFs, knowledge gaps, missing communication, partial failure handling, skipped postmortems, recovery treated as restore, no IaC.

Quarterly drill plus blameless postmortem plus continuous improvement. That’s the discipline that makes a DR plan actually work.

The first drill is going to be painful. Issues will surface. The good news: they surface in a controlled environment, not in a production incident.