Home / Blog / Disaster recovery: the mistakes I made before a real incident exposed them

Disaster recovery: the mistakes I made before a real incident exposed them

Most companies have a DR plan on paper that nobody tests. Here are the failures a real drill will expose.

Every company has a disaster recovery (DR) plan somewhere. A 50-page PDF, a runbook, step-by-step instructions. Run it during a real incident and 70% of it doesn’t work.

I’ve designed DR plans on a few projects and tested two of them under real incidents. This is the write-up of the mistakes I made and what I learned.

Mistake 1: The plan was never tested

The most common one. A DR plan gets written, it sits on paper, nobody ever runs it for real.

What falls apart during an actual incident:

  • The command written in the runbook no longer works (syntax changed)
  • Backup restore doesn’t finish in the expected window
  • Recovery server credentials are expired
  • DNS TTL is too high, propagation takes hours
  • Team members are rotated out and seeing the runbook for the first time

Fix: quarterly DR drill in a staging environment that’s close to production. Realistic simulation.

Mistake 2: RTO and RPO undefined

RTO (Recovery Time Objective): how much downtime is tolerable? (“4 hours”)

RPO (Recovery Point Objective): how much data loss is tolerable? (“15 minutes”)

These two are the foundation of the framework. Without them:

  • Backup frequency is unclear
  • Replica strategy is unclear
  • Investment level is unclear

Fix: talk to the business. Is “8 hours of downtime is OK” acceptable, or does it have to be up within 2 hours? An RPO of 24 hours versus 15 minutes is a completely different architecture.

Mistake 3: Backups never restored

You’re taking backups, but you’ve never tried restoring them.

What shows up when you actually try:

  • Backup is corrupted (silent failure)
  • Restore script doesn’t work against the prod environment
  • Backup decryption key was lost
  • No permission to access the storage bucket
  • Incremental backups taken, but full restore isn’t possible

Fix: monthly backup restore test. Restore to staging, verify data integrity. Automate it.

Mistake 4: Hidden single points of failure

On paper the obvious SPOFs look handled. In a real incident, the hidden ones show up.

Example SPOFs:

  • A single CI/CD runner (required for prod deploy, not covered in the DR plan)
  • A single DNS provider (what if Cloudflare is down?)
  • A single certificate authority
  • A single secrets manager instance
  • A single monitoring provider

Fix: dependency mapping. “What does it take to keep prod up?” For every dependency, a failover plan.

Mistake 5: Procedural knowledge lives in one or two heads

The DR plan exists as a runbook, but only one or two people on the team actually know it in detail. And those people are on vacation or unreachable.

Fix:

  • Rotation (everyone has actually executed the runbook)
  • Documentation that’s step-by-step and assumption-free
  • Quarterly game days
  • New on-call joiners always paired with someone experienced

Mistake 6: Communication plan missing

The incident starts. Who do you call?

Typical gaps:

  • No customer communication plan
  • No stakeholder notification template
  • Incident channel (Slack) not prepared
  • Status page not updated automatically
  • No owner for post-incident communication

Fix:

  1. Incident communication ladder: who, to whom, when.
  2. Template emails: ready to go, customized per situation.
  3. Status page: automated plus manual, in real time.
  4. Incident Slack channel: pre-created.
  5. Customer communication: “We’re aware, investigating” then “ETA” then “Resolved” pattern.

Mistake 7: Treating everything as a total outage

The DR plan is written for “full collapse”. But 80% of the time it’s a partial failure: one service is down, the rest are fine.

Triggering full failover on a partial failure is overshoot. It creates a second outage.

Fix:

  • Graduated response. Partial failure gets limited action; full failure triggers the DR plan.
  • Service-level failover. Bring one service up from backup while the rest of the cluster keeps running.
  • Decision framework: “full DR is triggered only when X, Y and Z hold; otherwise graduated response.”

Mistake 8: Skipping the post-incident review

Incident resolved. The team is relieved, “that’s over”. No blameless postmortem happens.

This step is critical:

  • Root cause analysis
  • Contributing factors
  • What went well
  • What could be improved
  • Action items (tracked to completion)

Without a postmortem, the same incident is very likely to repeat.

Fix: postmortem mandatory for every major incident. Template:

  1. Timeline (what happened when)
  2. Impact (customer, business, technical)
  3. Detection (how it was discovered)
  4. Resolution (what fixed it)
  5. Root cause (5 whys)
  6. Action items (prevent plus detect plus mitigate)

Mistake 9: Treating recovery as just restore

Recovery isn’t only data restore. Full service recovery looks like:

  • Data integrity check
  • Dependency service restart ordering
  • Cache warming (cold cache means slow performance)
  • User session handling (lost sessions vs persistent)
  • Monitoring re-enabled
  • Alerting silences reset
  • Customer communication update

Fix: checklist-based recovery. “Restore completed” is only the first item.

Mistake 10: No infrastructure as code

In a DR scenario you need to spin up a secondary region. Creating infrastructure manually takes forever.

IaC (infrastructure as code):

  • Terraform, CloudFormation, Pulumi
  • Primary plus secondary region templates
  • “terraform apply” brings up a fresh environment

On the last project, secondary region spin-up dropped from 4 hours to 30 minutes because of IaC.

Fix: critical infrastructure lives in IaC. Manual setup discouraged.

Drill planning

Good setup for a quarterly DR drill:

Pre-drill:

  • Pick the scenario (a specific failure)
  • Participant list (on-call team plus support)
  • Time window (hours)
  • Success criteria (RTO met? RPO met? Communication ran? Did the docs work?)

During the drill:

  • Real-time observation. What is the team doing, and why?
  • Stopwatch. RTO measurement.
  • Note the gaps. Steps the runbook didn’t cover.

Post-drill:

  • Debrief session
  • Findings document
  • Action items tracked
  • Runbook updates

One scenario per quarter. Over 12 months you end up testing four different failure modes.

Budgeting for DR

DR is expensive. Far less expensive than the real incident.

DR budget items:

  • Secondary region infrastructure (idle 30 to 50% of primary cost)
  • Backup storage (monthly cost)
  • Monitoring tools (paid tier)
  • DR testing time (team hours)
  • Training (workshops, simulations)

Incident cost estimate:

  • Downtime times revenue per hour
  • Loss of customer trust
  • Team overtime plus recovery
  • Regulatory penalties (compliance)

Balance matters: fully redundant everything isn’t practical. Budget inside a “tolerable downtime” framework.

Cloud-specific considerations

AWS, GCP and Azure all have outages. “Cloud reliability” isn’t infinite.

Multi-region strategies:

  • Same cloud, different region (AWS eu-west-1 plus eu-west-2)
  • Multi-cloud (AWS plus GCP)
  • Hybrid (cloud plus on-prem backup)

Complexity grows exponentially. For most teams, single region multi-AZ is enough.

Takeaway

The DR plan looks good on paper. Actual testing surfaces the problems. Ten common mistakes: untested plan, fuzzy RTO/RPO, backups never restored, hidden SPOFs, knowledge gaps, missing communication, partial failure handling, skipped postmortems, recovery treated as restore, no IaC.

Quarterly drill plus blameless postmortem plus continuous improvement. That’s the discipline that makes a DR plan actually work.

The first drill is going to be painful. Issues will surface. The good news: they surface in a controlled environment, not in a production incident.

Have a project on this topic?

Leave a brief summary — I’ll get back to you within 24 hours.

Get in touch