Blue/green, canary, rolling: picking the right deploy strategy

In modern web development, “deploy strategy” isn’t just “push deploy” anymore. Blue/green, canary, rolling: three popular approaches. Each offers a different risk, cost, and complexity trade-off.

Here are the production deploy scenarios I’ve collected over 19 years.

Rolling deploy

The most common and simplest approach. Upgrade the instances in your server cluster to the new version one at a time.

Process:

Server 1: old version to new version (30 seconds offline)
Server 2: old to new (30 seconds offline)
Server 3, 4, 5… same

The load balancer health check ensures only healthy servers serve traffic.

Pros:
– Simple setup
– No extra infrastructure
– Low memory overhead
– Most Kubernetes deployments default to this

Cons:
– Rollback is gradual (every server has to roll back)
– Brief version mismatch (some users on v1, some on v2)
– Database schema migration is tricky
– Partial failure mid-deploy is messy

When to use it:
– Small to medium team
– Standard web app
– No breaking changes
– Cost-conscious

Blue/green deploy

Two identical production environments. Blue is current, Green is new. The traffic switch is atomic.

Process:

Blue environment is live (current version)
Deploy new version to Green (no traffic yet)
Test Green (smoke tests, performance checks)
Load balancer switches traffic to Green
Blue sits idle, ready for rollback

Pros:
– Zero-downtime deploy
– Instant rollback (switch traffic back)
– Pre-production testing on an actual environment
– Database migrations are easier
– No version mismatch

Cons:
– Double the infrastructure cost
– Stateful app migration is complex
– Careful with session handling
– Extra operational complexity

When to use it:
– High-traffic production
– Zero-downtime requirement
– Critical systems (payment, healthcare)
– Rollback speed is critical

Canary deploy

Progressive rollout. Deploy to a small user group first, expand if it goes well.

Process:

Deploy v2 to one server (5 to 10% of the cluster)
Monitor for 15 to 30 minutes (error rate, latency, business metrics)
If OK, expand to 25%, monitor
If OK, 50%
If OK, 100%

If an issue appears: immediate rollback (only 5% of traffic was affected).

Advanced version: route by user ID hash. The same user always hits the same version.

Pros:
– Lowest-risk deployment
– Real user testing on real data
– Gradual confidence building
– Easy rollback

Cons:
– Complex traffic splitting
– Feature-flag-like complexity
– Database schema changes require care (backward compat across versions)
– Longer deploy cycle (hours vs minutes)

When to use it:
– High-stakes changes (core algorithms, payment logic)
– Large user base
– Breaking change expected but not certain
– A/B testing alongside

Comparison matrix

Database schema migration

The sneakiest part of deploy strategy. Do code and schema change at the same time?

Rolling deploy: v1 and v2 coexist for a few minutes. The schema has to be compatible with both. That’s the expand-contract pattern.

Phases:
1. Expand: add the new column (optional, with default value)
2. Deploy new code: writes and reads the new column
3. Backfill: populate the new column for old data
4. Contract: drop the old column (next deploy)

Spread across 3 to 4 deploy cycles. Requires patience.

Blue/green: Green could have its own database, but replication gets hairy. Usually a shared DB with expand-contract schema.

Canary: v1 and v2 run in parallel even longer. Schema compatibility matters the most here.

Tool support

Kubernetes: rolling by default. Canary via Flagger, Argo Rollouts. Blue/green via Spinnaker.

AWS ECS: rolling by default. Blue/green via CodeDeploy.

Vercel, Netlify: atomic deploy (instant switch). Basic blue/green.

Heroku: rolling by default. Canary via add-ons.

Custom: roll your own routing with Nginx config changes, HAProxy, Traefik.

Real project scenarios

Scenario 1: SaaS e-commerce, 100K monthly users

Kubernetes cluster, 10 pods
Regular database migrations
Critical checkout flow

Decision: rolling by default, canary for checkout changes. Blue/green infrastructure cost isn’t worth it.

Scenario 2: mobile app backend, 1M+ users

High traffic, low-latency requirement
Payment processing at the core

Decision: blue/green. Zero downtime is mandatory. Instant rollback matters for payment issues.

Scenario 3: analytics platform, internal users

500 internal users
Complex data pipelines
Latency is tolerable

Decision: rolling. Canary or blue/green would be overkill. Internal user feedback is fast.

Scenario 4: fintech, critical ML model update

Trading algorithms
Millions depend on the decision

Decision: aggressive canary. 1% rollout, 2 weeks of monitoring, then expand. Risk minimization is paramount.

Monitoring during deploy

What to watch during a deploy:

1. Error rate. Is it climbing from baseline? 0.5% tolerance is typical.

2. Latency p50/p95/p99. Any regression?

3. Business metrics. Checkout conversion, signups, and so on. Any anomaly after deploy?

4. Infrastructure health. CPU, memory, disk I/O. Hardware stress?

5. Synthetic checks. Smoke tests every minute.

Failed deploy state = automatic rollback trigger. Thresholds are defined in the CI/CD pipeline.

Common failure modes

Rolling deploy failure: 50% of servers are on the new version, and the new version is buggy. Rollback has to revert the whole cluster. Downtime risk.

Fix: extensive staging tests before deploy. Rolling deploy should fail fast, smoke test the first pod.

Blue/green switchover disaster: the Green environment passed tests, but once it takes production traffic, unknown behavior emerges (real data, real load). Rollback is instant, but some users are affected.

Fix: route pre-production traffic to Green (mirror real requests, compare responses).

Canary silent failure: the 5% group is hitting a bug but metrics don’t reflect it. The rollout slowly widens to 100%.

Fix: comprehensive metric coverage. A user feedback channel. Error tracking on every path.

Rollback strategy

An important part of deploy strategy: rollback.

Rolling: gradual rollback to the previous version. Every server downgrades.

Blue/green: load balancer switch. Instant.

Canary: disable the routing rule. The new version gets ignored.

Target: rollback under 5 minutes. During a production issue, every minute is lost revenue.

Testing deploys

How do you test deploy automation?

1. A staging environment that mirrors production. Test deploys there.

2. Chaos engineering. Random pod kills, network blips during deploy. Test recovery.

3. Game days. Quarterly. Deploy plus failure scenario simulation. Test team reaction.

4. Disaster recovery drills. “Production deploy failed, do the rollback drill”. Timed.

Without this discipline, a real failed deploy means panic mode.

Wrap-up

Deploy strategy is a function of risk tolerance plus infrastructure budget plus operational capability. Rolling is enough for most projects. Blue/green for high-stakes systems. Canary for critical changes.

Database schema migration makes the expand-contract pattern essential. Rollback strategy is the most critical piece of preparation.

Monitoring during deploy is as important as monitoring after. Early detection plus instant rollback saves production.