A user tries to log in, the auth server times out after 5 seconds. 1000 users are hitting it at once, the server’s entire connection pool is tied up waiting. Even the main server collapses. Cascade failure.
The pattern that prevents this: the circuit breaker. If the downstream service is sick, stop sending requests to it. Fail fast to the user. Protect the server.
I’ve seen circuit breaker implementations in 12 SaaS projects. Here are the four common mistakes and the correct implementation.
Circuit breaker states
The classic pattern has three states:
Closed: normal operation. Every request goes downstream. Failures are counted.
Open: threshold exceeded (X failures in Y time window). Requests don’t go downstream. Users get an immediate failure response. The server is protected.
Half-Open: after a period (30 seconds, 1 minute) a trial request is allowed. Success: back to Closed. Failure: back to Open.
Closed -> (X failures) -> Open -> (timeout) -> Half-Open -> (success) -> Closed
-> (failure) -> OpenThis simple state machine protects downstream services.
A real implementation
The most famous library was Netflix Hystrix (now deprecated, but the pattern is the reference). Modern alternatives: resilience4j (Java), Polly (.NET), opossum (Node.js), go-resilience (Go).
Node.js example (opossum):
const CircuitBreaker = require('opossum');
const options = {
timeout: 3000, // trip if no response in 3 seconds
errorThresholdPercentage: 50, // open at 50% failure rate
resetTimeout: 30000 // wait 30 seconds before going Half-Open
};
const breaker = new CircuitBreaker(callAuthService, options);
breaker.fallback(() => {
return { authenticated: false, reason: 'Auth service unavailable' };
});
breaker.on('open', () => console.log('Auth service circuit is OPEN'));
breaker.on('halfOpen', () => console.log('Circuit trying recovery'));
// Use
try {
const result = await breaker.fire(userId, password);
} catch (error) {
// Handles downstream error and open circuit alike
}breaker.fire() calls callAuthService if the circuit is Closed. If Open, it returns the fallback immediately.
The four common mistakes
Mistake 1: circuit breakers on every downstream call
Wrapping every external call in a breaker is overkill. Managing breaker state adds overhead.
Right move: apply to critical downstream services. Non-critical calls (logging, analytics) fail directly with a fallback.
Mistake 2: threshold too low
ErrorThresholdPercentage of 10% opens the circuit on every transient network glitch. To detect legitimate failures fast, 40 to 60% is reasonable.
Mistake 3: fixed reset timeout
30 seconds is a fine default, but different services need different recovery times. The payment API takes longer to recover than the email service.
Use adaptive timeouts: dynamic, based on downstream response times.
Mistake 4: no fallback logic
What happens when the circuit opens? You need a fallback. Options:
- Cached response (stale data)
- Default value (guest experience)
- Degraded feature (basic search instead of advanced search)
- Graceful error message
Without a fallback, a circuit breaker only makes the error message arrive faster. It doesn’t improve user experience.
When to use it
Scenarios where circuit breakers shine:
1. Network calls to external services. Third-party APIs (Stripe, Sendgrid, a geocoder). Provider down, you stay up.
2. Database calls (rare but possible). Database down, graceful degradation.
3. Cache servers. Redis down, the app bypasses cache and hits the DB directly.
4. Internal microservice calls. Service A depends on service B. B slow, A stays fast.
When not to use it
1. Synchronous mission-critical operations. Payment. Even if the circuit is open, the user wants to pay. You need an alternative payment path; a breaker alone isn’t enough.
2. Transient network issues. Retries are a better fit. Circuit breakers are for persistent failures.
3. Rate-limited-by-design operations. If you’re hitting rate limits, a breaker clashes with the rate limit provider. Exponential backoff fits better.
Circuit breaker + retry combination
The two are typically used together:
- Retry: for transient failures, 2 or 3 attempts with exponential backoff.
- Circuit breaker: for persistent failures, after the retries.
Order matters: retry first, breaker second. Retry handles temporary issues, the breaker handles persistent ones.
async function callWithResilience() {
return await circuitBreaker.fire(async () => {
return await retryWithBackoff(async () => {
return await downstream.call();
}, { maxAttempts: 3 });
});
}Monitoring circuit breakers
Monitoring state changes is critical:
Metrics:
– Current state (Closed/Open/Half-Open)
– Time in each state
– Failure rate
– Request count per state
Alerts:
– Circuit opens -> PagerDuty
– Open state > 5 minutes -> Slack notification
– Multiple circuits open at once -> major incident
Dashboards:
– Per-service circuit breaker state
– Historical state transitions
– Failure rate trends
Without this visibility, the breaker operates silently and the team doesn’t notice the system is degraded.
Bulkhead pattern (complementary)
The breaker’s cousin: the bulkhead pattern. Separate connection pools so one service’s calls can’t block the others.
Scenario: service A calls auth, database, payment. Auth is slow, it fills the connection pool. Database and payment calls also block.
Bulkhead fix: a dedicated thread pool or connection pool per downstream. If auth is slow, only auth’s pool drains; database and payment are unaffected.
Circuit breaker + bulkhead = comprehensive resilience.
Service mesh integration
In Kubernetes microservice systems, service meshes like Istio and Linkerd provide built-in circuit breakers. Each pod has a sidecar proxy, the proxy implements the breaker.
Advantage: application code doesn’t change. It’s handled at the infrastructure layer.
Disadvantage: customization is limited. Writing language-specific business logic in the sidecar is hard.
Hybrid: service mesh for basic circuit breaking, application-level fine-tuned breakers on important services.
Real story: Stripe API integration
On an e-commerce project I wrapped Stripe API calls in a circuit breaker. Stripe had a 2-hour outage (rare, but it happens).
Without the breaker:
– Every payment attempt hangs for 30 seconds
– The connection pool is stuck waiting on Stripe
– The main application starts to collapse
– Support tickets explode
With the breaker:
– Stripe times out after 30 seconds, circuit opens within 2 minutes
– Payment attempts fail immediately, user sees “Payment is temporarily unavailable, please try again in a few minutes”
– The rest of the app (catalog browsing, account management) keeps working
– Stripe recovers, circuit goes Half-Open then Closed
During the 2 hours Stripe was down, the system was degraded but up. Customer impact contained.
Testing circuit breakers
Chaos engineering mindset:
Unit test: test the state transitions of the breaker library. Mock downstream errors, check state.
Integration test: simulate slow or failing downstream. Does the circuit actually open?
Production testing: controlled downstream failure injection. Chaos-monkey style. Test environment, weekends.
Game days: quarterly. Simulate full service failure. Exercise the team’s response.
Configuration tuning
Defaults are rarely optimal. Tune per service:
Fast downstream (response < 100ms):
– Timeout: 1 second
– Failure threshold: 30%
– Reset: 10 seconds
Medium downstream (response 100 to 500ms):
– Timeout: 3 seconds
– Failure threshold: 50%
– Reset: 30 seconds
Slow downstream (response 1 to 5 seconds):
– Timeout: 10 seconds
– Failure threshold: 60%
– Reset: 60 seconds
Tune these based on your actual traffic patterns. Look at Prometheus metrics.
Bottom line
The circuit breaker is a foundational tool for microservice reliability. Critical for cascade failure prevention.
Don’t apply it to every call; focus on critical downstreams. Combine retry and circuit breaker. Fallback logic matters. Monitoring and alerting are mandatory.
Implementation: opossum, resilience4j, Polly. Proven libraries exist. You don’t need to roll your own. Tune the configuration; the wrong defaults are very common.
Price out one production incident and a circuit breaker pays for its 2 to 3 hour implementation many times over.