Where circuit breakers actually earn their keep in production

A user tries to log in, the auth server times out after 5 seconds. 1000 users are hitting it at once, the server’s entire connection pool is tied up waiting. Even the main server collapses. Cascade failure.

The pattern that prevents this: the circuit breaker. If the downstream service is sick, stop sending requests to it. Fail fast to the user. Protect the server.

I’ve seen circuit breaker implementations in 12 SaaS projects. Here are the four common mistakes and the correct implementation.

Circuit breaker states

The classic pattern has three states:

Closed: normal operation. Every request goes downstream. Failures are counted.

Open: threshold exceeded (X failures in Y time window). Requests don’t go downstream. Users get an immediate failure response. The server is protected.

Half-Open: after a period (30 seconds, 1 minute) a trial request is allowed. Success: back to Closed. Failure: back to Open.

Closed -> (X failures) -> Open -> (timeout) -> Half-Open -> (success) -> Closed
                                                      -> (failure) -> Open

This simple state machine protects downstream services.

A real implementation

The most famous library was Netflix Hystrix (now deprecated, but the pattern is the reference). Modern alternatives: resilience4j (Java), Polly (.NET), opossum (Node.js), go-resilience (Go).

Node.js example (opossum):

const CircuitBreaker = require('opossum');

const options = {
  timeout: 3000, // trip if no response in 3 seconds
  errorThresholdPercentage: 50, // open at 50% failure rate
  resetTimeout: 30000 // wait 30 seconds before going Half-Open
};

const breaker = new CircuitBreaker(callAuthService, options);

breaker.fallback(() => {
  return { authenticated: false, reason: 'Auth service unavailable' };
});

breaker.on('open', () => console.log('Auth service circuit is OPEN'));
breaker.on('halfOpen', () => console.log('Circuit trying recovery'));

// Use
try {
  const result = await breaker.fire(userId, password);
} catch (error) {
  // Handles downstream error and open circuit alike
}

breaker.fire() calls callAuthService if the circuit is Closed. If Open, it returns the fallback immediately.

The four common mistakes

Mistake 1: circuit breakers on every downstream call

Wrapping every external call in a breaker is overkill. Managing breaker state adds overhead.

Right move: apply to critical downstream services. Non-critical calls (logging, analytics) fail directly with a fallback.

Mistake 2: threshold too low

ErrorThresholdPercentage of 10% opens the circuit on every transient network glitch. To detect legitimate failures fast, 40 to 60% is reasonable.

Mistake 3: fixed reset timeout

30 seconds is a fine default, but different services need different recovery times. The payment API takes longer to recover than the email service.

Use adaptive timeouts: dynamic, based on downstream response times.

Mistake 4: no fallback logic

What happens when the circuit opens? You need a fallback. Options:

Cached response (stale data)
Default value (guest experience)
Degraded feature (basic search instead of advanced search)
Graceful error message

Without a fallback, a circuit breaker only makes the error message arrive faster. It doesn’t improve user experience.

When to use it

Scenarios where circuit breakers shine:

1. Network calls to external services. Third-party APIs (Stripe, Sendgrid, a geocoder). Provider down, you stay up.

2. Database calls (rare but possible). Database down, graceful degradation.

3. Cache servers. Redis down, the app bypasses cache and hits the DB directly.

4. Internal microservice calls. Service A depends on service B. B slow, A stays fast.

When not to use it

1. Synchronous mission-critical operations. Payment. Even if the circuit is open, the user wants to pay. You need an alternative payment path; a breaker alone isn’t enough.

2. Transient network issues. Retries are a better fit. Circuit breakers are for persistent failures.

3. Rate-limited-by-design operations. If you’re hitting rate limits, a breaker clashes with the rate limit provider. Exponential backoff fits better.

Circuit breaker + retry combination

The two are typically used together:

Retry: for transient failures, 2 or 3 attempts with exponential backoff.
Circuit breaker: for persistent failures, after the retries.

Order matters: retry first, breaker second. Retry handles temporary issues, the breaker handles persistent ones.

async function callWithResilience() {
  return await circuitBreaker.fire(async () => {
    return await retryWithBackoff(async () => {
      return await downstream.call();
    }, { maxAttempts: 3 });
  });
}

Monitoring circuit breakers

Monitoring state changes is critical:

Metrics:
– Current state (Closed/Open/Half-Open)
– Time in each state
– Failure rate
– Request count per state

Alerts:
– Circuit opens -> PagerDuty
– Open state > 5 minutes -> Slack notification
– Multiple circuits open at once -> major incident

Dashboards:
– Per-service circuit breaker state
– Historical state transitions
– Failure rate trends

Without this visibility, the breaker operates silently and the team doesn’t notice the system is degraded.

Bulkhead pattern (complementary)

The breaker’s cousin: the bulkhead pattern. Separate connection pools so one service’s calls can’t block the others.

Scenario: service A calls auth, database, payment. Auth is slow, it fills the connection pool. Database and payment calls also block.

Bulkhead fix: a dedicated thread pool or connection pool per downstream. If auth is slow, only auth’s pool drains; database and payment are unaffected.

Circuit breaker + bulkhead = comprehensive resilience.

Service mesh integration

In Kubernetes microservice systems, service meshes like Istio and Linkerd provide built-in circuit breakers. Each pod has a sidecar proxy, the proxy implements the breaker.

Advantage: application code doesn’t change. It’s handled at the infrastructure layer.

Disadvantage: customization is limited. Writing language-specific business logic in the sidecar is hard.

Hybrid: service mesh for basic circuit breaking, application-level fine-tuned breakers on important services.

Real story: Stripe API integration

On an e-commerce project I wrapped Stripe API calls in a circuit breaker. Stripe had a 2-hour outage (rare, but it happens).

Without the breaker:
– Every payment attempt hangs for 30 seconds
– The connection pool is stuck waiting on Stripe
– The main application starts to collapse
– Support tickets explode

With the breaker:
– Stripe times out after 30 seconds, circuit opens within 2 minutes
– Payment attempts fail immediately, user sees “Payment is temporarily unavailable, please try again in a few minutes”
– The rest of the app (catalog browsing, account management) keeps working
– Stripe recovers, circuit goes Half-Open then Closed

During the 2 hours Stripe was down, the system was degraded but up. Customer impact contained.

Testing circuit breakers

Chaos engineering mindset:

Unit test: test the state transitions of the breaker library. Mock downstream errors, check state.

Integration test: simulate slow or failing downstream. Does the circuit actually open?

Production testing: controlled downstream failure injection. Chaos-monkey style. Test environment, weekends.

Game days: quarterly. Simulate full service failure. Exercise the team’s response.

Configuration tuning

Defaults are rarely optimal. Tune per service:

Fast downstream (response < 100ms):
– Timeout: 1 second
– Failure threshold: 30%
– Reset: 10 seconds

Medium downstream (response 100 to 500ms):
– Timeout: 3 seconds
– Failure threshold: 50%
– Reset: 30 seconds

Slow downstream (response 1 to 5 seconds):
– Timeout: 10 seconds
– Failure threshold: 60%
– Reset: 60 seconds

Tune these based on your actual traffic patterns. Look at Prometheus metrics.

Bottom line

The circuit breaker is a foundational tool for microservice reliability. Critical for cascade failure prevention.

Don’t apply it to every call; focus on critical downstreams. Combine retry and circuit breaker. Fallback logic matters. Monitoring and alerting are mandatory.

Implementation: opossum, resilience4j, Polly. Proven libraries exist. You don’t need to roll your own. Tune the configuration; the wrong defaults are very common.

Price out one production incident and a circuit breaker pays for its 2 to 3 hour implementation many times over.