Chaos engineering for small teams: the stripped-down version

Say chaos engineering and everyone thinks Netflix, AWS, massive cloud infrastructure. “We’re a small team, this isn’t for us”, and we stay away. There’s a scaled-down version that’s perfectly doable on a small team. Here’s how I run it.

The core idea is simple. Before the system falls over on its own, we deliberately break pieces of it, watch what happens, and take the lessons. Instead of production imploding on a Friday night, we implode it on a Tuesday afternoon in a controlled setting.

Entry level: what breaks when what fails?

The first exercise is documentation. Sit down with the team and answer “what are the critical components of our system, and what happens if one goes down?”. Example:

That table alone is the most valuable part of chaos engineering. If the team can’t answer “what happens if Redis goes down?”, testing it is critical.

Level 1: manually kill one component

On staging, kill a single component by hand and watch the system.

# kill Redis on staging
sudo systemctl stop redis

# what's in the app logs?
tail -f /var/log/app.log

# what's changing in our metrics?
# response time? error rate?

First time we did this, Redis went down, the app waited through a 30-second timeout, and then threw an uncontained exception. An unhandled error message reached the user. We had “if Redis is down, fall back to the DB” in the design, but nobody had written the code. Without this test, we would have learned it in production.

Fix: set the Redis client to connect_timeout: 1s, command_timeout: 500ms. If it can’t connect, it returns null, the app treats it as a cache miss and goes straight to the DB.

Level 2: simulated network problems

With Toxiproxy you can inject controlled latency or drops. What happens if we add 500ms of latency and occasional packet drop between services on staging?

toxiproxy-cli toxic add -t latency -a latency=500 redis_proxy

This revealed that some dependencies were quietly changing behaviour under load. A microservice that was fine under normal conditions was exhausting its connection pool when an upstream slowed down by 500ms. New requests were queueing. Cascading failure was starting.

Fix: grow the connection pool and add a circuit breaker.

Level 3: load plus failure combined

What breaks under normal load? With k6 pushing 100 concurrent users, we slow down the database.

This test showed that slowing the database primary locked up every API endpoint at once. A single cache-free endpoint was draining the pool for everyone. The fix: separate database connection pools per endpoint.

The discipline for a small team

Netflix automates this, we can’t match that. But we can carve out two hours every Tuesday afternoon once a month. Goals:

Kill one component. Document the result.
If a fix is needed, open a task. Write an action plan.
Verify the fix before the next chaos session.

That’s it. Not flashy, disciplined.

What not to test

Things with unknown outcomes in production. Staging first.
Weekends or the night before holidays. On-call has to be around.
During peak user hours. Pick a low-traffic window.
Components without a designed fallback. Design first, break second.

Gamedays

Four months ago we ran a half-day “gameday” workshop. We killed a service and the team ran a real incident drill. Who does what, are the runbooks current, do our communication channels work, all of it tested. With four people in the room, our coordination during the next real incident was visibly faster.

A gameday isn’t just an investment, it’s training. New developers get up to speed on the system much faster this way.

The culture it creates

The discipline seeps into the culture over time. When you’re designing a new feature, “what if this component goes down?” becomes reflexive. A fallback section becomes mandatory in design docs. Hunting for single points of failure becomes a habit.

Being a small team shouldn’t keep you from this. Just shrink the scope, do it on a schedule, and extract a lesson each time.