Distributed rate limiting: sharing state across a server cluster

On a single application server, rate limiting is trivial. You keep an in-memory counter and check it on every request. In production there are usually 5 to 10 servers. If every server keeps its own counter, a user can burn through 5 to 10 times the rate limit.

Distributed rate limiting solves that shared state problem. This post walks through the practical approaches and their trade-offs.

The problem: sharing state across a cluster

User user_id=123, rate limit 10 requests per second.

5 server cluster. The load balancer sprays requests across servers. Each server keeps its own counter.

The user sends 50 requests in the same second. Every server sees 10 of them. Every server’s counter says “this user has made 10 requests”. They all pass. Actual traffic: 50 requests through, the rate limit abused by 5x.

Fix: keep the counter in shared storage.

Approach 1: Redis + Lua script

The most common solution. Redis as centralized storage, a Lua script for the atomic operation.

Implementation:

# Lua script (atomic, no race condition)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call('INCR', key)
if current == 1 then
    redis.call('EXPIRE', key, window)
end

if current > limit then
    return {0, current}  -- deny
else
    return {1, current}  -- allow
end

Usage:

result = redis.eval(script, keys=[f'rate:{user_id}'], args=[10, 1])
if result[0] == 0:
    return HTTPError(429)

Upsides:

Atomic: no race condition is possible (single-threaded Redis plus Lua script). Low latency, about 1ms per call. A proven pattern in production. Simple to implement.

Downsides:

Redis bottleneck: at very high throughput (100K+ RPS) a single Redis instance is not enough. A network hop for every request. If Redis is down, rate limiting breaks (which forces you to choose between fail-open and fail-closed).

For most production systems this is enough.

Approach 2: sliding window log (Redis sorted set)

More precise rate limiting:

# Record the timestamp of each request
now = time.time()
redis.zadd(f'rate:{user_id}', {now: now})

# Drop old timestamps
redis.zremrangebyscore(f'rate:{user_id}', 0, now - 60)  # 60 second window

# Count
count = redis.zcard(f'rate:{user_id}')
if count > 10:
    return HTTPError(429)

Solves the boundary issue of fixed windows. Exact count.

Downside: memory intensive. Every request lives in the log.

Use case: low volume scenarios where the exact rate matters.

Approach 3: token bucket (Redis-backed)

Token bucket algorithm in a cluster:

# Bucket state: {tokens, lastRefill}
# Stored in a Redis hash

local bucket = redis.call('HGETALL', KEYS[1])
local tokens = tonumber(bucket['tokens']) or capacity
local lastRefill = tonumber(bucket['lastRefill']) or now

-- Refill
local elapsed = now - lastRefill
local newTokens = math.min(capacity, tokens + elapsed * refillRate)

if newTokens >= 1 then
    redis.call('HMSET', KEYS[1], 'tokens', newTokens - 1, 'lastRefill', now)
    return 1  -- allow
else
    redis.call('HMSET', KEYS[1], 'tokens', newTokens, 'lastRefill', now)
    return 0  -- deny
end

Tolerates bursts while keeping the rate smooth. Common in API gateways.

Approach 4: database-backed (not recommended)

If Redis is not available, the database can stand in:

SELECT request_count FROM rate_limits WHERE user_id = 123 AND window_start > NOW() - INTERVAL '1 minute';

Problem: a database hit per request. Latency in the 10 to 50ms range. Under load the DB connection pool saturates.

Use case: only when Redis is genuinely not an option (very rare).

Approach 5: consistent hashing (sticky routing)

A different angle: no shared state, the same user always routes to the same server.

The load balancer uses consistent hashing on user_id. User 123 always lands on server 3. Server 3 keeps its own local counter.

Upsides:

No network hop for the rate limit check. Very fast. No external dependency.

Downsides:

Server goes down and the user vanishes (rate limit resets). Server restart and the counter zeroes out. Rebalance messiness when servers get added or removed. Load distribution can end up uneven.

Use case: very high throughput systems where approximate rate limiting is acceptable.

Hybrid approach: local + global

Combine them: a local counter plus a periodic Redis sync.

Every server has a local counter. Every 10 seconds it flushes that counter to Redis and syncs the aggregate back.

Formula: global_count = sum of all servers' local counts. Accuracy is roughly within 10 seconds.

Upsides:

Local counter means fast. Periodic sync makes it cluster-aware. If Redis dies, you fall back to local.

Downsides:

Not accurate (a 10 second lag). Implementation is more complex. Edge cases around server restart mid-window.

Use case: very high throughput where approximate accuracy is fine.

Fail-open vs fail-closed

Redis is down. Rate limit decision:

Fail-open: with no Redis, skip the rate limit and let every request through. Availability wins.

Fail-closed: with no Redis, deny everything. Security wins.

The call depends on the use case.

Public API: fail-open (users keep getting served). Security-critical endpoint: fail-closed (prevent abuse).

Don’t let a cache outage break rate limiting:

try:
    result = redis.eval(script, ...)
    return result[0] == 1  # allow/deny
except RedisError:
    logger.warning("Redis down, failing open")
    return True  # allow (fail-open)

Detect Redis downtime through monitoring and fix it. Fail-open masks abuse for long stretches if you leave it.

Minimising network overhead

A Redis call per request adds network cost. Optimizations:

1. Connection pooling. Pool Redis connections, do not open a new TCP connection per request.

2. Pipelining. Batch multiple Redis commands into a single round trip.

3. Batch check. Group 10 requests together, check and decide in one shot. Latency ticks up a bit, throughput roughly doubles.

4. Local cache (short TTL). A 1 to 2 second cached decision. Not exact, but fast.

These only matter at high throughput. At normal throughput the standard approach is enough.

Multi-region consideration

Cluster in multiple regions: US, EU, Asia. Is the rate limit global or per region?

Global rate limit: every region’s requests feed a single counter. Redis cluster with cross-region replication. Latency climbs, complexity climbs.

Per region rate limit: each region has its own Redis and its own counter. Not a global limit, but a per region one. A user who burns 100% in one region is still fresh in another.

Choice: per region is enough for most scenarios. A user typically lives in a single region anyway. A per region abuse threshold covers it.

Monitoring

Watch the distributed rate limiter:

Redis latency: p50, p99. A sudden jump means rate limiting is slow.
Rate limit hit rate: what percentage of requests are being rate limited? Too high means the limit is too aggressive, too low means it is not doing anything.
Redis availability: downtime alert.
Per user rate limit hits: who are the top users hitting limits, and why?

Those four metrics go on the dashboard. Anomalies jump out fast.

Start simple, evolve

When a project is new:

Phase 1: in-memory rate limiter (single server). Simple. Fine for dev and early production.

Phase 2: Redis + Lua script. Multi-server. Production standard.

Phase 3: sliding window log or token bucket (if needed). Exact rate control.

Phase 4: hybrid local + global (very high throughput). Rare, specific use cases.

Each phase evolves to the next 6 to 12 months later. Jumping from Phase 1 to Phase 3 is premature optimization.

Takeaway

Distributed rate limiting is a shared state problem. Redis + Lua script solves 80% of cases. Sliding window log for exact accuracy. Consistent hashing for ultra-high throughput.

Decision criteria: required accuracy, throughput, ops complexity. Start simple (Redis + Lua), upgrade when you need to.

Fail-open vs fail-closed decision, monitoring discipline, multi-region consideration. Those three details are what separate a production-ready rate limiter from a toy demo.