API rate limiting: token bucket vs sliding window, in practice

Every public API eventually needs rate limiting. Some user writes a script hitting you 1000 times a second, or a bot hammers an API key. Rate limiting stops the abuse and enforces fair use.

There are two popular algorithms: token bucket and sliding window. There’s a third classic, fixed window, but it’s not worth reaching for in practice. Here’s how they compare, with a pragmatic implementation.

Fixed window (simplest, and broken)

Fixed window is the simplest. “100 requests per hour.” You keep a counter, reset it when the hour rolls over.

The problem: 2x traffic at the boundary. The user fires 100 requests at 10:59. At 11:00 they fire another 100. 200 requests in a minute, even though the rule was 100 per hour.

That’s wide open to burst abuse.

Which is why nobody uses it seriously. Sliding window or token bucket instead.

Token bucket

You have a bucket that fills up with tokens at a fixed rate. Every request costs one token. Empty bucket, request rejected.

Parameters:
- capacity: max bucket size (for bursts)
- refillRate: tokens added per second

Algorithm:
1. Request arrives
2. Compute elapsed time since last refill
3. Add (elapsed * refillRate) tokens, capped at capacity
4. If bucket >= 1, consume a token, allow the request
5. Otherwise, reject

Example: capacity=100, refillRate=10 tokens/sec. That translates to “10 requests per second sustained, bursts up to 100 accepted”.

Redis implementation:

KEYS: user:123:bucket
VALUES: {tokens: 100, lastRefill: 1710000000}

Lua script (atomic):
  local tokens, lastRefill = ...
  local now = ...
  local elapsed = now - lastRefill
  local newTokens = math.min(capacity, tokens + elapsed * refillRate)
  if newTokens >= 1 then
    return {newTokens - 1, now} -- allow
  else
    return {newTokens, now} -- deny
  end

Upsides:
– Burst-tolerant (up to capacity)
– Memory efficient (just {tokens, timestamp} per user)
– Smooth rate limit, no boundary issue

Downsides:
– A bursty user gets an unfair advantage (once they drain the bucket, everyone after feels it)
– Refill rate needs tuning

Sliding window

Sliding window tracks the request count over the last N minutes. On each request it looks at the count inside the sliding window.

Two variants:

Sliding window log: store every request timestamp. On request, count “how many logs in the last 60 seconds”.

KEY: user:123:requests (Redis sorted set)
MEMBERS: [timestamp1, timestamp2, ...]

On request:
1. now = current timestamp
2. Drop old timestamps: ZREMRANGEBYSCORE key -inf (now-60)
3. Count: ZCARD key
4. If count < limit: ZADD key now now, allow
5. Else: deny

For a 10 req/sec limit: deny if Redis ZCARD exceeds 10.

Pro: exact, fair.
Con: memory intensive, every request is logged.

Sliding window counter (approximate): proportionally blend two fixed windows. “50% of the current minute plus 50% of the remainder of the previous minute.”

Less memory, close to exact. Plenty good in practice.

currentMinuteCount + (previousMinuteCount * (60 - secondsIntoCurrentMinute) / 60)

That formula gives you a “last 60 seconds” approximation. You only hold two counters.

Which one should you pick?

Token bucket:
– Public APIs (GitHub, Stripe, Twitter)
– You want to allow bursts (“the user spikes occasionally but overall is fine”)
– Memory-constrained environments

Sliding window log:
– Exact rate tracking is critical
– You need request history for audit or debug
– Small scale (up to roughly 10K users)

Sliding window counter:
– Large scale, memory matters
– Approximation is acceptable (real vs computed rate can differ by ~5%)
– A good default for a standard web API

My default is sliding window counter. Memory efficient, fair, production-ready.

Tiered rate limiting

You’ll want different limits per tier. Free gets one thing, pro another:

function checkRateLimit(userId, endpoint) {
    const user = await getUserTier(userId);
    const limit = rateLimits[user.tier][endpoint];
    // limit: {requests: 100, per: 60}
    return await rateLimiter.check(userId, endpoint, limit);
}

Tiers:
– Free: 100 req/hour per user
– Pro: 1000 req/hour per user
– Enterprise: 10000 req/hour per user plus a higher burst

Response headers

Tell the client about the rate limit. The standard headers:

X-RateLimit-Limit: 100          # per window
X-RateLimit-Remaining: 42       # how many are left
X-RateLimit-Reset: 1710000060   # when it resets (Unix timestamp)
Retry-After: 30                 # (with 429) how many seconds to wait

When client developers see these headers they can tune their own rate-limiting logic. It saves everyone time.

The 429 response

When the rate limit is exceeded, return 429 Too Many Requests:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Reset: 1710000060
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "message": "Too many requests. Please try again in 30 seconds.",
  "retry_after": 30
}

On seeing this, the client should retry with exponential backoff.

Distributed rate limiting

If your API runs on 5 instances and each keeps its own in-memory limiter, a user effectively gets 5x their limit.

For shared state:

Redis: enough for most setups. Low latency, atomic operations via Lua scripts.

Memcached: a touch faster than Redis but not persistent.

In-memory sharing (consistent hashing): each user’s requests always hit the same instance via sticky routing on the load balancer. Complex, but low latency.

My default is a Redis Lua script. Atomic check-and-decrement in a single operation, no race conditions.

Abuse patterns

When you’re rate-limiting, watch for:

Limit of IP-based rate limiting: multiple users sit behind NAT on the same IP. A 100-person company office hits 5x the limit from a single address.

Limit of user-based rate limiting: bots create multiple accounts. Per-user limits can be bypassed by farming accounts.

Endpoint-based rate limiting: expensive endpoints (large search, batch operations) need their own separate limits. “1000 per hour per endpoint” isn’t enough on its own.

Best practice: layer multiple rate limits:
– Per IP: looser (for NAT)
– Per user: standard
– Per endpoint: especially expensive ones

Any one of them tripping returns 429.

Monitoring

To verify rate limiting is configured right:

429 rate: what percentage of total requests are 429s. Too high and your limits are too aggressive, too low and they’re not doing their job.
Per-user usage distribution: what’s the top 1% of users consuming?
Spike detection: a single user suddenly pushes 10x their normal traffic, that’s suspicious.
Abuse patterns: attempts to bypass the limit (multiple accounts, IP rotation).

Takeaway

Rate limiting is something your API needs before it ships. Token bucket and sliding window counter are the pragmatic picks. A Redis-backed distributed implementation is enough for almost any scale.

Tiered limits (free vs pro), standard response headers, and a proper 429 keep your API ecosystem healthy. Add monitoring, watch for abuse patterns, adjust limits as needed.

A simple counter is enough for v1. You can get sophisticated as the scale grows.