Home / Blog / 5 things I missed while scaling background job queues

5 things I missed while scaling background job queues

Across Redis + BullMQ, RabbitMQ, and SQS, I've relearned the same 5 lessons about scaling job queues: poison messages, retry storms, priority queues, at-least-once semantics, and dead letter discipline.

Background job queues are the backbone of a modern web app. Email sends, report generation, third-party API calls, image processing, all of it goes onto a queue and gets picked up by workers. Easy to stand up, messy to scale.

I’ve used Redis (BullMQ), RabbitMQ, and AWS SQS across 6 different projects. Every time, I hit a new edge case. Here are the 5 things I missed and later fixed.

1. Isolate the poison message

My first big production incident was an email send job that got stuck in an infinite retry. The address was malformed, SMTP returned the same error every time, the queue did exponential backoff, but nothing ever gave up. In 4 hours, 15,000 retries piled up, Redis ran out of memory, workers couldn’t touch any other jobs, and the system ran at half capacity.

A poison message is one that’s guaranteed to fail after a certain number of retries. These have to be moved to a dead letter queue (DLQ):

  • Set a max retry count (usually 3 to 5)
  • Move the job to a separate queue once it exceeds retries
  • Monitor the DLQ separately (set an alarm), but don’t let it affect the main worker pool

BullMQ gives you attempts: 5 and configurable backoff. Once attempts run out, it drops into the failed state. On top of that, I layer a custom DLQ so the ops team can review the backlog.

2. Avoid retry storms

When a downstream service goes down, dozens of workers retry their failed jobs at the same time. That retry storm keeps the downed service from recovering and floods the queue.

The fix: circuit breaker plus jittered backoff.

Circuit breaker: once the downstream error rate crosses a threshold, the worker fails fast instead of retrying. After 30 seconds it goes to half-open and tests. If healthy, it closes again.

Jittered backoff: add random jitter to the retry interval (e.g., 100ms plus or minus 50ms) so workers don’t all retry on the same clock tick. Solves the thundering herd problem.

const delay = Math.min(30000, 1000 * Math.pow(2, attempt)) + Math.random() * 500;

3. Priority queues are mandatory, one queue isn’t enough

On my first few projects I used a single queue and tossed every job into the same pool. A critical job (the user’s 2FA SMS) ended up waiting behind a non-critical one (a weekly report email).

Priority queues are non-negotiable:

  • High priority: user-facing, latency-sensitive (SMS, login email, payment confirmation)
  • Normal priority: general work (notifications, webhooks)
  • Low priority: background (reporting, cleanup, analytics rollups)

BullMQ lets you prioritize within a single queue via the priority option. RabbitMQ needs a different setup for priority queues. SQS requires a separate worker pool per queue.

Worker pools have to be split by priority too. The high-priority queue needs its own workers so low-priority jobs can’t starve it.

4. At-least-once vs exactly-once

Queue systems generally guarantee at-least-once. A message may be processed once or more than once. But most code assumes exactly-once.

Example bug: the credit card charge job ran at-least-once, and the user got charged twice. Feedback: devastating.

The fix is idempotency. Every job is processed with a unique idempotency key, and the first thing the handler does is check “has this key already been processed?”. If yes, it’s a no-op.

async function processPayment(job) {
    const key = `payment:${job.data.orderId}:${job.data.attemptId}`;
    const already = await redis.get(key);
    if (already) return JSON.parse(already);
    
    const result = await chargeCard(job.data);
    await redis.setex(key, 86400, JSON.stringify(result));
    return result;
}

The idempotency key strategy has to be designed in on day one. Retrofitting works but threading it through decoupled code is painful.

5. Observability: queue depth and processing time

A queue system without monitoring doesn’t work. At a minimum, you need:

  • Queue depth (per queue): how many jobs are waiting
  • Processing time p50 / p95 / p99: how long jobs take
  • Failure rate: percentage of jobs that fail
  • Age of oldest job: how long the oldest job has been sitting in the queue

Alarm thresholds:

  • Queue depth greater than 10,000 (or whatever limit makes sense for you), scale workers up
  • p95 processing time over threshold, there’s a worker performance issue
  • Failure rate over 5%, downstream or code problem
  • Oldest job age over 10 minutes (for the high-priority queue), worker pool is undersized

BullMQ Arena, RabbitMQ Management UI, and SQS CloudWatch metrics expose all of this. But building dashboards and alarms on top is extra work.

Bonus: graceful shutdown

What should happen to half-processed jobs when a worker is being deployed? On SIGTERM the worker should:

  1. Stop accepting new jobs
  2. Try to finish the job it’s currently working on (with a timeout)
  3. If it can’t finish, requeue the job and exit

Without this pattern, every deploy leaves half-finished work behind. BullMQ has a gracefulShutdown implementation; RabbitMQ needs prefetch = 1 plus manual ack discipline.

Stability always before performance

Scaling a queue out is easy, spin up 100 workers and throughput goes up. But stability comes first:

  • Poison messages go to the DLQ
  • Retry storms stop at the circuit breaker
  • Priority queues are split
  • Idempotency is assumed
  • Observability on day one

Without these 5 principles, the queue system is a bomb. Quiet while it runs, loud when it breaks.

The lesson after 6 projects: a queue system isn’t “set it and forget it”, it’s “set it, watch it, tune it”. Queue incidents are one of the most common outage causes for growing startups.

Have a project on this topic?

Leave a brief summary — I’ll get back to you within 24 hours.

Get in touch