Retry and idempotency for long-running jobs: the strategy that holds up

I once built an invoice automation system. It sent a single invoice to each user once a month. One day a customer called: two invoices that month, two charges on the card. I went to the logs and saw it: a retry chain had fired wrong, two jobs ran twice. That was the day I learned my biggest lesson about idempotency.

Retry and idempotency have to travel together. If one exists, the other has to exist. Otherwise you aren’t solving a problem, you’re creating a new one.

When is retry actually needed?

Transient network errors
Short outages
Rate limit responses
Database deadlocks
Timeouts (dangerous, because the other side may have succeeded)

Simple rule: retry when you know for sure the call failed. When the outcome is ambiguous, when you’re thinking “maybe it went through”, don’t retry without idempotency.

What is idempotency?

Calling the same operation n times produces the same result as calling it once. Mathematically, f(f(x)) = f(x). In practice: fire the same “send invoice” request five times and only one invoice exists.

How do you design for it?

The most common approach is an idempotency key. A unique ID is generated per request, and if the server has already processed that ID, it returns the same response. Stripe’s model is the well-known one: any request that comes in with an Idempotency-Key header returns the original response for 24 hours for that key.

Implementation outline:

Request comes in, read the idempotency_key header.
Look for a record in the database, if it exists return the stored response.
If not, start the operation.
Place a lock during the operation (if two requests arrive at the same time, the second one waits).
Save the result, release the lock.

The lock is critical. Without it, if two requests race, both see “not processed before” and the operation runs twice.

That was exactly my mistake on the invoice system. I had the idempotency key but no lock. Cron fired twice, two jobs ran in parallel, both saw “not sent yet”, both sent.

Retry for long jobs

For long-running work, exponential backoff is the right shape. Wait 1 second, then 2, then 4, then 8. Cap it (say 60 seconds). Add jitter (random plus or minus 30 seconds) so that when a service comes back, every client doesn’t hit at the exact same instant.

Set a retry budget. Unlimited retry creates a flood, bounded retry gives you a guarantee. Three to five attempts is enough for most cases.

Track retry count in a header:
– X-Retry-Count: 3
– X-Original-Request-At: 2026-02-16T10:00:00Z

This is critical for traceability.

Queue-based jobs

If you use a job queue (Sidekiq, Laravel Queue, BullMQ), the queue has its own retry. But the queue will occasionally re-deliver a job under “I don’t think this succeeded”. If a worker dies before acknowledging, the job comes back. Your worker has to be fully idempotent.

Especially on at-least-once queues (SQS, RabbitMQ default mode), duplicate delivery is guaranteed, not hypothetical. You need idempotency by design.

Fallback plan

What happens when the retry chain runs out? Use a dead letter queue (DLQ). The job moves to the DLQ and into a manual review queue. Log, alert, page on-call.

Database-level idempotency

For anything involving money, the database needs its own guarantee. Unique constraints are non-negotiable. For example, a unique index on (idempotency_key, user_id) in the transactions table. If the same operation gets inserted twice, the constraint fires and the retry returns to its caller.

Completion guarantees for async work

If you tell a user “it worked”, the work has to actually be done. Use polling or webhooks for long work. Return “finished” later, not “started” now.

After the duplicate charges on my invoice system, here’s what I put in:

Unique constraint on the invoice table: (user_id, billing_period_start, billing_period_end). Any second insert triggers the constraint.
Distributed lock instead of raw cron. Before the job starts, it takes a Redis SETNX lock. If it can’t, it skips with “already running”.
Job log table. Every firing leaves a trace, and we can analyse it.

Since then, zero duplicate invoices. But the lesson stuck: in retry design, idempotency is step one and retry is step two. Not the other way around.