Home / Blog / Six design principles behind batch pipelines that actually survive production

Six design principles behind batch pipelines that actually survive production

Shared patterns from three projects where I built reporting, data migration, and nightly aggregation pipelines. Chunking, checkpoints, retry discipline, idempotency.

Batch processing is one of web engineering’s overlooked corners. Real-time APIs get the attention, batch jobs are the tired uncle of the backend. But the backbone of most systems is batch work: nightly reports, aggregate table refreshes, data migrations, bulk email, invoice generation.

Over the past two years I designed batch pipelines for three very different projects:
1. Nightly user report generation for a SaaS (500K users)
2. Year-end invoice batch for an e-commerce platform (100K orders)
3. Legacy DB to new schema migration for a fintech (12M rows)

Six design principles showed up in all three.

1. Chunking: atomicity guarantees

First and foundational: break the big job into small chunks. Instead of processing 500K users in one transaction, run them in chunks of 1000.

Choosing chunk size:
– Small enough to hold in memory (stay under 100MB)
– Small enough that a single transaction can roll back cleanly
– Large enough that progress tracking is meaningful (1 chunk per ms makes tracking impossible)

Sweet spot is usually 500 to 5000. I prefer 1000 for user-level jobs, 5000 for row-level jobs.

def process_in_chunks(query, chunk_size=1000):
    offset = 0
    while True:
        chunk = db.fetch(query, limit=chunk_size, offset=offset)
        if not chunk:
            break
        process_chunk(chunk)
        offset += chunk_size

This is naive but it’s the starting point.

2. Checkpoint: restartability

With the naive approach, if the pipeline crashes on record 300,000 you start over. A three-hour job from scratch. Checkpointing fixes this: after each chunk, the pipeline persists a “here’s where I am” marker.

The checkpoint table is simple:

CREATE TABLE batch_checkpoint (
    pipeline_name VARCHAR(255),
    last_processed_id BIGINT,
    updated_at TIMESTAMP,
    PRIMARY KEY (pipeline_name)
);

Update after every chunk. On restart, the pipeline picks up from the checkpoint.

def process_with_checkpoint(pipeline_name):
    last_id = get_checkpoint(pipeline_name)
    while True:
        chunk = db.fetch("SELECT * FROM users WHERE id > %s ORDER BY id LIMIT 1000", last_id)
        if not chunk:
            break
        process_chunk(chunk)
        last_id = chunk[-1].id
        save_checkpoint(pipeline_name, last_id)

Important: use cursor-based (id > last_id) instead of offset pagination. Offset gets slower as it grows, cursor stays O(log n).

3. Idempotency: safe to reprocess a chunk

A crash can happen between saving the checkpoint and processing the chunk. On restart, the same chunk runs twice. Idempotency is non-negotiable.

Idempotency patterns:

Use upserts. INSERT ... ON DUPLICATE KEY UPDATE or MERGE. The same record never gets inserted twice.

Deduplication via unique keys. If you generate IDs during processing, make them deterministic. UUID.uuid5(NAMESPACE, source_id) is deterministic.

Idempotency key table. Record “this chunk was processed under this version”. Skip on restart.

CREATE TABLE batch_processed (
    pipeline_name VARCHAR(255),
    item_id BIGINT,
    processed_at TIMESTAMP,
    PRIMARY KEY (pipeline_name, item_id)
);

At the start of each chunk, query “which of these items have I already processed?” and skip the ones that match.

4. Retry with dead letter

Some chunks will fail: DB timeout, external API rate limit, corrupted data. Two strategies together:

Retryable errors: retry with exponential backoff. Cap at 5 attempts.

Non-retryable errors (data corruption, schema mismatch): send to a dead letter table and let the pipeline keep moving.

def process_chunk_with_retry(chunk):
    for attempt in range(5):
        try:
            process_chunk(chunk)
            return
        except RetryableError as e:
            time.sleep(2 ** attempt)
        except NonRetryableError as e:
            save_to_dead_letter(chunk, str(e))
            return
    save_to_dead_letter(chunk, "max_retries_exhausted")

Review the dead letter weekly. If it’s growing, something in the pipeline is broken.

5. Observability: progress and ETA

The single most asked question during a long-running batch: “When will it be done?” Without progress tracking and ETA prediction, stakeholder communication falls apart.

Log after each chunk:

print(f"Processed {processed}/{total} ({processed/total*100:.1f}%) - ETA: {eta}")

ETA math is naive: elapsed time / processed = seconds per item. Remaining items times seconds per item is your ETA.

In production, push metrics to Prometheus or Datadog:
batch_items_processed_total (counter)
batch_items_total (gauge)
batch_chunk_duration_seconds (histogram)

A progress bar on a Grafana dashboard is a stress reducer for the whole team.

6. Graceful shutdown

What happens when a Kubernetes pod restarts mid-pipeline? The naive answer: it gets cut off mid-chunk, the checkpoint is stale, you get duplicates.

Graceful shutdown:

  1. On SIGTERM, finish the current chunk but don’t start a new one
  2. Save the checkpoint
  3. Exit cleanly
import signal

shutdown_requested = False

def handle_sigterm(*args):
    global shutdown_requested
    shutdown_requested = True

signal.signal(signal.SIGTERM, handle_sigterm)

while True:
    chunk = fetch_chunk()
    if not chunk or shutdown_requested:
        break
    process_chunk(chunk)
    save_checkpoint()

Kubernetes gives you a 30 second default grace period. Size your chunks so they finish in well under 30 seconds.

Parallel processing: handle with care

To speed a batch up, process chunks in parallel: 4 workers, each on a different chunk.

Parallel pitfalls:
– Checkpoint management gets harder (which chunks finished?)
– DB connection pool can exhaust
– Downstream rate limits get hit faster
– Deadlock risk rises (different chunks writing to the same row)

Two of these three projects went parallel, one stayed serial. Parallel was 40% faster but took 3x the debug effort. Don’t go parallel without benchmarking first.

Real project numbers

SaaS user reports (500K users):
– Chunk: 1000 users, 4 workers in parallel
– Duration: 45 minutes (was 6 hours)
– On crash, resumes from checkpoint within 90 seconds

E-commerce invoice batch (100K orders):
– Chunk: 500 orders, serial
– Duration: 22 minutes
– Restarted three times over the year thanks to idempotency, never needed manual cleanup

Fintech DB migration (12M rows):
– Chunk: 10,000 rows, 8 workers in parallel
– Duration: 14 hours (target was under 24)
– Dead letter table: 47 rows (corrupted legacy data), reviewed manually

Final lesson

Batch processing is less about writing code and more about writing discipline. Without chunking, checkpointing, idempotency, retry, observability, and graceful shutdown all working together, batch jobs do damage in production.

Build those six principles in from day one and your batch pipeline won’t wake you at night. Skip them and you’ll be patching it six months later.

Have a project on this topic?

Leave a brief summary — I’ll get back to you within 24 hours.

Get in touch