Test pyramid: the real split between unit, integration, and E2E

The “test pyramid” advice has been around since 2009: 70% unit, 20% integration, 10% E2E. In practice, teams that apply that ratio rigidly catch fewer bugs.

I’ve set the test strategy on three very different projects. The ratio came out differently on each. Here’s when the classic pyramid holds, and when it doesn’t.

Why the classic pyramid made sense

Mike Cohn proposed this in 2009:

Unit tests (base, most numerous): a single function or class in isolation, millisecond speed
Integration tests (middle): a handful of components working together, seconds
E2E tests (top, fewest): the whole system as the user sees it, minutes

The logic: unit tests are cheap, E2E tests are expensive. So most of them should be unit.

Where the pyramid works

Logic-heavy code. Algorithms, business rule calculations, utility functions. Unit tests at 80 to 90% coverage catch real bugs.

def calculate_discount(price, coupon, user_tier):
    # 20 different logic branches
    ...

# 40 unit tests cover the branches

Framework-independent logic. Domain layer, model validation, formatting. Fast to iterate, deterministic.

Heavy refactoring phase. During a refactor, unit tests catch regressions with instant feedback.

Where the pyramid breaks

Integration-heavy systems. Microservices, multiple external APIs, complex DB operations. Unit tests don’t catch bugs because the bug is in how the parts talk. Mocks lie.

Example: a system of 20 microservices. Each passes its unit tests, but the integration crashes. The integration test catches it.

UI-heavy frontends. React or Vue components. Unit tests mock the DOM, so real browser behaviour isn’t exercised. An event handler passes as a unit test but breaks on real clicks.

Data transformation pipelines. ETL, data migration. The individual step’s unit test barely matters; what matters is that the whole pipeline produces the right output.

Real project 1: fintech backend

Backend, 15 microservices, complex integration. Unit/integration/E2E ratio:

Unit: 45%
Integration: 40%
E2E: 15%

Integration tests got heavy. A “trapezoid” rather than a pyramid. Each service’s unit tests are fine, but service-to-service integration is the critical layer.

Contract tests (Pact) filled a specific role as a subset of integration.

Real project 2: React SPA

React frontend, component development in Storybook. Ratio:

Unit (component logic, hooks): 30%
Component tests (React Testing Library): 45%
E2E (Playwright): 25%

Component tests are the frontend version of integration tests. Render plus event plus assertion. Real user interactions are simulated, and the results are far more reliable.

E2E isn’t “heavy” anymore thanks to Playwright. 100 tests run in parallel in five minutes.

Real project 3: data pipeline

Python ETL orchestrated by Airflow. Ratio:

Unit: 50% (transformation functions)
Integration: 30% (source and sink interactions)
E2E/pipeline tests: 20% (full DAG end-to-end)

Pipeline tests are critical: does the input data become the correct output data? Mocks don’t help much; you need a small real dataset.

The honeycomb alternative

Google and Spotify pushed back on the classic pyramid with the “testing honeycomb”:

Few unit tests (brittle, tied to implementation details)
Many integration tests (where the business logic actually lives)
Few E2E tests

Specifically for mid-size teams on service-oriented architectures.

In microservice teams, the honeycomb pattern tends to hold up.

Picking a strategy

Questions I ask when looking at a new project:

Business logic complexity. High → unit-heavy.

Integration complexity. High → integration-heavy.

UI complexity. High → component/E2E-heavy.

Refactoring frequency. High → unit-heavy (fast feedback).

Team maturity. Junior team → E2E is more forgiving (tests user flows, not implementation details).

Deployment risk. High → E2E-heavy (release gate).

Mock vs real

What blurs the unit/integration line: how much you mock.

With aggressive mocking, you write “unit tests” that don’t catch production bugs because the mocks lie.

Minimal mocking: only mock external systems (DBs, third-party APIs, the filesystem). Internal dependencies stay real.

# Aggressive mocking (bugs slip through)
def test_process_order():
    mock_db = Mock()
    mock_payment = Mock()
    mock_email = Mock()
    # Everything mocked
    service.process_order(order)
    mock_db.save.assert_called()

# Minimal mocking (more trustworthy)
def test_process_order(real_db):  # real test DB
    mock_payment_api = Mock(return_value={'status': 'success'})
    service = OrderService(real_db, mock_payment_api)
    service.process_order(order)
    saved_order = real_db.query(Order).first()
    assert saved_order.status == 'paid'

Minimal mocks feel more like integration tests, and the value follows.

Test speed vs value

Speed-vs-value trade-off:

Unit: fast, low value (isolated)
Integration: medium speed, high value (real interaction)
E2E: slow, highest value (full scenario)

Modern tools close the speed gap. Playwright runs 100 E2E tests in five minutes; Testcontainers runs 500 integration tests in 10.

Speed isn’t the single criterion anymore. Value is.

Flakiness: test trust

A flaky test passes sometimes and fails sometimes. It’s the enemy of team trust.

Flakiness causes:
– Race conditions
– Test order dependencies
– External service flakiness
– Time-sensitive assertions
– Shared state

Don’t skip a flaky test, fix it. Flaky → disable → the bug gets through. Classic trajectory.

CI flaky test detection: run three times, fail once = flagged, investigate.

The coverage trap

An “80% coverage” target is misleading. You can hit 80% coverage with tests that don’t catch bugs.

“Smoke tests” without assertions boost coverage but catch nothing
Happy-path-only tests
Mock-heavy tests that don’t exercise production behaviour

Coverage is a metric, not a goal. The real goal is “bugs escaping to production”.

Mutation testing measures the quality of your coverage: high coverage with low mutation scores means the tests are shallow.

Keeping the pyramid current

As a project matures, the distribution shifts:

Early stage: E2E-heavy (feature validation)
Maturing: unit and integration grow
Refactoring phase: unit-heavy
Maintenance: integration/E2E preservation

Every six months, review the strategy: which level caught bugs, which level missed, where should we invest.

Closing thought

The classic test pyramid is a guide, not a rule. Depending on the project, a honeycomb, a trapezoid, or even a reverse pyramid can fit.

My advice:

Decide the strategy based on the project’s architecture, not a default ratio
Accept that each level catches a different kind of bug
Use “bugs escaping to production” as the metric, not coverage
Fix flaky tests quickly, or delete them
Review the strategy every six months

That discipline sends test investment to the right places. Production stability comes from investing where tests pay off, not from hitting a ratio.