The “test pyramid” advice has been around since 2009: 70% unit, 20% integration, 10% E2E. In practice, teams that apply that ratio rigidly catch fewer bugs.
I’ve set the test strategy on three very different projects. The ratio came out differently on each. Here’s when the classic pyramid holds, and when it doesn’t.
Why the classic pyramid made sense
Mike Cohn proposed this in 2009:
- Unit tests (base, most numerous): a single function or class in isolation, millisecond speed
- Integration tests (middle): a handful of components working together, seconds
- E2E tests (top, fewest): the whole system as the user sees it, minutes
The logic: unit tests are cheap, E2E tests are expensive. So most of them should be unit.
Where the pyramid works
Logic-heavy code. Algorithms, business rule calculations, utility functions. Unit tests at 80 to 90% coverage catch real bugs.
def calculate_discount(price, coupon, user_tier):
# 20 different logic branches
...
# 40 unit tests cover the branchesFramework-independent logic. Domain layer, model validation, formatting. Fast to iterate, deterministic.
Heavy refactoring phase. During a refactor, unit tests catch regressions with instant feedback.
Where the pyramid breaks
Integration-heavy systems. Microservices, multiple external APIs, complex DB operations. Unit tests don’t catch bugs because the bug is in how the parts talk. Mocks lie.
Example: a system of 20 microservices. Each passes its unit tests, but the integration crashes. The integration test catches it.
UI-heavy frontends. React or Vue components. Unit tests mock the DOM, so real browser behaviour isn’t exercised. An event handler passes as a unit test but breaks on real clicks.
Data transformation pipelines. ETL, data migration. The individual step’s unit test barely matters; what matters is that the whole pipeline produces the right output.
Real project 1: fintech backend
Backend, 15 microservices, complex integration. Unit/integration/E2E ratio:
- Unit: 45%
- Integration: 40%
- E2E: 15%
Integration tests got heavy. A “trapezoid” rather than a pyramid. Each service’s unit tests are fine, but service-to-service integration is the critical layer.
Contract tests (Pact) filled a specific role as a subset of integration.
Real project 2: React SPA
React frontend, component development in Storybook. Ratio:
- Unit (component logic, hooks): 30%
- Component tests (React Testing Library): 45%
- E2E (Playwright): 25%
Component tests are the frontend version of integration tests. Render plus event plus assertion. Real user interactions are simulated, and the results are far more reliable.
E2E isn’t “heavy” anymore thanks to Playwright. 100 tests run in parallel in five minutes.
Real project 3: data pipeline
Python ETL orchestrated by Airflow. Ratio:
- Unit: 50% (transformation functions)
- Integration: 30% (source and sink interactions)
- E2E/pipeline tests: 20% (full DAG end-to-end)
Pipeline tests are critical: does the input data become the correct output data? Mocks don’t help much; you need a small real dataset.
The honeycomb alternative
Google and Spotify pushed back on the classic pyramid with the “testing honeycomb”:
- Few unit tests (brittle, tied to implementation details)
- Many integration tests (where the business logic actually lives)
- Few E2E tests
Specifically for mid-size teams on service-oriented architectures.
In microservice teams, the honeycomb pattern tends to hold up.
Picking a strategy
Questions I ask when looking at a new project:
Business logic complexity. High → unit-heavy.
Integration complexity. High → integration-heavy.
UI complexity. High → component/E2E-heavy.
Refactoring frequency. High → unit-heavy (fast feedback).
Team maturity. Junior team → E2E is more forgiving (tests user flows, not implementation details).
Deployment risk. High → E2E-heavy (release gate).
Mock vs real
What blurs the unit/integration line: how much you mock.
With aggressive mocking, you write “unit tests” that don’t catch production bugs because the mocks lie.
Minimal mocking: only mock external systems (DBs, third-party APIs, the filesystem). Internal dependencies stay real.
# Aggressive mocking (bugs slip through)
def test_process_order():
mock_db = Mock()
mock_payment = Mock()
mock_email = Mock()
# Everything mocked
service.process_order(order)
mock_db.save.assert_called()
# Minimal mocking (more trustworthy)
def test_process_order(real_db): # real test DB
mock_payment_api = Mock(return_value={'status': 'success'})
service = OrderService(real_db, mock_payment_api)
service.process_order(order)
saved_order = real_db.query(Order).first()
assert saved_order.status == 'paid'Minimal mocks feel more like integration tests, and the value follows.
Test speed vs value
Speed-vs-value trade-off:
- Unit: fast, low value (isolated)
- Integration: medium speed, high value (real interaction)
- E2E: slow, highest value (full scenario)
Modern tools close the speed gap. Playwright runs 100 E2E tests in five minutes; Testcontainers runs 500 integration tests in 10.
Speed isn’t the single criterion anymore. Value is.
Flakiness: test trust
A flaky test passes sometimes and fails sometimes. It’s the enemy of team trust.
Flakiness causes:
– Race conditions
– Test order dependencies
– External service flakiness
– Time-sensitive assertions
– Shared state
Don’t skip a flaky test, fix it. Flaky → disable → the bug gets through. Classic trajectory.
CI flaky test detection: run three times, fail once = flagged, investigate.
The coverage trap
An “80% coverage” target is misleading. You can hit 80% coverage with tests that don’t catch bugs.
- “Smoke tests” without assertions boost coverage but catch nothing
- Happy-path-only tests
- Mock-heavy tests that don’t exercise production behaviour
Coverage is a metric, not a goal. The real goal is “bugs escaping to production”.
Mutation testing measures the quality of your coverage: high coverage with low mutation scores means the tests are shallow.
Keeping the pyramid current
As a project matures, the distribution shifts:
- Early stage: E2E-heavy (feature validation)
- Maturing: unit and integration grow
- Refactoring phase: unit-heavy
- Maintenance: integration/E2E preservation
Every six months, review the strategy: which level caught bugs, which level missed, where should we invest.
Closing thought
The classic test pyramid is a guide, not a rule. Depending on the project, a honeycomb, a trapezoid, or even a reverse pyramid can fit.
My advice:
- Decide the strategy based on the project’s architecture, not a default ratio
- Accept that each level catches a different kind of bug
- Use “bugs escaping to production” as the metric, not coverage
- Fix flaky tests quickly, or delete them
- Review the strategy every six months
That discipline sends test investment to the right places. Production stability comes from investing where tests pay off, not from hitting a ratio.