Building a feature flag system: a 4-stage adoption guide

Feature flags (feature toggles) are one of the foundational tools of modern software development. They decouple deploy from feature release. Your code is live in production, but hidden from users. Flip the switch, and the feature is live.

A/B tests, canary deploys, gradual rollouts, kill switches: all of it is possible with feature flags. In this post I’ll walk through how to build your own flag system in four stages.

Stage 1: simple config flags

Starting point: plain boolean flags in a config file.

# config/features.yml
features:
  new_checkout: false
  dark_mode: true
  experimental_search: false

Usage in code:

if config.features.new_checkout:
    render_new_checkout()
else:
    render_old_checkout()

Pros:
– Super simple setup
– No external dependency
– Version controlled (config lives in Git)

Cons:
– Changing a flag requires a deploy
– No user-level targeting
– A/B testing impossible
– No real-time toggle

Fine for a small project. Once you grow, it’s not enough.

Stage 2: database-backed flags

Store flags in the database. Toggle them from an admin panel.

Schema:

CREATE TABLE feature_flags (
    key VARCHAR(100) PRIMARY KEY,
    enabled BOOLEAN DEFAULT FALSE,
    description TEXT,
    updated_at TIMESTAMP
);

Code:

def is_enabled(flag_key: str) -> bool:
    flag = db.query("SELECT enabled FROM feature_flags WHERE key = ?", flag_key)
    return flag and flag.enabled

Admin panel: simple UI, one toggle per flag.

Pros:
– Real-time toggle (no deploy required)
– Admin panel control
– Audit log (who toggled what, when)

Cons:
– Each flag check is a database call (performance hit)
– Still no user-level targeting
– Still no A/B testing

Fix: add a caching layer. Cache flag state in Redis with a 30 second TTL.

Stage 3: user-targeted flags

Show a feature to some users, hide it from others. Beta testers, internal team, premium users.

Schema:

CREATE TABLE feature_flags (
    key VARCHAR(100) PRIMARY KEY,
    enabled BOOLEAN,
    rollout_percentage INT,  -- 0-100
    user_groups JSON,  -- ["beta", "internal"]
    user_whitelist JSON  -- specific user_ids
);

Code:

def is_enabled(flag_key: str, user_id: str, user_groups: list) -> bool:
    flag = get_flag(flag_key)
    if not flag.enabled:
        return False
    
    # Whitelist check
    if user_id in flag.user_whitelist:
        return True
    
    # Group check
    if any(g in flag.user_groups for g in user_groups):
        return True
    
    # Rollout percentage (deterministic hash)
    user_hash = int(md5(flag_key + user_id).hexdigest(), 16)
    if (user_hash % 100) < flag.rollout_percentage:
        return True
    
    return False

Gradual rollout:
– Start: 1% rollout
– If no issues: 5%, 10%, 25%, 50%, 100%
– Issue detected: rollback instantly

Pros:
– User targeting
– Gradual rollout
– Kill switch
– Foundations for A/B testing

Cons:
– Complexity is higher
– You need analytics integration for A/B testing

Stage 4: full A/B testing

A/B testing means feature flags plus analytics plus statistical significance.

Enhanced flag:

CREATE TABLE experiments (
    key VARCHAR(100) PRIMARY KEY,
    variants JSON,  -- [{"name": "control", "weight": 50}, {"name": "new_ui", "weight": 50}]
    active BOOLEAN,
    started_at TIMESTAMP
);

CREATE TABLE experiment_exposures (
    experiment_key VARCHAR(100),
    user_id VARCHAR(100),
    variant VARCHAR(50),
    exposed_at TIMESTAMP,
    INDEX (experiment_key, user_id)
);

Code:

def get_variant(experiment_key: str, user_id: str) -> str:
    experiment = get_experiment(experiment_key)
    
    # Existing assignment?
    existing = get_exposure(experiment_key, user_id)
    if existing:
        return existing.variant
    
    # New assignment (hash-based)
    user_hash = int(md5(experiment_key + user_id).hexdigest(), 16) % 100
    cumulative = 0
    for variant in experiment.variants:
        cumulative += variant.weight
        if user_hash < cumulative:
            record_exposure(experiment_key, user_id, variant.name)
            return variant.name

Analytics integration:

# Track experiment metric
analytics.track("purchase", {
    "user_id": user_id,
    "amount": 100,
    "experiment_new_checkout": get_variant("new_checkout", user_id)
})

Later, analyze:
– Variant A conversion rate: 5.2%
– Variant B conversion rate: 6.8%
– Statistical significance: p < 0.01
– Variant B wins, roll out fully

Third-party solutions

If you don’t want to build your own:

LaunchDarkly: enterprise-grade feature flag management. Expensive.
Unleash: open source, self-hosted.
Split.io: feature flags plus A/B testing. Mid-market.
Flagsmith: open source with managed options.
Firebase Remote Config: Google’s free option.

Trade-off: vendor dependency, pricing, customization.

Small project: Firebase Remote Config is enough. Mid-size: LaunchDarkly. Enterprise: self-hosted Unleash.

Flag hygiene

Feature flags have a lifecycle. Ignore it, and you accumulate flag debt.

Types:
– Release flags: for feature deployment. Short-lived (weeks).
– Experiment flags: for A/B testing. Medium-lived (months).
– Ops flags: kill switches, circuit breakers. Long-lived.
– Permission flags: premium features. Permanent.

Cleanup discipline:
– Release flag successful: merge the code, remove the flag within 2 weeks
– Experiment concluded: make the winning variant the default, remove the flag
– Ops flags: review quarterly

Flag debt: flags added six months ago are still in the code. Every flag is cognitive overhead.

Performance considerations

Flag checks happen on every request. Minimize overhead:

In-memory cache. Keep flag state in app memory, refresh periodically.
Batch fetching. Fetch all flags at user login, use them for the session.
Edge-deployed flags. Flag evaluation at the CDN level.

Cache TTL is a balance: long TTL means stale flags, short TTL means DB pressure.

Testing with flags

Testing flag-gated code:

@pytest.fixture
def enable_flag(flag_key):
    with mock.patch('features.is_enabled', return_value=True):
        yield

def test_new_checkout_flow(enable_flag):
    # Test with flag ON
    ...

For every major flag, test both paths. “Flag on” and “flag off” scenarios.

Wrap-up

Feature flags reduce deploy risk. Gradual rollout, kill switch, A/B testing: they enable all of it.

Four-stage adoption: config, DB, user-targeted, full A/B. Each stage builds on the last. Start where your project’s size warrants.

Don’t forget flag hygiene. When flags aren’t deprecated, code complexity climbs. Cleanup discipline is mandatory.