Log aggregation from zero: building a production ELK stack

Reading logs from a single server is easy. SSH in, tail -f, found it. But with 10 servers, 20 microservices, 100 containers, that approach falls apart. Which service started this error? Jumping onto each server by SSH is impossible.

That’s the problem log aggregation solves. You collect every log in one place, make it searchable, make it correlatable. The ELK Stack (Elasticsearch + Logstash + Kibana) is the most common solution. This post walks through building a production-ready ELK setup.

Stack components

Elasticsearch: search and analytics engine. Stores and indexes logs. Full-text search, aggregations.

Logstash: log pipeline. Takes input from sources, parses, transforms, writes to Elasticsearch.

Kibana: web UI. Visualizes Elasticsearch data, runs searches, hosts dashboards.

Filebeat / Fluent Bit: log collector agents. Run on each server, read log files, forward to Logstash.

These four components together are the ELK (+Beats) stack. Alternatives: OpenSearch instead of Elasticsearch (AWS fork), Splunk (commercial), Loki (Grafana ecosystem).

Architecture diagram

[Server 1]        [Server 2]        [Server 3]
Filebeat          Filebeat          Filebeat
    |                |                |
    +--------+       |        +-------+
             |       |        |
             v       v        v
          [Logstash cluster]
                   |
                   v
          [Elasticsearch cluster]
                   |
                   v
              [Kibana]
                   |
              (web UI)

Filebeat on every server, forwarding to a central Logstash cluster, which parses and enriches and writes to Elasticsearch, which Kibana queries.

Filebeat setup (server side)

Every application server needs Filebeat installed. It watches log files and forwards incrementally.

Filebeat config (filebeat.yml):

filebeat.inputs:
  - type: log
    paths:
      - /var/log/app/*.log
    fields:
      app: my-service
      environment: production

output.logstash:
  hosts: ["logstash.internal:5044"]

paths lists the log files. fields attaches metadata to every entry. output.logstash says where to send them.

Filebeat runs as a systemd service. systemctl start filebeat. It’s restart-resilient, it remembers its offset (picks up where it left off after a restart).

Logstash pipeline

A Logstash pipeline has three sections:

Input: where logs come from.
Filter: how they’re parsed and transformed.
Output: where they go.

Example pipeline (logstash.conf):

input {
  beats {
    port => 5044
  }
}

filter {
  # Parse JSON logs
  if [message] =~ /^{/ {
    json {
      source => "message"
    }
  }
  
  # Normalize timestamp
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
  
  # Geo IP lookup (location from user IP)
  geoip {
    source => "client_ip"
    target => "geoip"
  }
  
  # Convert request duration to numeric
  mutate {
    convert => { "request_duration_ms" => "integer" }
  }
}

output {
  elasticsearch {
    hosts => ["es.internal:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    user => "logstash_user"
    password => "${LOGSTASH_PASSWORD}"
  }
}

Logstash runs every log entry through this pipeline: parse JSON, normalize timestamp, enrich with GeoIP, write to Elasticsearch.

Structured logging: use JSON

The single biggest decision that simplifies the Logstash pipeline: log JSON from your applications.

Unstructured log:

2024-11-15 10:30:45 INFO Request received: /api/users duration=125ms user=abc123

Hard to parse. Regex to extract “duration” and “user” fields. Error-prone.

Structured log (JSON):

{"timestamp":"2024-11-15T10:30:45Z","level":"info","event":"request","path":"/api/users","duration_ms":125,"user_id":"abc123"}

Logstash parses it directly, every field is searchable. Best practice.

Use a structured logging library: zerolog in Go, structlog in Python, pino in Node.js, Monolog’s JSON formatter in PHP.

Elasticsearch index strategy

Log data grows fast. Even 1GB/day becomes 365GB a year. Index strategy matters.

Time-based indices:

logs-2024.11.15
logs-2024.11.16
logs-2024.11.17

A new index per day. Upsides:
– Deleting old indices is trivial (retention)
– Search is scoped by range (the last 7 days don’t hit old indices)
– Different settings possible per time window

Index lifecycle management (ILM):
– Hot phase: last 7 days, active index. Writes flowing, searches fast.
– Warm phase: 7 to 30 days. Writes stopped, slower searches OK.
– Cold phase: 30 to 90 days. Cheaper storage, rare search.
– Delete: older than 90 days. Delete.

These phases are automatic. Define an Elasticsearch ILM policy and every index follows it.

Retention policy

How many days do you keep logs?

30 days: typical minimum for debugging. Most issues surface within 30 days.

90 days: standard for audit trail and compliance. GDPR-compliant.

1 year+: required by compliance (financial, healthcare). Expensive but necessary.

Watch the data volume. One year retention + 10GB/day = 3.6TB of storage. Expensive on Elasticsearch. Cold tier (S3 via Searchable Snapshots) is cheaper.

Kibana dashboards

I build a dashboard per service in Kibana:

API service dashboard:
– Request rate (per second)
– Error rate (4xx, 5xx)
– p95 latency
– Top endpoints
– Recent errors (last 100)

Payment service dashboard:
– Transaction count
– Failed transaction rate
– Processing time distribution
– Daily revenue

System health dashboard:
– Error rate across all services
– Infrastructure metrics
– Cron job success/failure
– Security events (failed logins, suspicious IPs)

Every developer looks at their service’s dashboard. The on-call engineer watches the system health board.

Correlation IDs

In a microservice system a user request crosses 3 or 4 services. An error fires: which path did it come from?

Correlation ID: generate a unique ID (UUID) at the start of every request. Carry it through every service alongside the logs.

// API gateway
requestId = uuid.v4()
log.info({event: "request_start", requestId, path: req.path})
headers.set("X-Request-ID", requestId)

// Internal service call
fetch(url, {headers: {"X-Request-ID": requestId}})

// Other service's log
log.info({event: "received", requestId})

In Kibana, search by requestId and stitch that user’s journey across every service. A 10-hour debug becomes 10 minutes.

Alerting

Kibana alerts (or ElastAlert, Opensearch alerting) fire notifications on error patterns:

Alert 1: 50+ errors in 5 minutes -> email + Slack.
Alert 2: p95 latency over 2 seconds -> PagerDuty.
Alert 3: payment failure rate over 5% -> Slack + PagerDuty.
Alert 4: security event pattern -> security team email.

Avoid alert fatigue: only actionable alerts. “Every warning log” is not an alert.

Cost control

Unchecked log volume makes an ELK cluster expensive.

Strategies:

1. Sampling: send 10% of debug-level logs. Everything info and above, full.

2. Filter noise: don’t log health check endpoints or static asset requests.

3. Field filtering: strip fields you don’t need. A 500-byte user-agent string is rarely worth storing.

4. Cold tier: old data to S3. Searchable snapshots from the Elasticsearch index.

5. Dedup: deduplicate repeating error messages. One entry with a count instead of 1000 copies of the same error.

A typical startup produces 5 to 10GB/day. Mid-size company: 100GB to 1TB. Large: 10TB+.

Alternative: Loki + Grafana

ELK alternative: Loki + Grafana, from the Grafana ecosystem.

Loki takes a “logs as metrics” approach. Not full-text search, label-based. Cheaper storage. Visualize and alert in Grafana.

Upside: cost, simpler operations. Downside: limited full-text search, a smaller feature set than ELK.

Worth considering for small to mid scale. Enterprise still leans ELK.

Managed options

If self-hosting is too much:

AWS OpenSearch Service: managed. 30-minute setup. Pricing $50 to $500/month at typical scale.

Elastic Cloud: Elastic’s managed offering. Full ELK. Premium features.

Datadog Logs: SaaS log management. Expensive but no-ops. Mature alerting.

Grafana Cloud Logs: Loki-based managed. Cheaper.

For a startup, managed drops operational burden. 10 to 20 hours/month of saved ops time.

Bottom line

Log aggregation is foundational for any production system. In a microservice architecture, debugging is impossible without it.

The ELK stack is mature and battle-tested. Self-hosted or managed are both valid. Structured logging, correlation IDs, proper retention, and alerting are the four pillars.

Initial setup is a one to two week project. After that, operational costs scale with usage. A system whose logs aren’t observable isn’t production ready.