Running ELK at 600GB/day — What We Learned

Context

We operated a cloud gaming and video streaming platform with 20+ services generating high-volume structured logs. The ask was simple: give every engineer full-text search across all logs with sub-second latency, and keep the system running without constant babysitting.

The answer was a Kafka-backed ELK stack. Here's what that looked like in production.

Architecture

Services → Filebeat → Kafka → Logstash → Elasticsearch → Kibana

Kafka sits between Filebeat and Logstash as a buffer. This decouples ingestion rate from processing rate — if Logstash falls behind, Kafka absorbs the backpressure rather than dropping events at the source.

We ran Elasticsearch with dedicated master, data, and coordinating nodes. The data nodes used io1 EBS volumes for the hot tier, with an automated ILM policy that rolled indices to gp2 after 3 days and deleted them after 30.

What We Got Right

Structured logging from day one. We enforced a JSON logging standard at the application level — timestamp, service name, log level, trace ID, and message as required fields. This made Logstash pipelines trivial and kept Kibana dashboards maintainable.

Kafka as the backbone. This was the right call. Several times Logstash needed a restart for config changes. With Kafka as the buffer, no events were lost during those windows.

ILM policies. Without index lifecycle management, Elasticsearch fills its disks silently and then cluster health turns red. ILM made retention self-managing.

What Bit Us

Shard count. We started with the Elasticsearch default of 5 primary shards per index. At our data volume, daily indices had shards that were too small, creating excessive overhead. We tuned to 1-2 shards per daily index per data node — a 40% reduction in heap pressure.

Logstash memory. Logstash is a JVM process with a default heap of 1GB. At 600GB/day we needed 4GB+ per instance and careful GC tuning to avoid stop-the-world pauses that stalled the pipeline.

No read replica separation. We ran Kibana queries against the same nodes handling ingestion. Heavy dashboard queries from the ops team occasionally impacted ingest latency. The fix — a dedicated coordinating node for Kibana — was simple once we identified the bottleneck.

The MTBF Number

~300 hours MTBF means roughly one significant incident every 12 days. That sounds high, but for a system ingesting at this volume, most incidents were self-inflicted: shard misconfigurations, Logstash OOM from a bad regex, a Kafka consumer group offset reset. The underlying infrastructure was stable.

What I'd Do Today

For a new build at this scale, I'd evaluate OpenSearch (the AWS-managed fork) or Grafana Loki for logs specifically. Loki's label-based indexing model costs significantly less to operate than full-text Elasticsearch indexing, at the cost of query flexibility.

The Kafka layer I'd keep — it's the most resilient part of the stack.