All Blogs
Streaming

Kafka at Scale: Lessons from Production

February 28, 2024
10 min read

Apache Kafka has become the de facto standard for building event-driven architectures. However, running Kafka at scale in production comes with unique challenges. Here are some real-world insights from managing clusters that process billions of events daily.

Cluster Sizing and Planning

Broker Configuration

When planning your Kafka cluster:

  • Disk I/O: Kafka is disk-intensive. Use SSDs for better performance
  • Network bandwidth: Ensure sufficient network capacity for replication
  • Memory: Allocate enough heap memory for the JVM, but leave room for OS page cache

Topic Design

Design your topics carefully:

  • Partition count should match your consumer parallelism needs
  • Replication factor of 3 is standard for production
  • Consider retention policies based on your use case

Performance Optimization

Producer Configuration

Key producer settings:

  • Batch size: Larger batches improve throughput but increase latency
  • Compression: Use compression (snappy, lz4, or gzip) to reduce network traffic
  • Acks: Balance between durability and performance

Consumer Configuration

Optimize consumers:

  • Fetch size: Tune fetch.min.bytes and fetch.max.wait.ms
  • Consumer groups: Design groups to match your parallelism needs
  • Offset management: Choose between automatic and manual offset commits

Monitoring and Operations

Key Metrics to Monitor

  1. Throughput: Messages per second
  2. Latency: End-to-end latency from producer to consumer
  3. Replication lag: Monitor follower lag
  4. Disk usage: Track partition sizes and retention

Common Pitfalls

  1. Over-partitioning: Too many partitions can cause performance issues
  2. Under-replication: Ensure all partitions are properly replicated
  3. Consumer lag: Monitor and alert on consumer lag spikes
  4. Disk space: Set up alerts for disk usage

Conclusion

Running Kafka at scale requires careful planning, continuous monitoring, and iterative optimization. By following these practices, you can build robust event streaming systems that handle billions of events with minimal latency.