Kafka at Scale: Lessons from Production

Apache Kafka has become the de facto standard for building event-driven architectures. However, running Kafka at scale in production comes with unique challenges. Here are some real-world insights from managing clusters that process billions of events daily.

Cluster Sizing and Planning

Broker Configuration

When planning your Kafka cluster:

Disk I/O: Kafka is disk-intensive. Use SSDs for better performance
Network bandwidth: Ensure sufficient network capacity for replication
Memory: Allocate enough heap memory for the JVM, but leave room for OS page cache

Topic Design

Design your topics carefully:

Partition count should match your consumer parallelism needs
Replication factor of 3 is standard for production
Consider retention policies based on your use case

Performance Optimization

Producer Configuration

Key producer settings:

Batch size: Larger batches improve throughput but increase latency
Compression: Use compression (snappy, lz4, or gzip) to reduce network traffic
Acks: Balance between durability and performance

Consumer Configuration

Optimize consumers:

Fetch size: Tune fetch.min.bytes and fetch.max.wait.ms
Consumer groups: Design groups to match your parallelism needs
Offset management: Choose between automatic and manual offset commits

Monitoring and Operations

Key Metrics to Monitor

Throughput: Messages per second
Latency: End-to-end latency from producer to consumer
Replication lag: Monitor follower lag
Disk usage: Track partition sizes and retention

Common Pitfalls

Over-partitioning: Too many partitions can cause performance issues
Under-replication: Ensure all partitions are properly replicated
Consumer lag: Monitor and alert on consumer lag spikes
Disk space: Set up alerts for disk usage

Conclusion

Running Kafka at scale requires careful planning, continuous monitoring, and iterative optimization. By following these practices, you can build robust event streaming systems that handle billions of events with minimal latency.