Devops

Apache Kafka Best Practices

Apache Kafka is a distributed streaming platform that is used to build real-time streaming data pipelines and applications that adapt to data streams. Streaming data is data that is continuously generated by thousands of data sources, which typically send the data records simultaneously.

Apache Kafka is a widely popular distributed streaming platform that thousands of companies like New Relic, Uber, and Square use to build scalable, high-throughput, and reliable real-time streaming systems. For example, the production Kafka cluster at New Relic processes more than 15 million messages per second for an aggregate data rate approaching 1 Tbps.

Below are some of the Best Practices:

Zookeeper:

  • Do not co-locate zookeepers on the same boxes as Kafka.
  • We recommend zookeeper isolate and only use for Kafka, not any other systems should depend on this zookeeper cluster.
  • Make sure you allocate sufficient JVM, good starting point is 4Gb.
  • Monitor: Use JMX metrics to monitor the zookeeper instance.

Brokers:

Compacted topics require memory and CPU resources on your brokers:

Log compaction needs both heap (memory) and CPU cycles on the brokers to complete successfully, and failed log compaction puts brokers at risk from a partition that grows unbounded. You can tune log.cleaner.dedupe.buffer.size and log.cleaner.threads on your brokers, but keep in mind that these values affect heap usage on the brokers.

If a broker throws an OutOfMemoryError exception, it will shut down and potentially lose data. The buffer size and thread count will depend on both the number of topic partitions to be cleaned and the data rate and key size of the messages in those partitions. As of Kafka version 0.10.2.1, monitoring the log-cleaner log file for ERROR entries is the surest way to detect issues with log cleaner threads.
Monitor your brokers for network throughput:

Make sure to do this with both transmit (TX) and receive (RX), as well as disk I/O, disk space, and CPU usage. Capacity planning is a key part of maintaining cluster performance.

Distribute partition leadership among brokers in the cluster:

Leadership requires a lot of network I/O resources. For example, when running with replication factor 3, a leader must receive the partition data, transmit two copies to replicas, plus transmit to however many consumers want to consume that data. So, in this example, being a leader is at least four times as expensive as being a follower in terms of the network I/O used. Leaders may also have to read from disk; followers only write.

Don’t neglect to monitor your brokers for in-sync replica (ISR) shrinks, underreplicated partitions, and unpreferred leaders:

These are signs of potential problems in your cluster. For example, frequent ISR shrinks for a single partition can indicate that the data rate for that partition exceeds the leader’s ability to service the consumer and replica threads.

Modify the Apache Log4j properties as needed:

Kafka broker logging can use an excessive amount of disk space. However, don’t forgo logging completely — broker logs can be the best, and sometimes only, way to reconstruct the sequence of events after an incident.

Either disable automatic topic creation or establish a clear policy regarding the cleanup of unused topics:

For example, if no messages are seen for x days, consider the topic defunct and remove it from the cluster. This will
avoid the creation of additional metadata within the cluster that you’ll have to manage.

For sustained, high-throughput brokers, provision sufficient memory to avoid reading from the disk subsystem:

Partition data should be served directly from the operating system’s file system cache whenever possible. However,
this means you’ll have to ensure your consumers can keep up; a lagging consumer will force the broker to read from disk.

For a large cluster with high-throughput service level objectives (SLOs), consider isolating topics to a subset of brokers:

How you determine which topics to isolate will depend on the needs of your business. For example, if you have multiple online transaction processing (OLTP) systems using the same cluster, isolating the topics for each system to distinct subsets of brokers can help to limit the potential blast radius of an incident.

Using older clients with newer topic message formats, and vice versa places extra load on the brokers as they convert the formats on behalf of the client. Avoid this whenever possible.
Don’t assume that testing a broker on a local desktop machine is representative of the performance you’ll see in production. Testing over a
loopback interface to a partition using replication factor 1 is a very different topology from most production environments.

The network latency is negligible via the loopback and the time required to receive leader acknowledgments can vary greatly when there is no replication involved.

Producers:

Configure retries on your producers. The default value is 3, which is often too low. The right value will depend on your application; for applications where data loss cannot be tolerated, consider Integer.MAX_VALUE (effectively, infinity). These guards against situations where the broker leading the partition isn’t able to respond to a produce request right away.
For high-throughput producers, tune buffer sizes, particularly buffer, memory and batch. size(which is counted in bytes).
Because of the batch. size is a per-partition setting, producer performance, and memory usage can be correlated with the number of partitions in the topic.
The values here depend on several factors: producer data rate (both the size and number of messages), the number of partitions you are producing, and the amount of memory you have available. Keep in mind that larger buffers are not always better because if the producer stalls for some reason (say, one leader is slower to respond with acknowledgments), having more data buffered on the heap could result in more garbage collection.
Instrument your application to track metrics such as the number of produced messages, average produced message size, and the number of consumed messages.

Loading

2 thoughts on “Apache Kafka Best Practices

  • Hello there, I found your web site via Google while searching for a related topic, your web site came up, it looks good. I have bookmarked it in my google bookmarks.

  • Hello my friend! I want to say that this article is awesome, nice written and include approximately all significant infos. I would like to see more posts like this.

Comments are closed.

Translate »