chrisharvey - Fotolia

10 Apache Kafka best practices for data management pros

How can data management teams most effectively deploy and use Apache Kafka in data pipelines and streaming applications? Here are some key guidelines to follow.

Are you well-versed in Kafka best practices?

Apache Kafka, an increasingly popular distributed streaming platform, can help enterprises create and manage data pipelines and scalable, real-time data streaming applications. However, at scale, these applications often have a lot of moving pieces required for ingesting and processing data streams -- and configuring client applications that make use of them.

That complexity can make enterprise data teams' jobs even more difficult. Experts and consultants agree that data teams can avoid common Kafka roadblocks by following a set of key strategic guidelines.

That includes the use of automation, which reduces the overhead in launching new instances of the application or fixing problems when they come up. Canaries, full replication and storing Kafka offsets in HBase instead of ZooKeeper can also be game-changers if done right, according to experts.

The following Kafka best practices can help data teams overcome key deployment and management challenges.

Automate deployment

One of the most important and overarching Kafka best practices for IT teams to follow is to "automate, automate, automate," said Gwen Shapira, product manager at Confluent, a platform that facilitates the deployment of Kafka. She said she has seen that companies with strong DevOps culture that efficiently automate Kafka maintenance tasks have fewer incidents and can manage larger-scale deployments with smaller teams.

Gwen Shapira, product manager at ConfluentGwen Shapira

Every year there are several talks at Kafka Summit on how a small team manages a huge production deployment with 99.95% availability. "People are happy to share their practices, so there is no excuse not to learn from the best," Shapira said.

Veera Budhi, assistant vice president of technology and services at SaggezzaVeera Budhi

Open source platform Ansible can help provide an efficient and simple way to deploy, manage and configure the Kafka services, said Veera Budhi, assistant vice president of technology and services at Saggezza, an IT consultancy. Ansible playbooks offer a collection of scripts for automating the deployment of Kafka and associated applications across multiple servers. Ansible templates allow developers to specify the conditions for a new deployment as variables that can be filled in at runtime.

Kunal Agarwal, CEO of Unravel DataKunal Agarwal

Enterprises should also consider automating parts of the application performance management and monitoring processes, according to Kunal Agarwal, CEO of Unravel Data, a performance monitoring platform. When something goes wrong with an application spanning multiple tools, it can be difficult to determine whether the failure was a Kafka issue, a Spark issue or a myriad of other potential problems. Manually isolating and diagnosing these types of problems can take days or even weeks of trial and error, digging through raw log data or writing test scripts, Agarwal said.

This kind of strategy can use machine learning and automation to look at runtime parameters and settings for input/output (I/O) consumption, data partitioning, Spark parallelism, Kafka transport metrics, batch time intervals and consumer group details.

Plan for statefulness

Kafka is a stateful service like a database, meaning the computer or program keeps track of the state of interaction. Site reliability engineering instincts around "let's just restart the service" are usually incorrect because they are based on stateless services like web servers, according to Confluent's Shapira.

"If you just randomly restart Kafka machines the way you do with web servers, you can end up with lost data," Shapira said, adding that it's important to read the documentation and learn about the best way to restart and upgrade. For Confluent Cloud, Shapira's team wrote a Kubernetes Operator that automates all this specialized knowledge so that they can hire site reliability engineers without special Kafka knowledge and have them use their Kubernetes and DevOps skills to run Kafka safely.

Use a canary

One good way to keep an eye on a Kafka cluster is by using a canary, which is a client that produces and consumes artificial events in order to monitor and test systems. It simulates the actual user activity for identifying problems from a user perspective even when a cluster appears to be operating correctly, Shapira said.

For example, Shapira's team recently ran into a problem when they expanded one of their Kafka clusters, but the canary failed a health check. The brokers -- the term for each node in a Kafka cluster -- were healthy, but it turned out they had run out of available public endpoints from their cloud provider. They immediately worked with the cloud provider to allow provisioning of additional public endpoints.

"It's also important to remember that while the cloud sometimes looks infinitely scalable, cloud providers do impose limits and you need to take them into account," Shapira said.

Proactively filter logs

Your Kafka best practices plan should include keeping only required logs by configuring log parameters, according to Saggezza's Budhi. "Customizing log behavior to match particular requirements will ensure that they don't grow into a management challenge over the long term," Budhi said.

A good practice for log management is to set up your log retention policy, cleanups, compaction and compression activities, Budhi added.

Plan the data rate

Ensuring the correct retention space by understanding the data rate of your partitions is another Kafka best practice. The data rate dictates how much retention space -- in bytes -- enterprises need to guarantee retention for a given amount of time.

"It's essential to know the data rate to correctly calculate the retention space needed to meet a time-based retention goal," Budhi said. The data rate also specifies the minimum performance a single consumer needs to support without lagging.

Store Kafka offsets in HBase instead of ZooKeeper

By default, Kafka uses the Apache Zookeeper file application engine to manage various aspects of cluster and file management, including the offset used for specifying the location of data records. Budhi argued that using Apache HBase to store offset can increase performance because it stores data in order sorted by row key. It also avoids the overhead on the ZooKeeper process so that other services can utilize the ZooKeeper availability.

For large deployments with a lot of consumers, ZooKeeper can become a bottleneck, while Kafka can handle this load easily. The movement of Kafka offset to HBase enables ZooKeeper to be fully available for all other services running in the cluster ecosystem. By storing the offset outside of ZooKeeper in HBase instead of in Kafka itself, Spark Streaming applications can restart and replay messages from any point in time as long as the messages are still alive in Kafka.

Distribute partition leadership among brokers in the cluster

Kafka distributes data and processes across multiple servers in a cluster for fault tolerance and performance. A partition leader requires at least four times as much I/O as the followers, Budhi said.

A leader may also have to read from disk. However, IT teams can configure different servers as leaders for different partitions, which can spread the burden of leadership across different physical servers.

Retain a low network latency

When forming their Kafka best practices, data teams should ensure that brokers are geographically located in the regions nearest to clients to help with latency concerns, Budhi said. Also, they should consider network performance when selecting instance types offered by cloud providers.

"If bandwidth is holding you back, a greater and more powerful server is the right option for performance," Budhi said.

Try full replication

Ben Stopford, lead technologist of the office of the CTO at ConfluentBen Stopford

An acknowledgement is a signal that is passed to indicate that data has been received successfully. It's tempting to reduce the acknowledgements (ACKs) of replication required from different servers in a cluster to boost performance. But Ben Stopford, lead technologist of the office of the CTO at Confluent, recommends enterprises consider replicating data across all nodes by enabling the acks = all command to ensure data is replicated to all of the in-sync replicas.

Setting the acks = all command introduces a little extra latency to each batch of messages, but it doesn't typically affect throughput, Stopford said. One company Stopford worked with said they made the change and the applications didn't even notice.

Combine Kafka with other tools

Data managers may want to look at how other data processing platforms and tools can complement Kafka as a kind of connective tissue for modern data pipelines.

"[Kafka] is often employed in conjunction with Spark, MapReduce or Flink for near-real-time or interactive data applications that require reliable streaming data," Unravel Data's Agarwal said. For example, Kafka and Spark Streaming is becoming a common pairing in which a producer application writes to a Kafka topic. The Spark application then subscribes to the topic and consumes records. The records then might be further processed downstream or saved into a data store.

Next Steps

Kafka users detail real-time data benefits

Compare Hadoop vs. Spark vs. Kafka for your big data strategy

Dig Deeper on Data management strategies