Kafka producer tuning for high latency networks

11 February 2025

News
Software engineering

By Austin, Analytics Services Principal Engineer

Our Kafka clusters handle 90 billion events and messages every day; they’re a vital component of our data infrastructure, enabling us to test models and hypotheses rapidly.

One of our application engineering teams recently asked an interesting question: could our Kafka clusters meet the throughput requirements for a new use case when the producer client is located in a remote data centre with a high-latency network link?

A typical Kafka architecture for such a scenario involves deploying a Kafka cluster in each data centre, having clients interact only with their local cluster, and mirroring data between clusters using Mirror Maker.

Baseline testing

To answer this question, we needed a Kafka producer that could generate as many messages as quickly as possible. We would use this to establish the baseline throughput for our producer using a local Kafka cluster and then see what impact a high-latency network link has by running the producer in the remote data centre.

The application engineering team intended to use a librdkafka-based client, so it made sense for us to do the same. Like many Platform Engineering teams at G-Research, Python is our language of choice, and Python bindings are available for librdkafka.

The message throughput achieved with the first iteration of the producer was underwhelming, just a few megabytes per second. This was due to the cost of generating a fixed-length random string in Python. While it would have been interesting to benchmark different approaches and optimise them, it would have distracted us from our goal. To generate load, it was sufficient for every message to have an identical payload. With this change implemented, message throughput was much better at tens of megabytes per second.

Additionally, the first iteration used ThreadPoolExecutor to produce multiple messages in parallel. However, since the producer was CPU-bound, it didn’t benefit from Python threads as it could not fully utilise multiple CPU cores because the Global Interpreter Lock (GIL) means only one thread can execute Python code simultaneously. Using multiprocessing instead of ThreadPoolExecutor overcame this limitation, as each process has its own Python interpreter and memory space, bypassing the GIL entirely.

Having made these changes, a single process (with acks=all) could write 100-byte messages at 57 megabytes per second and continue to scale as additional processes and topic partitions were added. For example, with 32 processes and partitions, message throughput was 1.28 gigabytes per second.

Remember that these results do not measure the limits of what can be achieved with Apache Kafka. The goal was to generate sufficient load to meaningfully evaluate the impact of network latency on Kafka producer throughput by establishing a baseline for when the producer and Kafka cluster are located in the same data centre.

After establishing this baseline, we ran the producer in the remote data centre. When we did this, throughput plummeted to just 4 megabytes per second.

Buffer tuning

The size of socket buffers must be increased to maintain high throughput over a high-latency connection. This allows more data to remain in-flight on the network, avoiding bottlenecks caused by waiting for acknowledgements. This principle is tied to the bandwidth-delay product, which calculates the maximum amount of data on the network at any moment (bandwidth x round-trip delay).

There are two main approaches for tuning socket buffer sizes:

Manual buffer sizing

Increase the core.rmem_max and net.core.wmem_max Linux kernel parameters to set higher limits for receive and send buffers.
Align Kafka’s send.buffer.bytes and socket.receive.buffer.bytes with these values to ensure the producer and broker take advantage of the increased buffer capacity.

Auto-tuning (recommended)

Leave Kafka’s socket buffer settings at their defaults and let the operating system dynamically adjust buffer sizes based on network conditions.
To enable this, ensure that the ipv4.tcp_rmem and net.ipv4.tcp_wmem kernel parameters are configured with appropriately high maximum values. These parameters have three values: minimum, default and maximum (e.g. 4096 87380 16777216).

Both methods were tested using a single producer process with a maximum buffer size of 96 MB. Results showed that auto-tuning performed just as well as manual sizing while using memory more efficiently.

Memory usage statistics were collected by running the “ss -mnp” command in a loop on the broker that was the leader for the partition being written to and then grepping for the IP address of the producer.

The skmem_r and skmem_rb fields in the output show the actual amount of memory allocated and the maximum amount that can be allocated, respectively. Note this includes both the user payload and the additional memory Linux needs to process the packet (metadata). Also note that when using manually sized buffers, the maximum amount of memory that can be allocated is double the requested maximum size. This is because the Linux kernel doubles the specified value to allow space for packet metadata.

In either case, increasing the socket buffers’ size provided a 10x increase in message throughput.

Further testing showed that throughput could be scaled further by increasing the number of producer processes and topic partitions:

To reiterate, socket buffer changes must be made on both the broker and the producer.

Whilst all of our tests so far had used the strongest message durability guarantees that Kafka offers (acks=all), we hadn’t tested whether message ordering and exactly-once delivery guarantees would impact what message throughput could be achieved over a high-latency network link.

Engineering

Want to learn more about life as an engineer at G-Research?

Learn more

Delivery guarantees

Message ordering is guaranteed by setting max.in.flight.requests.per.connection=1 on the producer. Similarly, idempotence is achieved by setting enable.idempotence=true, which implicitly sets max.in.flight.requests.per.connection=5 and limits the number of messages the broker needs to keep track of to prevent duplicates.

In either case, message ordering and exactly-once delivery guarantees are achieved by limiting the number of in-flight requests. And given that message throughput is improved by being able to send large amounts of data without needing to stop and wait for acknowledgement, it stands to reason that message throughput will decrease over a high-latency link when message ordering and exactly-once delivery is required, regardless of any socket buffer tuning. This was confirmed via testing.

Message ordering

Exactly once-delivery

Key Lessons:

High-latency networks require larger buffers

Increasing TCP socket buffer sizes is critical for achieving high throughput over high-latency connections. The bandwidth-delay product highlights the need to allow more data to remain in flight at any time.

Auto-tuning is efficient

Using the operating system’s auto-tuning mechanism for TCP buffers provides comparable throughput to manual tuning while using memory more efficiently, making it the recommended approach.

Throughput scales with parallelism

Using multiple Python processes and partitions can significantly increase Kafka throughput. Scaling horizontally is an effective way to overcome network limitations.

Delivery guarantees have trade-offs

Enforcing message ordering and exactly-once delivery dramatically reduces throughput over high-latency networks. When throughput is a priority, carefully consider whether these guarantees are necessary for your use case. If so, consider deploying a local Kafka cluster and using geo-replication to mirror the data.

Baseline testing is crucial

Establishing a baseline throughput for local Kafka clusters is an essential first step to identifying and quantifying the impact of network latency and tuning adjustments.