Iterative Performance Benchmarking of Apache Kafka – Part 2

In Part 1 of this series of posts on the subject of performance benchmarking Apache Kafka for the purpose of hotspot analysis the profiling of the codebase was confined solely to the server side element. In this follow-up, a similar benchmark script and approach is used but instead of adaptively instrumenting and profiling the broker we now turn attention to the producer of the benchmark load – the client-side Java library. Both the server (broker) and producer scripts look up the KAFKA_OPTS environment variable so to change from server side to client side profiling all that is needed is for the environment variable to be set differently – in another order. Everything else stays pretty much the same. The script still loops over the benchmark a number of times and with each iteration, it redefines the coverage of the instrumentation.


After 10 iterations of the benchmark with hotspot thresholds of 5 and 1 microseconds, for inclusive and exclusive times respectively, the instrumentation set is reduced from 1000s of methods to less than 10.


Increasing (10x) the benchmark parameters -num-records from 10000000 to 100000000 and --throughput from 100000 to 1000000 results in a slightly different performance model following a complete re-execution of the benchmark iterations. Strangely enough, we’ve lost both the poll methods in this run. They were eliminated during the second iteration. This can happen when probes, methods, are on the borderline of a threshold or the execution cost has multi-phase characteristics. There are sophisticated ways to deal with such cases but that’s for another post.


The above hotspot list is a bit on the small side. A more typical list size falls in the range of 25-50 probes. Please note that the smaller the list the greater the overstatement of the exclusive (inherent/self) time is as both direct and indirect callee method invocations are removed from the instrumentation set and measurement collection with such timing moving upwards back through the caller chain. The reality is very simple you can’t have all (coverage) the cake (instrument) and eat (measure) it all (collection). You, more specifically the machine never the man, need to be very selective. In doing so you benefit as the final list of hotspots revealed are those points at which it is most suitable to inject some form flow control or an actual code change. Conversely the smaller the list the more accurate the inclusive time is.

In the metering table columns “(I)” stands for inherent, not inclusive. Inherent (self) is the time that’s exclusive to the probe, at least in terms of what was instrumented and measured.

Let’s rerun the benchmark again this time with an inclusive hotspot threshold set to 2. The exclusive threshold tends to be much bigger than the inclusive threshold in order to prune away leafs nodes in the call tree so as to expose the most suitable points for code changes which are not necessarily the actual bottleneck points. An example would be a loop making a relative inexpensive call N number of times. This can be seen in the table below with the KafkaProducer.doSend now appearing as a hotspot.


The list of hotspots can be increased by reducing further the exclusive hotspot threshold from 2 microseconds to 1, making both exclusive and inclusive thresholds the same. Normally this is not recommended when doing an ad hoc once-off profile of a system as it takes a much longer time for the adaptive metering strategy to filter out instrumentation and measurement noise within the Java runtime. But with the iterative refinement of the code coverage across multiple benchmarks there is no need to be overly concerned for any possible overhead residue caused by the instrumentation and measurement of non-hotspots though that is not to say some of the timings will not be overstated slightly which is why it is far better to focus on relative rankings and not actual numbers. Start at the top and work down through the list of hotspot probes by inspecting the source code for each in turn.


Looking over, the source code for the hotspots identified by Satoris there are a number of common cost concerns: synchronization, allocation, and logging.


org.apache.kafka.clients.producer.internals.RecordAccumulator.append #synchronized #logging #allocation #cas


org.apache.kafka.clients.producer.KafkaProducer.doSend #waiting #logging #allocation


org.apache.kafka.clients.producer.internals.RecordBatch.done #logging #looping #allocation #readrepeat


org.apache.kafka.clients.producer.internals.RecordAccumulator.abortExpiredBatches #allocation #looping #synchronized #logging


org.apache.kafka.clients.producer.internals.RecordAccumulator.ready #allocation #looping #synchronized


org.apache.kafka.clients.producer.internals.Sender.completeBatch #logging


org.apache.kafka.common.requests.RequestSend.serialize #allocation #nio



The following visualizations are useful for those curious to how the performance model is refined over the 10 iterations in the last benchmark run via the online and offline adaptive mechanism and at what point in the process does the instrumentation and measurement remain relatively constant.