Performance Benchmarking and Hotspot Analysis of Linkerd – Part 1

This is a first in a series of articles looking at the software performance of Linkerd, a recent addition to the Cloud Native Computing Foundation, offering a transparent proxy that adds service discovery, routing, failure handling, and visibility to modern software applications.

The purpose of this initial post and described benchmark is to first identify the necessary overhead cost and code hotspots of pass through HTTP traffic. As this project is largely a repackaging (in terms of this initial benchmark) of two battled tested and heavily tuned projects, Finagle and Netty 3.x, the profiling of the codebase needs to employ an iterative and adaptive approach to instrumentation and measurement.

The following benchmark script repeatedly executes a profiled test run with each iteration resulting in the refinement of the instrumentation set based on the hotspot execution analysis of a previous test run. In the first section of the script, after setting environment variables, exporting of LOCAL_JVM_OPTIONS, and reading possible workload command arguments, a nginx server is started.

What the Satoris profiling agent learns, in terms of performance hotspot analysis, from one test iteration is carried over into the next test iteration with the goal being to eliminate unnecessary costs and improve measurement accuracy as well as model relevance. Next, the JVM instrumentation agent library is copied into the working directory that is referenced in the exported LOCAL_JVM_OPTIONS environment variable and picked up by the linkerd execution script. The linkerd server is started up twice in each of the benchmarks test iterations. The first time incurring some bytecode instrumentation startup costs that are eliminated in the subsequent second execution following the population of an instrumented code cache that is used across process executions. After starting the linkerd server with the agent installed and the code instrumented a stress test tool, slow_cooker, is used to send request traffic to nginx server via the linkerd proxy server. The benchmark script executes a total of seven test iterations with the purpose of refining instrumentation.

After the stress tool has completed the workload the linkerd server is stopped. This results in the export of a performance metering model, a snapshot file, to the filesystem that is then used by the Satoris console library to generate a new version of the agent library with a revised instrumentation set based on the methods labeled within the model as being hotspot and !managed. This step is the same for the first five test iterations. For the sixth test iteration, the online adaptation of the instrumentation by the agent is disabled, further reducing the already minimal measurement costs. In the last iteration, the tracking metering extension is enabled to collect the typical call tree profiling view used (and somewhat abused) by developers in finding performance bottlenecks. It is safe to say performance measurement is taken very seriously.

Below is the yaml file referenced in the above script, when starting the linkerd server, with the nginx addition made to route inbound requests on port 8080 to port 9999. The nginx server was configured to disable all logging of access, reducing static page service request processing times, as well conserving disk space. In a follow-up benchmark article exploring features of linkerd a Java service will be used as the backend.

The following package and class level instrumentation filters were added to the jxinsight.aspectj.filters.config in the $CONF directory referenced above. Following the first test iteration, the agent will generate a method level configuration file and have it packaged with the revised copy of itself maintained in the working directory of the benchmark. This packaging of both agent code and configuration eases deployment when targeting multiple JVMs, all running and performing identically, as is the case for a standard large-scale benchmark.

# the following packages are included in instrumentation set
# the following classes are excluded from the above instrumentation set

In the jxinsight.override.config file the hotspot metering threshold for an averaged execution of a method was set to 5 microseconds for inclusive time and 1 microsecond for exclusive (self/inherent) time. In future benchmarks, these hotspot thresholds will be further refined and tuned with additional options.

# the following override the defaults for hotspot provider thresholds
# always generate the same named probes snapshot file on shutdown of the jvm
# include disabled labeled probes in snapshots exported to the filesystem

Before getting caught in the hotspot identification, the -print-probes-metering command can be used to see the actual instrumentation refinement process in action by reporting one the number of probes listed in each of the snapshot files exported and copied by the benchmark script. Archives of all snapshots presented below are available for download (zip, tar) and can be viewed using the Satoris console.

java -jar ../../console/satoris.jar -print-probes-metering --probe qps-1000-con-1-itr-1.ocs | wc -l
java -jar ../../console/satoris.jar -print-probes-metering --probe qps-1000-con-1-itr-2.ocs | wc -l
java -jar ../../console/satoris.jar -print-probes-metering --probe qps-1000-con-1-itr-3.ocs | wc -l
java -jar ../../console/satoris.jar -print-probes-metering --probe qps-1000-con-1-itr-4.ocs | wc -l
java -jar ../../console/satoris.jar -print-probes-metering --probe qps-1000-con-1-itr-5.ocs | wc -l

Here is the screenshot of the first snapshot exported when the script was executed with -q 1000 -c 1. The agent has already identified performance hotspots and disabled many others, but this has happened online during the execution of the benchmark and so effects collected measurements to some degree.

The fifth snapshot exported, following the refinement of the instrumentation based on the analysis of all previously exported snapshots, has significantly reduced the listing of hotspot methods in the metering table. Some probes, previously marked as hotspots, have disappeared and others have changed ranking regarding their overall inherent (exclusive/self) wall clock time total.

In the sixth snapshot for this particular benchmark run the hotspot metering extension was disabled resulting in no hotspot classification of probes. Disabling the strategy based metering extension offers the smallest measurement overhead possible for the Satoris agent except maybe when an alternative performance resource meter is used. There is always some possible variance in totals and averages across runs but in some cases below the total times have indeed dropped a little with this extension disablement. Presented below are the screenshots for each of the sixth benchmark test iterations for a specified stress workload parameter set including the concurrency.

Increasing the concurrency parameter of the stress testing tool, from 1 to 10, and decreasing the number of requests (per worker) from 1000 to 100 results in the following performance model. The ServiceFactoryProxy.status method has jumped up the table as has the Context$class.let method. Across the board, the total and inherent timings increased. For example the HttpMessageDecoder.readHeaders rose from 7s to 18s for both total and inherent total. Note that when a probe has the same value for both total and inherent total it means that none of the methods (if any) it calls, directly or indirectly, are instrumented or measured. The total number of hotspot probes increased from 74 to 166.

Doubling the concurrency parameter brings the Balancer$class.apply method up to the near top of the rankings. The HttpMessageEncoder.encode method has also jumped with a significant increase in its inherent total. Previously the inherent total represented 50% of the total for the encode method. In this benchmark run, it is 100%. This can be attributed to methods called by the encode method being disabled early on in the benchmark tests iterations. There is always a chance of some variance due to the adaptive and strategy based nature of the profiling agent as well as the chance of methods, probes, having an “average” performance timing bordering either of the hotspot thresholds. The ServerStatsFilter.apply method, is another probe that has moved upwards. In this run of the benchmark script, the total number of hotspot methods was 164 – slightly lower than the previous reference point.

Further increasing the concurrency parameter does not significantly change the hotspot charts table. Many probes, methods, have kept their positioning and only marginally increased their timings. The ServerStatsFilter.apply method did move up, but the biggest mover is the appearance of the FailureAccuralFactory.apply method. In previous runs, this method was eliminated midway in the benchmark run. The total number of hotspots remaining at the end of the benchmark run was 155.

In the final benchmark run, with concurrency was set to 50 and the number of requests per second per worker set to 20, the biggest movers relate to HTTP header access. The methods HttpMessageDecoder.readHeaders and DefaultHttpHeaders.get both rose to some degree in the table rankings, but again this can be attributed to adaptive disablement of called methods. The total number of hotspots is now at 158. Ideally, this number would be dropped down to 50 following experimentation with the hotspot thresholds and some specific filtering out of particular classes such as Filter$$anon$1 and possibly Service$$anon$1.

Satoris is a perfect blend of intelligent exploration, strategy guidance, and adaptive behavior for application performance management task coordination and cooperation between both man and software machine. One concern with the above performance models is that there is a large number of probes with relatively high counts and average inherent timings bordering on the threshold. The agent might well be metering more probes than required to get a good understanding of the execution flow and sequencing of request processing through the linkerd proxy server. The stress tool was configured to dispatch 500,000 requests so any probe with a count in the millions is being called in a loop of sorts (including recursion). It is good to identify such repeaters but if you want to actually fix the issue it is far better to have the caller of such probes, methods, be classified instead as the hotspot. This might be the case for some callers but not necessarily all so in a follow-up post, the hotspot thresholds used by the underlying strategy and balance scorecard will be tweaked to exclude as many of these high-frequency methods as possible. This information is still incredibly useful but when it comes to looking at call trees, via the tracking metering extension, it is far better to have such leaf nodes already filtered out. Below a screenshot of a snapshot sorted by the call frequency of a probe in descending order.

Reversing the above sorting gives us the following table view. Currently, this listing represents the closest thing we have to what are the major operations involved in serving a request within the linkerd server leaving aside the IO operations. In functional based systems, this can be hard to obtain with generic stream (pipeline) operations replacing such labeled demarcation and structuring of execution flow.

Below are some code snippets of the above methods pulled from GitHub. In some cases, it was not certain which version (commit) was packaged within the linkerd distribution.