Microbenchmarking Big Data Solutions on the JVM – Part 1

Though most developers view a microbenchmark as a targeted performance measurement of a very small block of code, with limited if not none outbound calls to other codebases, this article will focus on the far more productive, meaningful and challenging task of performance analyzing the execution behavior of a very specific and narrow system interaction resulting in the execution of many underlying activities (prescribed), actions (planned) and operations (provisional), many of which have a typical inherent (self) execution cost in microseconds. The micro-classification is attributed to the many short timed method invocations that need to be identified (or eliminated) to truly understand what is being benchmarked – the essence of the execution behavior.

The microbenchmark used in this article comes from the Apache Spark distribution. The main method takes a single argument specifying the number of slices used by the SparkContext in parallelizing the execution. Without an argument being passed the slices variable takes the value of 2.

microbench.bigdata.spark.code

To execute the above code the unqualified name of the Scala object (Java class) is passed to the run-example script from within the root directory of the Apache Spark installation.

./bin/run-example SparkTC

The challenge in measuring not just the microbenchmark code but the underlying method invocations within the software under test (performance analysis) is ensuring that the required instrumentation, measurement, and collection does not perturb the system to such an extent that the system is an entirely different system (measurement is not relevant) and the data collected is neither accurate or actionable. This challenge can be overcome with adaptive profiling of the codebase, both online and off-line. In online mode, the adaptive profiling agent will automatically disable measured methods based on one or more hotspot thresholds. In offline mode the adaptive profiling redefines the instrumentation set (code coverage) based on an exported performance model (snapshot) from a previous run, so with each consecutive run of the benchmark it starts with a smaller subset of the codebase to instrument and measure without needing to repeat the same disablement of non-hotspot methods as before. Whilst the Apache Spark software has practical no memory of its past execution, which is what you need for a “good” benchmark, the adaptive agent is very different in that it learns and self-reflects, continuously and iteratively, on what needs to be instrumented and measured. With each benchmark execution, the data collected by the profiling agent becomes more relevant, accurate and attributable. This is especially so with the offline mode removing any possible measurement overhead residue of non-hotspot methods disabled online in the previous execution.

microbench.bigdata.spark.timeline.bw

Below is the script used to execute the benchmark 20 times, with each iteration generating a new profiling agent bundle limiting the instrumentation in the next benchmark run to those methods labeled hotspot, as determined from the snapshot exported on the shutdown of the benchmarked JVM. The script initially copies over the standard agent bundle into the current directory and then after running the benchmark generates a revised agent bundle, copied into the current directory, based on analysis of the automatically exported snapshot. The snapshot, probes.ocs, is then moved over into the current directory and renamed to the iteration it represents. This entire process is then repeated.

microbench.bigdata.spark.script.bw

The instrumentation agent is loaded into the JVM with the following editing of the run-example script. Note this agent library is updated after each iteration of the SparkTC benchmark.

microbench.bigdata.spark.example.script.bw

Initially, the instrumentation was limited using the following package/class level filters config. The task is to identify hotspots within the Apache Spark codebase some of which may call into other non-instrumented libraries in a very inefficient manner. Measurement is confined to what can be realistically changed by the Apache Spark team.

org.
!org.jboss.

The metering runtime employed by the adaptive agent had the following hotspot thresholds defined (in microseconds).

j.s.p.hotspot.threshold=5
j.s.p.hotspot.threshold.inherent=1
j.s.p.console.dump.file.name=probes.ocs

After executing the script a software performance engineer need only look at the final performance snapshot exported though all previous snapshots are stored for traceability of the adaptive process across benchmark iterations. Below is a screenshot of the metering table from the 20th snapshot sorted on self-time. It is important to reiterate this, this is not just the 20th run of the same benchmark but the 20th time the profiling agent has learned to adapt its instrumentation and measurement strategy. The profiling agent is being trained with each benchmark and in doing so is increasing the value of the information extracted from the measurement collected. Because the profiling agent adapts during and after each iteration the training data being feed into the agent is, in fact, changing all the time though the benchmark code and it’s execution are not.

microbench.bigdata.spark.1.t5.it1.r1.bw

Ideally, the same script should be executed again after archiving the existing snapshots. Below is a screenshot of the 20th snapshot taken from a second execution of the script after effectively resetting the agent’s memory by copying over the standard agent bundle (at the very start of the script). Admittedly every microbenchmark execution of this type will vary to some degree but overall the similarity of both snapshots gives confidence in the benchmark, tooling, and configuration used. The metering reported will not be analyzed in this article but it seems on first glance that a significant amount of the time in the execution is taken up with housekeeping (sorting/serializing/storing) of internally generated data, particular to the execution context, as well as the tracking of processing.

microbench.bigdata.spark.1.t5.it1.r2.bw

Up to now the benchmarking has focused more so on the software execution hotspots than the system execution hotspots in the setting of the slices variable to 1. With a higher value for the slices variable, there should be more parallelism, less determinism, and (possibly) greater coordination and contention overhead.

microbench.bigdata.spark.script.revised.bw

Here is a screenshot of the 20th performance snapshot exported after running the revised script with the slices script parameter as 4. The count has changed for a number of probes (methods) and though there has been some reordering and changing in the list, on the whole, it appears similar to previous results – at least in terms of what is indicated by functional names. The profiling agent has adapted its instrumentation and measurement strategy in light of the changes with the workload. This online learning capability of an agent is something often overlooked when developers take on the task of hardwiring instrumentation into their code and worse turning such static measurement into simple non-contextual (and non-behavioral) metric measures eliminating all essence of flow (of life).

microbench.bigdata.spark.4.t5.it1.r1.bw

Below is the 20th performance snapshot taken from a second execution of the benchmarking script. The result is strikingly similar to the above snapshot bearing in mind the parallelism introduced and the measurement resolution. With each script execution, consisting of 20 individual benchmarks, the adaptive profiling agent in each case has converged on the same subset of hotspot probes.

microbench.bigdata.spark.4.t5.it1.r2.bw

In Part 2 attention shifts to cleaning up some aspects of the performance model such as restricting the measurement window to that of the execution of the main method and in doing so removing startup costs and timings. In Part 3 new data collection extensions are introduced in the playback of an episodic software memory. In Part 4 the many ways of visualizing the benchmark execution are given center stage.