Iterative Application Performance Benchmark Analysis

The ability for software to automatically (self-)adapt its execution behavior based on what has happened (past), what is happening (present) and what might happen (predicted) is extremely powerful as it generally forms the basis, if not the foundation, for improved resilience, performance, scalability and versatility in (higher order) functioning. Unfortunately there are limits, some intrinsically imposed, to how far online runtime adaptation can succeed in its fitness for purpose. When faced with such limits one course of action is destruction and rebirth (rebuild). A rebirth might still be considered an adaptation of sorts but on an entirely different scale. For example (some) humans are known to adapt their behavior in an instant to a detected change in context, whereas nature adapts genetically over the course of multiple generations of a species, sometimes kick started (or bootstrapped) by major events and subsequent environmental changes. So what has all this got to do with Java performance analysis? Lets first look at what are the typical steps involved in the performance analysis of software.

  • Instrument
  • Measure
  • Collect
  • Analyze

With each of the above steps comes a cost and if not managed intelligently will result in a final performance model that offers very little insight or worse misleads the analysis and subsequent recommendations. To be able to find the unknown hotspots a large amount of the code must first be instrumented. But in doing so a performance drag (in terms of higher response time and/or lower throughput) is created that causes the system to drift or divert from its normal behavior and for the resulting performance model to not be accurate or relevant to the purpose of the investigation. Self adaptive technology can help here by using the actual data collected to alter (adapt) the behavior of the measurement step and in some cases even the instrumentation step, such as when the Oracle Hotspot compiler triggers a recompilation of class bytecode after it has detected changes (driven by the self adaptation) in the low level profiling of the code execution (branches, loops, call sites).

Adaptation for the most part relies on a memory of past behavior and it is this and the residue of past adaptations that challenges how much unwarranted instrumentation and measurement overhead can be truly eliminated. Without the ability to do a full reset at each level and layer of adaptation the inefficiency remains and so to the perturbing (overhead) cost. Even with “on the fly” disablement of instrumentation and its measurement by an intelligent performance measurement agent it is still possible for inaccuracy to sneak into the performance model, especially if the final performance model includes data pertaining to both before and after adaptation phases. Ideally we only want to instrument and measure performance hotspots, but we can only know of these after the fact and then we must accept that some of what we know from the model maybe wrong. In data science (and mining) this is sometimes referred to as a leak – information that is historically captured but is not available at the time a decision has to be made (that requires such information). To solve this problem, especially when management of (overhead) cost is paramount, requires an approach that is incremental, isolating (selective memory) and iterative.

  • Instrument
  • Measure
  • Refine
  • Repeat

A benchmark run is performed to determine the initial performance hotspots. After completion a new instrumentation plan is created that only includes these hotspots. The benchmark is then performed again resulting in a possible refinement of the hotspot candidates. This iterative process continues on until eventually the set of instrumented probes is the set of probes classified as hotspots.


In the video titled “Aggregated Performance Analysis of OpenJDK JMH Benchmarks with Satoris & Simz” I briefly demonstrated this process after tackling another problem of aggregating performance measurement across multiple process executions of a benchmark. The following diagram shows the flow of data, configuration and (adaptation) change across runtimes and tooling. From the metering model aggregated by Simz and snapshotted within the Satoris console a new instrumentation agent bundle is generated and then used in the subsequent rerun of the benchmark. Initially the degree of instrumentation is controlled by a package and class level filters configuration file but following on from then the agent is rebuilt to only be aware of those methods that were previously classified as a hotspot.


Below is the hotspot listing following completion of the initial Pivotal Reactor benchmark run executed by the OpenJDK JMH tool. In this initial run I did not limit the instrumentation to any particular package so management of the overhead is problematic even for the most efficient agent, especially when instrumentation is applied to a third-party library with extreme high frequency call behavior, as is the case for the package.

Note: By default the Satoris agent does not instrument any of the JDK classes. You don’t measure what you cannot change or control.
Here are some of the probes (instrumented methods) disabled on the fly during the benchmark run based on hotspot thresholds I had defined. The count and totals represent only those firings (executions) of a probe (method) that were metered (measured) before classification and disablement.

To generate a revised instrumentation plan embedded with the Satoris agent, based on the above hotspot classifications, I simply executed:

java -jar satoris.jar -generate-aj-bundle <snapshot-ocs-file> <agent-bundle-jar-file>

After rerunning the benchmark again with the newly created agent bundle a revised list of disabled probes (instrumented methods) is determined. Previously these probes had been classified as hotspots.

Don’t be too concerned about the high averages for some of the probes listed as disabled as the hotspot metering extension uses a scorecard not impacted by large outliers as a result of object allocation and the follow-up garbage collection.

After refining the agent instrumentation plan once again with the new metering model and then re-executing the benchmark the list of probes disabled during the run is now down to a very small number.

After another refinement iteration the list of probes disabled during the course of the benchmark has shrunk down to just one.

Finally a benchmark run with no dynamic disablement of probes. All that is instrumented remains measured.

I performed a similar process as outlined above this time with an alternative meter and the final results show a very strong similarity though the hotspot thresholds would not exactly match across such different measures.

Those familiar with the Reactor codebase and the third party libraries it uses might be surprised with the limited number of probes listed for these packages in the metering tables above. The reason for this is the enablement of the exitpoint metering extension and it’s dynamic disablement of firing probes (executing code) below (in the scope of) a probe (method) marked as an exit.


By default the exitpoint metering extension will not meter any exit labeled probe but I changed this in the configuration so that the metering model would include the surface level probes into such packages. This has an additional knock on effect to the reported inherent (self) metering totals as well as the hotspot classification for non-exit probes.


Because non-surface level probes never got metered (measured) they never got mirrored (simulated) and so never become part of the metering model with the Simz runtime, which is another extremely useful feature of metered, mirrored and simulated application runtimes.