Iterative Application Performance Benchmark Analysis

The ability for software to automatically (self-)adapt its execution behavior based on what has happened (past), what is happening (present) and what might happen (predicted) is incredibly powerful. Adaptation forms the basis, if not the foundation, for improved resilience, performance, scalability, and versatility in (higher order) functioning. Unfortunately, there are limits, some intrinsically imposed, to how far online runtime adaptation can succeed in its fitness for purpose. When faced with such limits one course of action is destruction and rebuild. A rebirth might still be considered an adaptation of sorts but on an entirely different scale. For example (some) humans are known to adapt their behavior in an instant to a detected change in context, whereas nature adapts genetically over the course of multiple generations of a species, sometimes kick started (or bootstrapped) by major events and subsequent environmental changes. So what has all this got to do with Java performance analysis? Lets first look at what are the typical steps involved in the performance analysis of software.

  • Instrument
  • Measure
  • Collect
  • Analyze

With each of the above steps comes a cost and if not managed intelligently will result in a final performance model that offers little insight or worse misleads the analysis and subsequent recommendations. To be able to find the unknown hotspots a significant amount of the code must first be instrumented. But in doing so, a performance drag (higher response time and lower throughput) is created that causes the system to drift or divert from its normal behavior and for the resulting performance model to not be accurate or relevant to the purpose of the investigation. Self-adaptive technology can help here in using collected data to alter (adapt) the behavior of the measurement and instrumentation steps.

Adaptation, for the most part, relies on a memory of past behavior and it is this and the residue of previous adaptations that challenges how much instrumentation and measurement overhead is eliminated. Without the ability to do a full reset on each level and layer of adaptation the inefficiency remains and so to the perturbing (overhead) cost. Even with “on the fly” disablement of instrumentation and its measurement by an intelligent performance measurement agent it is still possible for an inaccuracy to sneak into a model, especially if the final model includes data pertaining to both before and after adaptation phases. Ideally we only want to instrument and measure performance hotspots, but we can only know of these after the fact and then we must accept that some of what we are aware of the model maybe wrong. In data science (and mining) this is sometimes referred to as a leak – information that is historically captured but is not available at the time a decision has to be made (that requires such information). To solve this problem, especially when management of (overhead) cost is paramount, requires an approach that is incremental, isolating (selective memory) and iterative.

  • Instrument
  • Measure
  • Refine
  • Repeat

A benchmark run is performed to determine the initial performance hotspots. After completion a new instrumentation plan is created that only includes these hotspots. The benchmark is then performed again resulting in a possible refinement of the hotspot candidates. This iterative process continues on until eventually the set of instrumented probes is the set of probes classified as hotspots.


In the video titled “Aggregated Performance Analysis of OpenJDK JMH Benchmarks with Satoris & Simz” I briefly demonstrated this process after tackling another problem of aggregating performance measurement across multiple process executions of a benchmark. The following diagram shows the flow of data, configuration and (adaptation) change across runtimes and tooling. From the metering model aggregated by Simz and snapshotted within the Satoris console a new instrumentation agent bundle is generated and then used in the subsequent rerun of the benchmark. Initially, the degree of instrumentation is controlled by both package and class level filters, but following on from then the agent is rebuilt to only be aware of those methods previously classified as a hotspot.


Below is the hotspot listing following completion of the initial Pivotal Reactor benchmark run executed by the OpenJDK JMH tool. In this initial run I did not limit the instrumentation to any particular package so management of the overhead is problematic even for the most efficient agent, especially when instrumentation is applied to a third-party library with extreme high frequency call behavior, as is the case for the package.

Note: By default the Satoris agent does not instrument any of the JDK classes. You don’t measure what you cannot change or control.
Here are some of the probes (instrumented methods) disabled on the fly during the benchmark run based on hotspot thresholds I had defined. The count and totals represent only those firings (executions) of a probe (method) that were metered (measured) before classification and disablement.

To generate a revised instrumentation plan embedded with the Satoris agent, based on the above hotspot classifications, I simply executed:

java -jar satoris.jar -generate-aj-bundle <snapshot-ocs-file> <agent-bundle-jar-file>

After rerunning the benchmark again with the newly created agent bundle a revised list of disabled probes (instrumented methods) is determined. Previously these probes had been classified as hotspots.

Don’t be too concerned about the high averages for some of the probes listed as disabled as the hotspot metering extension uses a scorecard not impacted by large outliers as a result of object allocation and the follow-up garbage collection.

After refining the agent instrumentation plan once again with the new metering model and then re-executing the benchmark the list of probes disabled during the run is now down to a minuscule number.

After another refinement iteration, the list of probes disabled during the benchmark has shrunk down to just one.

Finally, a benchmark execution with no dynamic disablement of probes – all instrumented remains measured.

I performed a similar process as outlined above this time with an alternative meter and the final results show a high similarity though the hotspot thresholds would not exactly match across such different measures.

Those familiar with the Reactor codebase and the third party libraries it uses might be surprised with the limited number of probes listed for these packages in the metering tables above. The reason for this is the enablement of the exitpoint metering extension, and it’s dynamic disablement of firing probes (executing code) below (in the scope of) a probe (method) marked as an exit.


By default the exitpoint metering extension will not meter any exit labeled probe but I changed this in the configuration so that the metering model would include the surface level probes into such packages. This has an additional knock on effect to the reported inherent (self) metering totals as well as the hotspot classification for non-exit probes.


Because non-surface level probes never got metered (measured) they never got mirrored (simulated) and so never become part of the metering model with the Simz runtime, which is another extremely useful feature of metered, mirrored and simulated application runtimes.