Tag: Spring Reactor

Iterative Application Performance Benchmark Analysis

The ability for software to automatically (self-)adapt its execution behavior based on what has happened (past), what is happening (present) and what might happen (predicted) is incredibly powerful. Adaptation forms the basis, if not the foundation, for improved resilience, performance, scalability, and versatility in (higher order) functioning. Unfortunately, there are limits, some intrinsically imposed, to how far online runtime adaptation can succeed in its fitness for purpose. When faced with such limits one course of action is destruction and rebuild. A rebirth might still be considered an adaptation of sorts but on an entirely different scale. For example (some) humans are known to adapt their behavior in an instant to a detected change in context, whereas nature adapts genetically over the course of multiple generations of a species, sometimes kick started (or bootstrapped) by major events and subsequent environmental changes. So what has all this got to do with Java performance analysis? Lets first look at what are the typical steps involved in the performance analysis of software.

  • Instrument
  • Measure
  • Collect
  • Analyze

With each of the above steps comes a cost and if not managed intelligently will result in a final performance model that offers little insight or worse misleads the analysis and subsequent recommendations. To be able to find the unknown hotspots a significant amount of the code must first be instrumented. But in doing so, a performance drag (higher response time and lower throughput) is created that causes the system to drift or divert from its normal behavior and for the resulting performance model to not be accurate or relevant to the purpose of the investigation. Self-adaptive technology can help here in using collected data to alter (adapt) the behavior of the measurement and instrumentation steps.


Adaptation, for the most part, relies on a memory of past behavior and it is this and the residue of previous adaptations that challenges how much instrumentation and measurement overhead is eliminated. Without the ability to do a full reset on each level and layer of adaptation the inefficiency remains and so to the perturbing (overhead) cost. Even with “on the fly” disablement of instrumentation and its measurement by an intelligent performance measurement agent it is still possible for an inaccuracy to sneak into a model, especially if the final model includes data pertaining to both before and after adaptation phases. Ideally we only want to instrument and measure performance hotspots, but we can only know of these after the fact and then we must accept that some of what we are aware of the model maybe wrong. In data science (and mining) this is sometimes referred to as a leak – information that is historically captured but is not available at the time a decision has to be made (that requires such information). To solve this problem, especially when management of (overhead) cost is paramount, requires an approach that is incremental, isolating (selective memory) and iterative.

  • Instrument
  • Measure
  • Refine
  • Repeat

A benchmark run is performed to determine the initial performance hotspots. After completion a new instrumentation plan is created that only includes these hotspots. The benchmark is then performed again resulting in a possible refinement of the hotspot candidates. This iterative process continues on until eventually the set of instrumented probes is the set of probes classified as hotspots.


In the video titled “Aggregated Performance Analysis of OpenJDK JMH Benchmarks with Satoris & Simz” I briefly demonstrated this process after tackling another problem of aggregating performance measurement across multiple process executions of a benchmark. The following diagram shows the flow of data, configuration and (adaptation) change across runtimes and tooling. From the metering model aggregated by Simz and snapshotted within the Satoris console a new instrumentation agent bundle is generated and then used in the subsequent rerun of the benchmark. Initially, the degree of instrumentation is controlled by both package and class level filters, but following on from then the agent is rebuilt to only be aware of those methods previously classified as a hotspot.


Below is the hotspot listing following completion of the initial Pivotal Reactor benchmark run executed by the OpenJDK JMH tool. In this initial run I did not limit the instrumentation to any particular package so management of the overhead is problematic even for the most efficient agent, especially when instrumentation is applied to a third-party library with extreme high frequency call behavior, as is the case for the com.gs.collection package.

Note: By default the Satoris agent does not instrument any of the JDK classes. You don’t measure what you cannot change or control.

Here are some of the probes (instrumented methods) disabled on the fly during the benchmark run based on hotspot thresholds I had defined. The count and totals represent only those firings (executions) of a probe (method) that were metered (measured) before classification and disablement.


To generate a revised instrumentation plan embedded with the Satoris agent, based on the above hotspot classifications, I simply executed:

java -jar satoris.jar -generate-aj-bundle <snapshot-ocs-file> <agent-bundle-jar-file>

After rerunning the benchmark again with the newly created agent bundle a revised list of disabled probes (instrumented methods) is determined. Previously these probes had been classified as hotspots.


Don’t be too concerned about the high averages for some of the probes listed as disabled as the hotspot metering extension uses a scorecard not impacted by large outliers as a result of object allocation and the follow-up garbage collection.

After refining the agent instrumentation plan once again with the new metering model and then re-executing the benchmark the list of probes disabled during the run is now down to a minuscule number.


After another refinement iteration, the list of probes disabled during the benchmark has shrunk down to just one.


Finally, a benchmark execution with no dynamic disablement of probes – all instrumented remains measured.


I performed a similar process as outlined above this time with an alternative meter and the final results show a high similarity though the hotspot thresholds would not exactly match across such different measures.


Those familiar with the Reactor codebase and the third party libraries it uses might be surprised with the limited number of probes listed for these packages in the metering tables above. The reason for this is the enablement of the exitpoint metering extension, and it’s dynamic disablement of firing probes (executing code) below (in the scope of) a probe (method) marked as an exit.


By default the exitpoint metering extension will not meter any exit labeled probe but I changed this in the configuration so that the metering model would include the surface level probes into such packages. This has an additional knock on effect to the reported inherent (self) metering totals as well as the hotspot classification for non-exit probes.


Because non-surface level probes never got metered (measured) they never got mirrored (simulated) and so never become part of the metering model with the Simz runtime, which is another extremely useful feature of metered, mirrored and simulated application runtimes.

An Allocation (Re)Action to Reactor

Reactor is a foundational library building for reactive fast data applications on the JVM. It provides abstractions for Java to make building event and data-driven applications easier. I am currently working with an experienced software engineering team building an exciting product prototype that I hope to talk more about at some point later in the year when the project starts to take shape. My initial assignment with the team involves reviewing some of the early technology choices made in a proof of concept implementation, in terms of performance and scalability, as the deployed system needs to sense, reason and respond to vast amounts of data in real-time. One of the first project dependencies I looked at was Reactor, a project that originated at SpringSource and is now maintained by engineers working at Pivotal.


The Reactor project has a benchmark sub-project based on the OpenJDK JMH tool. JMH is a Java harness for building, running, and analyzing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM. After building the sub-project package the benchmark is run from the command line as follows:

java -jar target/microbenchmarks.jar ".*ReactorBenchmarks.*"

Here is a sample of the summary output produced by the JMH tool after it has run a relatively large number of benchmark configuration iterations within separate processes.

o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 10000 thrpt 15 1445.069 241.602 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 1000 thrpt 15 11955.832 2847.127 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 100 thrpt 15 100754.020 19946.595 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput sync 10000 thrpt 15 1748.662 81.455 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput sync 1000 thrpt 15 22749.599 1028.661 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput sync 100 thrpt 15 265829.334 3000.288 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 10000 thrpt 15 552.242 56.231 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 1000 thrpt 15 6101.796 685.604 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 100 thrpt 15 60206.842 1353.195 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 10000 thrpt 15 226.803 6.641 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 1000 thrpt 15 2021.130 270.393 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 100 thrpt 15 21433.804 1140.880 ops/s

Not terribly useful from a software performance analysis perspective, beyond the benchmark itself, but fortunately JMH does come with some built-in profiling capabilities. One of those is a thread stack sampling profiler that is activated as follows:

java -jar target/microbenchmarks.jar -prof stack ".*ReactorBenchmarks.*"

The JMH tool doesn’t appear to aggregate the sampled stack results the way it does actual benchmark measurements. After each benchmark iteration, a stack sampling report is printed. Here is example of the output produced by this particular profiler.

Iteration   1: 1223.517 ops/s
       Stack |  90.0%    WAITING sun.misc.Unsafe.park
             |   3.7%   RUNNABLE org.projectreactor.bench.reactor.ReactorBenchmarks.reactorThroughput
             |   1.9%   RUNNABLE com.gs.collections.impl.list.mutable.UnmodifiableMutableList.<init>
             |   1.1%   RUNNABLE reactor.event.routing.ConsumerFilteringEventRouter.route
             |   1.0% TIMED_WAITING sun.misc.Unsafe.park
             |   0.8%   RUNNABLE com.lmax.disruptor.BatchEventProcessor.run
             |   0.6%   RUNNABLE com.gs.collections.impl.map.mutable.ConcurrentHashMap.get
             |   0.4%   RUNNABLE reactor.event.dispatch.AbstractLifecycleDispatcher.route
             |   0.2%   RUNNABLE sun.misc.Unsafe.park
             |   0.1%   RUNNABLE com.lmax.disruptor.MultiProducerSequencer.getHighestPublishedSequence
             |   0.0%            (other)
Iteration   2: 1235.400 ops/s
       Stack |  90.0%    WAITING sun.misc.Unsafe.park
             |   3.6%   RUNNABLE org.projectreactor.bench.reactor.ReactorBenchmarks.reactorThroughput
             |   2.0%   RUNNABLE com.gs.collections.impl.list.mutable.UnmodifiableMutableList.<init>
             |   1.1% TIMED_WAITING sun.misc.Unsafe.park
             |   0.9%   RUNNABLE reactor.event.routing.ConsumerFilteringEventRouter.route
             |   0.9%   RUNNABLE com.lmax.disruptor.BatchEventProcessor.run
             |   0.6%   RUNNABLE com.gs.collections.impl.map.mutable.ConcurrentHashMap.get
             |   0.5%   RUNNABLE reactor.event.dispatch.AbstractLifecycleDispatcher.route
             |   0.2%   RUNNABLE sun.misc.Unsafe.park
             |   0.1%   RUNNABLE org.projectreactor.bench.reactor.generated.ReactorBenchmarks.reactorThroughput_Throughput_measurementLoop
             |   0.1%            (other)

On the other built-in profiling capabilities of JMH I would strongly recommend you check out the perfasm option on Linux. Very impressive!


Let’s now see what an event based profiler can collect while the benchmark is running. Low latency micro-benchmarks are in general extremely challenging for code instrumentation profiling but fortunately, Satoris is both adaptive and extremely efficient, and with many built-in meters offering an alternative to wall clock time profiling when needed. Looking at the above benchmark report results, I decided it’d be best to switch out the clock.time microsecond resolution meter for the clock.tick nanosecond resolution meter, though admittedly this is problematic on Mac OS X with the current Java 8u5 runtime. I’ll come back to addressing this particular issue later in the article.

Here is the configuration I used to configure the Satoris adaptive metering engine called into by the instrumentation hooks added to class bytecode on loading into the runtime.


I’m not going to go into great detail in explaining the actual adaptive mechanism used by the metering engine. The above configuration will disable the instrumentation and metering of method calls that have on average (not in the statistical sense) a latency of 250 nanoseconds, 50 of which is inherent to the method – done online, the result is a list of hotspot probes representing the most suitable point in the code to investigate possible tuning options, which incidentally is not always the location of an actual high-frequency hotspot (hint: caller-to-callee loops).

Initially I restricted the instrumentation to the reactor.* package. In the early stages of any software performance analysis, the focus should be on the code that can be changed. Understanding of the software execution behavior and resource consumption profile should always start from the perspective of the application codebase. In this case the package, reactor.*, is the application, and all other library dependencies are black boxes (from an instrumentation coverage perspective). Again it is best to start with only measuring what can be changed (directly).

In the metering table below, which I’ve color inverted within a photo editing software package, I’ve listed both hotspot and disabled labeled probes. Because probes can be disabled during the benchmark run the counts and totals for such methods are only accurate in the sense of the method invocations that were metered up to the point of disablement. Only one of the probes still being metered at the end of the benchmark matches a method picked out by the stack sampling profiler. Then again the code coverage and focus is very different. Thread stack sampling is focused on what the system appears to be busy doing, but in a modern multithreaded world, appearances are very deceiving when there is no sense of activity and the transitions between such. Adaptive metering is far more compelling and relevant in it’s reporting when it comes to identifying the appropriate point for algorithmic change.

In the configuration below I’ve switched out the clock.tick meter for the spin.tick meter. Both meters report time intervals in nanoseconds. The difference is that the spin.tick meter eliminates the call to the underlying OS clock counter from the metered thread execution call path at the expense of a spinning core.


Below is the metering table exported to the file system following completion of a benchmark iteration of the RingBuffer configuration. The results appear consistent, but slightly better in terms of throughput, when compared with the previous metering results. Due to the nature of the adaptive mechanism, underlying the metering engine, the counts for those probes disabled varies across tables, even with the same configuration. This is very much like how the JVM Hotspot compiler differs in its bytecode compilation across process executions.

To address issues with the precision of the underlying clock counter on my test platform I’ve now switched to another built-in meter, spin.count, that increments a volatile field in a tight loop. Important to note that with all of these changes I’ve not had to do any instrumentation code changes. A good example of why you should never attempt to create your own ad hoc profiling solution just because you know a little bit of Aspect Oriented Programming (AOP). Choosing the appropriate threshold values that drive the adaptive mechanism is a little bit more complicated. I went with 25 and 10. Probably on the low side.


In the metering table below, produced with the above metering configuration, I’ve filtered out the disabled probes. The usual suspects (from above) are there along with a few new ones that were previously disabled during metered execution.

In the following configuration I’ve dropped the inherent threshold down from 10 to 5.


Here is the new metering model following re-execution of the benchmark with the revised threshold setting. I’ve also included in the listing those probes labeled disabled.


It can be useful to get some idea of how much invocation work is, directly or indirectly, performed by a probe. In the configuration I switched over to the probe.count meter which counts the number of metered method invocations that occur with the call window of a method. This is depends on the amount of code coverage and the adaptive mechanism employed. I set both adaptive thresholds to 0 to ensure there was dynamic disablement of metering during a benchmark iteration.


The overall throughput has dropped but it does give us an idea to how much of the code in the reactor.* package calls other code in the reactor.* package. Looking at the inherent total for Reactor.notify probe I now suspect this is an overloaded method that is counted twice. I could enable the nonrecursive metering extension to eliminate one of these invocations, which might well have eliminated the probe from other results reported above.

A major concern with any event processing platform is garbage collection. In the following configuration I’ve switched over to the jmgmt.alloc.bytes meter. The meter uses a method in the ThreadMXBean interface to track heap allocations. Unlike what has been said by some Google performance engineers it does come with a performance hit, especially when called inline of the thread execution context. In the configuration a probe is disabled if it does not “on average” allocate an object, directly or indirectly, within it’s call window.


In the metering table below the biggest allocation culprit is CachingRegistry.select. Probably not something you would expect from a class with a "Caching" prefix.


Here is the source code pulled from Github for the CachingRegistry.select method.


Time to widen the code coverage and include the com.gs.collections.impl.* package. The metering table pretty much shouts out the allocation call sequence ending with a call to UnmodifiableMutableList.of().


Here is the new object allocation site. The actual constructor,UnmodifiableMutableList.<init>(), did come up in the stack sampling results. The metering engine was able to link to a somewhat innocent line much higher in the call stack with the most direct enclosing calling site.


Update #1

I was asked to follow-up the above article with a metering model that included some of the other library dependencies. I thought I might as well address some other concerns with the above metering models, in particular method overloaded and possible premature disablement of probes before sufficient load or conditional state had arisen.

# adds an unique class method id to each probe name
# increases the initial hotspot scorecard balance from 1000 to 10000 

As to be expected in the RingBuffer benchmark configuration the LMAX Disruptor library hits the top spot. Note that the Reactor.notify probe is now listed twice in the metering table.


Update #2

In the last configuration I should have increased the lower hotspot bound after increasing the initial balance as follows:


A revised metering model with the change applied and a little higher throughput achieved.


Again at this kind of level you are looking for clues to guide further code inspection within an editor. There must be some acceptance of the fact that red herrings will creep into the model due the nature of execution and its measurement. But still and all, it sends thread call stack sampling back into the stone age where it came from and surely belongs, and if we are lucky, it carries with it those that hold up such an approach as state of the art Java performance profiling.