An Allocation (Re)Action to Reactor

PROJECT
I am currently working with an experienced software engineering team building an exciting product prototype that I hope to talk more about at some point later in the year when the project starts to take shape. My initial assignment with the team involves reviewing some of the early technology choices made in a proof of concept implementation, in terms of performance and scalability, as the deployed system needs to sense, reason and respond to vast amounts of data in real-time. One of the first project dependencies I looked at was Reactor, a project that originated at SpringSource and is now maintained by engineers working at Pivotal.

Here is the summary description of the project pulled straight from the Github repository.

“Reactor is a foundational library building for reactive fast data applications on the JVM.
It provides abstractions for Java, Groovy and other JVM languages to make building event and
data-driven applications easier. It’s also really fast.”

BENCHMARK
The Reactor project has a benchmark sub-project based on the OpenJDK JMH tool.  JMH is a Java harness for building, running, and analyzing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM. After building the sub-project package the benchmark is run from the command line as follows:

java -jar target/microbenchmarks.jar ".*ReactorBenchmarks.*"

Here is a sample of the summary output produced by the JMH tool after it has run a relatively large number of benchmark configuration iterations within separate processes.

o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 10000 thrpt 15 1445.069 241.602 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 1000 thrpt 15 11955.832 2847.127 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 100 thrpt 15 100754.020 19946.595 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput sync 10000 thrpt 15 1748.662 81.455 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput sync 1000 thrpt 15 22749.599 1028.661 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput sync 100 thrpt 15 265829.334 3000.288 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 10000 thrpt 15 552.242 56.231 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 1000 thrpt 15 6101.796 685.604 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 100 thrpt 15 60206.842 1353.195 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 10000 thrpt 15 226.803 6.641 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 1000 thrpt 15 2021.130 270.393 ops/s
o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 100 thrpt 15 21433.804 1140.880 ops/s

Not terribly useful from a software performance analysis perspective, beyond the benchmark itself, but fortunately JMH does come with some built-in profiling capabilities. One of those is a thread stack sampling profiler that is activated as follows:

java -jar target/microbenchmarks.jar -prof stack ".*ReactorBenchmarks.*"

The JMH tool doesn’t appear to aggregate the sampled stack results the way it does actual benchmark measurements. After each benchmark iteration a stack sampling report is printed. Here is example of the output produced by this particular profiler.

Iteration   1: 1223.517 ops/s
       Stack |  90.0%    WAITING sun.misc.Unsafe.park
             |   3.7%   RUNNABLE org.projectreactor.bench.reactor.ReactorBenchmarks.reactorThroughput
             |   1.9%   RUNNABLE com.gs.collections.impl.list.mutable.UnmodifiableMutableList.<init>
             |   1.1%   RUNNABLE reactor.event.routing.ConsumerFilteringEventRouter.route
             |   1.0% TIMED_WAITING sun.misc.Unsafe.park
             |   0.8%   RUNNABLE com.lmax.disruptor.BatchEventProcessor.run
             |   0.6%   RUNNABLE com.gs.collections.impl.map.mutable.ConcurrentHashMap.get
             |   0.4%   RUNNABLE reactor.event.dispatch.AbstractLifecycleDispatcher.route
             |   0.2%   RUNNABLE sun.misc.Unsafe.park
             |   0.1%   RUNNABLE com.lmax.disruptor.MultiProducerSequencer.getHighestPublishedSequence
             |   0.0%            (other)
             |
Iteration   2: 1235.400 ops/s
       Stack |  90.0%    WAITING sun.misc.Unsafe.park
             |   3.6%   RUNNABLE org.projectreactor.bench.reactor.ReactorBenchmarks.reactorThroughput
             |   2.0%   RUNNABLE com.gs.collections.impl.list.mutable.UnmodifiableMutableList.<init>
             |   1.1% TIMED_WAITING sun.misc.Unsafe.park
             |   0.9%   RUNNABLE reactor.event.routing.ConsumerFilteringEventRouter.route
             |   0.9%   RUNNABLE com.lmax.disruptor.BatchEventProcessor.run
             |   0.6%   RUNNABLE com.gs.collections.impl.map.mutable.ConcurrentHashMap.get
             |   0.5%   RUNNABLE reactor.event.dispatch.AbstractLifecycleDispatcher.route
             |   0.2%   RUNNABLE sun.misc.Unsafe.park
             |   0.1%   RUNNABLE org.projectreactor.bench.reactor.generated.ReactorBenchmarks.reactorThroughput_Throughput_measurementLoop
             |   0.1%            (other)

On the other built-in profiling capabilities of JMH I would strongly recommend you check out the perfasm option on Linux. Very impressive!

SATORIS
Lets now see what an event based profiler can collect whilst the benchmark is running. Low latency micro-benchmarks are in general extremely challenging for code instrumentation profilers but fortunately Satoris is both adaptive and extremely efficient, and with many built-in meters offering an alternative to wall clock time profiling, when needed. Looking at the above benchmark report results I decided it’d be best to switch out the clock.time microsecond resolution meter for the clock.tick nanosecond resolution meter, though admittedly this is problematic on Mac OS X with the current Java 8u5 runtime. I’ll come back to addressing this particular issue later in the article.

Here is the configuration I used to configure the Satoris adaptive metering engine called into by the instrumentation hooks added to class bytecode on loading into the runtime.

j.s.p.hotspot.threshold=250
j.s.p.hotspot.threshold.inherent=50
j.s.p.hotspot.meter=clock.tick
j.s.p.console.snapshot.probes.disabled.include=true
j.s.p.meter.clock.time.include=false
j.s.p.meter.clock.tick.include=true

I’m not going to go into great detail in explaining the actual adaptive mechanism used by the metering engine. Basically the above configuration will disable the instrumentation and metering of method calls that have on average (not in the statistical sense) a latency of 250 nanoseconds, 50 of which is inherent to the method. This is done online with the end result being a list of hotspot probes representing the most suitable point in the code to investigate possible tuning options, which incidentally is not always the location of an actual high frequency hotspot (hint: caller-to-callee loops).

Initially I restricted the instrumentation to the reactor.* package. In the early stages of any software performance analysis the focus should be on the code that can be changed. Understanding of the software execution behavior and resource consumption profile should always start from the perspective of the application codebase. In this case the reactor.* package is the application and all other library dependencies are black boxes (from an instrumentation coverage perspective). Again it is best to start with only measuring what can be changed (directly).

In the metering table below, which I’ve color inverted within a photo editing software package, I’ve listed both hotspot and disabled labeled probes. Because probes can be disabled during the benchmark run the counts and totals for such methods are only accurate in the sense of the method invocations that were metered up to the point of disablement. Only one of the probes still being metered at the end of the benchmark matches a method picked out by the stack sampling profiler. Then again the code coverage and focus is very different. Thread stack sampling is focused on what the system appears to be busy doing but in a modern multithreaded appearances are very deceiving especially when there is no sense of activity and the transitions between such. Adaptive metering is far more powerful and relevant in it’s reporting when it comes to identifying the appropriate point for algorithmic change.

autoletics.blog.reactor.benchmark.clock.tick
In the configuration below I’ve switched out the clock.tick meter for the spin.tick meter. Both meters report time intervals in nanoseconds. The difference is that the spin.tick meter eliminates the call to the underlying OS clock counter from the metered thread execution call path at the expense of a spinning core.

j.s.p.hotspot.threshold=100
j.s.p.hotspot.threshold.inherent=25
j.s.p.hotspot.meter=spin.tick
j.s.p.console.snapshot.probes.disabled.include=true
j.s.p.meter.clock.time.include=false
j.s.p.meter.spin.tick.include=true

Below is the metering table exported to the file system following completion of a benchmark iteration of the RingBuffer configuration. The results appear consistent, but slightly better in terms of throughput, when compared with the previous metering results. Due to the nature of the adaptive mechanism, underlying the metering engine, the counts for those probes disabled varies across tables, even with the same configuration. This is very much like how the JVM Hotspot compiler differs in its bytecode compilation across process executions.

autoletics.blog.reactor.benchmark.spin.tick
To address issues with the precision of the underlying clock counter on my test platform I’ve now switched to another built-in meter, spin.count, that increments a volatile field in a tight loop. Important to note that with all of these changes I’ve not had to do any instrumentation code changes. A good example of why you should never attempt to create your own ad hoc profiling solution just because you know a little bit of Aspect Oriented Programming (AOP). Choosing the appropriate threshold values that drive the adaptive mechanism is a little bit more complicated. I went with 25 and 10. Probably on the low side.

j.s.p.hotspot.threshold=25
j.s.p.hotspot.threshold.inherent=10
j.s.p.hotspot.meter=spin.count
j.s.p.console.snapshot.probes.disabled.include=true
j.s.p.meter.clock.time.include=false
j.s.p.meter.spin.count.include=true

In the metering table below, produced with the above metering configuration, I’ve filtered out the disabled probes. The usual suspects (from above) are there along with a few new ones that were previously disabled during metered execution.

autoletics.blog.reactor.benchmark.spin.count
In the following configuration I’ve dropped the inherent threshold down from 10 to 5.

j.s.p.hotspot.threshold=25
j.s.p.hotspot.threshold.inherent=5
j.s.p.hotspot.meter=spin.count
j.s.p.console.snapshot.probes.disabled.include=true
j.s.p.meter.clock.time.include=false
j.s.p.meter.spin.count.include=true

Here is the new metering model following re-execution of the benchmark with the revised threshold setting. I’ve also included in the listing those probes labeled disabled.

autoletics.blog.reactor.benchmark.spin.count.ti5It can be useful to get some idea of how much invocation work is, directly or indirectly, performed by a probe. In the configuration I switched over to the probe.count meter which counts the number of metered method invocations that occur with the call window of a method. This is depends on the amount of code coverage and the adaptive mechanism employed. I set both adaptive thresholds to 0 to ensure there was dynamic disablement of metering during a benchmark iteration.

j.s.p.hotspot.threshold=0
j.s.p.hotspot.threshold.inherent=0
j.s.p.hotspot.meter=probe.count
j.s.p.console.snapshot.probes.disabled.include=true
j.s.p.meter.clock.time.include=false
j.s.p.meter.probe.count.include=true

The overall throughput has dropped but it does give us an idea to how much of the code in the reactor.* package calls other code in the reactor.* package. Looking at the inherent total for Reactor.notify probe I now suspect this is an overloaded method that is counted twice. I could enable the nonrecursive metering extension to eliminate one of these invocations, which might well have eliminated the probe from other results reported above.

autoletics.blog.reactor.benchmark.probe.count
A major concern with any event processing platform is garbage collection. In the following configuration I’ve switched over to the jmgmt.alloc.bytes meter. The meter uses a method in the ThreadMXBean interface to track heap allocations. Unlike what has been said by some Google performance engineers it does come with a performance hit, especially when called inline of the thread execution context. In the configuration a probe is disabled if it does not “on average” allocate an object, directly or indirectly, within it’s call window.

j.s.p.hotspot.threshold=1
j.s.p.hotspot.threshold.inherent=0
j.s.p.hotspot.meter=jmgmt.alloc.bytes
j.s.p.console.snapshot.probes.disabled.include=true
j.s.p.meter.clock.time.include=false
j.s.p.meter.jmgmt.alloc.bytes.include=true

In the metering table below the biggest allocation culprit is CachingRegistry.select. Probably not something you would expect from a class with a "Caching" prefix.

autoletics.blog.reactor.benchmark.alloc.bytes

Here is the source code pulled from Github for the CachingRegistry.select method.

autoletics.blog.reactor.benchmark.code.1

Time to widen the code coverage and include the com.gs.collections.impl.* package. The metering table pretty much shouts out the allocation call sequence ending with a call to UnmodifiableMutableList.of().

autoletics.blog.reactor.benchmark.gc.alloc.bytes

Here is the new object allocation site. The actual constructor,UnmodifiableMutableList.<init>(), did come up in the stack sampling results.  The metering engine was able to link a somewhat innocent line much higher in the call stack with the most direct enclosing calling site.

autoletics.blog.reactor.benchmark.code.2

Stack sampling is a yesteryear approach largely promoted by those that have not moved with the times. An adaptive activity metering is most certainly a step up in the evolution ladder for both tool and tool user. But I will readily admit that there are limits. For extreme low latency execution analysis you need a measurement technology that is able to pinpoint when behavior takes a different course (flow/branch) at a much finer granularity. Developers need to embed signals into code that label and track branches representing fast and slow paths within an algorithm. I created a technology that does just this in 2011 and published an article describing it in 2012 on the JINSPIRED.com site titled “Introducing Signals – The Next Big Thing in Application Management”. Good luck with the Googling.

UPDATE #1
I was asked to follow-up the above article with a metering model that included some of the other library dependencies. I thought I might as well address some other concerns with the above metering models, in particular method overloaded and possible premature disablement of probes before sufficient load or conditional state had arisen.

# adds an unique class method id to each probe name
j.s.p.ext.aspectj.probe.name.joinpoint.id.include=true
# increases the initial hotspot scorecard balance from 1000 to 10000 
j.s.p.hotspot.initial=10000
j.s.p.hotspot.threshold=15
j.s.p.hotspot.threshold.inherent=5
j.s.p.hotspot.meter=spin.count
j.s.p.console.snapshot.probes.disabled.include=true
j.s.p.meter.clock.time.include=false
j.s.p.meter.spin.count.include=true

As to be expected in the RingBuffer benchmark configuration the LMAX Disruptor library hits the top spot. Note that the Reactor.notify probe is now listed twice in the metering table.

autoletics.blog.reactor.spin.count.all

UPDATE #2
In the last configuration I should really have increased the lower hotspot bound after increasing the initial balance as follows:

j.s.p.hotspot.lower=20000

A revised metering model with the change applied and a little higher throughput achieved.

autoletics.blog.reactor.spin.count.all.lower

Again at this kind of level you are looking for clues to guide further code inspection within an editor. There must be some acceptance of the fact that red herrings will creep into the model due the nature of execution and its measurement. But still and all, it sends thread call stack sampling back into the stone age where it came from and surely belongs, and if we are luckily it carries with it those that hold up such an approach as state of the art Java performance profiling.