To achieve a real understanding of software for the purpose of managing and changing we need a deeper level of observation that is as close as possible to what it is software actually does. Software does not log. Developers write logs calls to create an echo chamber.
This article outlines the reason behind and usage of the gap metering extension included in Satoris which unlike possibly all other application performance monitoring solutions focuses on what transpires between consecutive service request processing by a particular thread for the same entry point method.
The second part in a series of articles looking at the microbenchmarking of big data solutions running on the JVM. In this part the performance model is further refined over a number of configuration steps, each step building on the previous steps with the purpose of deriving a smaller, simpler and more relevant model of the microbenchmark to facilitate more targeted code inspection as well as a better understanding of the nature of execution flow across method, class and package boundaries.
The challenge in measuring not just the microbenchmark code but the underlying method invocations within the software under test (performance analysis) is ensuring that the required instrumentation, measurement and collection does not perturb the system to such an extent that the system is an entirely different system (measurement is not relevant) and the data collected is neither accurate or actionable. This challenge can be overcome with adaptive profiling of the codebase, both online and offline. Here this advanced application performance monitoring approach is applied to Apache Spark.
There are a few interpretations of “static tracing” but at this point, I can only assume that the following tweet is referring to the more common case of traces (probes) being explicitly coded or compiled into software at build time. The reason for the different possible interpretations of static, dynamic and adaptive is that in the tracing, monitoring or (preferably) metering of software execution behavior there are three aspects that can be classified in this manner and somewhat independently. They are instrumentation, measurement, and collection.
Static instrumentation is added into the software at code, compilation or post build time. Dynamic instrumentation is added into the code at runtime, as in the case of a JVM instrumentation agent. Some tracing systems claim to be dynamic though the instrumentation is applied statically by way of support for custom callbacks (or interceptors) at the execution of the static instrumentation. Adaptive instrumentation is added into and also removed from code during runtime execution, based on value and cost as well as some other behavioral analysis. Adaptive instrumentation is very effective at dealing with pass through methods and leaf nodes in a call tree that have very little inherent cost (clock time). This approach is very much supported and complimented by the underlying dynamic and adaptive compilation nature of the Hotspot JVM. Adaptive instrumentation can be done in online or offline mode. In the case of Satoris, both adaptive modes are fully supported including a hybrid mode which allows for instrumentation to be continually redefined based on the past (offline) and present (online) measured software execution behavior – IMRR (Instrument, Measure, Refine and Repeat).
Profiling Low Latency JVM Applications with Instrument-Measure-Refine-Repeat (IMRR)
When trace instrumentation is always measured it is considered as being static (in nature). What is instrumented, is measured, is collected. This is irrespective of how the instrumentation is classified. Dynamic measurement is when the measurement can be turned on and off based on some setting (possibly accessible to remote tools) or the presence of an observer (man or machine). The instrumentation fires (executes) but the measurement, such as reading a clock time counter, is conditional. Adaptive measurement is when the tracing system internally decides at runtime whether the instrumentation that is firing is actually measured. It uses one or more adaptive strategies to decide this, factoring in at the minimum the cost and value of the trace being fired and in what trace context. The decision is far more complicated than any sort of dynamic measurement and changes based on numerous factors that rely heavily on the prior history of measurement. In Satoris this is referred to as Intelligent Activity Metering (IAM) and it encompasses more than 35 extensions to the metering runtime.
A JVM Profiling Agent that does the work of a Software Performance Analyst
Static collection of trace data is when the tracing system always collects (stores) what is measured. Again it does not reason about the value of what has been measured and the additional cost to be incurred before executing the collection routine such as stack traces, call graphs, aggregations, distributions, etc,… Dynamic collection can sometimes refer to a tracing system that enables (or disables) certain data collectors via configuration. At runtime it is static. This might seem a strange classification if it were not for the fact that many tracing/profiling tools do not offer the ability to turn off baked-in collectors – an all of nothing proposition. Dynamic tracing can also refer to a tracing system that allows various collection routines to be turned on and off at runtime during the course of tracing a software execution. This can be done both locally or remotely, from within the code or from some remote management interface. Adaptive collection is when tracing system defers various different collection routines until some activation point such as a particular trace context, a “novel” sequence/temporal pattern or an associated labeling of a trace flow performed by some other adaptive learning routine. Dynamic collection is conditional but largely instructional. Adaptive collection is conditioned on the past and present in terms of execution behavior and system environment.
Beyond Metrics and Logging with Metered Software Memories
All tracing systems, except for the Autoletics technology, are live tracing systems. Instrumentation is fired in a live environment. Measurement is performed in a live environment. Collection is done in a live environment. With the mirroring and simulation capabilities of Satoris, Simz and Stenos the same trace (metering) solution can be embedded within a real or simulated machine environment and still have adaptive aspects of instrumentation, measurement and collection perform without a change in the playing back of a software episodic memory (a memory trace of sorts). The ability to trace a simulated environment with the same capabilities of a live system is a profound change to how we approach the measurement and understanding of software execution behavior. It opens up so many new opportunities that can never be realistically matched with static or dynamic tracing. Imagine the execution of C/C++ code being simulated and adaptively traced within an entirely different machine runtime and architecture – now and in the future. Simulated and Adaptive represents the next stage in the evolution of software and not just in terms of tracing. When these two are combined and integrated we will finally move from intelligently writing code to engineering intelligent systems.
The Marking of Adaptive Software Memories
This week Simz 2.3 broke all previous benchmark records in simulating 270 million metered network streamed calls a second on a Google Cloud n1-highcpu-32 machine type instance. That is 540 million call events a second - 32 billion events a minute. The software execution calls originated in 28 client JVMs also running on an n1-highcpu-32 machine type instance. On average each client was able to invoke 9.6 million instrumented method calls a second from a single thread per process with an average client call latency of 100 nanoseconds. The CPU utilization on the Simz machine was pegged at just over 90% with the incoming network data transmission at 985MB a second.
I've long fascinated over how best to perceive the behavior of software machines that for the most part appear as black boxes; consuming input we feed them and producing output we consume, directly or indirectly. I cannot help feeling there is a lost beauty in the motion of action that needs to be rediscovered in order to acquire a far greater understanding of what it is that software and hardware machines do and how in the course of action(s) and state changes (encompassing the environment) such a system changes behavior in ways not observed, or accurately predicted nor fully understood.
A proposal for a different approach to application performance monitoring that is far more efficient, effective, extensible and eventual than traditional legacy approaches based on metrics and event logging. Instead of seeing logging and metrics as primary data sources for monitoring solutions we should instead see them as a form of human inquiry over some software execution behavior that is happening or has happened. With this is mind it becomes clear that logging and metrics do not serve as a complete, contextual and comprehensive representation of software execution behavior.
The following is a graphic I've used in the past to frame various software performance optimization techniques. It is not a comprehensive inventory of all software performance optimization techniques (or concerns) but I've found it serves a purpose in managing the amount of effort that, in general, should be spent on each technique outside of extreme cases such as trading platforms (or profilers). The left side is your typical localized bottom up approach to speeding up code execution steps.
How does a software performance engineer jump into a large codebase and immediately start to identify the essence of the software execution as well as the potential hotspots or bottlenecks within the code base? The answer is profiling of a representative workload – a benchmark. But measuring the performance of a large codebase is fraught with problems in the selection of which methods to instrument, how such methods are measured, what measurement data is collected, and how should the measurement data be stored, analyzed and visualized. Too much instrumentation (and its measurement) is likely to perturb the results to such an extent that they are simply wrong, at least in the context of the initial benchmark. The challenge is even more pronounced when it is a highly multi-threaded system that is being benchmarked and observed, with implicit internal and external co-ordination between components and client drivers. Any significant performance impact and the system under observation becomes an entirely different system. If that is not bad enough, an engineer is likely to end up with a list of 1,000 to 10,000 measurement points (methods). How does a performance engineer manage overhead when the hotspots are not known beforehand to refine the instrumentation space?
A performance engineer is ideally looking for a relatively small list of code locations worthy of further investigation. The investigation can be online, done from within a live monitoring console, or offline, in the viewing of source code editor. This is not to say that all hotspots are bottlenecks and need to be tuned. For the most part, they give the performance engineer a digest of what it is this system does, when it does it, how frequently it does it and how much time or resource is consumed in doing it. Rarely does a performance engineer jump into the code of a hotspot and voilà identifies how it should be changed and tuned. Unless the code is of poor quality, in which case throwing darts at a board, holding source code printouts, might well be more efficient (this is your typical call stack profiling technique). With software of some reasonable level of quality, a much harder task awaits the performance engineer once he/she peers beneath the hotspot covers, as invariably making code faster, or more scalable requires some redirection and short circuiting of execution flow that is best done by those that have developed the sources. A performance engineer needs to shine a light in the darkness, give focus to tuning efforts, track progress towards some objective and along the way offer expert guidance in optimization, with both a local and global perspective of the software execution and the system dynamics.
Having spent +10 years looking for better ways to tackle the enormity of this task the best that can be achieved today is through some form of adaptive instrumentation and measurement, driven by one or more hotspot seeking measurement strategies. Call stack sampling does not cut it and neither does transaction tracing when a codebase has 10,000 possible measurement points all firing simultaneous across a large number of executing threads. A performance engineer can very easily become lost in the woods of the resulting call trees that are formed as execution flows across such a large number of measurement points (nodes). New ways of visualizing are not the sole answer. There needs to be a significant reduction in measurement data collected well before it is presented but this decrease cannot be done blindly, or stupidly, but instead intelligently and in context. Clearly, this is best done by the executing machine itself and happen with little effort other than by setting one or two threshold values that inform the measurement runtime of the typical latency band of the entire application processing patterns and the acceptable level of overhead that can be budgeted. This needs to be automated so that more time is spent productively on the much harder task of determining how best to restructure the execution flow for the purpose of improved efficiency and scale. As soon as a change is committed, feedback on the movement of performance hotspots needs to be immediate, relevant and relatively accurate.
So how does this look in practice? Let’s take the latest release of Apache Cassandra (2.1) and instrument all +5,000 methods found in the
io.netty.* packages using an adaptive measurement agent that has some hotspot thresholds set extremely low to eliminate much of the getter/setter and call delegation noise. Then using the new and much-improved cassandra-stress tool a series of write and read workloads are executed. At the end of any one of the benchmarks, the result should be a relatively small list of hotspot methods. Below is such a list from the read stress benchmark run. This is 50 out of 5000, and it was automatically whittled down by the agent. It is important to note that this is not just some top 50 method listing. The listing holds those methods still measured at the end of the benchmark after the agent disabled and removed the instrumentation from all of the other non-hotspot methods. In doing this continuously at runtime, the agent not only decreases overhead and improves accuracy it also causes the list to shrink automatically as methods labeled as hotspot become disabled as the perturbation reduces.
Lets now have a look at some of the methods labeled as a
hotspot by the agent. Below is the source code listing for
CollationController.collectTimeOrderedData. Note the numerous
Tracing.trace() calls indicating that the developer who wrote the code also considered it an important observation point.
Here is the listing for
SSTableReader.getPosition(). Again the agent has automatically picked out a method that is also littered with
Finally the source code listing for
SelectStatement.getSlideCommands. There are no traces present here, but the inline comments are interesting. The ordering of the
ArrayList allocation should be corrected and placed after the
if statement. When a loop (and object cloning) is driven by the size of a data structure, then the performance of such methods can sometimes be classified as a hotspot and other times not. It is wonderful when a particular technology can find needles in a haystack, saving time in the scanning of a huge codebase that might not even be understood by the engineer who originally produced it.
We now move onto the write stress benchmark. This benchmark was executed first to populate the database for the read. A similar number of methods were instrumented and measured, and again an unmanageable number was reduced to a handful of hotspot methods still being measured at the end of the benchmark run. A quick scan down the table listing picking out the class and method names reveals the nature of the execution.
Here is the code listing for the
MergeIterator$ManyToOne class. You might be surprised to see that the agent has picked out the
computeNext() method. What has happened is that the two methods called by the method were disabled as in themselves they don’t exceed the required hotspot threshold until both cost profiles are combined at the caller point.
CommitLog.add() method has some object allocation and IO calls. Then there is the time spent waiting for async work to be completed by an executor, to be explored in a follow-up article.
50% of the time spent in executing the method,
ColumnIndex$Builder.buildForCompaction, is inherent to the method so again we have the adaptive mechanism at play in that after a period of measuring the called methods and attributing cost to them it then disables these methods and subsequent costs are assigned to the caller method. You might ask yourself how does one tune such a method. It might not be possible unless one can reduce the number of iterations performed. The point here is that this method represents an important point in the production of invocation work that branches outwards and flows downwards to other methods that individually have relatively low-cost profiles.
Only 15% of the time spent executing
StorageProxy.performWrite is inherently attributed to the method itself but like in the previous case the method represents an important execution point for what flows onwards in the processing.
99% of the time executing
ModificationStatement.getMutations is attributed to the method itself because the instrumentation was not applied to the collection classes allocated and accessed within the scope of execution.
Here is the code listing for
Memtable.put. Again it would be hard for a developer to readily assess the cost profile of such a method without first understanding what drives loop iterations as well as the size of data structures cloned.
Another loop and another conditional exit which might still fail after some execution costs have occurred. There is a common behavioral pattern to all of these write related methods.
It is wonderful when out pops a short list of hotspots but what is even more wonderful is when the call tree that is created by the profiling agent only ever includes hotspot methods.
Because of the hotspot adaptive mechanism, it can be the case that a callee method can be promoted to a hotspot before its caller and hence appear as a distinct branch of the tree without a caller.
You might have noticed a scattering of
awaitUntil calls in the call tree. These leaf nodes will be treated differently in a follow-up article, questioning the hotspot analysis of asynchronous scheduling call sites.
The ability for software to automatically (self-)adapt its execution behavior based on what has happened (past), what is happening (present) and what might happen (predicted) is incredibly powerful. Adaptation forms the basis, if not the foundation, for improved resilience, performance, scalability, and versatility in (higher order) functioning. Unfortunately, there are limits, some intrinsically imposed, to how far online runtime adaptation can succeed in its fitness for purpose. When faced with such limits one course of action is destruction and rebuild. A rebirth might still be considered an adaptation of sorts but on an entirely different scale. For example (some) humans are known to adapt their behavior in an instant to a detected change in context, whereas nature adapts genetically over the course of multiple generations of a species, sometimes kick started (or bootstrapped) by major events and subsequent environmental changes. So what has all this got to do with Java performance analysis? Lets first look at what are the typical steps involved in the performance analysis of software.
With each of the above steps comes a cost and if not managed intelligently will result in a final performance model that offers little insight or worse misleads the analysis and subsequent recommendations. To be able to find the unknown hotspots a significant amount of the code must first be instrumented. But in doing so, a performance drag (higher response time and lower throughput) is created that causes the system to drift or divert from its normal behavior and for the resulting performance model to not be accurate or relevant to the purpose of the investigation. Self-adaptive technology can help here in using collected data to alter (adapt) the behavior of the measurement and instrumentation steps.
Adaptation, for the most part, relies on a memory of past behavior and it is this and the residue of previous adaptations that challenges how much instrumentation and measurement overhead is eliminated. Without the ability to do a full reset on each level and layer of adaptation the inefficiency remains and so to the perturbing (overhead) cost. Even with “on the fly” disablement of instrumentation and its measurement by an intelligent performance measurement agent it is still possible for an inaccuracy to sneak into a model, especially if the final model includes data pertaining to both before and after adaptation phases. Ideally we only want to instrument and measure performance hotspots, but we can only know of these after the fact and then we must accept that some of what we are aware of the model maybe wrong. In data science (and mining) this is sometimes referred to as a leak – information that is historically captured but is not available at the time a decision has to be made (that requires such information). To solve this problem, especially when management of (overhead) cost is paramount, requires an approach that is incremental, isolating (selective memory) and iterative.
A benchmark run is performed to determine the initial performance hotspots. After completion a new instrumentation plan is created that only includes these hotspots. The benchmark is then performed again resulting in a possible refinement of the hotspot candidates. This iterative process continues on until eventually the set of instrumented probes is the set of probes classified as hotspots.
In the video titled “Aggregated Performance Analysis of OpenJDK JMH Benchmarks with Satoris & Simz” I briefly demonstrated this process after tackling another problem of aggregating performance measurement across multiple process executions of a benchmark. The following diagram shows the flow of data, configuration and (adaptation) change across runtimes and tooling. From the metering model aggregated by Simz and snapshotted within the Satoris console a new instrumentation agent bundle is generated and then used in the subsequent rerun of the benchmark. Initially, the degree of instrumentation is controlled by both package and class level filters, but following on from then the agent is rebuilt to only be aware of those methods previously classified as a hotspot.
Below is the hotspot listing following completion of the initial Pivotal Reactor benchmark run executed by the OpenJDK JMH tool. In this initial run I did not limit the instrumentation to any particular package so management of the overhead is problematic even for the most efficient agent, especially when instrumentation is applied to a third-party library with extreme high frequency call behavior, as is the case for the
Note: By default the Satoris agent does not instrument any of the JDK classes. You don’t measure what you cannot change or control.
Here are some of the probes (instrumented methods) disabled on the fly during the benchmark run based on hotspot thresholds I had defined. The count and totals represent only those firings (executions) of a probe (method) that were metered (measured) before classification and disablement.
To generate a revised instrumentation plan embedded with the Satoris agent, based on the above hotspot classifications, I simply executed:
java -jar satoris.jar -generate-aj-bundle <snapshot-ocs-file> <agent-bundle-jar-file>
After rerunning the benchmark again with the newly created agent bundle a revised list of disabled probes (instrumented methods) is determined. Previously these probes had been classified as hotspots.
Don’t be too concerned about the high averages for some of the probes listed as disabled as the hotspot metering extension uses a scorecard not impacted by large outliers as a result of object allocation and the follow-up garbage collection.
After refining the agent instrumentation plan once again with the new metering model and then re-executing the benchmark the list of probes disabled during the run is now down to a minuscule number.
After another refinement iteration, the list of probes disabled during the benchmark has shrunk down to just one.
Finally, a benchmark execution with no dynamic disablement of probes – all instrumented remains measured.
I performed a similar process as outlined above this time with an alternative meter and the final results show a high similarity though the hotspot thresholds would not exactly match across such different measures.
Those familiar with the Reactor codebase and the third party libraries it uses might be surprised with the limited number of probes listed for these packages in the metering tables above. The reason for this is the enablement of the
exitpoint metering extension, and it’s dynamic disablement of firing probes (executing code) below (in the scope of) a probe (method) marked as an exit.
By default the
exitpoint metering extension will not meter any exit labeled probe but I changed this in the configuration so that the metering model would include the surface level probes into such packages. This has an additional knock on effect to the reported inherent (self) metering totals as well as the hotspot classification for non-exit probes.
j.s.p.exitpoint.enabled=true j.s.p.exitpoint.name.groups=com.lmax,com.gs.collections,net.openhft,ch.qos.logback,jsr166e,org.openjdk j.s.p.exitpoint.inclusive=false j.s.p.exitpoint.filter.enabled=true j.s.p.exitpoint.filter.exclude.name.groups=reactor,org.projectreactor
Because non-surface level probes never got metered (measured) they never got mirrored (simulated) and so never become part of the metering model with the Simz runtime, which is another extremely useful feature of metered, mirrored and simulated application runtimes.
In this article I show how to measure and analyze the performance of client drivers to backend storage services such as MongoDB external services through the use of the Satoris
egress metering extension that creates a derived meter from a base meter, such as
clock.time, representing the outbound service (latency) time. This particular measurement technique can be applied to any other type of outbound service call from an application or client driver library.
MongoDB is a cross-platform document-oriented database. Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas. When doing a performance analysis of a software application that communicates with an external services or storage system via some client interfacing (driver) library, I initially start with a simple conceptual model having three essential layers: the application, the client driver and the networking (or file system). The application represents the entry point (ingress) for interactions that eventually pass through the driver (transit) and out to the external system and service (egress). When I measure latency and cost (consumption) at the application entry, it includes both the transit as well as the exit costs. But invariably I like to get a breakdown of the cost for each of these call flow classifications. One way to do this is to aggregate latency and execution cost at the package level and then group each package into a particular call flow class. But this has its problems because you need to track the ingress, transit and egress timing and costing within the ingress, transit and egress layers themselves. It becomes recursive because when software makes a call into a package, it typically makes various internal calls within its package before calling into another that has a different call flow classification grouping. Another limitation with package grouping is that you want to have the breakdown of on a per service entry point basis. For example how much of the request processing time for transaction X can be attributed to remote calls to services
B. If you view this as a call tree, you are looking to compare the cost at a root node with the summed time spent in leaf call nodes. Frequently you don’t care about the breakdown of latency, or cost, across packages for a particular application entry point call. You just want to know how much of the transaction time spent in the JVM.
There are two measurement concerns to bear in mind during such performance analysis: black holes and blind spots. A black hole is a term I use when referring to outbound calls to a remote service. I am aware that there is this black box the application communicates with but I am not sure of how I can determine the surface area (leaf call nodes) and the time spent (lost) there in what I refer to as the “outerspace”. This task is further complicated because of a client driver library that creates an “inner space” around a portal to the remote destination. A blind spot occurs in the context of black holes when there is some form of coordination (and coupling) between two inner space call interaction points that communicate with the outer space. A request is dispatched to the server, and then there is this wait within the inner space, the client driver library, for a paired response. If we only measure the writing of the request and then the reading of the server response, the time (and possibly cost) between these two events is not visible. If the calling thread is in some way suspended from doing other work while waiting for a response, we should attribute this time in the inner space to the outer space. There is a slightly different form of this in the typical web request processing pattern observed in JEE applications. A web request comes in and makes some SQL query statement executions. The time to execute the SQL queries is measured but not the time in iterating over the resulting recordset, which can have a much greater latency than what was measured in retrieving the first initially set of records back in executing the SQL query. As an application iterates over a recordset, the client driver library communicates back and forth with the remote server to retrieve more of the records matching the query.
Hopefully, the following example based on the MongoDB Java Driver API will make this far more understandable. Here is the mini application that will be used to execute a series of common remote database storage interactions: insert, find and remove. The code was inspired by an example within the MongoDB Github repository.
Lets start with a configuration of the
hotspot metering extension enabled by default within the Satoris instrumentation agent. Methods on average (not in the statistical sense) having a clock time of less than 5 microseconds will be disabled.
Here is the resulting
hotspot metering model collected on completion of the application execution. The majority of the inherent (self) clock time is spent in the
The focus of the article is on the measurement technique employed, what questions it answers and the challenges in doing so. The numbers are not to be taken as representative considering the benchmark test executed for only 7 minutes and was run local to the database server. I will do a more lengthy benchmark run and append the results to the article later this week.
In the following configuration I enabled the
nonreentrant metering extension. This metering extension prevents the metering engine from measuring (metering) a method within the execution scope of itself. An alternative and more efficient metering extension,
nonrecursive, is available for filtering on direct reentrant method calls. I generally see the turning on of this particular as a data cleaning exercise as averaging of reentrant same named method call always seems somewhat confusing and misleading to me.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true
The biggest change in the metering model with the
nonreentrant metering extension enabled is the reduction in the measurement count for
Bits.readFully, dropping down from 9M to 3M. The reason for this is that the
Bits.readFully method is overloaded and chained. This is also the case for various methods in the
DBCollectionImpl classes dropping down from 2M to 1M.
Let’s try going a bit lower with the threshold hopefully collecting more measurements methods for methods previously disabled by the
hotspot metering extension.
j.s.p.hotspot.threshold=1 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true
The metering table filtered on
hotspot labeled probes is now much larger. A slight increase in the response times but even there it is probably overstated due to the relatively short run duration and the fact that it would take longer for the metering engine to eliminate extraneous probes and their measurement.
Now begins the more interesting part of the analysis with the enablement of the
egress metering extension and the creation of a new meter,
io.time, derived from the base
clock.time meter. What the
egress metering extension does is track the
clock.time metering for all leaf call nodes. A leaf is a metered probe that did not have any nested metered probes. The keyword here is metered as all metering extensions operate on the methods that are instrumented and continually metered.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=io.time
With the above configuration we now have two metering records for each probe (named method):
io.time is the actual
clock.time for leaf probes called directly or indirectly by a probe. Looking at the metering records for the
DBCollection.findOne probe we can now see that out of an average
clock.time of 117 microseconds we can attribute 103 microseconds to the
io.time, or the
clock.time spent in leaf probes. Incidentally this is a good example of where average reporting is useful. The average makes for a far easier comparison value than a significantly large total. What I particular like about this is that I don’t need to walk through a typical profiler call tree only to get lost in the expanding of a large number of nodes and seeing a large % of attributed time disappear in front of me without any useful insight offered up in the meantime.
The above table only showed those probes labeled with
hotspot. In the following chart, sorted on the inherent total, many of the probes observed as call leafs by the
egress metering extension are shown. Some of these probes would probably not be deemed IO related, especially the object constructors (
Any metering extension activated by the metering engine can have its visibility reduced using filters. In the following configuration the
egress metering extension now only sees those probes within the
org.bson.io package and sub packages.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=io.time j.s.p.egress.filter.enabled=true j.s.p.egress.filter.include.name.groups=org.bson.io
Here is the adjusted metering table now with the
io.time metering only pertaining to leaf calls within the
org.bson.io package, called directly or indirectly by a probe. The
DBCollection.findOne() probe has a total
clock.time of 118 microseconds and a
io.time metering of 103 microseconds. Only 15 microseconds is actually spent executing non-IO related methods within the scope of the
findOne() method. For
DBCollection.remove() the delta is 38 microseconds (127 – 96) and 27 microseconds (154 – 127) respectively.
Eventually the calls made in the
Stress class on the
DBCollection end up invoking a
doOperation() method in class
DBTCPConnector. Below is the code snippet from the
DBPort class. What do you see? At least two separate IO operations. The piping (or dispatch) of a client request and then the reading of the result, done strangely enough in the
Response object constructor. A possible blind spot depending on how much runtime execution cost follows the last IO write and how much precedes the first IO read. It may be better to consider the
doOperation() as the exit (egress) point in the remote procedure call (RPC) plumbing.
In the following configuration I’ve changed the filtering on the
egress metering extension to only include the two
doOperation() methods mentioned above. I’ve also renamed the
io.time meter to
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonreentrant.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=rpc.time j.s.p.egress.filter.enabled=true j.s.p.egress.filter.include.name.groups=com.mongodb.DBPort.doOperation,com.mongodb.DBTCPConnector.doOperation
The delta for the
DBCollection.findOne() method is now 9 microseconds (115 – 106). For
DBCollection.remove() the delta is 20 microseconds (125 – 105) and 17 microseconds (152 – 135) respectively. All
rpc.time metering averages have increased over the
io.time meterings and the delta, with
clock.time, has decreased. It is now relatively easy to take a look at any method at the surface of the Java driver and very quickly determine whether a possible code optimization investigation is worth considering knowing how much time is actually spent calling outbound.
You might be tempted to change the
egress filtering from the
DBPort.doOperation() method to the
DBPort.call() methods but then the
rpc.time metering would include the time spent within the runtime performing concurrency control as the
doOperation() method is
synchronized. This may well be acceptable with the renaming of the meter from
I changed the
Stress class to run 10 times longer and also made some configuration changes to measure more, more accurately and more cheaply.
j.s.p.hotspot.initial=10000 j.s.p.hotspot.lower=20000 j.s.p.hotspot.threshold=1 j.s.p.hotspot.threshold.inherent=1 j.s.p.nonrecursive.enabled=true j.s.p.egress.enabled=true j.s.p.egress.meter.clock.time.name=rpc.time j.s.p.egress.filter.enabled=true j.s.p.egress.filter.include.name.groups=com.mongodb.DBPort.doOperation,com.mongodb.DBTCPConnector.doOperation
Here is the resulting metering model filtered on
Reactor is a foundational library building for reactive fast data applications on the JVM. It provides abstractions for Java to make building event and data-driven applications easier. I am currently working with an experienced software engineering team building an exciting product prototype that I hope to talk more about at some point later in the year when the project starts to take shape. My initial assignment with the team involves reviewing some of the early technology choices made in a proof of concept implementation, in terms of performance and scalability, as the deployed system needs to sense, reason and respond to vast amounts of data in real-time. One of the first project dependencies I looked at was Reactor, a project that originated at SpringSource and is now maintained by engineers working at Pivotal.
The Reactor project has a benchmark sub-project based on the OpenJDK JMH tool. JMH is a Java harness for building, running, and analyzing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM. After building the sub-project package the benchmark is run from the command line as follows:
java -jar target/microbenchmarks.jar ".*ReactorBenchmarks.*"
Here is a sample of the summary output produced by the JMH tool after it has run a relatively large number of benchmark configuration iterations within separate processes.
o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 10000 thrpt 15 1445.069 241.602 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 1000 thrpt 15 11955.832 2847.127 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput ringBuffer 100 thrpt 15 100754.020 19946.595 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput sync 10000 thrpt 15 1748.662 81.455 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput sync 1000 thrpt 15 22749.599 1028.661 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput sync 100 thrpt 15 265829.334 3000.288 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 10000 thrpt 15 552.242 56.231 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 1000 thrpt 15 6101.796 685.604 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput threadPoolExecutor 100 thrpt 15 60206.842 1353.195 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 10000 thrpt 15 226.803 6.641 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 1000 thrpt 15 2021.130 270.393 ops/s o.p.b.r.ReactorBenchmarks.reactorThroughput workQueue 100 thrpt 15 21433.804 1140.880 ops/s
Not terribly useful from a software performance analysis perspective, beyond the benchmark itself, but fortunately JMH does come with some built-in profiling capabilities. One of those is a thread stack sampling profiler that is activated as follows:
java -jar target/microbenchmarks.jar -prof stack ".*ReactorBenchmarks.*"
The JMH tool doesn’t appear to aggregate the sampled stack results the way it does actual benchmark measurements. After each benchmark iteration, a stack sampling report is printed. Here is example of the output produced by this particular profiler.
Iteration 1: 1223.517 ops/s Stack | 90.0% WAITING sun.misc.Unsafe.park | 3.7% RUNNABLE org.projectreactor.bench.reactor.ReactorBenchmarks.reactorThroughput | 1.9% RUNNABLE com.gs.collections.impl.list.mutable.UnmodifiableMutableList.<init> | 1.1% RUNNABLE reactor.event.routing.ConsumerFilteringEventRouter.route | 1.0% TIMED_WAITING sun.misc.Unsafe.park | 0.8% RUNNABLE com.lmax.disruptor.BatchEventProcessor.run | 0.6% RUNNABLE com.gs.collections.impl.map.mutable.ConcurrentHashMap.get | 0.4% RUNNABLE reactor.event.dispatch.AbstractLifecycleDispatcher.route | 0.2% RUNNABLE sun.misc.Unsafe.park | 0.1% RUNNABLE com.lmax.disruptor.MultiProducerSequencer.getHighestPublishedSequence | 0.0% (other) | Iteration 2: 1235.400 ops/s Stack | 90.0% WAITING sun.misc.Unsafe.park | 3.6% RUNNABLE org.projectreactor.bench.reactor.ReactorBenchmarks.reactorThroughput | 2.0% RUNNABLE com.gs.collections.impl.list.mutable.UnmodifiableMutableList.<init> | 1.1% TIMED_WAITING sun.misc.Unsafe.park | 0.9% RUNNABLE reactor.event.routing.ConsumerFilteringEventRouter.route | 0.9% RUNNABLE com.lmax.disruptor.BatchEventProcessor.run | 0.6% RUNNABLE com.gs.collections.impl.map.mutable.ConcurrentHashMap.get | 0.5% RUNNABLE reactor.event.dispatch.AbstractLifecycleDispatcher.route | 0.2% RUNNABLE sun.misc.Unsafe.park | 0.1% RUNNABLE org.projectreactor.bench.reactor.generated.ReactorBenchmarks.reactorThroughput_Throughput_measurementLoop | 0.1% (other)
Let’s now see what an event based profiler can collect while the benchmark is running. Low latency micro-benchmarks are in general extremely challenging for code instrumentation profiling but fortunately, Satoris is both adaptive and extremely efficient, and with many built-in meters offering an alternative to wall clock time profiling when needed. Looking at the above benchmark report results, I decided it’d be best to switch out the
clock.time microsecond resolution meter for the
clock.tick nanosecond resolution meter, though admittedly this is problematic on Mac OS X with the current Java 8u5 runtime. I’ll come back to addressing this particular issue later in the article.
Here is the configuration I used to configure the Satoris adaptive metering engine called into by the instrumentation hooks added to class bytecode on loading into the runtime.
j.s.p.hotspot.threshold=250 j.s.p.hotspot.threshold.inherent=50 j.s.p.hotspot.meter=clock.tick j.s.p.console.snapshot.probes.disabled.include=true j.s.p.meter.clock.time.include=false j.s.p.meter.clock.tick.include=true
I’m not going to go into great detail in explaining the actual adaptive mechanism used by the metering engine. The above configuration will disable the instrumentation and metering of method calls that have on average (not in the statistical sense) a latency of 250 nanoseconds, 50 of which is inherent to the method – done online, the result is a list of hotspot probes representing the most suitable point in the code to investigate possible tuning options, which incidentally is not always the location of an actual high-frequency hotspot (hint: caller-to-callee loops).
Initially I restricted the instrumentation to the
reactor.* package. In the early stages of any software performance analysis, the focus should be on the code that can be changed. Understanding of the software execution behavior and resource consumption profile should always start from the perspective of the application codebase. In this case the package,
reactor.*, is the application, and all other library dependencies are black boxes (from an instrumentation coverage perspective). Again it is best to start with only measuring what can be changed (directly).
In the metering table below, which I’ve color inverted within a photo editing software package, I’ve listed both
disabled labeled probes. Because probes can be disabled during the benchmark run the counts and totals for such methods are only accurate in the sense of the method invocations that were metered up to the point of disablement. Only one of the probes still being metered at the end of the benchmark matches a method picked out by the stack sampling profiler. Then again the code coverage and focus is very different. Thread stack sampling is focused on what the system appears to be busy doing, but in a modern multithreaded world, appearances are very deceiving when there is no sense of activity and the transitions between such. Adaptive metering is far more compelling and relevant in it’s reporting when it comes to identifying the appropriate point for algorithmic change.
In the configuration below I’ve switched out the
clock.tick meter for the
spin.tick meter. Both meters report time intervals in nanoseconds. The difference is that the
spin.tick meter eliminates the call to the underlying OS clock counter from the metered thread execution call path at the expense of a spinning core.
j.s.p.hotspot.threshold=100 j.s.p.hotspot.threshold.inherent=25 j.s.p.hotspot.meter=spin.tick j.s.p.console.snapshot.probes.disabled.include=true j.s.p.meter.clock.time.include=false j.s.p.meter.spin.tick.include=true
Below is the metering table exported to the file system following completion of a benchmark iteration of the RingBuffer configuration. The results appear consistent, but slightly better in terms of throughput, when compared with the previous metering results. Due to the nature of the adaptive mechanism, underlying the metering engine, the counts for those probes disabled varies across tables, even with the same configuration. This is very much like how the JVM Hotspot compiler differs in its bytecode compilation across process executions.
To address issues with the precision of the underlying clock counter on my test platform I’ve now switched to another built-in meter,
spin.count, that increments a volatile field in a tight loop. Important to note that with all of these changes I’ve not had to do any instrumentation code changes. A good example of why you should never attempt to create your own ad hoc profiling solution just because you know a little bit of Aspect Oriented Programming (AOP). Choosing the appropriate threshold values that drive the adaptive mechanism is a little bit more complicated. I went with 25 and 10. Probably on the low side.
j.s.p.hotspot.threshold=25 j.s.p.hotspot.threshold.inherent=10 j.s.p.hotspot.meter=spin.count j.s.p.console.snapshot.probes.disabled.include=true j.s.p.meter.clock.time.include=false j.s.p.meter.spin.count.include=true
In the metering table below, produced with the above metering configuration, I’ve filtered out the disabled probes. The usual suspects (from above) are there along with a few new ones that were previously disabled during metered execution.
j.s.p.hotspot.threshold=25 j.s.p.hotspot.threshold.inherent=5 j.s.p.hotspot.meter=spin.count j.s.p.console.snapshot.probes.disabled.include=true j.s.p.meter.clock.time.include=false j.s.p.meter.spin.count.include=true
Here is the new metering model following re-execution of the benchmark with the revised threshold setting. I’ve also included in the listing those probes labeled
It can be useful to get some idea of how much invocation work is, directly or indirectly, performed by a probe. In the configuration I switched over to the probe.count meter which counts the number of metered method invocations that occur with the call window of a method. This is depends on the amount of code coverage and the adaptive mechanism employed. I set both adaptive thresholds to 0 to ensure there was dynamic disablement of metering during a benchmark iteration.
j.s.p.hotspot.threshold=0 j.s.p.hotspot.threshold.inherent=0 j.s.p.hotspot.meter=probe.count j.s.p.console.snapshot.probes.disabled.include=true j.s.p.meter.clock.time.include=false j.s.p.meter.probe.count.include=true
The overall throughput has dropped but it does give us an idea to how much of the code in the
reactor.* package calls other code in the
reactor.* package. Looking at the inherent total for
Reactor.notify probe I now suspect this is an overloaded method that is counted twice. I could enable the
nonrecursive metering extension to eliminate one of these invocations, which might well have eliminated the probe from other results reported above.
A major concern with any event processing platform is garbage collection. In the following configuration I’ve switched over to the
jmgmt.alloc.bytes meter. The meter uses a method in the
ThreadMXBean interface to track heap allocations. Unlike what has been said by some Google performance engineers it does come with a performance hit, especially when called inline of the thread execution context. In the configuration a probe is disabled if it does not “on average” allocate an object, directly or indirectly, within it’s call window.
j.s.p.hotspot.threshold=1 j.s.p.hotspot.threshold.inherent=0 j.s.p.hotspot.meter=jmgmt.alloc.bytes j.s.p.console.snapshot.probes.disabled.include=true j.s.p.meter.clock.time.include=false j.s.p.meter.jmgmt.alloc.bytes.include=true
In the metering table below the biggest allocation culprit is
CachingRegistry.select. Probably not something you would expect from a class with a
Here is the source code pulled from Github for the
Time to widen the code coverage and include the
com.gs.collections.impl.* package. The metering table pretty much shouts out the allocation call sequence ending with a call to
Here is the new object allocation site. The actual constructor,
UnmodifiableMutableList.<init>(), did come up in the stack sampling results. The metering engine was able to link to a somewhat innocent line much higher in the call stack with the most direct enclosing calling site.
I was asked to follow-up the above article with a metering model that included some of the other library dependencies. I thought I might as well address some other concerns with the above metering models, in particular method overloaded and possible premature disablement of probes before sufficient load or conditional state had arisen.
# adds an unique class method id to each probe name j.s.p.ext.aspectj.probe.name.joinpoint.id.include=true # increases the initial hotspot scorecard balance from 1000 to 10000 j.s.p.hotspot.initial=10000 j.s.p.hotspot.threshold=15 j.s.p.hotspot.threshold.inherent=5 j.s.p.hotspot.meter=spin.count j.s.p.console.snapshot.probes.disabled.include=true j.s.p.meter.clock.time.include=false j.s.p.meter.spin.count.include=true
As to be expected in the
RingBuffer benchmark configuration the LMAX Disruptor library hits the top spot. Note that the
Reactor.notify probe is now listed twice in the metering table.
In the last configuration I should have increased the lower hotspot bound after increasing the initial balance as follows:
A revised metering model with the change applied and a little higher throughput achieved.
Again at this kind of level you are looking for clues to guide further code inspection within an editor. There must be some acceptance of the fact that red herrings will creep into the model due the nature of execution and its measurement. But still and all, it sends thread call stack sampling back into the stone age where it came from and surely belongs, and if we are lucky, it carries with it those that hold up such an approach as state of the art Java performance profiling.