Managing Performance Analysis Complexity with Adaptive Hotspot Measurement

How does a software performance engineer jump into a large codebase and immediately start to identify the essence of the software execution as well as the potential hotspots or bottlenecks within the code base? The answer is profiling of a representative workload – a benchmark. But measuring the performance of a large codebase is fraught with problems in the selection of which methods to instrument, how such methods are measured, what measurement data is collected, and how should the measurement data be stored, analyzed and visualized. Too much instrumentation (and its measurement) is likely to perturb the results to such an extent that they are simply wrong, at least in the context of the initial benchmark. The challenge is even more pronounced when it is a highly multi-threaded system that is being benchmarked and observed, with implicit internal and external co-ordination between components and client drivers. Any significant performance impact and the system under observation becomes an entirely different system. If that is not bad enough, an engineer is likely to end up with a list of 1,000 to 10,000 measurement points (methods). How does a performance engineer manage overhead when the hotspots are not known beforehand to refine the instrumentation space?

A performance engineer is ideally looking for a relatively small list of code locations worthy of further investigation. The investigation can be online, done from within a live monitoring console, or offline, in the viewing of source code editor. This is not to say that all hotspots are bottlenecks and need to be tuned. For the most part, they give the performance engineer a digest of what it is this system does, when it does it, how frequently it does it and how much time or resource is consumed in doing it. Rarely does a performance engineer jump into the code of a hotspot and voilà identifies how it should be changed and tuned. Unless the code is of poor quality, in which case throwing darts at a board, holding source code printouts, might well be more efficient (this is your typical call stack profiling technique). With software of some reasonable level of quality, a much harder task awaits the performance engineer once he/she peers beneath the hotspot covers, as invariably making code faster, or more scalable requires some redirection and short circuiting of execution flow that is best done by those that have developed the sources. A performance engineer needs to shine a light in the darkness, give focus to tuning efforts, track progress towards some objective and along the way offer expert guidance in optimization, with both a local and global perspective of the software execution and the system dynamics.

Having spent +10 years looking for better ways to tackle the enormity of this task the best that can be achieved today is through some form of adaptive instrumentation and measurement, driven by one or more hotspot seeking measurement strategies. Call stack sampling does not cut it and neither does transaction tracing when a codebase has 10,000 possible measurement points all firing simultaneous across a large number of executing threads. A performance engineer can very easily become lost in the woods of the resulting call trees that are formed as execution flows across such a large number of measurement points (nodes). New ways of visualizing are not the sole answer. There needs to be a significant reduction in measurement data collected well before it is presented but this decrease cannot be done blindly, or stupidly, but instead intelligently and in context. Clearly, this is best done by the executing machine itself and happen with little effort other than by setting one or two threshold values that inform the measurement runtime of the typical latency band of the entire application processing patterns and the acceptable level of overhead that can be budgeted. This needs to be automated so that more time is spent productively on the much harder task of determining how best to restructure the execution flow for the purpose of improved efficiency and scale. As soon as a change is committed, feedback on the movement of performance hotspots needs to be immediate, relevant and relatively accurate.

So how does this look in practice? Let’s take the latest release of Apache Cassandra (2.1) and instrument all +5,000 methods found in the org.apache.cassadra.* and io.netty.* packages using an adaptive measurement agent that has some hotspot thresholds set extremely low to eliminate much of the getter/setter and call delegation noise. Then using the new and much-improved cassandra-stress tool a series of write and read workloads are executed. At the end of any one of the benchmarks, the result should be a relatively small list of hotspot methods. Below is such a list from the read stress benchmark run. This is 50 out of 5000, and it was automatically whittled down by the agent. It is important to note that this is not just some top 50 method listing. The listing holds those methods still measured at the end of the benchmark after the agent disabled and removed the instrumentation from all of the other non-hotspot methods. In doing this continuously at runtime, the agent not only decreases overhead and improves accuracy it also causes the list to shrink automatically as methods labeled as hotspot become disabled as the perturbation reduces.
Lets now have a look at some of the methods labeled as a hotspot by the agent. Below is the source code listing for CollationController.collectTimeOrderedData. Note the numerous Tracing.trace() calls indicating that the developer who wrote the code also considered it an important observation point.


Here is the listing for SSTableReader.getPosition(). Again the agent has automatically picked out a method that is also littered with Tracing.trace() calls.


Finally the source code listing for SelectStatement.getSlideCommands. There are no traces present here, but the inline comments are interesting. The ordering of the ArrayList allocation should be corrected and placed after the if statement. When a loop (and object cloning) is driven by the size of a data structure, then the performance of such methods can sometimes be classified as a hotspot and other times not. It is wonderful when a particular technology can find needles in a haystack, saving time in the scanning of a huge codebase that might not even be understood by the engineer who originally produced it.


We now move onto the write stress benchmark. This benchmark was executed first to populate the database for the read. A similar number of methods were instrumented and measured, and again an unmanageable number was reduced to a handful of hotspot methods still being measured at the end of the benchmark run. A quick scan down the table listing picking out the class and method names reveals the nature of the execution.


Here is the code listing for the MergeIterator$ManyToOne class. You might be surprised to see that the agent has picked out the computeNext() method. What has happened is that the two methods called by the method were disabled as in themselves they don’t exceed the required hotspot threshold until both cost profiles are combined at the caller point.


The CommitLog.add() method has some object allocation and IO calls. Then there is the time spent waiting for async work to be completed by an executor, to be explored in a follow-up article.


50% of the time spent in executing the method, ColumnIndex$Builder.buildForCompaction, is inherent to the method so again we have the adaptive mechanism at play in that after a period of measuring the called methods and attributing cost to them it then disables these methods and subsequent costs are assigned to the caller method. You might ask yourself how does one tune such a method. It might not be possible unless one can reduce the number of iterations performed. The point here is that this method represents an important point in the production of invocation work that branches outwards and flows downwards to other methods that individually have relatively low-cost profiles.


Only 15% of the time spent executing StorageProxy.performWrite is inherently attributed to the method itself but like in the previous case the method represents an important execution point for what flows onwards in the processing.


99% of the time executing ModificationStatement.getMutations is attributed to the method itself because the instrumentation was not applied to the collection classes allocated and accessed within the scope of execution.


Here is the code listing for Memtable.put. Again it would be hard for a developer to readily assess the cost profile of such a method without first understanding what drives loop iterations as well as the size of data structures cloned.


Another loop and another conditional exit which might still fail after some execution costs have occurred. There is a common behavioral pattern to all of these write related methods.


It is wonderful when out pops a short list of hotspots but what is even more wonderful is when the call tree that is created by the profiling agent only ever includes hotspot methods.


Because of the hotspot adaptive mechanism, it can be the case that a callee method can be promoted to a hotspot before its caller and hence appear as a distinct branch of the tree without a caller.

You might have noticed a scattering of awaitUntil calls in the call tree. These leaf nodes will be treated differently in a follow-up article, questioning the hotspot analysis of asynchronous scheduling call sites.