Managing Performance Analysis Complexity with Adaptive Hotspot Measurement

How does a software performance engineer jump into a large codebase and immediately start to identify the essence of the software execution as well as the potential hotspots or bottlenecks within the code base? The answer is profiling of a representative workload – a benchmark. But measuring performance of a large codebase is fraught with problems in the selection of which methods to instrument, how such methods are measured, what measurement data is collected, and how should the measurement data be stored, analyzed and visualized. Too much instrumentation (and its measurement) is likely to perturb the results to such an extent that they are simply wrong, at least in the context of the original benchmark. This is even more pronounced when it is a highly multi-threaded system that is being benchmarked and observed, with implicit internal and external co-ordination between components and client drivers. Any significant performance impact and the system under observation becomes an entirely different system. And if that is not bad enough the performance engineer is then left with a list of 1,000 to 10,000 measurement points (methods). How does a performance engineer manage measurement overhead when the hotspots are not known beforehand in order to refine the instrumentation space?

A performance engineer is ideally looking for a relatively small list of source code locations worthy of further investigation, online from within a live monitoring console and offline within some source code editor with the developer in hand. This is not to say that all hotspots are bottlenecks and need to be tuned. For the most part they give the performance engineer a digest of what it is this system does, when it does it, how frequently it does it and how much time or resource is consumed in doing it. Rarely does a performance engineer jump into the code of a hotspot and voilà identifies how it should be changed and tuned, unless the code is of extremely poor quality, in which case throwing darts at a board holding source code printouts might be equally effective (this is your typical call stack profiling technique). With software of some reasonable level of quality a much harder task awaits the performance engineer once he/she peers beneath the hotspot covers, as invariably making code faster, or more scalable, requires some redirection and short circuiting of execution flow that is best done by those that have developed the code. A performance engineer needs to shine a light in the darkness, give focus to tuning efforts, track progress towards some objective and along the way offer expert guidance in optimization, with both a local and global perspective of the software execution and the system dynamics, which by and large is completely absent from the typical developer toolset as well as mindset.

Having spent +10 years looking for better ways to tackle the enormity of this task, which seems to grow as more and more layers of indirection are added and disconnects in execution flow are introduced for some greater good (or because async event loops are the thing to do), the best that can be achieved today is through some form of adaptive instrumentation and measurement driven by one or more hotspot seeking measurement strategies. Call stack sampling does not cut it and neither does transaction tracing when a codebase has 10,000 possible measurement points all firing simultaneous across a large number of executing threads. A performance engineer can very easily become lost in the woods of the resulting call trees that are formed as execution flows across such a large number of measurement points (nodes). New forms of visualizations are not the sole answer. There needs to be a significant reduction in measurement data collected well before it is presented but this reduction cannot be done blindly, or stupidly, but instead intelligently and in context. Clearly, this is best done by the executing machine itself. This needs to happen with very little effort other than to set one or two threshold values that inform the measurement runtime of the typical latency band of the entire application processing patterns and the acceptable level of overhead that can be budgeted. This needs to be automated so that more time is productively spent on the much harder task of determining how best to restructure the execution flow for the purpose of improved efficiency and scale. As soon as a change is committed feedback on movement of performance hotspots needs to be immediate, relevant and relatively accurate.

So how does this look in practice? Lets take the latest release of Apache Cassandra (2.1) and instrument all +5,000 methods found in the org.apache.cassadra.* and io.netty.* packages using an adaptive measurement agent that has some hotspot thresholds set extremely low to eliminate much of the getter/setter and call delegation noise. Then using the new and much improved cassandra-stress tool a series of write and read workloads are executed. At the end of any one of the benchmarks the result should be a relatively small list of hotspot methods. Below is such a list from the read stress benchmark run. This is 50 out of 5,000 and it was automatically whittled down by the agent. It is important to note that this is not just some top 50 method listing. This list represents those methods still measured at the end of the benchmark after the agent has disabled and removed the instrumentation from all of the other non-hotspot methods. In doing this continuously at runtime the agent not only decreases overhead and improves accuracy it also causes the list to automatically shrink as those methods currently labeled as hotspot are perturbed to a lesser extent by measurement of methods they call, direct or indirectly, and in turn become disabled.
Lets now have a look at some of the methods labeled as a hotspot by the agent. Below is the source code listing for CollationController.collectTimeOrderedData. Note the numerous Tracing.trace() calls indicating that the developer who wrote the code also considered it an important observation point.


Here is the listing for SSTableReader.getPosition(). Again the agent has automatically picked out a method that is also littered with Tracing.trace() calls.


Finally the source code listing for SelectStatement.getSlideCommands. There are no traces present here but the inline comments are interesting. The ordering of the ArrayList allocation should be corrected and placed after the if statement. When a loop (and object cloning) is driven by the size of a data structure then the performance of such methods can sometimes be classified as a hotspot and other times not. It is wonderful when a technology can find such needles in a haystack, saving time in the scanning of a huge codebase that might not even be understood by the very engineer who originally produced it.


We now move onto the write stress benchmark. This benchmark was actually executed first to populate the database for the read benchmark. A similar number of methods were instrumented and measured and again a large and unmanageable number was reduced to a handful of hotspot methods still being measured at the end of the benchmark run. A quick scan down the table listing picking out the class and method names reveals the nature of the execution.


Here is the code listing for the MergeIterator$ManyToOne class. You might be surprised to see that the agent has picked out the computeNext() method but what has happened is that the two methods called by the method were disabled as in themselves they don’t exceed the required hotspot threshold until both cost profiles are combined at the caller point.


The CommitLog.add() method has some object allocation and IO calls. Then there is the time spent waiting for async work to be completed by an executor, which will be explored in a follow-up article.


50% of the time spent in executing the ColumnIndex$Builder.buildForCompaction method is inherent to the method so again we have the adaptive mechanism at play in that after a period of time of measuring the called methods and attributing cost to them it then disables these methods and subsequent costs are assigned to the caller method. You might ask yourself how does one tune such a method. It might not be possible unless one can reduce the number of iterations performed. The point here is that this method represents an important point in the production of invocation work that branches outwards and flows downwards to other methods that individually have relatively low cost profiles.


Only 15% of the time spent executing StorageProxy.performWrite is inherently attributed to the method itself but like in the previous case the method represents an important execution point for what flows onwards in the processing.


99% of the time executing ModificationStatement.getMutations is spent within the method itself because instrumentation was not applied to the collection classes allocated and accessed within the method scope.


Here is the code listing for Memtable.put. Again it would be hard for a developer to readily assess the cost profile of such a method without first understanding what drives loop iterations as well as the size of data structures cloned.


Another loop and another conditional exit which might still fail after some execution costs have occurred. There is a common behavioral pattern to all of these write related methods.


It is wonderful when out pops a short list of hotspots but what is even more wonderful is when the call tree that is created by the profiling agent only ever includes hotspot methods.


Note: Because of the hotspot adaptive mechanism it can be the case that a callee method can be promoted to a hotspot before its caller and hence appear as a distinct branch in the tree without a caller

You might have noticed a scattering of awaitUntil calls in the call tree. These leaf nodes will be treated differently in a follow-up article questioning hotspot analysis of asynchronous scheduling call sites.