Advance Adaptive Profiling of Apache Cassandra 3.0

There is never one performance model especially when the tooling, Satoris, is configurable in terms of set points used for online adaptation.

With Apache Cassandra 3.0 just released it is time to check out whether previous performance hotspot analysis of the codebase, using Satoris, still stands today. But before jumping into the hotspot analysis it is important to first understand how Satoris uniquely approaches profiling and monitoring of relatively low latency systems such as Apache Cassandra. As stated previously the challenge with profiling and monitoring is collecting a performance model that is both reasonably accurate and relevant, whilst offering sufficient code coverage and data collection in order to derive useful insights (hotspots in this case). Both instrumentation and measurement come with a cost so to get the “biggest bang for our buck” the same benchmark is executed repeatedly but with each iteration, the instrumentation set is refined using the previous performance model. Eventually, the performance model will remain unchanged across iterations.
microbench.bigdata.spark.timeline.bw

Below is a sample benchmark script used to profile a single node Apache Cassandra instance servicing write requests from 8 threads from within the standard stress tooling. After setting up some environment variables the javaagent within the Satoris distribution is copied over into the Apache Cassandra directory. Then in each iteration, after cleaning up some caches and directories, the Apache Cassandra server is started and the stress test performed. Following completion, the Satoris snapshot, exported on JVM shutdown, is used to generate a new javaagent with a refined instrumentation set based on the hotspots listed in the snapshot. This new javaagent is written over the initial copied javaagent for use in the next iteration.

adaptive.benchmarking.cassandra.3.script
In all of the metering tables listed below the method, AbstractWriteResponseHandler.get, has been filtered out even though it had the biggest inherent total (5 mins) and inherent average (~87 us). The reason for this exclusive in the display was because the method was largely async wait based.

Satoris is not just refining the performance model across iterations of the stress test. It is adaptively profiling the code during each individual process execution, refining the instrumentation on the fly as it measures, learns and adapts. But because there is always some pre-adaptation residue left over in both the performance model (aggregated measurements) as well as the bytecode optimization a subsequent rerun of the stress test is performed following a restart of the server and a rebuild of the agent and its instrumentation set. It takes much longer, but with each iteration, the accuracy and relevance of the models are vastly enhanced and with that greater confidence. Even though during the process execution the Satoris agent is able to whittle down the instrumentation coverage from >10K methods to a couple of hundred or so hotspot probes (methods) it needs time to measure, learn and adapt. That comes at a cost – overhead => less accuracy.

Here is the first performance model generated following the first stress test iteration. Please note the average wall clock time of the operation, Memtable.put, is 42 microseconds.

adaptive.benchmarking.cassandra.3.t5t1.1

The (I) in table columns stands for inherent clock time sometimes referred to as self-time. This is clock time exclusive to the method.

But after the 20th iteration, shown below, the average clock time for the Memtable.put operation has dropped to 21 microseconds and the list of hotspot probes significantly shortened. The drop is not entirely the result of eliminating instrumentation and measurement nested within the execution scope of the Memtable.put. With many threads servicing multiple write (insert/update) requests at the same time any reduction in instrumentation and measurement elsewhere is going to create less contention and congestion around precious resources such as processors and locks.
adaptive.benchmarking.cassandra.3.t5t1.20

The above performance model was generated with a hotspot threshold of 5 microseconds and 1 microsecond for inclusive and exclusive clock timings. These set points are generally suitable for production monitoring on modern hardware, they might even be a little bit too aggressive as the default set points are 10 microseconds and 2 microseconds, but for profiling purposes, the inclusive clock time threshold can be dropped down to 2 microseconds accepting a marginal hit on the throughput and response times.

Counts might be understated as the Satoris 2.6.4 agent does not by default meter a direct and exact (in signature) recursive method invocation.

Below is a revised performance model after the 20th iteration of a new benchmark run with the hotspot threshold lowered. The model contains a far greater number of hotspot probes, many of which have an average just above the 2 microseconds threshold though the average is not used by the hotspot metering extension in driving its on-the-fly adaptations. Strangely enough one of the top new hotspot entries is, in fact, a measurement library used to time and collect the latency of server operations which is why in many cases instrumentation should be left to efficient and adaptive monitoring agents. You’re more than likely going to need an agent so why pay the price again and again.
adaptive.benchmarking.cassandra.3.t2t1.d

One might expect that with both online and offline adaptability of code profiling there would be less predictability in performance models. But that is not necessary the case when the adaptation is driven by very similar execution behaviors, resource consumptions and workload patterns across runs of the benchmark script. Whilst some of the early iteration performance models might differ slightly across runs in the number of hotspots remaining, by the time the stress test iteration within a benchmark run hits the 2 digit ranges there is an immediate and very conclusive convergence. The screenshot below was taken following completion of a rerun of the entire benchmark script. It is remarkably similar, not just in probe listing but the actual total and inherent total values. Two completely differently adaptation cycles with the same end result.

adaptive.benchmarking.cassandra.3.t2t1.e

Such consistency in the hotspot analysis also helps when updating a profiling agent. The above results were obtained with Satoris 2.6.4. Running the same benchmark with Satoris 2.6.5 gives near identical results. It looks likes this is definitive performance hotspot list for Apache Cassandra, at least in a single node write usage. An analysis of a cluster is in the works.

adaptive.benchmarking.cassandra.3.t2ti1.satoris.2.6.5

Appendix A – Hotspot Code Listings

org.apache.cassandra.db.Memtable.put

adaptive.benchmarking.cassandra.3.memtable.put

org.apache.cassandra.metrics.LatencyMetrics.addNano

adaptive.benchmarking.cassandra.3.latencymetrics.addnano

org.apache.cassandra.transport.Message$Dispatcher$Flusher.start

adaptive.benchmarking.cassandra.3.message.flusher

org.apache.cassandra.cql3.statements.ModificationStatement.checkAccess

adaptive.benchmarking.cassandra.3.modificationstatement.checkaccess

org.apache.cassandra.cql3.statements.ModificationStatement.addUpdates

adaptive.benchmarking.cassandra.3.modificationstatement.addupdates

org.apache.cassandra.db.commitlog.CommitLog.add

adaptive.benchmarking.cassandra.3.commitlog.add

Appendix B – Google Cloud Hardware

Machine type: n1-highcpu-32 (32 vCPUs, 28.8 GB memory)
Disk type: Standard persistent disk, IOPS (read=3, write=15), throughput (read=1.2, write=0.9)

Appendix C – Agent Configuration

j.s.p.hotspot.threshold=5
j.s.p.hotspot.threshold.inherent=1
j.s.p.hotspot.initial=5000
j.s.p.hotspot.lower=6000
j.s.p.console.dump.file.name=probes.ocs