Author: William Louth

A renowned software engineer with particular expertise in software engineering adaptive systems, adaptive control, self-regulation, resilience engineering, information visualization, software simulation & mirroring as well as performance measurement and optimization.

Microbenchmarking Big Data Solutions on the JVM – Part 2

The second part in a series of articles looking at the microbenchmarking of big data solutions running on the JVM. In this part the performance model is further refined over a number of configuration steps, each step building on the previous steps with the purpose of deriving a smaller, simpler and more relevant model of the microbenchmark to facilitate more targeted code inspection as well as a better understanding of the nature of execution flow across method, class and package boundaries.

Microbenchmarking Big Data Solutions on the JVM – Part 1

The challenge in measuring not just the microbenchmark code but the underlying method invocations within the software under test (performance analysis) is ensuring that the required instrumentation, measurement and collection does not perturb the system to such an extent that the system is an entirely different system (measurement is not relevant) and the data collected is neither accurate or actionable. This challenge can be overcome with adaptive profiling of the codebase, both online and offline. Here this advanced application performance monitoring approach is applied to Apache Spark.

Software Tracing – Static, Dynamic, Adaptive and Simulated

static.tracing.tweet

There are a few interpretations of “static tracing” but at this point I can only assume that the above tweet is referring to the more common case of traces (probes) being explicitly coded or compiled into software at build time. The reason for the different possible interpretations of static, dynamic and adaptive is that in the tracing, monitoring or (preferably) metering of software execution behavior there are three aspects that can be classified in this manner and somewhat independently. They are instrumentation, measurement and collection.

INSTRUMENTATION

Static instrumentation is added into the software at code, compilation or post build time. Dynamic instrumentation is added into the code at runtime, as in the case of a JVM instrumentation agent. Some tracing systems claim to be dynamic though the instrumentation is applied statically by way of support for custom callbacks (or interceptors) at the execution of the static instrumentation. Adaptive instrumentation is added into, and also removed, from code during runtime execution, based on value and cost as well as some other behavioral analysis. Adaptive instrumentation is very effective at dealing with pass through methods and leaf nodes in a call tree that have very little inherent cost (clock time). This approach is very much supported and complimented by the underlying dynamic and adaptive compilation nature of the Hotspot JVM. Adaptive instrumentation can be done in online or offline mode. In the case of Satoris both adaptive modes are fully supported including a hybrid mode which allows for instrumentation to be continually redefined based on past (offline) and present (online) measured software execution behavior – IMRR (Instrument, Measure, Refine and Repeat).

Profiling Low Latency JVM Applications with Instrument-Measure-Refine-Repeat (IMRR)

MEASUREMENT

When trace instrumentation is always measured it is considered as being static (in the nature). What is instrumented, is measured, is collected. This is irrespective of how the instrumentation is classified. Dynamic measurement is when the measurement can be turned on and off based on some setting (possibly accessible to remote tools) or the presence of an observer (man or machine). The instrumentation fires (executes) but the measurement, such as reading a clock time counter, is conditional. Adaptive measurement is when the tracing system internally decides at runtime whether the instrumentation that is firing is actually measured. It uses one or more adaptive strategies to decide this, factoring in at minimum the cost and value of the trace being fired and in what trace context. The decision is far more complicated than any sort of dynamic measurement and changes based on numerous factors that rely heavily on the prior history of measurement. In Satoris this is referred to as Intelligent Activity Metering (IAM) and it encompasses more than 35 extensions to the metering runtime.

A JVM Profiling Agent that does the work of a Software Performance Analyst

COLLECTION

Static collection of trace data is when the tracing system always collects (stores) what is measured. Again it does not reason about the value of what has been measured and the additional cost to be incurred before executing the collection routine such as stack traces, call graphs, aggregations, distributions, etc,… Dynamic collection can sometimes refer to a tracing system that enables (or disables) certain data collectors via configuration. At runtime it is static. This might seem a strange classification if it were not for the fact that many tracing/profiling tools do not offer the ability to turn off baked-in collectors – an all of nothing proposition. Dynamic tracing can also refer to a tracing system that allows various collection routines to be turned on and off at runtime during the course of tracing a software execution. This can be done both locally or remotely, from within the code or from some remote management interface. Adaptive collection is when tracing system defers various different collection routines until some activation point such as a particular trace context, a “novel” sequence/temporal pattern or an associated labeling of a trace flow performed by some other adaptive learning routine. Dynamic collection is conditional but largely instructional. Adaptive collection is conditioned on the past and present both in terms in execution behavior and system environment.

Beyond Metrics and Logging with Metered Software Memories

SIMULATED

All tracing systems, except for the Autoletics technology, are live tracing systems. Instrumentation is fired in a live environment. Measurement is performed in a live environment. Collection is done in a live environment. With the mirroring and simulation capabilities of Satoris, Simz and Stenos the same trace (metering) solution can be embedded within a real or simulated machine environment and still have adaptive aspects of instrumentation, measurement and collection perform without change in the playing back of a software episodic memory (a memory trace of sorts). The ability to trace a simulated environment with the same capabilities of a live system is a profound change to how we approach the measurement and understanding of software execution behavior. It opens up so many new opportunities that can never be realistically matched with static or dynamic tracing. Imagine the execution of C/C++ code being simulated and adaptively traced within an entirely different machine runtime and architecture – now and in the future. Simulated and Adaptive represents the next stage in the evolution of software and not just in terms of tracing. When these two are combined and integrated we will finally move from intelligently writing code to engineering intelligent systems.

The Marking of Adaptive Software Memories

The “New Possible” in Application Monitoring and Management

This week Simz 2.3 broke all previous benchmark records in simulating 270 million metered network streamed calls a second on a Google Cloud n1-highcpu-32 machine type instance. That is 540 million call events a second - 32 billion events a minute. The software execution calls originated in 28 client JVMs also running on a n1-highcpu-32 machine type instance. On average each client was able to invoke 9.6 million instrumented method calls a second from a single thread per process with an average client call latency of 100 nanoseconds. The CPU utilization on the Simz machine was pegged at just over 90% with the incoming network data transmission at 985MB a second.

The Power of Now in the Performance Analysis of JVM Applications

I've long fascinated over how best to perceive the behavior of software machines that for the most part appear as black boxes; consuming input we feed them and producing output we consume, directly or indirectly. I cannot help feeling there is a lost beauty in the motion of action that needs to be rediscovered in order to acquire a far greater understanding of what it is that software and hardware machines do and how in the course of action(s) and state changes (encompassing the environment) such a system changes behavior in ways not observed, or accurately predicted nor fully understood.

Beyond Metrics and Logging with Metered Software Memories

A proposal for a different approach to application performance monitoring that is far more efficient, effective, extensible and eventual than traditional legacy approaches based on metrics and event logging. Instead of seeing logging and metrics as primary datasources for monitoring solutions we should instead see them as a form of human inquiry over some software execution behavior that is happening or has happened. With this is mind it becomes clear that logging and metrics do not serve as a complete, contextual and comprehensive representation of software execution behavior.

A Quick Performance Hotspot Analysis of Apache Kafka

I was asked to produce a quick preliminary performance analysis of Apache Kafka. Here is the short description of the project taken from the community site:

“Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design…at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers…Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. We provide a Java client for Kafka, but clients are available in many languages.”

I downloaded the Kafka 0.8.2-beta distribution from the Apache site and then before launching a single kafka server on my workstation I exported the following environment variable:

export KAFKA_OPTS="-server -noverify -javaagent:/Applications/satoris-1.9/bundle/satoris-javaagent.jar -Djxinsight.home=/Applications/satoris-1.9/conf/apache-kafka"

I used the following jxinsight.override.config file located in the jxinsight.home directory to drop down the hotspot metering extension thresholds.

# clock.time (μs) thresholds
j.s.p.hotspot.threshold=5
j.s.p.hotspot.threshold.inherent=1

The instrumentation performed dynamically at runtime by the Satoris agent was limited to just the kafka.* codebase by adding the following to the jxinsight.aspectj.filters.config file.

kafka.

To generate workload I used the following script and command line arguments. The command was executed 5 times.

./bin/kafka-producer-perf-test.sh --broker-list=localhost:9092 --messages 10000000 --topic test --threads 10 --message-size 1000 --batch-size 100 --compression-codec 1

Here are the results reported by the performance script.

kafka.metering.terminal.1

Below is the results of the metering hotspot analysis. The Satoris agent automatically and intelligently whittled down the list of methods from nearly 3,000 to just 40!! The table is sorted by inherent (self) total with kafka.log.Log.append being the biggest contributor to the execution time with an average of 9 ms per method invocation. Ignoring the pollExpired probe the next probe closest has a 6 minutes inherent total compared to 1 hour for Log.append. The cost profile for all other probes is dwarfed by Log.append.

kafka.metering.hotspots.1

Here is the Scala source code for the Log.append method. Note the lock synchronized block and the extent to which it covers in the code execution including a sub-optimal trace() call that involves a relatively expensive String.format call.

kafka.metering.hotspots.code.1

To determine whether indeed there might be a build up of threads at this point in the execution I turned on the queue metering extension by adding the following to the jxinsight.override.config file within the directory pointed by the jxinsight.home system property.

j.s.p.queue.enabled=true
j.s.p.queue.guard.name.labels=!managed

With the queue metering extension enabled it is relatively easy to inspect the concurrent thread execution flowing through the Log.append method. In the timeline below we can see it pegged at 8.

kafka.metering.queue.1

The Log.liftedTree1$1 method called from the Log.append method has only ever 1 thread executing pretty much confirming the monitor locked on within the Log.append method is a shared instance. The Log.liftedTree1$1 method is not actually listed in the Scala source code but instead is a synthetic method generated by the Scala compiler for the lock synchronized block (at least that is what I suspect here).

kafka.metering.queue.2

Looking for a possible workaround to a rather obvious performance bottleneck I came across this line in the server.properties file used in starting the Kafka server instance.

# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across the brokers.
num.partitions=1

I then changed this property to 10, reset all data files and reran the benchmarks again. Here are the results reported by the performance script. They look much better but not amazing.

kafka.metering.terminal.2

In this run the number of hotspot classified probes has increased. The response of the Log.append method has dropped down from ms to microseconds and we start to get to see some other probes that might be worth investigating as possible areas that need code (and resource consumption) tuning. Many would appear related to ByteBuffer management and manipulation, which may well be outside of the scope (of influence), but there are some that seem relatively heavyweight for what is indicated by the method naming such as RequestChannel$Request.updateRequestMetrics. A whole 11 microseconds per call wasted on metric updates seems inefficient and wasteful.

kafka.metering.hotspots.2

Below some of the methods classified as hotspots by the Satoris agent starting with ByteBufferMessageSet.decompress.

kafka.metering.hotspots.code.2

The execution cost in the CompressionFactory looks to be largely object allocation related.

kafka.metering.hotspots.code.3

More object allocation and IO copying work in the ByteBufferMessageSet.create method.

kafka.metering.hotspots.code.4

Software Performance Optimization Heuristics: Fast, Frugal and Factual

The following is a graphic I've used in the past to frame various software performance optimization techniques. It is not a comprehensive inventory of all software performance optimization techniques (or concerns) but I've found it serves a purpose in managing the amount of effort that, in general, should be spent on each technique outside of extreme cases such as trading platforms (or profilers). The left side is your typical localized bottom up approach to speeding up code execution steps.

Software Regulators! Mirror Outwards, Simulate Inwards.

The Good Regulator Theorem states "every good regulator of a system must be a model of that system". But what exactly would such a model look like? What elements should the model contain and how might they be related and reasoned about? The theorem itself does not address this so in this article I present my own research findings covering dramatism, observational learning, experiential learning, activity theory, simulation theory and mirror neurons as well as software activity metering and software performance measurement.