To achieve a real understanding of software for the purpose of managing and changing we need a deeper level of observation that is as close as possible to what it is software actually does. Software does not log. Developers write logs calls to create an echo chamber.
This article outlines the reason behind and usage of the gap metering extension included in Satoris which unlike possibly all other application performance monitoring solutions focuses on what transpires between consecutive service request processing by a particular thread for the same entry point method.
The second part in a series of articles looking at the microbenchmarking of big data solutions running on the JVM. In this part the performance model is further refined over a number of configuration steps, each step building on the previous steps with the purpose of deriving a smaller, simpler and more relevant model of the microbenchmark to facilitate more targeted code inspection as well as a better understanding of the nature of execution flow across method, class and package boundaries.
The challenge in measuring not just the microbenchmark code but the underlying method invocations within the software under test (performance analysis) is ensuring that the required instrumentation, measurement and collection does not perturb the system to such an extent that the system is an entirely different system (measurement is not relevant) and the data collected is neither accurate or actionable. This challenge can be overcome with adaptive profiling of the codebase, both online and offline. Here this advanced application performance monitoring approach is applied to Apache Spark.
There are a few interpretations of “static tracing” but at this point I can only assume that the above tweet is referring to the more common case of traces (probes) being explicitly coded or compiled into software at build time. The reason for the different possible interpretations of static, dynamic and adaptive is that in the tracing, monitoring or (preferably) metering of software execution behavior there are three aspects that can be classified in this manner and somewhat independently. They are instrumentation, measurement and collection.
Static instrumentation is added into the software at code, compilation or post build time. Dynamic instrumentation is added into the code at runtime, as in the case of a JVM instrumentation agent. Some tracing systems claim to be dynamic though the instrumentation is applied statically by way of support for custom callbacks (or interceptors) at the execution of the static instrumentation. Adaptive instrumentation is added into, and also removed, from code during runtime execution, based on value and cost as well as some other behavioral analysis. Adaptive instrumentation is very effective at dealing with pass through methods and leaf nodes in a call tree that have very little inherent cost (clock time). This approach is very much supported and complimented by the underlying dynamic and adaptive compilation nature of the Hotspot JVM. Adaptive instrumentation can be done in online or offline mode. In the case of Satoris both adaptive modes are fully supported including a hybrid mode which allows for instrumentation to be continually redefined based on past (offline) and present (online) measured software execution behavior – IMRR (Instrument, Measure, Refine and Repeat).
Profiling Low Latency JVM Applications with Instrument-Measure-Refine-Repeat (IMRR)
When trace instrumentation is always measured it is considered as being static (in the nature). What is instrumented, is measured, is collected. This is irrespective of how the instrumentation is classified. Dynamic measurement is when the measurement can be turned on and off based on some setting (possibly accessible to remote tools) or the presence of an observer (man or machine). The instrumentation fires (executes) but the measurement, such as reading a clock time counter, is conditional. Adaptive measurement is when the tracing system internally decides at runtime whether the instrumentation that is firing is actually measured. It uses one or more adaptive strategies to decide this, factoring in at minimum the cost and value of the trace being fired and in what trace context. The decision is far more complicated than any sort of dynamic measurement and changes based on numerous factors that rely heavily on the prior history of measurement. In Satoris this is referred to as Intelligent Activity Metering (IAM) and it encompasses more than 35 extensions to the metering runtime.
A JVM Profiling Agent that does the work of a Software Performance Analyst
Static collection of trace data is when the tracing system always collects (stores) what is measured. Again it does not reason about the value of what has been measured and the additional cost to be incurred before executing the collection routine such as stack traces, call graphs, aggregations, distributions, etc,… Dynamic collection can sometimes refer to a tracing system that enables (or disables) certain data collectors via configuration. At runtime it is static. This might seem a strange classification if it were not for the fact that many tracing/profiling tools do not offer the ability to turn off baked-in collectors – an all of nothing proposition. Dynamic tracing can also refer to a tracing system that allows various collection routines to be turned on and off at runtime during the course of tracing a software execution. This can be done both locally or remotely, from within the code or from some remote management interface. Adaptive collection is when tracing system defers various different collection routines until some activation point such as a particular trace context, a “novel” sequence/temporal pattern or an associated labeling of a trace flow performed by some other adaptive learning routine. Dynamic collection is conditional but largely instructional. Adaptive collection is conditioned on the past and present both in terms in execution behavior and system environment.
Beyond Metrics and Logging with Metered Software Memories
All tracing systems, except for the Autoletics technology, are live tracing systems. Instrumentation is fired in a live environment. Measurement is performed in a live environment. Collection is done in a live environment. With the mirroring and simulation capabilities of Satoris, Simz and Stenos the same trace (metering) solution can be embedded within a real or simulated machine environment and still have adaptive aspects of instrumentation, measurement and collection perform without change in the playing back of a software episodic memory (a memory trace of sorts). The ability to trace a simulated environment with the same capabilities of a live system is a profound change to how we approach the measurement and understanding of software execution behavior. It opens up so many new opportunities that can never be realistically matched with static or dynamic tracing. Imagine the execution of C/C++ code being simulated and adaptively traced within an entirely different machine runtime and architecture – now and in the future. Simulated and Adaptive represents the next stage in the evolution of software and not just in terms of tracing. When these two are combined and integrated we will finally move from intelligently writing code to engineering intelligent systems.
The Marking of Adaptive Software Memories
This week Simz 2.3 broke all previous benchmark records in simulating 270 million metered network streamed calls a second on a Google Cloud n1-highcpu-32 machine type instance. That is 540 million call events a second - 32 billion events a minute. The software execution calls originated in 28 client JVMs also running on a n1-highcpu-32 machine type instance. On average each client was able to invoke 9.6 million instrumented method calls a second from a single thread per process with an average client call latency of 100 nanoseconds. The CPU utilization on the Simz machine was pegged at just over 90% with the incoming network data transmission at 985MB a second.
I've long fascinated over how best to perceive the behavior of software machines that for the most part appear as black boxes; consuming input we feed them and producing output we consume, directly or indirectly. I cannot help feeling there is a lost beauty in the motion of action that needs to be rediscovered in order to acquire a far greater understanding of what it is that software and hardware machines do and how in the course of action(s) and state changes (encompassing the environment) such a system changes behavior in ways not observed, or accurately predicted nor fully understood.
A proposal for a different approach to application performance monitoring that is far more efficient, effective, extensible and eventual than traditional legacy approaches based on metrics and event logging. Instead of seeing logging and metrics as primary datasources for monitoring solutions we should instead see them as a form of human inquiry over some software execution behavior that is happening or has happened. With this is mind it becomes clear that logging and metrics do not serve as a complete, contextual and comprehensive representation of software execution behavior.
I was asked to produce a quick preliminary performance analysis of Apache Kafka. Here is the short description of the project taken from the community site:
“Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design…at a high level, producers send messages over the network to the Kafka cluster which in turn serves them up to consumers…Communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. We provide a Java client for Kafka, but clients are available in many languages.”
I downloaded the Kafka 0.8.2-beta distribution from the Apache site and then before launching a single kafka server on my workstation I exported the following environment variable:
export KAFKA_OPTS="-server -noverify -javaagent:/Applications/satoris-1.9/bundle/satoris-javaagent.jar -Djxinsight.home=/Applications/satoris-1.9/conf/apache-kafka"
I used the following
jxinsight.override.config file located in the
jxinsight.home directory to drop down the
hotspot metering extension thresholds.
# clock.time (μs) thresholds
The instrumentation performed dynamically at runtime by the Satoris agent was limited to just the
kafka.* codebase by adding the following to the
To generate workload I used the following script and command line arguments. The command was executed 5 times.
./bin/kafka-producer-perf-test.sh --broker-list=localhost:9092 --messages 10000000 --topic test --threads 10 --message-size 1000 --batch-size 100 --compression-codec 1
Here are the results reported by the performance script.
Below is the results of the metering hotspot analysis. The Satoris agent automatically and intelligently whittled down the list of methods from nearly 3,000 to just 40!! The table is sorted by inherent (self) total with
kafka.log.Log.append being the biggest contributor to the execution time with an average of 9 ms per method invocation. Ignoring the
pollExpired probe the next probe closest has a 6 minutes inherent total compared to 1 hour for
Log.append. The cost profile for all other probes is dwarfed by
Here is the Scala source code for the
Log.append method. Note the
lock synchronized block and the extent to which it covers in the code execution including a sub-optimal
trace() call that involves a relatively expensive
To determine whether indeed there might be a build up of threads at this point in the execution I turned on the
queue metering extension by adding the following to the
jxinsight.override.config file within the directory pointed by the
jxinsight.home system property.
queue metering extension enabled it is relatively easy to inspect the concurrent thread execution flowing through the
Log.append method. In the timeline below we can see it pegged at 8.
Log.liftedTree1$1 method called from the
Log.append method has only ever 1 thread executing pretty much confirming the monitor locked on within the
Log.append method is a shared instance. The
Log.liftedTree1$1 method is not actually listed in the Scala source code but instead is a synthetic method generated by the Scala compiler for the
lock synchronized block (at least that is what I suspect here).
Looking for a possible workaround to a rather obvious performance bottleneck I came across this line in the
server.properties file used in starting the Kafka server instance.
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across the brokers.
I then changed this property to 10, reset all data files and reran the benchmarks again. Here are the results reported by the performance script. They look much better but not amazing.
In this run the number of hotspot classified probes has increased. The response of the
Log.append method has dropped down from ms to microseconds and we start to get to see some other probes that might be worth investigating as possible areas that need code (and resource consumption) tuning. Many would appear related to
ByteBuffer management and manipulation, which may well be outside of the scope (of influence), but there are some that seem relatively heavyweight for what is indicated by the method naming such as
RequestChannel$Request.updateRequestMetrics. A whole 11 microseconds per call wasted on metric updates seems inefficient and wasteful.
Below some of the methods classified as hotspots by the Satoris agent starting with
The execution cost in the
CompressionFactory looks to be largely object allocation related.
More object allocation and IO copying work in the
The following is a graphic I've used in the past to frame various software performance optimization techniques. It is not a comprehensive inventory of all software performance optimization techniques (or concerns) but I've found it serves a purpose in managing the amount of effort that, in general, should be spent on each technique outside of extreme cases such as trading platforms (or profilers). The left side is your typical localized bottom up approach to speeding up code execution steps.
The Good Regulator Theorem states "every good regulator of a system must be a model of that system". But what exactly would such a model look like? What elements should the model contain and how might they be related and reasoned about? The theorem itself does not address this so in this article I present my own research findings covering dramatism, observational learning, experiential learning, activity theory, simulation theory and mirror neurons as well as software activity metering and software performance measurement.