This is a first in a series of articles looking at the software performance of Linkerd, a recent addition to the Cloud Native Computing Foundation, offering a transparent proxy that adds service discovery, routing, failure handling, and visibility to modern software applications.
This article outlines the reason behind and usage of the gap metering extension included in Satoris which unlike possibly all other application performance monitoring solutions focuses on what transpires between consecutive service request processing by a particular thread for the same entry point method.
The second part in a series of articles looking at the microbenchmarking of big data solutions running on the JVM. In this part the performance model is further refined over a number of configuration steps, each step building on the previous steps with the purpose of deriving a smaller, simpler and more relevant model of the microbenchmark to facilitate more targeted code inspection as well as a better understanding of the nature of execution flow across method, class and package boundaries.
The challenge in measuring not just the microbenchmark code but the underlying method invocations within the software under test (performance analysis) is ensuring that the required instrumentation, measurement and collection does not perturb the system to such an extent that the system is an entirely different system (measurement is not relevant) and the data collected is neither accurate or actionable. This challenge can be overcome with adaptive profiling of the codebase, both online and offline. Here this advanced application performance monitoring approach is applied to Apache Spark.
There are a few interpretations of “static tracing” but at this point, I can only assume that the following tweet is referring to the more common case of traces (probes) being explicitly coded or compiled into software at build time. The reason for the different possible interpretations of static, dynamic and adaptive is that in the tracing, monitoring or (preferably) metering of software execution behavior there are three aspects that can be classified in this manner and somewhat independently. They are instrumentation, measurement, and collection.
Static instrumentation is added into the software at code, compilation or post build time. Dynamic instrumentation is added into the code at runtime, as in the case of a JVM instrumentation agent. Some tracing systems claim to be dynamic though the instrumentation is applied statically by way of support for custom callbacks (or interceptors) at the execution of the static instrumentation. Adaptive instrumentation is added into and also removed from code during runtime execution, based on value and cost as well as some other behavioral analysis. Adaptive instrumentation is very effective at dealing with pass through methods and leaf nodes in a call tree that have very little inherent cost (clock time). This approach is very much supported and complimented by the underlying dynamic and adaptive compilation nature of the Hotspot JVM. Adaptive instrumentation can be done in online or offline mode. In the case of Satoris, both adaptive modes are fully supported including a hybrid mode which allows for instrumentation to be continually redefined based on the past (offline) and present (online) measured software execution behavior – IMRR (Instrument, Measure, Refine and Repeat).
Profiling Low Latency JVM Applications with Instrument-Measure-Refine-Repeat (IMRR)
When trace instrumentation is always measured it is considered as being static (in nature). What is instrumented, is measured, is collected. This is irrespective of how the instrumentation is classified. Dynamic measurement is when the measurement can be turned on and off based on some setting (possibly accessible to remote tools) or the presence of an observer (man or machine). The instrumentation fires (executes) but the measurement, such as reading a clock time counter, is conditional. Adaptive measurement is when the tracing system internally decides at runtime whether the instrumentation that is firing is actually measured. It uses one or more adaptive strategies to decide this, factoring in at the minimum the cost and value of the trace being fired and in what trace context. The decision is far more complicated than any sort of dynamic measurement and changes based on numerous factors that rely heavily on the prior history of measurement. In Satoris this is referred to as Intelligent Activity Metering (IAM) and it encompasses more than 35 extensions to the metering runtime.
A JVM Profiling Agent that does the work of a Software Performance Analyst
Static collection of trace data is when the tracing system always collects (stores) what is measured. Again it does not reason about the value of what has been measured and the additional cost to be incurred before executing the collection routine such as stack traces, call graphs, aggregations, distributions, etc,… Dynamic collection can sometimes refer to a tracing system that enables (or disables) certain data collectors via configuration. At runtime it is static. This might seem a strange classification if it were not for the fact that many tracing/profiling tools do not offer the ability to turn off baked-in collectors – an all of nothing proposition. Dynamic tracing can also refer to a tracing system that allows various collection routines to be turned on and off at runtime during the course of tracing a software execution. This can be done both locally or remotely, from within the code or from some remote management interface. Adaptive collection is when tracing system defers various different collection routines until some activation point such as a particular trace context, a “novel” sequence/temporal pattern or an associated labeling of a trace flow performed by some other adaptive learning routine. Dynamic collection is conditional but largely instructional. Adaptive collection is conditioned on the past and present in terms of execution behavior and system environment.
Beyond Metrics and Logging with Metered Software Memories
All tracing systems, except for the Autoletics technology, are live tracing systems. Instrumentation is fired in a live environment. Measurement is performed in a live environment. Collection is done in a live environment. With the mirroring and simulation capabilities of Satoris, Simz and Stenos the same trace (metering) solution can be embedded within a real or simulated machine environment and still have adaptive aspects of instrumentation, measurement and collection perform without a change in the playing back of a software episodic memory (a memory trace of sorts). The ability to trace a simulated environment with the same capabilities of a live system is a profound change to how we approach the measurement and understanding of software execution behavior. It opens up so many new opportunities that can never be realistically matched with static or dynamic tracing. Imagine the execution of C/C++ code being simulated and adaptively traced within an entirely different machine runtime and architecture – now and in the future. Simulated and Adaptive represents the next stage in the evolution of software and not just in terms of tracing. When these two are combined and integrated we will finally move from intelligently writing code to engineering intelligent systems.
The Marking of Adaptive Software Memories
This week Simz 2.3 broke all previous benchmark records in simulating 270 million metered network streamed calls a second on a Google Cloud n1-highcpu-32 machine type instance. That is 540 million call events a second - 32 billion events a minute. The software execution calls originated in 28 client JVMs also running on an n1-highcpu-32 machine type instance. On average each client was able to invoke 9.6 million instrumented method calls a second from a single thread per process with an average client call latency of 100 nanoseconds. The CPU utilization on the Simz machine was pegged at just over 90% with the incoming network data transmission at 985MB a second.
I've long fascinated over how best to perceive the behavior of software machines that for the most part appear as black boxes; consuming input we feed them and producing output we consume, directly or indirectly. I cannot help feeling there is a lost beauty in the motion of action that needs to be rediscovered in order to acquire a far greater understanding of what it is that software and hardware machines do and how in the course of action(s) and state changes (encompassing the environment) such a system changes behavior in ways not observed, or accurately predicted nor fully understood.
A proposal for a different approach to application performance monitoring that is far more efficient, effective, extensible and eventual than traditional legacy approaches based on metrics and event logging. Instead of seeing logging and metrics as primary data sources for monitoring solutions we should instead see them as a form of human inquiry over some software execution behavior that is happening or has happened. With this is mind it becomes clear that logging and metrics do not serve as a complete, contextual and comprehensive representation of software execution behavior.
The following is a graphic I've used in the past to frame various software performance optimization techniques. It is not a comprehensive inventory of all software performance optimization techniques (or concerns) but I've found it serves a purpose in managing the amount of effort that, in general, should be spent on each technique outside of extreme cases such as trading platforms (or profilers). The left side is your typical localized bottom up approach to speeding up code execution steps.