In this article, I propose a different approach to application performance monitoring that is far more efficient, effective, extensible and eventual than traditional legacy approaches based on metrics and event logging. Instead of seeing logging and metrics as primary data sources for monitoring solutions we should instead see them as a form of human inquiry over some software execution behavior that is happening or has happened. With this is mind it becomes clear that logging and metrics do not serve as a complete, contextual and comprehensive representation of software execution behavior. Following on from this comes the question of instrumentation that both measurement techniques require. Can we realistically know at development (of the code) what should be logged, counted and tallied? If logs and metrics are indeed questions how can we know which questions will need to be answered in the future bearing in mind that developers rarely have any performance profile information in front of them when such coding is done. An even greater concern could be that we are answering questions that will never be needed, incurring unnecessary resource consumption (CPU, I/O) and further delaying completion of actual useful work. This is especially true considering how poorly designed most logging and metrics libraries are in terms performance as well as data payload encoding and sizing. And the cost overhead of logging and metrics does not end there as we invariably need to have special custom collectors transform and transmit data from one source and form to another, eventually ending up in some big system monitoring solution that then slices and dices the data to fit into an entirely different table schema. Finally software analytics is then plonked right on top of all of this, far removed from reality in so many ways, which helps explain why it has largely failed to deliver much value in practice except for those companies that have managed to peddle what is effectively monitoring “dashboarding” as business intelligence or insight.
I am not saying that we should completely abandon event logging and metric monitoring. I just don’t see them as the most suitable sources of intelligence gathering in the field in which machines need to supervise other machines. Metrics do indeed have a place in understanding the environment (situation) in which an actor performs an activity when they give context to possible constraints on motion and consumption. The problem is we’ve pushed metrics, as well as logging, to the foreground in nearly all cases and not to the background. Because our tools and techniques view everything as a black box system, occasionally emitting some sort of status blip, we’ve lost sight (perception) of action that is fundamental to understanding both man and machine behavior.
My long-held position is that man does not scale in managing systems of scale and complexity and that this task must be largely delegated to the machines themselves. We need to be engineers of self-regulating and self-adaptive systems as well as governors of control and influence at a much higher level of abstraction (and manageability). Logging and metrics are very primitive tools for understanding, more so observing, especially when they must be applied without prior information and intelligence of a system that needs to be managed. Software engineering effort would be more wisely spent in figuring our how to create a suitable representation of machine behavior that can be recorded and then played back within a simulation that is sufficiently close to reality without unwanted side effects going beyond the boundaries of the simulation.
Simulated software memories, whether online or offline, allow us to lazily defer the cost of inquisition to another time and space (machine) and offer the chance to repeatedly refine our questioning and understanding of what transpired within the software machines we monitor and manage (indirectly). Software memories allow us to employ multiple techniques of discovery and they are not limited to what we know today and what tools and techniques are available to us at this time. If software machine (behavioral) memories can always be simulated with the ability to augment each playback of a simulation then there are no limits to what questions can be asked of the past in the present and future. This is an area of research that is worth pursuing because the benefits are potentially unlimited and with an application that goes far beyond monitoring, analytics, security, auditing, capacity management,… We just need a model representation of machine behavior that encompasses motion and state. I recently gave a talk on my proposal for such a representation and how intelligent adaptive agent technology allows this to scale without much loss of information.
What follows is a description of a video recording I’ve published here that demonstrates how metrics, automatically registered as JMX
MBean instances, can be generated from the execution of instrumented dynamically weaved into Java class bytecode or invoked within a simulated playback of an application memory pertaining to the past or present.
In the first segment of the video I instrumented a single Apache Cassandra server with the Satoris metering agent that has been configured to install the management metering extension. Here is the
jxinsight.override.config file I used to enable this.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.management.enabled=true
In the video I used the
jconsole tool, distributed with the Java runtime, to inspect the
MBean metrics within the Apache Cassandra server.
management extension enabled every probe + metering tuple will have an associated
MeteringMBean that maintains counts and totals for each metered consumption (method invocation). This can generate a huge number of
MBean metrics, +10,000, so it is best to limit the registration to only those probes that are classified as hotspot. Below is a revised
jxinsight.override.config file that does this.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.management.enabled=true j.s.p.management.guard.name.labels=!managed
In the next segment of the video I moved out the
management extension from within the Apache Cassandra server and enabled it within a Simz server that simulates in near real-time the mirrored metered execution of the Apache Cassandra server. Here is the
jxinsight.override.config used by the Satoris agent loaded into the Apache Cassandra server runtime.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.simz.enabled=true j.s.p.simz.guard.name.labels=!managed
For every call, into the Probes Open API made within the Apache Cassandra server, there will be a corresponding one performed by a paired thread within the simulated environment. The management extension is completely unaware (and unchanged) that it is now operating within a simulated runtime. In the video I connected the
jconsole tool to the Simz process and observed the same MBeans seen previously within the real application.
Extremely useful is the ability to move a metering extension such as
management over into a near-realtime simulation that can simulate the behavior of multiple real application processes. In the video I showed how the single
MBeanServer within the simulated runtime can hold
MBean metrics for two distinct and separate applications processes. We don’t need to setup JMX in each and every application process and more importantly, we don’t have to have tools and other monitoring solutions pull from each individually. The simulation is a single pane of observation. The simulation, which is adaptive, can now form an understanding of software behavior within a global frame of reference. The design of JMX, as well as many
MBean measurements, does not lend itself well to any sort of consolidation of values so the ability to simulate global behavior within a single space and from there to generate
MBean metrics is a gift. In the video I hook up both Apache Cassandra and Apache Kafka to Simz.
In the final section of the video, I focus on the offline playback of software memories. First in recording an execution of the Apache Cassandra server performing a write benchmark.
Here is the
jxinsight.override.config file used.
j.s.p.hotspot.threshold=5 j.s.p.hotspot.threshold.inherent=1 j.s.p.stenos.enabled=true
I could have limited the recording to only those probes deemed hotspots by adding the following to the above.
I then playback the recorded file using the
stenos.jar with the
management extension enabled. Voilà, MBean metrics from software memories of the past.
Don’t forget to check-out the video if you have not already done so.