I’ve long fascinated over how best to perceive the behavior of software machines that for the most part appear as black boxes; consuming input we feed them and producing output we consume, directly or indirectly. I cannot help feeling there is a lost beauty in the motion of action that needs to be rediscovered in order to acquire a far greater understanding of what it is that software and hardware machines do and how in the course of action(s) and state changes (encompassing the environment) such a system changes behavior in ways not observed, or accurately predicted nor fully understood. Have we forgotten how to observe in real-time the software execution behavior of our applications in favor of tabulated benchmark results published at the end of some test? Could it be that the present moment when consciously observed brings a sense of fear to many system engineers because such mindful watching serves only to make us more aware of the enormous complexity of our software creations and how little control and influence we have over them, other than by way of start and stop actions as well as code changes?
“These neural networks we use are currently considered to be black boxes…They’re kind of doing the right thing, but you don’t really know how those representations work…We have to make sure we understand these systems better than we currently do…In 10, 20, 30 years time, we’ll maybe have these machines that are behaving in smart ways. We will have the chance for them to be able to actually introspect better – or we to be able to introspect their brains better than we can, more easily than we can than with human minds.”
DEMIS HASSABIS, DEEPMIND, GOOGLE – JAN 2015, FINANCIAL TIMES
In performance workshops and consultancy engagements that I’ve given I’ve noticed how overwhelmed many developers can become when I demonstrate the live interactive performance analysis capabilities available in the Satoris monitoring client. Because of this I’ve designed the main content view within the monitoring client to have both dynamic and static parts. The static part renders the software performance model information within the snapshot which imbues a sense of control as the information cannot change until the engineer explicitly performs a refresh command to retrieve a new and updates version of reality. The dynamic section of the content pane, which is not completely divorced from the static section (a common problem with many performance analysis tools on the market), includes many live visualizations of what is happening at the process level, the currently selected probe in the snapshot table as well as the top probes (rows) displayed in the ordered snapshot table. The dynamic visualizations are updated every second and constitute the now here. This now is not the last second but the period of active (engaged) perception of the machine – our presence within the present.
A common concern for many newcomers to the tooling revolves around the meaning of each color point in a chart and the data value that underlies it and dictates the placement. This need to dig immediately into the details of each charted point is best resisted, at least initially, especially as most confuse “different” with “complicated” and “complex”. Instead I recommend observers focus on discerning visual patterns through mindfully watching of change(s) in one or more visualizations. I’ve very deliberately designed the visualizations to make recognition less taxing on the observers perception and cognition capacities.
Below I’ve cropped away all numbers and labels from a screenshot and yet I can still extract useful information from the visualizations even though the picture is absent of the most important aspect of action – motion. For example I know I am watching a process that is executing mostly hotspot methods. The agent has had sufficient time to classify behavior and eliminate noise. The process is currently busy executing methods across a large number of threads though this has not always be the case since I started observing the process. I can also see that the execution behavior of the selected method (probe) in a snapshot table, not shown here, is relatively consistent across the last few seconds since I’ve started watching it exclusively. This consistency extends to both the total and inherent (self) total timing as well as the count (throughput) and that in terms of threads execution this method (probe) has a work rate (throughput) and cost (latency) that is evenly distributed across threads except for one. Finally, though there is far more information I could gather, I can also ascertain that the cumulative performance model within the snapshot has a similar shape and profile to what has happened in the last second across multiple probes and each of their performance statistics. I am more than likely watching a stress test consisting of the same set of transactions. This near immediate information extraction has come about through practiced observation of software behavior with an emphasis on movement, motion and patterns over measured data values.
Interactivity and selectivity is incredibly important in exploring the unknown during any software performance analysis. I use the snapshot table to guide which methods I choose to observe in greater detail across time and threads. Any reasonably sized Java application will have thousands of possible methods to measure but fortunately the intelligent and adaptive nature of the metering agent will whittle this down to a more manageable number. Here the instrumentation agent is truly acting as an agent on my behalf.
Even if we reduced the number of instrumented and measured methods down from 10,000 to 100 it is still impractical to visualize the live execution behavior of such a number when there are +10 features per method. We cannot reasonably watch over 1,000 data items changing every second so I pick and choose a much smaller selection by ordering the table and walking down the list. I give each method a few seconds of my attention as the charts update to reflect the new selection context and then move on to the next one. In doing so I am able to recognize similar behavioral patterns across different clusters of methods. It is incredibly easy to see which methods are part of the same activity by noting which threads (identified not by name but position in the visualization) light up as I select different methods (probes) in the list. The motion and movement in the visualizations act as a surrogate for the perception of action in the method, the actors (threads) involved and the resource consumption as a result. But I am not restricted to the table ordering as the monitoring client console offers the ability to observe the last second of performance measured behavior across multiple methods (see last two visualizations rows above). I can very quickly detect a change in one or more probes in the last second and use this to drive my selection and attention.
In active observation I try not to form premature judgements or expectations of the data and behavior. Instead I allow the changes in the visualizations, and the patterns they form over the course of the now, to stimulate interest and sway interaction.
There is not much I miss after practicing this mindful watching for so long now but I do always get asked what if I did indeed I miss something. Should I not anxiously click around the set of hotspot probes in a futile attempt to avoid missing that all important moment of insight? No. The now pertains mostly to the observer so I manipulate time in creating a metering recording of the application and then simulating the playback of the application using Stenos. The Satoris monitoring client is completely oblivious to the fact that it is connected to a JVM that is simulating the past software execution behavior of another application down to the particular thread and method level. Interactivity is now extended (and repeated) across time with each execution of the playback. Unlike video recording I can choose to observe differently the same execution behavior via my choice of method selection, the time at which I observe, as well as the data collection extensions I enable within the simulation.
The big question for application performance monitoring going forward with greater adoption and application of machine learning and artificial intelligence will be the co-operative roles played by both man and machine in this regard. Machines will more than likely be used to mine past performance data for behavioral patterns that can predict possible near future outcomes. Man on the other hand will oversee such training and the guidance of supervisory machines and their internal mechanisms of observation (perception) and control (action). In addition man will offer a presence in the now helping to identify and reconcile differences between what the machine perceives in terms of health (status) and what the reality is as contradicted by other external sources in particular users and systems. There needs to be tight feedback between man and machine but for this to scale man needs to be focused less on the past and future and instead the now. To the machine an operative becomes an actor or sensor within its environment.
Check out the following video recording to see some aspects of the above in action.