“When we’ve discounted all other possibilities,
whatever remains, however impossible,
will be made possible when needed.”
In 2010 the record and playback capability included now in Satoris and Stenos was first released. Strangely enough the technology was originally engineered to help with testing the very same products built on the technology. Obtaining a repeatable performance model from a benchmark is near impossible when you are trying to compare each and every single aggregated measurement for two distinct runs, especially when latency is measured in microseconds across thousands of instrumented methods. But if you create a recording of an original benchmark run that includes the meter readings for every call event and then play this recording back, recreating the threads and stack frames associated with call events as well as timings, you can change the underlying metering software, optimizing the calculation of statistics, but still reproduce the exact same performance model. With the ability to simulate a memory of software execution it became obvious that many of the optional data collection extensions need not be activated at the time of actual execution of a metered application but instead could be moved to another space (process) and time (offline), when the application recording was played back. Each time a software memory of execution behavior was played back the observer, the metering agent, could be configured differently in what was to be collected. The relatively high cost of some data collection extensions could be deferred without ever impacting the online application or benchmark.
Around the same time scalable recording and playback of metered execution behavior was introduced software developers and architects had started experimenting with a more micro like service oriented architecture (SOA) as a way to decouple and isolate pieces of their software stack. With the number of processes, not necessarily coupled by a distributed path of execution, growing beyond the scalability limits inherent in traditionally performance transaction monitoring solutions a new approach to performance monitoring was desperately needed. A relative easy option would be to strip away all the richness inherent in software execution behavioral model and boil it down to a few hundred or so metrics and then combine this with sampling of call stacks. The top two application performance monitoring vendors in the market today opted for this approach and though from a technical point of view it is a step back to the stone age for the application performance management industry it has proven immensely successful as it plays to the fact that many in IT operations still don’t truly understand what constitutes application performance monitoring beyond the occasional creation and inspection of “management” dashboards, which is a misnomer as dashboards rarely offer any sort of influence over execution behavior.
There has been very little interest in observing and reasoning about the execution behavior of software and the ever present emerging dynamics of complex software systems at the level of activity (and it’s action) required for deep understanding and effective system influence. But change would eventually come in the form of continuous change itself – the undeniable truth of life.
Simz on the other hand took a very different and far more challenging approach to the problem of monitoring distributed applications and services. Encouraged by what had been achieved with the record and playback and inspired by science fiction the plan was to build a Matrix for the Software Machine. Software execution memories would be streamed from hundreds of machines into a single simulation process that would constitute the observable universe of computing machines.
In 2013 the hype surrounding PaaS had peaked and disillusion was rapidly sinking in. One problematic area for PaaS was with the lack observability of application code execution deployed to an opaque platform service in the cloud. Whilst many of the cloud platforms aimed to simplify management of the many moving parts within an application, through automated processes driven in part by policies, when things went wrong those responsible for the management of the application were forced to tear away the veil to see the moving parts in a way that was not possible or entirely transparent.
Mirrored simulation offered a way forward for PaaS but unfortunately many of the vendors were fighting too many other technical battles with limited finances and resources to undertake such a bold move of streaming the execution behavior occurring within opaque containers over into a machine that would simulate the playback of such execution in a manner that portrayed the system as one whole but at a level of detail that offered unprecedented behavioral insight and diagnosis to developers, testers and operations staff. Simz offered a “high definition” observation point into production that was secure and safe but at the same time it appeared real to both man and tool.
Mirrored simulation not only offered a solution to the lack of observability in the cloud it also provided a means of integrating existing customer systems not deployed to the cloud. Vendors would not have to extend their own platform to support hundreds of possible integrations. Instead customers could plugin extension code into the simulation, particular to their needs and software. Using an Open API consistent across real and simulated machine worlds makes the development, testing and maintenance of integration code relatively straightforward.
With the ability to easily move interception extensions from a real environment into the simulated environment without any change to the code it was apparent a far greater potential for mirrored simulation lay beyond performance monitoring. Instead of forking asynchronous integration tasks at particular points in the execution of application code, the same thread call stack execution behavior, but not actual class byte code, would be replicated to multiple simulating machines in which the integration task is executed like it was within the same thread execution context of the simulation. This allowed developers to program in a more natural synchronous style but still execute in a distributed asynchronous manner. Developers could choose to execute the augmentation of the behavior local or remote, online or offline without changing a single line of code in the extension. The ability to alter both the space (location) and time (online, delayed, offline) aspects of a cross cutting concern whilst maintaining the calling context (thread stack) as well as execution path histories elevated aspect oriented programming (AOP) to a whole new level.
Simulated software memories, whether online or offline, allow us to lazily defer the cost of inquisition, or augmentation, to another time and space, offering the chance to repeatedly refine our questioning and understanding of what transpired within the software machines we monitor and manage. Software memories allow us to employ multiple techniques of discovery and they are not limited to what we know today and what tools and techniques are available to us at this time. If software machine (behavioral) memories can always be simulated with the ability to augment each playback of a simulation then there are no limits to what questions can be asked of the past and what can be actioned in response to such in the present and future. Operations can use the simulated playback to check whether new alerting rules created as a result of an incident recorded will indeed fire at the appropriate time in the past, when it did occur, and hopefully in the future, when it might reoccur. New recruits in Operations can be placed in a “flight simulator” fed data by the simulated playback of execution, then during the simulated playback they’re questioned on what they can perceive and predict as well as what questions should be asked of the situation.
Today to monitor human activity and interaction requires monitoring the machines that actually delivery the services utilized in such interactions. Devices are not just proxies for human action they’re active agents. Personal agents don’t yet have a general artificial intelligence that could operate across different task and contexts so it is likely their numbers will keep growing for sometime.
To improve the resilience and changeability of services supporting personal agents, architects and developers are forced to increase the degree of partitioning, replication and isolation of components within the service supply chain. This represents a significant increase in complexity for operations and it is driven along two dimensions – space and time. Time contributes to the complexity and risk in managing systems, services and applications when there are significant differences (time delays) between when something is observed, when it is judged and when it is reacted upon. Operations are being tasked with managing behavior (cause) and resulting events (effects) that have a time resolution as low as few microseconds whilst it collects, processes, sorts, filters, and analyses measurement data in seconds even minutes and then reacts in minutes even hours. By the time an engineer acts the basis for such has probably already elapsed and being invalidated. Not only is the management space increasing in scope and coverage its rate of change is accelerating along with increasing diversity and wider dispersion and distribution of data and execution.
How we monitor and manage machines and services needs to change and very quickly. The big question for application performance monitoring going forward with greater adoption and application of machine learning and artificial intelligence will be the co-operative roles played by both man and machine in this regard. Machines will more than likely be used to mine past performance data for behavioral patterns that can predict possible near future outcomes. Man on the other hand will oversee such training and the guidance of supervisory machines and their internal mechanisms of observation (perception) and control (action).
Man does not scale in actively managing systems of such scale and complexity. This task must be largely delegated to the machines themselves. We need to be engineers of self-regulating and self-adaptive systems as well as governors of control and influence at a much higher level of abstraction and manageability. There is a proliferation of movable body parts in the systems being built but yet no model of a cognitive and nervous system that would self-regulate and align action to one or more goals. A mirrored simulation could very well be the foundational basis for what is needed with every managed process having a “mini-me” mirror instance that survived any crash of the real (and mirrored) process? These mirror instances, locally deployed, could themselves project outwards to a more remote mirror instance that acted as a collective “many-me” mirrored world.
It would seem a tall order for any technology to address so many concerns as depicted above and in a unified way. How could one mirror the software execution behavior of hundreds, maybe thousands, of service process within a single simulation process, leaving aside any sort of redundancy for now? What of the resource constraints and limits that would imposed on an implementation? CPU? Memory? Disk? Network? Code simulated within a mirrored world would consume very little if any CPU in the interval between the processing of begin and end events of an activity (method call) executed in the real world. Each event would translate into a push or pop operation on the mirrored thread stack. With the chunking of streamed events, in near real-time, makes it possible to apply numerous predictive pipelined optimizations that would be near impractical in the actual application. A cost aware instrumentation agent like Satoris, embedded within the real application, would eliminate much of the execution transmission noise thereby reducing the overall processing capacity needs. Even with the best software engineering in the world there are limits but these are being pushed higher and higher with the ever increasing capacity of multi-core machines. Memory one might think would be a huge challenge to overcome but the simulation is largely mimicking behavior of code execution and not the state referenced and passed around. Most of the data a simulation needs is once-off transmitted meta-data describing threads and methods though a simulation like Simz would support transmission of tagged contextual state scoped to a process or thread. Whilst Simz has been able to average the transmission size of a single call event down to 2-3 bytes the biggest challenge for huge scale mirroring of a machine world is the network bandwidth. Fortunately many companies are rolling out 10G networks within their data-centers along with many of the cloud vendors such as Google who have already done so and are planning even larger data transmission pipes.
This week Simz 2.3 broke all previous benchmark records in simulating 270 million metered network streamed calls a second on a Google Cloud n1-highcpu-32 machine type instance. That is 540 million call events a second – 32 billion events a minute. The software execution calls originated in 28 client JVMs also running on a n1-highcpu-32 machine type instance. On average each client was able to invoke 9.6 million instrumented method calls a second from a single thread per process with an average client call latency of 100 nanoseconds. The CPU utilization on the Simz machine was pegged at just over 90% with the incoming network data transmission at 985MB a second.
Mirrored simulation is not only prime time ready it could well be an industry game changer in the design and development of large scale systems. And it need not be specific to one language or platform. It should be possible to switch-in and switch-out different body parts, microservices, without requiring a new mind and model to be developed for each one in turn. At the end of the day both code and language serve to execute and communicate action and intent. Action itself can very easily transcend the borders of a runtime imposed by developers.