I firmly believe that the mirroring (replication) of software execution behavior as performed by Simz (online) and Stenos (offline) has the potential to be one of the most significant advances in the engineering of software systems. Its impact will be as great as that of distributed computing.
Today it is common to replicate data across machine boundaries but what of execution behavior? While remote procedure call (RPC) middleware has allowed us to move performance across process and machine boundaries, at a very coarse granularity, these calls do not necessarily represent the replication of software behavior, but merely a form of service delegation. The mirroring I am referring to here is the simulated playback, online or offline, of software’s execution behavior in which a thread performs a local function or procedure call that is near simultaneously mirrored in one or more “paired” runtimes. Within such runtimes a paired thread is created, if not already done so, to mirror the real application thread and in tandem, when online, perform pushing and popping of stack frames as they occur within the application. This does not mean that the mirroring runtime needs to be implemented in the same language of the application. It should be possible for a C# application to be mirrored within a Java runtime as long as the mapping maintains the same representation of the execution flow, consisting of events marking entry and exit into methods within a particular thread. As you will read later, I see no reason why the flow cannot be a representation of organizational and human activity.
While some of the contextual data, driving execution behavior, can be mirrored, the replaying of the software execution behavior does not have the same side effects or outputs as it does within the real application runtime. It is immutable, in that mirrored action cannot change the course of what has already occurred, but mutable in that it can augment what has already occurred with new side effects that potentially extend beyond the applications own reach.
In the mirrored runtime software engineers can inject new code routines into interception points, typically the entry and exit of stack frames, and perform additional processing of such behavior as if the injected code was within the application at the time the execution behavior occurred (the illusion of time). This is not to say that the simulated environment fully constitutes the state of the real application runtime. In fact, the application code, a form of state, does not exist here. The simulation merely plays back thread stack frame operations, push and pop, but within the frames, nothing happens other than more nested stack frame operations but these are driven by more streamed events and not the content of a frame or code block (that does not exist in the simulation).
What an engineer sees at the point of interception is the thread(s), its call stack, including the current call frame and as well as some contextual data accessible from the threads environment interface. The playback is like a video memory of recorded execution behavior, as such we can’t directly touch, feel and change what is captured in a frame, but we can augment that frame much like what is done in post-production of Hollywood movies that require special effects. Augmentation is made far simpler when the mirroring and playback are not based on the capturing and rendering of pixels but on event recording, taken from actual motion detectors attached to actors (threads are the software equivalent). The instrumentation probes, added into applications at runtime, are in fact motion sensors of software behavior. Likewise, we don’t need to be concerned with every minor and possibly unobservable change in state or behavior of an actor (thread) between such recorded motion points, much like video equipment used to record and playback, the original scene does not need to be of the same make and model for the playback and augmentations. The augmentation that can be performed within the mirrored runtime allows us to mashup behaviors across space (different runtimes) and time (different histories).
Before going further, it is worthwhile to compare software execution recording and simulated playback with metric monitoring consisting of measurement collection and reporting.
Metrics are questions decided on before an actual event or behavior occurs. In fact, they can be formulated well before the software is even developed or deployed. Metrics typically sample a counter that tracks of a series of events (or measures) over some time window i.e. transactions per second. Metrics don’t record the execution behavior that is counted. Many metrics, and users of such metrics, don’t understand the what and how of such counting and rarely are metrics tested to the same degree of the functionality within an application or service. Metrics are far removed from the underlying software execution behavior. A metric could count the number of scenes an actor appeared in a movie but not the behavior of the actor and that of others present in the scene; how each interacted as well as continuations across scenes. There is no playback of action with metric monitoring; there is only rendering of metric measurement samples. Even if metrics were simulated, the playback would only be a replay of the metric collection itself and not the actual software execution behavior that is indirectly represented by and hidden from, the metric. Simulated playback of a recording, on the other hand, allows us to reconstruct the entire software execution behavior and create new measurements (or metrics) on the fly based on the experiencing of the behavior repeatedly. Questions can be formulated post the event or execution and continuously refined across multiple playbacks. What, when and how such actions should be counted is deferred until needed or known. When played back we experience, through observation, the behavior not the collection of counters and gauges as is the case with metrics.
Let’s now briefly explore how simulation can change the Past, Present, and Future of how we engineer and manage software systems.
Assume we have previously mirrored the software execution behavior of an application to a file or to a socket that streams to a file or some other persistent store, here are a few possible usage scenarios:
– New recruits in Operations are placed in a “flight simulator”, a dashboard fed data by the simulated playback of execution and tasked with observing an application(s) perform. During the simulated playback, they are questioned on what they can perceive and predict as well as what questions should be asked.
– Operations after encountering a problem in production use the simulated playback to check whether new alerting rules created as a result of the incident will indeed fire at the appropriate time in the past when it did occur, and hopefully in the future when it might reoccur. A similar use case exists for new metrics or software analytical insights.
– After a failed IT audit the development team uses the simulation to go back in time and recreate the necessary audit traces that were omitted in the source. Here the new output is generated from past input — recorded behavior.
– The performance engineering team uses hooks into the simulated playback to schedule the execution of load test scripts at more realistic volumes and velocity.
– The test team creates a recording with some contextual data capture of results or behaviors not directly visible from unit test code and uses the simulated playback to perform delta analysis, not just the returned values or state changes but on the resulting behavior that is carried out by the software in executing the tests. Instead of creating assertions on the values exposed at call function boundaries the team creates deep inspection rules on the expected call behavior, both direct and indirect.
– The development team tasked with modularizing a monolithically system into multiple micro-services uses the simulated playback of past execution behaviors to identify candidate services using captured runtime call dependencies across components and packages as well as a cost impact (latency, network) assessment based on the frequency of interaction across proposed service boundaries.
– The performance engineering team uses simulated playback to assess a proposed external service integration that would require the development team to add integration calls directly into their application code. The integration is first done in the simulation so that performance engineering can determine the impact to request latency and the additional resource consumption involved. As the simulation, in playing back a recording, uses minimal system resources this assessment can be very accurate. If need be, the team can playback the simulation with and without the integration and then compare resource consumption across simulated playbacks. The team can also use the playback to test the performance and reliability of the integration endpoint before code deployed to production.
– After serious availability issues with a SaaS APM solution, the business decides to move to another service provider. Operations use the simulated playback runtime to feed the proposed new SaaS APM with past measured (metered) data and then compare the reporting of both vendors on the same underlying software execution behavior. For a period, they feed both services the same daily recorded software execution behavior, allowing operations staff to transition over to the new visualization and reporting capabilities gradually. This is made easy because of extensions to the simulation that make the necessary API calls to backends at the point of a stack frame operation.
– To help resolve an intermittent problem in production with a particular third party library (or platform) Operations decide to create a filtered recording from a simulated playback of a recording made in production which only includes those calls to the library itself. This limited recording is then sent to the third party vendor for analysis by the support team using the same simulated playback engine and observation tools. A similar use case exists for internal component/platform teams.
– Finding it impossible to get a handle on what is happening in their distributed systems the Operations team decide to mirror, project, the software execution behavior of all execution units, application processes, into a single simulated runtime that is augmented with simple but powerful sensors and alerts that offer near real-time automated diagnosis across the entire space of the distributed system.
– After fire fighting many performance and reliability issues with a non-critical or non-functional service integration, the development team decides to move the integration code out of the application and into a simulated playback environment and in doing so have decided to defer the playback to a less busy time window. This is achieved with minimal change to the original integration code.
– The engineering team decides on a partitioning of a system into two domains to increase agility in the development of enhancements and integrations, while still ensuring reliability. The first domain being far more stable (subject to less change) and reliable and the second domain, running as real-time mirrored simulation, allowing for greater ad-hoc and online experimentation via the integration of dynamic business rules into the interception points of the simulation.
– After significant delays in the resolution of productions problems Operations is put under great pressure to allow developers access to production including the installation of developer tools within the environment. Reluctant to allow such unfettered access Operations decide to create a near real-time mirrored simulated environment from which developers can inspect the behavior of their code within production at any moment without giving them direct access to the application and the machines running them.
– The system engineering team, frustrated by sub-optimal network load balancing in the routing of service requests to different nodes due to the load balancer being blind to the state of internal processing queues within the applications as well as the chain of service interactions for each particular service, decides to develop a new load balancer. This load balancer uses the mirrored simulation environment as the primary source in determining the present outstanding queued work items and the estimated time of when such items will be scheduled and completed on a per node basis. The simulated environment drives the workload to the real applications, and the applications project their execution behavior back to the simulated environment which in turns drives the routing of more or fewer service requests. A feedback loop between the real and simulated world.
– Following the repeated crashing of applications without sufficient capture of diagnostic data, Operations creates a universal launch script that starts a mini-me simulated mirrored process before starting the actual application process. When the application process does crash the engineering team only needs to inspect the mini-me simulated process to determine what was happening on all thread stacks within the application before it’s crashing. Eventually, the engineering team extends the mini-me process to take on a supervisory role in accessing the likelihood that an incident will occur, alerting operations as well as an application management solution that preemptively readies a new service instantiation.
I’ve left out some items here that I am currently working on that tie up the simulation with self-adaptive signaling.
– When an application starts up, it will connect to the simulated world and download a digest of past software execution behaviors, which it will then use to train its internal self-adaptive and self-regulating systems. Self-awareness will extend across life cycles of an application process.
– All devices, and users, connected to a software service will be mirrored in a simulation world. The software execution is a reflection of the user’s actions. The simulated playback is a mirroring of the software. The simulated reproduction is thus a mirror of the user and his device. There will be many simulated parallel worlds. Each one is mined in a more natural and immediate way to assess the effectiveness of various behavioral influences dynamically injected and signaled back to the real applications. These worlds will serve as a proxy to the physical world, though the time and space dimensions may be changed to make it appear as a whole and current and to circumvent abuse and unwanted alteration. Companies will offer paid access to such parallel worlds both online (current) as well as offline (past). Active agent technology will be deployed into the simulated worlds.
– The push for faster “real-time” feedback loops between (software) machines and man will result in the projection of both their behaviors into the same simulated universe. Within this simulation, a typical behavioral model consisting of actor (thread or man), activity (call or action) and resource (meter) will unify both worlds sufficient allowing a business to monitor and manage operations oblivious to the actual nature of an actor.