A Gap in Time – Measuring What Came Before

This article outlines the reason behind, and usage of, the gap metering extension included in Satoris which unlike possibly all other application performance monitoring solutions focuses on what transpires between consecutive service request processing by a particular thread for the same entry point method. Measuring, or metering, the gap between consecutive root method invocations is important in trading environments as it offers an insight into possible performance bottlenecks further upstream (before) in the order submission. If throughput for each thread within a JVM process is tracked then observing and analyzing the gap between events, that constitute the actual throughput work, is incredibly important. The larger the gap the less throughput achieved. In some low latency environments that gap can be far greater than the service time but yet it is not readily observed or analyzed.

a.gap.in.time.timeline

But how can an activity based metering engine track and present what is effectively a blindspot outside of any measured activity? It’s a strange sort of performance blindspot because nothing is actually occurring in terms of what is observed – call invocations. The gap metering extension addresses this by assigning the time, or metering, differences between the end of the last activity and the beginning of a new activity, by the same thread, to the subsequent activity. For every meter enabled within the metering runtime the gap metering extension creates an additional meter that is prefixed with "gap". The changes in the underlying meter measure are effectively carried forward into the new activity and assigned to it. This only happens if both the previous and next (current) activities are of the same probe name.

Here is a simple snippet of code that will be used to demonstrate the gap metering extension in action. The outer() method, called repeatedly in a while loop, will initially represent the actual service entry point. The call spin(50_000L) will be the gap that occurs between each consecutive outer() call. The outer method takes approximately 125 microseconds to execute with 100 microseconds spent within outer and 25 microsecond within inner, called by outer. The spin method is not instrumented.

a.gap.in.time.code

Below is a sample metering snapshot of the above code executed within the Satoris agent loaded. The gap metering has not been enabled yet so we only have the basic clock.time meter included in the model.

a.gap.in.time.table.1

Here is the software execution behavior visualized by the blink metering extension. The white bar represents the outer method and the green the inner.

a.gap.in.time.blink

The following jxinsight.override.config file enables the gap metering extension and excludes main method from being augmented by the gap metering extension.

j.s.p.gap.enabled=true
j.s.p.gap.filter.enabled=true
j.s.p.gap.filter.exclude.name.groups=samples.Gaps.main

Below is a sample snapshot of the resulting performance metering model. There are now two metering rows per probe name – clock.time and gap.clock.time. The only probe to have an actual value assigned to the gap.clock.time total column is samples.Gap.outer. The average gap.clock.time for outer is 51 microseconds. Close to the explicit delay placed in the code.

a.gap.in.time.table.2

In the following revised configuration the gap metering extension augmentation is restricted to the inner method.

j.s.p.gap.enabled=true
j.s.p.gap.filter.enabled=true
j.s.p.gap.filter.include.name.groups=samples.Gaps.inner

The following table is sample of a metering model generated with the above configuration applied. Note that both outer and inner have gap.clock.time metering assignments. That is because inner method is enclosed, executed, within the scope of outer method so any metering changes applied to inner method (probe) will propagated upwards the caller (probe) chain. Only the inner method has the inherent (self) gap.clock.time attributed to it. The average has also changed from 51 microseconds to 152 microseconds because the gap now relates to time interval between consecutive calls of inner and not outer.

a.gap.in.time.table.3