This is the third post in a series looking at the performance analysis of Apache Cassandra. In the first post, Advance Adaptive Profiling of Apache Cassandra 3.0, adaptive profiling over many benchmark iterations was introduced. The post offered a comprehensive performance model for an entire metered software execution of a “write” benchmark. For the follow-up post, Spot Checking Apache Cassandra 3.0 Performance Hotspots, the measurements within the performance model were verified using method spot checks and sampled traces at both thread and method levels.
Up to this point the measurement of instrumented methods has been controlled indirectly by enabling additional “intelligent” (contextual) metering extensions and setting thresholds within the default
hotspot metering extension. In this post, performance measurement budgetary control is applied alongside trace sampling to further reduce the impact of nested call (method) measurements on the accuracy of the measure caller chain. With the
budget metering extension the management of measurement overhead becomes more explicitly and direct.
By default the
budget metering extension assumes a probe (method) measurement cost of 1 unit of the (wall)
clock.time meter. Effectively metering an instrumented method invocation costs 1 microsecond though in reality the cost is ~120 nanoseconds (on modern bare-metal hardware) with most of that time attributed to the underlying clock counter access performed twice. For a method to be metered its “typical” inclusive metered execution time is multiplied by a budget % (default 10%). If the result of the budget calculation is greater than the unit cost, default 1, then the probe is metered but only if the budget for the caller, if any, allows for it. So for a method to be metered its typical execution time would need to be at least 10 microseconds (10% of 10 is 1). With a budget allowance of 1 there will be no metering of instrumented methods calls (callees) as the budget is used up immediately with the measurement of the caller method. The budget is constantly refined and updated as the thread moves up and down the call stack – both depth and breadth.
The goal of the
budget metering extension is to limit the measurement that occurs within the execution of a method. So if a service entry point method has a typical execution time of 1,000 microseconds then the maximum number of method measurements will be 100 (10% of 1,000). In most enterprise Java applications, especially those using the Spring Framework and/or Servlet filters, the methods close to the entry point are pass through methods with very little self (inherent) time. This would normally present a problem in applying the
budget extension as such pass through methods would consume the budget allowance before the call path (or trace) got further and deeper down in the call stack but with the
hotspot metering extension eliminating the measurement of methods not exceeding the inherent hotspot threshold this issue is less likely to happen once the process is past the warmup phase or the instrumentation has been refined as was the case in Advance Adaptive Profiling of Apache Cassandra 3.0.
Below is the performance model collected following a single Apache Cassandra write stress benchmark iteration with both the
budget metering extensions enabled. In all benchmark runs the
hotspot metering extension is always enabled as per the default configuration.
Click on the images below to get the full technicolor table display.
Whilst the purpose in budgeting performance measurement is to improve the measurement accuracy of those methods metered the initial comparison of each tested metering configuration will be based on the impact to the frequency, the count, an instrumented method is measured within a single write stress run. Here are the results for the
budget extension with a default percentage of 10%. The top data row is the total number of instrumented method executions measured (metered).
With just the
tracepoint metering extension enabled the total number of measurements was 4.1 million – a significant reduction from the baseline performance model of 305 million measurements (not shown here). With both
budget enabled the measurement drops to 0.8 million. Out of the methods selected for comparison purposes the
StorageProxy.mutate method experienced the greatest reduction with the
budget extension enabled. This could be attributed to the overhead budget being consumed by calls further back in the caller chain (and stack). It could also happen if the method after a period of metered execution becomes too cheap to have any actual budget units itself though that is probably more like to be the case for the
addUpdates method. Once the budget hits 0 the method is never measured again – effectively it remains disabled for the lifetime of the process.
The budget allowance can be increased by setting the
j.s.p.budget.percent value to 50 from the default of 10. With such an increase those methods lower in the trace depth or later in the trace timeline will have a much greater chance of being metered. Below is the revised performance model after re-running the write stress test with the new configuration.
The updated comparison table shows the budget allowance increase has upped the frequency for the
StorageProxy.mutate probe. There is now greater alignment in the frequencies across sets of probes (methods) indicating sufficient overhead allowance most of the time within a sampled trace.
What if the budget allowance is instead dropped down below the default of 10%. Say 5%. With a unit cost of 1 only methods with a “typical” execution time of 20 microseconds would continue to be metered, assuming that the budget is not already consumed further up in the caller chain (stack).
With just a 5% budget allowance most of the selected methods suffer a significant drop in the actual metered measurement of method executions. The overall measurement count has dropped down from 0.8 million (with a 10% allowance) to 0.4 million.
budget extension is extremely useful at limiting measurement overhead in production when tracing service request calls it is less effective at spot checking a previous performance hotspot analysis especially when combined with the
budget extension as it limits measurements to those probes, methods, closer to the trace entry points. It would be wonderful if a trace was randomly selected from different points in the call stack (or caller chain). Not just at the entry point into any service request processing. This is exactly what the
flowcheck metering extension offers – it’s
spotcheck for traces (or call paths). Here is the performance model exported following a stress test run with both
budget metering extensions enabled.
flowcheck samples by default every 100 trace (flow) on a per thread basis, unlike
tracepoint which by default uses a global shared counter by probe name, the more threads the more measurements, as can be seen in the following comparison table. What is far more interesting here is that the counts for
flowcheck seem as correlated, or grouped, as was the case in the other comparison tables above. This is because
flowcheck is randomly picking (different) locations at which to start a trace, which then enables metering of all direct and indirect executed methods that are instrumented.
flowcheck metering extension provides control over the random trace sampling using the
j.s.p.flowcheck.upper configuration properties. The default values are
upper=100 but in the performance model below these values were set to 250 and 500 respectively, ensuring the gap between traces was 250 and the randomly selected point was somewhere between 0 and 250 (500 – 250).
With lower set to 250 and upper 500 the frequency of measurement has dropped across the board, going from 11.7 million measurements to 1.7 million.
In the following performance model the budget allowance percentage was dropped down from 10% to 5%. Budgeting can be far more conservative now that the
flowcheck extension is starting traces at various different points in the trace (and call stack).
With the budget allowance halved the overall measurement count for the
flowcheck run has dropped from 1.7 million to 1.1 million.
The table below shows the net effect of reducing nested measurement when dealing with extremely low latency method executions. The baseline is the last run with just the hotspot metering extension enabled. The number of measurements for this run was 305 million. It represents far more of the measurement population and so incurs possibly more overhead. At hotspot thresholds of 2 and 1 microseconds this was always going to the case without changing the laws of physics, which some other tools inadvertently claim to do with “zero overhead” statements when actually what they do is “nothing” – not instrumenting, not measuring just call stack sampling every 10,000-50,000 microseconds (an eternity for a machine).
flowcheck combined with
budget offers microbenchmark like performance analysis of code within an actual real execution environment. Maybe it is times to reconsider your microbenchmark approach and tooling. You’re doing it wrong if it is not real.
APPENDIX A – GOOGLE CLOUD HARDWARE
Machine type: n1-highcpu-32 (32 vCPUs, 28.8 GB memory)
Disk type: Standard persistent disk, IOPS (read=3, write=15), throughput (read=1.2, write=0.9)