RChain RNode Metrics - Developer
Introduction
The RChain RNode software is a complex distributed system. It is desirable that node operators manage, maintain, and tune their individual nodes for best reliability and efficiency. To do this, they require data on how various aspects of the node are functioning. Only by analyzing and visualizing this data can a node operator make educated decisions on how to deploy their software and hardware resources.
Metrics Requirements
The data required from the RNode software includes, but is not limited to the following:
- Number of comm events per second
- Number of deploys per second
- Number of purse transfers per second (This only used during development testing, and is done with a known load)
- The average latency to consensus for a proposed block at the 0, 1/3, and 1 levels of consensus. Of these, the 1/3 level is deemed the most important.
- Ping times to neighbors
- The average latency of disk, network etc.
- Overall System statistics such as memory usage and paging
- Statistics obtainable from /proc/self/*, and JVM etc.
Who is Responsible?
Each development team knows the operation of their assigned subsystem. It isn’t reasonable for any single developer to understand the intricate details of the entire system. Therefore, it is incumbent on individual developers and teams to instrument their software in such a way as to provide the required data. If done correctly, this instrumentation should not adversely affect the performance, reliability, or operation of any specific subsystem.
Collection Pitfalls
Many of the desired metrics and statistics are rapidly varying. Many come in bursts of activity with idle times between. Others may be cyclical in nature.
Periodic Sampling
These types of data, when gathered on a periodic basis, are subject to Nyquist sampling errors. This means that a given metric may have high and low peaks that are missed because the sampling happens to occur in between highs and lows. Also, because these peaks and valleys may occur very rapidly, merely increasing the sampling rate will not solve the problem. What is required is to collect the data using counters that can be examined with a known time period. Thus a moving average of the numbers of events over time can be accurately determined.
Operation Throughput
Another common type of metric involves the average length of time required to perform specific operations. This type of data can be obtained by time-stamping the beginning and end of the operation, taking the difference between timestamps, and calculating a moving average.
Care should be taken when recording time for asynchronous operations to associate the timer with the specific item to be timed. For example, when timing how long it takes to process a block, it makes no sense to record the start time for one block and relate it to the stop time for a different block that is being processed concurrently.
Metrics Tools in RNode
RNode uses three third party products for collecting, processing, and displaying metrics. These are Kamon, Prometheus, and Grafana respectively.
Kamon
The interface to the Kamon library resides in node/diagnostics/package.scala. This module exposes methods for interacting with Kamon Metric types of counters, range samplers, gauges, histograms, and Timers. NOTE: Timers not yet implemented.
Individual metrics in Kamon are kept by name. When a call is made to record or modify a metric (e.g. increment), the metric is created, located, and recorded by name. Developers should take care to be consistent in the use of names in order to avoid “split” metrics where events are recorded under more than one name.
- Counters -- these are the workhorse of the metrics system. These gather information on how many times some event occurs, or some piece of code is executed during a period of time. These can give you information on where most of the work is happening in the system.
- Range Samplers -- Not currently used in RNode
- Gauges -- Currently used to record numbers of peers (see comm/rp/Connect.scala and comm/discovery/KademliaNodeDiscovery.scala). In addition, a variety of information derived from the JVM is recorded using Gauges. These are collected in node/diagnostics/JvmMetrics.scala and use a wrapper called simply “g”.
- Histograms -- Currently used to record the amount of time to connect to a peer. See comm/rp/Connect.scala.
- Timers – Are not currently exposed in node/diagnostics/package.scala. Timers can record a start and stop time for an individual event.
- Spans – Are not currently exposed in node/diagnostics/package.scala. Spans are used to trace composite operations and the time it takes to perform them. Spans have the concept of parent and child spans.
Future Metrics Migration
As noted, Gauges are currently used in several places. Notably the JVM metrics. Since Gauges are an instantaneous snapshot, and suffer from the Nyquist problems mentioned above, we intend to migrate these to other Kamon metric types. Some will migrate to Counters. Metrics that cannot be expressed as a counter will migrate to either Histograms or Timers. Support for Timers will be added to node/diagnostics/package.scala.
Adding Metrics
Counters
As mentioned above, most metrics should use Counters. Counters are used in cases where we are interested in the numbers of events over time. If you are looking for a metric that shows how many of Event A happens during time T, then a Counter is what you should use. All that is required to add a counter is to call the diagnostics.metrics.incrementCounter method with an appropriate name at a strategic spot in your code. See examples in
- casper/util/comm/ApproveBlockProtocol.scala
- comm/rp/Connect.scala
- blockstorage/*BlockStore.scala
Range Samplers
Not currently used in RNode.
Gauges
The use of Gauges in RNode is deprecated.
Histograms
Metrics that show quantities or current consumption of resources e.g. memory or disk space should use a Histogram. To add a metric that uses Histograms, call the diagnostics.metrics.record method in node/diagnostics/package.scala. These are polled periodically and added to Prometheus. Most of the current use of Gauges in node/diagnostics/JvmMetrics.scala will migrate to Histograms. See an example in comm/rp/Connect.scala.
Timers
Metrics that show how long some class of task takes to perform should use Timers. Support for them will be added in node/diagnostics/package.scala soon. See PR#1607. Examples of the use of Timers are time to process a transaction, achieve consensus on a block, ping times, or resource latency.
Spans
Prometheus
The Prometheus interface in RNode is created in node/diagnostics/NewPrometheusReporter.scala. Here, a singleton object is created, and a configuration file is read.
The default configuration file is docker/node/prometheus/prometheus.yml.
Grafana
Configured by some combination of
docker/node/grafana/grafana.conf
docker/node/grafana/provisioning/dashboards/dashboard.yml
docker/node/grafana/provisioning/dashboards/genesis-metrics.json
docker/node/grafana/provisioning/dashboards/rnode-metric-counters.json
docker/node/grafana/provisioning/dashboards/rnode-metrics.json
controls what metrics are displayed on the Grafana dashboard
scripts/rnode-metric-counters-to-grafana-dash.sh