The RChain RNode software is a complex distributed system. It is desirable that node operators manage, maintain, and tune their individual nodes for best reliability and efficiency. To do this, they require data on how various aspects of the node are functioning. Only by analyzing and visualizing this data can a node operator make educated decisions on how to deploy their software and hardware resources.
The data required from the RNode software includes, but is not limited to the following:
Each development team knows the operation of their assigned subsystem. It isn’t reasonable for any single developer to understand the intricate details of the entire system. Therefore, it is incumbent on individual developers and teams to instrument their software in such a way as to provide the required data. If done correctly, this instrumentation should not adversely affect the performance, reliability, or operation of any specific subsystem.
Many of the desired metrics and statistics are rapidly varying. Many come in bursts of activity with idle times between. Others may be cyclical in nature.
These types of data, when gathered on a periodic basis, are subject to Nyquist sampling errors. This means that a given metric may have high and low peaks that are missed because the sampling happens to occur in between highs and lows. Also, because these peaks and valleys may occur very rapidly, merely increasing the sampling rate will not solve the problem. What is required is to collect the data using counters that can be examined with a known time period. Thus a moving average of the numbers of events over time can be accurately determined.
Another common type of metric involves the average length of time required to perform specific operations. This type of data can be obtained by time-stamping the beginning and end of the operation, taking the difference between timestamps, and calculating a moving average.
Care should be taken when recording time for asynchronous operations to associate the timer with the specific item to be timed. For example, when timing how long it takes to process a block, it makes no sense to record the start time for one block and relate it to the stop time for a different block that is being processed concurrently.
RNode uses three third party products for collecting, processing, and displaying metrics. These are Kamon, Prometheus, and Grafana respectively.
The interface to the Kamon library resides in node/diagnostics/package.scala. This module exposes methods for interacting with Kamon Metric types of counters, range samplers, gauges, histograms, and Timers. NOTE: Timers not yet implemented.
Individual metrics in Kamon are kept by name. When a call is made to record or modify a metric (e.g. increment), the metric is created, located, and recorded by name. Developers should take care to be consistent in the use of names in order to avoid “split” metrics where events are recorded under more than one name.
As noted, Gauges are currently used in several places. Notably the JVM metrics. Since Gauges are an instantaneous snapshot, and suffer from the Nyquist problems mentioned above, we intend to migrate these to other Kamon metric types. Some will migrate to Counters. Metrics that cannot be expressed as a counter will migrate to either Histograms or Timers. Support for Timers will be added to node/diagnostics/package.scala.
As mentioned above, most metrics should use Counters. Counters are used in cases where we are interested in the numbers of events over time. If you are looking for a metric that shows how many of Event A happens during time T, then a Counter is what you should use. All that is required to add a counter is to call the diagnostics.metrics.incrementCounter method with an appropriate name at a strategic spot in your code. See examples in
Not currently used in RNode.
The use of Gauges in RNode is deprecated.
Metrics that show quantities or current consumption of resources e.g. memory or disk space should use a Histogram. To add a metric that uses Histograms, call the diagnostics.metrics.record method in node/diagnostics/package.scala. These are polled periodically and added to Prometheus. Most of the current use of Gauges in node/diagnostics/JvmMetrics.scala will migrate to Histograms. See an example in comm/rp/Connect.scala.
Metrics that show how long some class of task takes to perform should use Timers. Support for them will be added in node/diagnostics/package.scala soon. See PR#1607. Examples of the use of Timers are time to process a transaction, achieve consensus on a block, ping times, or resource latency.
The Prometheus interface in RNode is created in node/diagnostics/NewPrometheusReporter.scala. Here, a singleton object is created, and a configuration file is read.
The default configuration file is docker/node/prometheus/prometheus.yml.
Configured by some combination of
controls what metrics are displayed on the Grafana dashboard