Node metrics specification

Node metrics specification

Target release

Mercury

Epic

Monitoring and metrics for the node

Document status

in progress

Document owner

@Medha Parlikar (Unlicensed)

Designer

Developers

QA

Goals

  • The node can emit metrics

  • Node operators are able to pull metrics from the node through the metrics API or Prometheus integration

  • Demonstration of a node monitoring system at launch of test net

Background and strategic fit

Node operators are fairly similar to site reliability engineers.  They will want to know how their nodes are operating and performing.  To that end, the node supports the ability to emit metrics and a way for node operators to pull metrics.  The node will not generate metrics It will emit raw numbers from the system. Node operators will be able to access metrics through the metrics API or through Prometheus integration. Node operators can use these interface tools to create relevant visualizations of metrics related to node performance.

Assumptions

  • Node operators will be responsible for setting up a node monitoring system.

  • Node operators will be able to use available RCHain documentation to interface with the node metrics API to pull data needed for their monitoring system.

Requirements

Title

User Story

Importance

Notes

Status 

Title

User Story

Importance

Notes

Status 

1

Node emits metrics through the metrics API

Node operators want access to any and all metrics related to node operation and performance.

Must have

  • The node emits metrics



  • Expose extensive metrics on the system and JVM

  • At minimum the following metrics should be exposed:

    • CPU

    • RAM

    • Disk

    • Network core metrics at the core level

    • JVM performance

    • Garbage collection

    • Size of memory pools

    • Consumption of memory pools

  • Monitoring systems can pull metrics from the node

  • The node does not push metrics

  • The node does not report on itself

Metrics emitted on http port 40403.  Complete (metrics emitted via API) 

2

Node measures COMM Events

COMM events are a measure of an RChain transaction.  The node must report on how many raw comm events are being processed, so we can demonstrate 40K COMM events /second

Must Have

  • Create a counter of COMM events as a total count of events in the last hour.  @Rita Allen to confirm with SRE if the metrics should reset in the past hour.

  • COMM events should be measured for Propose (block creation) and Replay RSpace.

@Rita Allen to validate the requirement, see what has been implemented.

3

Node reports on CPU Utilization

The Node reports on the percentage of CPU that is being consumed. 

Must Have

  • Report on the percentage of CPU being consumed at the time the request for metrics is being made.

@Rita Allen Please confirm the status

4

Node reports on total RAM consumed

The node reports on the amount of RAM being consumed

Must Have

  • Report on the total amount of RAM being consumed at the time the request for metrics is being made

@Rita Allen Please confirm the status

5

Deploy Count

The node reports on the total number of deployments received

Must Have

  • Report on the total number of times the deploy API receives a request.

RHOL-924

6

Deploys since last propose

The node reports the total number of deploys received since the last propose

Nice to have

Report on the total number of deploys received via the deploy API since the last successful propose

RHOL-913

7

Blocks proposed

A validating node proposes blocks

Must Have

  • Total blocks proposed by the node in the past hour.  Metric should reset in the past hour.  

CORE-1298

8

Fork Choice Tip

For something like Ethstats.net, it would be good to show the current fork choice tip for each node

Nice to have

  • Show the hash of the block that is the current fork choice rule.



9

Blocks being processed?







CORE-1300

10

Demonstration of a node operating monitoring system

Node operators will have various needs and expectations for their node monitoring system. The RChain node will not dictate the system they use. However, at launch of test net there will be an example of a node operating system to share as an example, along with documentation in the event a node operator wants to reproduce the example monitoring system.

Must have

  • Create a Pyrofex node monitoring system

  • Document the system so others can create something similar.

  • This system will be created using Prometheus and Grafana

  • Needed at time of launch of test net



11

Documentation on how to export metrics using Prometheus

Node operators need to know how to access the Prometheus integration in the node.

Must have





12

Documentation on how to export metrics from metrics API using Scala

Node operators need to know how to interface with the metrics API. 

Developers need to know how to interface with the metrics API when integrating features with the node.

Must have

  • Pawel to help create a template

    • Jeremy to use template to create export for required metrics listed above

In progress. First draft at RChain RNode Metrics - Node Operator

13

Metrics acceptability testing

Validate exposure and pull of metrics works using both Prometheus and the metrics API

Must have 

  • Acceptable when:

    • Node comes up

    • Node emits metrics

    • Metrics are scrapable through metric API 

    • Metrics are scrapable with Prometheus

      • Metrics pulled through metric API and Prometheus match



14

Integration with Docker compose

Some node operators may want the option to monitor the node using Docker compose

Optional

  • Integrate node with Docker compose

  • Maintain the integration over time

  • Document the integration - this is already available.

Done 

Traceability matrix

key summary type assignee reporter priority status resolution fixversions sprint
Loading...
Refresh

User interaction and design

Questions

Below is a list of questions to be addressed as a result of this requirements document:

Question

Outcome

Question

Outcome

What if node operators want a specific type of metrics interface?

Initially we will offer Prometheus integration. It is a widely accepted monitoring interface that will be successful at scale. We do not want to be in the business of customizing the RChain node for the need of all users. Requests, however will be considered on a case-by-case basis. For example, if a large or significant end-user want integration with a different monitoring tool other than Prometheus and when the metrics API does not suffice.

Why don't we want the node to push metrics on performance?

This adds complexity and will potentially reduce node performance. The best-practice is to expose raw data from the node and allow users to pull or scrape the data.

Why don't we primarily offer metrics through Docker compose?

Running a node monitoring system using Docker containers is one way. For some node operators who have limited programming experience, this may be the best way. For more experienced end-users, or developers building on the RChain platform, this is less likely to be the preferred way. 

There is also a risk related to maintaining successful integration with Docker over time. 

For these reasons, the best solution is to natively expose metrics through the metrics API for users to pull raw data. They can then choose the method that is best for them.

RNode-0.2 offers a Docker integration for metrics. Because of the maintenance issue, it may not always be a supported option for monitoring. 

Notes from Meeting:

  • Metrics are expensive - Not expensive per call.  But eventually it adds up.  Then you start pushing blocks of data - and it gets expensive.  Recommend that you start and stop metrics.  Over time these can get expensive.

  • Chris loves Prometheus for stuff.  Much simpler and great local visualization tools.   No histograms in Prometheus.  Allow Prometheus to scrape after we turn on metrics

  • Loves the tools with statsD.  it's right if you want to analyze your metrics offsite.

Not Doing