Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Page Properties


Target releaseMercury
EpicMonitoring and metrics for the node
Document status
Status
colourGreenYellow
titlecompletein progress
Document owner
Designer
Developers
QA


...

  • Node operators will be responsible for setting up a node monitoring system.
  • Node operators will be able to use available RCHain documentation to interface with the node metrics API to pull data needed for their monitoring system.

Requirements

#TitleUser StoryImportanceNotesStatus 
1Node emits metrics through the metrics APINode operators want access to any and all metrics related to node operation and performance.Must have
  • The node emits metrics


  • Expose
estensive
  • extensive metrics on the system and JVM
  • At minimum the following metrics should be exposed:
    • CPU
    • RAM
    • Disk
    • Network core metrics at the core level
    • JVM performance
    • Garbage collection
    • Size of memory pools
    • Consumption of memory pools
  • Monitoring systems can pull metrics from the node
  • The node does not push metrics
  • The node does not report on itself
Metrics emitted on http port 40403.  Complete (metrics emitted via API) 
2Node measures COMM EventsCOMM events are a measure of an RChain transaction.  The node must report on how many raw comm events are being processed, so we can demonstrate 40K COMM events /secondMust Have
  • Create a counter of COMM events as a total count of events in the last hour.  Rita Allen to confirm with SRE if the metrics should reset in the past hour.
  • COMM events should be measured for Propose (block creation) and Replay RSpace.
Rita Allen to validate the requirement, see what has been implemented.
3Node reports on CPU UtilizationThe Node reports on the percentage of CPU that is being consumed. Must Have
  • Report on the percentage of CPU being consumed at the time the request for metrics is being made.
Rita Allen Please confirm the status
4Node reports on total RAM consumedThe node reports on the amount of RAM being consumedMust Have
  • Report on the total amount of RAM being consumed at the time the request for metrics is being made
Rita Allen Please confirm the status
5Deploy CountThe node reports on the total number of deployments receivedMust Have
  • Report on the total number of times the deploy API receives a request.
Needs to be implemented. Should have a ticket.  Add ticket number here Rita Allen
6Blocks proposedA validating node proposes blocksMust Have
  • Total blocks proposed by the node in the past hour.  Metric should reset in the past hour.  Rita Allen to confirm.
Needs to be implemented. Should have a ticket.  Add ticket number here Rita Allen.
7Fork Choice TipFor something like Ethstats.net, it would be good to show the current fork choice tip for each nodeNice to have
  • Show the hash of the block that is the current fork choice rule.

8Blocks being processed?



9Time since last block?



10Demonstration of a node operating monitoring systemNode operators will have various needs and expectations for their node monitoring system. The RChain node will not dictate the system they use. However, at launch of test net there will be an example of a node operating system to share as an example, along with documentation in the event a node operator wants to reproduce the example monitoring system.Must have
  • Create a Pyrofex node monitoring system
  • Document the system so others can create something similar.
  • This system will be created using Prometheus and
Graphana
  • Grafana
  • Needed at time of launch of test net
3

11Documentation on how to export metrics using PrometheusNode operators need to know how to access the Prometheus integration in the node.Must have
4


12Documentation on how to export metrics from metrics API using Scala

Node operators need to know how to interface with the metrics API. 

Developers need to know how to interface with the metrics API when integrating features with the node.

Must have
  • Pawel to help create a template
    • Jeremy to use template to create export for required metrics listed above
5

13Metrics acceptability testingValidate exposure and pull of metrics works using both Prometheus and the metrics APIMust have 
  • Acceptable when:
    • Node comes up
    • Node emits metrics
    • Metrics are scrapable through metric API 
    • Metrics are scrapable with Prometheus
      • Metrics pulled through metric API and Prometheus match
6

14Integration with Docker composeSome node operators may want the option to monitor the node using Docker composeOptional
  • Integrate node with Docker compose
  • Maintain the integration over time
  • Document the integration - this is already available.

Traceability matrix

Jira Legacy
serverSystem JIRA
columnskey,summary,type,assignee,reporter,priority,status,resolution,fixversions,sprint
maximumIssues1000
jqlQueryproject = CORE AND "Epic Link" = CORE-195 ORDER BY created DESC
serverId50130123-f232-3df4-bccb-c16e7d83cd3e

...