Centralized P2P Node Metrics - Out of scope

This is a work in progress. This is just the initial sketch to give a general idea of direction.

While this is initially a centralized collection it could be easily pushed and decentralized, to some degree, on the RChain network. You could have as many metric nodes as you want.We want collection to be simple, easy, performant, secure and easily salable. SQL is extremely nice for reports. Something that using ORMs or NoSQL data-stores can convolute.

Peer Identification and Collection

All peer-to-peer(P2P) node instances are uniquely identified by using asymmetrical cryptography and creating a Curve25519 private/public key pair. A nodes ip address can and will change so using asymmetrical cryptography allows us to track and validate the node over time as long as it uses the same private key. See http://pynacl.readthedocs.io/en/stable/public/ for example of how. When a node registers the public key is stored in a node registration data-store and all metric communications from node are validated and associated with the public key. You can use a separate node or node metrics key. You also could use a wallet or seed key. Whatever key you use it needs to be sticky to that node. Historical data on a node is important because that could have greater weight in the network. Proof of time, proof of behavior over time, etc.

Registration

All peers will register with metrics. This info could be taken off a node if all public_keys are available for view or they could register separately with unique node or node metrics key. NGINX+LUA and Redis will be used for caching session info and protecting against service attacks.

Pushing Data

Using HTTPS and a REST API consuming and producing JSON is probably our best option. Some metrics and information could be pulled from an existing node in the network and displayed through API or website but most metrics probably will be pushed from nodes via https to dns resolved host like node-stats.rhcain.coop. There could even be a fallback hard coded ip like in bootstrap node or even the same ip. We are using https because we get the advantages of https like security, almost universal outbound allowance through firewalls, REST API and the ability to do application load balancing and custom http manipulation. While JSON isn't perfect it is a good option. This is what many developers are familiar with. It's easier to read than XML. Also, it's fairly easy to modify in NGINX using LUA if needed.

Metrics could easily be enabled or disabled. You could also have different levels of reporting set by the node operator. Default would be enabled or if we forced them users could still block with firewall. It is for them so they would be encouraged to keep them enabled. Right now I would just send them on best effort. You could queue locally and send them as well. This is where having a time_measured field in UTC is convenient but can't be dependable due to incorrect configuration of time on user host or time drift.

Access to Data

There will be command line as well as web access to data. HTTP with JSON makes this easy and nice. Node operators can get access to data if they have access to the private key of node. You could also make provisions that anyone on the same ip address of the node would have access to all data. You can add more functionality and complexity into the application later. RChain can make as much or as little data available on web. It could be locked down to members, node operators, or general public. Postgres allows for speed and flexibility in reporting using SQL. These could be expressed in views https://www.postgresql.org/docs/current/static/sql-createview.html and data made accessible via JSON in HTTP REST API interface for easy consumption into anything.

Use of Data

Data can be used in a wide variety of ways. Network operation and optimization is one of the biggest use cases. Historical data overtime would show what nodes are the "best" nodes. RChain, as well as node operators can use this information to make improvements on the node and overall performance of the network. Security can play into this as well. Postgres triggers are a great way of performing notifications for events and then doing actions. Nodes could also pull updated information on bad nodes. These functions could be included in P2P network or parted out separately for more modular approach. Most of these functions probably/hopefully will live in P2P blockchain network at sometime in the future but that may be awhile.

Gathering Metrics on Node

Gathering metrics on the node can easily be done with any script, any language as long as it is on docker. You could use Scala to run checks from with rchain or use external script to run some command that connects to repl or collect info off of the node host. You then just collect them and push them up. I would make it easy for people to request a new metric_item so we can add it to the list and allow them to collect that. We'll have internal as we as external contributions. You want to create a space so that you can get others involved in the building.

Technologies

NGINX+LUA - OpenResty - https://github.com/nginx/nginx, https://github.com/openresty/ - Used as load application delivery controller for load-balancing, security and payload prepping.

Redis - https://github.com/antirez/redis - Fast queries for access control and session information with NGINX and LUA to query the datastore.

PostgREST - https://github.com/begriffs/postgrest, https://postgrest.com - Adds a REST interface onto Postgres. Gives one benefits of SQL and REST. Should help speed performance up than just using Python Pyramid or Flask. Use those when you need them. You PostgREST until you do.

Postgres - https://www.postgresql.org/ - I really did try hard to look at and use other options than Postgres but it really is always the best option. NoSQL can be nice but you lose the benefits of SQL which are especially good for reporting.

TimeScale - https://github.com/timescale/timescaledb, https://www.timescale.com/ - time series data storage via auto-sharding for Postgres. We can always use Cassandra or http://opentsdb.net/ if needed in the future. We won't even need this for a while but might be nice to explore use.

JSON - https://en.wikipedia.org/wiki/JSON - While not perfect it is a really good choice compared to the other options.

lua crypto https://github.com/philanc/luatweetnacl - https://luarocks.org/modules/philanc/luatweetnacl

Python Libraries if or when needed
- flask
- pynacl
- requests

Web Display Options

HighChartsJS - https://www.highcharts.com/ - - Web Data Display

http://c3js.org/ - I like HighCharts Better.

https://mdbootstrap.com/ - Get the advantages of Bootstrap 4 while getting the look of Material Design - https://material.io/guidelines/

https://bootswatch.com/cerulean/

Redis table for front door access control and session information. This will just save postgresql from load. You can cache, that helps. I personally like seperating it. You can always change in the future. Redis is dumb simple.

HSET your-node-public-key ipaddr "192.168.0.1" someothersessionvariable "blahlbah"

Table Structure

This is a quick sketch of table structure to give you a simplified idea of how data could be stored.

You could have the

Note: In using id vs table-name_id here is good discussion https://softwareengineering.stackexchange.com/questions/114728/why-is-naming-a-tables-primary-key-column-id-considered-bad-practice. I've done both mostly table-name_id but I don't think it is that big of deal. Both have some advantages.

Also, over normalization of tables can really make queries over complicated and queries expensive. I have a tendency to lumping information together if it is going to stay together. Start simple and normalize or break out into different tables as needed.Be clever but not overly clever.

node

name	type	desc
id	sequenced big int	primary key
name	varchar(100)	Custom readable name for node. Autogenerated at first it may be customized by node operator.
public_key	varchar(100)	base64 encoded Curve25519 public_key - could use bytea but it can be annoying sometimes
ipaddr	inet	ip address of current node. This could be ipv4 or 6. Still might split into two different fields.
status	varchar(1) or int	could be different statuses of node. Active disabled. Could be an enum
type	varchar(1) or int	type of node - heavy, lite. Could be an enum
note	text	This field has always served me well for stashing information. Seried notes would be in a seperate table

metric_item

name	type	desc
id	serial
name	name of
desc
note

This would us Postgres TimeScale extension to implemnt "hyper_table" that is a table but with auto sharding

metric_item_history

name	type	desc
id	sequenced integer (bigserial)
node_id	bigint	foreign key of node
metric_item_id		foreign key of metric_item
value	integer	value of metric
received_timestamp	timestamp	Time when value was received
received_nanoseconds	integer	nanonseconds when value was received
measured_timestamp	timestamp	Time stamp that metric was measured on host at
measured_nanoseconds	integer	Time stamp that metric was measured on host at in nanoseconds

ipaddr_history

ip history of node can be useful. We will store this. This can be useful in blocking out "bad-actors" ip addresses or ranges and protecting against attacks.

name	type	desc
id	bigserial
node_id	bigint
ipaddr	inet
timestamp	timestamp

We add the ip address and public key into json via NGINX+LUA with session info. You could embed it directly but ... could also add time measured in UTC.

data='{"metric_id": "1", "value": "40333"}'
cmd="curl https://stats.rchain.coop:xx443/node -X POST \
     -H \"Content-Type: application/json\" \
     -d '${data}'"
eval $cmd

Some Technologies Explored

python-eve with mongodb.

docker prometheus grafana

http://opentsdb.net/

rabbitmq

SQLAlchemy Fask/Pyramid

influxdb

lots of lua/openresty related packages.

Cassandra.

ZeroMQ

https://msgpack.org

Notes from meeting with Nash on 4/2

Fundamental requirements
- People who operate node need to be able to gather metrics on their node operation. (easy problem)
  - This will include some access to metrics on the network
  - Prometheus is the tool the node currently integrates with
    - We may need more based on public need
      - We will make these decisions for additional connectors on a case-by-case basis
- Get aggregated network metrics (hard problem)
Requirement - for test net have a node operating monitoring system that we deploy and document
- Decision - This should be Prometheus
- This will support others deploying and using something similar
- This will not monitor the activity of the entire network
Requirement - monitoring system should pull data from the node, node does not push
- Monitoring system should know where the nodes are
Requirement - expose certain data in the node
- 10-20 system metrics
  - CPU, RAM, system, network
- 20 metrics related to JVM
Requirement - deliver a monad to developers so they can publish metrics
- This is the monitoring interface - Kyle and Pawel need to discuss (can we do for April? or is it for May?)
- Kamon may already do this
- Needs to be functional and not side affecting
- Pawel, is kamon monadic?
Requirement - write metrics code in Scala in the node
In April
- Demo how prometheus is doing the right thing, and document it so others can do the same
- Need
Caution - Prometheus should not monitor the web server. It should monitor the node
- Open question - Does prometheus natively support protobuf