2018-11-13 meeting notes: metrics implementation

Date

11 Oct 2018

Attendees

Goals

By the end of this meeting we need a decision on how to expose required metrics without causing out of memory errors.

Discussion items

Item

Notes

Resources

Node metrics specification
Kamon provides several key metrics regarding jvm heap memory use: https://kamon.io/documentation/kamon-system-metrics/0.6.6/overview/

These are worth pushing into Grafana

heap-memory
heap-used: a gauge tracking the amount of heap memory currently being used in bytes.
heap-max: a gauge tracking the maximum amount of heap memory that can be used in bytes.
heap-committed: a gauge tracking the amount of memory that is committed for the JVM to use in bytes.

non-heap-memory
non-heap-used: a gauge tracking the amount of non-heap memory currently being used in bytes.
non-heap-max: a gauge tracking the maximum amount of non-heap memory that can be used in bytes.
non-heap-committed: a gauge tracking the amount of non-heap memory that is committed for the JVM to use in bytes.

garbage-collection-count: a gauge tracking the number of garbage collections that have ocurred.
garbage-collection-time: a gauge tracking the time spent in garbage collections, measured in milliseconds.

pretty good article on tools other than kamon: https://dzone.com/articles/java-memory-and-cpu-monitoring-tools-and-technique

Potential problem with the amount of data in metrics

Reference issues

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution

Loading...

Refresh

Concern about out of memory errors

Questions

Is node providing metrics (ex running calculations) or just emitting raw data via the metrics port?

Idea: HdrHistogram

From Artur via discord "re metrics memory usage: I'm not sure what exactly causes that, but if it's histograms size, using a better reservoir for the metrics, especially one based on HdrHistogram (http://hdrhistogram.github.io/HdrHistogram/), could help"

Discussion about calculating histograms

Adam agrees use of histograms is expensive and may not be a priority for the metrics most needed (ex CPU usage)
Histograms
- Recommendation to scrape metrics via Kamon and send to analyze somewhere else (outside of RNode)
- Are we really using histograms?
- What we need are time series, not histogram (Adam, Dom)
Time series
- Time and value
- Supports creation of reasonable Grafana graphs
- Kamon does not support time series

Push / pull discussion

Prometheus requires a pull
- Currently implemented
StatsD requires a push
- Not implemented
- Would it integrate with Kamon?
- Would it work with all supported OSs?

Metrics for who?

SRE team for testnet monitoring
Validators for RNode monitoring

Concern about Prometheus

If you limit Prometheus to only report data every few minutes, you lose the ability to generate meaningful metrics
Current Prometheus requirement is to sample every 30 seconds
- Question about how often Kamon would push data to Prometheus reporter (ex push 60 seconds, scrape every 30 seconds)

Proposal A

Get rid of Prometheus reporter
Implement StatsD
Configure Kamon to push to StatsD
- Point RNode to the StatsD server
  - Idea periodically ping the StatsD server
- Decide what data to push
- Or decide not to use Kamon if above bullet doesn't work
Do not remove Prometheus reporter. Disable it. Allow use as a configuration option.
RNode uses StatsD by default
Requires end user to analyze pushed data
Next steps
- Proof of concept
- Design document showing the architecture of monitoring system
  - Adam blocked by limited Kamon understanding
- Educate node operators on how to view metrics (ex Grafana or Kibana)
- Goal deliver for community test on Nov. 27

Proposal B

Investigate to understand what is the cause for OOM errors with Prometheus
- Ex understand why we store history the way we do
We don't really have a good understanding of what is causing problems.

Proposal C

Extend Prometheus (multiple reporters) with a database (Influx DB reporter)
- Idea push data every 15 seconds

Proposal D

Is there another solution other than Kamon?
- Drop Wizard has a library that abstracts StatsD and produces useful timeseries data
- https://opencensus.io/tracing

RChain wiki

2018-11-13 meeting notes: metrics implementation

Analytics

Date

Attendees

Goals

Discussion items

Action items

Related content

2018-11-13 meeting notes: metrics implementation

[data-colorid=d7djwku0rj]{color:#dcddde} html[data-color-mode=dark] [data-colorid=d7djwku0rj]{color:#212223}Date

Attendees

Goals

Discussion items

Action items

Related content

Date