2018-11-13 meeting notes: metrics implementation

2018-11-13 meeting notes: metrics implementation

Date

Oct 11, 2018

Attendees

  • @Pawel Szulc (Unlicensed)

  • @Sebastian Bach

  • @Deanna Duke (optional)

  • @Ned Robinson

  • @Lucius Meredith

Goals

  • By the end of this meeting we need a decision on how to expose required metrics without causing out of memory errors.

Discussion items

Item

Notes

Item

Notes

Resources

These are worth pushing into Grafana

heap-memory heap-used: a gauge tracking the amount of heap memory currently being used in bytes. heap-max: a gauge tracking the maximum amount of heap memory that can be used in bytes. heap-committed: a gauge tracking the amount of memory that is committed for the JVM to use in bytes.
non-heap-memory non-heap-used: a gauge tracking the amount of non-heap memory currently being used in bytes. non-heap-max: a gauge tracking the maximum amount of non-heap memory that can be used in bytes. non-heap-committed: a gauge tracking the amount of non-heap memory that is committed for the JVM to use in bytes.
garbage-collection-count: a gauge tracking the number of garbage collections that have ocurred. garbage-collection-time: a gauge tracking the time spent in garbage collections, measured in milliseconds.

Potential problem with the amount of data in metrics

  • Reference issues

key summary type created updated due assignee reporter priority status resolution
Loading...
Refresh



  • Concern about out of memory errors

Questions

  • Is node providing metrics (ex running calculations) or just emitting raw data via the metrics port?

Idea: HdrHistogram

  • From Artur via discord "re metrics memory usage: I'm not sure what exactly causes that, but if it's histograms size, using a better reservoir for the metrics, especially one based on HdrHistogram (http://hdrhistogram.github.io/HdrHistogram/), could help"

Discussion about calculating histograms

  • Adam agrees use of histograms is expensive and may not be a priority for the metrics most needed (ex CPU usage)

  • Histograms

    • Recommendation to scrape metrics via Kamon and send to analyze somewhere else (outside of RNode)

    • Are we really using histograms? 

    • What we need are time series, not histogram (Adam, Dom)

  • Time series

    • Time and value 

    • Supports creation of reasonable Grafana graphs

    • Kamon does not support time series

Push / pull discussion

  • Prometheus requires a pull

    • Currently implemented

  • StatsD requires a push

    • Not implemented

    • Would it integrate with Kamon?

    • Would it work with all supported OSs?

Metrics for who?

  • SRE team for testnet monitoring

  • Validators for RNode monitoring 

Concern about Prometheus

  • If you limit Prometheus to only report data every few minutes, you lose the ability to generate meaningful metrics

  • Current Prometheus requirement is to sample every 30 seconds 

    • Question about how often Kamon would push data to Prometheus reporter (ex push 60 seconds, scrape every 30 seconds)

Proposal A

  • Get rid of Prometheus reporter

  • Implement StatsD

  • Configure Kamon to push to StatsD

    • Point RNode to the StatsD server

      • Idea periodically ping the StatsD server

    • Decide what data to push

    • Or decide not to use Kamon if above bullet doesn't work

  • Do not remove Prometheus reporter. Disable it. Allow use as a configuration option.

  • RNode uses StatsD by default

  • Requires end user to analyze pushed data 

  • Next steps

    • Proof of concept

    • Design document showing the architecture of monitoring system

      • Adam blocked by limited Kamon understanding

    • Educate node operators on how to view metrics (ex Grafana or Kibana)

    • Goal deliver for community test on Nov. 27

Proposal B

  • Investigate to understand what is the cause for OOM errors with Prometheus

    • Ex understand why we store history the way we do

  • We don't really have a good understanding of what is causing problems.

Proposal C

  • Extend Prometheus (multiple reporters) with a database (Influx DB reporter)

    • Idea push data every 15 seconds

Proposal D

Action items