2018-11-13 meeting notes: metrics implementation

Date

Attendees

Goals

  • By the end of this meeting we need a decision on how to expose required metrics without causing out of memory errors.

Discussion items

ItemNotes
Resources

These are worth pushing into Grafana

heap-memory
heap-used: a gauge tracking the amount of heap memory currently being used in bytes.
heap-max: a gauge tracking the maximum amount of heap memory that can be used in bytes.
heap-committed: a gauge tracking the amount of memory that is committed for the JVM to use in bytes.
non-heap-memory
non-heap-used: a gauge tracking the amount of non-heap memory currently being used in bytes.
non-heap-max: a gauge tracking the maximum amount of non-heap memory that can be used in bytes.
non-heap-committed: a gauge tracking the amount of non-heap memory that is committed for the JVM to use in bytes.
garbage-collection-count: a gauge tracking the number of garbage collections that have ocurred.
garbage-collection-time: a gauge tracking the time spent in garbage collections, measured in milliseconds.
Potential problem with the amount of data in metrics
  • Reference issues

key summary type created updated due assignee reporter priority status resolution
Loading...
Refresh


  • Concern about out of memory errors
Questions
  • Is node providing metrics (ex running calculations) or just emitting raw data via the metrics port?
Idea: HdrHistogram
  • From Artur via discord "re metrics memory usage: I'm not sure what exactly causes that, but if it's histograms size, using a better reservoir for the metrics, especially one based on HdrHistogram (http://hdrhistogram.github.io/HdrHistogram/), could help"
Discussion about calculating histograms
  • Adam agrees use of histograms is expensive and may not be a priority for the metrics most needed (ex CPU usage)
  • Histograms
    • Recommendation to scrape metrics via Kamon and send to analyze somewhere else (outside of RNode)
    • Are we really using histograms? 
    • What we need are time series, not histogram (Adam, Dom)
  • Time series
    • Time and value 
    • Supports creation of reasonable Grafana graphs
    • Kamon does not support time series
Push / pull discussion
  • Prometheus requires a pull
    • Currently implemented
  • StatsD requires a push
    • Not implemented
    • Would it integrate with Kamon?
    • Would it work with all supported OSs?
Metrics for who?
  • SRE team for testnet monitoring
  • Validators for RNode monitoring 
Concern about Prometheus
  • If you limit Prometheus to only report data every few minutes, you lose the ability to generate meaningful metrics
  • Current Prometheus requirement is to sample every 30 seconds 
    • Question about how often Kamon would push data to Prometheus reporter (ex push 60 seconds, scrape every 30 seconds)
Proposal A
  • Get rid of Prometheus reporter
  • Implement StatsD
  • Configure Kamon to push to StatsD
    • Point RNode to the StatsD server
      • Idea periodically ping the StatsD server
    • Decide what data to push
    • Or decide not to use Kamon if above bullet doesn't work
  • Do not remove Prometheus reporter. Disable it. Allow use as a configuration option.
  • RNode uses StatsD by default
  • Requires end user to analyze pushed data 
  • Next steps
    • Proof of concept
    • Design document showing the architecture of monitoring system
      • Adam blocked by limited Kamon understanding
    • Educate node operators on how to view metrics (ex Grafana or Kibana)
    • Goal deliver for community test on Nov. 27
Proposal B
  • Investigate to understand what is the cause for OOM errors with Prometheus
    • Ex understand why we store history the way we do
  • We don't really have a good understanding of what is causing problems.
Proposal C
  • Extend Prometheus (multiple reporters) with a database (Influx DB reporter)
    • Idea push data every 15 seconds
Proposal D

Action items

  •  
  •