2019-02-11 Meeting notes: testnet planning with SRE team

Date

Feb 11, 2019

Participants

  • @Kelly Foster

  • @Adam Szkoda

  • @Tomáš Virtus

  • @Chris Boscolo

Goals

We will use this time to draft requirements for public testnet support, monitoring, and maintenance.

Discussion topics

Item

Notes

Item

Notes

Testnet infrastructure

  • 1 bootstrap

  • 10 validators

  • Review of Juniper set up for each validator

    • 16 GB RAM

    • 2 CPUs

  • Plan to use terraform

Testnet validators

  • All nodes need to auto-propose

  • At least one needs ports open so that public nodes can deploy to it

Monitoring

  • Health check

    • Rudimentary health check in place

      • Requires updates for dev team (TICKET)

  • Status page based on view of the network according to Coop-operated nodes

    • For showing testnet status

    • Definition of availability

      • A bootstrap is available

      • There is evidence of a growing blockchain (aggregated view of Coop nodes)

  • Monitoring service

    • Nagios described in Ops monitoring (hosted infrastructure)

    • Discussion about using cloud services for monitoring rather than something we run in-house

  • Test

    • Can we intercept a message from the monitoring system and use it to kill and restart a node?

Notification service

  • Ex Pager Duty

Log capturing

  • Provided by RDoctor on RunDeck testnet

  • Ex Loggly

  • Idea to have it integrate with monitoring and paging services

  • Requirements

    • Must be able to collect logs from our machines

    • Nice to have, ability for other validators

Criteria for killing a node

  • OOM error

    • Requires more than just a reboot to clear memory

  • Lost connection to peers

    •  

  •  

Block propagation evaluation

  • One option

    • Pick a random node and make it deploy and propose via SSH

    • Capture the block hash

    • Scan the network to see when other blocks receive

      • Requires SSH to all nodes to check receipt of block hash

    • If condition satisfied in specified timeframe, then good

  • Another option

    • Use log parsing

    • Pick a random node node and it’s propose

    • Parse the logs to observe block receipt by other nodes

    • RISK clock synchronization of machines

      • Could use log aggregator clock

Management of user expectations

  • Need place to communicate this

  • May be the case we do not use a genesis ceremony every time when/if we need to restart the network

What happens if…

  • Outside validator doesn’t know network is down and continues to eagerly maintain/regain connection to the bootstrap node after bootstrap comes back up with a new genesis block

  • We provide a different

Testnet goals

  • Claim test_REV based on RHOC balances included in Genesis block

    • To reduce risk of someone using their ETH pub key incorrectly, we could have an opt-in feature to provide a fake ETH pub key to associate to RHOC

      • Needs communication to the RHOC holders

      • Needs solution to collect alternative ETH pub keys

  • Join an observer (non-validating) RNode to the network and observe the state

  • dApp developers can deploy to a known validator

  • Testnet users can request test_REV via a faucet solution RCHAIN-2973 Develop and implement a faucet and process for testnet users to receive tokens

Action items

@Tomáš Virtus Document the testnet plan: RNodes specification, monitoring solution, notification solution, log capturing , and SLO’s (@Kelly Foster to create ticket)

Decisions