2019-02-11 Meeting notes: testnet planning with SRE team

Date

Feb 11, 2019

Participants

@Kelly Foster
@Adam Szkoda
@Tomáš Virtus
@Chris Boscolo

Goals

We will use this time to draft requirements for public testnet support, monitoring, and maintenance.

Discussion topics

Item	Notes

Item	Notes
Testnet infrastructure	1 bootstrap 10 validators Review of Juniper set up for each validator 16 GB RAM 2 CPUs Plan to use terraform
Testnet validators	All nodes need to auto-propose At least one needs ports open so that public nodes can deploy to it
Monitoring	Health check Rudimentary health check in place Requires updates for dev team (TICKET) Status page based on view of the network according to Coop-operated nodes For showing testnet status Definition of availability A bootstrap is available There is evidence of a growing blockchain (aggregated view of Coop nodes) Monitoring service Nagios described in Ops monitoring (hosted infrastructure) Discussion about using cloud services for monitoring rather than something we run in-house Ex Data Dog Test Can we intercept a message from the monitoring system and use it to kill and restart a node?
Notification service	Ex Pager Duty
Log capturing	Provided by RDoctor on RunDeck testnet Ex Loggly Idea to have it integrate with monitoring and paging services Requirements Must be able to collect logs from our machines Nice to have, ability for other validators
Criteria for killing a node	OOM error Requires more than just a reboot to clear memory Lost connection to peers
Block propagation evaluation	One option Pick a random node and make it deploy and propose via SSH Capture the block hash Scan the network to see when other blocks receive Requires SSH to all nodes to check receipt of block hash If condition satisfied in specified timeframe, then good Another option Use log parsing Pick a random node node and it’s propose Parse the logs to observe block receipt by other nodes RISK clock synchronization of machines Could use log aggregator clock
Management of user expectations	Need place to communicate this May be the case we do not use a genesis ceremony every time when/if we need to restart the network
What happens if…	Outside validator doesn’t know network is down and continues to eagerly maintain/regain connection to the bootstrap node after bootstrap comes back up with a new genesis block We provide a different
Testnet goals	Claim test_REV based on RHOC balances included in Genesis block To reduce risk of someone using their ETH pub key incorrectly, we could have an opt-in feature to provide a fake ETH pub key to associate to RHOC Needs communication to the RHOC holders Needs solution to collect alternative ETH pub keys Join an observer (non-validating) RNode to the network and observe the state dApp developers can deploy to a known validator Testnet users can request test_REV via a faucet solution RCHAIN-2973 Develop and implement a faucet and process for testnet users to receive tokens

Action items

@Tomáš Virtus Document the testnet plan: RNodes specification, monitoring solution, notification solution, log capturing , and SLO’s (@Kelly Foster to create ticket)

RChain wiki