2018-10-03 Meeting notes: next steps for production engineering

Date

Oct 3, 2018

Attendees

@Kelly Foster
@Tom Vasile
@Tomáš Virtus
@Adam Szkoda
@Medha Parlikar (Unlicensed)

Goals

Following our conversation with Nash earlier this week, we will use this time to discuss next steps to bringing production engineering to the RChain project.

Discussion items

Item	Notes

Item	Notes
Resources	Notes from meeting with Nash: https://rchain.atlassian.net/wiki/spaces/OP/pages/562036959/2018-10-01+Meeting+notes+production+engineering+tutorial
Backward compatibility	Establishing a mindset for backward compatibility Plan to start testing protobuf schema after Mercury
Plan for upgrading testnet	Deliverable from the SRE team Test the plan Gather metrics to the health of the cluster
Predicting failure	Prerequisite Metrics for what good node performance looks like Health of clique
Baselines	Memory consumption How do we establish the baselines? What processes do we put in place to monitor change in baselines? Resilience to errors on the network
How do we monitor the network?	Discussion about how we monitor Do we offer a centralized system? Becomes difficult as the network grows Do we offer a monitoring module with node software? Achieving buy-in from node operators may be difficult or at best inconsistent Do we use the Casper protocol to help bubble up metrics on the network? Do we need to support monitoring beyond the clique? Measure things on a single node Aggregate metrics for nodes on the clique
SLO Speed	What does it mean that the network supports 40K transactions/second? The business requirement is to measure 40K COMM events/second This means COMM events There is a misunderstanding in the community between COOM events and TRANSACTIONS COMM events are a join Deploys aren’t a good metric because the platform is computational and not token transfer What tools do we need to monitor a network to understand it’s transaction speed? What does it mean to be secure? What tools do we need to monitor a network to understand it’s security? What does it mean to be scalable? What
SLO Security	DOS attack is the priority for Mercury Ability to detect the number of deploys a node receives Ability to detect the number of proposes a node makes Trust infrastructure Proof of stake What is the trust between two nodes? Does proof of stake
SLO Reliability	What does it mean to be reliable Idea to test with a testnet crash

Action items

@Kelly Foster to stub out tickets related to SLO for speed, security, and reliability

@Tomáš Virtus to write up a plan to achieve SLO’s by mid-day Thurs

@Adam Szkoda and @Tom Vasile review the SLO doc by EOD Thurs

@Kelly Foster circulate SLO plan with team and create tickets

RChain wiki