2018-10-03 Meeting notes: next steps for production engineering


Oct 3, 2018


  • @Kelly Foster

  • @Tom Vasile

  • @Tomáš Virtus

  • @Adam Szkoda

  • @Medha Parlikar (Unlicensed)


  • Following our conversation with Nash earlier this week, we will use this time to discuss next steps to bringing production engineering to the RChain project.

Discussion items






Backward compatibility

  • Establishing a mindset for backward compatibility

  • Plan to start testing protobuf schema after Mercury


Plan for upgrading testnet

  • Deliverable from the SRE team

  • Test the plan

  • Gather metrics to the health of the cluster

Predicting failure

  • Prerequisite

    • Metrics for what good node performance looks like

      • Health of clique


  • Memory consumption

    • How do we establish the baselines?

    • What processes do we put in place to monitor change in baselines?

  • Resilience to errors on the network

How do we monitor the network?

  • Discussion about how we monitor

    • Do we offer a centralized system?

      • Becomes difficult as the network grows

    • Do we offer a monitoring module with node software?

      • Achieving buy-in from node operators may be difficult or at best inconsistent

    • Do we use the Casper protocol to help bubble up metrics on the network?

  • Do we need to support monitoring beyond the clique?

    • Measure things on a single node

    • Aggregate metrics for nodes on the clique

SLO Speed

  • What does it mean that the network supports 40K transactions/second?

    • The business requirement is to measure 40K COMM events/second

    • This means COMM events

      • There is a misunderstanding in the community between COOM events and TRANSACTIONS

      • COMM events are a join

      • Deploys aren’t a good metric because the platform is computational and not token transfer

    • What tools do we need to monitor a network to understand it’s transaction speed?

  • What does it mean to be secure?

    • What tools do we need to monitor a network to understand it’s security?

  • What does it mean to be scalable?

    • What

SLO Security

  • DOS attack is the priority for Mercury

    • Ability to detect the number of deploys a node receives

    • Ability to detect the number of proposes a node makes

  • Trust infrastructure

    • Proof of stake

    • What is the trust between two nodes? Does proof of stake

SLO Reliability

  • What does it mean to be reliable

  • Idea to test with a testnet crash

Action items

@Kelly Foster to stub out tickets related to SLO for speed, security, and reliability
@Tomáš Virtus to write up a plan to achieve SLO’s by mid-day Thurs
@Adam Szkoda and @Tom Vasile review the SLO doc by EOD Thurs
@Kelly Foster circulate SLO plan with team and create tickets