2018-10-03 Meeting notes: next steps for production engineering

Date

Oct 3, 2018

Attendees

  • @Kelly Foster

  • @Tom Vasile

  • @Tomáš Virtus

  • @Adam Szkoda

  • @Medha Parlikar (Unlicensed)

Goals

  • Following our conversation with Nash earlier this week, we will use this time to discuss next steps to bringing production engineering to the RChain project.

Discussion items

Item

Notes

Item

Notes

Resources

Backward compatibility

  • Establishing a mindset for backward compatibility

  • Plan to start testing protobuf schema after Mercury

 

Plan for upgrading testnet

  • Deliverable from the SRE team

  • Test the plan

  • Gather metrics to the health of the cluster

Predicting failure

  • Prerequisite

    • Metrics for what good node performance looks like

      • Health of clique

Baselines

  • Memory consumption

    • How do we establish the baselines?

    • What processes do we put in place to monitor change in baselines?

  • Resilience to errors on the network

How do we monitor the network?

  • Discussion about how we monitor

    • Do we offer a centralized system?

      • Becomes difficult as the network grows

    • Do we offer a monitoring module with node software?

      • Achieving buy-in from node operators may be difficult or at best inconsistent

    • Do we use the Casper protocol to help bubble up metrics on the network?

  • Do we need to support monitoring beyond the clique?

    • Measure things on a single node

    • Aggregate metrics for nodes on the clique

SLO Speed

  • What does it mean that the network supports 40K transactions/second?

    • The business requirement is to measure 40K COMM events/second

    • This means COMM events

      • There is a misunderstanding in the community between COOM events and TRANSACTIONS

      • COMM events are a join

      • Deploys aren’t a good metric because the platform is computational and not token transfer

    • What tools do we need to monitor a network to understand it’s transaction speed?

  • What does it mean to be secure?

    • What tools do we need to monitor a network to understand it’s security?

  • What does it mean to be scalable?

    • What

SLO Security

  • DOS attack is the priority for Mercury

    • Ability to detect the number of deploys a node receives

    • Ability to detect the number of proposes a node makes

  • Trust infrastructure

    • Proof of stake

    • What is the trust between two nodes? Does proof of stake

SLO Reliability

  • What does it mean to be reliable

  • Idea to test with a testnet crash

Action items

@Kelly Foster to stub out tickets related to SLO for speed, security, and reliability
@Tomáš Virtus to write up a plan to achieve SLO’s by mid-day Thurs
@Adam Szkoda and @Tom Vasile review the SLO doc by EOD Thurs
@Kelly Foster circulate SLO plan with team and create tickets

Decisions