2018-06-27 Meeting notes: debrief from 20180626 RChain community RNode testing

2018-06-27 Meeting notes: debrief from 20180626 RChain community RNode testing

Date

Jun 27, 2018

Attendees

  • @Kelly Foster

  • @Medha Parlikar (Unlicensed)

  • @Pawel Szulc (Unlicensed)

  • @Michael Birch (Unlicensed)

Goals

  • We will use this time to review today's community testing session and assure finds that need addressing are captured in JIRA.

Discussion items

Item

Notes

Item

Notes

Dropped blocks

  • Here's what happened in a network of 6 peers

We have 1 dropped block - seems we need to retry

Note about block issue: Paolo didn't get the right block - 14:51:26.492 [grpc-default-executor-15] WARN coop.rchain.casper.Validate$ - CASPER: Ignoring block 71adf07ed5... because block creator 0968e49f98... has 0 weight.
14:51:26.506 [grpc-default-executor-15] INFO c.rchain.casper.MultiParentCasper$ - CASPER: New fork-choice tip is block 7fafe8e720....
14:53:28.576 [grpc-default-executor-7] WARN coop.rchain.casper.Validate$ - CASPER: Ignoring block 80be0898af... because block creator 0968e49f98... has 0 weight.
14:53:28.578 [grpc-default-executor-7] INFO c.rchain.casper.MultiParentCasper$ - CASPER: New fork-choice tip is block 7fafe8e720.

  • Michael is reasonably confident this is something in the transport layer and not Casper

  • Pawel thinks timeout may have been a factor (broadcast and send)

  • Timeout

    • is currently fixed

    • IDEA - extend the time with each subsequent resend

  • IDEA - Transport layer to support a new method that will provide list of peers who received or did not receive a message. This method will work in parallel and wait for receipt confirmation from peers.

    • This may not be needed if the system currently in place works (block proposed, at least a peer receives, and that peer can share around the network).

  • QUESTION

    • What if in a network of 10 peers only 1 peer receives a block and then promptly dies, then the network will lose information of the block.

    • Is it possible that the gRPC implementation creates the timeout based on the internet connection.

  • DECISIONS - see action items below

User set up

  • Goal to have bootstrap node running 24 hours prior to testing 

  • Need to assure merge to master every Monday

Tickets

  • Throw an error in the event that a validator key has 0 weight (Michael)

Action items

@Pawel Szulc (Unlicensed) and @Michael Birch (Unlicensed) collaborate to attempt to recreate the issue using the mini-testnet and other machines 27 June at 8am EDT/3pm CET (@Kelly Foster to schedule)
@Pawel Szulc (Unlicensed) update so nodes can only connect if you can act as both a server and a client
@Pawel Szulc (Unlicensed) implement safe broadcast - send messages in parallel and maintain a tuple for receipt of distribution
@Pawel Szulc (Unlicensed) to talk with Nash about the issue of understanding reliably that messages are successfully sent across the network.
@Pawel Szulc (Unlicensed) to think about what can be done to improve timeout experience (expand timeout capacity and/or smart timeout learning)
@Kelly Foster assure bootstrap for testing is working 24 hours in advance
@Kelly Foster shepherd Monday merge to master each week