Block Store Meeting notes

Date

Attendees

Goals

  • Re think the back end of the block store. Is LMDB the right technology?
  • Talk about recovery after a node goes down, and how do we want to rework the block dag data structure to facilitate that.

Discussion items

TimeItemWhoNotes
3 minBackgroundMichael Birch (Unlicensed)
  • During the conference, while working on the RSong demo, it was pretty easy to hit the map store limit.  We could just increase the limit.  That isn't a good long term solution.
  • We need to look at whether we want to memory map the entire block store.  Not all blocks will be accessed equally.
  • Talked with Nash while at the conference, this 
  • Should persist the block dag as well.  This speaks to the recovery story.  If you shut down a node and bring it back up, you should be able to pick up where you left off.

Algorithm
  • What are you looking for? Given some blocks, you want the blocks that are not in the all the ancestry.  We have to traverse the DAG sometimes, and traversing the dag is slow, we don't have to compute the greatest common ancestor, we can get the same information by identifying the non-common ancestor by using a breadth first search.  Kyle's search relies upon a topological sort of DAG.  You want a total order on the DAG, that is consistent with the partial order, and this doesn't re-arrange things that were there before.  

Storing the DAG on disk
  •  You don't want to have to load the entire DAG from all time into memory to answer the questions that casper wants to answer.  there has been discussions around checkpointing, where everyone agrees on the state up to that point.  Michael Birch (Unlicensed) - lots of the ground work for the feature is there.  
  • The files that contain your dag are going to grow unbounded  - and performance will degrade, until checkpointing is in place.  You will get to a place where you need to ignore a huge amount of past.  You can't expect to hold the whole thing in memory.  Simplest thing is to lean on the checkpointing feature.  If you checkpoint, then you can close the file and open another file.  You only deal with the data up to the checkpoint.  We can hold all the headers in memory, not the blocks themselves.  It's frequently the case that your working set is 2X of what you are actually working on.  Therefore, 2 checkpoints back from the current data.
  • We need to figure out is a topological sort of the dag that is stable under appends.  You need just 1 database for the DAG going in one direction, and another going in the other direction.
  • If we can flatten the dag to an array, then we can memory map.
  • for blocks, it's an indexed log.  You only ever need the data when you query it by hash. Blocks are large anyways.  
  • once the block data is older, it isn't accessed or read very often.  it's only used to compute the tuplespace, or to send to other nodes.  And we only do this 1X.
  • your index can be fixed length, because all your lookups are by block hash.  You need file offsets for where the files are in the store.  A binary search for this will work quite well.

Latest messages
  • Gets updated everytime we receive a block.  Assuming the block is valid.
  • This is a flat table, all of your keys, presently all in memory. This needs to be pushed to disk either intermittently, so that we can recover. 
  • How often are you doing lookups on this?  Michael says this happens a lot.  The safety oracle has to look at the latest messages, as well as the latest of latest.  It has to calculate agreement across the validators.  You are hitting this table n squared, where n is the number of validators.  And we hit this every time we get a new blocks.
  • We are trying to compute edges on a graph, which is an N squared thing.  

Clique Safety oracleFormer user (Deleted)
  • Kent describes how the Clique Safety Oracle builds an agreement graph.
  • The agreement graph is being constructed each time, for each block.
    • We could an order and update, instead of an order N squared.  Since we know which validator sent the last block.
  • Nash - consider that RAM and Disk as both I/O with different speeds.  Try to avoid N squared algorithms, 
  • Michael - we will need to probably persist the agreement graph.  Nash - there may not be a convenient flattening of it that you may like. 
  • Nash - what works the best for I/O - if you want to flatten your data structure, to remove extra references, if you can avoid fragmenting memory, Linux will memory map it and push it to disk for you.  It does a very good job of this.  This doesn't work for all data structures.  Sometimes the most important thing are the pointers.  In the case of an arbitrary graph, the pointers are the most important thing.  The next best thing you can do is copy it, checkpoint it to disk, and do this in the background.  



  • Suspect that most of the data structures can be flattened.  Not sure how it works in Java.  Once flattened, it should be easy to make this all work.

Work items
  • Indexed file & hash lookup - which will replace the existing block store
  • storing just the DAG (headers) that include the hash pointers as something that is memory mapped - replaces the other half of the existing block store
    • sort that is stable
  • re-write the safety oracle such that it is doing updates 
  • persist the data structure from the oracle in such a way that it can be recovered.
  • Process
    • Nash proposes that we spend some time to build design documents that very explictly lays out what we are going to do.  In the mean time you work on building the next 2 releases.  
    • Once we have the design documents written, we will review them, and we will have a solid understanding of what each of these items will take.
    • Then we can add the work to release 9.
  • Will have to be to create a gRPC API to request the state from a node
  • Periodically send out the latest block




Action items

  •