Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

1. Introduction

1.1 Purpose of the system

The block DAG in the RChain architecture is constantly increases in size due to production of new blocks. As blocks are effectively unlimited in size, plainly storing the whole DAG consisting out of full blocks in memory will not be possible once DAG reaches a certain size. In order to handle huge DAGs, node software should be able to keep only relevant parts of the DAG in memory while maintaining an ability to access the other parts of the DAG if requested. This document describes how the DAG is stored on disk, new API for working with the DAG, which parts are kept in memory and how the other parts can temporarily loaded into it. The document also keeps in mind the mechanism of checkpointing - a future improvement which falls out of the scope of this document. This storage however does not store blocks in their full form, block storage is covered in Block Storage Specification.

The node writes complete block data to disk, in order to minimize network synchronization across node restarts. If a node is down for extended periods, it will be forced to perform synchronization with others nodes in order to update both its RBlocks data store, the BlockDAG, and the LatestMessages table for validators. The algorithm by which this update occurs and the network protocols enabling it are beyond the scope of this document. (See <TBD>)

1.2 Design goals

The design goals of the DAG storage include:

  1. Do children, metadata and topological sorting lookups in the DAG
  2. Do inserts of new blocks into the DAG
  3. Provide quick access to the recent parts of the DAG
  4. Provide (perhaps slower) access to the older parts of the DAG
  5. Do not keep the whole DAG loaded into memory when it is too big
  6. Conceal the fact that possibly not whole DAG is presented in the memory from the user
  7. An ability to restore the DAG on node startup

2. Current software architecture

As for now DAG is stored in-memory only and represented as BlockDag class:


Code Block
languagescala
titleBlockDag
final case class BlockDag(
    childMap: Map[BlockHash, Set[BlockHash]],
    latestMessages: Map[Validator, BlockMessage],
    latestMessagesOfLatestMessages: Map[Validator, Map[Validator, BlockHash]],
    currentSeqNum: Map[Validator, Int],
    dataLookup: Map[BlockHash, BlockMetadata],
    topoSort: Vector[Vector[BlockHash]],
    sortOffset: Long
)


3. Proposed software architecture

3.1Overview

It should be possible to restore every field of the BlockDag class on node startup. Some of them are already persisted by Latest Message Storage Specification:

  • latestMessages is exactly what is stored in latest message storage
  • latestMessagesOfLatestMessages for a given validator can be derived from justifications of the block returned by latestMessages for the given validator (TODO: check)
  • currentSeqNum for a given validator can be derived from seqNum of the block returned by latestMessages for the given validator (TODO: check)

All other fields are persisted by the proposed API for DAG storage:

Code Block
languagescala
titleDAGStorage
trait DagStorage[F[_]] {
  def insert(block: BlockMessage): F[Unit]
  def children(blockHash: BlockHash): F[Option[Set[BlockHash]]]
  def lookup(blockHash: BlockHash): F[Option[BlockMetadata]]
  def topoSort(startBlockNumber: Int): F[Vector[Vector[BlockHash]]]
  def checkpoint(blockHash: BlockHash): F[Unit]
  def close(): F[Unit]
}

The only way to modify the DAG is by inserting a new block. The childMap field is represented by children method, the dataLookup field is represented by lookup method and toposort field is represented by toposort method. sortOffset is an internal detail and API does not expose it to the user. Additionally, a new method checkpoint is introduced to provide the user an ability to tell the DAG that the network agreed that the provided block is a checkpoint.

The proposed implementation DagFileStorage of this API stores parts of the DAG between checkpoints on disk and only keeps a part of the DAG after the last checkpoint in memory.

Code Block
languagescala
titleDagFileStorage
final class DagFileStorage[F[_]: Sync: BlockStore] private (
    lock: Semaphore[F],
    checkpointsRef: Ref[F, List[Checkpoint]],
    currentFile: Resource[F, OutputStream],
    currentDagInfoRef: Ref[F, DagInfo]
) extends DagStorage[F] {
  private def loadDagInfo(path: Path): DagInfo
  private def loadCheckpoint(offset: Long): F[Option[DagInfo]] =
    for {
      checkpoints <- checkpointsRef.get
      neededCheckpoint = checkpoints.zipWithIndex.find {
        case (c, _) => c.start <= offset && offset < c.end
      }
      result <- neededCheckpoint match {
                 case None =>
                   None.pure[F]
                 case Some((checkpoint, i)) =>
                   checkpoint.dagInfo.flatMap(_.get) match {
                     case Some(dagInfo) =>
                       Some(dagInfo).pure[F]
                     case None =>
                       val loadedDagInfo = loadDagInfo(checkpoint.path)
                       val newCheckpoint =
                         checkpoint.copy(dagInfo = Some(WeakReference(loadedDagInfo)))
                       for {
                         _ <- checkpointsRef.set(checkpoints.patch(i, List(newCheckpoint), 1))
                       } yield loadedDagInfo
                   }
               }
    } yield result

  private def updateCurrentDagInfoRef(b: BlockMessage): F[Unit]

  def insert(block: BlockMessage): F[Unit] =
    for {
      _ <- lock.acquire
      result <- currentFile.use { out =>
                 val parents     = parentHashes(block)
                 val parentsSize = parents.foldLeft(0) { case (a, b) => a + b.size }
                 val buffer      = ByteBuffer.allocate(8 + 32 + parentsSize)
                 buffer.putLong(blockNumber(block))
                 buffer.put(block.blockHash.toByteArray)
                 buffer.putInt(parents.length)
                 parents.foreach(parent => buffer.put(parent.toByteArray))
                 out.write(buffer.array()).pure[F]
               }
      _ <- updateCurrentDagInfoRef(block)
      _ <- lock.release
    } yield result
  def children(blockHash: BlockHash): F[Option[Set[BlockHash]]] =
    for {
      dagInfo <- currentDagInfoRef.get
      result <- dagInfo.childMap.get(blockHash) match {
                 case Some(children) =>
                   Option(children).pure[F]
                 case None =>
                   for {
                     blockOpt <- BlockStore[F].get(blockHash)
                     result <- blockOpt match {
                                case Some(block) =>
                                  val number = blockNumber(block)
                                  if (number >= dagInfo.sortOffset) {
                                    None.pure[F]
                                  } else {
                                    for {
                                      _          <- lock.acquire
                                      oldDagInfo <- loadCheckpoint(number)
                                      _          <- lock.release
                                    } yield oldDagInfo.flatMap(_.childMap.get(blockHash))
                                  }
                                case None => None.pure[F]
                              }
                   } yield result
               }
    } yield result
  def lookup(blockHash: BlockHash): F[Option[BlockMetadata]]
  def topoSort(startBlockNumber: Int): F[Vector[Vector[BlockHash]]]
  def checkpoint(blockHash: BlockHash): F[Unit]
  def close(): F[Unit]
}

object DagFileStorage {
  final case class DagInfo(
      childMap: Map[BlockHash, Set[BlockHash]],
      dataLookup: Map[BlockHash, BlockMetadata],
      topoSort: Vector[Vector[BlockHash]],
      sortOffset: Long
  )

  final case class Checkpoint(
      start: Long,
      end: Long,
      path: Path,
      dagInfo: Option[WeakReference[DagInfo]]
  )

  final case class Config(
      dagDirectoryPath: Path
  )

  def apply[F[_]](config: Config): DagFileStorage[F]
}

Some auxiliary classes are introduced in DagFileStorage object:

  • DagInfo is a new version of the old BlockDag data structure with everything regarding latest messages removed from it
  • Checkpoint represents a single checkpoint stored on disk. start and end represent the range of block numbers the checkpoint stores. path points to the place where on disk the checkpoint data is stored. dagInfo is an optional weak reference to a loaded DagInfo representing this checkpoint. Initially, all checkpoint have this value set to None, but, if user will require some data from this part of the DAG, it will be loaded into memory. Afterwards, it can be reclaimed by the GC since dagInfo stores information in a weak reference.
  • Config is a configuration necessary for instantiating DagFileStorage. It contains the parent directory for checkpoints and for the active dag storage file. This parent directory will be called storage directory in this document.

DagFileStorage has the following constructor parameters:

  • lock is used for mutually exclusive access to the active dag storage file
  • checkpointsRef is a list of known checkpoints located under storage directory supplied by configuration
  • currentFile is an output stream to the active dag storage file
  • currentDagInfoRef is a part of the DAG after the last checkpoint (i.e. the same part which has been written to currentFile)

Draft implementations of insert and children can be seen in the code block above. insert always takes the lock since it will write the new block to the active dag storage file. On the other hand, children does not require taking the lock since currentDagInfo can be pulled out of atomic reference currentDagInfoRef and then safely used for looking up the provided block hash in currentDagInfo.childrenMap. If currentDagInfo does not know about the requested block, DagFileStorage looks the block up in the provided BlockStore and then looks up the block's children in the relevant checkpoint determined by block's number. Checkpoint might require loading from disk in case it was not loaded previously or was reclaimed by GC. 

3.2 Software/hardware mapping

3.2.1 Memory allocation

The details of checkpointing are not established yet, so some approximations will be given in this section. Amount of children blocks for a block is also not a fixed parameter and hence will be approximated. One block hash is 32 bytes and, assuming a single block has about 5 children, set of children for a block is 5 * 32 bytes. Taking Java headers' overhead into account, a single entry in DagInfo.childrenMap takes about 312 bytes. BlockMetadata is composed out of block hash, sender, seqNum, parents and justifications. It will require (32 + 12) + (32 + 12) + 4 + ((32 + 12) * 5 + 8) + ((32 + 32 + 12 + 12) * 5 + 8) = 768 bytes to store one block metadata. Hence, a single entry in DagInfo.dataLookup takes 812 bytes. topSort adds additional 44 bytes per block to that amount. In total, inserting a single block would require 312 + 812 + 44 = 1168 bytes.

Assuming the speed of the network is 10 blocks/second and checkpoints are done once in a day, there will be 10 * 60 * 60 * 24 = 864000 blocks in one checkpoint. Since every block requires 1168 bytes to be stored, total weight of a checkpoint is about 940 MiB. The same is true for every additional checkpoint loaded into memory.

3.2.2 Disk Memory

Dag storage layout is shown in detail in section 3.3 and it takes 12408 bytes to store a single block with 5 parent, 5 justifications and 300 validators. Storing a DAG consisting out of 10M blocks would require 115 GiB of disk space.

Note: that is quite a lot and the main space consumer here is weightMap. Should we really have it in BlockMetadata?

3.3Persistent data management

Currently BlockMetadata's implementation looks like this:

Code Block
languagescala
titleBlockMetadata
final case class BlockMetadata(
    blockHash: BlockHash,
    parents: List[BlockHash],
    sender: ByteString,
    justifications: List[Justification],
    weightMap: Map[ByteString, Long],
    seqNum: Int
)

This whole structure is persisted in the dag storage file using the following scheme:

  • blockHash - 32 bytes
  • blockNumber - 8 bytes
  • sender - 32 bytes
  • seqNum - 4 bytes
  • parentsLength - 4 bytes for the length of parents
  • parents - a sequence of 32 byte hashes and the length of the sequence is parentsLength
  • justificationsLength - 4 bytes for the length of justifications
  • justifications - a sequence of pairs of 32 validator public keys and 32 byte hashes, and the length of the sequence is justificationsLength
  • weightMapLength - 4 bytes for the length of weightMap
  • weightMap - a sequence of pairs of 32 byte validator public keys and 8 byte stakes, and the length of the sequence is weightMapLength

The structure is variable in size and hence requires some additional values for deterministic deserialization (parentsLengthjustificationsLength and weightMapLength). blockNumber was not a part of BlockMetadata, but it is required for successful rebuilding of topological sorting.

3.5 Operational aspects

TODO

3.6 Global software control

Inserts are mutually exclusive as they require appending data to the DAG storage file and hence they are acquring the lock. Other methods such as childrenlookup and topoSort do not require acquiring the lock in general, but might do so in case they need to access some off-memory checkpoint.

3.7 Boundary conditions

  • On node startup known checkpoint metainformation (`start`, end and path) are read from the provided directory and the active DAG storage file is read into memory as DagInfo
  • On node shutdown closes currentFile
  • On node error TODO: add CRC stuff