1. Introduction
1.1 Purpose of the system
The block DAG in the RChain architecture is constantly increases in size due to production of new blocks. As blocks are effectively unlimited in size, plainly storing the whole DAG consisting out of full blocks in memory will not be possible once DAG reaches a certain size. In order to handle huge DAGs, node software should be able to keep only relevant parts of the DAG in memory while maintaining an ability to access the other parts of the DAG if requested. This document describes how the DAG is stored on disk, new API for working with the DAG, which parts are kept in memory and how the other parts can temporarily loaded into it. The document also keeps in mind the mechanism of checkpointing - a future improvement which falls out of the scope of this document. This storage however does not store blocks in their full form, block storage is covered in Block Storage Specification.
The node writes complete block data to disk, in order to minimize network synchronization across node restarts. If a node is down for extended periods, it will be forced to perform synchronization with others nodes in order to update both its RBlocks data store, the BlockDAG, and the LatestMessages table for validators. The algorithm by which this update occurs and the network protocols enabling it are beyond the scope of this document. (See <TBD>)
1.2 Design goals
The design goals of the DAG storage include:
- Do children, metadata and topological sorting lookups in the DAG
- Do inserts of new blocks into the DAG
- Provide quick access to the recent parts of the DAG
- Provide (perhaps slower) access to the older parts of the DAG
- Do not keep the whole DAG loaded into memory when it is too big
- Conceal the fact that possibly not whole DAG is presented in the memory from the user
- An ability to restore the DAG on node startup
2. Current software architecture
As for now DAG is stored in-memory only and represented asBlockDag
class:Code Block | ||||
---|---|---|---|---|
| ||||
final case class BlockDag( childMap: Map[BlockHash, Set[BlockHash]], latestMessages: Map[Validator, BlockMessage], latestMessagesOfLatestMessages: Map[Validator, Map[Validator, BlockHash]], currentSeqNum: Map[Validator, Int], dataLookup: Map[BlockHash, BlockMetadata], topoSort: Vector[Vector[BlockHash]], sortOffset: Long ) |
3. Proposed software architecture
3.1Overview
It should be possible to restore every field of the BlockDag
class on node startup. Some of them are already persisted by Latest Message Storage Specification:
latestMessages
is exactly what is stored in latest message storagelatestMessagesOfLatestMessages
for a given validator can be derived fromjustifications
of the block returned bylatestMessages
for the given validator (TODO: check)currentSeqNum
for a given validator can be derived fromseqNum
of the block returned bylatestMessages
for the given validator (TODO: check)
All other fields are persisted by the proposed API for DAG storage:
Code Block | ||||
---|---|---|---|---|
| ||||
trait DagStorage[F[_]] { def insert(block: BlockMessage): F[Unit] def children(blockHash: BlockHash): F[Option[Set[BlockHash]]] def lookup(blockHash: BlockHash): F[Option[BlockMetadata]] def topoSort(startBlockNumber: Int): F[Vector[Vector[BlockHash]]] def checkpoint(blockHash: BlockHash): F[Unit] def close(): F[Unit] } |
The only way to modify the DAG is by inserting a new block. The childMap
field is represented by children
method, the dataLookup
field is represented by lookup
method and toposort
field is represented by toposort
method. sortOffset
is an internal detail and API does not expose it to the user. Additionally, a new method checkpoint
is introduced to provide the user an ability to tell the DAG that the network agreed that the provided block is a checkpoint.
The proposed implementation DagFileStorage
of this API stores parts of the DAG between checkpoints on disk and only keeps a part of the DAG after the last checkpoint in memory.
Code Block | ||||
---|---|---|---|---|
| ||||
final class DagFileStorage[F[_]: Sync: BlockStore] private ( lock: Semaphore[F], checkpointsRef: Ref[F, List[Checkpoint]], currentFile: Resource[F, OutputStream], currentDagInfoRef: Ref[F, DagInfo] ) extends DagStorage[F] { private def loadDagInfo(path: Path): DagInfo private def loadCheckpoint(offset: Long): F[Option[DagInfo]] = for { checkpoints <- checkpointsRef.get neededCheckpoint = checkpoints.zipWithIndex.find { case (c, _) => c.start <= offset && offset < c.end } result <- neededCheckpoint match { case None => None.pure[F] case Some((checkpoint, i)) => checkpoint.dagInfo.flatMap(_.get) match { case Some(dagInfo) => Some(dagInfo).pure[F] case None => val loadedDagInfo = loadDagInfo(checkpoint.path) val newCheckpoint = checkpoint.copy(dagInfo = Some(WeakReference(loadedDagInfo))) for { _ <- checkpointsRef.set(checkpoints.patch(i, List(newCheckpoint), 1)) } yield loadedDagInfo } } } yield result private def updateCurrentDagInfoRef(b: BlockMessage): F[Unit] def insert(block: BlockMessage): F[Unit] = for { _ <- lock.acquire result <- currentFile.use { out => val parents = parentHashes(block) val parentsSize = parents.foldLeft(0) { case (a, b) => a + b.size } val buffer = ByteBuffer.allocate(8 + 32 + parentsSize) buffer.putLong(blockNumber(block)) buffer.put(block.blockHash.toByteArray) buffer.putInt(parents.length) parents.foreach(parent => buffer.put(parent.toByteArray)) out.write(buffer.array()).pure[F] } _ <- updateCurrentDagInfoRef(block) _ <- lock.release } yield result def children(blockHash: BlockHash): F[Option[Set[BlockHash]]] = for { dagInfo <- currentDagInfoRef.get result <- dagInfo.childMap.get(blockHash) match { case Some(children) => Option(children).pure[F] case None => for { blockOpt <- BlockStore[F].get(blockHash) result <- blockOpt match { case Some(block) => val number = blockNumber(block) if (number >= dagInfo.sortOffset) { None.pure[F] } else { for { _ <- lock.acquire oldDagInfo <- loadCheckpoint(number) _ <- lock.release } yield oldDagInfo.flatMap(_.childMap.get(blockHash)) } case None => None.pure[F] } } yield result } } yield result def lookup(blockHash: BlockHash): F[Option[BlockMetadata]] def topoSort(startBlockNumber: Int): F[Vector[Vector[BlockHash]]] def checkpoint(blockHash: BlockHash): F[Unit] def close(): F[Unit] } object DagFileStorage { final case class DagInfo( childMap: Map[BlockHash, Set[BlockHash]], dataLookup: Map[BlockHash, BlockMetadata], topoSort: Vector[Vector[BlockHash]], sortOffset: Long ) final case class Checkpoint( start: Long, end: Long, path: Path, dagInfo: Option[WeakReference[DagInfo]] ) final case class Config( dagDirectoryPath: Path ) def apply[F[_]](config: Config): DagFileStorage[F] } |
Some auxiliary classes are introduced in DagFileStorage
object:
DagInfo
is a new version of the oldBlockDag
data structure with everything regarding latest messages removed from itCheckpoint
represents a single checkpoint stored on disk.start
andend
represent the range of block numbers the checkpoint stores.path
points to the place where on disk the checkpoint data is stored.dagInfo
is an optional weak reference to a loadedDagInfo
representing this checkpoint. Initially, all checkpoint have this value set toNone
, but, if user will require some data from this part of the DAG, it will be loaded into memory. Afterwards, it can be reclaimed by the GC sincedagInfo
stores information in a weak reference.Config
is a configuration necessary for instantiatingDagFileStorage
. It contains the parent directory for checkpoints and for the active dag storage file. This parent directory will be called storage directory in this document.
DagFileStorage
has the following constructor parameters:
lock
is used for mutually exclusive access to the active dag storage filecheckpointsRef
is a list of known checkpoints located under storage directory supplied by configurationcurrentFile
is an output stream to the active dag storage filecurrentDagInfoRef
is a part of the DAG after the last checkpoint (i.e. the same part which has been written tocurrentFile
)
Draft implementations of insert
and children
can be seen in the code block above. insert
always takes the lock since it will write the new block to the active dag storage file. On the other hand, children
does not require taking the lock since currentDagInfo
can be pulled out of atomic reference currentDagInfoRef
and then safely used for looking up the provided block hash in currentDagInfo.childrenMap
. If currentDagInfo
does not know about the requested block, DagFileStorage
looks the block up in the provided BlockStore
and then looks up the block's children in the relevant checkpoint determined by block's number. Checkpoint might require loading from disk in case it was not loaded previously or was reclaimed by GC.
3.2 Software/hardware mapping
3.2.1 Memory allocation
The details of checkpointing are not established yet, so some approximations will be given in this section. Amount of children blocks for a block is also not a fixed parameter and hence will be approximated. One block hash is 32 bytes and, assuming a single block has about 5 children, set of children for a block is 5 * 32 bytes. Taking Java headers' overhead into account, a single entry in DagInfo.childrenMap
takes about 312 bytes. BlockMetadata
is composed out of block hash, sender, seqNum, parents and justifications. It will require (32 + 12) + (32 + 12) + 4 + ((32 + 12) * 5 + 8) + ((32 + 32 + 12 + 12) * 5 + 8) = 768 bytes to store one block metadata. Hence, a single entry in DagInfo.dataLookup
takes 812 bytes. topSort
adds additional 44 bytes per block to that amount. In total, inserting a single block would require 312 + 812 + 44 = 1168 bytes.
Assuming the speed of the network is 10 blocks/second and checkpoints are done once in a day, there will be 10 * 60 * 60 * 24 = 864000 blocks in one checkpoint. Since every block requires 1168 bytes to be stored, total weight of a checkpoint is about 940 MiB. The same is true for every additional checkpoint loaded into memory.
3.2.2 Disk Memory
Dag storage layout is shown in detail in section 3.3 and it takes 12408 bytes to store a single block with 5 parent, 5 justifications and 300 validators. Storing a DAG consisting out of 10M blocks would require 115 GiB of disk space.
Note: that is quite a lot and the main space consumer here is weightMap
. Should we really have it in BlockMetadata
?
3.3Persistent data management
Currently BlockMetadata
's implementation looks like this:
Code Block | ||||
---|---|---|---|---|
| ||||
final case class BlockMetadata( blockHash: BlockHash, parents: List[BlockHash], sender: ByteString, justifications: List[Justification], weightMap: Map[ByteString, Long], seqNum: Int ) |
This whole structure is persisted in the dag storage file using the following scheme:
blockHash
- 32 bytesblockNumber
- 8 bytessender
- 32 bytesseqNum
- 4 bytesparentsLength
- 4 bytes for the length ofparents
parents
- a sequence of 32 byte hashes and the length of the sequence isparentsLength
justificationsLength
- 4 bytes for the length of justificationsjustifications
- a sequence of pairs of 32 validator public keys and 32 byte hashes, and the length of the sequence isjustificationsLength
weightMapLength
- 4 bytes for the length ofweightMap
weightMap
- a sequence of pairs of 32 byte validator public keys and 8 byte stakes, and the length of the sequence isweightMapLength
The structure is variable in size and hence requires some additional values for deterministic deserialization (parentsLength
, justificationsLength
and weightMapLength). blockNumber
was not a part of BlockMetadata
, but it is required for successful rebuilding of topological sorting.
3.5 Operational aspects
TODO3.6 Global software control
Inserts are mutually exclusive as they require appending data to the DAG storage file and hence they are acquring the lock. Other methods such as children
, lookup
and topoSort
do not require acquiring the lock in general, but might do so in case they need to access some off-memory checkpoint.
3.7 Boundary conditions
- On node startup known checkpoint metainformation (`start`,
end
andpath
) are read from the provided directory and the active DAG storage file is read into memory asDagInfo
- On node shutdown closes
currentFile
- On node error TODO: add CRC stuff