Ethereum P2P Node Discovery and Routing
Ethereum has defined a comprehensive-looking protocol for node discovery and routing. It is described in the RLPx specification, and it is in some sense presented as an extension of RLP, Ethereum's serialization format. For node discovery and for routing, RLPx uses a subset of the Kademlia protocol (see Maymounkov & Mazières, described in Wikipedia), a distributed hash table developed for peer-to-peer file sharing networks. In particular, Ethereum's discovery protocol omits the STORE
and FIND_VALUE
RPCs, and simply uses the rest of the DHT to accumulate information about peers.
Nodes
Each node represents the other nodes it knows about by some chunk of useful information. This can include data used for connecting to the node as well as any metric information not otherwise represented by the table structure (such as average latency to that node). A node is identified by its cryptographic public key, and for reasons1 the 256-bit hash (sha3/Keccak-256) of the public key is used for the Kademlia distance metric.
Distance Metric
Let N0 and N1 be distinct nodes with node keys K0 and K1. The distance between N0 and N1 is the length of the common (binary) prefix of K0 and K1. This can be as simple as finding the first bit set in K0 ^ K1, where ^ is the bitwise XOR operation. As keys are meant to be randomly assigned, "distance" has nothing to do with geography: a node in China may be the nearest neighbor of one in Cuba.
Node Table
On a particular node N, peer nodes are stored in a table T. There 256 (ish) rows in the table, with each row's entries being of a distance from N under the above XOR metric. Row size is capped by a global parameter k, called redundancy. There is some implementation wiggle room in policies for node inclusion and eviction, which govern how the redundancy is maintained.
Who's stored in the routing table
Throughout the life of N, qualitative data about peers or N's link to its peers in T may be updated. For instance, if we're interested in latency-based routing, that information can be accumulated in the table. It's possible that T becomes useful beyond node routing, in which case even reputation data could be stored there.
Routing
Kademlia was developed for finding specific nodes (or nodes holding specific data), but there is not much call for that in Ethereum. However, it seems like a useful concept we would get for free for use with RChain, whose channel-based networking may eventually require that individual nodes be picked out. So, while the querying portion of the Kademlia protocol is useful to guarantee that N has a nice, full view of the network with plenty of nodes, but it can also find specific nodes in log2 n queries, where n is the bit-width of a key.
In order to decrease the number of hops between nodes, the table can be chunked in b-bit chunks, so that each row represents a difference in the ith group of b bits (instead of a difference in each successsive single bit). The RLPx protocol specifies b = 8, but it sure looks like the Go, Java, and Rust ethereum clients use b = 1 anyway. Clients using differing values of b can interoperate with no problems.
Discovery Protocol
While the RLPx protocol follows Kademlia pretty closely for discovery and maintaining lists of known nodes, Kademlia does not include much support for secure communication. RLPx augments this with a robust two-phase handshake protocol on first connection. Public keys are exchanged, and all further communication is encrypted. This document is going to skip most of the hard bits, as your author is no expert.
Bootstrapping
As in most peer-to-peer systems, at least one node H must be known up front. Our node N handshakes with H (which gets N added to H's peer table, if possible). If N queries H for N's own key, it will receive H's peers which are closest to N. After that, N can query those newly added nodes for other nodes in empty rows of its peer table to get a fuller picture of the network.
Maintaining Peers
The peer table is potentially updated every time a communication is received from a remote node, and this can be forced by periodic PING
/PONG
exchanges. Typically strategies for maintaining the peer lists make use of the observation from file-sharing statistics that nodes that have stayed active for a while will tend to stay active even longer, so retaining nodes that do respond is generally preferable to replacing them.
RLPx does not attempt to prescribe which nodes should be used as directly connect peers for the work of maintaining the blockchain. Those are typically fewer and chosen based on measured latency or other characteristics.
Summary
Using the RLPx system in EthereumJ, the Java Ethereum client, is likely the most straightforward way to get a network built. The RLP encoding scheme could be removed in favor of Protocol Buffers with some effort, but the rest of the network maintenance protocol would be more easily . . . borrowed . . . than developed. The Kademlia subset, along with of RLPx, along with the handshake protocol, provide all the required mechanisms of an RChain network. If peers for direct communication are chosen from the discovery peer list, the entirety of the peer-to-peer layer could be hidden from the RChain node code proper, with no further authentication machinery necessary.
1 I'm being facetious, but I cannot find an affirmative justification of this choice. I assume it is as simple as the fact that a key length of 256 provides plenty of space for a reasonable distance, whereas public keys are of unwieldy length. It could also allow the public key format to change independent of node discovery. It might be related to the EVM's native support for 256-bit integers. It could also have something to do with security, to avoid broadcasting public keys, for instance.