/
Road to stability of communications module

Road to stability of communications module

Communication module is facing some challenges, some of them were known in the past, majority of them were recently discovered. Issues can not be simply put into single category, they relate to infrastructure challenges (going through NAT), coupling issues that trigger conceptual/business issues, implementation errors & lack of fear by having untested / not documented code.  They also come with different level of priority. The idea behind this document is to gather all issues that comm is facing, building up a roadmap to bring stability to the module.


Key Summary T Due Assignee P Status Description
RCHAIN-2077 Investigate and test - check how Kademlia reponds when given two nodes with same ID but different IP addresses Sub-task Pawel Szulc Medium Done

Done when:

  • tests are in place to test the Kademlia code base
    *
RCHAIN-2076 Investiage and test how kademlia behaves when node is not responding to heart beats. Sub-task Pawel Szulc Medium Done
RCHAIN-1959 Write initial set of tests for kademlia Story Pawel Szulc Medium Done
RCHAIN-1952 Simulate transport layer: Kademlia test infrastructure Story Pawel Szulc Medium Done

Create the test infrastructure for Kademlia.

  • Create a simulation of a transport layer for use in kademlia tests, so that we can model transport layer changes.
RCHAIN-1906 Remove Disconnet from NodeDiscovery.Protocol Story Pawel Szulc Medium Done

Node Discovery should not rely on external sources to learn that given node disconnected from the network. Disconnect messages seems as an artifical abstraction that does not bring any value, neither were it is defined (NodeDiscovery) nor where it is being used (RChain protocol)

RCHAIN-1905 Any upstream should be removed from NodeDiscovery.Protocol Story Pawel Szulc Medium Done

TBD

RCHAIN-1891 Remove Messages Hello and Disconnect are not used in the RChain protocol Story Pawel Szulc Medium Done
RCHAIN-1883 Remove method disconnect from TransportLayer Story Pawel Szulc Medium Done

Method disconnect should never be part of TransportLayer abstraction. It is TcpTransportLayer implementation detail. It must be removed from the abstraction. It should be called externally when TcpTransportLayer realises that given tcp connection is no longer valid. TcpTransportLayer should not rely on external algorithms to inform it, that given tcp connection should be removed. It should be able to handled that on its own.

RCHAIN-1882 Type of messages received and send by TransportLayer should be generic Story Pawel Szulc Medium Done

More details to be provided!

Signature change from

```
trait TransportLayer[F[_]] {
def roundTrip(peer: PeerNode, msg: Protocol, timeout: FiniteDuration): F[CommErr[Protocol]]
```

to

```
trait TransportLayer[F[_], P] {
def roundTrip(peer: PeerNode, msg: P, timeout: FiniteDuration): F[CommErr[P]]
```

RCHAIN-1815 Remove mutable List from Kademlia Story Pawel Szulc Medium Done
RCHAIN-1807 Introduce ApplicativeAsk for NodeConfig Story Pawel Szulc Medium Done
RCHAIN-1804 Introduce NodeConfig and provide it via ApplicativeAsk Story Pawel Szulc Medium Done
RCHAIN-1794 Move whoami from conf to node or comm Story Sebastian Bach Low Done

This improves the readability of the code. Low priority.

RCHAIN-1737 Integration tests for TcpTransportLayer Story Sebastian Bach Medium Done

Minimum set integration tests needed for TcpTransportLayer:

***** roundTrip is send and response is received if message is handled
***** roundTrip is send and error is received if message is not handled
***** roundTrip is send and error is received if message is not handled
***** send is received by peer
***** broadcast is received by all peers
***** roundTrip, send, broadcast return error if TransportLayer.shutdown was called
***** receive should ignore incomming messages once TransportLayer.shutdown was called

RCHAIN-1706 Investiage / modify NodeDiscovery to forget nodes that don't respond Story Sebastian Bach High Done

What needs to be done:

  • Investigate what is going on
  • Determine how we can remove old nodes from the list.
  • test the functionality.
RCHAIN-863 Reading and writing files takes ages, not deterministic behaviour, seems to disappear after reboot Bug Sebastian Bach High Done
RCHAIN-398 Redefine boundary between TransportLayer, NodeDiscovery and RChain protocol Story Pawel Szulc High Done

This ticket is based on some discussion made in https://rchain.atlassian.net/wiki/spaces/CORE/pages/500367432/Communications+Module+-+current+state+and+challenges - please read it through before reading rest of this ticket description.

Currently NodeDiscovery is toupled with Rchain protocol which. RChain adds elements to table defined in NodeDiscovery (via addNode) and NodeDiscovery adds and removes those elements according to its own internal algorithms.

This leads to following errors:
1. nodes can send messages to each other even if they did NOT exchanged messages according to RCHain Protocol handshake
2. Node Discovery (Kademlia) will add nodes to its own tabel of peers without going through the RCHain Protocol handshake (will bypass)

The main root of evil here is the fact that NodeDiscovery *shares* same table of peers with RChain Protocol.

Right now when RChain Protocol:

1. can be called from the out side world via `connectToBootstrap` (goes through `connect` procedure for given bootstrap
2. asks periodically `findMorePeers` from Node Discovery (goes through `connect` procedure for each node returned

The way `connect` works is described in `ConnectSpec.scala`. In essence it sends message ProtocolHandshake and once received ProtocolhandshakeResponse it adds the given node to kademlia peer table. On the other side, upon receiving ProtocolHandshake message, ProtocolHandshakeResponse is being send back, and incoming node is added to Kademlia peer table.

Node Discovery (kademlia) abstracts concept of node attaching itself to existing P2P network and continuously learning about new nodes. The way that Kademlia works right now is following: it waits for outside calls like `addNode` and `findMorePeers` and triggers communication with the outside world if needed (sending Loopkup and Ping messages)

Node Discovery should be totally independent protocol, with only single method AddMe available as an entry point (sent once to connect given node to P2P network). Once that method is sent, NodeDiscovery should continuously explor the P2P trying to learn about neighbour peers and at the same time exposing single method listPeers. That method should allow listing all nodes that Node Discovery learned while runnings. Node Discovery should be totally abstracted away from anything build on top of it (like RChain protocol).
RChain protocol should hold list of its own peers (not tight to kademlia list of peers) and use Node Discovery to learn about new nodes in the network.

What should be done described in subtasks of this issue.


Related content