Node Connectivity Issues

The Problem

TL;DR: With the new TCP based transport layer, users who are behind "complicated" NATs have no easy ways of running rnode.

When running P2P network, each rnode acts as both client and a server.

When being a client, node has no issues communicating to other nodes, regardless of complicated is the topology of NATs between then node and the outside world. This is possible because NATs allow to receive SYN-ACK messages from the server during the period when TCP connection is being established. Once the a TCP connection is established, NAT allow communication of packets on that connection.

The problem occurs when being node acts as a server, because it has to be be "visible" to the outside world. Currently we support UPnP to open the ports for us on routers. This approach however is limiting. It requires routers to support upnp (which most modern routers do), but if user's router is behind yet another router (which is fairly common topology), then the UPnP will open port on the internal router but not on the external one.

RNode will work correctly only if:

node has public static IP and is not behind any NAT
node is behind a single NAT and that NAT allows UPnP
node is behind a single NAT and ports we open manually on that server (a.k.a port forwarding).

In any different scenario, user will not be able to connect to the P2P network as its server will not be visible to the outside world.

Requirement:

The RChain communications layer has to work 'anywhere' Ethereum does. More specifically, the Mist implementation which uses the geth node.

Reference documentation: http://ethdocs.org/en/latest/network/connecting-to-the-network.html

RLPx Protocol: https://github.com/ethereum/devp2p/blob/master/rlpx.md

Please not that we were implementing something similar to the RLPx Protocol, but we moved from that approach with the TCP based transport layer.

Why didn't the problem occur with the UDP based Transport Layer?

UDP Transport Layer was using a single socket for both incoming and outgoing communication. This enabled a working solution. The UDP protocol is designed in such a way, that after a packet from the client to the server is sent, then everything that is sent from the server can be received by the client through exact same public port of the NAT (from which the first packet was sent) - no port forwarding is needed.

To illustrate this with the example:

Given to nodes A and B are running behind their own NATs (multiple of them)
Node A sents a packt to B on port 30304
Once that happens, that port is opened on A's NAT
Node B sents back a packet to A on port 30304, since that is exactly the same port as it was used for outgoing communication, the packet is delivered to A

Interestingly that approach had a potential bug since: "Since the UDP transport protocol provides NATs with no reliable, application-independent way to determine the lifetime of a session crossing the NAT, most NATs simply associate an idle timer with UDP translations, closing the hole if no traffic has used it for some time period."

Symptoms of the Problem:

Without having a public IP address, users are unable to connect to the bootstrap node with 2 way communication. Those are issued as bugs, but are not bugs really in technical sense, as node behaves correctly, it blocks establishing communication, as to nodes can not communicate with each other.

CORE-785 - Getting issue details... STATUS

CORE-784 - Getting issue details... STATUS

CORE-786 - Getting issue details... STATUS

CORE-782 - Getting issue details... STATUS

CORE-783 - Getting issue details... STATUS

What can be done?

We could resign from TCP over UDP, but that path we explored already. The problems and issues that will arise are (among the many):

running our own encryption protocol
synchronizing messages that were bigger then the maximum allowed size of UDP packet
simulating synchyrouns communication for the roundTrip method from the TransportLaye

Alternative is to find solution to "open" NATs (regardless of how many there are along the way) ports for the communication. There is a great paper called "Peer-to-Peer Communication Across Network Address Translators" that explains the problem in a nutshell and also provides possible solutions.

It seems that there are basically two alternatives that we could possibly use for TCP:

Relaying
TCP Hole Punching

A) Relaying

Relaying is an approach that has many cons and one single pro - it will work on almost any NAT configuration. In a nutshell the technique requires to exist an external server that both node A and B establish connection with. Then the communication goes through that server.

If node A want to sends message to node B, that message goes through the relaying server. This obviously has many disadventages, the most obvious ones that I see are single point of failure and enormous traffic on the server

B) TCP Hole Punching

"Peer-to-Peer Communication Across Network Address Translators" paper does pretty well job on describing how hole punching works. For new comers it is worth actually reading through the UDP version of hole punching as it is much easier to understand. In general the idea requires also an external server that is only used to established first communication and then rest of the conversation between node is done without the "proxy".

The main difference between the UDP Hole Punching lies in the fact that TCP sockets usually have a one-to-one correspondence to TCP port numbers on the local host: after the application binds one socket to a particular local TCP port, attempts to bind a second socket to the same TCP port fail. For TCP hole punching to work, however, single local TCP port has to listen for incoming TCP connections and to initiate multiple outgoing TCP connections concurrently. The magic is behind special TCP socket option, commonly named SO_REUSEADDR, which allows the application to bind multiple sockets.

Limitations:

This approach is mostly reliable however not all routers support it. Paragraph 5 of already mentioned paper, provides all characteristics that routers need to have in order to be considered "P2P-Friendly".
This approach requires that the operating systems understands SO_REUSEADDR. That is being implemented in Linux, most BSD Unix and any windows since Windows 2000 (https://docs.microsoft.com/en-us/windows/desktop/winsock/using-so-reuseaddr-and-so-exclusiveaddruse).

TCP Hole Punching vs gRPC

Given if we decide to use TCP Hole Punching technique, there is an issue of if we can use that technique with current way that TCP transport layer is implemented. To be more precise: we are using grpc-java to create TCP transport. That libaray covers a lot of low level details (socket creation, error handling) and provided an established API that we have very little control over (that's why I HATE object-oriented paradigm by the way). It seems for me right now at least that in order to implemented Hole Punching we would have to go a bit low level and possibly dropping grpc-java - I'm not totally convinced that this is actually true, I will have to research that tomorrow.