2 of 10 validators thrown InvalidTransaction while replaying a block.

Description

Testnet fallen into a deadlock, each node is printing `Error while creating block: Must wait for more blocks from other validators`. But, all validators are on the same height. Show-blocks on all nodes prints block 90573.

Call for `show-blocks` for last 11 blocks shows that it was turn to propose for node3 when the network stopped. Looking into node3 logs (here's part we need https://transfer.sh/14Zffz/node3_stuck_-30000lines.log), the last successfully proposed block is 90564 (at 2019-10-10 14:59:03).

After node3 received 90565, 90566, 90567.
After a lot of deploys hit one of the nodes during Thursday testing session (script was generating 20deploys/sec for some time, then stopped then active again, and so several times).

Block 90568 (cc155cf3b6) arrived after 7 mins, its processing ended up with `InvalidTransaction`.
```
2019-10-10 15:10:45,259 [tl-dispatcher-server-queue-52] WARN coop.rchain.casper.MultiParentCasperImpl - Recording invalid block cc155cf3b6... for InvalidTransaction.
```
Next block (26b19afd7f) and each subsequent blocks received are dropped with 

`InvalidParents`.
```
2019-10-10 15:10:48,955 [grpc-io-3357] INFO coop.rchain.casper.MultiParentCasperImpl - Attempting to add Block 26b19afd7f... to DAG.
2019-10-10 15:10:51,317 [tl-dispatcher-server-queue-52] WARN coop.rchain.casper.Validate$ - Ignoring block 26b19afd7f... because block parents cc155cf3b6... did not match estimate 89eb01f9ee... based on justification 89eb01f9ee...,201843db6b...,17d0c04b43...,86bc48d626...,cc155cf3b6...,814e401f4b...,0eb3365196...,80ec15e71d...,f19f375240...,a83c02de6d....
2019-10-10 15:10:51,317 [tl-dispatcher-server-queue-52] WARN coop.rchain.casper.MultiParentCasperImpl - Recording invalid block 26b19afd7f... for InvalidParents.
```
Block 26b19afd7f arrived from node2, which means node2 successfully replayed block cc155cf3b6. Logs confirm this - no `invalid block cc155cf3b6... for InvalidTransaction` records there, as opposed to node3.
Looking into every other node - only node6, and node 3 has this record.

So the issue#1 is that node3 and node6 thrown `InvalidTransaction` when replaying block `cc155cf3b6` while other peers do not. UPD: some convo with Dom happened on this topic, looks like replay was somehow was interrupted (he figured it out with reporting tool wget --timeout 5000000 http://localhost:38480/reporting/trace?blockHash=cc155cf3b63671da464ab44bd2897a2eb883e983f648f2756632d36fd8ec77a1. But the node has to be started with --reporting flag). But investigation is not finished yet.

But, there is more. After receiving bunch of blocks which were ignored, node3 succeed to pass stake threshold 0.99 and made an attempt to propose a block. It supposed to be block 90574 (as the last block received has height 90573 and it is returned but `show-blocks`), but in reality block height of the block proposed was 90568 (last good block received +1). And it thrown an error.
```
2019-10-10 15:13:11,638 [node-runner-3402] INFO coop.rchain.casper.util.rholang.InterpreterUtil$ - Computed parents post state for Block #90568 (4327a78554...) – Sender ID 04abd47e88... – M Parent Hash 89eb01f9ee... – Contents 4bda07882a...-- Shard ID rchain.
2019-10-10 15:13:12,858 [node-runner-3391] WARN coop.rchain.casper.MultiParentCasperImpl - Recording invalid block 4327a78554... for NeglectedInvalidBlock.
2019-10-10 15:13:12,865 [node-runner-3391] INFO coop.rchain.casper.MultiParentCasperImpl - New fork-choice tip is block 89eb01f9ee....
2019-10-10 15:13:12,866 [node-runner-3391] WARN coop.rchain.casper.api.BlockAPI$ - Failure! Invalid block: Left(NeglectedInvalidBlock)
```

So the issue#2 is that node3 adds ignored blocks to its DAG and uses them as a justification when proposing a new block. UPD: For this ptal at the comments, might be the same issue caused another failure.

Environment

None

Status

Assignee

Chris Boscolo

Reporter

nutzipper

Priority

Medium

Affects versions

Components

None

Sprint

None

Epic Link

None

Labels

Configure