What's happening when pubnet validator nodes die without logging an error?

Description

Version
RChain RNode v0.9.8 (63499e3)

Observation
Reported by in https://discordapp.com/channels/391435113415573504/502935251413106723/594485407656181763 and pasted below.

Node6 suddenly rebooted. Again, there're no information in logs. The used memory reached 99% before it rebooted. @kellyfoster I haven't anticipated the manual catch up when writing setup code. My attempt with node0 and node9 failed as they got justification error at night today. It's also lengthy error-prone process. The most straightforward way would be a) leave it running without node0, node6 and node9, b) start testnet1.6

https://collectd.rchain-dev.tk/host.php?h=node6.testnet.rchain-dev.tk

FYI I think it is not GCE fault, but rather either kernel or GCE VM machinery forcibly rebooting when no more memory could be allocated. In either case nothing gets logged into machine logs. In the latter case I'd expect something in GCE logs (there's nothing)

I haven't investigated it yet, but I've just looked up at java (node) usage https://collectd.rchain-dev.tk/host.php?h=node0.testnet.rchain-dev.tk&p=processes&s=86400. Perhaps it's not memory exhaustion after all, there's lots of it free until before reboot: https://collectd.rchain-dev.tk/graph.php?p=memory&t=memory&h=node0.testnet.rchain-dev.tk&s=86400

This is bad example. This machine hasn't rebooted, but node just died...
The ones after node6 and node9 haven't rebooted. They just filled memory to the max and most probably OOM killer killed node:
https://collectd.rchain-dev.tk/host.php?h=node4.testnet.rchain-dev.tk
https://collectd.rchain-dev.tk/host.php?h=node4.testnet.rchain-dev.tk&p=processes&s=86400
So to debug what's happening, a there's a script to dump various Java VM info. It could be run periodically but there's not much I can conclude from this. It'd require someone (dev) to investigate node memory usage from this information.
It'd be good idea to enable it after next reboot though, every 15 minutes say (it's performance impact, it traverses heap and has to pause JVM for a while).
This is what gets dumped: https://files.rchain-dev.tk/heapdumps/devnet.rchain-dev.tk/2019-05-15-21-19-45/node0/info.2019-05-17T21-23-28/

Environment

None

Status

Assignee

Sebastian Bach

Reporter

Kelly Foster

Priority

Highest

Affects versions

None

Components

Sprint

None

Epic Link

None

Fix versions