Error Handling for Rholang

Add your comments directly to the page. Include links to any relevant research, data, or feedback.

Status
IN PROGRESS
Impact

MEDIUM 

Driver~557058:7d4e2dcc-68c9-4b88-8d80-8f294d6be11c 
Approver
Contributors~5afa0eac4b719848347ec6ba ~557058:8e536576-9c11-434b-acbc-a1b034bb964d ~5b3a5fbe69812b2ef3f78b59 ~557058:8818dec5-34ed-4abe-854a-9b4e7d4963db 
Informed
Due date
OutcomeDo nothing (remove ErrorLog and be vigilant for similar bugs in the future)

Background

This is a problem that I originally posed in standup on May 30th after witnessing test failures upon removing the ErrorLog from the Rholang interpreter. Basically, at the moment, the only errors that cause global failure in a Rholang program are Out of Phlogiston Errors. All other errors cause a branch to fail locally and just log the error in ErrorLog. The goal was to change this framework such that all errors immediately caused global program failure. After removing ErrorLog, I noticed some funky behavior.

Given a par P | Q, we want evaluation to be such that if P throws an error, then Q halts immediately and P | Q throws an error and returns to the main thread. That is, we want the evaluation of P | Q to complete entirely before the interpreter logic continues. The behavior we see currently is that, when P throws an error, the evaluation of P | Q returns immediately to the main thread without cancelling Q. Afterwards, the interpreter logic on the main thread attempts to reset the store, but doesn't yet see Q's effects, so the reset is a no-op. It is only after the reset is attempted that Q affects the store. As a result, Q's effects leak into the store and appear in the next evaluation.

After the new RSpace was plugged in, I stopped witnessing this behavior because the test was changed. Since we never determined the cause of the failure, it's possible that the new RSpace unintentionally solved the problem I was seeing. On the other hand, it's also possible that now the problem just isn't manifesting.

Artur's idea is to implement a locking mechanism around produces and consumes that would test for errors before each produce/consume is executed, and only execute the produce/consume if no error has been reported. He wants to test this with the old RSpace to see if it fixes the previously failing test. If it does, he wants to switch back to the new RSpace and keep the locking mechanism around.

My idea is, frankly, to do nothing. I don't think that we should implement a lock built for a previous RSpace with the hope that it fixes problems with the current RSpace. I think we should address the bug in the future with more data if it shows again.

Relevant data

  • One possible source of the bug is that Task (the data-type we use for parallelization) doesn't cancel all threads when one thread is cancelled. This contradicts the tests I have performed locally, which have shown that Task cancels as expected when it doesn't interact with the tuplespace.
  • Another possible source of the bug is that cancellation and LMDB don't play nicely. It's possible that Task thinks it has cancelled all threads, when actually LMDB has been trigged to perform some operation which only completes at a later time. That is, if we cancel before the produce hits the store, then the store is unaffected. If we cancel after the produce hits the store, then the store is affected.

Options considered


Option 1:Option 2:
DescriptionLocking mechanismDo nothing
Pros and cons

(plus) 

(minus) 

Not necessarily a solution

Linearizes produces and consumes (though this could be mitigated with a more sophisticated lock)

(plus)

(minus)

Not necessarily a solution

Estimated cost
Low
None

Action items

  •  

Outcome