You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@reef.apache.org by "Markus Weimer (JIRA)" <ji...@apache.org> on 2014/12/03 19:42:12 UTC

[jira] [Created] (REEF-61) Possible race condition in EvaluatorManager

Markus Weimer created REEF-61:
---------------------------------

             Summary: Possible race condition in EvaluatorManager
                 Key: REEF-61
                 URL: https://issues.apache.org/jira/browse/REEF-61
             Project: REEF
          Issue Type: Bug
          Components: REEF-Common
            Reporter: Markus Weimer
            Priority: Minor


There is a theoretical issue in {{EvaluatorManager}} which we should elevate to critical if it starts hitting us:

{{EvaluatorManager}}  joins two streams of information: It receives heartbeats from the Evaluator itself as well as container status events from the resource manager. Usually, the relevant events arrive in the  {{EvaluatorManager}}  in their canonical order:

# The resource manager indicates container launch.
# The first heartbeat is received from the Evaluator
# The Evaluator sends heartbeats with status messages etc.
# The Evaluator sends its last heartbeat
# The resource manager indicates container exit

The 4th message might never be sent or received in catastrophic failure scenarios. That is why  {{EvaluatorManager}}  declares an  {{FailedEvaluator}}  when receiving the 5th message for an Evaluator whose last heartbeat still indicated a RUNNING state. 

This is where the race condition occurs: If the last heartbeat from an Evaluator arrives after the container exit from the resource manager, the application experiences a  {{FailedEvaluator}}  where a  {{CompletedEvaluator}}  would have been in order.

A first idea to fix this would be to add a small time window after receiving the container exit before deciding whether this is indeed a failure. Think 100ms or so. That way, we allow for some slack in the arrival of the last heartbeat. The obvious downside of such an approach is that we introduce latency in the cases where the Evaluator really failed. Even worse, we add a magic constant to the code.

This used to be [#964|https://github.com/Microsoft-CISL/REEF/issues/964]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)