You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Brian Cho (JIRA)" <ji...@apache.org> on 2015/04/30 17:27:06 UTC

[jira] [Created] (REEF-294) Timeout on Fail_AllocatedEvaluator

Brian Cho created REEF-294:
------------------------------

             Summary: Timeout on Fail_AllocatedEvaluator
                 Key: REEF-294
                 URL: https://issues.apache.org/jira/browse/REEF-294
             Project: REEF
          Issue Type: Sub-task
          Components: REEF Runtime Local, REEF-Tests
            Reporter: Brian Cho


I've created a Sub-task, as it's not clear this is the ONLY timed out job we are looking for. This happens while running {code}Fail_AllocatedEvaluator{code}. It's not always reproducible on every run, but something about my current setup/machine is producing it more times than not when building from scratch.

I've isolated a deadlock, and I will attach the stack trace from driver.stdout. [1] 

Basically: RuntimeClock.run (which holds a lock on schedule) is triggering an idle check, but the idle check can't progress because the lock for DriverStatusManager is held. This is because an error was triggered. The error wants to stop RuntimeClock, but is waiting to get the lock on schedule.

The lock for idle check originates from https://github.com/Microsoft-CISL/REEF/pull/1022/. We'll have to figure out what needs to be done.

[1] For those interested, I set the LocalTestEnvironment timeout to a high value; then when I noticed the job stalling, I did {code}kill -3 <PID>{code} which triggers the stack trace to driver.stdout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)