You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Brian Cho (JIRA)" <ji...@apache.org> on 2015/04/30 17:27:06 UTC
[jira] [Created] (REEF-294) Timeout on Fail_AllocatedEvaluator
Brian Cho created REEF-294:
------------------------------
Summary: Timeout on Fail_AllocatedEvaluator
Key: REEF-294
URL: https://issues.apache.org/jira/browse/REEF-294
Project: REEF
Issue Type: Sub-task
Components: REEF Runtime Local, REEF-Tests
Reporter: Brian Cho
I've created a Sub-task, as it's not clear this is the ONLY timed out job we are looking for. This happens while running {code}Fail_AllocatedEvaluator{code}. It's not always reproducible on every run, but something about my current setup/machine is producing it more times than not when building from scratch.
I've isolated a deadlock, and I will attach the stack trace from driver.stdout. [1]
Basically: RuntimeClock.run (which holds a lock on schedule) is triggering an idle check, but the idle check can't progress because the lock for DriverStatusManager is held. This is because an error was triggered. The error wants to stop RuntimeClock, but is waiting to get the lock on schedule.
The lock for idle check originates from https://github.com/Microsoft-CISL/REEF/pull/1022/. We'll have to figure out what needs to be done.
[1] For those interested, I set the LocalTestEnvironment timeout to a high value; then when I noticed the job stalling, I did {code}kill -3 <PID>{code} which triggers the stack trace to driver.stdout.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)