You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Markus Weimer (JIRA)" <ji...@apache.org> on 2015/05/01 18:06:07 UTC

[jira] [Commented] (REEF-294) Timeout on Fail_AllocatedEvaluator

    [ https://issues.apache.org/jira/browse/REEF-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523376#comment-14523376 ] 

Markus Weimer commented on REEF-294:
------------------------------------

Thanks for the refresher, [~yingdachen]. Maybe we should resolve the issue in Wake then and undo the locking introduced via #1022. [~bgchun], what do you think about changing Wake's clock in this way?

> Timeout on Fail_AllocatedEvaluator
> ----------------------------------
>
>                 Key: REEF-294
>                 URL: https://issues.apache.org/jira/browse/REEF-294
>             Project: REEF
>          Issue Type: Sub-task
>          Components: REEF Runtime Local, REEF-Tests
>            Reporter: Brian Cho
>         Attachments: stacktrace.txt
>
>
> I've created a Sub-task, as it's not clear this is the ONLY timed out job we are looking for. This happens while running {code}Fail_AllocatedEvaluator{code}. It's not always reproducible on every run, but something about my current setup/machine is producing it more times than not when building from scratch.
> I've isolated a deadlock, and I will attach the stack trace from driver.stdout. [1] 
> Basically: RuntimeClock.run (which holds a lock on schedule) is triggering an idle check, but the idle check can't progress because the lock for DriverStatusManager is held. This is because an error was triggered. The error wants to stop RuntimeClock, but is waiting to get the lock on schedule.
> The lock for idle check originates from https://github.com/Microsoft-CISL/REEF/pull/1022/. We'll have to figure out what needs to be done.
> [1] For those interested, I set the LocalTestEnvironment timeout to a high value; then when I noticed the job stalling, I did {code}kill -3 <PID>{code} which triggers the stack trace to driver.stdout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)