You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@reef.apache.org by "Markus Weimer (JIRA)" <ji...@apache.org> on 2015/04/30 17:50:06 UTC

[jira] [Commented] (REEF-294) Timeout on Fail_AllocatedEvaluator

    [ https://issues.apache.org/jira/browse/REEF-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521692#comment-14521692 ] 

Markus Weimer commented on REEF-294:
------------------------------------

The discussions of #1022 mentions another solution in Wake for the same problem. However, I have no recollection of what that fix was. Regrettably, we weren't as good about documenting design decisions back then than today. [~yingdachen], do you remember?

Here's what I can make up today: There was an issue with multiple shutdown requests for the Driver being in flight at the same time. The prime example for this is the case when something in the shutdown flow fails (e.g. sending the message to the client). This in turn created multiple calls to {{Clock.close()}} in Wake. The fix in #1022 added a bunch of locks to make sure that won't happen.

Which leads me what maybe was the alternative design back then: We could make Wake's {{Clock}} robust against multiple calls to {{.close()}}. And while we are at it, it would also be interesting to more clearly separate out calls to it that suggest an orderly vs. forced shutdown.

> Timeout on Fail_AllocatedEvaluator
> ----------------------------------
>
>                 Key: REEF-294
>                 URL: https://issues.apache.org/jira/browse/REEF-294
>             Project: REEF
>          Issue Type: Sub-task
>          Components: REEF Runtime Local, REEF-Tests
>            Reporter: Brian Cho
>         Attachments: stacktrace.txt
>
>
> I've created a Sub-task, as it's not clear this is the ONLY timed out job we are looking for. This happens while running {code}Fail_AllocatedEvaluator{code}. It's not always reproducible on every run, but something about my current setup/machine is producing it more times than not when building from scratch.
> I've isolated a deadlock, and I will attach the stack trace from driver.stdout. [1] 
> Basically: RuntimeClock.run (which holds a lock on schedule) is triggering an idle check, but the idle check can't progress because the lock for DriverStatusManager is held. This is because an error was triggered. The error wants to stop RuntimeClock, but is waiting to get the lock on schedule.
> The lock for idle check originates from https://github.com/Microsoft-CISL/REEF/pull/1022/. We'll have to figure out what needs to be done.
> [1] For those interested, I set the LocalTestEnvironment timeout to a high value; then when I noticed the job stalling, I did {code}kill -3 <PID>{code} which triggers the stack trace to driver.stdout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)