You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2014/09/04 04:58:52 UTC

[jira] [Comment Edited] (TEZ-1494) DAG hangs waiting for ShuffleManager.getNextInput()

    [ https://issues.apache.org/jira/browse/TEZ-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14120889#comment-14120889 ] 

Bikas Saha edited comment on TEZ-1494 at 9/4/14 2:58 AM:
---------------------------------------------------------

Alternatively we could wait for TEZ-1494. Then we can simply register for vertex started running notification and schedule all vertices once that notification has been received. That would be much simpler than having to monitor for essentially the same thing and faster since we dont have to wait for tasks to complete before we schedule tasks. However for that to work vertex started running notification needs to come when the vertex actually starts running (schedules tasks) instead of when the vertex state machine enters running state. Or maybe add a new notification saying vertex started scheduling.


was (Author: bikassaha):
Overall, I feel we should wait for TEZ-1494. Then we can simply register for vertex started running notification and schedule all vertices once that notification has been received. That would be much simpler than having to monitor for essentially the same thing and faster since we dont have to wait for tasks to complete before we schedule tasks. However for that to work vertex started running notification needs to come when the vertex actually starts running (schedules tasks) instead of when the vertex state machine enters running state. Maybe add a new notification saying vertex started scheduling.

> DAG hangs waiting for ShuffleManager.getNextInput()
> ---------------------------------------------------
>
>                 Key: TEZ-1494
>                 URL: https://issues.apache.org/jira/browse/TEZ-1494
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>              Labels: performance
>         Attachments: TEZ-1494-DAG.dot, TEZ-1494.1.patch, TEZ-1494.2.patch
>
>
> Attaching the DAG and the stack trace of the hung process.  
> Thread 30071: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
>  - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame)
>  - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame)
>  - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame)
>  - org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=610 (Interpreted frame)
>  - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame)
>  - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)