You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2014/03/10 19:09:45 UTC

[jira] [Comment Edited] (TEZ-918) Shuffle can hang if there are intermittent fetch failures

    [ https://issues.apache.org/jira/browse/TEZ-918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13925937#comment-13925937 ] 

Siddharth Seth edited comment on TEZ-918 at 3/10/14 6:08 PM:
-------------------------------------------------------------

bq. he previous loop only removes unused/completed inputs until the max fetch limit is reached.
They're removed while constructing the dedupe list - which means the entire list is considered, and not only the limited list which will actually be fetched. The change simplifies the previous loops.

bq. Yes. Its complicated and thats why its tracked by a separate jira. We probably need to revisit this area of shuffle error reporting more comprehensively.
Yes it does. There's several issues there which need to be addressed. The Runtime component continues to function like it did in MapReduce, but the AM side has changed - and that has obviously caused a bunch of issues - most of which need to be addressed. TEZ-814 is one of them (when to consider an Input as Failed), TEZ-915, and I'll be opening one more (TEZ-924)  for the Point#2 that I'd mentioned in TEZ-902.


was (Author: sseth):
bq. he previous loop only removes unused/completed inputs until the max fetch limit is reached.
They're removed while constructing the dedupe list - which means the entire list is considered, and not only the limited list which will actually be fetched. The change simplifies the previous loops.

bq. Yes. Its complicated and thats why its tracked by a separate jira. We probably need to revisit this area of shuffle error reporting more comprehensively.
Yes it does. There's several issues there which need to be addressed. The Runtime component continues to function like it did in MapReduce, but the AM side has changed - and that has obviously caused a bunch of issues - most of which need to be addressed. TEZ-814 is one of them (when to consider an Input as Failed), TEZ-915, and I'll be opening one more for the Point#2 that I'd mentioned in TEZ-902.

> Shuffle can hang if there are intermittent fetch failures
> ---------------------------------------------------------
>
>                 Key: TEZ-918
>                 URL: https://issues.apache.org/jira/browse/TEZ-918
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: TEZ-918.1.txt, TEZ-918.2.txt, TEZ-918.3.txt, TEZ-918.4.txt
>
>
> Post TEZ-902 - if there's a fetch failure, the task could end up hanging - waiting for the specific Input.
> Had spoken to [~bikassaha] about this offline while looking at TEZ-902 - another similar issue already exists under the fault tollerance jira, but that occurs rarely and under specific circumstances. 
> Will try fixing this tomorrow, otherwise may revert TEZ-902.



--
This message was sent by Atlassian JIRA
(v6.2#6252)