You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jiang Xin (JIRA)" <ji...@apache.org> on 2018/09/19 09:46:00 UTC

[jira] [Commented] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG

    [ https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16620383#comment-16620383 ] 

Jiang Xin commented on TEZ-3198:
--------------------------------

I encountered this issue as well recently. My scenario is the disk which stores map output is damaged, so when the last several reduce attempts read map output, 500 internal server error occurs. Does anyone looking into this error now ? The workaround I'm going to take is   tez.task.max.allowed.output.failures to 3. Any advice? 

> Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-3198
>                 URL: https://issues.apache.org/jira/browse/TEZ-3198
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Critical
>
> I've seen an increasing number of cases where a single-node failure caused the whole Tez DAG to fail. These scenarios are common in that they involve the last task of a vertex attempting to complete a shuffle where all the peer tasks have already finished shuffling.  The last task's attempt encounters errors shuffling one of its inputs and keeps reporting it to the AM.  Eventually the attempt decides it must be the cause of the shuffle error and fails.  The subsequent attempts all do the same thing, and eventually we hit the task max attempts limit and fail the vertex and DAG.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)