You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/04/05 21:33:25 UTC

[jira] [Commented] (TEZ-3198) Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG

    [ https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15226974#comment-15226974 ] 

Jason Lowe commented on TEZ-3198:
---------------------------------

The issue is that the AM never gets around to re-running the upstream task.  Part of the issue is that errors reported from a task are aggregated by the reporting task attempt, so no matter how often an attempt reports an error against another task, it will only be recorded as a single failure when calculating whether to re-run the task.  As I understand it, the algorithm for re-running a task due to fetch failures currently works like this:
- if it has been 5 minutes (default) since the first time the reporting attempt flagged an error then re-run the blamed task
- if the number of unique reporting attempts is more than 10% (default) of the total number of downstream tasks then re-run the blamed task
- if the total number of unique reporting attempts is >= 10 (default) then re-run the blamed task
- otherwise do not re-run the blamed task

Since there's only one task trying to shuffle, there is by default only 4 attempts maximum.  This means the task cannot alone clear the total number of unique reporting attempts nor can it clear the 10% hurdle if there are a decent number of tasks in its vertex.  That leaves just the 5 minute period, but often the task attempt is giving up on its own due to the number of failures before it gets past the 5 minute mark.  This is especially true if the failure is fast such as connection refused, invalid shuffle secret, missing map output on the node, etc.

A typical scenario that can lead to this is a lone vertex needing to be re-run due to shuffle errors.  That lone task needs to shuffle again, long after its peer tasks have completed, and in the interim a node has failed in some way (not necessarily recognized by YARN yet).  Now we're left with a lone task trying to complete a shuffle that cannot succeed since the attempts always give up before the AM decides to re-run the upstream task.

> Shuffle failures for the trailing task in a vertex are often fatal to the entire DAG
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-3198
>                 URL: https://issues.apache.org/jira/browse/TEZ-3198
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Critical
>             Fix For: 0.7.1, 0.8.3
>
>
> I've seen an increasing number of cases where a single-node failure caused the whole Tez DAG to fail. These scenarios are common in that they involve the last task of a vertex attempting to complete a shuffle where all the peer tasks have already finished shuffling.  The last task's attempt encounters errors shuffling one of its inputs and keeps reporting it to the AM.  Eventually the attempt decides it must be the cause of the shuffle error and fails.  The subsequent attempts all do the same thing, and eventually we hit the task max attempts limit and fail the vertex and DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)