You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/01/26 19:57:39 UTC

[jira] [Updated] (TEZ-3072) Node blacklisting always reruns completed non-leaf tasks

     [ https://issues.apache.org/jira/browse/TEZ-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated TEZ-3072:
----------------------------
    Attachment: TEZ-3072.001.patch

Here's a patch that reflects what we're planning to run with at least in the short term to work around these blacklisting problems.  Basically it boils down to avoiding blacklisting from interacting with shuffle.  It allows us to configure the Tez AM so that fetch failures will not factor into blacklisting calculations (which is also not done by MR blacklisting logic), and it also avoids re-running any completed tasks due to node failure notifications (e.g.: from blacklisting logic).


> Node blacklisting always reruns completed non-leaf tasks
> --------------------------------------------------------
>
>                 Key: TEZ-3072
>                 URL: https://issues.apache.org/jira/browse/TEZ-3072
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Jason Lowe
>         Attachments: TEZ-3072.001.patch
>
>
> Recently a user ran a job with many vertices, and there was a bug in the user's code that caused a problem in one of the trailing vertices in the task.  On some nodes enough tasks failed that the AM thought it needed to blacklist those nodes.  That blacklisting then caused many completed vertices to re-run because it thought it needed to re-execute the non-leaf tasks that had completed on those nodes.  This wasted a lot of cluster resources and job time for no benefit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)