You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Yingda Chen (JIRA)" <ji...@apache.org> on 2018/10/16 14:04:00 UTC

[jira] [Assigned] (TEZ-3075) Revamp bad node handling

     [ https://issues.apache.org/jira/browse/TEZ-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yingda Chen reassigned TEZ-3075:
--------------------------------

    Assignee: Ying Han

> Revamp bad node handling
> ------------------------
>
>                 Key: TEZ-3075
>                 URL: https://issues.apache.org/jira/browse/TEZ-3075
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>            Assignee: Ying Han
>            Priority: Major
>
> The current logic around that is derived from MR and does not work in all cases.
> Things to consider
> 1) Have a notion of probation where machines are put out of service for a period of time (say 5m, 15m and 30m) before being given up for good. This allows more graceful handling of temporary glitches.
> 2) Different handling for YARN marking a node as bad vs internal heuritics
> 3) Bad nodes should not immediately trigger re-execution of completed work. That should be based on presence of downstream consumers (ie existing demand for that output) and a reasonable indication by other consumers from that node that it cannot serve results (eg. multiple reports of read errors with that node as a source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)