You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Zhiyuan Yang (JIRA)" <ji...@apache.org> on 2017/07/05 18:25:00 UTC

[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

    [ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075194#comment-16075194 ] 

Zhiyuan Yang commented on TEZ-3718:
-----------------------------------

Thanks [~sseth] for review! 
bq.  Instead of isUnhealthy in the event - this could be an enum (UNHEALTHY, BLACKLISTED)
Use enum in new patch.

bq. Don't change container state - allow a task action to change the state via a STOP_REQEUST depending on task level configs. OR If there are not running fragments on the container, change state to a COMPLETED state, so that new tasks allocations are not accepted.
In the third patch, I removed the 'do nothing' behavior previously introduced by TEZ-2972. The sole purpose of this behavior is to avoid unnecessary reschedule of completed task. With third patch we can do this more precisely, so there is no need to keep 'do nothing' behavior. Also the semantic of configuration is changed according. With this said, there is no issue about sending TA_CONTAINER_TERMINATING from container to TAImpl since running task should always be killed.

bq. Don't read from a Configuration instance within each AMContainer / TaskAttemptImpl - there's example code on how to avoid this in TaskImpl/TaskAttemptImpl
Didn't change this part.  Are you saying this because of locking issue? Even in TAImpl, configuration is read by each instance, just in a different place (ctor).

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of relying on a timeout which leads to the attempt being marked as FAILED after the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure heuristics. Normally source tasks require multiple consumers to report failure for them to be marked as bad. If a single consumer reports failure against a source which ran on a bad node, consider it bad and re-schedule immediately. (Otherwise failures can take a while to propagate, and jobs get a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and restart sources which ran on a bad node. Also running tasks being counted as FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)