You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "TezQA (JIRA)" <ji...@apache.org> on 2017/10/11 00:11:00 UTC

[jira] [Commented] (TEZ-3718) Better handling of 'bad' nodes

    [ https://issues.apache.org/jira/browse/TEZ-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199602#comment-16199602 ] 

TezQA commented on TEZ-3718:
----------------------------

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment
  http://issues.apache.org/jira/secure/attachment/12891370/TEZ-3718.4.patch
  against master revision c82b2ea.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 4 new or modified test files.

      {color:red}-1 javac{color}.  The applied patch generated 25 javac compiler warnings (more than the master's current 24 warnings).

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version 3.0.1) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number of release audit warnings.

    {color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/2658//testReport/
Javac warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/2658//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/2658//console

This message is automatically generated.

> Better handling of 'bad' nodes
> ------------------------------
>
>                 Key: TEZ-3718
>                 URL: https://issues.apache.org/jira/browse/TEZ-3718
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Zhiyuan Yang
>         Attachments: TEZ-3718.1.patch, TEZ-3718.2.patch, TEZ-3718.3.patch, TEZ-3718.4.patch
>
>
> At the moment, the default behaviour in case of a node being marked bad is to do nothing other than not schedule new tasks on this node.
> The alternate, via config, is to retroactively kill every task which ran on the node, which causes far too many unnecessary re-runs.
> Proposing the following changes.
> 1. KILL fragments which are currently in the RUNNING state (instead of relying on a timeout which leads to the attempt being marked as FAILED after the timeout interval.
> 2. Keep track of these failed nodes, and use this as input to the failure heuristics. Normally source tasks require multiple consumers to report failure for them to be marked as bad. If a single consumer reports failure against a source which ran on a bad node, consider it bad and re-schedule immediately. (Otherwise failures can take a while to propagate, and jobs get a lot slower).
> [~jlowe] - think you've looked at this in the past. Any thoughts/suggestions.
> What I'm seeing is retroactive failures taking a long time to apply, and restart sources which ran on a bad node. Also running tasks being counted as FAILURES instead of KILLS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)