You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2016/03/14 22:08:33 UTC

[jira] [Commented] (TEZ-3164) Surface error histograms from the AM

    [ https://issues.apache.org/jira/browse/TEZ-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194133#comment-15194133 ] 

Siddharth Seth commented on TEZ-3164:
-------------------------------------

Big +1 for doing this.
An external script could be used for such diagnostics, but Tez, MR etc will likely already have a lot of this information from running jobs.

> Surface error histograms from the AM
> ------------------------------------
>
>                 Key: TEZ-3164
>                 URL: https://issues.apache.org/jira/browse/TEZ-3164
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>
> Job tasks are constantly probing the cluster. So if there are some issues in the cluster then jobs would be the first to notice that. If we can make these observations surface to the user then we could quickly identify cluster issues.
> Lets say a set of bad machines got added to the cluster and tasks started seeing shuffle errors from those machines. This can slow down or hang the job. If the AM can surface increased errors counts from source and destination machines then that could pin point the bad machines vs having to arrive at those machines from first principles and log searching.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)