You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2014/03/26 03:01:17 UTC

[jira] [Resolved] (HIVE-6751) maxtaskfailures.per.node is set to too low a threshold

     [ https://issues.apache.org/jira/browse/HIVE-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gopal V resolved HIVE-6751.
---------------------------

    Resolution: Invalid

Should be a tez fix.

> maxtaskfailures.per.node is set to too low a threshold
> ------------------------------------------------------
>
>                 Key: HIVE-6751
>                 URL: https://issues.apache.org/jira/browse/HIVE-6751
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Gopal V
>
> The node blacklisting results in a task retry system which can consume cluster resources excessively with queries which will eventually fail.
> For a large stage query, on a 20 node cluster, with a few failures a query can go back and re-run query stages multiple times till it eventually re-runs the broken reducer 3 times.
> The same vertex failing 3 times on a node is no reason to throw away all the shuffle data accumulated already on that.
> An alternative strategy is to kill a container after 3 tasks fail within it, because the error is occasionally due to bugs triggered due to container re-use  (static variables, task cleanup isn't complete etc) and will succeed if run on a fresh container.
> The threshold should be ~3x no-of-containers for a node failure, when the containers are getting respawned for every 3rd failure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)