You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2014/03/26 03:01:17 UTC
[jira] [Resolved] (HIVE-6751) maxtaskfailures.per.node is set to
too low a threshold
[ https://issues.apache.org/jira/browse/HIVE-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gopal V resolved HIVE-6751.
---------------------------
Resolution: Invalid
Should be a tez fix.
> maxtaskfailures.per.node is set to too low a threshold
> ------------------------------------------------------
>
> Key: HIVE-6751
> URL: https://issues.apache.org/jira/browse/HIVE-6751
> Project: Hive
> Issue Type: Bug
> Reporter: Gopal V
>
> The node blacklisting results in a task retry system which can consume cluster resources excessively with queries which will eventually fail.
> For a large stage query, on a 20 node cluster, with a few failures a query can go back and re-run query stages multiple times till it eventually re-runs the broken reducer 3 times.
> The same vertex failing 3 times on a node is no reason to throw away all the shuffle data accumulated already on that.
> An alternative strategy is to kill a container after 3 tasks fail within it, because the error is occasionally due to bugs triggered due to container re-use (static variables, task cleanup isn't complete etc) and will succeed if run on a fresh container.
> The threshold should be ~3x no-of-containers for a node failure, when the containers are getting respawned for every 3rd failure.
--
This message was sent by Atlassian JIRA
(v6.2#6252)