You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Matei Zaharia (Updated) (JIRA)" <ji...@apache.org> on 2011/12/29 14:13:30 UTC
[jira] [Updated] (MESOS-106) Failover timeout should default to 1
[ https://issues.apache.org/jira/browse/MESOS-106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matei Zaharia updated MESOS-106:
--------------------------------
Description: Since the failover timeout was added, you get a lot of weird behavior in clusters running frameworks that don't support failover due to its long default value of 1 day. If a framework fails or just exits without calling driver.stop(), all its executors stay around and consume resources on the machines, causing subsequent runs to mysteriously fail to acquire resources. See http://groups.google.com/group/spark-users/msg/553af12424e4ed3d for an example. I know that the failover timeout is supposed to eventually become a per-framework parameter anyway, but in the meantime, the easiest way to prevent this is to set it to 1, because almost no users have failover-enabled frameworks. (was: Since the failover timeout was added, you get a lot of weird behavior in clusters running frameworks that don't support failover due to its long default value of 1 day. If a framework fails or just exits without calling driver.stop(), all its executors stay around and consume resources on the machines, causing subsequent runs to mysteriously fail to acquire resources. See http://groups.google.com/group/spark-users/msg/553af12424e4ed3d for an example. I know that the failover timeout is supposed to eventually become a per-framework parameter anyway, but in the meantime, the easiest way to prevent this is to set it to 0, because almost no users have failover-enabled frameworks.)
Summary: Failover timeout should default to 1 (was: Failover timeout should default to 0)
> Failover timeout should default to 1
> ------------------------------------
>
> Key: MESOS-106
> URL: https://issues.apache.org/jira/browse/MESOS-106
> Project: Mesos
> Issue Type: Improvement
> Reporter: Matei Zaharia
> Attachments: MESOS-106-v2.patch, MESOS-106-v3.patch, MESOS-106.patch
>
>
> Since the failover timeout was added, you get a lot of weird behavior in clusters running frameworks that don't support failover due to its long default value of 1 day. If a framework fails or just exits without calling driver.stop(), all its executors stay around and consume resources on the machines, causing subsequent runs to mysteriously fail to acquire resources. See http://groups.google.com/group/spark-users/msg/553af12424e4ed3d for an example. I know that the failover timeout is supposed to eventually become a per-framework parameter anyway, but in the meantime, the easiest way to prevent this is to set it to 1, because almost no users have failover-enabled frameworks.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira