You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Yuval Pavel Zholkover (JIRA)" <ji...@apache.org> on 2014/06/05 18:16:02 UTC
[jira] [Commented] (MESOS-1219) Master should generate new id for frameworks that reconnect after failover timeout

    [ https://issues.apache.org/jira/browse/MESOS-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14018908#comment-14018908 ] 

Yuval Pavel Zholkover commented on MESOS-1219:
----------------------------------------------

Hi,

Attached a log excerpt from a mesos-master 0.16.0 after a Zookeeper cluster hick-up:
{noformat}
I0605 00:05:16.398090 24908 master.cpp:872] Re-registering framework background_0 at scheduler(1)@xxx.xxx.xxx.xxx:48561
I0605 00:05:16.398108 24908 master.cpp:910] Allowing the Framework background_0 to re-register with an already used id

W0605 00:05:16.774268 24908 master.cpp:1393] Slave at slave(1)@xxx.xxx.xxx.xxx:5044 (deb015) is being allowed to re-register with an already in use id (201406030648-1821603594-5043-25313-1)
W0605 00:05:16.774646 24908 master.cpp:2384] Slave 201406030648-1821603594-5043-25313-1 (deb015) re-registered with completed framework background_0. Shutting down the framework on the slave

I0605 00:05:17.522913 24915 master.cpp:1592] Executor executor-background_0 of framework background_0 on slave 201406030648-1821603594-5043-25313-1 (deb015) has terminated with signal Real-time signal 9
{noformat}

We are re-registering a scheduler with the same frameworkId as a previously failed one (background_0) - this is a mistake on our part.
The master forces the re-registered slave to kill the background_0 executor as the background_0 frameworkId is already in the completedFrameworks circular_buffer. No TASK_LOST/TASK_KILLED are being sent to the re-registered scheduler.
Also I'm not sure if there's another issue for this, but executorLost callbacks are never get called as it is.

The workaround is to reset all the masters to clear their completedFrameworks state, and stop re-using failed frameworkId's. Or alternatively not to set the failover_timeout  (default 0.0) - Thanks [~adam-mesos] #irc

> Master should generate new id for frameworks that reconnect after failover timeout
> ----------------------------------------------------------------------------------
>
>                 Key: MESOS-1219
>                 URL: https://issues.apache.org/jira/browse/MESOS-1219
>             Project: Mesos
>          Issue Type: Bug
>          Components: master, webui
>            Reporter: Robert Lacroix
>
> When a scheduler reconnects after the failover timeout has exceeded, the framework id is usually reused because the scheduler doesn't know that the timeout exceeded and it is actually handled as a new framework.
> The /framework/:framework_id route of the Web UI doesn't handle those cases very well because its key is reused. It only shows the terminated one.
> Would it make sense to ignore the provided framework id when a scheduler reconnects to a terminated framework and generate a new id to make sure it's unique?



--
This message was sent by Atlassian JIRA
(v6.2#6252)