You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2013/10/04 02:31:43 UTC

[jira] [Commented] (MESOS-420) Master crashes when re-registering framework

    [ https://issues.apache.org/jira/browse/MESOS-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13785737#comment-13785737 ] 

Vinod Kone commented on MESOS-420:
----------------------------------

OK. I think this check is wrong in the first place for couple of reasons.

1) For command executors' tasks, Task.executor_id() is set by the slave. So when the slave re-registers with a failed over master the master doesn't know about the executor but knows about the task. But master uses Task.executor_id() is a proxy for determining if the task belongs to a command executor or not.

2) It is possible that the executor exited message reached the master before a terminal status update (TASK_LOST) for its non terminal tasks reach the master. In this case the master knows about the task but not the executor.

In both the scenarios, if the framework re-registers with the failed over master after 1) or 2) happens this check will fail.

> Master crashes when re-registering framework
> --------------------------------------------
>
>                 Key: MESOS-420
>                 URL: https://issues.apache.org/jira/browse/MESOS-420
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Vinod Kone
>            Assignee: Vinod Kone
>            Priority: Critical
>
> F0328 18:42:22.489728 48996 master.cpp:697] Check failed: slave->hasExecutor(framework->id, task->executor_id()) 
> *** Check failure stack trace: ***
>     @     0x7f2c45f4f69d  google::LogMessage::Fail()
>     @     0x7f2c45f55307  google::LogMessage::SendToLog()
>     @     0x7f2c45f50f4c  google::LogMessage::Flush()
>     @     0x7f2c45f511b6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f2c45bb7857  mesos::internal::master::Master::reregisterFramework()
>     @     0x7f2c45bfa9f1  ProtobufProcess<>::handler2<>()
>     @     0x7f2c45bc6207  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f2c45bfb03b  ProtobufProcess<>::visit()
>     @     0x7f2c45e18e74  process::MessageEvent::visit()
>     @     0x7f2c45e0d11d  process::ProcessManager::resume()
>     @     0x7f2c45e0d968  process::schedule()
>     @       0x358ca0673d  (unknown)
>     @       0x358c2d3f6d  (unknown)



--
This message was sent by Atlassian JIRA
(v6.1#6144)