You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2013/10/04 02:31:43 UTC
[jira] [Commented] (MESOS-420) Master crashes when re-registering
framework
[ https://issues.apache.org/jira/browse/MESOS-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13785737#comment-13785737 ]
Vinod Kone commented on MESOS-420:
----------------------------------
OK. I think this check is wrong in the first place for couple of reasons.
1) For command executors' tasks, Task.executor_id() is set by the slave. So when the slave re-registers with a failed over master the master doesn't know about the executor but knows about the task. But master uses Task.executor_id() is a proxy for determining if the task belongs to a command executor or not.
2) It is possible that the executor exited message reached the master before a terminal status update (TASK_LOST) for its non terminal tasks reach the master. In this case the master knows about the task but not the executor.
In both the scenarios, if the framework re-registers with the failed over master after 1) or 2) happens this check will fail.
> Master crashes when re-registering framework
> --------------------------------------------
>
> Key: MESOS-420
> URL: https://issues.apache.org/jira/browse/MESOS-420
> Project: Mesos
> Issue Type: Bug
> Reporter: Vinod Kone
> Assignee: Vinod Kone
> Priority: Critical
>
> F0328 18:42:22.489728 48996 master.cpp:697] Check failed: slave->hasExecutor(framework->id, task->executor_id())
> *** Check failure stack trace: ***
> @ 0x7f2c45f4f69d google::LogMessage::Fail()
> @ 0x7f2c45f55307 google::LogMessage::SendToLog()
> @ 0x7f2c45f50f4c google::LogMessage::Flush()
> @ 0x7f2c45f511b6 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2c45bb7857 mesos::internal::master::Master::reregisterFramework()
> @ 0x7f2c45bfa9f1 ProtobufProcess<>::handler2<>()
> @ 0x7f2c45bc6207 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7f2c45bfb03b ProtobufProcess<>::visit()
> @ 0x7f2c45e18e74 process::MessageEvent::visit()
> @ 0x7f2c45e0d11d process::ProcessManager::resume()
> @ 0x7f2c45e0d968 process::schedule()
> @ 0x358ca0673d (unknown)
> @ 0x358c2d3f6d (unknown)
--
This message was sent by Atlassian JIRA
(v6.1#6144)