You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2012/06/23 04:18:42 UTC

[jira] [Created] (MESOS-218) Master throws exception on removeTask() if Framework is not connected

Vinod Kone created MESOS-218:
--------------------------------

             Summary: Master throws exception on removeTask() if Framework is not connected
                 Key: MESOS-218
                 URL: https://issues.apache.org/jira/browse/MESOS-218
             Project: Mesos
          Issue Type: Bug
            Reporter: Vinod Kone


When a slave is disconnected from the master, the master removes all tasks belonging to that slave.

If a framework is disconnected during this period, removeTask() throws an exception. This can result in LOST tasks not being reported to the scheduler. This is bad because framework now thinks the task is running, but the executor doesnt think so. But the TASK_KILLED messages from executor are dropped by slave, because the (restarted) slave has no idea about the task.


I0623 00:58:36.758640 28346 master.cpp:1694] Adding slave 201206230058-1937777162-5050-28332-0 at smf1-afg-23-sr3.prod.twitter.com with cpus=14; mem=22528; ports=[31000-32000]; disk=400000
I0623 00:58:36.758826 28346 simple_allocator.cpp:69] Added slave 201206230058-1937777162-5050-28332-0 with cpus=14; mem=22528; ports=[31000-32000]; disk=400
000
I0623 00:58:36.761170 28344 master.cpp:941] Attempting to register slave on smf1-aff-31-sr4.prod.twitter.com at slave(1)@10.34.135.131:5051
I0623 00:58:36.761245 28344 master.cpp:1158] Master now considering a slave at smf1-aff-31-sr4.prod.twitter.com:5051 as active
I0623 00:58:36.761275 28344 master.cpp:1694] Adding slave 201206230058-1937777162-5050-28332-1 at smf1-aff-31-sr4.prod.twitter.com with cpus=14; mem=22528; 
ports=[31000-32000]; disk=400000
I0623 00:58:36.761489 28344 simple_allocator.cpp:69] Added slave 201206230058-1937777162-5050-28332-1 with cpus=14; mem=22528; ports=[31000-32000]; disk=400
000
2012-06-23 00:58:39,871:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
I0623 00:58:39.910228 28342 master.cpp:70] Watching path file:///usr/local/mesos/conf/whitelist.txt
I0623 00:58:39.910339 28342 master.cpp:98] Whitelisting slave smf1-afg-23-sr3.prod.twitter.com
I0623 00:58:39.910395 28342 master.cpp:98] Whitelisting slave smf1-aff-31-sr4.prod.twitter.com
2012-06-23 00:58:43,208:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
I0623 00:58:44.911403 28346 master.cpp:70] Watching path file:///usr/local/mesos/conf/whitelist.txt
I0623 00:58:44.911511 28346 master.cpp:98] Whitelisting slave smf1-afg-23-sr3.prod.twitter.com
I0623 00:58:44.911541 28346 master.cpp:98] Whitelisting slave smf1-aff-31-sr4.prod.twitter.com
2012-06-23 00:58:46,545:28332(0x4955b940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
I0623 00:58:49.738129 28345 master.cpp:548] Slave 201206160031-1937777162-5050-11967-3 disconnected
F0623 00:58:49.738231 28345 master.cpp:1880] Check failed: framework != NULL
*** Check failure stack trace: ***
    @     0x7f032d18e3fd  google::LogMessage::Fail()
    @     0x7f032d194067  google::LogMessage::SendToLog()
    @     0x7f032d18fcac  google::LogMessage::Flush()
    @     0x7f032d18ff16  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f032cedb462  mesos::internal::master::Master::removeTask()
    @     0x7f032cee58d6  mesos::internal::master::Master::removeSlave()
    @     0x7f032cee7b6e  mesos::internal::master::Master::exited()
    @     0x7f032d0ac3f2  process::ProcessBase::visit()
    @     0x7f032d0be4f6  process::ExitedEvent::visit()
    @     0x7f032d0b7054  process::ProcessManager::resume()
    @     0x7f032d0b78a7  process::schedule()
    @     0x7f032c5f573d  start_thread
    @     0x7f032bbdff6d  clone
Bottle server starting up (using WSGIRefServer())...
Listening on http://0.0.0.0:8080/
Use Ctrl-C to quit.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira