You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/09/03 20:23:53 UTC
[jira] [Commented] (MESOS-676) Slave::reregistered LOG(FATAL)s due to being in RECOVERING state.

    [ https://issues.apache.org/jira/browse/MESOS-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756865#comment-13756865 ] 

Benjamin Mahler commented on MESOS-676:
---------------------------------------

Taking a glance at this slave, it appears the recovery process took long enough to have the slave be notified of re-registration while it was recovering.

{noformat}
I0903 02:01:21.469768 42408 main.cpp:119] Creating "cgroups" isolator
I0903 02:01:21.470412 42408 main.cpp:127] Build: 2013-08-26 23:07:30 by bmahler
I0903 02:01:21.470432 42408 main.cpp:128] Starting Mesos slave
I0903 02:01:21.471226 42417 slave.cpp:114] Slave started on 1)@10.34.110.125:5051
I0903 02:01:21.471454 42417 slave.cpp:214] Slave resources: cpus(*):14; mem(*):21913; ports(*):[31000-32000]; disk(*):400000
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@662: Client environment:host.name=smf1-aeg-27-sr2.prod.twitter.com
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@669: Client environment:os.name=Linux
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@670: Client environment:os.arch=2.6.44-t8.el5
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Wed Jul 10 16:42:52 PDT 2013
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@679: Client environment:user.name=(null)
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@687: Client environment:user.home=/root
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@699: Client environment:user.dir=/
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=sdzktest.local.twitter.com:2181 sessionTimeout=1
0000 watcher=0x7f042c316150 sessionId=0 sessionPasswd=<null> context=0x167d310 flags=0
I0903 02:01:21.472421 42411 cgroups_isolator.cpp:224] Using /cgroup as cgroups hierarchy root
I0903 02:01:21.474203 42417 state.cpp:33] Recovering state from /var/lib/mesos/meta
2013-09-03 02:01:21,474:42408(0x7f042482a940):ZOO_DEBUG@start_threads@152: starting threads...
2013-09-03 02:01:21,474:42408(0x7f041f410940):ZOO_DEBUG@do_io@279: started IO thread
2013-09-03 02:01:21,474:42408(0x7f041ec0f940):ZOO_DEBUG@do_completion@326: started completion thread
2013-09-03 02:01:21,475:42408(0x7f041f410940):ZOO_INFO@check_events@1585: initiated connection to server [10.35.39.102:2181]
2013-09-03 02:01:21,476:42408(0x7f041f410940):ZOO_INFO@check_events@1632: session establishment complete on server [10.35.39.102:2181], sessionId=0x3f8b9d89df
5a18, negotiated timeout=10000
2013-09-03 02:01:21,476:42408(0x7f041f410940):ZOO_DEBUG@check_events@1638: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_CONNECTED_STATE
2013-09-03 02:01:21,476:42408(0x7f041ec0f940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I0903 02:01:21.476490 42413 detector.cpp:234] Master detector (slave(1)@10.34.110.125:5051) connected to ZooKeeper ...
I0903 02:01:21.476523 42413 detector.cpp:239] Authenticating to ZooKeeper using scheme 'digest'
2013-09-03 02:01:21,476:42408(0x7f042602d940):ZOO_DEBUG@send_last_auth_info@1265: Sending auth info request to 10.35.39.102:2181
{noformat}

Then 5 seconds later while still recovering, the above crash occurred.
                
> Slave::reregistered LOG(FATAL)s due to being in RECOVERING state.
> -----------------------------------------------------------------
>
>                 Key: MESOS-676
>                 URL: https://issues.apache.org/jira/browse/MESOS-676
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>             Fix For: 0.14.0
>
>
> void Slave::reregistered(const SlaveID& slaveId)
> {
>   switch(state) {
>     case DISCONNECTED:
>       LOG(INFO) << "Re-registered with master " << master;
>       state = RUNNING;
>       if (!(info.id() == slaveId)) {
>         EXIT(1) << "Re-registered but got wrong id: " << slaveId
>                 << "(expected: " << info.id() << "). Committing suicide";
>       }
>       break;
>     case RUNNING:
>       // Already re-registered!
>       if (!(info.id() == slaveId)) {
>         EXIT(1) << "Re-registered but got wrong id: " << slaveId
>                 << "(expected: " << info.id() << "). Committing suicide";
>       }
>       LOG(WARNING) << "Already re-registered with master " << master;
>       break;
>     case TERMINATING:
>       LOG(WARNING) << "Ignoring re-registration because slave is terminating";
>       break;
>     case RECOVERING:
>     default:
>       LOG(FATAL) << "Unexpected slave state " << state;
>       break;
>   }
> }
> Saw a slave fail because of this last case statement:
> F0903 02:01:26.436521 42417 slave.cpp:672] Unexpected slave state 0
> *** Check failure stack trace: ***
>     @     0x7f042c579d8d  google::LogMessage::Fail()
>     @     0x7f042c57dd77  google::LogMessage::SendToLog()
>     @     0x7f042c57c674  google::LogMessage::Flush()
>     @     0x7f042c57c8a6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f042c21db8a  mesos::internal::slave::Slave::reregistered()
>     @     0x7f042c276c1d  ProtobufProcess<>::handler1<>()
>     @     0x7f042c24560a  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f042c27702b  ProtobufProcess<>::visit()
>     @     0x7f042c46baf4  process::ProcessManager::resume()
>     @     0x7f042c46c54f  process::schedule()
>     @     0x7f042bbd983d  start_thread
>     @     0x7f042a5bbf8d  clone
> /usr/local/bin/mesos-slave.sh: line 117: 42408 Aborted                 (core dumped) /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTRIBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
> Slave Exit Status: 134

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira