You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/09/03 21:11:57 UTC

[jira] [Comment Edited] (MESOS-675) CHECK failure in the Master.

    [ https://issues.apache.org/jira/browse/MESOS-675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13756868#comment-13756868 ] 

Benjamin Mahler edited comment on MESOS-675 at 9/3/13 7:11 PM:
---------------------------------------------------------------

For added context, all of the slaves in this testing cluster are being killed approximately once per minute to stress test slave recovery.

Observed another case. Both cases share the fact that there were numerous duplicate re-registration attempts received by the master which may point towards some networking issues causing an influx of messages, or the master being backed up.

Interestingly, this is the same slave as before!

{noformat}
$ grep 201308270033-1937777162-5050-50911-137 /var/log/mesos/mesos-master.log
I0903 02:00:55.511580 13323 master.cpp:553] Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) disconnected
I0903 02:00:55.523751 13317 hierarchical_allocator_process.hpp:459] Removed slave 201308270033-1937777162-5050-50911-137
...
W0903 02:01:21.068780 13314 master.cpp:1317] Ignoring unknown exited executor thermos-1377831261464-mesos-slave-recovery-spinner-60-f0bcfda6-4f8d-4df4-bd74-0b15f32d0502 on slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com)
...
W0903 02:01:21.094483 13314 master.cpp:1317] Ignoring unknown exited executor thermos-1377946710233-mesos-slave-recovery-spinner-97-dae51553-6b4c-44dc-99cf-dbd23294bd95 on slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com)
W0903 02:01:21.168313 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
I0903 02:01:21.169204 13311 hierarchical_allocator_process.hpp:434] Added slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) with cpus(*):14; mem(*):21913; ports(*):[31000-32000]; disk(*):400000 (and cpus(*):10.96; mem(*):19866; ports(*):[31000-31080, 31082-31573, 31575-31618, 31620-31791, 31793-31801, 31803-32000]; disk(*):397809 available)
W0903 02:01:21.202672 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.316110 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.595093 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.634462 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.668175 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.896671 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.920936 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.957500 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:22.165949 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:22.261553 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:22.295368 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
I0903 02:01:22.640902 13314 master.cpp:553] Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) disconnected
W0903 02:01:22.641563 13314 master.cpp:1404] Master returning resources offered because slave 201308270033-1937777162-5050-50911-137 is disconnected
I0903 02:01:22.641603 13318 hierarchical_allocator_process.hpp:459] Removed slave 201308270033-1937777162-5050-50911-137
I0903 02:01:22.646209 13314 master.cpp:553] Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) disconnected
F0903 02:01:22.646548 13314 master.cpp:564] Check failed: !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) already disconnected!
{noformat}

Slave log:
{noformat}
I0903 02:01:07.943450 42285 slave.cpp:1720] Cleaning up un-reregistered executors
I0903 02:01:07.944167 42285 slave.cpp:405] Finished recovery
I0903 02:01:07.948012 42285 slave.cpp:425] Garbage collecting old slave 201307130218-2280333834-5050-31521-81
I0903 02:01:07.959118 42285 slave.cpp:425] Garbage collecting old slave 201308222147-2206671370-5050-28630-131
I0903 02:01:07.960516 42282 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/201307130218-2280333834-5050-31521-81' for removal
I0903 02:01:07.964192 42282 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/201307130218-2280333834-5050-31521-81' for removal
I0903 02:01:07.964614 42282 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/201308222147-2206671370-5050-28630-131' for removal
I0903 02:01:07.964931 42282 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/201308222147-2206671370-5050-28630-131' for removal
I0903 02:01:10.842142 42281 http.cpp:278] HTTP request for '/slave(1)/stats.json'
2013-09-03 02:01:10,963:42271(0x7fd913fff940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
2013-09-03 02:01:14,297:42271(0x7fd913fff940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 1 ms
2013-09-03 02:01:17,633:42271(0x7fd913fff940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
/usr/local/bin/mesos-slave.sh: line 117: 42271 Killed                  /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="
${MESOS_ATTRIBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 137
I0903 02:01:21.469768 42408 main.cpp:119] Creating "cgroups" isolator
I0903 02:01:21.470412 42408 main.cpp:127] Build: 2013-08-26 23:07:30 by bmahler
I0903 02:01:21.470432 42408 main.cpp:128] Starting Mesos slave
I0903 02:01:21.471226 42417 slave.cpp:114] Slave started on 1)@10.34.110.125:5051
I0903 02:01:21.471454 42417 slave.cpp:214] Slave resources: cpus(*):14; mem(*):21913; ports(*):[31000-32000]; disk(*):400000
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@662: Client environment:host.name=smf1-aeg-27-sr2.prod.twitter.com
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@669: Client environment:os.name=Linux
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@670: Client environment:os.arch=2.6.44-t8.el5
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Wed Jul 10 16:42:52 PDT 2013
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@679: Client environment:user.name=(null)
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@687: Client environment:user.home=/root
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@699: Client environment:user.dir=/
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=sdzktest.local.twitter.com:2181 sessionTimeout=10000 watcher=0x7f042c316150 sessionId=0 sessionPasswd=<null> context=0x167d310 flags=0
I0903 02:01:21.472421 42411 cgroups_isolator.cpp:224] Using /cgroup as cgroups hierarchy root
I0903 02:01:21.474203 42417 state.cpp:33] Recovering state from /var/lib/mesos/meta
2013-09-03 02:01:21,474:42408(0x7f042482a940):ZOO_DEBUG@start_threads@152: starting threads...
2013-09-03 02:01:21,474:42408(0x7f041f410940):ZOO_DEBUG@do_io@279: started IO thread
2013-09-03 02:01:21,474:42408(0x7f041ec0f940):ZOO_DEBUG@do_completion@326: started completion thread
2013-09-03 02:01:21,475:42408(0x7f041f410940):ZOO_INFO@check_events@1585: initiated connection to server [10.35.39.102:2181]
2013-09-03 02:01:21,476:42408(0x7f041f410940):ZOO_INFO@check_events@1632: session establishment complete on server [10.35.39.102:2181], sessionId=0x3f8b9d89df5a18, negotiated timeout=10000
2013-09-03 02:01:21,476:42408(0x7f041f410940):ZOO_DEBUG@check_events@1638: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_CONNECTED_STATE
2013-09-03 02:01:21,476:42408(0x7f041ec0f940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I0903 02:01:21.476490 42413 detector.cpp:234] Master detector (slave(1)@10.34.110.125:5051) connected to ZooKeeper ...
I0903 02:01:21.476523 42413 detector.cpp:239] Authenticating to ZooKeeper using scheme 'digest'
2013-09-03 02:01:21,476:42408(0x7f042602d940):ZOO_DEBUG@send_last_auth_info@1265: Sending auth info request to 10.35.39.102:2181
...
{noformat}

Then the slave crashed during recovery:
{noformat}
F0903 02:01:26.436521 42417 slave.cpp:672] Unexpected slave state 0
*** Check failure stack trace: ***
    @     0x7f042c579d8d  google::LogMessage::Fail()
    @     0x7f042c57dd77  google::LogMessage::SendToLog()
    @     0x7f042c57c674  google::LogMessage::Flush()
    @     0x7f042c57c8a6  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f042c21db8a  mesos::internal::slave::Slave::reregistered()
    @     0x7f042c276c1d  ProtobufProcess<>::handler1<>()
    @     0x7f042c24560a  std::tr1::_Function_handler<>::_M_invoke()
    @     0x7f042c27702b  ProtobufProcess<>::visit()
    @     0x7f042c46baf4  process::ProcessManager::resume()
    @     0x7f042c46c54f  process::schedule()
    @     0x7f042bbd983d  start_thread
    @     0x7f042a5bbf8d  clone
/usr/local/bin/mesos-slave.sh: line 117: 42408 Aborted                 (core dumped) /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTRIBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 134
{noformat}

Filed MESOS-676 for the above LOG(FATAL).
                
      was (Author: bmahler):
    For added context, all of the slaves in this testing cluster are being killed approximately once per minute to stress test slave recovery.

Observed another case. Both cases share the fact that there were numerous duplicate re-registration attempts received by the master which may point towards some networking issues causing an influx of messages, or the master being backed up.

Interestingly, this is the same slave as before!

{noformat}
$ grep 201308270033-1937777162-5050-50911-137 /var/log/mesos/mesos-master.log
...
W0903 02:01:21.068780 13314 master.cpp:1317] Ignoring unknown exited executor thermos-1377831261464-mesos-slave-recovery-spinner-60-f0bcfda6-4f8d-4df4-bd74-0b15f32d0502 on slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com)
...
W0903 02:01:21.094483 13314 master.cpp:1317] Ignoring unknown exited executor thermos-1377946710233-mesos-slave-recovery-spinner-97-dae51553-6b4c-44dc-99cf-dbd23294bd95 on slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com)
W0903 02:01:21.168313 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
I0903 02:01:21.169204 13311 hierarchical_allocator_process.hpp:434] Added slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) with cpus(*):14; mem(*):21913; ports(*):[31000-32000]; disk(*):400000 (and cpus(*):10.96; mem(*):19866; ports(*):[31000-31080, 31082-31573, 31575-31618, 31620-31791, 31793-31801, 31803-32000]; disk(*):397809 available)
W0903 02:01:21.202672 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.316110 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.595093 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.634462 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.668175 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.896671 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.920936 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:21.957500 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:22.165949 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:22.261553 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
W0903 02:01:22.295368 13314 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (smf1-aeg-27-sr2.prod.twitter.com) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
I0903 02:01:22.640902 13314 master.cpp:553] Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) disconnected
W0903 02:01:22.641563 13314 master.cpp:1404] Master returning resources offered because slave 201308270033-1937777162-5050-50911-137 is disconnected
I0903 02:01:22.641603 13318 hierarchical_allocator_process.hpp:459] Removed slave 201308270033-1937777162-5050-50911-137
I0903 02:01:22.646209 13314 master.cpp:553] Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) disconnected
F0903 02:01:22.646548 13314 master.cpp:564] Check failed: !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (smf1-aeg-27-sr2.prod.twitter.com) already disconnected!
{noformat}

Slave log:
{noformat}
I0903 02:01:07.943450 42285 slave.cpp:1720] Cleaning up un-reregistered executors
I0903 02:01:07.944167 42285 slave.cpp:405] Finished recovery
I0903 02:01:07.948012 42285 slave.cpp:425] Garbage collecting old slave 201307130218-2280333834-5050-31521-81
I0903 02:01:07.959118 42285 slave.cpp:425] Garbage collecting old slave 201308222147-2206671370-5050-28630-131
I0903 02:01:07.960516 42282 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/201307130218-2280333834-5050-31521-81' for removal
I0903 02:01:07.964192 42282 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/201307130218-2280333834-5050-31521-81' for removal
I0903 02:01:07.964614 42282 gc.cpp:56] Scheduling '/var/lib/mesos/slaves/201308222147-2206671370-5050-28630-131' for removal
I0903 02:01:07.964931 42282 gc.cpp:56] Scheduling '/var/lib/mesos/meta/slaves/201308222147-2206671370-5050-28630-131' for removal
I0903 02:01:10.842142 42281 http.cpp:278] HTTP request for '/slave(1)/stats.json'
2013-09-03 02:01:10,963:42271(0x7fd913fff940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
2013-09-03 02:01:14,297:42271(0x7fd913fff940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 1 ms
2013-09-03 02:01:17,633:42271(0x7fd913fff940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
/usr/local/bin/mesos-slave.sh: line 117: 42271 Killed                  /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="
${MESOS_ATTRIBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 137
I0903 02:01:21.469768 42408 main.cpp:119] Creating "cgroups" isolator
I0903 02:01:21.470412 42408 main.cpp:127] Build: 2013-08-26 23:07:30 by bmahler
I0903 02:01:21.470432 42408 main.cpp:128] Starting Mesos slave
I0903 02:01:21.471226 42417 slave.cpp:114] Slave started on 1)@10.34.110.125:5051
I0903 02:01:21.471454 42417 slave.cpp:214] Slave resources: cpus(*):14; mem(*):21913; ports(*):[31000-32000]; disk(*):400000
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@658: Client environment:zookeeper.version=zookeeper C client 3.3.4
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@662: Client environment:host.name=smf1-aeg-27-sr2.prod.twitter.com
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@669: Client environment:os.name=Linux
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@670: Client environment:os.arch=2.6.44-t8.el5
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@671: Client environment:os.version=#1 SMP Wed Jul 10 16:42:52 PDT 2013
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@679: Client environment:user.name=(null)
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@687: Client environment:user.home=/root
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@log_env@699: Client environment:user.dir=/
2013-09-03 02:01:21,472:42408(0x7f042482a940):ZOO_INFO@zookeeper_init@727: Initiating client connection, host=sdzktest.local.twitter.com:2181 sessionTimeout=10000 watcher=0x7f042c316150 sessionId=0 sessionPasswd=<null> context=0x167d310 flags=0
I0903 02:01:21.472421 42411 cgroups_isolator.cpp:224] Using /cgroup as cgroups hierarchy root
I0903 02:01:21.474203 42417 state.cpp:33] Recovering state from /var/lib/mesos/meta
2013-09-03 02:01:21,474:42408(0x7f042482a940):ZOO_DEBUG@start_threads@152: starting threads...
2013-09-03 02:01:21,474:42408(0x7f041f410940):ZOO_DEBUG@do_io@279: started IO thread
2013-09-03 02:01:21,474:42408(0x7f041ec0f940):ZOO_DEBUG@do_completion@326: started completion thread
2013-09-03 02:01:21,475:42408(0x7f041f410940):ZOO_INFO@check_events@1585: initiated connection to server [10.35.39.102:2181]
2013-09-03 02:01:21,476:42408(0x7f041f410940):ZOO_INFO@check_events@1632: session establishment complete on server [10.35.39.102:2181], sessionId=0x3f8b9d89df5a18, negotiated timeout=10000
2013-09-03 02:01:21,476:42408(0x7f041f410940):ZOO_DEBUG@check_events@1638: Calling a watcher for a ZOO_SESSION_EVENT and the state=ZOO_CONNECTED_STATE
2013-09-03 02:01:21,476:42408(0x7f041ec0f940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I0903 02:01:21.476490 42413 detector.cpp:234] Master detector (slave(1)@10.34.110.125:5051) connected to ZooKeeper ...
I0903 02:01:21.476523 42413 detector.cpp:239] Authenticating to ZooKeeper using scheme 'digest'
2013-09-03 02:01:21,476:42408(0x7f042602d940):ZOO_DEBUG@send_last_auth_info@1265: Sending auth info request to 10.35.39.102:2181
...
{noformat}

Then the slave crashed during recovery:
{noformat}
F0903 02:01:26.436521 42417 slave.cpp:672] Unexpected slave state 0
*** Check failure stack trace: ***
    @     0x7f042c579d8d  google::LogMessage::Fail()
    @     0x7f042c57dd77  google::LogMessage::SendToLog()
    @     0x7f042c57c674  google::LogMessage::Flush()
    @     0x7f042c57c8a6  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f042c21db8a  mesos::internal::slave::Slave::reregistered()
    @     0x7f042c276c1d  ProtobufProcess<>::handler1<>()
    @     0x7f042c24560a  std::tr1::_Function_handler<>::_M_invoke()
    @     0x7f042c27702b  ProtobufProcess<>::visit()
    @     0x7f042c46baf4  process::ProcessManager::resume()
    @     0x7f042c46c54f  process::schedule()
    @     0x7f042bbd983d  start_thread
    @     0x7f042a5bbf8d  clone
/usr/local/bin/mesos-slave.sh: line 117: 42408 Aborted                 (core dumped) /usr/local/sbin/mesos-slave --port=5051 --resources="${MESOS_RESOURCES}" --attributes="${MESOS_ATTRIBUTES}" --master="${master_zoo_url}" --log_dir="${log_dir}" ${EXTRA_FLAGS} "$@"
Slave Exit Status: 134
{noformat}

Filed MESOS-676 for the above LOG(FATAL).
                  
> CHECK failure in the Master.
> ----------------------------
>
>                 Key: MESOS-675
>                 URL: https://issues.apache.org/jira/browse/MESOS-675
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Blocker
>             Fix For: 0.14.0
>
>
> Observed this failure in a staging cluster running 0.14.0-rc2.
> {noformat}
> F0902 06:01:11.105391 11876 master.cpp:564] Check failed: !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (<scrub>)
>  already disconnected!
> *** Check failure stack trace: ***
>     @     0x7fb470894d8d  google::LogMessage::Fail()
>     @     0x7fb470898d77  google::LogMessage::SendToLog()
>     @     0x7fb470897674  google::LogMessage::Flush()
>     @     0x7fb4708978a6  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fb4704aaea4  mesos::internal::master::Master::exited()
>     @     0x7fb470786af4  process::ProcessManager::resume()
>     @     0x7fb47078754f  process::schedule()
>     @     0x7fb46fef483d  start_thread
>     @     0x7fb46e8d6f8d  clone
> {noformat}
> Grepping for this slave in the logs:
> {noformat}
> $ grep 201308270033-1937777162-5050-50911-137 /var/log/mesos/mesos-master.log
> W0902 06:01:10.607168 11876 master.cpp:1317] Ignoring unknown exited executor thermos-1377831261464-mesos-slave-recovery-spinner-60-f0bcfda6-4f8d-4df4-bd74-0b15f32d0502 on slave 201308270033-1937777162-5050-50911-137 (<scrub>)
> ...
> W0902 06:01:10.646383 11876 master.cpp:1317] Ignoring unknown exited executor thermos-1377964938274-mesos-slave-recovery-spinner-184-3a25b824-5d73-4be0-984d-606230c5e8ac on slave 201308270033-1937777162-5050-50911-137 (<scrub>)
> W0902 06:01:10.699635 11876 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
> I0902 06:01:10.700628 11868 hierarchical_allocator_process.hpp:434] Added slave 201308270033-1937777162-5050-50911-137 (<scrub>) with cpus(*):14; mem(*):21913; ports(*):[31000-32000]; disk(*):400000 (and cpus(*):10.96; mem(*):19866; ports(*):[31000-31003, 31005-31449, 31451-31580, 31582-31801, 31803-31927, 31929-32000]; disk(*):397809 available)
> W0902 06:01:10.866525 11876 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
> W0902 06:01:10.919178 11876 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
> W0902 06:01:11.070862 11876 master.cpp:1123] Slave at slave(1)@10.34.110.125:5051 (<scrub>) is being allowed to re-register with an already in use id (201308270033-1937777162-5050-50911-137)
> I0902 06:01:11.085773 11876 master.cpp:553] Slave 201308270033-1937777162-5050-50911-137 (<scrub>) disconnected
> W0902 06:01:11.086096 11876 master.cpp:1404] Master returning resources offered because slave 201308270033-1937777162-5050-50911-137 is disconnected
> I0902 06:01:11.086145 11867 hierarchical_allocator_process.hpp:459] Removed slave 201308270033-1937777162-5050-50911-137
> I0902 06:01:11.104651 11876 master.cpp:553] Slave 201308270033-1937777162-5050-50911-137 (<scrub>) disconnected
> F0902 06:01:11.105391 11876 master.cpp:564] Check failed: !slave->disconnected Slave 201308270033-1937777162-5050-50911-137 (<scrub>) already disconnected!
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira