You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2012/11/02 18:43:12 UTC

[jira] [Resolved] (MESOS-299) Master detector doesn't notify about leading master after network disconnection

     [ https://issues.apache.org/jira/browse/MESOS-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kone resolved MESOS-299.
------------------------------

    Resolution: Fixed
    
> Master detector doesn't notify about leading master after network disconnection
> -------------------------------------------------------------------------------
>
>                 Key: MESOS-299
>                 URL: https://issues.apache.org/jira/browse/MESOS-299
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Vinod Kone
>            Assignee: Vinod Kone
>
> This occurred during a rack switch upgrade event at Twitter.
> Slave lost connectivity with the leading master. But, when the network switch came back up, the slave was never notified of the leading master and it never registered.
> {code}
> I1025 17:05:23.435269 33693 detector.cpp:389] Master detector lost connection to ZooKeeper, attempting to reconnect ...
> 2012-10-25 17:05:30,105:33681(0x49807940):ZOO_ERROR@handle_socket_error_msg@1528: Socket [10.35.96.123:2181] zk retcode=-7, errno=110(Connection timed out): connection timed out (exceeded timeout by 5ms)
> I1025 17:05:33.011539 33698 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> W1025 17:05:33.436805 33686 detector.cpp:450] Timed out waiting to reconnect to ZooKeeper (sessionId=13969feb5654992)
> I1025 17:05:33.436957 33686 slave.cpp:362] Lost master(s) ... waiting
> .......
> .......
> I1025 17:07:23.214442 33684 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:07:23,249:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:26,586:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:29,920:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:07:33.233223 33698 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:07:33,255:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:36,592:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:39,929:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:07:43.251569 33698 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:07:43,265:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:46,602:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:49,939:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:07:53.270818 33691 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:07:53,275:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:56,612:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:07:59,949:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:03,286:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:08:03.293431 33687 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:08:06,620:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:09,956:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:13,291:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:08:13.316841 33684 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:08:16,628:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:19,964:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:23,300:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:08:23.337929 33695 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:08:26,637:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:29,974:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:33,310:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:08:33.358109 33696 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 17:08:36,646:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:39,981:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-25 17:08:43,317:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1025 17:08:43.377002 33685 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> W1025 17:08:45.004808 33696 slave.cpp:336] Ignoring shutdown message from master@10.34.235.132:5050because it is not from the registered master (@0.0.0.0:0)
> ....
> ....
> 2012-10-25 18:35:04,005:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1916: Processing WATCHER_EVENT
> 2012-10-25 18:35:04,005:33681(0x49807940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [/home/mesos/prod/master], type = -1 event=ZOO_CHILD_EVENT
> 2012-10-25 18:35:04,005:33681(0x47002940):ZOO_DEBUG@zoo_awget_children_@2626: Sending request xid=0x50858793 for path [/home/mesos/prod/master] to 10.35.98.111:2181
> 2012-10-25 18:35:04,011:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989: Queueing asynchronous response
> 2012-10-25 18:35:04,011:33681(0x49807940):ZOO_DEBUG@process_completions@1795: Calling COMPLETION_STRINGLIST for xid=0x50858793 rc=0
> I1025 18:35:04.012164 33696 detector.cpp:469] Master detector found 2 registered masters
> 2012-10-25 18:35:04,012:33681(0x47002940):ZOO_DEBUG@zoo_awget@2414: Sending request xid=0x50858794 for path [/home/mesos/prod/master/0000001188] to 10.35.98.111:2181
> 2012-10-25 18:35:04,017:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989: Queueing asynchronous response
> 2012-10-25 18:35:04,017:33681(0x49807940):ZOO_DEBUG@process_completions@1772: Calling COMPLETION_DATA for xid=0x50858794 rc=0
> I1025 18:35:04.017673 33696 detector.cpp:504] Master detector got new master pid: master@10.34.91.117:5050
> I1025 18:35:04.017858 33696 slave.cpp:350] New master detected at master@10.34.91.117:5050
> I1025 18:35:04.330201 33697 slave.cpp:407] Re-registered with master
> I1025 18:35:04.330456 33697 slave.cpp:694] Updating framework 201104070004-0000002563-0000 pid to scheduler(1)@10.34.231.115:41277
> I1025 18:35:04.954231 33684 http.cpp:156] HTTP request for '/slave(1)/stats.json'
> 2012-10-25 18:35:05,715:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1916: Processing WATCHER_EVENT
> 2012-10-25 18:35:05,715:33681(0x49807940):ZOO_DEBUG@process_completions@1765: Calling a watcher for node [/home/mesos/prod/master], type = -1 event=ZOO_CHILD_EVENT
> 2012-10-25 18:35:05,716:33681(0x41ff8940):ZOO_DEBUG@zoo_awget_children_@2626: Sending request xid=0x50858795 for path [/home/mesos/prod/master] to 10.35.98.111:2181
> 2012-10-25 18:35:05,719:33681(0x4a008940):ZOO_DEBUG@zookeeper_process@1989: Queueing asynchronous response
> 2012-10-25 18:35:05,719:33681(0x49807940):ZOO_DEBUG@process_completions@1795: Calling COMPLETION_STRINGLIST for xid=0x50858795 rc=0
> I1025 18:35:05.720186 33685 detector.cpp:469] Master detector found 3 registered masters
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira