You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Priyanka Gupta (JIRA)" <ji...@apache.org> on 2016/04/12 19:23:25 UTC
[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover
registrar on reboot of mesos master
[ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237577#comment-15237577 ]
Priyanka Gupta commented on MESOS-5193:
---------------------------------------
Error Stack in mesos master log
Node3
I0411 22:47:02.007249 1348 detector.cpp:479] A new leading master (UPID=master@10.221.28.61:5050) is detected
I0411 22:47:02.007380 1348 master.cpp:1710] The newly elected leader is master@10.221.28.61:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4
I0411 22:47:02.007428 1348 master.cpp:1723] Elected as the leading master!
I0411 22:47:02.007457 1348 master.cpp:1468] Recovering from registrar
I0411 22:47:02.007551 1345 registrar.cpp:307] Recovering registrar
I0411 22:47:02.007649 1356 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 }
I0411 22:47:02.007841 1356 log.cpp:659] Attempting to start the writer
I0411 22:47:02.008477 1348 replica.cpp:493] Replica received implicit promise request from (30)@10.221.28.61:5050 with proposal 52
E0411 22:47:02.008903 1358 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected
I0411 22:47:02.009968 1348 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.44126ms
I0411 22:47:02.010022 1348 replica.cpp:342] Persisted promised to 52
F0411 22:48:02.008332 1357 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f4bd5bcedfd (unknown)
@ 0x7f4bd5bd0c3d (unknown)
@ 0x7f4bd5bce9ec (unknown)
@ 0x7f4bd5bd1539 (unknown)
@ 0x7f4bd54022dc (unknown)
@ 0x7f4bd5442ab0 (unknown)
@ 0x42807e (unknown)
@ 0x7f4bd54690a5 (unknown)
@ 0x7f4bd54bb976 (unknown)
@ 0x7f4bd54cc566 (unknown)
@ 0x7f4bd52fc4d6 (unknown)
@ 0x7f4bd54cc553 (unknown)
@ 0x7f4bd54b0614 (unknown)
@ 0x7f4bd5b7c971 (unknown)
@ 0x7f4bd5b7cc77 (unknown)
@ 0x3dc38b6470 (unknown)
@ 0x3dc18079d1 (unknown)
@ 0x3dc14e88fd (unknown)
@ (nil) (unknown)
/bin/bash: line 1: 1313 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2
Node 2
I0411 22:48:10.006216 1466 log.cpp:659] Attempting to start the writer
E0411 22:48:10.006958 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected
I0411 22:48:10.007202 1467 replica.cpp:493] Replica received implicit promise request from (13)@10.221.28.249:5050 with proposal 52
E0411 22:48:10.007491 1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected
I0411 22:48:10.008458 1467 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.227092ms
I0411 22:48:10.008491 1467 replica.cpp:342] Persisted promised to 52
F0411 22:49:10.006739 1476 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7fec686f2dfd (unknown)
@ 0x7fec686f4c3d (unknown)
@ 0x7fec686f29ec (unknown)
@ 0x7fec686f5539 (unknown)
@ 0x7fec67f262dc (unknown)
@ 0x7fec67f66ab0 (unknown)
@ 0x42807e (unknown)
@ 0x7fec67f8d0a5 (unknown)
@ 0x7fec67fdf976 (unknown)
@ 0x7fec67ff0566 (unknown)
@ 0x7fec67e204d6 (unknown)
@ 0x7fec67ff0553 (unknown)
@ 0x7fec67fd4614 (unknown)
@ 0x7fec686a0971 (unknown)
@ 0x7fec686a0c77 (unknown)
@ 0x37f98b6470 (unknown)
@ 0x39ed207a51 (unknown)
@ 0x39ecae89ad (unknown)
@ (nil) (unknown)
/bin/bash: line 1: 1452 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2
Node 1
I0411 22:45:52.017833 8338 detector.cpp:479] A new leading master (UPID=master@10.221.29.247:5050) is detected
I0411 22:45:52.017925 8338 master.cpp:1710] The newly elected leader is master@10.221.29.247:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16
I0411 22:45:52.017956 8338 master.cpp:1723] Elected as the leading master!
I0411 22:45:52.017983 8338 master.cpp:1468] Recovering from registrar
I0411 22:45:52.018069 8339 registrar.cpp:307] Recovering registrar
I0411 22:45:52.018337 8333 log.cpp:659] Attempting to start the writer
I0411 22:45:52.018785 8336 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 }
I0411 22:45:52.019008 8336 replica.cpp:493] Replica received implicit promise request from (31)@10.221.29.247:5050 with proposal 50
E0411 22:45:52.019548 8341 process.cpp:1966] Failed to shutdown socket with fd 24: Transport endpoint is not connected
I0411 22:45:52.020465 8336 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.421142ms
I0411 22:45:52.020496 8336 replica.cpp:342] Persisted promised to 50
I0411 22:46:15.034744 8340 network.hpp:413] ZooKeeper group memberships changed
I0411 22:46:15.034867 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000346' in ZooKeeper
I0411 22:46:15.035729 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000347' in ZooKeeper
I0411 22:46:15.036533 8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000348' in ZooKeeper
I0411 22:46:15.037353 8335 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050, log-replica(1)@10.221.29.247:5050 }
I0411 22:46:27.242632 8336 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
I0411 22:46:37.292083 8335 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
I0411 22:46:47.342876 8334 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
F0411 22:46:52.019045 8333 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
@ 0x7f7ad44badfd (unknown)
@ 0x7f7ad44bcc3d (unknown)
@ 0x7f7ad44ba9ec (unknown)
@ 0x7f7ad44bd539 (unknown)
@ 0x7f7ad3cee2dc (unknown)
@ 0x7f7ad3d2eab0 (unknown)
@ 0x42807e (unknown)
@ 0x7f7ad3d550a5 (unknown)
@ 0x7f7ad3da7976 (unknown)
@ 0x7f7ad3db8566 (unknown)
@ 0x7f7ad3be84d6 (unknown)
@ 0x7f7ad3db8553 (unknown)
@ 0x7f7ad3d9c614 (unknown)
@ 0x7f7ad4468971 (unknown)
@ 0x7f7ad4468c77 (unknown)
@ 0x35282b6470 (unknown)
@ 0x35262079d1 (unknown)
@ 0x3525ee88fd (unknown)
@ (nil) (unknown)
/bin/bash: line 1: 8332 Aborted /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2
> Recovery failed: Failed to recover registrar on reboot of mesos master
> ----------------------------------------------------------------------
>
> Key: MESOS-5193
> URL: https://issues.apache.org/jira/browse/MESOS-5193
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 0.22.0, 0.27.0
> Reporter: Priyanka Gupta
> Labels: master, mesosphere
>
> Hi all,
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on all of them. We are using chronos on top of it. The problem is when we reboot the mesos master leader, the other nodes try to get elected as leader but fail with recovery registrar issue.
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins"
> The next node then try to become the leader but again fails with same error. I am not sure about the issue. We are currently using mesos 0.22 and also tried to upgrade to mesos 0.27 as well but the problem continues to happen.
> /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)