You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Priyanka Gupta (JIRA)" <ji...@apache.org> on 2016/04/12 19:23:25 UTC

[jira] [Commented] (MESOS-5193) Recovery failed: Failed to recover registrar on reboot of mesos master

    [ https://issues.apache.org/jira/browse/MESOS-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15237577#comment-15237577 ] 

Priyanka Gupta commented on MESOS-5193:
---------------------------------------

Error Stack in mesos master log

Node3
I0411 22:47:02.007249  1348 detector.cpp:479] A new leading master (UPID=master@10.221.28.61:5050) is detected
I0411 22:47:02.007380  1348 master.cpp:1710] The newly elected leader is master@10.221.28.61:5050 with id 725d1232-bea3-4df5-90c5-6479e5652ef4
I0411 22:47:02.007428  1348 master.cpp:1723] Elected as the leading master!
I0411 22:47:02.007457  1348 master.cpp:1468] Recovering from registrar
I0411 22:47:02.007551  1345 registrar.cpp:307] Recovering registrar
I0411 22:47:02.007649  1356 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050 }
I0411 22:47:02.007841  1356 log.cpp:659] Attempting to start the writer
I0411 22:47:02.008477  1348 replica.cpp:493] Replica received implicit promise request from (30)@10.221.28.61:5050 with proposal 52
E0411 22:47:02.008903  1358 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected
I0411 22:47:02.009968  1348 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.44126ms
I0411 22:47:02.010022  1348 replica.cpp:342] Persisted promised to 52
F0411 22:48:02.008332  1357 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7f4bd5bcedfd  (unknown)
    @     0x7f4bd5bd0c3d  (unknown)
    @     0x7f4bd5bce9ec  (unknown)
    @     0x7f4bd5bd1539  (unknown)
    @     0x7f4bd54022dc  (unknown)
    @     0x7f4bd5442ab0  (unknown)
    @           0x42807e  (unknown)
    @     0x7f4bd54690a5  (unknown)
    @     0x7f4bd54bb976  (unknown)
    @     0x7f4bd54cc566  (unknown)
    @     0x7f4bd52fc4d6  (unknown)
    @     0x7f4bd54cc553  (unknown)
    @     0x7f4bd54b0614  (unknown)
    @     0x7f4bd5b7c971  (unknown)
    @     0x7f4bd5b7cc77  (unknown)
    @       0x3dc38b6470  (unknown)
    @       0x3dc18079d1  (unknown)
    @       0x3dc14e88fd  (unknown)
    @              (nil)  (unknown)
/bin/bash: line 1:  1313 Aborted                 /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2



Node 2

I0411 22:48:10.006216  1466 log.cpp:659] Attempting to start the writer
E0411 22:48:10.006958  1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected
I0411 22:48:10.007202  1467 replica.cpp:493] Replica received implicit promise request from (13)@10.221.28.249:5050 with proposal 52
E0411 22:48:10.007491  1478 process.cpp:1966] Failed to shutdown socket with fd 23: Transport endpoint is not connected
I0411 22:48:10.008458  1467 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.227092ms
I0411 22:48:10.008491  1467 replica.cpp:342] Persisted promised to 52
F0411 22:49:10.006739  1476 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7fec686f2dfd  (unknown)
    @     0x7fec686f4c3d  (unknown)
    @     0x7fec686f29ec  (unknown)
    @     0x7fec686f5539  (unknown)
    @     0x7fec67f262dc  (unknown)
    @     0x7fec67f66ab0  (unknown)
    @           0x42807e  (unknown)
    @     0x7fec67f8d0a5  (unknown)
    @     0x7fec67fdf976  (unknown)
    @     0x7fec67ff0566  (unknown)
    @     0x7fec67e204d6  (unknown)
    @     0x7fec67ff0553  (unknown)
    @     0x7fec67fd4614  (unknown)
    @     0x7fec686a0971  (unknown)
    @     0x7fec686a0c77  (unknown)
    @       0x37f98b6470  (unknown)
    @       0x39ed207a51  (unknown)
    @       0x39ecae89ad  (unknown)
    @              (nil)  (unknown)
/bin/bash: line 1:  1452 Aborted                 /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2



Node 1
I0411 22:45:52.017833  8338 detector.cpp:479] A new leading master (UPID=master@10.221.29.247:5050) is detected
I0411 22:45:52.017925  8338 master.cpp:1710] The newly elected leader is master@10.221.29.247:5050 with id 13df6437-fbe9-4390-9f6c-db9fd1d53a16
I0411 22:45:52.017956  8338 master.cpp:1723] Elected as the leading master!
I0411 22:45:52.017983  8338 master.cpp:1468] Recovering from registrar
I0411 22:45:52.018069  8339 registrar.cpp:307] Recovering registrar
I0411 22:45:52.018337  8333 log.cpp:659] Attempting to start the writer
I0411 22:45:52.018785  8336 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.29.247:5050 }
I0411 22:45:52.019008  8336 replica.cpp:493] Replica received implicit promise request from (31)@10.221.29.247:5050 with proposal 50
E0411 22:45:52.019548  8341 process.cpp:1966] Failed to shutdown socket with fd 24: Transport endpoint is not connected
I0411 22:45:52.020465  8336 leveldb.cpp:304] Persisting metadata (8 bytes) to leveldb took 1.421142ms
I0411 22:45:52.020496  8336 replica.cpp:342] Persisted promised to 50
I0411 22:46:15.034744  8340 network.hpp:413] ZooKeeper group memberships changed
I0411 22:46:15.034867  8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000346' in ZooKeeper
I0411 22:46:15.035729  8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000347' in ZooKeeper
I0411 22:46:15.036533  8334 group.cpp:672] Trying to get '/mesos/log_replicas/0000000348' in ZooKeeper
I0411 22:46:15.037353  8335 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.221.28.61:5050, log-replica(1)@10.221.28.249:5050, log-replica(1)@10.221.29.247:5050 }
I0411 22:46:27.242632  8336 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
I0411 22:46:37.292083  8335 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
I0411 22:46:47.342876  8334 http.cpp:503] HTTP GET for /master/state.json from 216.145.54.15:54890 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:45.0) Gecko/20100101 Firefox/45.0'
F0411 22:46:52.019045  8333 master.cpp:1457] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7f7ad44badfd  (unknown)
    @     0x7f7ad44bcc3d  (unknown)
    @     0x7f7ad44ba9ec  (unknown)
    @     0x7f7ad44bd539  (unknown)
    @     0x7f7ad3cee2dc  (unknown)
    @     0x7f7ad3d2eab0  (unknown)
    @           0x42807e  (unknown)
    @     0x7f7ad3d550a5  (unknown)
    @     0x7f7ad3da7976  (unknown)
    @     0x7f7ad3db8566  (unknown)
    @     0x7f7ad3be84d6  (unknown)
    @     0x7f7ad3db8553  (unknown)
    @     0x7f7ad3d9c614  (unknown)
    @     0x7f7ad4468971  (unknown)
    @     0x7f7ad4468c77  (unknown)
    @       0x35282b6470  (unknown)
    @       0x35262079d1  (unknown)
    @       0x3525ee88fd  (unknown)
    @              (nil)  (unknown)
/bin/bash: line 1:  8332 Aborted                 /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://scheduler1.rpt.cb.ne1.yahoo.com:2181,scheduler2.rpt.cb.ne1.yahoo.com:2181,scheduler3.rpt.cb.ne1.yahoo.com:2181/mesos --quorum=2

> Recovery failed: Failed to recover registrar on reboot of mesos master
> ----------------------------------------------------------------------
>
>                 Key: MESOS-5193
>                 URL: https://issues.apache.org/jira/browse/MESOS-5193
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.22.0, 0.27.0
>            Reporter: Priyanka Gupta
>              Labels: master, mesosphere
>
> Hi all, 
> We are using a 3 node cluster with mesos master, mesos slave and zookeeper on all of them. We are using chronos on top of it. The problem is when we reboot the mesos master leader, the other nodes try to get elected as leader but fail with recovery registrar issue. 
> "Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins"
> The next node then try to become the leader but again fails with same error. I am not sure about the issue. We are currently using mesos 0.22 and also tried to upgrade to mesos 0.27 as well but the problem continues to happen. 
>  /usr/sbin/mesos-master --work_dir=/tmp/mesos_dir --zk=zk://node1:2181,node2:2181,node3:2181/mesos --quorum=2
> Can you please help us resolve this issue as its a production system.
> Thanks,
> Priyanka



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)