You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Stephan Erb (JIRA)" <ji...@apache.org> on 2016/09/16 11:41:20 UTC

[jira] [Commented] (AURORA-1661) Scheduler leader failed to re-announce itself after ZK name was changed

    [ https://issues.apache.org/jira/browse/AURORA-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15496090#comment-15496090 ] 

Stephan Erb commented on AURORA-1661:
-------------------------------------

Given that curator has landed, I'd propose to close this issue as fixed.

> Scheduler leader failed to re-announce itself after ZK name was changed
> -----------------------------------------------------------------------
>
>                 Key: AURORA-1661
>                 URL: https://issues.apache.org/jira/browse/AURORA-1661
>             Project: Aurora
>          Issue Type: Story
>            Reporter: Dmitriy Shirchenko
>            Assignee: John Sirois
>
> After we renamed our ZK cluster we were in a situation where the Aurora leader did not write to the ZK endpoint. 
> Sequence that we performed:
> - change aurora cluster name in clusters.json
> - restart all schedulers (all good)
> - restart zk and move replication log (lose all state) during renaming: intentional 
> - current Aurora leader fails to re-announce itself but election isn't retriggered so we lose our leader
> - some time later: restart all schedulers, and a new leader is elected (good again)
> Partial logs from the master (sorry, I'm not sure which parts are super relevant).
> {code}
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586244700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.162.18.25:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586a45700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.162.28.31:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586244700):ZOO_INFO@check_events@1703: initiated connection to server [10.162.22.24:2181]
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,723:30181(0x7f7586a45700):ZOO_INFO@check_events@1703: initiated connection to server [10.162.2.27:2181]
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,733:30181(0x7f7586a45700):ZOO_INFO@check_events@1750: session establishment complete on server [10.162.2.27:2181], sessionId=0x28004d50ccb80001, negotiated timeout=4000
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734082 30370 group.cpp:349] Group process (group(1)@10.162.12.31:8083) connected to ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734127 30370 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.734136 30370 group.cpp:427] Trying to create path '/aurora/replicated-log' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755173 30373 network.hpp:413] ZooKeeper group memberships changed
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.755249 30367 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.756778 30364 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.162.14.30:8083 }
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763407 30373 network.hpp:413] ZooKeeper group memberships changed
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.763478 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.764437 30362 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.765494 30376 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.14.30:8083 }
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:10,782:30181(0x7f7586244700):ZOO_INFO@check_events@1750: session establishment complete on server [10.162.22.24:2181], sessionId=0x26004d8049520002, negotiated timeout=4000
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782429 30376 group.cpp:349] Group process (group(2)@10.162.12.31:8083) connected to ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782444 30376 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (1, 0, 0)
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.782447 30376 group.cpp:427] Trying to create path '/aurora/replicated-log' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795754 30377 network.hpp:413] ZooKeeper group memberships changed
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.795805 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000000' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.796530 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000001' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.797600 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000002' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.798283 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000003' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799046 30359 group.cpp:700] Trying to get '/aurora/replicated-log/0000000004' in ZooKeeper
> Apr 08 22:59:10 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:10.799967 30372 network.hpp:461] ZooKeeper group PIDs: { log-replica(1)@10.160.41.62:8083, log-replica(1)@10.162.9.54:8083, log-replica(1)@10.162.12.31:8083, log-replica(1)@10.162.14.30:8083, log-replica(1)@10.162.29.25:8083 }
> Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to authenticate using SA
> Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.839 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:876] Socket connection established to compute39-sjc1.prod.uber.internal/10.162.1.28:2181, initiating session
> Apr 08 22:59:11 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:11.842 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
> Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute38-sjc1.prod.uber.internal/10.162.22.24:2181. Will not attempt to authenticate using S
> Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.810 [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:876] Socket connection established to compute38-sjc1.prod.uber.internal/10.162.22.24:2181, initiating session
> Apr 08 22:59:12 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:12.811 [RedirectMonitor STARTING-SendThread(compute38-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:13,460:30181(0x7f7577fff700):ZOO_INFO@check_events@1703: initiated connection to server [10.162.2.27:2181]
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: 2016-04-08 22:59:13,462:30181(0x7f7577fff700):ZOO_INFO@check_events@1750: session establishment complete on server [10.162.2.27:2181], sessionId=0x28004d50ccb8001c, negotiated timeout=10000
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463156 30378 group.cpp:349] Group process (group(3)@10.162.12.31:8083) connected to ZooKeeper
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463179 30378 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.463186 30378 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466274 30373 detector.cpp:154] Detected a new leader: (id='1')
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.466698 30365 group.cpp:700] Trying to get '/mesos/json.info_0000000001' in ZooKeeper
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467295 30360 detector.cpp:479] A new leading master (UPID=master@10.162.29.25:5050) is detected
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467350 30370 sched.cpp:326] New master detected at master@10.162.29.25:5050
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467427 30370 sched.cpp:382] Authenticating with master master@10.162.29.25:5050
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467437 30370 sched.cpp:389] Using default CRAM-MD5 authenticatee
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.467504 30360 authenticatee.cpp:121] Creating new client SASL connection
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:13.753 [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute36-sjc1.prod.uber.internal/10.162.28.31:2181. Will not attempt to authenticate using S
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:13.753 [RedirectMonitor STARTING-SendThread(compute36-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1162] Session 0x270049a1666d1483 for server null, unexpected error, closing socket connection and attempting reconnect java.net.Connect
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_72-internal]
> ...
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[na:1.8.0_72-internal]
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[zookeeper-3.4.8.jar:3.4.8--1]
> Apr 08 22:59:13 compute33-sjc1 aurora-scheduler[30176]: at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141) ~[zookeeper-3.4.8.jar:3.4.8--1]
> Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute40-sjc1.prod.uber.internal/10.162.2.27:2181. Will not attempt to authenticate using SA
> Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.851 [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:876] Socket connection established to compute40-sjc1.prod.uber.internal/10.162.2.27:2181, initiating session
> Apr 08 22:59:14 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:14.852 [RedirectMonitor STARTING-SendThread(compute40-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
> ...
> Apr 08 22:59:50 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:50.538 [RedirectMonitor STARTING-SendThread(compute37-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1158] Unable to read additional data from server sessionid 0x270049a1666d1483, likely server has closed socket, closing socket connecti
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1032] Opening socket connection to server compute39-sjc1.prod.uber.internal/10.162.1.28:2181. Will not attempt to authenticate using SA
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.273 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1108] Client session timed out, have not heard from server in 1635ms for sessionid 0x270049a1666d1483
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: I0408 22:59:52.273 [RedirectMonitor STARTING-SendThread(compute39-sjc1.prod.uber.internal:2181), ClientCnxn$SendThread:1156] Client session timed out, have not heard from server in 1635ms for sessionid 0x270049a1666d1483, closing socket connection and at
> Apr 08 22:59:52 compute33-sjc1 aurora-scheduler[30176]: W0408 22:59:52.374 [AsyncProcessor-7, Group$ActiveMembership:370] Temporary error cancelling membership:/aurora/scheduler/member_0000000360 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /aurora/scheduler/member_0000000360
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)