You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Akmal Abbasov <ak...@icloud.com> on 2015/10/01 16:34:38 UTC

Re: Unstable work of zookeeper

I’ve checked the logs of the zookeeper quorum nodes, and it seems that problem is not here.

I found that client is closing connection due to timeout, only after this zookeeper server has EOF exception. Here is the logs from hadoop zkfc process
2015-10-01 07:12:17,820 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x25020ff36c40002, closing socket connection and attempting reconnect
2015-10-01 07:12:17,938 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session disconnected. Entering neutral mode…

As you can see client timeouted after 3334ms, but the timeout configured in hdfs-site.xml dfs.ha.fencing.ssh.connect-timeout is 10.000ms 
so why does the client timed out?

Thanks.


> On 24 Sep 2015, at 15:36, Flavio Junqueira <fp...@apache.org> wrote:
> 
> I can see that the client is disconnecting from the server, and there is also a new round of leader election for the zookeeper servers. If this is happening, then yeah, your ensemble is unstable. If the ensemble leader election is being triggered frequently, then I'd start by looking there. Try to determine why the ensemble is failing to continue with the same leader. If ensemble elections aren't happening frequently, then another possibility is that GC pauses are causing the session to expire.
> 
> -Flavio
> 
>> On 24 Sep 2015, at 05:24, Akmal Abbasov <ak...@icloud.com> wrote:
>> 
>> Hi,
>> I am using zookeeper 3.4.6
>> I have a spark cluster configured with HA. Once per 1-2 days, the active spark master is shutting down with a message
>> 15/09/23 18:58:18 INFO zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x34ffa68dbd10021, likely server has closed socket, closing socket connection and attempting reconnect
>> 15/09/23 18:58:18 INFO state.ConnectionStateManager: State change: SUSPENDED
>> 15/09/23 18:58:18 INFO master.ZooKeeperLeaderElectionAgent: We have lost leadership
>> 15/09/23 18:58:18 ERROR master.Master: Leadership has been revoked -- master shutting down.
>> 15/09/23 18:58:18 INFO util.Utils: Shutdown hook called
>> 
>> I don’t have the zookeeper logs from the same period, but the logs are full of the these messages 
>> 2015-09-24 05:07:42,228 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.0.8.4:34705
>> 2015-09-24 05:07:42,229 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection request from old client /10.0.8.4:34705; will be dropped if server is in r-o mode
>> 2015-09-24 05:07:42,229 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /10.0.8.4:34705
>> 2015-09-24 05:07:42,292 [myid:1] - INFO  [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14ffd3670130030 with negotiated timeout 20001 for client /10.0.8.4:34705
>> 2015-09-24 05:07:42,302 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
>> EndOfStreamException: Unable to read additional data from client sessionid 0x14ffd3670130030, likely client has closed socket
>> 	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>> 	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>> 	at java.lang.Thread.run(Thread.java:745)
>> 2015-09-24 05:07:42,303 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.0.8.4:34705 which had sessionid 0x14ffd3670130030
>> 2015-09-24 05:07:42,314 [myid:1] - ERROR [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
>> java.nio.channels.CancelledKeyException
>> 	at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>> 	at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>> 	at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>> 	at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>> 	at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>> 	at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
>> 2015-09-24 05:07:42,334 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.0.8.4:34707
>> 2015-09-24 05:07:42,334 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection request from old client /10.0.8.4:34707; will be dropped if server is in r-o mode
>> 2015-09-24 05:07:42,335 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client attempting to establish new session at /10.0.8.4:34707
>> 2015-09-24 05:07:42,357 [myid:1] - INFO  [CommitProcessor:1:ZooKeeperServer@617] - Established session 0x14ffd3670130031 with negotiated timeout 20001 for client /10.0.8.4:34707
>> 2015-09-24 05:07:42,364 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception
>> EndOfStreamException: Unable to read additional data from client sessionid 0x14ffd3670130031, likely client has closed socket
>> 	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
>> 	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
>> 	at java.lang.Thread.run(Thread.java:745)
>> 2015-09-24 05:07:42,365 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.0.8.4:34707 which had sessionid 0x14ffd3670130031
>> 2015-09-24 05:07:42,376 [myid:1] - ERROR [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
>> java.nio.channels.CancelledKeyException
>> 	at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>> 	at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>> 	at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:151)
>> 	at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1081)
>> 	at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
>> 	at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)
>> 
>> Also there are 
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] - Shutting down
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@441] - shutting down
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@105] - Shutting down
>> 2015-09-24 06:29:54,459 [myid:1] - INFO  [FollowerRequestProcessor:1:FollowerRequestProcessor@95] - FollowerRequestProcessor exited loop!
>> 2015-09-24 06:29:54,460 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@181] - Shutting down
>> 2015-09-24 06:29:54,464 [myid:1] - INFO  [CommitProcessor:1:CommitProcessor@150] - CommitProcessor exited loop!
>> 2015-09-24 06:29:54,465 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415] - shutdown of request processor complete
>> 2015-09-24 06:29:54,466 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209] - Shutting down
>> 2015-09-24 06:29:54,466 [myid:1] - INFO  [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor exited!
>> 2015-09-24 06:29:54,466 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@714] - LOOKING
>> 2015-09-24 06:29:54,584 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.0.8.58:36137
>> 2015-09-24 06:29:54,584 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
>> 2015-09-24 06:29:54,584 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.0.8.58:36137 (no session established for client)
>> 2015-09-24 06:29:54,679 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /10.0.8.57:57410
>> 2015-09-24 06:29:54,680 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
>> 2015-09-24 06:29:54,680 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client /10.0.8.57:57410 (no session established for client)
>> 
>> I also observed that hadoop-zkfc restarts very frequently.
>> Any ideas what could be wrong?
>> 
>> Thanks.
>