You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Raul Gutierrez Segales (JIRA)" <ji...@apache.org> on 2015/03/25 00:12:52 UTC

[jira] [Commented] (ZOOKEEPER-2151) FollowerZookeeperServer has thousands of outstanding requests stuck in CommitProcessor

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378883#comment-14378883 ] 

Raul Gutierrez Segales commented on ZOOKEEPER-2151:
---------------------------------------------------

[~jaredc]: we've seen this issue in prod many many times. Are you using 3.5.0rc1 or a recent version out of master? Also, and more importantly, do you have local sessions enabled?

And, finally, do you have any special settings for the CommitProcessor? I.e., any of the ones introduced by ZOOKEEPER-1505?

> FollowerZookeeperServer has thousands of outstanding requests stuck in CommitProcessor
> --------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2151
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2151
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0
>         Environment: Ubuntu 12.04
>            Reporter: Jared Cantwell
>
> We are seeing one follower server in our quorum stuck with thousands of outstanding requests:
> ---------------------------------------------
> node04:~$ telnet 10.10.10.6 2181
> Trying 10.10.10.6...
> Connected to 10.10.10.6.
> Escape character is '^]'.
> *stat*
> Zookeeper version: 3.5.0-1547702, built on 05/15/2014 03:06 GMT
> Clients:
>  /10.10.10.6:60646\[0\](queued=0,recved=1,sent=0)
>  /10.10.10.6:60648\[0\](queued=0,recved=1,sent=0)
>  /10.10.10.6:41786\[0\](queued=1,recved=3,sent=1)
> Latency min/avg/max: 0/0/1887
> Received: 3064156900
> Sent: 3064134581
> Connections: 3
> *Outstanding: 24395*
> Zxid: 0x11050f7e4b
> Mode: follower
> Node count: 6969
> Connection closed by foreign host.
> ---------------------------------------------
> When this happens, our c client is able to establish an initial connection to the server, but any request then times out.  It re-establishes a connection, then times out, rinse, repeat.  We are noticing this because we set up this particular client to connect directly to only one server in the quorum, so any problem with that server will be noticed.  Our other clients are just connecting to the next server in the list, which is why only this client notices a problem.
> We were able to capture a heap dump in one instance.  This is what we observed:
> - FollowerZookeeperServer.requestsInProcess has count ~24K
> - CommitProcessor.queuedRequest list has the 24K items in it, so the FinalRequestProcessor's processRequest function isn't ever getting called to complete the requests.
> - CommitProcessor.isWaitingForCommit()==true
> - CommitProcessor.committedRequests.isEmpty()==true
> - CommitProcessor.nextPending is a create request
> - CommitProcessor.currentlyCommitting is null
> - CommitProcessor.numRequestsProcessing is 0
> - FollowerZookeeperServer, who should be calling commit() on the CommitProcessor, has no elements in its pendingTxns list, which indicates that it thinks it has already passed a COMMIT message to the CommitProcessor for every request that is stuck in the queuedRequests list and nextPending member of CommitProcessor.
> The CommitProcessor's run() is doing this:
> {quote}
> Thread 23510: (state = BLOCKED)
>    java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be imprecise)
>    org.apache.zookeeper.server.quorum.CommitProcessor.run() @bci=165, line=182 (Compiled frame)
> {quote}
> When we attached via gdb to get the dump, sockets closed that caused a new round of leader election.  When this happened, the issued corrected itself since the whole FollowerZookeeperServer got restarted.
> I've confirmed that no time changing was happening before things got stuck 2 days before we noticed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)