You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Artem Golotin (JIRA)" <ji...@apache.org> on 2016/08/01 21:07:21 UTC

[jira] [Commented] (ZOOKEEPER-1863) Race condition in commit processor leading to out of order request completion, xid mismatch on client.

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402819#comment-15402819 ] 

Artem Golotin commented on ZOOKEEPER-1863:
------------------------------------------

Hi! I was able to reproduce the same issue as mentioned in [ZOOKEEPER-2151|https://issues.apache.org/jira/browse/ZOOKEEPER-2151]. Here is the output from the stat command: 
{noformat}
Zookeeper version: 3.5.1-alpha--1, built on 07/08/2016 17:08 GMT
Clients:
 /0:0:0:0:0:0:0:1:59312[0](queued=0,recved=1,sent=0)
 /10.2.29.7:34266[0](queued=0,recved=1,sent=0)
 /10.2.29.7:55996[0](queued=0,recved=1,sent=0)
 /10.2.81.3:48640[0](queued=0,recved=1,sent=0)

Latency min/avg/max: 0/1/826
Received: 2199007
Sent: 935511
Connections: 4
Outstanding: 48112
Zxid: 0x100011d4e
Mode: leader
Node count: 171
{noformat}

Notice the number of outstanding connections.... 
After examining the code I found that the patch that have been submitted with this issue was applied to the code. With my limited experience I do not believe that I will be able to reproduce the issue, but I am saving the full stdout zookeeper output and can provide it if needed. However, I cannot provide you with a jstack dump, because when I try to do so (even with -F option), it generates the following exception:
{noformat}
Error attaching to process: sun.jvm.hotspot.debugger.DebuggerException: Can't attach to the process
{noformat}

Unfortunately, the issue seems to have happened on Sunday when I was not able to check the health of my cluster or perform any immediate fixes. I do have several Kazoo clients + Kafka + Secor that are using zookeeper, and as all those are failing now with zookeeper outputting the following logs:

{noformat}
2016-08-01 20:45:51,828 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.81.3:50656
2016-08-01 20:45:51,829 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.81.3:50656
2016-08-01 20:45:53,723 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:35862
2016-08-01 20:45:53,725 - INFO  [NIOWorkerThread-3:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.29.7:35862
2016-08-01 20:45:59,171 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.81.3:50658
2016-08-01 20:45:59,172 - INFO  [NIOWorkerThread-1:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.81.3:50658
2016-08-01 20:46:00,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection for client /10.2.29.7:35828 which had sessionid 0x10044f682c45bdd
2016-08-01 20:46:00,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection for client /10.2.81.3:50656 which had sessionid 0x10044f682c45bdf
2016-08-01 20:46:05,013 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:35898
2016-08-01 20:46:05,028 - INFO  [NIOWorkerThread-2:NIOServerCnxn@836] - Processing ruok command from /10.2.29.7:35898
2016-08-01 20:46:05,029 - INFO  [NIOWorkerThread-2:NIOServerCnxn@1007] - Closed socket connection for client /10.2.29.7:35898 (no session established for client)
2016-08-01 20:46:05,146 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO  [NIOWorkerThread-4:NIOServerCnxn@836] - Processing srvr command from /10.2.29.7:35900
2016-08-01 20:46:05,153 - INFO  [NIOWorkerThread-4:NIOServerCnxn@1007] - Closed socket connection for client /10.2.29.7:35900 (no session established for client)
2016-08-01 20:46:05,617 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:35904
2016-08-01 20:46:05,617 - INFO  [NIOWorkerThread-3:NIOServerCnxn@836] - Processing ruok command from /10.2.29.7:35904
2016-08-01 20:46:05,618 - INFO  [NIOWorkerThread-3:NIOServerCnxn@1007] - Closed socket connection for client /10.2.29.7:35904 (no session established for client)
2016-08-01 20:46:05,634 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:35906
2016-08-01 20:46:05,634 - INFO  [NIOWorkerThread-1:NIOServerCnxn@836] - Processing srvr command from /10.2.29.7:35906
2016-08-01 20:46:05,636 - INFO  [NIOWorkerThread-1:NIOServerCnxn@1007] - Closed socket connection for client /10.2.29.7:35906 (no session established for client)
2016-08-01 20:46:05,644 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:35908
2016-08-01 20:46:05,644 - INFO  [NIOWorkerThread-2:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.29.7:35908
2016-08-01 20:46:06,718 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.81.3:50660
2016-08-01 20:46:06,719 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.81.3:50660
2016-08-01 20:46:10,224 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection for client /10.2.81.3:50658 which had sessionid 0x10044f682c45be1
2016-08-01 20:46:10,225 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection for client /10.2.29.7:35862 which had sessionid 0x10044f682c45be0
2016-08-01 20:46:13,933 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.81.3:50662
2016-08-01 20:46:13,933 - INFO  [NIOWorkerThread-3:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.81.3:50662
2016-08-01 20:46:16,030 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:35942
2016-08-01 20:46:16,032 - INFO  [NIOWorkerThread-1:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.29.7:35942
2016-08-01 20:46:18,646 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.29.7:57696
2016-08-01 20:46:18,647 - INFO  [NIOWorkerThread-2:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.29.7:57696
2016-08-01 20:46:20,223 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection for client /10.2.29.7:35908 which had sessionid 0x10044f682c45be2
2016-08-01 20:46:20,224 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection for client /10.2.81.3:50662 which had sessionid 0x10044f682c45be4
2016-08-01 20:46:20,225 - INFO  [ConnnectionExpirer:NIOServerCnxn@1007] - Closed socket connection for client /10.2.81.3:50660 which had sessionid 0x10044f682c45be3
2016-08-01 20:46:21,255 - INFO  [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181:NIOServerCnxnFactory$AcceptThread@296] - Accepted socket connection from /10.2.81.3:50664
2016-08-01 20:46:21,256 - INFO  [NIOWorkerThread-4:ZooKeeperServer@931] - Client attempting to establish new session at /10.2.81.3:50664
{noformat}

Also, I would like to point out that I am running zookeeper as a part of kubernetes cluster, so the issue may be caused by kubernetes internal proxy, but I do not know why that would cause an issue...

> Race condition in commit processor leading to out of order request completion, xid mismatch on client.
> ------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1863
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1863
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.5.0
>            Reporter: Dutch T. Meyer
>            Assignee: Dutch T. Meyer
>            Priority: Blocker
>             Fix For: 3.5.0
>
>         Attachments: ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, ZOOKEEPER-1863.patch, stack.17512
>
>
> In CommitProcessor.java processor, if we are at the primary request handler on line 167:
> {noformat}
>                 while (!stopped && !isWaitingForCommit() &&
>                        !isProcessingCommit() &&
>                        (request = queuedRequests.poll()) != null) {
>                     if (needCommit(request)) {
>                         nextPending.set(request);
>                     } else {
>                         sendToNextProcessor(request);
>                     }
>                 }
> {noformat}
> A request can be handled in this block and be quickly processed and completed on another thread. If queuedRequests is empty, we then exit the block. Next, before this thread makes any more progress, we can get 2 more requests, one get_children(say), and a sync placed on queuedRequests for the processor. Then, if we are very unlucky, the sync request can complete and this object's commit() routine is called (from FollowerZookeeperServer), which places the sync request on the previously empty committedRequests queue. At that point, this thread continues.
> We reach line 182, which is a check on sync requests.
> {noformat}
>                 if (!stopped && !isProcessingRequest() &&
>                     (request = committedRequests.poll()) != null) {
> {noformat}
> Here we are not processing any requests, because the original request has completed. We haven't dequeued either the read or the sync request in this processor. Next, the poll above will pull the sync request off the queue, and in the following block, the sync will get forwarded to the next processor.
> This is a problem because the read request hasn't been forwarded yet, so requests are now out of order.
> I've been able to reproduce this bug reliably by injecting a Thread.sleep(5000) between the two blocks above to make the race condition far more likely, then in a client program.
> {noformat}
>         zoo_aget_children(zh, "/", 0, getchildren_cb, NULL);
>         //Wait long enough for queuedRequests to drain
>         sleep(1);
>         zoo_aget_children(zh, "/", 0, getchildren_cb, &th_ctx[0]);
>         zoo_async(zh, "/", sync_cb, &th_ctx[0]);
> {noformat}
> When this bug is triggered, 3 things can happen:
> 1) Clients will see requests complete out of order and fail on xid mismatches.
> 2) Kazoo in particular doesn't handle this runtime exception well, and can orphan outstanding requests.
> 3) I've seen zookeeper servers deadlock, likely because the commit cannot be completed, which can wedge the commit processor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)