You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Andor Molnar (JIRA)" <ji...@apache.org> on 2018/10/25 13:42:00 UTC

[jira] [Created] (ZOOKEEPER-3182) Race condition when follower syncing with leader and starting to server requests

Andor Molnar created ZOOKEEPER-3182:
---------------------------------------

             Summary: Race condition when follower syncing with leader and starting to server requests
                 Key: ZOOKEEPER-3182
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3182
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
    Affects Versions: 3.6.0
            Reporter: Andor Molnar


This issue is probably introduced by ZOOKEEPER-2024 where 2 seperate queues have been implemented in CommitProcessor to improve performance.
[~abrahamfine] 's analysis is accurate on GitHub: https://github.com/apache/zookeeper/pull/300

He was trying to introduce synchronization between Learner.syncWithLeader() and CommitProcessor to wait for in-flight requests to be committed before accepting client requests.

In the affected unit test (testNodeDataChanged) there's a race between reconnecting client's setWatches request and updates coming from the leader according to the following logs:

{noformat}
2018-10-25 13:59:58,556 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
2018-10-25 13:59:58,556 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x10005d8fc4d0000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
...
2018-10-25 13:59:58,557 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005d8f8a40000 type:delete cxid:0x1 zxid:0x100000004 txntype:2 reqpath:n/a
...
2018-10-25 13:59:58,561 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005d8f8a40000 type:create cxid:0x2 zxid:0x100000005 txntype:1 reqpath:n/a
2018-10-25 13:59:58,561 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent state:SyncConnected type:NodeDeleted path:/test-changed for sessionid 0x10005d8fc4d0000
{noformat}

{{setWatches}} request is processed before {{delete}} and {{create}}, hence the client receives NodeDeleted event.

In the working scenario it looks like:

{noformat}
2018-10-25 14:04:55,247 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005dd88110000 type:delete cxid:
0x1 zxid:0x100000004 txntype:2 reqpath:n/a
2018-10-25 14:04:55,249 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x20005dd88110000 type:create cxid:
0x2 zxid:0x100000005 txntype:1 reqpath:n/a
...
2018-10-25 14:04:56,314 [myid:] - DEBUG [FollowerRequestProcessor:1:CommitProcessor@424] - Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
2018-10-25 14:04:56,315 [myid:] - DEBUG [CommitProcWorkThread-1:FinalRequestProcessor@91] - Processing request:: sessionid:0x10005dd88110000 type:setWatches cxid:0x3 zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
...
2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@842] - Got notification sessionid:0x10005dd88110000
2018-10-25 14:04:56,316 [myid:127.0.0.1:11231] - DEBUG [main-SendThread(127.0.0.1:11231):ClientCnxn$SendThread@864] - Got WatchedEvent state:SyncConnected type:NodeDataChanged path:/test-changed for sessionid 0x10005dd88110000
{noformat}

{{delete}} and {{create}} requests happen way before {{setWatches}} comes in (even before the client connection is established.

Abe's approach unfortunately raises the following concerns:
- modifies CommitProcessor's code which might affect performance and correctness ([~shralex] raised on ZOOKEEPER-2807),
- we experienced deadlocks while testing the patch: https://github.com/apache/zookeeper/pull/300

As a consequence I raised this Jira the capture the experiences and to put the unit test on Ignore list, because currenty I'm not sure about whether this is a real issue or a non-backward compatible change in 3.6 with the gain of a huge performance improvement.

Either way I don't want this flaky test to influence contributions, so I'll mark as Ignored on trunk until the issue is resolved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)