You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Jithin Girish (Jira)" <ji...@apache.org> on 2022/05/06 18:19:00 UTC

[jira] [Created] (ZOOKEEPER-4537) Race between SyncThread and CommitProcessor thread

Jithin Girish created ZOOKEEPER-4537:
----------------------------------------

             Summary: Race between SyncThread and CommitProcessor thread
                 Key: ZOOKEEPER-4537
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4537
             Project: ZooKeeper
          Issue Type: Bug
          Components: server
            Reporter: Jithin Girish


Zookeeper server can get stuck when it has just one client and the only way it recovers is due to a socket timeout or another client commit request.

Details:

We have a system where a system manager app (SM) launches all the other applications based on some config criteria.  When system boots up, SM connects to zookeeper and is the only client talking to zookeeper until all other apps are launched. SM does a bunch of sync create() calls to zookeeper, before it starts up all the other apps. So at this point zookeeper server has got just one connection, which is from SM. 1 out of 3 times during system startup, SM gets stuck at a random create calls and there are only two ways this gets unwedged.
1. The socket has to time out and we get ZOPERATIONTIMEOUT
2. If we start another connection to zookeeper (I used zkCli.sh to do this), this unwedges the exiting connection to SM.
 
From strace I can see that there is a race between SyncThread and CommitProcessor thread.
 
Sync thread
class CommitProcessor {
run() {
            int requestsToProcess = 0;
            boolean commitIsWaiting = false;
            do {
                commitIsWaiting = !committedRequests.isEmpty(); <<<<< we are checking if there are more messages to process here
                requestsToProcess = queuedRequests.size();
                // Avoid sync if we have something to do
                if (requestsToProcess == 0 && !commitIsWaiting) {
                    // Waiting for requests to process
                    synchronized (this) {
                        while (!stopped && requestsToProcess == 0 && !commitIsWaiting) { <<<<< and acting on the information read above inside the sync block
                            wait(); <<<<<< wait here
                            commitIsWaiting = !committedRequests.isEmpty();
                            requestsToProcess = queuedRequests.size();
                        }
                    }
                }
 
 
Commit thread
    public void commit(Request request) {
        if (stopped || request == null) {
            return;
        }
        LOG.debug("Committing request:: {}", request);
        request.commitRecvTime = Time.currentElapsedTime();
        ServerMetrics.getMetrics().COMMITS_QUEUED.add(1);
        committedRequests.add(request); <<<<<< enqueue messages here
        wakeup();
    }
 
    @SuppressFBWarnings("NN_NAKED_NOTIFY")
    private synchronized void wakeup() { <<<<< wakeup call is synchronized
        notifyAll();
    }
 
 
Now lets consider the following scenario
  1. Sync thread reads commitIsWaiting and there are no commits pending.
  2. This thread gets scheduled out
  3. We got a commit request – CommitProcessor thread adds the request to committedRequests and calls wakeup
  4. CommitProcessor goes ahead and does a notifyAll().
  5. Since the sync thread has not reached the wait() yet, there is no one to wake up.
  6. Sync thread gets scheduled back in.
  7. It goes ahead and does a wait() but since there are no other connections or new commit requests no one does a wakeup(). So this thread waits here until the socket is timed out.
     7a. Or if another commit is made where we endup calling notifyAll() which wakes up the waiting thread.
 
I modified the CommitProcessor::run() like this
            do {
                /*
                 * Since requests are placed in the queue before being sent to
                 * the leader, if commitIsWaiting = true, the commit belongs to
                 * the first update operation in the queuedRequests or to a
                 * request from a client on another server (i.e., the order of
                 * the following two lines is important!).
                 */
                    synchronized (this) { <<<<<
                            commitIsWaiting = !committedRequests.isEmpty(); <<<<< moved the queue checks inside the sync block to ensure we don’t have the race condition
                            requestsToProcess = queuedRequests.size();
                            // Avoid sync if we have something to do
                            if (requestsToProcess == 0 && !commitIsWaiting) {
                                    // Waiting for requests to process
                                    while (!stopped && requestsToProcess == 0 && !commitIsWaiting) {
                                            wait();
                                            commitIsWaiting = !committedRequests.isEmpty();
                                            requestsToProcess = queuedRequests.size();
                                    }
                            }
                    }
 
This seems to have fixed the issue. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)