You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Jordan Zimmerman (Jira)" <ji...@apache.org> on 2020/02/22 13:56:00 UTC

[jira] [Commented] (CURATOR-561) Race condition preventing reconnection

    [ https://issues.apache.org/jira/browse/CURATOR-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042585#comment-17042585 ] 

Jordan Zimmerman commented on CURATOR-561:
------------------------------------------

It's very hard to write a test case for this (I'm not able to reproduce what you have above either). My only thought is to set a flag when {{injectSessionExpiration()}} is called so that if it's called again Curator will just reset the connection. This would work around the problem.  Thoughts?

> Race condition preventing reconnection
> --------------------------------------
>
>                 Key: CURATOR-561
>                 URL: https://issues.apache.org/jira/browse/CURATOR-561
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Grant Digby
>            Assignee: Jordan Zimmerman
>            Priority: Major
>
> Heavy GC caused several rounds of suspend, lost, reconnected - as you'd expect. But eventually it failed to reconnect, even though ZK was up the entire time. My thread waited on acquireLock until the retry period was up. I believe I've tracked this down to a race condition between the Curator thread which calls ZookKeeperTestable#injectSessionExpiration to post the expiry and death events and the Zookeeper SendThread which posts the same.
>  
> The EventThread is meant to receive the expiry event first so it can set up its replacement, then it receives the death event making it shut down.
> The race condition allows these two events to be received in the wrong order so the EventThread shuts down before a replacement can be set up.
>  
> The "Curator-ConnectionStateManager" thread calls injectSessionExpiration which calls EventThread#queueEvent. It goes as far as my comment in the next code snippit, sets the status and let's say it pauses here, for whatever reason.
>  
> {code:java}
> private void queueEvent(WatchedEvent event,
>         Set<Watcher> materializedWatchers) {
>     if (event.getType() == EventType.None
>             && sessionState == event.getState()) {
>         return;
>     }
>     sessionState = event.getState();   // The curator thread sets the status here
>   // It then goes onto send the expiry event, but for now it's here..the event has not yet been sent{code}
>  
> The SendThread then goes into the same method with its Expiry message, but the state has already been set to expired by the curator thread so it returns without sending the expiry message. It then sends the event of death:
> {code:java}
> // from ClientCnxn#onConnected
> eventThread.queueEvent(new WatchedEvent(
>         Watcher.Event.EventType.None,
>         Watcher.Event.KeeperState.Expired, null));
> eventThread.queueEventOfDeath(); {code}
>  
> The EventThread receives this event of death and shuts down. The curator thread then continues and sends the original expiry message but it's now too late. With no EventThread created it remains in suspended state with any operations just retrying until they give up.
>  
> I find it difficult to explan race conditions, but hopefully that made sense. I can recreate the issue in Intellij using Suspend-All break points to cause the session expiry and Suspend-Thread break points to order the threads in the correct way:
>  
> Run the following and put a suspend-all break point at the System.out
> {code:java}
> package com.gebatech.curator;
> import org.apache.curator.framework.CuratorFramework;
> import org.apache.curator.framework.CuratorFrameworkFactory;
> import org.apache.curator.framework.recipes.locks.InterProcessReadWriteLock;
> import org.apache.curator.retry.BoundedExponentialBackoffRetry;
> /**
>  *
>  */
> public class GoCurator {
>     public static void main(String[] args) throws Exception {
>         CuratorFramework cf = CuratorFrameworkFactory.builder()
>                 .connectString("localhost:2181")
>                 .retryPolicy(new BoundedExponentialBackoffRetry(200, 10000, 20))
>                 .sessionTimeoutMs(30 * 1000)
>                 .build();
>         cf.start();
>         InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, "/grant/myLock");
>         System.out.println("Set a Suspend-ALL break point here and leave it until the session expires, 30 seconds ish");
>         lcok.readLock().acquire();
>         System.out.println();
>     }
> } {code}
> Whlst suspended here, put a Suspend-Thread break poing in EventThread#queueEvent, on the if statement from the first code snippet.
>  
> When you see the following log line, you can let the program resume;
> {code:java}
> 2020-02-20 20:15:05,912 [myid:] - INFO [SessionTracker:ZooKeeperServer@398] - Expiring session 0x1006300f2390000, timeout of 30000ms exceeded {code}
> The SendThread will hit your break point first with a Disconnect event, you can let this run.
> Next you'll get the SendThread with the Expiry event, wait here until you get the popup telling you that the Curator thread has hit the same break point. Switch to this and step through until it has set the expiry status, then leave it held there and switch back to the SendThread. Let the sendThread run, then remove break points and let everything else run.
>  
> The program will then be stuck and you'll see this stacktrace for the acquire lock method
> {code:java}
> "main@1" prio=5 tid=0x1 nid=NA waiting"main@1" prio=5 tid=0x1 nid=NA waiting  java.lang.Thread.State: WAITING   at java.lang.Object.wait(Object.java:-1)   at java.lang.Object.wait(Object.java:502)   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499)   at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487)   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1547)   at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180)   at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)   at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)   at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)   at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)   at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)   at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)   at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)   at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)   at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)   at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)   at com.gebatech.curator.GoCurator.main(GoCurator.java:26) {code}
>  
> I know this issue falls between Zookeeper and Curator but thought I'd report it in case you can think of a solution. I think ideally they'd send a single expiry event with a death flag set or, performance impact allowing, synchronize that method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)