You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Sky Tao (JIRA)" <ji...@apache.org> on 2015/12/02 12:03:10 UTC
[jira] [Commented] (CURATOR-209) Background retry falls into infinite loop of reconnection after connection loss

    [ https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035637#comment-15035637 ] 

Sky Tao commented on CURATOR-209:
---------------------------------

Something similar happens when the network of my app cluster is broken.  
I think I can reproduce this bug. 

Context:
I use the protection mode for my ephemeral node when I created it. 
The retry policy is : ExponentialBackoffRetry(1000, 3).

When the connection between my app and ZK cluster is broken, the retry policy works.
However, since the connection was not recovery, after 3 times retry, the OperationAndData.ErrorCallback<PathAndBytes> registered when the node was created was triggered.  

Since it was a protection node, the findAndDeleteProtectedNodeInBackground method would be fired. 

The terrible part is here... findAndDeleteProtectedNodeInBackground is a self recursive method!
Since connection is not back, it will always throw exception, and the catch part is just simply call itself again.

I think that's the route cause, any one can fix this?

> Background retry falls into infinite loop of reconnection after connection loss
> -------------------------------------------------------------------------------
>
>                 Key: CURATOR-209
>                 URL: https://issues.apache.org/jira/browse/CURATOR-209
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 2.6.0
>         Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on AWS EC2 in a 3 box ensemble
>            Reporter: Ryan Anderson
>            Priority: Critical
>              Labels: connectionloss, loop, reconnect
>
> We've been unable to replicate this in our test environments, but approximately once a week in production (~50 machine cluster using curator/zk for service discovery) we will get a machine falling into a loop and spewing tens of thousands of errors that look like:
> {code}
> Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497) [curator-framework-2.6.0.jar:na]
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) [zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) [zookeeper-3.4.6.jar:3.4.6-1569965]
> {code}
> The rate at which we get these errors seems to increase linearly until we stop the process (starts at 10-20/sec, when we kill the box it's typically generating 1,000+/sec)
> When the error first occurs, there's a slightly different stack trace:
> {code}
> Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> followed very closely by:
> {code}
> Background retry gave uporg.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> After which it begins spewing the stack trace I first posted above. We're assuming that some sort of networking hiccup is occurring in EC2 that's causing the ConnectionLoss, which seems entirely momentary (none of our other boxes see it, and when we check the box it can connect to all the zk servers without any issues.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)