You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@curator.apache.org by "Jordan Zimmerman (JIRA)" <ji...@apache.org> on 2016/04/30 21:50:12 UTC

[jira] [Commented] (CURATOR-320) Discovery reregister triggered even if retry policy suceeds. Connection looping condition.

    [ https://issues.apache.org/jira/browse/CURATOR-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265455#comment-15265455 ] 

Jordan Zimmerman commented on CURATOR-320:
------------------------------------------

A pull request with a fix would be appreciated.

> Discovery reregister triggered even if retry policy suceeds. Connection looping condition.
> ------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-320
>                 URL: https://issues.apache.org/jira/browse/CURATOR-320
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Client, Framework
>    Affects Versions: TBD, 2.10.0
>         Environment: 3 server Quorum running on individual AWS boxes.
> Session timeout set to 1-2 min on most clients.
>            Reporter: Running Fly
>             Fix For: TBD
>
>
>     ServiceDiscoveryImpl.reRegisterServices() can be trigger  on ConnectionState events: RECONNECTED and CONNECTED. Causing the reRegisterServices() method to be run on ConnectionStateManager thread. If a connection drops while running reRegisterServices() it will be recovered by the retry policy. However the ConnectionState SUSPENDED followed by RECONNECTED events will be queued but not fired until reRegisterServices() completes(ConnectionStateManager Thread fires these events but is in use). When it does complete the RECONNECTED event in the queue will fire and reRegisterServices() will rerun.
>     When zookeeper's server connection is interrupted all of the clients will simultaneously call reRegisterServices(). This overloads the server with requests causing connections to timeout and reset. Thus queuing up more RECONNECTED events. This state can persist indefinitely.
>     Because the reRegisterServices() will most likely receive a NodeExistsException. It deletes and recreates the node. Effectively causing the services to thrash up and down. Wreaking havoc on our service dependency chain. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)