You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Wang XiaoTian (JIRA)" <ji...@apache.org> on 2016/01/26 07:49:39 UTC

[jira] [Commented] (CURATOR-293) Curator can NOT reconnect after connection lost and session expired when the connection come up while the DNS server is not ready yet.(zookeeper connection string using domain names)

    [ https://issues.apache.org/jira/browse/CURATOR-293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15116788#comment-15116788 ] 

Wang XiaoTian commented on CURATOR-293:
---------------------------------------

Exactly, we have 5 worker nodes on line, and all the 5 nodes lost the ZK connections caused by network problem, after the networks recovered( All the connection sessions have been expired), we found that 2 of 5 nodes cannot reconnect to the ZK cluster and the "blockUntilConnectedOrTimedOut" method continuously logged the connection status was always false.
In the situation, Curator cannot be notified by subsequent events because the framework has already closed the previous zookeeper client which held the expired session, after that, the    Curator fail to instantiate the zookeeper client because of the name service fault, both "sendThread" and "eventThread" was down as well, eventually, no event arrived to tell the framework what to do, and the framework cannot be recovered by itself.

> Curator can NOT reconnect after connection lost and session expired when the connection come up while the DNS server is not ready yet.(zookeeper connection string using domain names)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-293
>                 URL: https://issues.apache.org/jira/browse/CURATOR-293
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.9.1
>            Reporter: huanhuan li
>            Priority: Critical
>         Attachments: CuratorConnectionLostEventTest.java
>
>
> 1. Add following lines to the /etc/hosts:
> x.x.x.x zk1.test.com
> x.x.x.x  zk2.test.com
> x.x.x.x  zk3.test.com
> 2. RUN the test programme
> 3. shutdown the network connection to x.x.x.x
> 4. wait until the session expires (for example 10 min)
> 5. remove the added 3 lines in /etc/hosts
> 6. open the network connection to x.x.x.x
> 7. watch that curator cannot reconnect
> 8. add the 3 lines to /etc/hosts
> 9. watch that curator cannot reconnect either
> The log may look like the following:
> [main-SendThread(172.24.2.35:2181)][INFO ]2016-01-26 11:07:45.005 [ClientCnxn.logStartConnect] - Opening socket connection to server 172.24.2.35/172.24.2.35:2181. Will not attempt to authenticate using SASL (unknown error)
> [main-SendThread(172.24.2.35:2181)][INFO ]2016-01-26 11:07:45.050 [ClientCnxn.primeConnection] - Socket connection established to 172.24.2.35/172.24.2.35:2181, initiating session
> [main-EventThread][WARN ]2016-01-26 11:07:45.093 [ConnectionState.handleExpiredSession] - Session expired event received
> [main-EventThread][DEBUG]2016-01-26 11:07:45.093 [ConnectionState.reset] - reset
> [main-SendThread(172.24.2.35:2181)][INFO ]2016-01-26 11:07:45.093 [ClientCnxn.run] - Unable to reconnect to ZooKeeper service, session 0x1525d9593a537af has expired, closing socket connection
> [main-EventThread][INFO ]2016-01-26 11:07:45.095 [ZooKeeper.<init>] - Initiating client connection, connectString=zk1.test.com:2181,zk2.test.com:2181,zk3.test.com:2181 sessionTimeout=60000 watcher=org.apache.curator.ConnectionState@7e7d611f
> [main-EventThread][INFO ]2016-01-26 11:07:45.488 [ClientCnxn.run] - EventThread shut down
> [main-SendThread(111.206.227.147:2181)][INFO ]2016-01-26 11:07:45.615 [ClientCnxn.logStartConnect] - Opening socket connection to server 111.206.227.147/111.206.227.147:2181. Will not attempt to authenticate using SASL (unknown error)
> [Curator-ConnectionStateManager-0][DEBUG]2016-01-26 11:07:58.523 [CuratorZookeeperClient.blockUntilConnectedOrTimedOut] - blockUntilConnectedOrTimedOut() end. isConnected: false



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)