You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Mahadev konar (JIRA)" <ji...@apache.org> on 2011/07/27 17:15:44 UTC

[jira] [Updated] (ZOOKEEPER-1057) zookeeper c-client, connection to offline server fails to successfully fallback to second zk host

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated ZOOKEEPER-1057:
-------------------------------------

    Fix Version/s:     (was: 3.3.4)
                       (was: 3.4.0)
                   3.5.0

Not a blocker.

> zookeeper c-client, connection to offline server fails to successfully fallback to second zk host
> -------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1057
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1057
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: c client
>    Affects Versions: 3.3.1, 3.3.2, 3.3.3
>         Environment: snowdutyrise-lm ~/-> uname -a
> Darwin snowdutyrise-lm 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386
> also observed on:
> 2.6.35-28-server 49-Ubuntu SMP Tue Mar 1 14:55:37 UTC 2011
>            Reporter: Woody Anderson
>             Fix For: 3.5.0
>
>
> Hello, I'm a contributor for the node.js zookeeper module: https://github.com/yfinkelstein/node-zookeeper
> i'm using zk 3.3.3 for the purposes of this issue, but i have validated it fails on 3.3.1 and 3.3.2
> i'm having an issue when trying to connect when one of my zookeeper servers is offline.
> if the first server attempted is online, all is good.
> if the offline server is attempted first, then the client is never able to connect to _any_ server.
> inside zookeeper.c a connection loss (-4) is received, the socket is closed and buffers are cleaned up, it then attempts the next server in the list, creates a new socket (which gets the same fd as the previously closed socket) and connecting fails, and it continues to fail seemingly forever.
> The nature of this "fail" is not that it gets -4 connection loss errors, but that zookeeper_interest doesn't find anything going on on the socket before the user provided timeout kicks things out. I don't want to have to wait 5 minutes, even if i could make myself.
> this is the message that follows the connection loss:
> 2011-04-27 23:18:28,355:13485:ZOO_ERROR@handle_socket_error_msg@1530: Socket [127.0.0.1:5020] zk retcode=-7, errno=60(Operation timed out): connection timed out (exceeded timeout by 3ms)
> 2011-04-27 23:18:28,355:13485:ZOO_ERROR@yield@213: yield:zookeeper_interest returned error: -7 - operation timeout
> While investigating, i decided to comment out close(zh->fd) in handle_error (zookeeper.c#1153)
> now everything works (obviously i'm leaking an fd). Connection the the second host works immediately.
> this is the behavior i'm looking for, though i clearly don't want to leak the fd, so i'm wondering why the fd re-use is causing this issue.
> close() is not returning an error (i checked even though current code assumes success).
> i'm on osx 10.6.7
> i tried adding a setsockopt so_linger (though i didn't want that to be a solution), it didn't work.
> full debug traces are included in issue here: https://github.com/yfinkelstein/node-zookeeper/issues/6

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira