You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@curator.apache.org by Jeremy Stribling <st...@vmware.com> on 2014/02/26 03:23:40 UTC

adding a "network timeout" to curator?

Hi all,

I started a thread on the ZK list a while back about timeouts in ZK. 
You can find it in the archives here:

http://mail-archives.apache.org/mod_mbox/zookeeper-user/201309.mbox/%3C522F7A9D.20800@nicira.com%3E

The basic idea is that when ZK is running on a node with slow disks 
(e.g., in a VM), you might want to set your session timeout to a long 
value (e.g., 30 seconds or 60 seconds), but still detect network 
timeouts quickly.  On that thread, Michi proposed using 'ruok' commands 
from the client to test network connectivity, along with the normal 
client pings happening in the background to detect server slowness.

I was wondering if this would make sense to provide as part of the 
Curator Framework or Client.  There could be some background thread 
sending 'ruok' commands to whatever server the client is connected to, 
and going into SUSPENDED (or LOST?) mode when it hits a timeout or gets 
a failure back.  We might be able to implement something like that here 
and contribute it back, if it sounds interesting to other people and we 
can agree on a design.  Any thoughts?

Jeremy

Re: adding a "network timeout" to curator?

Posted by Jeremy Stribling <st...@vmware.com>.
Actually, in our case it's mostly about trying to detect a true network 
or process failure quickly, without having to wait the entire, long 
session timeout that's needed because of slow disks.

Now that I think about it a bit more though, I don't think we can get 
what we want entirely on the client side.  Really, what we want is fast 
leader failover among the clients when there is a real network/process 
failure, without risking false session expirations due to slow disks.  
However, in my proposal the server still won't expire the client's 
session until the full session timeout elapses, since it doesn't know 
about the client's 'ruok' protocol.  What I'm proposing would only allow 
a client to reconnect to a different server quickly, it wouldn't affect 
other clients' view of the session.

Hmm, maybe back to the drawing board.  Thanks for listening, anyway.

Jeremy

On 02/26/2014 10:16 PM, Jordan Zimmerman wrote:
> I see. So, this is a slow network. You’d like a heuristic that puts 
> Curator into SUSPENDED mode when the network performance drops. Sounds 
> interesting to me.
>
> -JZ
>
> ------------------------------------------------------------------------
> From: Jeremy Stribling Jeremy Stribling <ma...@vmware.com>
> Reply: Jeremy Stribling strib@vmware.com <ma...@vmware.com>
> Date: February 27, 2014 at 11:15:46 AM
> To: Jordan Zimmerman jordan@jordanzimmerman.com 
> <ma...@jordanzimmerman.com>, user@curator.apache.org 
> user@curator.apache.org <ma...@curator.apache.org>
> Subject: Re: adding a "network timeout" to curator?
>> Please correct me if I'm wrong, but I thought Curator went into 
>> SUSPENDED mode when it gets a Disconnected state event from its ZK 
>> client.  That is not necessarily the same as a network issue, because 
>> that ZK keepalive could be stuck in the ZK server processing queue, 
>> blocked on a slow disk.  What I'm proposing would be a true, 
>> network-only timeout that could be used to declare a client 
>> disconnected quickly if there's a network issue, without having to 
>> reduce the ZK session timeout so low that a slow disk would cause 
>> false negatives. Does that make sense?
>>
>> Jeremy
>>
>> On 02/26/2014 09:25 PM, Jordan Zimmerman wrote:
>>> Curator should already go into SUSPENDED when there is a connection 
>>> issue, right? How would this be different?
>>>
>>> -JZ
>>>
>>> ------------------------------------------------------------------------
>>> From: Jeremy Stribling Jeremy Stribling <ma...@vmware.com>
>>> Reply: user@curator.apache.org user@curator.apache.org 
>>> <ma...@curator.apache.org>
>>> Date: February 26, 2014 at 7:56:26 AM
>>> To: user@curator.apache.org user@curator.apache.org 
>>> <ma...@curator.apache.org>
>>> Subject: adding a "network timeout" to curator?
>>>> Hi all,
>>>>
>>>> I started a thread on the ZK list a while back about timeouts in ZK.
>>>> You can find it in the archives here:
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201309.mbox/%3C522F7A9D.20800@nicira.com%3E
>>>>
>>>> The basic idea is that when ZK is running on a node with slow disks
>>>> (e.g., in a VM), you might want to set your session timeout to a long
>>>> value (e.g., 30 seconds or 60 seconds), but still detect network
>>>> timeouts quickly. On that thread, Michi proposed using 'ruok' commands
>>>> from the client to test network connectivity, along with the normal
>>>> client pings happening in the background to detect server slowness.
>>>>
>>>> I was wondering if this would make sense to provide as part of the
>>>> Curator Framework or Client. There could be some background thread
>>>> sending 'ruok' commands to whatever server the client is connected to,
>>>> and going into SUSPENDED (or LOST?) mode when it hits a timeout or gets
>>>> a failure back. We might be able to implement something like that here
>>>> and contribute it back, if it sounds interesting to other people and we
>>>> can agree on a design. Any thoughts?
>>>>
>>>> Jeremy
>>


Re: adding a "network timeout" to curator?

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
I see. So, this is a slow network. You’d like a heuristic that puts Curator into SUSPENDED mode when the network performance drops. Sounds interesting to me. 

-JZ

From: Jeremy Stribling Jeremy Stribling
Reply: Jeremy Stribling strib@vmware.com
Date: February 27, 2014 at 11:15:46 AM
To: Jordan Zimmerman jordan@jordanzimmerman.com, user@curator.apache.org user@curator.apache.org
Subject:  Re: adding a "network timeout" to curator?  
Please correct me if I'm wrong, but I thought Curator went into SUSPENDED mode when it gets a Disconnected state event from its ZK client.  That is not necessarily the same as a network issue, because that ZK keepalive could be stuck in the ZK server processing queue, blocked on a slow disk.  What I'm proposing would be a true, network-only timeout that could be used to declare a client disconnected quickly if there's a network issue, without having to reduce the ZK session timeout so low that a slow disk would cause false negatives.  Does that make sense?

Jeremy

On 02/26/2014 09:25 PM, Jordan Zimmerman wrote:
Curator should already go into SUSPENDED when there is a connection issue, right? How would this be different?

-JZ

From: Jeremy Stribling Jeremy Stribling
Reply: user@curator.apache.org user@curator.apache.org
Date: February 26, 2014 at 7:56:26 AM
To: user@curator.apache.org user@curator.apache.org
Subject:  adding a "network timeout" to curator?
Hi all,

I started a thread on the ZK list a while back about timeouts in ZK.
You can find it in the archives here:

http://mail-archives.apache.org/mod_mbox/zookeeper-user/201309.mbox/%3C522F7A9D.20800@nicira.com%3E

The basic idea is that when ZK is running on a node with slow disks
(e.g., in a VM), you might want to set your session timeout to a long
value (e.g., 30 seconds or 60 seconds), but still detect network
timeouts quickly. On that thread, Michi proposed using 'ruok' commands
from the client to test network connectivity, along with the normal
client pings happening in the background to detect server slowness.

I was wondering if this would make sense to provide as part of the
Curator Framework or Client. There could be some background thread
sending 'ruok' commands to whatever server the client is connected to,
and going into SUSPENDED (or LOST?) mode when it hits a timeout or gets
a failure back. We might be able to implement something like that here
and contribute it back, if it sounds interesting to other people and we
can agree on a design. Any thoughts?

Jeremy


Re: adding a "network timeout" to curator?

Posted by Jeremy Stribling <st...@vmware.com>.
Please correct me if I'm wrong, but I thought Curator went into 
SUSPENDED mode when it gets a Disconnected state event from its ZK 
client.  That is not necessarily the same as a network issue, because 
that ZK keepalive could be stuck in the ZK server processing queue, 
blocked on a slow disk.  What I'm proposing would be a true, 
network-only timeout that could be used to declare a client disconnected 
quickly if there's a network issue, without having to reduce the ZK 
session timeout so low that a slow disk would cause false negatives.  
Does that make sense?

Jeremy

On 02/26/2014 09:25 PM, Jordan Zimmerman wrote:
> Curator should already go into SUSPENDED when there is a connection 
> issue, right? How would this be different?
>
> -JZ
>
> ------------------------------------------------------------------------
> From: Jeremy Stribling Jeremy Stribling <ma...@vmware.com>
> Reply: user@curator.apache.org user@curator.apache.org 
> <ma...@curator.apache.org>
> Date: February 26, 2014 at 7:56:26 AM
> To: user@curator.apache.org user@curator.apache.org 
> <ma...@curator.apache.org>
> Subject: adding a "network timeout" to curator?
>> Hi all,
>>
>> I started a thread on the ZK list a while back about timeouts in ZK.
>> You can find it in the archives here:
>>
>> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201309.mbox/%3C522F7A9D.20800@nicira.com%3E 
>>
>>
>> The basic idea is that when ZK is running on a node with slow disks
>> (e.g., in a VM), you might want to set your session timeout to a long
>> value (e.g., 30 seconds or 60 seconds), but still detect network
>> timeouts quickly. On that thread, Michi proposed using 'ruok' commands
>> from the client to test network connectivity, along with the normal
>> client pings happening in the background to detect server slowness.
>>
>> I was wondering if this would make sense to provide as part of the
>> Curator Framework or Client. There could be some background thread
>> sending 'ruok' commands to whatever server the client is connected to,
>> and going into SUSPENDED (or LOST?) mode when it hits a timeout or gets
>> a failure back. We might be able to implement something like that here
>> and contribute it back, if it sounds interesting to other people and we
>> can agree on a design. Any thoughts?
>>
>> Jeremy


Re: adding a "network timeout" to curator?

Posted by Jordan Zimmerman <jo...@jordanzimmerman.com>.
Curator should already go into SUSPENDED when there is a connection issue, right? How would this be different?

-JZ

From: Jeremy Stribling Jeremy Stribling
Reply: user@curator.apache.org user@curator.apache.org
Date: February 26, 2014 at 7:56:26 AM
To: user@curator.apache.org user@curator.apache.org
Subject:  adding a "network timeout" to curator?  
Hi all,  

I started a thread on the ZK list a while back about timeouts in ZK.  
You can find it in the archives here:  

http://mail-archives.apache.org/mod_mbox/zookeeper-user/201309.mbox/%3C522F7A9D.20800@nicira.com%3E  

The basic idea is that when ZK is running on a node with slow disks  
(e.g., in a VM), you might want to set your session timeout to a long  
value (e.g., 30 seconds or 60 seconds), but still detect network  
timeouts quickly. On that thread, Michi proposed using 'ruok' commands  
from the client to test network connectivity, along with the normal  
client pings happening in the background to detect server slowness.  

I was wondering if this would make sense to provide as part of the  
Curator Framework or Client. There could be some background thread  
sending 'ruok' commands to whatever server the client is connected to,  
and going into SUSPENDED (or LOST?) mode when it hits a timeout or gets  
a failure back. We might be able to implement something like that here  
and contribute it back, if it sounds interesting to other people and we  
can agree on a design. Any thoughts?  

Jeremy