You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Erik Onnen <eo...@gmail.com> on 2011/04/13 19:32:23 UTC

CL.ONE reads and SimpleSnitch unnecessary timeouts

Sorry for the complex setup, took a while to identify the behavior and
I'm still not sure I'm reading the code correctly.

Scenario:

Six node ring w/ SimpleSnitch and RF3. For the sake of discussion
assume the token space looks like:

node-0 1-10
node-1 11-20
node-2 21-30
node-3 31-40
node-4 41-50
node-5 51-60

In this scenario we want key 35 where nodes 3,4 and 5 are natural
endpoints. Client is connected to node-0, node-1 or node-2. node-3
goes into a full GC lasting 12 seconds.

What I think we're seeing is that as long as we read with CL.ONE *and*
are connected to 0,1 or 2, we'll never get a response for the
requested key until the failure detector kicks in and convicts 3
resulting in reads spilling over to the other endpoints.

We've tested this by switching to CL.QUORUM and since haven't seen
read timeouts during big GCs.

Assuming the above, is this behavior really correct? We have copies of
the data on two other nodes but because this snitch config always
picks node-3, we always timeout until conviction which can take up to
8 seconds sometimes. Shouldn't the read attempt to pick a different
endpoint in the case of the first timeout rather than repeatedly
trying a node that isn't responding?

Thanks,
-erik

Re: CL.ONE reads and SimpleSnitch unnecessary timeouts

Posted by Jonathan Ellis <jb...@gmail.com>.

Yes, we've had dynamic snitch on by default in all the 0.7 releases so
it's pretty well tested by this point.

On Wed, Apr 13, 2011 at 1:17 PM, Erik Onnen <eo...@gmail.com> wrote:
> So we're not currently using a dynamic snitch, only the SimpleSnitch
> is at play (lots of history as to why, I won't go into it). If this
> would solve our problems I'm fine changing it.
>
> Understood re: client contract. I guess in this case my issue is that
> the server we're connected to never tries more than the one failing
> server until failure detector has kicked in - it keeps flogging the
> bad server so subsequent requests never produce a different result
> until conviction.
>
> Regarding clients retrying, in this configuration the situation
> doesn't improve and it still times out because our client libraries
> don't try another host. They still have a valid connection to a
> working host, it's just that given our configuration that one node
> keeps proxying to a bad server and never routes around it. It sounds
> like switching to the dynamic switch would adjust for the first
> timeout on subsequent attempts so maybe that's the most advisable
> thing in this case.
>
> On Wed, Apr 13, 2011 at 10:58 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>> First, our contract with the client says "we'll give you the answer or
>> a timeout after rpc_timeout." Once we start trying to cheat on that
>> the client has no guarantee anymore when it should expect a response
>> by. So that feels iffy to me.
>>
>> Second, retrying to a different node isn't expected to give
>> substantially better results than the client issuing a retry itself if
>> that's what it wants, since by the time we timeout once then FD and/or
>> dynamic snitch should route the request to another node for the retry
>> without adding additional complexity to StorageProxy.  (If that's not
>> what you see in practice, then we probably have a dynamic snitch bug.)
>>
>> On Wed, Apr 13, 2011 at 12:32 PM, Erik Onnen <eo...@gmail.com> wrote:
>>> Sorry for the complex setup, took a while to identify the behavior and
>>> I'm still not sure I'm reading the code correctly.
>>>
>>> Scenario:
>>>
>>> Six node ring w/ SimpleSnitch and RF3. For the sake of discussion
>>> assume the token space looks like:
>>>
>>> node-0 1-10
>>> node-1 11-20
>>> node-2 21-30
>>> node-3 31-40
>>> node-4 41-50
>>> node-5 51-60
>>>
>>> In this scenario we want key 35 where nodes 3,4 and 5 are natural
>>> endpoints. Client is connected to node-0, node-1 or node-2. node-3
>>> goes into a full GC lasting 12 seconds.
>>>
>>> What I think we're seeing is that as long as we read with CL.ONE *and*
>>> are connected to 0,1 or 2, we'll never get a response for the
>>> requested key until the failure detector kicks in and convicts 3
>>> resulting in reads spilling over to the other endpoints.
>>>
>>> We've tested this by switching to CL.QUORUM and since haven't seen
>>> read timeouts during big GCs.
>>>
>>> Assuming the above, is this behavior really correct? We have copies of
>>> the data on two other nodes but because this snitch config always
>>> picks node-3, we always timeout until conviction which can take up to
>>> 8 seconds sometimes. Shouldn't the read attempt to pick a different
>>> endpoint in the case of the first timeout rather than repeatedly
>>> trying a node that isn't responding?
>>>
>>> Thanks,
>>> -erik
>>>
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: CL.ONE reads and SimpleSnitch unnecessary timeouts

Posted by Erik Onnen <eo...@gmail.com>.

So we're not currently using a dynamic snitch, only the SimpleSnitch
is at play (lots of history as to why, I won't go into it). If this
would solve our problems I'm fine changing it.

Understood re: client contract. I guess in this case my issue is that
the server we're connected to never tries more than the one failing
server until failure detector has kicked in - it keeps flogging the
bad server so subsequent requests never produce a different result
until conviction.

Regarding clients retrying, in this configuration the situation
doesn't improve and it still times out because our client libraries
don't try another host. They still have a valid connection to a
working host, it's just that given our configuration that one node
keeps proxying to a bad server and never routes around it. It sounds
like switching to the dynamic switch would adjust for the first
timeout on subsequent attempts so maybe that's the most advisable
thing in this case.

On Wed, Apr 13, 2011 at 10:58 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> First, our contract with the client says "we'll give you the answer or
> a timeout after rpc_timeout." Once we start trying to cheat on that
> the client has no guarantee anymore when it should expect a response
> by. So that feels iffy to me.
>
> Second, retrying to a different node isn't expected to give
> substantially better results than the client issuing a retry itself if
> that's what it wants, since by the time we timeout once then FD and/or
> dynamic snitch should route the request to another node for the retry
> without adding additional complexity to StorageProxy.  (If that's not
> what you see in practice, then we probably have a dynamic snitch bug.)
>
> On Wed, Apr 13, 2011 at 12:32 PM, Erik Onnen <eo...@gmail.com> wrote:
>> Sorry for the complex setup, took a while to identify the behavior and
>> I'm still not sure I'm reading the code correctly.
>>
>> Scenario:
>>
>> Six node ring w/ SimpleSnitch and RF3. For the sake of discussion
>> assume the token space looks like:
>>
>> node-0 1-10
>> node-1 11-20
>> node-2 21-30
>> node-3 31-40
>> node-4 41-50
>> node-5 51-60
>>
>> In this scenario we want key 35 where nodes 3,4 and 5 are natural
>> endpoints. Client is connected to node-0, node-1 or node-2. node-3
>> goes into a full GC lasting 12 seconds.
>>
>> What I think we're seeing is that as long as we read with CL.ONE *and*
>> are connected to 0,1 or 2, we'll never get a response for the
>> requested key until the failure detector kicks in and convicts 3
>> resulting in reads spilling over to the other endpoints.
>>
>> We've tested this by switching to CL.QUORUM and since haven't seen
>> read timeouts during big GCs.
>>
>> Assuming the above, is this behavior really correct? We have copies of
>> the data on two other nodes but because this snitch config always
>> picks node-3, we always timeout until conviction which can take up to
>> 8 seconds sometimes. Shouldn't the read attempt to pick a different
>> endpoint in the case of the first timeout rather than repeatedly
>> trying a node that isn't responding?
>>
>> Thanks,
>> -erik
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: CL.ONE reads and SimpleSnitch unnecessary timeouts

Posted by Jonathan Ellis <jb...@gmail.com>.

First, our contract with the client says "we'll give you the answer or
a timeout after rpc_timeout." Once we start trying to cheat on that
the client has no guarantee anymore when it should expect a response
by. So that feels iffy to me.

Second, retrying to a different node isn't expected to give
substantially better results than the client issuing a retry itself if
that's what it wants, since by the time we timeout once then FD and/or
dynamic snitch should route the request to another node for the retry
without adding additional complexity to StorageProxy.  (If that's not
what you see in practice, then we probably have a dynamic snitch bug.)

On Wed, Apr 13, 2011 at 12:32 PM, Erik Onnen <eo...@gmail.com> wrote:
> Sorry for the complex setup, took a while to identify the behavior and
> I'm still not sure I'm reading the code correctly.
>
> Scenario:
>
> Six node ring w/ SimpleSnitch and RF3. For the sake of discussion
> assume the token space looks like:
>
> node-0 1-10
> node-1 11-20
> node-2 21-30
> node-3 31-40
> node-4 41-50
> node-5 51-60
>
> In this scenario we want key 35 where nodes 3,4 and 5 are natural
> endpoints. Client is connected to node-0, node-1 or node-2. node-3
> goes into a full GC lasting 12 seconds.
>
> What I think we're seeing is that as long as we read with CL.ONE *and*
> are connected to 0,1 or 2, we'll never get a response for the
> requested key until the failure detector kicks in and convicts 3
> resulting in reads spilling over to the other endpoints.
>
> We've tested this by switching to CL.QUORUM and since haven't seen
> read timeouts during big GCs.
>
> Assuming the above, is this behavior really correct? We have copies of
> the data on two other nodes but because this snitch config always
> picks node-3, we always timeout until conviction which can take up to
> 8 seconds sometimes. Shouldn't the read attempt to pick a different
> endpoint in the case of the first timeout rather than repeatedly
> trying a node that isn't responding?
>
> Thanks,
> -erik
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com