You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Kirk True <ki...@mustardgrain.com> on 2012/10/04 20:25:34 UTC

Expected behavior of number of nodes contacted during CL=QUORUM read

Hi all,

Test scenario:

     4 nodes (.1, .2, .3, .4)
     RF=3
     CL=QUORUM
     1.1.2

I noticed that in ReadCallback's constructor, it determines the 
'blockfor' number of 2 for RF=3, CL=QUORUM.

According to the API page on the wiki[1] for reads at CL=QUORUM:

    Will query *all* replicas and return the record with the most recent
    timestamp once it has at least a majority of replicas (N / 2 + 1)
    reported.


However, in ReadCallback's constructor, it determines blockfor to be 2, 
then calls filterEndpoints. filterEndpoints is given a list of the three 
replicas, but at the very end of the method, the endpoint list to only 
two replicas. Those two replicas are then used in StorageProxy to 
execute the read/digest calls. So it ends up as 2 nodes, not all three 
as stated on the wiki.

In my test case, I kill a node and then immediately issue a query for a 
key that has a replica on the downed node. For the live nodes in the 
system, it doesn't immediately know that the other node is down yet. 
Rather than contacting *all* nodes as the wiki states, the coordinator 
contacts only two -- one of which is the downed node. Since it blocks 
for two, one of which is down, the query times out. Attempting the read 
again produces the same effect, even when trying different nodes as 
coordinators. I end up retrying a few times until the failure detectors 
on the live nodes realize that the node is down.

So, the end result is that if a client attempts to read a row that has a 
replica on a newly downed node, it will timeout repeatedly until the ~30 
seconds failure detector window has passed -- even though there are 
enough live replicas to satisfy the request. We basically have a 
scenario wherein a value is not retrievable for upwards of 30 seconds. 
The percentage of keys that exhibit this possibility shrinks as the ring 
grows, but it's still non-zero.

This doesn't seem right and I'm sure I'm missing something.

Thanks,
Kirk

[1] http://wiki.apache.org/cassandra/API

Re: Expected behavior of number of nodes contacted during CL=QUORUM read

Posted by Jonathan Ellis <jb...@gmail.com>.

The API page is incorrect.  Cassandra only contacts enough nodes to
satisfy the requested CL.
https://issues.apache.org/jira/browse/CASSANDRA-4705 and
https://issues.apache.org/jira/browse/CASSANDRA-2540 are relevant to
the fragility that can result as you say.  (Although, unless you are
doing zero read repairs I would expect the dynamic snitch to steer
requests away from the unresponsive node a lot faster than 30s.)

On Thu, Oct 4, 2012 at 1:25 PM, Kirk True <ki...@mustardgrain.com> wrote:
> Hi all,
>
> Test scenario:
>
>     4 nodes (.1, .2, .3, .4)
>     RF=3
>     CL=QUORUM
>     1.1.2
>
> I noticed that in ReadCallback's constructor, it determines the 'blockfor'
> number of 2 for RF=3, CL=QUORUM.
>
> According to the API page on the wiki[1] for reads at CL=QUORUM:
>
>    Will query *all* replicas and return the record with the most recent
>    timestamp once it has at least a majority of replicas (N / 2 + 1)
>    reported.
>
>
> However, in ReadCallback's constructor, it determines blockfor to be 2, then
> calls filterEndpoints. filterEndpoints is given a list of the three
> replicas, but at the very end of the method, the endpoint list to only two
> replicas. Those two replicas are then used in StorageProxy to execute the
> read/digest calls. So it ends up as 2 nodes, not all three as stated on the
> wiki.
>
> In my test case, I kill a node and then immediately issue a query for a key
> that has a replica on the downed node. For the live nodes in the system, it
> doesn't immediately know that the other node is down yet. Rather than
> contacting *all* nodes as the wiki states, the coordinator contacts only two
> -- one of which is the downed node. Since it blocks for two, one of which is
> down, the query times out. Attempting the read again produces the same
> effect, even when trying different nodes as coordinators. I end up retrying
> a few times until the failure detectors on the live nodes realize that the
> node is down.
>
> So, the end result is that if a client attempts to read a row that has a
> replica on a newly downed node, it will timeout repeatedly until the ~30
> seconds failure detector window has passed -- even though there are enough
> live replicas to satisfy the request. We basically have a scenario wherein a
> value is not retrievable for upwards of 30 seconds. The percentage of keys
> that exhibit this possibility shrinks as the ring grows, but it's still
> non-zero.
>
> This doesn't seem right and I'm sure I'm missing something.
>
> Thanks,
> Kirk
>
> [1] http://wiki.apache.org/cassandra/API



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Expected behavior of number of nodes contacted during CL=QUORUM read

Posted by Brandon Williams <dr...@gmail.com>.

On Thu, Oct 4, 2012 at 1:25 PM, Kirk True <ki...@mustardgrain.com> wrote:
> Hi all,
>
> Test scenario:
>
>     4 nodes (.1, .2, .3, .4)
>     RF=3
>     CL=QUORUM
>     1.1.2
>
> I noticed that in ReadCallback's constructor, it determines the 'blockfor'
> number of 2 for RF=3, CL=QUORUM.

Which is correct.  floor(3/2) = 1, plus 1 equals 2.

-Brandon