You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Chris Burroughs <ch...@gmail.com> on 2011/04/06 22:00:55 UTC

CL.ONE reads / RR / badness_threshold interaction

My understanding for For CL.ONE.  For the node that receives the request:

(A) If RR is enabled and this node contains the needed row --> return
immediately and do RR to remaining replicas in background.
(B) If RR is off and this node contains the needed row --> return the
needed data immediately.
(C) If this node does not have the needed row --> regardless of RR ask
all replicas and return the first result.


However case (C) as I have described it does not allow for any notion of
'pinning' as mentioned for dynamic_snitch_badness_threshold:

# if set greater than zero and read_repair_chance is < 1.0, this will allow
# 'pinning' of replicas to hosts in order to increase cache capacity.
# The badness threshold will control how much worse the pinned host has
to be
# before the dynamic snitch will prefer other replicas over it.  This is
# expressed as a double which represents a percentage.  Thus, a value of
# 0.2 means Cassandra would continue to prefer the static snitch values
# until the pinned host was 20% worse than the fastest.


The wiki states CL.ONE "Will return the record returned by the first
replica to respond" [1] implying that the request goes to multiple
replicas, but datastax's docs state that only one node will receive the
request ("Returns the response from *the* closest replica, as determined
by the snitch configured for the cluster" [2]).

Could someone clarify how CL.ONE reads with RR off work?


[1] http://wiki.apache.org/cassandra/API
[2]
http://www.datastax.com/docs/0.7/consistency/index#choosing-consistency-levels
 emphasis added

Re: CL.ONE reads / RR / badness_threshold interaction

Posted by aaron morton <aa...@thelastpickle.com>.
Nice explanation.

Wanted to add the importance of been the first node in the ordered node list, even for CL ONE, is that this is the node sent the data request and it has to return before the CL is considered satisfied. e.g. CL One with RR running, read sent to all 5 replicas, if 3 digest request have returned the coordinator will still be blocking waiting for the one data request.  

Thanks
Aaron


On 7 Apr 2011, at 08:13, Peter Schuller wrote:

> Ok, I took this opportunity to look look a bit more on this part of
> the code. My reading of StorageProxy.fetchRows() and related is as
> follows, but please allow for others to say I'm wrong/missing
> something (and sorry, this is more a stream of consciousness that is
> probably more useful to me for learning the code than in answer to
> your question, but it's probably better to send it than write it and
> then just discard the e-mail - maybe someone is helped;)):
> 
> The endpoints obtained is sorted by the snitch's sortByProximity,
> against the local address. If the closest (as determined by that
> sorting) is the local address, the request is added directly to the
> local READ stage.
> 
> In the case of the SimpleSnitch, sortByProximity is a no-op, so the
> "sorted by proximity" should be the ring order. As per the comments to
> the SimpleSnitch, the intent is to allow "non-read-repaired reads to
> prefer a single endpoint, which improves cache locality".
> 
> So my understand is that in the case of the SImpleSnitch, ignoring any
> effect of the dynamic snitch, you will *not* always grab from the
> local node because the "closest" node (because ring order is used) is
> just whatever is the "first" node on the ring in the replica set.
> 
> In the case of the NetworkTopologyStrategy, it inherits the
> implementation in AbstractNetworkTopologySnitch which sorts by
> AbstractNetworkTopologySnitch.compareEndPoints(), which:
> 
> (1) Always prefers itself to any other node. So "myself" is always
> "closest", no matter what.
> (2) Else, always prefers a node in the same rack, to a node in a different rack.
> (3) Else, always prefers a node in the same dc, to a node in a different dc.
> 
> So in the NTS case, I believe, *disregarding the dynamic snitch*, that
> with NTS you would in fact always read from the co-ordinator node if
> that node happens to be part of the replica set for the row.
> 
> (There is no tie-breaking if neither 1, 2 nor 3 above gives a
> presedence, and it is sorted with Collections.sort(), which guarantees
> that the sort is stable. So for nodes where rack/dc awareness does not
> come into play, it should result in the ring order as with the
> SimpleSnitch.)
> 
> Now; so far this only determines the order of endpoints after
> proximity sorting. fetchRows() will route to "itself" directly without
> messaging if the closest node is itself. This determines from which
> node we read the *data* (not digest).
> 
> Moving back to endpoint selection; after sorting by proximity it is
> actually filtered by getReadCallback. This is what determines how many
> will be receiving a request. If read repair doesn't happen, it'll be
> whatever is implied by the consistency level (so only one for CL.ONE).
> If read repair does happen, all endpoints are included and so none is
> filtered out.
> 
> Moving back out into fetchRows(), we're now past the sending or local
> scheduling of the data read. It then loops over the remainder (1
> through last of handler.endpoints) and submitting digest read messages
> to each endpoint (either local or remote).
> 
> We're now so far as to have determined (1) which node to send data
> request to, (2) which nodes, if any, to send digest reads to
> (regardless of whether it is due to read repair or consistency level
> requirements).
> 
> Now fetchRows() proceeds to iterate over all the ReadCallbacks,
> get():Ing each. This is where digest mismatch exceptions are raised if
> relevant. CL.ONE seems special-cased in the sense that if the number
> of responses to block/wait for is exactly 1, the data is returned
> without resolving to check for digest mismatches (once responses come
> back later on, the read repair is triggered by
> ReadCallback.maybeResolveForRepair).
> 
> In the case of CL > ONE, a digest mismatch can be raised immediately
> in which case fetchRows() triggers read repair.
> 
> Now:
> 
>> However case (C) as I have described it does not allow for any notion of
>> 'pinning' as mentioned for dynamic_snitch_badness_threshold:
>> 
>> # if set greater than zero and read_repair_chance is < 1.0, this will allow
>> # 'pinning' of replicas to hosts in order to increase cache capacity.
>> # The badness threshold will control how much worse the pinned host has
>> to be
>> # before the dynamic snitch will prefer other replicas over it.  This is
>> # expressed as a double which represents a percentage.  Thus, a value of
>> # 0.2 means Cassandra would continue to prefer the static snitch values
>> # until the pinned host was 20% worse than the fastest.
> 
> If you look at DynamicEndpointSnitch.sortByProximity(), it branches
> into two main cases: If BADNESS_THRESHOLD is exactly 0 (it's not a
> constant despite the caps, it's taken from the conf) is uses
> sortByProximityWithScore(). Otherwise it uses
> sortByProximityWithBadness().
> 
> ...withBadness() first asks the subsnitch (meaning normally either
> SImpleSnitch or NTS) to sort by proximity. Then it iterates through
> the endpoints, and if any node is sufficiently good in comparison to
> the closest-as-determined-by-subsnitch endpoint, it falls back to
> sortByProximityWithScore(). "Sufficiently good" is where the actual
> value of BADNESS_THRESHOLD comes in (if the "would-be" closest node is
> sufficiently bad, that implies some other node is sufficiently good in
> comparison to it...)
> 
> So, my reading of it is then that the comment is correct. By setting
> it to >0, you're making the dynamic snitch essentially be a no-op for
> the purpose of proximity sorting *until* the badness threshold is
> reached, at which point it uses it's scoring algorithm. Because the
> behavior of NTS and the SImpleSnitch is to always prefer the same node
> (for a given row key), that means pinning.
> 
> As far as I can tell, read-repair should not affect things either way
> since it doesn't have anything to do with which node gets asked for
> the data (as opposed to the digest).
> 
> One interesting aspect though: Say you specifically *don't* want
> pinning, and rather *want* round-robin type of behavior to keep caches
> hot. If you're at CL.ONE with read repair turned off or very low, this
> doesn't seem possible except as may result accidentally by dynamic
> snitch balancing - depending on performance characteristics of nodes.
> 
> -- 
> / Peter Schuller


Re: CL.ONE reads / RR / badness_threshold interaction

Posted by Chris Burroughs <ch...@gmail.com>.
On 04/12/2011 06:27 PM, Peter Schuller wrote:
>> So to increase pinny-ness I'll further reduce RR chance and set a
>> badness threshold.  Thanks all.
> 
> Just be aware that, assuming I am not missing something, while this
> will indeed give you better cache locality under normal circumstances
> - once that "closest" node does go down, traffic will then go to a
> node which will have potentially zero cache hit rate on that data
> since all reads up to that point were taken by the node that just went
> down.
> 
> So it's not an obvious win depending.


Yeah there less than great behaviour when nodes are restarted or
otherwise go down with this configuration.  Probably still preferable
for my current situation.  Other's mileage may vary.


http://img27.imageshack.us/img27/85/cacherestart.png

Re: CL.ONE reads / RR / badness_threshold interaction

Posted by Peter Schuller <pe...@infidyne.com>.
> To now answer my own question, the critical points that are different
> from what I said earlier are: that CL.ONE does prefer *one* node (which
> one depending on snitch) and that RR uses digests (which are not
> mentioned on the wiki page [1]) instead of comparing raw requests.

I updated it to mention digest queries with a link to another page to
explain what that is, and why they are used.

> I am assuming that RR digests save on bandwidth, but to generate the
> digest with a row cache miss the same number of disk seeks are required
> (my nemesis is disk io).

Yes. It's only a bandwidth optimization.

> So to increase pinny-ness I'll further reduce RR chance and set a
> badness threshold.  Thanks all.

Just be aware that, assuming I am not missing something, while this
will indeed give you better cache locality under normal circumstances
- once that "closest" node does go down, traffic will then go to a
node which will have potentially zero cache hit rate on that data
since all reads up to that point were taken by the node that just went
down.

So it's not an obvious win depending.

-- 
/ Peter Schuller

Re: CL.ONE reads / RR / badness_threshold interaction

Posted by Chris Burroughs <ch...@gmail.com>.
Peter, thank you for the extremely detailed reply.

To now answer my own question, the critical points that are different
from what I said earlier are: that CL.ONE does prefer *one* node (which
one depending on snitch) and that RR uses digests (which are not
mentioned on the wiki page [1]) instead of comparing raw requests.
Totally tangential, but in the case of CL.ONE with narrow rows making
the request and taking the fastest would probably be better, but having
things work both ways depending on row size sounds painfully
complicated.  (As Aaron points out this is not how things work now.)

I am assuming that RR digests save on bandwidth, but to generate the
digest with a row cache miss the same number of disk seeks are required
(my nemesis is disk io).

So to increase pinny-ness I'll further reduce RR chance and set a
badness threshold.  Thanks all.


[1] http://wiki.apache.org/cassandra/ReadRepair

Re: CL.ONE reads / RR / badness_threshold interaction

Posted by Peter Schuller <pe...@infidyne.com>.
Ok, I took this opportunity to look look a bit more on this part of
the code. My reading of StorageProxy.fetchRows() and related is as
follows, but please allow for others to say I'm wrong/missing
something (and sorry, this is more a stream of consciousness that is
probably more useful to me for learning the code than in answer to
your question, but it's probably better to send it than write it and
then just discard the e-mail - maybe someone is helped;)):

The endpoints obtained is sorted by the snitch's sortByProximity,
against the local address. If the closest (as determined by that
sorting) is the local address, the request is added directly to the
local READ stage.

In the case of the SimpleSnitch, sortByProximity is a no-op, so the
"sorted by proximity" should be the ring order. As per the comments to
the SimpleSnitch, the intent is to allow "non-read-repaired reads to
prefer a single endpoint, which improves cache locality".

So my understand is that in the case of the SImpleSnitch, ignoring any
effect of the dynamic snitch, you will *not* always grab from the
local node because the "closest" node (because ring order is used) is
just whatever is the "first" node on the ring in the replica set.

In the case of the NetworkTopologyStrategy, it inherits the
implementation in AbstractNetworkTopologySnitch which sorts by
AbstractNetworkTopologySnitch.compareEndPoints(), which:

(1) Always prefers itself to any other node. So "myself" is always
"closest", no matter what.
(2) Else, always prefers a node in the same rack, to a node in a different rack.
(3) Else, always prefers a node in the same dc, to a node in a different dc.

So in the NTS case, I believe, *disregarding the dynamic snitch*, that
with NTS you would in fact always read from the co-ordinator node if
that node happens to be part of the replica set for the row.

(There is no tie-breaking if neither 1, 2 nor 3 above gives a
presedence, and it is sorted with Collections.sort(), which guarantees
that the sort is stable. So for nodes where rack/dc awareness does not
come into play, it should result in the ring order as with the
SimpleSnitch.)

Now; so far this only determines the order of endpoints after
proximity sorting. fetchRows() will route to "itself" directly without
messaging if the closest node is itself. This determines from which
node we read the *data* (not digest).

Moving back to endpoint selection; after sorting by proximity it is
actually filtered by getReadCallback. This is what determines how many
will be receiving a request. If read repair doesn't happen, it'll be
whatever is implied by the consistency level (so only one for CL.ONE).
If read repair does happen, all endpoints are included and so none is
filtered out.

Moving back out into fetchRows(), we're now past the sending or local
scheduling of the data read. It then loops over the remainder (1
through last of handler.endpoints) and submitting digest read messages
to each endpoint (either local or remote).

We're now so far as to have determined (1) which node to send data
request to, (2) which nodes, if any, to send digest reads to
(regardless of whether it is due to read repair or consistency level
requirements).

Now fetchRows() proceeds to iterate over all the ReadCallbacks,
get():Ing each. This is where digest mismatch exceptions are raised if
relevant. CL.ONE seems special-cased in the sense that if the number
of responses to block/wait for is exactly 1, the data is returned
without resolving to check for digest mismatches (once responses come
back later on, the read repair is triggered by
ReadCallback.maybeResolveForRepair).

In the case of CL > ONE, a digest mismatch can be raised immediately
in which case fetchRows() triggers read repair.

Now:

> However case (C) as I have described it does not allow for any notion of
> 'pinning' as mentioned for dynamic_snitch_badness_threshold:
>
> # if set greater than zero and read_repair_chance is < 1.0, this will allow
> # 'pinning' of replicas to hosts in order to increase cache capacity.
> # The badness threshold will control how much worse the pinned host has
> to be
> # before the dynamic snitch will prefer other replicas over it.  This is
> # expressed as a double which represents a percentage.  Thus, a value of
> # 0.2 means Cassandra would continue to prefer the static snitch values
> # until the pinned host was 20% worse than the fastest.

If you look at DynamicEndpointSnitch.sortByProximity(), it branches
into two main cases: If BADNESS_THRESHOLD is exactly 0 (it's not a
constant despite the caps, it's taken from the conf) is uses
sortByProximityWithScore(). Otherwise it uses
sortByProximityWithBadness().

...withBadness() first asks the subsnitch (meaning normally either
SImpleSnitch or NTS) to sort by proximity. Then it iterates through
the endpoints, and if any node is sufficiently good in comparison to
the closest-as-determined-by-subsnitch endpoint, it falls back to
sortByProximityWithScore(). "Sufficiently good" is where the actual
value of BADNESS_THRESHOLD comes in (if the "would-be" closest node is
sufficiently bad, that implies some other node is sufficiently good in
comparison to it...)

So, my reading of it is then that the comment is correct. By setting
it to >0, you're making the dynamic snitch essentially be a no-op for
the purpose of proximity sorting *until* the badness threshold is
reached, at which point it uses it's scoring algorithm. Because the
behavior of NTS and the SImpleSnitch is to always prefer the same node
(for a given row key), that means pinning.

As far as I can tell, read-repair should not affect things either way
since it doesn't have anything to do with which node gets asked for
the data (as opposed to the digest).

One interesting aspect though: Say you specifically *don't* want
pinning, and rather *want* round-robin type of behavior to keep caches
hot. If you're at CL.ONE with read repair turned off or very low, this
doesn't seem possible except as may result accidentally by dynamic
snitch balancing - depending on performance characteristics of nodes.

-- 
/ Peter Schuller