You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Daniel Doubleday <da...@gmx.net> on 2010/12/12 17:49:59 UTC

Dynamic Snitch / Read Path Questions

Hi again.

It would be great if someone could comment whether the following is true 
or not.
I tried to understand the consequences of using 
|-Dcassandra.dynamic_snitch=true for the read path |and that's what I 
came up with:

1) If using CL > 1 than using the dynamic snitch will result in a data 
read from node with the lowest latency (little simplified) even if the 
proxy node contains the data but has a higher latency that other 
possible nodes which means that it is not necessary to do load-based 
balancing on the client side.

2) If using CL =1 than the proxy node will always return the data itself 
even when there is another node with less load.

3) Digest requests will be sent to all other living peer nodes for that 
key and will result in a data read on all nodes to calculate the digest. 
The only difference is that the data is not sent back but IO-wise it is 
just as expensive.


The next one goes a little further:

We read / write with quorum / rf = 3.

It seems to me that it wouldn't be hard to patch the StorageProxy to 
send only one read request and one digest request. Only if one of the 
requests fail we would have to query the remaining node. We don't need 
read repair because we have to repair once a week anyways and quorum 
guarantees consistency. This way we could reduce read load significantly 
which should compensate for latency increase by failing reads. Am I 
missing something?


Best,
Daniel




Re: Dynamic Snitch / Read Path Questions

Posted by Daniel Doubleday <da...@gmx.net>.
> the purpose of your thread is: How far are you away from being I/O
> bound (say in terms of % utilization - last column of iostat -x 1 -
> assuming you don't have a massive RAID underneath the block device)

No my cheap boss didn't want to by me a stack of these 
http://www.ocztechnology.com/products/solid-state-drives/pci-express/z-drive-r2/mlc-performance-series/ocz-z-drive-r2-p88-pci-express-ssd.html

But seriously: we don't know yet what the best way in terms of TCO is. Maybe its worth investing 2k in SSDs if that machine could than handle the load of 3.


> when compaction/AESis *not* running? I.e., how much in relative terms,
> in terms of "time spent by disks servicing requests" is added by
> compaction/AES?
> 

Can't really say in terms of util% because we only monitor IO waits in zabbix. Now with our cluster running smoothly I'd say compactions adds around 15-20%.
In terms of IO waits we saw our graphs jumped during compactions

- from 20 - 30% to 50%  with 'ok' load (reqs where handled at around 100ms max and no messages dropped) and
- from 50% - 80/90% during peak hours. Things got ugly then

> Are your values in generally largish (say a few kb or some such)or
> very small (5-50 bytes) or somewhere in between? I've been trying to
> collect information when people report compaction/repair killing their
> performance. My hypothesis is that most sever issues are for data sets
> where compaction becomes I/O bound rather than CPU bound (for those
> that have seen me say this a gazillion times I must be sounding like
> I'm a stuck LP record); and this would tend to be expected with larger
> and fewer values as opposed to smaller and more numerous values as the
> latter is much more expensive in terms of CPU cycles per byte
> compacted. Further I expect CPU bound compaction to be a problem very
> infrequently in comparison. I'm trying to confirm or falsify the
> hypothesis.

Well we have 4 CFs with different characteristics but it seems that what made things go wrong was a CF with ~2k cols. I have never seen CPU user time over 30% on any of the nodes. So I second your hypothesis

> 
> -- 
> / Peter Schuller


Re: Dynamic Snitch / Read Path Questions

Posted by Peter Schuller <pe...@infidyne.com>.
> We are entirely IO-bound. What killed us last week were to many reads combined with flushes and compactions. Reducing compaction priority helped but it was not enough. Them main problem why we could not add nodes though had to do with the quorum reads we are doing:

I'm going to respond to this separately, and a bit unrelated to what
the purpose of your thread is: How far are you away from being I/O
bound (say in terms of % utilization - last column of iostat -x 1 -
assuming you don't have a massive RAID underneath the block device)
when compaction/AESis *not* running? I.e., how much in relative terms,
in terms of "time spent by disks servicing requests" is added by
compaction/AES?

Are your values in generally largish (say a few kb or some such)or
very small (5-50 bytes) or somewhere in between? I've been trying to
collect information when people report compaction/repair killing their
performance. My hypothesis is that most sever issues are for data sets
where compaction becomes I/O bound rather than CPU bound (for those
that have seen me say this a gazillion times I must be sounding like
I'm a stuck LP record); and this would tend to be expected with larger
and fewer values as opposed to smaller and more numerous values as the
latter is much more expensive in terms of CPU cycles per byte
compacted. Further I expect CPU bound compaction to be a problem very
infrequently in comparison. I'm trying to confirm or falsify the
hypothesis.

-- 
/ Peter Schuller

Re: Dynamic Snitch / Read Path Questions

Posted by Daniel Doubleday <da...@gmx.net>.
Hi Peter

I should have started with the why instead of what ...

Background Info (I try to be brief ...)

We have a very small production cluster (started with 3 nodes, now we have 5). Most of our data is currently in mysql but we want to slowly move the larger tables which are killing our mysql cache to cassandra. After migration of another table we faced under-capacity last week. Since we couldn't cope with it and the old system was still running in the background we switched back and added another 2 nodes while still performing writes to the cluster (but no reads).

We are entirely IO-bound. What killed us last week were to many reads combined with flushes and compactions. Reducing compaction priority helped but it was not enough. Them main problem why we could not add nodes though had to do with the quorum reads we are doing:

First we stopped compaction on all nodes. Everything was golden. The cluster handled the load easily. Than we bootstrapped a new node. That increased the IO-pressure on the node which was selected as streaming source because it started anti-compcation. The increase pressure slowed the node down. Just a little, but enough to get flooded by digest requests from the other nodes. We have seen this before: http://comments.gmane.org/gmane.comp.db.cassandra.user/10687.

So the status at that point was: 2 nodes that were serving requests at 10 - 50ms and one that errr... wouldn't serve requests (average response time storage proxy was 10 secs). The problem here was that users would suffer from the slow server because it was not down and still being queried from clients. Also because the streaming node was so overwhelmed anticompaction became *real* slow. It would have taken days to finish.

Than we took one node down to prevent digest flooding and it was much better. It almost worked out but at peak hours it collapsed. At this point we rolled back.

From this my learning / guessing is:

We could have survived this if the streaming node would not have had to serve 

- read requests (because than users would not have been affected) and / or
- digest requests (because that would have reduced io pressure)

To summarize: I want to prevent that the slowest node

A) affects latency at the level of StorageProxy
B) gets digest requests because the only thing they are good for is killing it 


Ok sorry - that was not brief ...

Back to my original mail. 1-3) were supposed to describe the current situation in order to validate that my understanding is actually correct.

Point A) would be achieved with quorum when the slowest node would never be selected to perform the actual data read. That way ReadResponseResolver would be satisfied without waiting for the slowest node. I think that's what's supposed to happen when using the dynamic snitch without any code change. And that was the point of question 1) in my last mail. question 2) is important to me because in the future we might want to read at CL 1 for other use cases and cassandra seems to shortcut the messaging service in the case where the proxy node contains the row. Thus in that case the node would respond even if it's the slowest node. So its not load balancing.

With question 3) I wanted to verify that my understanding of digest requests is correct. Before digging deeper I thought that cassandra would have some magical way of being able to calculate md5 digests for conflict resolution without reading the column values. But (I think) obviously it cannot do that because conflict resolution is not based on timestamps alone but will fall back to selecting the bigger value if timestamps are equal. This statement is important to me because it means that digest requests are equally expensive as regular read requests. And that I might be able to reduce IO pressure significantly when I change the way quorum read are performed.

And thats what question 4) was about:

Your question was:

> | Am I interpreting you correctly that you want to switch the default
> | read mode in the case of QUOROM to optimistically assume that data is
> | consistent and read from one node and only perform digest on the rest?


Well almost :-) In fact I believe you are describing what is happening now with a vanilla cassandra. Quorum reads are performed by reading the data from the node which is selected by the end point snitch (that is self, same rack, same dc with the standard one or score based with dynamic snitch). All other live nodes will receive digest requests.

Only when a digest mismatch occurs full read requests are sent to all nodes.

BTW: It seems (but thats probably only misinterpretation of the code) that disabling read repair is bad when doing quorum reads. For instance if the two digest requests return before the data request and one of the digest requests return a mismatch but the overall response would be valid nothing would be returned even though the correct data was present. The same thing seems to be true for every mismatch. I guess that using the read repair config here might be wrong or at least not intuitive.

What I want to do is change the behavior that on the first request run only the two 'best' nodes would be uses. One for data one for digest. To do that you'd only have to sort the live nodes by 'proximity' and use the first two. Only if that fails I would switch back to default behavior and do a full data read on all nodes. So in a perfect world of 3 consistent nodes I would reduce IO reads by 1/3.

Best, and thanks for your time (if you bothered to read all of this),
Daniel


On Dec 13, 2010, at 9:19 AM, Peter Schuller wrote:

>> 1) If using CL > 1 than using the dynamic snitch will result in a data read
>> from node with the lowest latency (little simplified) even if the proxy node
>> contains the data but has a higher latency that other possible nodes which
>> means that it is not necessary to do load-based balancing on the client
>> side.
>> 
>> 2) If using CL =1 than the proxy node will always return the data itself
>> even when there is another node with less load.
>> 
>> 3) Digest requests will be sent to all other living peer nodes for that key
>> and will result in a data read on all nodes to calculate the digest. The
>> only difference is that the data is not sent back but IO-wise it is just as
>> expensive.
> 
> I think I may just be completely misunderstanding something, but I'm
> not really sure to what extent you're trying to describe the current
> situation and to what extent you're suggesting changes? I'm not sure
> about (1) and (2) though my knee-jerk reaction is that I would expect
> it to be mostly agnostic w.r.t. which node happens to be taking the
> RPC call (e.g. the "latency" may be due to disk I/O and preferring the
> local node has lots of potential to be detrimental, while forwarding
> is only slightly more expensive).
> 
> (3) sounds like what's happening with read repair.
> 
>> The next one goes a little further:
>> 
>> We read / write with quorum / rf = 3.
>> 
>> It seems to me that it wouldn't be hard to patch the StorageProxy to send
>> only one read request and one digest request. Only if one of the requests
>> fail we would have to query the remaining node. We don't need read repair
>> because we have to repair once a week anyways and quorum guarantees
>> consistency. This way we could reduce read load significantly which should
>> compensate for latency increase by failing reads. Am I missing something?
> 
> Am I interpreting you correctly that you want to switch the default
> read mode in the case of QUOROM to optimistically assume that data is
> consistent and read from one node and only perform digest on the rest?
> 
> What's the goal here? The only thin saved by digest reads at QUROM
> seems to me to be the throughput saved by not saving the data. You're
> still taking the reads in terms of potential disk I/O, and you still
> have to wait for the response, and you're still taking almost all of
> the CPU hit (still reading and checksumming, just not sending back).
> For highly contended data the need to fallback to real needs would
> significantly increase average latency.
> 
> -- 
> / Peter Schuller


Re: Dynamic Snitch / Read Path Questions

Posted by Peter Schuller <pe...@infidyne.com>.
> 1) If using CL > 1 than using the dynamic snitch will result in a data read
> from node with the lowest latency (little simplified) even if the proxy node
> contains the data but has a higher latency that other possible nodes which
> means that it is not necessary to do load-based balancing on the client
> side.
>
> 2) If using CL =1 than the proxy node will always return the data itself
> even when there is another node with less load.
>
> 3) Digest requests will be sent to all other living peer nodes for that key
> and will result in a data read on all nodes to calculate the digest. The
> only difference is that the data is not sent back but IO-wise it is just as
> expensive.

I think I may just be completely misunderstanding something, but I'm
not really sure to what extent you're trying to describe the current
situation and to what extent you're suggesting changes? I'm not sure
about (1) and (2) though my knee-jerk reaction is that I would expect
it to be mostly agnostic w.r.t. which node happens to be taking the
RPC call (e.g. the "latency" may be due to disk I/O and preferring the
local node has lots of potential to be detrimental, while forwarding
is only slightly more expensive).

(3) sounds like what's happening with read repair.

> The next one goes a little further:
>
> We read / write with quorum / rf = 3.
>
> It seems to me that it wouldn't be hard to patch the StorageProxy to send
> only one read request and one digest request. Only if one of the requests
> fail we would have to query the remaining node. We don't need read repair
> because we have to repair once a week anyways and quorum guarantees
> consistency. This way we could reduce read load significantly which should
> compensate for latency increase by failing reads. Am I missing something?

Am I interpreting you correctly that you want to switch the default
read mode in the case of QUOROM to optimistically assume that data is
consistent and read from one node and only perform digest on the rest?

What's the goal here? The only thin saved by digest reads at QUROM
seems to me to be the throughput saved by not saving the data. You're
still taking the reads in terms of potential disk I/O, and you still
have to wait for the response, and you're still taking almost all of
the CPU hit (still reading and checksumming, just not sending back).
For highly contended data the need to fallback to real needs would
significantly increase average latency.

-- 
/ Peter Schuller

Re: Dynamic Snitch / Read Path Questions

Posted by Daniel Doubleday <da...@gmx.net>.
On Dec 14, 2010, at 2:29 AM, Brandon Williams wrote:

> On Mon, Dec 13, 2010 at 6:43 PM, Daniel Doubleday <da...@gmx.net> wrote:
> Oh - well but I see that the coordinator is actually using its own score for ordering. I was only concerned that dropped messages are ignored when calculating latencies but that seems to be the case for local or remote responses. And even than I guess you can assume that enough slow messages arrive to destroy the score.
>  
> That's odd, since it should only be tracking READ_RESPONSE messages... I'm not sure how a node would send one to itself.

As far as I understand it the MessagingService is always user in the strong read path. Local messages will be shortcutted transport-wise but MessageDeliveryTask will still be used which in turn calls the ResponseVerbHandler which notifies the snitch about latencies.

It's only in the weak read path where the MessagingService is not used at all. And it will always use the local data and (I think) latencies are not recorded.  

> 
> Maybe I misunderstand but that would not really lead to less load right. I don't think that inconsistency / read repairs are the problem which leads to high io load but the digest requests. Turning off read repair would also lead to inconsistent reads which invalidates the whole point of quorum reads (at least in 0.6. I think rr probability has no effect in strong reads in 0.7) . Again assuming I am not misinterpreting the code.
> 
> Ah, I see what you want to do: take a chance that you pick the two replicas (at RF=3, at least) that should agree, and only send the last checksum request if you lose (at the price of latency.)
> 

Yes exactly. I want to use the two endpoints with the best score (according to dynamic snitch). As soon as I have test results I'll post them here.

Thanks,
Daniel

Re: Dynamic Snitch / Read Path Questions

Posted by Brandon Williams <dr...@gmail.com>.
On Mon, Dec 13, 2010 at 6:43 PM, Daniel Doubleday
<da...@gmx.net>wrote:

> Oh - well but I see that the coordinator is actually using its own score
> for ordering. I was only concerned that dropped messages are ignored when
> calculating latencies but that seems to be the case for local or remote
> responses. And even than I guess you can assume that enough slow messages
> arrive to destroy the score.
>

That's odd, since it should only be tracking READ_RESPONSE messages... I'm
not sure how a node would send one to itself.

Maybe I misunderstand but that would not really lead to less load right. I
> don't think that inconsistency / read repairs are the problem which leads to
> high io load but the digest requests. Turning off read repair would also
> lead to inconsistent reads which invalidates the whole point of quorum reads
> (at least in 0.6. I think rr probability has no effect in strong reads in
> 0.7) . Again assuming I am not misinterpreting the code.
>

Ah, I see what you want to do: take a chance that you pick the two replicas
(at RF=3, at least) that should agree, and only send the last checksum
request if you lose (at the price of latency.)

-Brandon

Re: Dynamic Snitch / Read Path Questions

Posted by Daniel Doubleday <da...@gmx.net>.
On 13.12.10 21:15, Brandon Williams wrote:
> On Sun, Dec 12, 2010 at 10:49 AM, Daniel Doubleday 
> <daniel.doubleday@gmx.net <ma...@gmx.net>> wrote:
>
>     Hi again.
>
>     It would be great if someone could comment whether the following
>     is true or not.
>     I tried to understand the consequences of using
>     |-Dcassandra.dynamic_snitch=true for the read path |and that's
>     what I came up with:
>
>     1) If using CL > 1 than using the dynamic snitch will result in a
>     data read from node with the lowest latency (little simplified)
>     even if the proxy node contains the data but has a higher latency
>     that other possible nodes which means that it is not necessary to
>     do load-based balancing on the client side.
>
>
> No.  If the coordinator node is part of the replica set, the dynamic 
> snitch will fall back to the wrapped snitch for ordering, since it 
> does not track latencies of itself.  This likely means it will return 
> the data.
Oh - well but I see that the coordinator is actually using its own score 
for ordering. I was only concerned that dropped messages are ignored 
when calculating latencies but that seems to be the case for local or 
remote responses. And even than I guess you can assume that enough slow 
messages arrive to destroy the score.
>
>     The next one goes a little further:
>
>     We read / write with quorum / rf = 3.
>
>     It seems to me that it wouldn't be hard to patch the StorageProxy
>     to send only one read request and one digest request. Only if one
>     of the requests fail we would have to query the remaining node. We
>     don't need read repair because we have to repair once a week
>     anyways and quorum guarantees consistency. This way we could
>     reduce read load significantly which should compensate for latency
>     increase by failing reads. Am I missing something?
>
>
> Just turn off read repair in that case.
Maybe I misunderstand but that would not really lead to less load right. 
I don't think that inconsistency / read repairs are the problem which 
leads to high io load but the digest requests. Turning off read repair 
would also lead to inconsistent reads which invalidates the whole point 
of quorum reads (at least in 0.6. I think rr probability has no effect 
in strong reads in 0.7) . Again assuming I am not misinterpreting the code.
>
> -Brandon


Re: Dynamic Snitch / Read Path Questions

Posted by Brandon Williams <dr...@gmail.com>.
On Sun, Dec 12, 2010 at 10:49 AM, Daniel Doubleday <daniel.doubleday@gmx.net
> wrote:

> Hi again.
>
> It would be great if someone could comment whether the following is true or
> not.
> I tried to understand the consequences of using
> |-Dcassandra.dynamic_snitch=true for the read path |and that's what I came
> up with:
>
> 1) If using CL > 1 than using the dynamic snitch will result in a data read
> from node with the lowest latency (little simplified) even if the proxy node
> contains the data but has a higher latency that other possible nodes which
> means that it is not necessary to do load-based balancing on the client
> side.
>

No.  If the coordinator node is part of the replica set, the dynamic snitch
will fall back to the wrapped snitch for ordering, since it does not track
latencies of itself.  This likely means it will return the data.


> 2) If using CL =1 than the proxy node will always return the data itself
> even when there is another node with less load.
>

Yes, same as above.


> 3) Digest requests will be sent to all other living peer nodes for that key
> and will result in a data read on all nodes to calculate the digest. The
> only difference is that the data is not sent back but IO-wise it is just as
> expensive.
>

Yes.


> The next one goes a little further:
>
> We read / write with quorum / rf = 3.
>
> It seems to me that it wouldn't be hard to patch the StorageProxy to send
> only one read request and one digest request. Only if one of the requests
> fail we would have to query the remaining node. We don't need read repair
> because we have to repair once a week anyways and quorum guarantees
> consistency. This way we could reduce read load significantly which should
> compensate for latency increase by failing reads. Am I missing something?
>

Just turn off read repair in that case.

-Brandon