You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Philippe <wa...@gmail.com> on 2011/06/24 22:58:54 UTC

Questions about Cassandra reads

Hello,

I am trying to understand the way cassandra reads data. I've been reading a
lot and here is what I understand.
Can I get some feedback on the following claims ? Which are right and which
are wrong?

A) Upon opening an SSTTable for read, Cassandra samples one key in 100 to
speed up disk access.
Is the percentage configurable ?
What is the relationship between this sampling and the key cache ?

B) So assuming we have 200 keys in the table, the in-memory index will
contain the on-disk position of keys 0 and 100.

C) I want to access a key that is at the 50th position in that table,
Cassandra will seek position 0 and then do a sequential read of the file
from there until it finds the key, right ?
While it does that, does C* deserialize the rows it is reading or does it
just compare the keys' bytes and ignore the accompanying data ?

D) Does the data for a key immediatly follow the row in the file ?
Ex: [key0][data0][key1][data1]... ?

Assuming a perfectly uniform random read pattern and no caches whatsoever
(neither C*, nor the OS, nor the disks... nothing)

E) Since the sampling is 1%, we'll have to scan 50 keys in the file *on
average* to get to the key we want.

G) Because we're scanning the file ondisk, scanning those 50 keys requires
in fact to read from disk both the keys and the data row so, on average,
retrieving one row requires in fact reading 50 rows from the disk thus
increasing I/O fifty-fold.

The keycache stores the position in the file for the keys it contains. So
it's a great way to cut down on these inefficiencies.
H) Going back to my previous example : if my keycache has 100 keys capacity,
then I'll only have to scan the file for 1/2 the requests

Real world now... I am proof-testing using SSD drives and I have too much
data to hold it in memory. I have some hotspots.
I) I wonder how best to allocate memory between the OS cache, key cache &
row cache
Any suggestions ?

My read pattern is very "chunky" : I never read a single row but ranges of
rows with column slices. The sizes are varying.
J) I've considered writing a partitioner that will chunk the rows together
so that queries for "close" rows go to the same replica on the ring. Since
the rows have close keys, they will be close together in the file and this
will increase OS cache efficiency.
What do you think ?

Thanks for your insights

Re: Questions about Cassandra reads

Posted by Jonathan Ellis <jb...@gmail.com>.

Thanks for the update, that is very useful!

On Tue, Jul 12, 2011 at 3:16 PM, Philippe <wa...@gmail.com> wrote:
> Hi Jonathan,
> Thanks for the answer, I wanted to report on the improvements I got because
> someone else is bound to run into the same questions...
>
>>
>> > C) I want to access a key that is at the 50th position in that table,
>> > Cassandra will seek position 0 and then do a sequential read of the file
>> > from there until it finds the key, right ?
>> Sequential read of the index file, not the data file.
>
> and then it will seek directly to the right position in the data file ?
>
>>
>> > J) I've considered writing a partitioner that will chunk the rows
>> > together
>> > so that queries for "close" rows go to the same replica on the ring.
>> > Since
>> > the rows have close keys, they will be close together in the file and
>> > this
>> > will increase OS cache efficiency.
>> Sounds like ByteOrderedPartitioner to me.
>
> I indeed ended up using just that
>
>>
>> > What do you think ?
>>  I think you should strongly consider denormalizing so that you can
>> read ranges from a single row instead.
>
> Yes, that's what I did : I took a hard look at the data and the acces
> pattern and sliced away at everything I could.
> Given that I am storing data in a quad tree and that I have strong locality
> in my read-pattern, I ended up using the morton (z-order) code as the key
> and using super-columns to only get the column groups I'm interested in.
> I gave some thought on how to balance the tree because I have 10 different
> levels of data in the quadtree and I am doing tricks with shifts to reuse
> the same prefixes in the keys.
> What I think is worth noting for others on the mailing list is that doing
> this resulted in a x50 to x100 increase in read performance and my IO is now
> down to virtually nothing (I can basically see the OS load up the pages in
> its cache).
> I also found out that one big multiget is more efficient that a couple range
> queries in my case.
> So
>  - instead of a steady rate of 280/350MB/s of disk reads I get 100MB/s every
> so often
>  - instead of seeing my cluster melt down at 3 concurrent clients, it's now
> speeding along just fine at 50 concurrent clients
> :)



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Questions about Cassandra reads

Posted by Philippe <wa...@gmail.com>.

Hi Jonathan,
Thanks for the answer, I wanted to report on the improvements I got because
someone else is bound to run into the same questions...


> > C) I want to access a key that is at the 50th position in that table,
> > Cassandra will seek position 0 and then do a sequential read of the file
> > from there until it finds the key, right ?
> Sequential read of the index file, not the data file.
>
and then it will seek directly to the right position in the data file ?


> > J) I've considered writing a partitioner that will chunk the rows
> together
> > so that queries for "close" rows go to the same replica on the ring.
> Since
> > the rows have close keys, they will be close together in the file and
> this
> > will increase OS cache efficiency.
> Sounds like ByteOrderedPartitioner to me.
>
I indeed ended up using just that


> > What do you think ?
>  I think you should strongly consider denormalizing so that you can
> read ranges from a single row instead.
>
Yes, that's what I did : I took a hard look at the data and the acces
pattern and sliced away at everything I could.

Given that I am storing data in a quad tree and that I have strong locality
in my read-pattern, I ended up using the morton (z-order) code as the key
and using super-columns to only get the column groups I'm interested in.
I gave some thought on how to balance the tree because I have 10 different
levels of data in the quadtree and I am doing tricks with shifts to reuse
the same prefixes in the keys.

What I think is worth noting for others on the mailing list is that doing
this resulted in a x50 to x100 increase in read performance and my IO is now
down to virtually nothing (I can basically see the OS load up the pages in
its cache).
I also found out that one big multiget is more efficient that a couple range
queries in my case.

So
 - instead of a steady rate of 280/350MB/s of disk reads I get 100MB/s every
so often
 - instead of seeing my cluster melt down at 3 concurrent clients, it's now
speeding along just fine at 50 concurrent clients

:)

Re: Questions about Cassandra reads

Posted by David Boxenhorn <da...@citypath.com>.

Ah, I get it. Your normal access pattern should be one row at a time.

On Sun, Jul 3, 2011 at 11:41 AM, David Boxenhorn <da...@citypath.com> wrote:
>>> What do you think ?
>>
>> I think you should strongly consider denormalizing so that you can
>> read ranges from a single row instead.
>
> Why do you recommend denormalizing instead of secondary indexes?
>

Re: Questions about Cassandra reads

Posted by David Boxenhorn <da...@citypath.com>.

>> What do you think ?
>
> I think you should strongly consider denormalizing so that you can
> read ranges from a single row instead.

Why do you recommend denormalizing instead of secondary indexes?

Re: Questions about Cassandra reads

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Jun 24, 2011 at 3:58 PM, Philippe <wa...@gmail.com> wrote:
> A) Upon opening an SSTTable for read, Cassandra samples one key in 100 to
> speed up disk access.

Close enough.

> Is the percentage configurable ?

# The Index Interval determines how large the sampling of row keys
#  is for a given SSTable. The larger the sampling, the more effective
#  the index is at the cost of space.
index_interval: 128

> What is the relationship between this sampling and the key cache ?

None.  The key cache remembers LRU key locations.

> C) I want to access a key that is at the 50th position in that table,
> Cassandra will seek position 0 and then do a sequential read of the file
> from there until it finds the key, right ?

Sequential read of the index file, not the data file.

> D) Does the data for a key immediatly follow the row in the file ?

Yes.

> H) Going back to my previous example : if my keycache has 100 keys capacity,
> then I'll only have to scan the file for 1/2 the requests

Right.

> I never read a single row but ranges of
> rows with column slices. The sizes are varying.

While key cache and row cache *can* speed up range slicing, usually
you don't have enough cache capacity for this to be useful in
practice.

> J) I've considered writing a partitioner that will chunk the rows together
> so that queries for "close" rows go to the same replica on the ring. Since
> the rows have close keys, they will be close together in the file and this
> will increase OS cache efficiency.

Sounds like ByteOrderedPartitioner to me.

> What do you think ?

I think you should strongly consider denormalizing so that you can
read ranges from a single row instead.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com