You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Sylvain Lebresne <sy...@yakaz.com> on 2010/03/09 14:15:13 UTC

Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Hello,

I've done some tests and it seems that somehow to have more rows with few
columns is better than to have more rows with fewer columns, at least as long
as read performance is concerned.
Using stress.py, on a quad core 2.27Ghz with 4Go RAM and the out of the box
cassandra configuration, I inserted:

  1) 50000000 rows (that's 50 millions) with 1 column each
(stress.py -n 50000000 -c 1)
  2) 500000 rows (that's 500 thousands) with 100 column each
(stress.py -n 500000 -c 100)

that is, it ends up with 50 millions columns in both case (I use such big
numbers so that in case 2, the resulting data are big enough not to fit in
the system caches, in which case the problem I'm mentioning below
doesn't show).
Those two 'tests' have been done separatly, with data flushed completely
between them. I let cassandra compact everything each time, shutdown the
server and start it again (so that no data is in memtable). Then I tried
reading columns, one at a time using:
  1) stress.py -t 10 -o read -n 50000000 -c 1 -r
  2) stress.py -t 10 -o read -n 500000 -c 1 -r

In the case 1) I get around 200 reads/seconds and that's pretty stable. The
disk is spinning like crazy (~25% io_wait), very few cpu or memory used,
performances are IO bound, which is expected.
In the case 2) however, it starts with reasonnable performance (400+
reads/seconds), but it very quickly drop to an average of 80 reads/seconds
(after a minute and a half or so). And it don't go up significantly after
that. Turns out this seems to be a GC problem. Indeed, the info log (I'm
running trunk from today, but I first saw the problem on an older version of
trunk) show every few seconds lines like:
  GC for ConcurrentMarkSweep: 4599 ms, 57247304 reclaimed leaving
1033481216 used; max is 1211498496
I'm not surprised that performance are bad with such GC pauses. I'm surprised
to have such GC pauses.

Note that in case 1), the resulting data 'weights' ~14G, while in case 2) it
'weights' only ~2.4G.

Let me add that I used stress.py to try to identify the problem, but I first
run into it in an application I'm writting where I had rows with around 1000
columns of 30K each. With about 1000 rows, I had awfull performances, like 5
reads/seconds on average. I try switching to 1 millions row having each 1
column of 30K and end up with more than 300 reads/seconds.

Any idea, insight ? Am I doing something utterly wrong ?
Thanks in advance.

--
Sylvain

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Jesse McConnell <je...@gmail.com>.
in my experience #2 will work well up to a point where it will trigger
a limitation of cassandra (slated to be resolved in .7 \o/) where all
of the columns under a given key must be able to fit into memory.  For
things like index's of data I have opted to shard the keys for really
large data sets to get around this until its fixed....

I suspect if you doubled the test for #2 once or twice you'll start
seeing OOM's

also #2 will end up having a lumpy distribution around a cluster as
all the data under a given key needs to be able to fit on one machine,
#1 will spread out a bit finer.

cheers,
jesse

--
jesse mcconnell
jesse.mcconnell@gmail.com



On Tue, Mar 9, 2010 at 07:15, Sylvain Lebresne <sy...@yakaz.com> wrote:
> Hello,
>
> I've done some tests and it seems that somehow to have more rows with few
> columns is better than to have more rows with fewer columns, at least as long
> as read performance is concerned.
> Using stress.py, on a quad core 2.27Ghz with 4Go RAM and the out of the box
> cassandra configuration, I inserted:
>
>  1) 50000000 rows (that's 50 millions) with 1 column each
> (stress.py -n 50000000 -c 1)
>  2) 500000 rows (that's 500 thousands) with 100 column each
> (stress.py -n 500000 -c 100)
>
> that is, it ends up with 50 millions columns in both case (I use such big
> numbers so that in case 2, the resulting data are big enough not to fit in
> the system caches, in which case the problem I'm mentioning below
> doesn't show).
> Those two 'tests' have been done separatly, with data flushed completely
> between them. I let cassandra compact everything each time, shutdown the
> server and start it again (so that no data is in memtable). Then I tried
> reading columns, one at a time using:
>  1) stress.py -t 10 -o read -n 50000000 -c 1 -r
>  2) stress.py -t 10 -o read -n 500000 -c 1 -r
>
> In the case 1) I get around 200 reads/seconds and that's pretty stable. The
> disk is spinning like crazy (~25% io_wait), very few cpu or memory used,
> performances are IO bound, which is expected.
> In the case 2) however, it starts with reasonnable performance (400+
> reads/seconds), but it very quickly drop to an average of 80 reads/seconds
> (after a minute and a half or so). And it don't go up significantly after
> that. Turns out this seems to be a GC problem. Indeed, the info log (I'm
> running trunk from today, but I first saw the problem on an older version of
> trunk) show every few seconds lines like:
>  GC for ConcurrentMarkSweep: 4599 ms, 57247304 reclaimed leaving
> 1033481216 used; max is 1211498496
> I'm not surprised that performance are bad with such GC pauses. I'm surprised
> to have such GC pauses.
>
> Note that in case 1), the resulting data 'weights' ~14G, while in case 2) it
> 'weights' only ~2.4G.
>
> Let me add that I used stress.py to try to identify the problem, but I first
> run into it in an application I'm writting where I had rows with around 1000
> columns of 30K each. With about 1000 rows, I had awfull performances, like 5
> reads/seconds on average. I try switching to 1 millions row having each 1
> column of 30K and end up with more than 300 reads/seconds.
>
> Any idea, insight ? Am I doing something utterly wrong ?
> Thanks in advance.
>
> --
> Sylvain
>

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Jonathan Ellis <jb...@gmail.com>.
For the record, I note that "no row cache" is the default on
user-defined CFs; we include it in the sample configuration file as an
example only.

On Wed, Mar 10, 2010 at 9:58 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>> So did you disable the row cache entirely?
>
> Yes (getting back reasonable performances).
>
>>> From: Sylvain Lebresne
>>>
>>> Well, I've found the reason.
>>> The default cassandra configuration use a 10% row cache.
>>> And the row cache reads all the row each time. So it was indeed reading
>>> the
>>> full row each time even though the request was asking for only one
>>> column.
>>>
>>> Sylvain
>>
>>
>>
>

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Sylvain Lebresne <sy...@yakaz.com>.
> So did you disable the row cache entirely?

Yes (getting back reasonable performances).

>> From: Sylvain Lebresne
>>
>> Well, I've found the reason.
>> The default cassandra configuration use a 10% row cache.
>> And the row cache reads all the row each time. So it was indeed reading
>> the
>> full row each time even though the request was asking for only one
>> column.
>>
>> Sylvain
>
>
>

RE: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by David Dabbs <dm...@gmail.com>.
So did you disable the row cache entirely?

> From: Sylvain Lebresne 
> 
> Well, I've found the reason.
> The default cassandra configuration use a 10% row cache.
> And the row cache reads all the row each time. So it was indeed reading
> the
> full row each time even though the request was asking for only one
> column.
> 
> Sylvain



Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Sylvain Lebresne <sy...@yakaz.com>.
Well, I've found the reason.
The default cassandra configuration use a 10% row cache.
And the row cache reads all the row each time. So it was indeed reading the
full row each time even though the request was asking for only one column.

My bad (at least I learned something).

--
Sylvain

On Tue, Mar 9, 2010 at 9:49 PM, Brandon Williams <dr...@gmail.com> wrote:
> On Tue, Mar 9, 2010 at 2:28 PM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>>
>> > A row causes a disk seek while columns are contiguous.  So if the row
>> > isn't
>> > in the cache, you're being impaired by the seeks.  In general, fatter
>> > rows
>> > should be more performant than skinny ones.
>>
>> Sure, I understand that. Still, I get 400 columns by seconds (ie, 400
>> seeks by
>> seconds) when the rows only have one column by row, while I have 10
>> columns
>> by seconds when the row have 100 columns, even though I read only the
>> first
>> column.
>
> Doesn't that imply the disk is having to seek further for the rows with more
> columns?
> -Brandon

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Brandon Williams <dr...@gmail.com>.
On Tue, Mar 9, 2010 at 2:28 PM, Sylvain Lebresne <sy...@yakaz.com> wrote:

> > A row causes a disk seek while columns are contiguous.  So if the row
> isn't
> > in the cache, you're being impaired by the seeks.  In general, fatter
> rows
> > should be more performant than skinny ones.
>
> Sure, I understand that. Still, I get 400 columns by seconds (ie, 400 seeks
> by
> seconds) when the rows only have one column by row, while I have 10 columns
> by seconds when the row have 100 columns, even though I read only the first
> column.
>

Doesn't that imply the disk is having to seek further for the rows with more
columns?

-Brandon

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Sylvain Lebresne <sy...@yakaz.com>.
> A row causes a disk seek while columns are contiguous.  So if the row isn't
> in the cache, you're being impaired by the seeks.  In general, fatter rows
> should be more performant than skinny ones.

Sure, I understand that. Still, I get 400 columns by seconds (ie, 400 seeks by
seconds) when the rows only have one column by row, while I have 10 columns
by seconds when the row have 100 columns, even though I read only the first
column.

--
Sylvain

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Brandon Williams <dr...@gmail.com>.
On Tue, Mar 9, 2010 at 1:14 PM, Sylvain Lebresne <sy...@yakaz.com> wrote:

> I've inserted 1000 row of 100 column each (python stress.py -t 2 -n
> 1000 -c 100 -i 5)
> If I read, I get the roughly the same number of row whether I read the
> whole row
> (python stress.py -t 10 -n 1000 -o read -r -c 100) or only the first column
> (python stress.py -t 10 -n 1000 -o read -r -c 1). And that's less that
> 10 rows by
> seconds.
>
> So sure, when I read the whole row, that almost 1000 columns by
> seconds, which is
> roughly 50M/s troughput, which is quite good. But when I read only the
> first column,
> I get 10 columns by seconds, that 500K/s, which is less good. Now,
> from what I've
> understood so far, cassandra doesn't deserialize whole row to read a
> single column
> (I'm not using supercolumn here), so I don't understand those numbers.
>

A row causes a disk seek while columns are contiguous.  So if the row isn't
in the cache, you're being impaired by the seeks.  In general, fatter rows
should be more performant than skinny ones.

-Brandon

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Sylvain Lebresne <sy...@yakaz.com>.
Alright,

What I'm observing shows better with bigger columns, so I've slightly modified
the stress.py test so that it inserts column of 50K bytes (I attach
the modified stress.py
for info but it really just read 50000 bytes from /dev/null and use
that as data.
I also added a sleep to the insert otherwise cassandra dies on the
insertion :)).

I'm also using 0.6-beta2 from the cassandra website. And I've given 1.5G of RAM
to Cassandra just in case.

I've inserted 1000 row of 100 column each (python stress.py -t 2 -n
1000 -c 100 -i 5)
If I read, I get the roughly the same number of row whether I read the whole row
(python stress.py -t 10 -n 1000 -o read -r -c 100) or only the first column
(python stress.py -t 10 -n 1000 -o read -r -c 1). And that's less that
10 rows by
seconds.

So sure, when I read the whole row, that almost 1000 columns by
seconds, which is
roughly 50M/s troughput, which is quite good. But when I read only the
first column,
I get 10 columns by seconds, that 500K/s, which is less good. Now,
from what I've
understood so far, cassandra doesn't deserialize whole row to read a
single column
(I'm not using supercolumn here), so I don't understand those numbers.

Plus if I insert the same data but 'inlining' everything, that is
100000 rows of 1 column,
then I get read performances of around 400 columns by seconds.
Does that mean that I should put columns in the same row only if every
request will read
at least 40 columns at a time ?

Just to explain why I'm doing such test, let me quickly explain what
I'm trying to do.
I need to store images that are geographically localized. When I
request them, I
request 5 to 10 of those images that are geographically close. My idea
is to have
row keys that are some id of a delimited region and column names that
are the actual
geographic position of the image (the column values are the images data). Each
region (row) will have from 10 to around 10000 image (column) max and
getting my 5-10
images geographically close just amount to a get_slice.
But when I do that, I have bad read performances (4-5 row/sec, that is
50 images max by
seconds and less than that on average). I get better performances by
putting one image by
row. And it makes me really sad as it makes me use cassandra as a
basic key/value store
without using the free sorting. And I want my free sorting :(

Thanks in advance for any explanation/help.

Cheers,
Sylvain

On Tue, Mar 9, 2010 at 3:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Mar 9, 2010 at 8:31 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>> Well, unless I'm mistaking, that's the same in my example as I give in
>> both case
>> to stress.py the option '-c 1' which tells it to retrieve only one
>> column each time
>> even in the case where I have 100 columns by row.
>
> Oh.
>
> Why would you do that? :)
>

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, Mar 9, 2010 at 8:31 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
> Well, unless I'm mistaking, that's the same in my example as I give in
> both case
> to stress.py the option '-c 1' which tells it to retrieve only one
> column each time
> even in the case where I have 100 columns by row.

Oh.

Why would you do that? :)

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Sylvain Lebresne <sy...@yakaz.com>.
On Tue, Mar 9, 2010 at 2:52 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> By "reads" do you mean what stress.py counts (rows) or rows * columns?
>  If it is rows, then you are still actually reading more columns/s in
> case 2.

Well, unless I'm mistaking, that's the same in my example as I give in
both case
to stress.py the option '-c 1' which tells it to retrieve only one
column each time
even in the case where I have 100 columns by row.

>> And it don't go up significantly after
>> that. Turns out this seems to be a GC problem. Indeed, the info log (I'm
>> running trunk from today, but I first saw the problem on an older version of
>> trunk) show every few seconds lines like:
>>  GC for ConcurrentMarkSweep: 4599 ms, 57247304 reclaimed leaving
>> 1033481216 used; max is 1211498496
>
> First, use the 0.6 branch, not trunk.  we're breaking stuff over there.

Fair enough, I will do the test with 0.6. But again, I saw such
behavior with a trunk
from like 3 weeks ago. Just to say that I don't believe it to be
something broke
recently. But I admit, I should have tried with 0.6 and I will do it.

> What happens if you give the jvm 50% more ram?

A quick test doesn't show the problem with 50% more ram, at least not in a
short time frame. But I'm still not convinced there is no problem, I saw pretty
weird performance with bigger columns. Let me try to come up with a more
compelling test against 0.6. I'll keep you posted, even if I'm wrong :)

> Are you using a 64-bit JVM?

yep

--
Sylvain

Re: Bad read performances: 'few rows of many columns' vs 'many rows of few columns'

Posted by Jonathan Ellis <jb...@gmail.com>.
On Tue, Mar 9, 2010 at 7:15 AM, Sylvain Lebresne <sy...@yakaz.com> wrote:
>  1) stress.py -t 10 -o read -n 50000000 -c 1 -r
>  2) stress.py -t 10 -o read -n 500000 -c 1 -r
>
> In the case 1) I get around 200 reads/seconds and that's pretty stable. The
> disk is spinning like crazy (~25% io_wait), very few cpu or memory used,
> performances are IO bound, which is expected.
> In the case 2) however, it starts with reasonnable performance (400+
> reads/seconds), but it very quickly drop to an average of 80 reads/seconds

By "reads" do you mean what stress.py counts (rows) or rows * columns?
 If it is rows, then you are still actually reading more columns/s in
case 2.

> And it don't go up significantly after
> that. Turns out this seems to be a GC problem. Indeed, the info log (I'm
> running trunk from today, but I first saw the problem on an older version of
> trunk) show every few seconds lines like:
>  GC for ConcurrentMarkSweep: 4599 ms, 57247304 reclaimed leaving
> 1033481216 used; max is 1211498496

First, use the 0.6 branch, not trunk.  we're breaking stuff over there.

What happens if you give the jvm 50% more ram?

Are you using a 64-bit JVM?

-Jonathan