You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by James Golick <ja...@gmail.com> on 2010/04/13 20:00:30 UTC

Reading thousands of columns

Hi All,

I'm seeing about 35-50ms to read 1000 columns from a CF using
get_range_slices. The columns are TimeUUIDType with empty values.

The row cache is enabled and I'm running the query 500 times in a row, so I
can only assume the row is cached.

Is that about what's expected or am I doing something wrong? (It's from java
this time, so it's not ruby thrift being slow).

- James

Re: Reading thousands of columns

Posted by Gautam Singaraju <ga...@gmail.com>.
Yes, I find that get_range_slices takes an incredibly long time return
the results.
---
Gautam



On Tue, Apr 13, 2010 at 2:00 PM, James Golick <ja...@gmail.com> wrote:
> Hi All,
> I'm seeing about 35-50ms to read 1000 columns from a CF using
> get_range_slices. The columns are TimeUUIDType with empty values.
> The row cache is enabled and I'm running the query 500 times in a row, so I
> can only assume the row is cached.
> Is that about what's expected or am I doing something wrong? (It's from java
> this time, so it's not ruby thrift being slow).
> - James

Re: Reading thousands of columns

Posted by James Golick <ja...@gmail.com>.
That helped a little. But, it's still quite slow. Now, it's around 20-35ms
on average, sometimes as high as 70ms.

On Wed, Apr 14, 2010 at 8:50 AM, James Golick <ja...@gmail.com> wrote:

> Right - that make sense. I'm only fetching one row. I'll give it a try with
> get_slice().
>
> Thanks,
>
> -James
>
>
> On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> 35-50ms for how many rows of 1000 columns each?
>>
>> get_range_slices does not use the row cache, for the same reason that
>> oracle doesn't cache tuples from sequential scans -- blowing away
>> 1000s of rows worth of recently used rows queried by key, for a swath
>> of rows from the scan, is the wrong call more often than it is the
>> right one.
>>
>> On Tue, Apr 13, 2010 at 1:00 PM, James Golick <ja...@gmail.com>
>> wrote:
>> > Hi All,
>> > I'm seeing about 35-50ms to read 1000 columns from a CF using
>> > get_range_slices. The columns are TimeUUIDType with empty values.
>> > The row cache is enabled and I'm running the query 500 times in a row,
>> so I
>> > can only assume the row is cached.
>> > Is that about what's expected or am I doing something wrong? (It's from
>> java
>> > this time, so it's not ruby thrift being slow).
>> > - James
>>
>
>

Re: Reading thousands of columns

Posted by James Golick <ja...@gmail.com>.
Right - that make sense. I'm only fetching one row. I'll give it a try with
get_slice().

Thanks,

-James

On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> 35-50ms for how many rows of 1000 columns each?
>
> get_range_slices does not use the row cache, for the same reason that
> oracle doesn't cache tuples from sequential scans -- blowing away
> 1000s of rows worth of recently used rows queried by key, for a swath
> of rows from the scan, is the wrong call more often than it is the
> right one.
>
> On Tue, Apr 13, 2010 at 1:00 PM, James Golick <ja...@gmail.com>
> wrote:
> > Hi All,
> > I'm seeing about 35-50ms to read 1000 columns from a CF using
> > get_range_slices. The columns are TimeUUIDType with empty values.
> > The row cache is enabled and I'm running the query 500 times in a row, so
> I
> > can only assume the row is cached.
> > Is that about what's expected or am I doing something wrong? (It's from
> java
> > this time, so it's not ruby thrift being slow).
> > - James
>

Re: Reading thousands of columns

Posted by Jonathan Ellis <jb...@gmail.com>.
How long to read just 10 columns?

On Wed, Apr 14, 2010 at 3:19 PM, James Golick <ja...@gmail.com> wrote:
> The values are empty. It's 3000 UUIDs.
>
> On Wed, Apr 14, 2010 at 12:40 PM, Avinash Lakshman
> <av...@gmail.com> wrote:
>>
>> How large are the values? How much data on disk?
>>
>> On Wednesday, April 14, 2010, James Golick <ja...@gmail.com> wrote:
>> > Just for the record, I am able to repeat this locally.
>> > I'm seeing around 150ms to read 1000 columns from a row that has 3000 in
>> > it. If I enable the rowcache, that goes down to about 90ms. According to my
>> > profile, 90% of the time is being spent waiting for cassandra to respond, so
>> > it's not thrift.
>> >
>> > On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod <pr...@gmail.com>
>> > wrote:
>> >
>> > On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone <mi...@simplegeo.com>
>> > wrote:
>> >> ...
>> >>
>> >> Couldn't you cache a list of keys that were returned for the key range,
>> >> then
>> >> cache individual rows separately or not at all?
>> >> By "blowing away rows queried by key" I'm guessing you mean "pushing
>> >> them
>> >> out of the LRU cache," not explicitly blowing them away? Either way I'm
>> >> not
>> >> entirely convinced. In my experience I've had pretty good success
>> >> caching
>> >> items that were pulled out via more complicated join / range type
>> >> queries.
>> >> If your system is doing lots of range quereis, and not a lot of lookups
>> >> by
>> >> key, you'd obviously see a performance win from caching the range
>> >> queries.
>> >> Maybe range scan caching could be turned on separately?
>> >
>> > I agree with you that the caches should be separate, if you're going
>> > to cache ranges. You could imagine a single query (perhaps entered
>> > interactively) would replace the entire row caching all of the data
>> > for the systems' interactive users. For example, a summary page of who
>> > is most over the last month active could replace the profile
>> > information for the actual users who are using the system at that
>> > moment.
>> >
>> >  Paul Prescod
>> >
>> >
>> >
>
>

Re: Reading thousands of columns

Posted by James Golick <ja...@gmail.com>.
The values are empty. It's 3000 UUIDs.

On Wed, Apr 14, 2010 at 12:40 PM, Avinash Lakshman <
avinash.lakshman@gmail.com> wrote:

> How large are the values? How much data on disk?
>
> On Wednesday, April 14, 2010, James Golick <ja...@gmail.com> wrote:
> > Just for the record, I am able to repeat this locally.
> > I'm seeing around 150ms to read 1000 columns from a row that has 3000 in
> it. If I enable the rowcache, that goes down to about 90ms. According to my
> profile, 90% of the time is being spent waiting for cassandra to respond, so
> it's not thrift.
> >
> > On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod <pr...@gmail.com>
> wrote:
> >
> > On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone <mi...@simplegeo.com>
> wrote:
> >> ...
> >>
> >> Couldn't you cache a list of keys that were returned for the key range,
> then
> >> cache individual rows separately or not at all?
> >> By "blowing away rows queried by key" I'm guessing you mean "pushing
> them
> >> out of the LRU cache," not explicitly blowing them away? Either way I'm
> not
> >> entirely convinced. In my experience I've had pretty good success
> caching
> >> items that were pulled out via more complicated join / range type
> queries.
> >> If your system is doing lots of range quereis, and not a lot of lookups
> by
> >> key, you'd obviously see a performance win from caching the range
> queries.
> >> Maybe range scan caching could be turned on separately?
> >
> > I agree with you that the caches should be separate, if you're going
> > to cache ranges. You could imagine a single query (perhaps entered
> > interactively) would replace the entire row caching all of the data
> > for the systems' interactive users. For example, a summary page of who
> > is most over the last month active could replace the profile
> > information for the actual users who are using the system at that
> > moment.
> >
> >  Paul Prescod
> >
> >
> >
>

Re: Reading thousands of columns

Posted by Avinash Lakshman <av...@gmail.com>.
How large are the values? How much data on disk?

On Wednesday, April 14, 2010, James Golick <ja...@gmail.com> wrote:
> Just for the record, I am able to repeat this locally.
> I'm seeing around 150ms to read 1000 columns from a row that has 3000 in it. If I enable the rowcache, that goes down to about 90ms. According to my profile, 90% of the time is being spent waiting for cassandra to respond, so it's not thrift.
>
> On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod <pr...@gmail.com> wrote:
>
> On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone <mi...@simplegeo.com> wrote:
>> ...
>>
>> Couldn't you cache a list of keys that were returned for the key range, then
>> cache individual rows separately or not at all?
>> By "blowing away rows queried by key" I'm guessing you mean "pushing them
>> out of the LRU cache," not explicitly blowing them away? Either way I'm not
>> entirely convinced. In my experience I've had pretty good success caching
>> items that were pulled out via more complicated join / range type queries.
>> If your system is doing lots of range quereis, and not a lot of lookups by
>> key, you'd obviously see a performance win from caching the range queries.
>> Maybe range scan caching could be turned on separately?
>
> I agree with you that the caches should be separate, if you're going
> to cache ranges. You could imagine a single query (perhaps entered
> interactively) would replace the entire row caching all of the data
> for the systems' interactive users. For example, a summary page of who
> is most over the last month active could replace the profile
> information for the actual users who are using the system at that
> moment.
>
>  Paul Prescod
>
>
>

Re: Reading thousands of columns

Posted by James Golick <ja...@gmail.com>.
Just for the record, I am able to repeat this locally.

I'm seeing around 150ms to read 1000 columns from a row that has 3000 in it.
If I enable the rowcache, that goes down to about 90ms. According to my
profile, 90% of the time is being spent waiting for cassandra to respond, so
it's not thrift.

On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod <pr...@gmail.com> wrote:

> On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone <mi...@simplegeo.com> wrote:
> > ...
> >
> > Couldn't you cache a list of keys that were returned for the key range,
> then
> > cache individual rows separately or not at all?
> > By "blowing away rows queried by key" I'm guessing you mean "pushing them
> > out of the LRU cache," not explicitly blowing them away? Either way I'm
> not
> > entirely convinced. In my experience I've had pretty good success caching
> > items that were pulled out via more complicated join / range type
> queries.
> > If your system is doing lots of range quereis, and not a lot of lookups
> by
> > key, you'd obviously see a performance win from caching the range
> queries.
> > Maybe range scan caching could be turned on separately?
>
> I agree with you that the caches should be separate, if you're going
> to cache ranges. You could imagine a single query (perhaps entered
> interactively) would replace the entire row caching all of the data
> for the systems' interactive users. For example, a summary page of who
> is most over the last month active could replace the profile
> information for the actual users who are using the system at that
> moment.
>
>  Paul Prescod
>

Re: Reading thousands of columns

Posted by Paul Prescod <pr...@gmail.com>.
On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone <mi...@simplegeo.com> wrote:
> ...
>
> Couldn't you cache a list of keys that were returned for the key range, then
> cache individual rows separately or not at all?
> By "blowing away rows queried by key" I'm guessing you mean "pushing them
> out of the LRU cache," not explicitly blowing them away? Either way I'm not
> entirely convinced. In my experience I've had pretty good success caching
> items that were pulled out via more complicated join / range type queries.
> If your system is doing lots of range quereis, and not a lot of lookups by
> key, you'd obviously see a performance win from caching the range queries.
> Maybe range scan caching could be turned on separately?

I agree with you that the caches should be separate, if you're going
to cache ranges. You could imagine a single query (perhaps entered
interactively) would replace the entire row caching all of the data
for the systems' interactive users. For example, a summary page of who
is most over the last month active could replace the profile
information for the actual users who are using the system at that
moment.

 Paul Prescod

Re: Reading thousands of columns

Posted by Mike Malone <mi...@simplegeo.com>.
On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> 35-50ms for how many rows of 1000 columns each?
>
> get_range_slices does not use the row cache, for the same reason that
> oracle doesn't cache tuples from sequential scans -- blowing away
> 1000s of rows worth of recently used rows queried by key, for a swath
> of rows from the scan, is the wrong call more often than it is the
> right one.


Couldn't you cache a list of keys that were returned for the key range, then
cache individual rows separately or not at all?

By "blowing away rows queried by key" I'm guessing you mean "pushing them
out of the LRU cache," not explicitly blowing them away? Either way I'm not
entirely convinced. In my experience I've had pretty good success caching
items that were pulled out via more complicated join / range type queries.
If your system is doing lots of range quereis, and not a lot of lookups by
key, you'd obviously see a performance win from caching the range queries.
Maybe range scan caching could be turned on separately?

Mike

Re: Reading thousands of columns

Posted by Jonathan Ellis <jb...@gmail.com>.
35-50ms for how many rows of 1000 columns each?

get_range_slices does not use the row cache, for the same reason that
oracle doesn't cache tuples from sequential scans -- blowing away
1000s of rows worth of recently used rows queried by key, for a swath
of rows from the scan, is the wrong call more often than it is the
right one.

On Tue, Apr 13, 2010 at 1:00 PM, James Golick <ja...@gmail.com> wrote:
> Hi All,
> I'm seeing about 35-50ms to read 1000 columns from a CF using
> get_range_slices. The columns are TimeUUIDType with empty values.
> The row cache is enabled and I'm running the query 500 times in a row, so I
> can only assume the row is cached.
> Is that about what's expected or am I doing something wrong? (It's from java
> this time, so it's not ruby thrift being slow).
> - James