You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by John Anderson <so...@gmail.com> on 2015/04/22 09:39:11 UTC

Reading hundreds of thousands of rows at once?

Hey, I'm looking at querying around 500,000 rows that I need to pull into a
Pandas data frame for processing.  Currently testing this on a single
cassandra node it takes around 21 seconds:

https://gist.github.com/sontek/4ca95f5c5aa539663eaf

I tried introducing multiprocessing so I could use 4 processes at a time to
query this and I got it down to 14 seconds:

https://gist.github.com/sontek/542f13307ef9679c0094

Although shaving off 7 seconds is great it still isn't really where I would
like to be in regards to performance, for this many rows I'd really like to
get down to a max of 1-2 seconds query time.

What types of optimization's can I make to improve the read performance
when querying a large set of data?  Will this timing speed up linearly as I
add more nodes?

This is what the schema looks like currently:

https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4


I'm not tied to the current schema at all, its mostly just a replication of
what we have in SQL Server. I'm more interested in what things I can change
to make querying it faster.

Thanks,
John

Re: Reading hundreds of thousands of rows at once?

Posted by Robert Wille <rw...@fold3.com>.
Add more nodes to your cluster

On Apr 22, 2015, at 1:39 AM, John Anderson <so...@gmail.com>> wrote:

Hey, I'm looking at querying around 500,000 rows that I need to pull into a Pandas data frame for processing.  Currently testing this on a single cassandra node it takes around 21 seconds:

https://gist.github.com/sontek/4ca95f5c5aa539663eaf

I tried introducing multiprocessing so I could use 4 processes at a time to query this and I got it down to 14 seconds:

https://gist.github.com/sontek/542f13307ef9679c0094

Although shaving off 7 seconds is great it still isn't really where I would like to be in regards to performance, for this many rows I'd really like to get down to a max of 1-2 seconds query time.

What types of optimization's can I make to improve the read performance when querying a large set of data?  Will this timing speed up linearly as I add more nodes?

This is what the schema looks like currently:

https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4


I'm not tied to the current schema at all, its mostly just a replication of what we have in SQL Server. I'm more interested in what things I can change to make querying it faster.

Thanks,
John


Re: Reading hundreds of thousands of rows at once?

Posted by Anishek Agarwal <an...@gmail.com>.
also might want to go through a thread here in with subject "High latencies
for simple queries"

On Wed, Apr 22, 2015 at 1:55 PM, Anishek Agarwal <an...@gmail.com> wrote:

> I think these will help speed up
>
> - removing compression
> - you have lot of independent columns mentioned. If you are always going
> to query all of them together one other thing that will help is have a full
> json(or some custom obj representation) of the value data and change the
> model to just have survey_id, hour_created,respondent_id, *json_value*
>
> On Wed, Apr 22, 2015 at 1:09 PM, John Anderson <so...@gmail.com> wrote:
>
>> Hey, I'm looking at querying around 500,000 rows that I need to pull into
>> a Pandas data frame for processing.  Currently testing this on a single
>> cassandra node it takes around 21 seconds:
>>
>> https://gist.github.com/sontek/4ca95f5c5aa539663eaf
>>
>> I tried introducing multiprocessing so I could use 4 processes at a time
>> to query this and I got it down to 14 seconds:
>>
>> https://gist.github.com/sontek/542f13307ef9679c0094
>>
>> Although shaving off 7 seconds is great it still isn't really where I
>> would like to be in regards to performance, for this many rows I'd really
>> like to get down to a max of 1-2 seconds query time.
>>
>> What types of optimization's can I make to improve the read performance
>> when querying a large set of data?  Will this timing speed up linearly as I
>> add more nodes?
>>
>> This is what the schema looks like currently:
>>
>> https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4
>>
>>
>> I'm not tied to the current schema at all, its mostly just a replication
>> of what we have in SQL Server. I'm more interested in what things I can
>> change to make querying it faster.
>>
>> Thanks,
>> John
>>
>
>

Re: Reading hundreds of thousands of rows at once?

Posted by Anishek Agarwal <an...@gmail.com>.
I think these will help speed up

- removing compression
- you have lot of independent columns mentioned. If you are always going to
query all of them together one other thing that will help is have a full
json(or some custom obj representation) of the value data and change the
model to just have survey_id, hour_created,respondent_id, *json_value*

On Wed, Apr 22, 2015 at 1:09 PM, John Anderson <so...@gmail.com> wrote:

> Hey, I'm looking at querying around 500,000 rows that I need to pull into
> a Pandas data frame for processing.  Currently testing this on a single
> cassandra node it takes around 21 seconds:
>
> https://gist.github.com/sontek/4ca95f5c5aa539663eaf
>
> I tried introducing multiprocessing so I could use 4 processes at a time
> to query this and I got it down to 14 seconds:
>
> https://gist.github.com/sontek/542f13307ef9679c0094
>
> Although shaving off 7 seconds is great it still isn't really where I
> would like to be in regards to performance, for this many rows I'd really
> like to get down to a max of 1-2 seconds query time.
>
> What types of optimization's can I make to improve the read performance
> when querying a large set of data?  Will this timing speed up linearly as I
> add more nodes?
>
> This is what the schema looks like currently:
>
> https://gist.github.com/sontek/d6fa3fc1b6d085ad3fa4
>
>
> I'm not tied to the current schema at all, its mostly just a replication
> of what we have in SQL Server. I'm more interested in what things I can
> change to make querying it faster.
>
> Thanks,
> John
>