You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Philo Yang <ud...@gmail.com> on 2014/07/31 19:44:36 UTC

select many rows one time or select many times?

Hi all,

I have a cluster of 2.0.6 and one of my tables is like this:
CREATE TABLE word (
  user text,
  word text,
  flag double,
  PRIMARY KEY (user, word)
)

each "user" has about 10000 "word" per node. I have a requirement of
selecting all rows where user='someuser' and word is in a large set whose
size is about 1000 .

In C* document, it is not recommended to use "select ... in" just like:

select from word where user='someuser' and word in ('a','b','aa','ab',...)

So now I select all rows where user='someuser' and filtrate them via client
rather than via C*. Of course, I use Datastax Java Driver to page the
resultset by setFetchSize(1000).  Is it the best way? I found the system's
load is high because of large range query, should I change to select for
only one row each time and select 1000 times?

just like:
select from word where user='someuser' and word = 'a';
select from word where user='someuser' and word = 'b';
select from word where user='someuser' and word = 'c';
.....

Which method will cause lower pressure on Cassandra cluster?

Thanks,
Philo Yang

RE: select many rows one time or select many times?

Posted by Mohammed Guller <mo...@glassbeam.com>.

Did you benchmark these two options:

1)      Select with IN

2)      Select all words and filter in application

Mohammed

From: Philo Yang [mailto:ud1937@gmail.com]
Sent: Thursday, July 31, 2014 10:45 AM
To: user@cassandra.apache.org
Subject: select many rows one time or select many times?

Hi all,

I have a cluster of 2.0.6 and one of my tables is like this:
CREATE TABLE word (
  user text,
  word text,
  flag double,
  PRIMARY KEY (user, word)
)

each "user" has about 10000 "word" per node. I have a requirement of selecting all rows where user='someuser' and word is in a large set whose size is about 1000 .

In C* document, it is not recommended to use "select ... in" just like:

select from word where user='someuser' and word in ('a','b','aa','ab',...)

So now I select all rows where user='someuser' and filtrate them via client rather than via C*. Of course, I use Datastax Java Driver to page the resultset by setFetchSize(1000).  Is it the best way? I found the system's load is high because of large range query, should I change to select for only one row each time and select 1000 times?

just like:
select from word where user='someuser' and word = 'a';
select from word where user='someuser' and word = 'b';
select from word where user='someuser' and word = 'c';
.....

Which method will cause lower pressure on Cassandra cluster?

Thanks,
Philo Yang

Re: select many rows one time or select many times?

Posted by Jack Krupansky <ja...@basetechnology.com>.

This doesn’t seem like a reasonable use case for Cassandra. I mean, it’s not a typical “database” use case.

-- Jack Krupansky

From: Philo Yang 
Sent: Thursday, July 31, 2014 1:44 PM
To: user@cassandra.apache.org 
Subject: select many rows one time or select many times?

Hi all, 

I have a cluster of 2.0.6 and one of my tables is like this:
CREATE TABLE word (
  user text,
  word text,
  flag double,
  PRIMARY KEY (user, word)
)

each "user" has about 10000 "word" per node. I have a requirement of selecting all rows where user='someuser' and word is in a large set whose size is about 1000 . 

In C* document, it is not recommended to use "select ... in" just like:

select from word where user='someuser' and word in ('a','b','aa','ab',...) 

So now I select all rows where user='someuser' and filtrate them via client rather than via C*. Of course, I use Datastax Java Driver to page the resultset by setFetchSize(1000).  Is it the best way? I found the system's load is high because of large range query, should I change to select for only one row each time and select 1000 times?

just like:
select from word where user='someuser' and word = 'a';
select from word where user='someuser' and word = 'b';

select from word where user='someuser' and word = 'c';

.....

Which method will cause lower pressure on Cassandra cluster?

Thanks, 
Philo Yang

Re: select many rows one time or select many times?

Posted by "Laing, Michael" <mi...@nytimes.com>.

I don't think there is an easy "answer" to this...

A possible approach, based upon the implied dimensions of the problem,
would be to maintain a bloom filter over "words" for each user as a
partition key with the user as clustering key. Then a single query would
efficiently yield the list of users that "may" match and other techniques
could be used to refine that list down to actual matches.

ml


On Thu, Jul 31, 2014 at 10:44 AM, Philo Yang <ud...@gmail.com> wrote:

> Hi all,
>
> I have a cluster of 2.0.6 and one of my tables is like this:
> CREATE TABLE word (
>   user text,
>   word text,
>   flag double,
>   PRIMARY KEY (user, word)
> )
>
> each "user" has about 10000 "word" per node. I have a requirement of
> selecting all rows where user='someuser' and word is in a large set whose
> size is about 1000 .
>
> In C* document, it is not recommended to use "select ... in" just like:
>
> select from word where user='someuser' and word in ('a','b','aa','ab',...)
>
> So now I select all rows where user='someuser' and filtrate them via
> client rather than via C*. Of course, I use Datastax Java Driver to page
> the resultset by setFetchSize(1000).  Is it the best way? I found the
> system's load is high because of large range query, should I change to
> select for only one row each time and select 1000 times?
>
> just like:
> select from word where user='someuser' and word = 'a';
> select from word where user='someuser' and word = 'b';
> select from word where user='someuser' and word = 'c';
> .....
>
> Which method will cause lower pressure on Cassandra cluster?
>
> Thanks,
> Philo Yang
>
>