You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by David Boxenhorn <da...@taotown.com> on 2011/04/13 18:23:53 UTC

Indexes on heterogeneous rows

Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
different sets of columns?

For example, let's say you have three types of objects (1, 2, 3) which each
had three members. If your rows had the following pattern

type=1 a=? b=? c=?
type=2 d=? e=? f=?
type=3 g=? h=? i=?

could you index "type" as your primary index, and also index "a", "e", "h"
as secondary indexes, to get the objects of that type that you are looking
for?

Would it work if you had billions of rows of each type?

Re: Indexes on heterogeneous rows

Posted by "Wangpei (Peter)" <pe...@huawei.com>.

Does the get_indexed_slice in 0.7.4 version already do thing that way?
It seems always take the 1st indexed column with EQ.
Or is it a new feature of coming 0.7.5 or 0.8?

-----邮件原件-----
发件人: Jonathan Ellis [mailto:jbellis@gmail.com] 
发送时间: 2011年4月15日 0:21
收件人: user@cassandra.apache.org
抄送: David Boxenhorn; aaron morton
主题: Re: Indexes on heterogeneous rows

This should work reasonably well w/ 0.7 indexes. Cassandra tracks
statistics on index selectivity, so it would plan that query as "index
lookup on e=5, then iterate over those results and return only rows
that also have type=2."

On Thu, Apr 14, 2011 at 5:33 AM, David Boxenhorn <da...@taotown.com> wrote:
> Thank you for your answer, and sorry about the sloppy terminology.
>
> I'm thinking of the scenario where there are a small number of results in
> the result set, but there are billions of rows in the first of your
> secondary indexes.
>
> That is, I want to do something like (not sure of the CQL syntax):
>
> select * where type=2 and e=5
>
> where there are billions of rows of type 2, but some manageable number of
> those rows have e=5.
>
> As I understand it, secondary indexes are like column families, where each
> value is a column. So the billions of rows where type=2 would go into a
> single row of the secondary index. This sounds like a problem to me, is it?
>
> I'm assuming that the billions of rows that don't have column "e" at all
> (those rows of other types) are not a problem at all...
>
> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton <aa...@thelastpickle.com>
> wrote:
>>
>> Need to clear up some terminology here.
>> Rows have a key and can be retrieved by key. This is *sort of* the primary
>> index, but not primary in the normal RDBMS sense.
>> Rows can have different columns and the column names are sorted and can be
>> efficiently selected.
>> There are "secondary indexes" in cassandra 0.7 based on column
>> values http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>> So you could create secondary indexes on the a,e, and h columns and get
>> rows that have specific values. There are some limitations to secondary
>> indexes, read the linked article.
>> Or you can make your own secondary indexes using row keys as the index
>> values.
>> If you have billions of rows, how many do you need to read back at once?
>> Hope that helps
>> Aaron
>>
>> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>>
>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
>> different sets of columns?
>>
>> For example, let's say you have three types of objects (1, 2, 3) which
>> each had three members. If your rows had the following pattern
>>
>> type=1 a=? b=? c=?
>> type=2 d=? e=? f=?
>> type=3 g=? h=? i=?
>>
>> could you index "type" as your primary index, and also index "a", "e", "h"
>> as secondary indexes, to get the objects of that type that you are looking
>> for?
>>
>> Would it work if you had billions of rows of each type?
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Indexes on heterogeneous rows

Posted by Jonathan Ellis <jb...@gmail.com>.

This should work reasonably well w/ 0.7 indexes. Cassandra tracks
statistics on index selectivity, so it would plan that query as "index
lookup on e=5, then iterate over those results and return only rows
that also have type=2."

On Thu, Apr 14, 2011 at 5:33 AM, David Boxenhorn <da...@taotown.com> wrote:
> Thank you for your answer, and sorry about the sloppy terminology.
>
> I'm thinking of the scenario where there are a small number of results in
> the result set, but there are billions of rows in the first of your
> secondary indexes.
>
> That is, I want to do something like (not sure of the CQL syntax):
>
> select * where type=2 and e=5
>
> where there are billions of rows of type 2, but some manageable number of
> those rows have e=5.
>
> As I understand it, secondary indexes are like column families, where each
> value is a column. So the billions of rows where type=2 would go into a
> single row of the secondary index. This sounds like a problem to me, is it?
>
> I'm assuming that the billions of rows that don't have column "e" at all
> (those rows of other types) are not a problem at all...
>
> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton <aa...@thelastpickle.com>
> wrote:
>>
>> Need to clear up some terminology here.
>> Rows have a key and can be retrieved by key. This is *sort of* the primary
>> index, but not primary in the normal RDBMS sense.
>> Rows can have different columns and the column names are sorted and can be
>> efficiently selected.
>> There are "secondary indexes" in cassandra 0.7 based on column
>> values http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>> So you could create secondary indexes on the a,e, and h columns and get
>> rows that have specific values. There are some limitations to secondary
>> indexes, read the linked article.
>> Or you can make your own secondary indexes using row keys as the index
>> values.
>> If you have billions of rows, how many do you need to read back at once?
>> Hope that helps
>> Aaron
>>
>> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>>
>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
>> different sets of columns?
>>
>> For example, let's say you have three types of objects (1, 2, 3) which
>> each had three members. If your rows had the following pattern
>>
>> type=1 a=? b=? c=?
>> type=2 d=? e=? f=?
>> type=3 g=? h=? i=?
>>
>> could you index "type" as your primary index, and also index "a", "e", "h"
>> as secondary indexes, to get the objects of that type that you are looking
>> for?
>>
>> Would it work if you had billions of rows of each type?
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Indexes on heterogeneous rows

Posted by Jonathan Ellis <jb...@gmail.com>.

Right.

On Sun, Apr 17, 2011 at 4:23 AM, David Boxenhorn <da...@taotown.com> wrote:
> Thanks, Jonathan. I think I understand now.
>
> To sum up: Everything would work, but if your only equality is on "type"
> (all the rest inequalities), it could be very inefficient.
>
> Is that right?
>
> On Thu, Apr 14, 2011 at 7:22 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> On Thu, Apr 14, 2011 at 6:48 AM, David Boxenhorn <da...@taotown.com>
>> wrote:
>> > The reason why I put "type" first is that queries on type will
>> > always be an exact match, whereas the other clauses might be
>> > inequalities.
>>
>> Expression order doesn't matter, but as you imply, non-equalities
>> can't be used in an index lookup and have to be checked in a nested
>> loop phase afterwards.
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Indexes on heterogeneous rows

Posted by David Boxenhorn <da...@taotown.com>.

Thanks, Jonathan. I think I understand now.

To sum up: Everything would work, but if your only equality is on "type"
(all the rest inequalities), it could be very inefficient.

Is that right?

On Thu, Apr 14, 2011 at 7:22 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Thu, Apr 14, 2011 at 6:48 AM, David Boxenhorn <da...@taotown.com>
> wrote:
> > The reason why I put "type" first is that queries on type will
> > always be an exact match, whereas the other clauses might be
> inequalities.
>
> Expression order doesn't matter, but as you imply, non-equalities
> can't be used in an index lookup and have to be checked in a nested
> loop phase afterwards.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Indexes on heterogeneous rows

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Apr 14, 2011 at 6:48 AM, David Boxenhorn <da...@taotown.com> wrote:
> The reason why I put "type" first is that queries on type will
> always be an exact match, whereas the other clauses might be inequalities.

Expression order doesn't matter, but as you imply, non-equalities
can't be used in an index lookup and have to be checked in a nested
loop phase afterwards.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Indexes on heterogeneous rows

Posted by aaron morton <aa...@thelastpickle.com>.

> (This is a case were 1/3 of the rows are of type 2, but, say only a few hundred rows of type 2 have e=5.)

How many rows would have e=5 without worrying about their type value?
 
Aaron

On 14 Apr 2011, at 23:48, David Boxenhorn wrote:

> Thanks. I'm aware that I can roll my own. I wanted to avoid that, for ease of use, but especially for atomicity concerns. 
> 
> I thought that the secondary index would bring into memory all keys where type=2, and then iterate over them to find keys where=5. (This is a case were 1/3 of the rows are of type 2, but, say only a few hundred rows of type 2 have e=5.) The reason why I put "type" first is that queries on type will always be an exact match, whereas the other clauses might be inequalities. 
> 
> On Thu, Apr 14, 2011 at 2:07 PM, aaron morton <aa...@thelastpickle.com> wrote:
> You could make your own inverted index by using keys like  "e=5-type=2" where the columns are either the keys for the object or the objects themselves. Then just grab the full row back. If you know you always want to run queries like that. 
> 
> This recent discussion and blog post from Ed is good background http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html
> 
> I'm not sure how efficient the join from "e" to type would be. AFAIK it will iterate all keys where e=5 and lookup corresponding rows to find out if type = 2. 
> 
> If know how you want to read things back and need to deal with lots-o-data I would start testing with custom indexes. Then compare to the built in ones, it should be reasonably simple add them for a test.   
> 
> Hope that helps. 
> Aaron
>    
> On 14 Apr 2011, at 22:33, David Boxenhorn wrote:
> 
>> Thank you for your answer, and sorry about the sloppy terminology.
>> 
>> I'm thinking of the scenario where there are a small number of results in the result set, but there are billions of rows in the first of your secondary indexes.
>> 
>> That is, I want to do something like (not sure of the CQL syntax):
>> 
>> select * where type=2 and e=5
>> 
>> where there are billions of rows of type 2, but some manageable number of those rows have e=5.
>> 
>> As I understand it, secondary indexes are like column families, where each value is a column. So the billions of rows where type=2 would go into a single row of the secondary index. This sounds like a problem to me, is it?  
>> 
>> I'm assuming that the billions of rows that don't have column "e" at all (those rows of other types) are not a problem at all...
>> 
>> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton <aa...@thelastpickle.com> wrote:
>> Need to clear up some terminology here. 
>> 
>> Rows have a key and can be retrieved by key. This is *sort of* the primary index, but not primary in the normal RDBMS sense. 
>> Rows can have different columns and the column names are sorted and can be efficiently selected.
>> There are "secondary indexes" in cassandra 0.7 based on column values http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>> 
>> So you could create secondary indexes on the a,e, and h columns and get rows that have specific values. There are some limitations to secondary indexes, read the linked article. 
>> 
>> Or you can make your own secondary indexes using row keys as the index values.
>> 
>> If you have billions of rows, how many do you need to read back at once?
>> 
>> Hope that helps
>> Aaron
>>     
>> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>> 
>>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have different sets of columns?
>>> 
>>> For example, let's say you have three types of objects (1, 2, 3) which each had three members. If your rows had the following pattern
>>> 
>>> type=1 a=? b=? c=?
>>> type=2 d=? e=? f=?
>>> type=3 g=? h=? i=?
>>> 
>>> could you index "type" as your primary index, and also index "a", "e", "h" as secondary indexes, to get the objects of that type that you are looking for?
>>> 
>>> Would it work if you had billions of rows of each type?
>> 
>> 
> 
>

Re: Indexes on heterogeneous rows

Posted by David Boxenhorn <da...@taotown.com>.

Thanks. I'm aware that I can roll my own. I wanted to avoid that, for ease
of use, but especially for atomicity concerns.

I thought that the secondary index would bring into memory all keys where
type=2, and then iterate over them to find keys where=5. (This is a case
were 1/3 of the rows are of type 2, but, say only a few hundred rows of type
2 have e=5.) The reason why I put "type" first is that queries on type will
always be an exact match, whereas the other clauses might be inequalities.

On Thu, Apr 14, 2011 at 2:07 PM, aaron morton <aa...@thelastpickle.com>wrote:

> You could make your own inverted index by using keys like  "e=5-type=2"
> where the columns are either the keys for the object or the objects
> themselves. Then just grab the full row back. If you know you always want to
> run queries like that.
>
> This recent discussion and blog post from Ed is good background
> http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html
>
> I'm not sure how efficient the join from "e" to type would be. AFAIK it
> will iterate all keys where e=5 and lookup corresponding rows to find out if
> type = 2.
>
> If know how you want to read things back and need to deal with lots-o-data
> I would start testing with custom indexes. Then compare to the built in
> ones, it should be reasonably simple add them for a test.
>
> <http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html>Hope
> that helps.
> Aaron
>
> On 14 Apr 2011, at 22:33, David Boxenhorn wrote:
>
> Thank you for your answer, and sorry about the sloppy terminology.
>
> I'm thinking of the scenario where there are a small number of results in
> the result set, but there are billions of rows in the first of your
> secondary indexes.
>
> That is, I want to do something like (not sure of the CQL syntax):
>
> select * where type=2 and e=5
>
> where there are billions of rows of type 2, but some manageable number of
> those rows have e=5.
>
> As I understand it, secondary indexes are like column families, where each
> value is a column. So the billions of rows where type=2 would go into a
> single row of the secondary index. This sounds like a problem to me, is it?
>
>
> I'm assuming that the billions of rows that don't have column "e" at all
> (those rows of other types) are not a problem at all...
>
> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> Need to clear up some terminology here.
>>
>> Rows have a key and can be retrieved by key. This is *sort of* the primary
>> index, but not primary in the normal RDBMS sense.
>> Rows can have different columns and the column names are sorted and can be
>> efficiently selected.
>> There are "secondary indexes" in cassandra 0.7 based on column values
>> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>>
>> So you could create secondary indexes on the a,e, and h columns and get
>> rows that have specific values. There are some limitations to secondary
>> indexes, read the linked article.
>>
>> Or you can make your own secondary indexes using row keys as the index
>> values.
>>
>> If you have billions of rows, how many do you need to read back at once?
>>
>> Hope that helps
>> Aaron
>>
>> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>>
>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
>> different sets of columns?
>>
>> For example, let's say you have three types of objects (1, 2, 3) which
>> each had three members. If your rows had the following pattern
>>
>> type=1 a=? b=? c=?
>> type=2 d=? e=? f=?
>> type=3 g=? h=? i=?
>>
>> could you index "type" as your primary index, and also index "a", "e", "h"
>> as secondary indexes, to get the objects of that type that you are looking
>> for?
>>
>> Would it work if you had billions of rows of each type?
>>
>>
>>
>
>

Re: Indexes on heterogeneous rows

Posted by aaron morton <aa...@thelastpickle.com>.

You could make your own inverted index by using keys like  "e=5-type=2" where the columns are either the keys for the object or the objects themselves. Then just grab the full row back. If you know you always want to run queries like that. 

This recent discussion and blog post from Ed is good background http://www.mail-archive.com/user@cassandra.apache.org/msg12136.html

I'm not sure how efficient the join from "e" to type would be. AFAIK it will iterate all keys where e=5 and lookup corresponding rows to find out if type = 2. 

If know how you want to read things back and need to deal with lots-o-data I would start testing with custom indexes. Then compare to the built in ones, it should be reasonably simple add them for a test.   

Hope that helps. 
Aaron
   
On 14 Apr 2011, at 22:33, David Boxenhorn wrote:

> Thank you for your answer, and sorry about the sloppy terminology.
> 
> I'm thinking of the scenario where there are a small number of results in the result set, but there are billions of rows in the first of your secondary indexes.
> 
> That is, I want to do something like (not sure of the CQL syntax):
> 
> select * where type=2 and e=5
> 
> where there are billions of rows of type 2, but some manageable number of those rows have e=5.
> 
> As I understand it, secondary indexes are like column families, where each value is a column. So the billions of rows where type=2 would go into a single row of the secondary index. This sounds like a problem to me, is it?  
> 
> I'm assuming that the billions of rows that don't have column "e" at all (those rows of other types) are not a problem at all...
> 
> On Thu, Apr 14, 2011 at 12:12 PM, aaron morton <aa...@thelastpickle.com> wrote:
> Need to clear up some terminology here. 
> 
> Rows have a key and can be retrieved by key. This is *sort of* the primary index, but not primary in the normal RDBMS sense. 
> Rows can have different columns and the column names are sorted and can be efficiently selected.
> There are "secondary indexes" in cassandra 0.7 based on column values http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
> 
> So you could create secondary indexes on the a,e, and h columns and get rows that have specific values. There are some limitations to secondary indexes, read the linked article. 
> 
> Or you can make your own secondary indexes using row keys as the index values.
> 
> If you have billions of rows, how many do you need to read back at once?
> 
> Hope that helps
> Aaron
>     
> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
> 
>> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have different sets of columns?
>> 
>> For example, let's say you have three types of objects (1, 2, 3) which each had three members. If your rows had the following pattern
>> 
>> type=1 a=? b=? c=?
>> type=2 d=? e=? f=?
>> type=3 g=? h=? i=?
>> 
>> could you index "type" as your primary index, and also index "a", "e", "h" as secondary indexes, to get the objects of that type that you are looking for?
>> 
>> Would it work if you had billions of rows of each type?
> 
>

Re: Indexes on heterogeneous rows

Posted by David Boxenhorn <da...@taotown.com>.

Thank you for your answer, and sorry about the sloppy terminology.

I'm thinking of the scenario where there are a small number of results in
the result set, but there are billions of rows in the first of your
secondary indexes.

That is, I want to do something like (not sure of the CQL syntax):

select * where type=2 and e=5

where there are billions of rows of type 2, but some manageable number of
those rows have e=5.

As I understand it, secondary indexes are like column families, where each
value is a column. So the billions of rows where type=2 would go into a
single row of the secondary index. This sounds like a problem to me, is it?

I'm assuming that the billions of rows that don't have column "e" at all
(those rows of other types) are not a problem at all...

On Thu, Apr 14, 2011 at 12:12 PM, aaron morton <aa...@thelastpickle.com>wrote:

> Need to clear up some terminology here.
>
> Rows have a key and can be retrieved by key. This is *sort of* the primary
> index, but not primary in the normal RDBMS sense.
> Rows can have different columns and the column names are sorted and can be
> efficiently selected.
> There are "secondary indexes" in cassandra 0.7 based on column values
> http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
>
> So you could create secondary indexes on the a,e, and h columns and get
> rows that have specific values. There are some limitations to secondary
> indexes, read the linked article.
>
> Or you can make your own secondary indexes using row keys as the index
> values.
>
> If you have billions of rows, how many do you need to read back at once?
>
> Hope that helps
> Aaron
>
> On 14 Apr 2011, at 04:23, David Boxenhorn wrote:
>
> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have
> different sets of columns?
>
> For example, let's say you have three types of objects (1, 2, 3) which each
> had three members. If your rows had the following pattern
>
> type=1 a=? b=? c=?
> type=2 d=? e=? f=?
> type=3 g=? h=? i=?
>
> could you index "type" as your primary index, and also index "a", "e", "h"
> as secondary indexes, to get the objects of that type that you are looking
> for?
>
> Would it work if you had billions of rows of each type?
>
>
>

Re: Indexes on heterogeneous rows

Posted by aaron morton <aa...@thelastpickle.com>.

Need to clear up some terminology here. 

Rows have a key and can be retrieved by key. This is *sort of* the primary index, but not primary in the normal RDBMS sense. 
Rows can have different columns and the column names are sorted and can be efficiently selected.
There are "secondary indexes" in cassandra 0.7 based on column values http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes

So you could create secondary indexes on the a,e, and h columns and get rows that have specific values. There are some limitations to secondary indexes, read the linked article. 

Or you can make your own secondary indexes using row keys as the index values.

If you have billions of rows, how many do you need to read back at once?

Hope that helps
Aaron

On 14 Apr 2011, at 04:23, David Boxenhorn wrote:

> Is it possible in 0.7.x to have indexes on heterogeneous rows, which have different sets of columns?
> 
> For example, let's say you have three types of objects (1, 2, 3) which each had three members. If your rows had the following pattern
> 
> type=1 a=? b=? c=?
> type=2 d=? e=? f=?
> type=3 g=? h=? i=?
> 
> could you index "type" as your primary index, and also index "a", "e", "h" as secondary indexes, to get the objects of that type that you are looking for?
> 
> Would it work if you had billions of rows of each type?