You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by buddhasystem <po...@bnl.gov> on 2011/01/21 13:55:07 UTC

Multiple indexes - how does Cassandra handle these internally?

Greetings --

if I use multiple secondary indexes in the query, what will Cassandra do?
Some examples say it will index on first EQ and then loop on others. Does it
ever do a proper index product to avoid inner loops?

Thanks

Maxim

-- 
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Multiple-indexes-how-does-Cassandra-handle-these-internally-tp5947533p5947533.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Aaron Morton <aa...@thelastpickle.com>.
Maxim, 

Off the top of my head I'm not aware of any limitations in the indexes (other than the operators). This will give me a reason to dig into the code further and do some more reading. Can you provide some more info on the system (in another thread) and the group may be able to help with the design. In the meantime this may help http://wiki.apache.org/cassandra/LargeDataSetConsiderations

Aaron


On 24 Jan, 2011,at 03:13 PM, David McNelis <dm...@agentisenergy.com> wrote:

No worries, I am in the states too.
Sent from my Droid
On Jan 23, 2011 8:05 PM, "Maxim Potekhin" <po...@bnl.gov> wrote:
> Not silly at all, sorry for eschewing clarity in Latin usage:
> 
> millions, not thousands.
> 
> Here in the States, we typically use M for millions.
> Anyhow, my system generates 1 million (large) records every three days,
> 
> Cheers,
> Maxim
> 
> 
> 
> On 1/23/2011 8:35 PM, David McNelis wrote:
>>
>> Silly question, M us thousand or million? In print, thousand is M, fwiw
>>
>> Sent from my Droid
>>
>> On Jan 23, 2011 7:26 PM, "Maxim Potekhin" <potekhin@bnl.gov 
>> <ma...@bnl.gov>> wrote:
>> > Aaron -- thanks!
>> >
>> > I don't have examples like Timo.
>> >
>> > But,
>> >
>> > I'm keen to use multiple indices over a database
>> > of 300M rows.
>> >
>> >
>> > Maxim
>> >
>> >
>> > On 1/23/2011 3:28 PM, Aaron Morton wrote:
>> >> Timo / Maxim
>> >> Could you provide a more concrete example and I'll try to look into 
>> it tonight
>> >>
>> >> Cheers
>> >> Aaron
>> >>
>> >>
>> >> On 22/01/2011, at 5:01 AM, Maxim Potekhin<potekhin@bnl.gov 
>> <ma...@bnlgov>> wrote:
>> >>
>> >>> Well it does sound like a bug in Cassandra. Indexes MUST commute.
>> >>>
>> >>> I really need this functionality, it's a show stopper for me...
>> >>>
>> >>> On 1/21/2011 10:56 AM, Timo Nentwig wrote:
>> >>>> On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:
>> >>>>
>> >>>>> But Timo, this is even more mysterious! If both conditions are 
>> met, at least
>> >>>>> something must be returned in the second query. Have you tried 
>> this in CLI?
>> >>>>> That would allow you to at least alleviate client concerns.
>> >>>> I did this on the CLI only so far. So value comparison on the 
>> index seems to be done differently than in the nested loop...or 
>> something. Don't know, don't know the code base well enough to debug 
>> this down to the very bottom either. But it's actually only a CF with 
>> 2 cols (AsciiType and IntegerType) and a command in the CLI so not too 
>> time-consuming to reproduce.
>> >
> 

Re: Multiple indexes - how does Cassandra handle these internally?

Posted by David McNelis <dm...@agentisenergy.com>.
No worries, I am in the states too.

Sent from my Droid
On Jan 23, 2011 8:05 PM, "Maxim Potekhin" <po...@bnl.gov> wrote:
> Not silly at all, sorry for eschewing clarity in Latin usage:
>
> millions, not thousands.
>
> Here in the States, we typically use M for millions.
> Anyhow, my system generates 1 million (large) records every three days,
>
> Cheers,
> Maxim
>
>
>
> On 1/23/2011 8:35 PM, David McNelis wrote:
>>
>> Silly question, M us thousand or million? In print, thousand is M, fwiw
>>
>> Sent from my Droid
>>
>> On Jan 23, 2011 7:26 PM, "Maxim Potekhin" <potekhin@bnl.gov
>> <ma...@bnl.gov>> wrote:
>> > Aaron -- thanks!
>> >
>> > I don't have examples like Timo.
>> >
>> > But,
>> >
>> > I'm keen to use multiple indices over a database
>> > of 300M rows.
>> >
>> >
>> > Maxim
>> >
>> >
>> > On 1/23/2011 3:28 PM, Aaron Morton wrote:
>> >> Timo / Maxim
>> >> Could you provide a more concrete example and I'll try to look into
>> it tonight.
>> >>
>> >> Cheers
>> >> Aaron
>> >>
>> >>
>> >> On 22/01/2011, at 5:01 AM, Maxim Potekhin<potekhin@bnl.gov
>> <ma...@bnl.gov>> wrote:
>> >>
>> >>> Well it does sound like a bug in Cassandra. Indexes MUST commute.
>> >>>
>> >>> I really need this functionality, it's a show stopper for me...
>> >>>
>> >>> On 1/21/2011 10:56 AM, Timo Nentwig wrote:
>> >>>> On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:
>> >>>>
>> >>>>> But Timo, this is even more mysterious! If both conditions are
>> met, at least
>> >>>>> something must be returned in the second query. Have you tried
>> this in CLI?
>> >>>>> That would allow you to at least alleviate client concerns.
>> >>>> I did this on the CLI only so far. So value comparison on the
>> index seems to be done differently than in the nested loop...or
>> something. Don't know, don't know the code base well enough to debug
>> this down to the very bottom either. But it's actually only a CF with
>> 2 cols (AsciiType and IntegerType) and a command in the CLI so not too
>> time-consuming to reproduce.
>> >
>

Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Maxim Potekhin <po...@bnl.gov>.
Not silly at all, sorry for eschewing clarity in Latin usage:

millions, not thousands.

Here in the States, we typically use M for millions.
Anyhow, my system generates 1 million (large) records every three days,

Cheers,
Maxim



On 1/23/2011 8:35 PM, David McNelis wrote:
>
> Silly question, M us thousand or million?  In print, thousand is M, fwiw
>
> Sent from my Droid
>
> On Jan 23, 2011 7:26 PM, "Maxim Potekhin" <potekhin@bnl.gov 
> <ma...@bnl.gov>> wrote:
> > Aaron -- thanks!
> >
> > I don't have examples like Timo.
> >
> > But,
> >
> > I'm keen to use multiple indices over a database
> > of 300M rows.
> >
> >
> > Maxim
> >
> >
> > On 1/23/2011 3:28 PM, Aaron Morton wrote:
> >> Timo / Maxim
> >> Could you provide a more concrete example and I'll try to look into 
> it tonight.
> >>
> >> Cheers
> >> Aaron
> >>
> >>
> >> On 22/01/2011, at 5:01 AM, Maxim Potekhin<potekhin@bnl.gov 
> <ma...@bnl.gov>> wrote:
> >>
> >>> Well it does sound like a bug in Cassandra. Indexes MUST commute.
> >>>
> >>> I really need this functionality, it's a show stopper for me...
> >>>
> >>> On 1/21/2011 10:56 AM, Timo Nentwig wrote:
> >>>> On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:
> >>>>
> >>>>> But Timo, this is even more mysterious! If both conditions are 
> met, at least
> >>>>> something must be returned in the second query. Have you tried 
> this in CLI?
> >>>>> That would allow you to at least alleviate client concerns.
> >>>> I did this on the CLI only so far. So value comparison on the 
> index seems to be done differently than in the nested loop...or 
> something. Don't know, don't know the code base well enough to debug 
> this down to the very bottom either. But it's actually only a CF with 
> 2 cols (AsciiType and IntegerType) and a command in the CLI so not too 
> time-consuming to reproduce.
> >


Re: Multiple indexes - how does Cassandra handle these internally?

Posted by David McNelis <dm...@agentisenergy.com>.
Silly question, M us thousand or million?  In print, thousand is M, fwiw

Sent from my Droid
On Jan 23, 2011 7:26 PM, "Maxim Potekhin" <po...@bnl.gov> wrote:
> Aaron -- thanks!
>
> I don't have examples like Timo.
>
> But,
>
> I'm keen to use multiple indices over a database
> of 300M rows.
>
>
> Maxim
>
>
> On 1/23/2011 3:28 PM, Aaron Morton wrote:
>> Timo / Maxim
>> Could you provide a more concrete example and I'll try to look into it
tonight.
>>
>> Cheers
>> Aaron
>>
>>
>> On 22/01/2011, at 5:01 AM, Maxim Potekhin<po...@bnl.gov> wrote:
>>
>>> Well it does sound like a bug in Cassandra. Indexes MUST commute.
>>>
>>> I really need this functionality, it's a show stopper for me...
>>>
>>> On 1/21/2011 10:56 AM, Timo Nentwig wrote:
>>>> On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:
>>>>
>>>>> But Timo, this is even more mysterious! If both conditions are met, at
least
>>>>> something must be returned in the second query. Have you tried this in
CLI?
>>>>> That would allow you to at least alleviate client concerns.
>>>> I did this on the CLI only so far. So value comparison on the index
seems to be done differently than in the nested loop...or something. Don't
know, don't know the code base well enough to debug this down to the very
bottom either. But it's actually only a CF with 2 cols (AsciiType and
IntegerType) and a command in the CLI so not too time-consuming to
reproduce.
>

Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Maxim Potekhin <po...@bnl.gov>.
Aaron -- thanks!

I don't have examples like Timo.

But,

I'm keen to use multiple indices over a database
of 300M rows.


Maxim


On 1/23/2011 3:28 PM, Aaron Morton wrote:
> Timo / Maxim
> Could you provide a more concrete example and I'll try to look into it tonight.
>
> Cheers
> Aaron
>
>
> On 22/01/2011, at 5:01 AM, Maxim Potekhin<po...@bnl.gov>  wrote:
>
>> Well it does sound like a bug in Cassandra. Indexes MUST commute.
>>
>> I really need this functionality, it's a show stopper for me...
>>
>> On 1/21/2011 10:56 AM, Timo Nentwig wrote:
>>> On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:
>>>
>>>> But Timo, this is even more mysterious! If both conditions are met, at least
>>>> something must be returned in the second query. Have you tried this in CLI?
>>>> That would allow you to at least alleviate client concerns.
>>> I did this on the CLI only so far. So value comparison on the index seems to be done differently than in the nested loop...or something. Don't know, don't know the code base well enough to debug this down to the very bottom either. But it's actually only a CF with 2 cols (AsciiType and IntegerType) and a command in the CLI so not too time-consuming to reproduce.


Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Aaron Morton <aa...@thelastpickle.com>.
Timo / Maxim
Could you provide a more concrete example and I'll try to look into it tonight.

Cheers
Aaron


On 22/01/2011, at 5:01 AM, Maxim Potekhin <po...@bnl.gov> wrote:

> Well it does sound like a bug in Cassandra. Indexes MUST commute.
> 
> I really need this functionality, it's a show stopper for me...
> 
> On 1/21/2011 10:56 AM, Timo Nentwig wrote:
>> On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:
>> 
>>> But Timo, this is even more mysterious! If both conditions are met, at least
>>> something must be returned in the second query. Have you tried this in CLI?
>>> That would allow you to at least alleviate client concerns.
>> 
>> I did this on the CLI only so far. So value comparison on the index seems to be done differently than in the nested loop...or something. Don't know, don't know the code base well enough to debug this down to the very bottom either. But it's actually only a CF with 2 cols (AsciiType and IntegerType) and a command in the CLI so not too time-consuming to reproduce.
> 

Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Maxim Potekhin <po...@bnl.gov>.
Well it does sound like a bug in Cassandra. Indexes MUST commute.

I really need this functionality, it's a show stopper for me...

On 1/21/2011 10:56 AM, Timo Nentwig wrote:
> On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:
>
>> But Timo, this is even more mysterious! If both conditions are met, at least
>> something must be returned in the second query. Have you tried this in CLI?
>> That would allow you to at least alleviate client concerns.
>
> I did this on the CLI only so far. So value comparison on the index seems to be done differently than in the nested loop...or something. Don't know, don't know the code base well enough to debug this down to the very bottom either. But it's actually only a CF with 2 cols (AsciiType and IntegerType) and a command in the CLI so not too time-consuming to reproduce.


Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Aaron Morton <aa...@thelastpickle.com>.
Timo, 
Below is a test I did via the CLI, it worked as expected for me If you're still experiencing the problem could you try to repro it like this ?

Paste this into the CLI to setup the schema and data 


create keyspace index_test 
    with replication_factor = 1;

use index_test;

create column family Indexed
    with comparator = AsciiType
    and column_metadata = [
    {
        column_name : col1,
        validation_class : AsciiType,
        index_name : col1_idx,
        index_type : 0}, 
    {
        column_name : col2,
        validation_class : AsciiType,
        index_name : col2_idx,
        index_type : 0},
    {
        column_name : col3,
        validation_class : IntegerType,
        index_name : col3_idx,
        index_type : 0},
    ];

set Indexed['key1']['col1'] = 'value1';
set Indexed['key1']['col2'] = 'value2';
set Indexed['key1']['col3'] = '3';
set Indexed['key2']['col1'] = 'value1';
set Indexed['key2']['col2'] = 'value2';
set Indexed['key2']['col3'] = '3';
set Indexed['key3']['col1'] = 'value1';
set Indexed['key3']['col2'] = 'foo';
set Indexed['key3']['col3'] = '0';
set Indexed['key4']['col1'] = 'foo';
set Indexed['key4']['col2'] = 'value2';
set Indexed['key4']['col3'] = '0';
set Indexed['key5']['col1'] = 'foo';
set Indexed['key5']['col2'] = 'foo';
set Indexed['key5']['col3'] = '3';

================

Then I ran these...

[default@index_test] get Indexed where col1='value1' and col2='value2';
-------------------
RowKey: key1
=> (column=col1, value=value1, timestamp=1295901869013000)
=> (column=col2, value=value2, timestamp=1295901869017000)
=> (column=col3, value=3, timestamp=1295901869038000)
-------------------
RowKey: key2
=> (column=col1, value=value1, timestamp=1295901869044000)
=> (column=col2, value=value2, timestamp=1295901869047000)
=> (column=col3, value=3, timestamp=1295901870574000)

2 Rows Returned.

[default@index_test] get Indexed where col2='value2' and col1='value1';
-------------------
RowKey: key1
=> (column=col1, value=value1, timestamp=1295901869013000)
=> (column=col2, value=value2, timestamp=1295901869017000)
=> (column=col3, value=3, timestamp=1295901869038000)
-------------------
RowKey: key2
=> (column=col1, value=value1, timestamp=1295901869044000)
=> (column=col2, value=value2, timestamp=1295901869047000)
=> (column=col3, value=3, timestamp=1295901870574000)

2 Rows Returned.

[default@index_test] get Indexed where col1='value1' and col3='3';
-------------------
RowKey: key1
=> (column=col1, value=value1, timestamp=1295901869013000)
=> (column=col2, value=value2, timestamp=1295901869017000)
=> (column=col3, value=3, timestamp=1295901869038000)
-------------------
RowKey: key2
=> (column=col1, value=value1, timestamp=1295901869044000)
=> (column=col2, value=value2, timestamp=1295901869047000)
=> (column=col3, value=3, timestamp=1295901870574000)

2 Rows Returned.

[default@index_test] get Indexed where col3='3' and col1='value1';
-------------------
RowKey: key1
=> (column=col1, value=value1, timestamp=1295901869013000)
=> (column=col2, value=value2, timestamp=1295901869017000)
=> (column=col3, value=3, timestamp=1295901869038000)
-------------------
RowKey: key2
=> (column=col1, value=value1, timestamp=1295901869044000)
=> (column=col2, value=value2, timestamp=1295901869047000)
=> (column=col3, value=3, timestamp=1295901870574000)

2 Rows Returned.


Cheers
Aaron

On 22 Jan, 2011,at 04:56 AM, Timo Nentwig <ti...@toptarif.de> wrote:

On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:

> But Timo, this is even more mysterious! If both conditions are met, at least
> something must be returned in the second query. Have you tried this in CLI?
> That would allow you to at least alleviate client concerns.


I did this on the CLI only so far. So value comparison on the index seems to be done differently than in the nested loop...or something. Don't know, don't know the code base well enough to debug this down to the very bottom either. But it's actually only a CF with 2 cols (AsciiType and IntegerType) and a command in the CLI so not too time-consuming to reproduce.

Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Timo Nentwig <ti...@toptarif.de>.
On Jan 21, 2011, at 16:46, Maxim Potekhin wrote:

> But Timo, this is even more mysterious! If both conditions are met, at least
> something must be returned in the second query. Have you tried this in CLI?
> That would allow you to at least alleviate client concerns.


I did this on the CLI only so far. So value comparison on the index seems to be done differently than in the nested loop...or something. Don't know, don't know the code base well enough to debug this down to the very bottom either. But it's actually only a CF with 2 cols (AsciiType and IntegerType) and a command in the CLI so not too time-consuming to reproduce.

Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Maxim Potekhin <po...@bnl.gov>.
But Timo, this is even more mysterious! If both conditions are met, at least
something must be returned in the second query. Have you tried this in CLI?
That would allow you to at least alleviate client concerns.

On 1/21/2011 10:38 AM, Timo Nentwig wrote:
> On Jan 21, 2011, at 13:55, buddhasystem wrote:
>
>> if I use multiple secondary indexes in the query, what will Cassandra do?
>> Some examples say it will index on first EQ and then loop on others. Does it
>> ever do a proper index product to avoid inner loops?
> Just asked the same question on the hector-dev group a few minutes ago. Seems indeed to be the case that cassandra only uses 1 index. At least this would make sense narrowing down issues I have that
>
>   get foo where col1=cond1 and col2=cond2
>
> works while flipping conditions
>
>   get foo where col2=cond2 and col1=cond1
>
> returns no results no more.
>
> Unfortunately nobody around here seems to care...


Re: Multiple indexes - how does Cassandra handle these internally?

Posted by Timo Nentwig <ti...@toptarif.de>.
On Jan 21, 2011, at 13:55, buddhasystem wrote:

> if I use multiple secondary indexes in the query, what will Cassandra do?
> Some examples say it will index on first EQ and then loop on others. Does it
> ever do a proper index product to avoid inner loops?

Just asked the same question on the hector-dev group a few minutes ago. Seems indeed to be the case that cassandra only uses 1 index. At least this would make sense narrowing down issues I have that

 get foo where col1=cond1 and col2=cond2

works while flipping conditions 

 get foo where col2=cond2 and col1=cond1

returns no results no more.

Unfortunately nobody around here seems to care...