You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Aaron Morton <aa...@thelastpickle.com> on 2011/02/08 21:16:58 UTC

Re: How do secondary indices work

Moving to the user group.



On 08 Feb, 2011,at 11:39 PM, altanis@ceid.upatras.gr wrote:

Hello,

I'd like some information about how secondary indices work under the hood.

1) Is data stored in some external data structure, or is it stored in an
actual Cassandra table, as columns within column families?
2) Is data stored sorted or not? How is it partitioned?
3) How can I access index data?

Thanks in a advance,

Alexander Altanis

Re: How do secondary indices work

Posted by al...@ceid.upatras.gr.

Thank you for the reply, although I didn't quite understand you. All I got
was that Index data is stored in some kind of external data structure.

Alexander

>
> On Feb 8, 2011, at 21:23, Aaron Morton wrote:
>
>>>> 1) Is data stored in some external data structure, or is it stored in
>>>> an
>>>> actual Cassandra table, as columns within column families?
>
> Yes. Own files next to the CF files and own node IndexColumnFamilies in
> JMX.
>
> And they are built asynchronously.
>
>

Re: How do secondary indices work

Posted by Timo Nentwig <ti...@toptarif.de>.

On Feb 8, 2011, at 21:23, Aaron Morton wrote:

>>> 1) Is data stored in some external data structure, or is it stored in an
>>> actual Cassandra table, as columns within column families?

Yes. Own files next to the CF files and own node IndexColumnFamilies in JMX.

And they are built asynchronously.

Re: How do secondary indices work

Posted by Jonathan Ellis <jb...@gmail.com>.

"Iterating through all of the rows matching an index clause on your
cluster is guaranteed to touch N/RF of the nodes in your cluster,
because each node only knows about data that is indexed locally."

On Wed, Feb 9, 2011 at 9:13 AM,  <al...@ceid.upatras.gr> wrote:
> One more question: does each node keep an index of their own values, or is
> the index global?
>
> Alexander
>
>> Thank you very much, this is the information I was looking for. I started
>> adding secondary index functionality to Cassandra myself, and it turns out
>> I am doing almost exactly the same thing. I will try to change my code to
>> use your implementation as well to compare results.
>>
>> Alexander
>>
>>> Alexander:
>>>
>>> The secondary indexes in 0.7.0 (type KEYS) are stored internally in a
>>> column
>>> family, and are kept synchronized with the base data via locking on a
>>> local
>>> node, meaning they are always consistent on the local node. Eventual
>>> consistency still applies between nodes, but a returned result will
>>> always
>>> match your query.
>>>
>>> This index column family stores a mapping from index values to a sorted
>>> list
>>> of matching row keys. When you query for rows between x and y matching a
>>> value z (via the get_indexed_slices call), Cassandra performs a lookup
>>> to
>>> the index column family for the slice of columns in row z between x and
>>> y.
>>> If any matches are found in the index, they are row keys that match the
>>> index clause, and we query the base data to return you those rows.
>>>
>>> Iterating through all of the rows matching an index clause on your
>>> cluster
>>> is guaranteed to touch N/RF of the nodes in your cluster, because each
>>> node
>>> only knows about data that is indexed locally.
>>>
>>> Some portions of the indexing implementation are not fully baked yet:
>>> for
>>> instance, although the API allows you to specify multiple columns, only
>>> one
>>> index will actually be used per query, and the rest of the clauses will
>>> be
>>> brute forced.
>>>
>>> A second secondary index implementation has been on the back burner for
>>> a
>>> while: it provides an identical API, but does not use a column family to
>>> store the index, and should be more efficient for append only data. See
>>> https://issues.apache.org/jira/browse/CASSANDRA-1472
>>>
>>> Thanks,
>>> Stu
>>>
>>> On Wed, Feb 9, 2011 at 2:35 AM, <al...@ceid.upatras.gr> wrote:
>>>
>>>> Thank you for the links, I did read a bit in the comments of the
>>>> ticket,
>>>> but I couldn't get much out of it.
>>>>
>>>> I am mainly interested in how the index is stored and partitioned, not
>>>> how
>>>> it is used. I think the people in the dev list will probably be better
>>>> qualified to answer that. My questions always seem to get moved to the
>>>> user list, and usually with good cause, but I think this time it should
>>>> be
>>>> in the dev list :) Please move it back, if you can.
>>>>
>>>> Alexander
>>>>
>>>> > AFAIK this was the ticket the original work was done under
>>>> > https://issues.apache.org/jira/browse/CASSANDRA-1415
>>>> >
>>>> > also  http://www.datastax.com/docs/0.7/data_model/secondary_indexes
>>>> > and  http://pycassa.githubcom/pycassa/tutorial.html#indexes may help
>>>> >
>>>> > (sorry on reflection the email prob did not need to be moved from
>>>> dev,
>>>> my
>>>> > bad)
>>>> > Aaron
>>>> >
>>>> > On 09 Feb, 2011,at 09:16 AM, Aaron Morton <aa...@thelastpickle.com>
>>>> wrote:
>>>> >
>>>> > Moving to the user group.
>>>> >
>>>> >
>>>> >
>>>> > On 08 Feb, 2011,at 11:39 PM, altanis@ceid.upatras.gr wrote:
>>>> >
>>>> > Hello,
>>>> >
>>>> > I'd like some information about how secondary indices work under the
>>>> hood.
>>>> >
>>>> > 1) Is data stored in some external data structure, or is it stored in
>>>> an
>>>> > actual Cassandra table, as columns within column families?
>>>> > 2) Is data stored sorted or not? How is it partitioned?
>>>> > 3) How can I access index data?
>>>> >
>>>> > Thanks in a advance,
>>>> >
>>>> > Alexander Altanis
>>>> >
>>>>
>>>
>>
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: How do secondary indices work

Posted by al...@ceid.upatras.gr.

One more question: does each node keep an index of their own values, or is
the index global?

Alexander

> Thank you very much, this is the information I was looking for. I started
> adding secondary index functionality to Cassandra myself, and it turns out
> I am doing almost exactly the same thing. I will try to change my code to
> use your implementation as well to compare results.
>
> Alexander
>
>> Alexander:
>>
>> The secondary indexes in 0.7.0 (type KEYS) are stored internally in a
>> column
>> family, and are kept synchronized with the base data via locking on a
>> local
>> node, meaning they are always consistent on the local node. Eventual
>> consistency still applies between nodes, but a returned result will
>> always
>> match your query.
>>
>> This index column family stores a mapping from index values to a sorted
>> list
>> of matching row keys. When you query for rows between x and y matching a
>> value z (via the get_indexed_slices call), Cassandra performs a lookup
>> to
>> the index column family for the slice of columns in row z between x and
>> y.
>> If any matches are found in the index, they are row keys that match the
>> index clause, and we query the base data to return you those rows.
>>
>> Iterating through all of the rows matching an index clause on your
>> cluster
>> is guaranteed to touch N/RF of the nodes in your cluster, because each
>> node
>> only knows about data that is indexed locally.
>>
>> Some portions of the indexing implementation are not fully baked yet:
>> for
>> instance, although the API allows you to specify multiple columns, only
>> one
>> index will actually be used per query, and the rest of the clauses will
>> be
>> brute forced.
>>
>> A second secondary index implementation has been on the back burner for
>> a
>> while: it provides an identical API, but does not use a column family to
>> store the index, and should be more efficient for append only data. See
>> https://issues.apache.org/jira/browse/CASSANDRA-1472
>>
>> Thanks,
>> Stu
>>
>> On Wed, Feb 9, 2011 at 2:35 AM, <al...@ceid.upatras.gr> wrote:
>>
>>> Thank you for the links, I did read a bit in the comments of the
>>> ticket,
>>> but I couldn't get much out of it.
>>>
>>> I am mainly interested in how the index is stored and partitioned, not
>>> how
>>> it is used. I think the people in the dev list will probably be better
>>> qualified to answer that. My questions always seem to get moved to the
>>> user list, and usually with good cause, but I think this time it should
>>> be
>>> in the dev list :) Please move it back, if you can.
>>>
>>> Alexander
>>>
>>> > AFAIK this was the ticket the original work was done under
>>> > https://issues.apache.org/jira/browse/CASSANDRA-1415
>>> >
>>> > also  http://www.datastax.com/docs/0.7/data_model/secondary_indexes
>>> > and  http://pycassa.githubcom/pycassa/tutorial.html#indexes may help
>>> >
>>> > (sorry on reflection the email prob did not need to be moved from
>>> dev,
>>> my
>>> > bad)
>>> > Aaron
>>> >
>>> > On 09 Feb, 2011,at 09:16 AM, Aaron Morton <aa...@thelastpickle.com>
>>> wrote:
>>> >
>>> > Moving to the user group.
>>> >
>>> >
>>> >
>>> > On 08 Feb, 2011,at 11:39 PM, altanis@ceid.upatras.gr wrote:
>>> >
>>> > Hello,
>>> >
>>> > I'd like some information about how secondary indices work under the
>>> hood.
>>> >
>>> > 1) Is data stored in some external data structure, or is it stored in
>>> an
>>> > actual Cassandra table, as columns within column families?
>>> > 2) Is data stored sorted or not? How is it partitioned?
>>> > 3) How can I access index data?
>>> >
>>> > Thanks in a advance,
>>> >
>>> > Alexander Altanis
>>> >
>>>
>>
>
>

Re: How do secondary indices work

Posted by al...@ceid.upatras.gr.

Thank you very much, this is the information I was looking for. I started
adding secondary index functionality to Cassandra myself, and it turns out
I am doing almost exactly the same thing. I will try to change my code to
use your implementation as well to compare results.

Alexander

> Alexander:
>
> The secondary indexes in 0.7.0 (type KEYS) are stored internally in a
> column
> family, and are kept synchronized with the base data via locking on a
> local
> node, meaning they are always consistent on the local node. Eventual
> consistency still applies between nodes, but a returned result will always
> match your query.
>
> This index column family stores a mapping from index values to a sorted
> list
> of matching row keys. When you query for rows between x and y matching a
> value z (via the get_indexed_slices call), Cassandra performs a lookup to
> the index column family for the slice of columns in row z between x and y.
> If any matches are found in the index, they are row keys that match the
> index clause, and we query the base data to return you those rows.
>
> Iterating through all of the rows matching an index clause on your cluster
> is guaranteed to touch N/RF of the nodes in your cluster, because each
> node
> only knows about data that is indexed locally.
>
> Some portions of the indexing implementation are not fully baked yet: for
> instance, although the API allows you to specify multiple columns, only
> one
> index will actually be used per query, and the rest of the clauses will be
> brute forced.
>
> A second secondary index implementation has been on the back burner for a
> while: it provides an identical API, but does not use a column family to
> store the index, and should be more efficient for append only data. See
> https://issues.apache.org/jira/browse/CASSANDRA-1472
>
> Thanks,
> Stu
>
> On Wed, Feb 9, 2011 at 2:35 AM, <al...@ceid.upatras.gr> wrote:
>
>> Thank you for the links, I did read a bit in the comments of the ticket,
>> but I couldn't get much out of it.
>>
>> I am mainly interested in how the index is stored and partitioned, not
>> how
>> it is used. I think the people in the dev list will probably be better
>> qualified to answer that. My questions always seem to get moved to the
>> user list, and usually with good cause, but I think this time it should
>> be
>> in the dev list :) Please move it back, if you can.
>>
>> Alexander
>>
>> > AFAIK this was the ticket the original work was done under
>> > https://issues.apache.org/jira/browse/CASSANDRA-1415
>> >
>> > also  http://www.datastax.com/docs/0.7/data_model/secondary_indexes
>> > and  http://pycassa.githubcom/pycassa/tutorial.html#indexes may help
>> >
>> > (sorry on reflection the email prob did not need to be moved from dev,
>> my
>> > bad)
>> > Aaron
>> >
>> > On 09 Feb, 2011,at 09:16 AM, Aaron Morton <aa...@thelastpickle.com>
>> wrote:
>> >
>> > Moving to the user group.
>> >
>> >
>> >
>> > On 08 Feb, 2011,at 11:39 PM, altanis@ceid.upatras.gr wrote:
>> >
>> > Hello,
>> >
>> > I'd like some information about how secondary indices work under the
>> hood.
>> >
>> > 1) Is data stored in some external data structure, or is it stored in
>> an
>> > actual Cassandra table, as columns within column families?
>> > 2) Is data stored sorted or not? How is it partitioned?
>> > 3) How can I access index data?
>> >
>> > Thanks in a advance,
>> >
>> > Alexander Altanis
>> >
>>
>

Re: How do secondary indices work

Posted by Stu Hood <st...@gmail.com>.

Alexander:

The secondary indexes in 0.7.0 (type KEYS) are stored internally in a column
family, and are kept synchronized with the base data via locking on a local
node, meaning they are always consistent on the local node. Eventual
consistency still applies between nodes, but a returned result will always
match your query.

This index column family stores a mapping from index values to a sorted list
of matching row keys. When you query for rows between x and y matching a
value z (via the get_indexed_slices call), Cassandra performs a lookup to
the index column family for the slice of columns in row z between x and y.
If any matches are found in the index, they are row keys that match the
index clause, and we query the base data to return you those rows.

Iterating through all of the rows matching an index clause on your cluster
is guaranteed to touch N/RF of the nodes in your cluster, because each node
only knows about data that is indexed locally.

Some portions of the indexing implementation are not fully baked yet: for
instance, although the API allows you to specify multiple columns, only one
index will actually be used per query, and the rest of the clauses will be
brute forced.

A second secondary index implementation has been on the back burner for a
while: it provides an identical API, but does not use a column family to
store the index, and should be more efficient for append only data. See
https://issues.apache.org/jira/browse/CASSANDRA-1472

Thanks,
Stu

On Wed, Feb 9, 2011 at 2:35 AM, <al...@ceid.upatras.gr> wrote:

> Thank you for the links, I did read a bit in the comments of the ticket,
> but I couldn't get much out of it.
>
> I am mainly interested in how the index is stored and partitioned, not how
> it is used. I think the people in the dev list will probably be better
> qualified to answer that. My questions always seem to get moved to the
> user list, and usually with good cause, but I think this time it should be
> in the dev list :) Please move it back, if you can.
>
> Alexander
>
> > AFAIK this was the ticket the original work was done under
> > https://issues.apache.org/jira/browse/CASSANDRA-1415
> >
> > also  http://www.datastax.com/docs/0.7/data_model/secondary_indexes
> > and  http://pycassa.githubcom/pycassa/tutorial.html#indexes may help
> >
> > (sorry on reflection the email prob did not need to be moved from dev, my
> > bad)
> > Aaron
> >
> > On 09 Feb, 2011,at 09:16 AM, Aaron Morton <aa...@thelastpickle.com>
> wrote:
> >
> > Moving to the user group.
> >
> >
> >
> > On 08 Feb, 2011,at 11:39 PM, altanis@ceid.upatras.gr wrote:
> >
> > Hello,
> >
> > I'd like some information about how secondary indices work under the
> hood.
> >
> > 1) Is data stored in some external data structure, or is it stored in an
> > actual Cassandra table, as columns within column families?
> > 2) Is data stored sorted or not? How is it partitioned?
> > 3) How can I access index data?
> >
> > Thanks in a advance,
> >
> > Alexander Altanis
> >
>

Re: How do secondary indices work

Posted by al...@ceid.upatras.gr.

Thank you for the links, I did read a bit in the comments of the ticket,
but I couldn't get much out of it.

I am mainly interested in how the index is stored and partitioned, not how
it is used. I think the people in the dev list will probably be better
qualified to answer that. My questions always seem to get moved to the
user list, and usually with good cause, but I think this time it should be
in the dev list :) Please move it back, if you can.

Alexander

> AFAIK this was the ticket the original work was done under 
> https://issues.apache.org/jira/browse/CASSANDRA-1415
>
> also  http://www.datastax.com/docs/0.7/data_model/secondary_indexes
> and  http://pycassa.githubcom/pycassa/tutorial.html#indexes may help
>
> (sorry on reflection the email prob did not need to be moved from dev, my
> bad)
> Aaron
>
> On 09 Feb, 2011,at 09:16 AM, Aaron Morton <aa...@thelastpickle.com> wrote:
>
> Moving to the user group.
>
>
>
> On 08 Feb, 2011,at 11:39 PM, altanis@ceid.upatras.gr wrote:
>
> Hello,
>
> I'd like some information about how secondary indices work under the hood.
>
> 1) Is data stored in some external data structure, or is it stored in an
> actual Cassandra table, as columns within column families?
> 2) Is data stored sorted or not? How is it partitioned?
> 3) How can I access index data?
>
> Thanks in a advance,
>
> Alexander Altanis
>

Re: How do secondary indices work

Posted by Aaron Morton <aa...@thelastpickle.com>.

AFAIK this was the ticket the original work was done under 
https://issues.apache.org/jira/browse/CASSANDRA-1415

also  http://www.datastax.com/docs/0.7/data_model/secondary_indexes
and  http://pycassa.githubcom/pycassa/tutorial.html#indexes may help

(sorry on reflection the email prob did not need to be moved from dev, my bad)
Aaron

On 09 Feb, 2011,at 09:16 AM, Aaron Morton <aa...@thelastpickle.com> wrote:

Moving to the user group.



On 08 Feb, 2011,at 11:39 PM, altanis@ceid.upatras.gr wrote:

Hello,

I'd like some information about how secondary indices work under the hood.

1) Is data stored in some external data structure, or is it stored in an
actual Cassandra table, as columns within column families?
2) Is data stored sorted or not? How is it partitioned?
3) How can I access index data?

Thanks in a advance,

Alexander Altanis