You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by TuX RaceR <tu...@gmail.com> on 2010/04/25 18:54:55 UTC

newbie question on how columns names are indexed/lucene limitations?

Hello Cassandra Users,

When use the RandomPartinionner and a simple ColumnFamily/Columns (i.e. 
no SuperColumns) my understanding is that one signle Row can store 
millions of columns.

If I look at the http://wiki.apache.org/cassandra/API, I understand that 
I can get a subset of the millions of columns defined above using:
SlicePredicate->ColumnNames or SlicePredicate->SliceRange

My question is about the implementation of this columns 'selection'.
I vaguely remember reading somewhere (but I cannot find the link again) 
that this was implemented using a Lucene index over the column names for 
each row.
Is that true? Is there a small lucene index per row?

Also we know from that lucene have some limitations 
http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations : you 
cannot index more than 2.1 billions documents as a document ID is mapped 
to a 32 bits int.

As I plan to store in column names the ID of my cassandra documents (the 
global number of documents can go well beyond 2.1 billions), will I be 
hit by the lucene limitations? I.e can I store cassandra documents ID 
(i.e keys) in column names, if in each individual row there are no more 
than few millions of those IDs? I guess the answer is "yes I can", 
because lucandra uses a similar schema but it is not clear for me why. 
Is that because the lucene index is made on each row and what really 
matters in the number of columns in one single row and not the number of 
distinct column names (globally over all the rows)?


Thanks in advance
TuX

Re: newbie question on how columns names are indexed/lucene limitations?

Posted by Schubert Zhang <zs...@gmail.com>.

The column index in a row is a sorted-blocked index (like b-tree), just like
bigtable.

On Mon, Apr 26, 2010 at 2:43 AM, Stu Hood <st...@rackspace.com> wrote:

> The indexes within rows are _not_ implemented with Lucene: there is a
> custom index structure that allows for random access within a row. But, you
> should probably read http://wiki.apache.org/cassandra/CassandraLimitationsto understand the current limitations of the file format, some of which are
> scheduled to be fixed soon.
>
> -----Original Message-----
> From: "TuX RaceR" <tu...@gmail.com>
> Sent: Sunday, April 25, 2010 11:54am
> To: user@cassandra.apache.org
> Subject: newbie question on how columns names are indexed/lucene
> limitations?
>
> Hello Cassandra Users,
>
> When use the RandomPartinionner and a simple ColumnFamily/Columns (i.e.
> no SuperColumns) my understanding is that one signle Row can store
> millions of columns.
>
> If I look at the http://wiki.apache.org/cassandra/API, I understand that
> I can get a subset of the millions of columns defined above using:
> SlicePredicate->ColumnNames or SlicePredicate->SliceRange
>
> My question is about the implementation of this columns 'selection'.
> I vaguely remember reading somewhere (but I cannot find the link again)
> that this was implemented using a Lucene index over the column names for
> each row.
> Is that true? Is there a small lucene index per row?
>
> Also we know from that lucene have some limitations
> http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations : you
> cannot index more than 2.1 billions documents as a document ID is mapped
> to a 32 bits int.
>
> As I plan to store in column names the ID of my cassandra documents (the
> global number of documents can go well beyond 2.1 billions), will I be
> hit by the lucene limitations? I.e can I store cassandra documents ID
> (i.e keys) in column names, if in each individual row there are no more
> than few millions of those IDs? I guess the answer is "yes I can",
> because lucandra uses a similar schema but it is not clear for me why.
> Is that because the lucene index is made on each row and what really
> matters in the number of columns in one single row and not the number of
> distinct column names (globally over all the rows)?
>
>
> Thanks in advance
> TuX
>
>
>

RE: newbie question on how columns names are indexed/lucene limitations?

Posted by Stu Hood <st...@rackspace.com>.

The indexes within rows are _not_ implemented with Lucene: there is a custom index structure that allows for random access within a row. But, you should probably read http://wiki.apache.org/cassandra/CassandraLimitations to understand the current limitations of the file format, some of which are scheduled to be fixed soon.

-----Original Message-----
From: "TuX RaceR" <tu...@gmail.com>
Sent: Sunday, April 25, 2010 11:54am
To: user@cassandra.apache.org
Subject: newbie question on how columns names are indexed/lucene limitations?

Hello Cassandra Users,

When use the RandomPartinionner and a simple ColumnFamily/Columns (i.e. 
no SuperColumns) my understanding is that one signle Row can store 
millions of columns.

If I look at the http://wiki.apache.org/cassandra/API, I understand that 
I can get a subset of the millions of columns defined above using:
SlicePredicate->ColumnNames or SlicePredicate->SliceRange

My question is about the implementation of this columns 'selection'.
I vaguely remember reading somewhere (but I cannot find the link again) 
that this was implemented using a Lucene index over the column names for 
each row.
Is that true? Is there a small lucene index per row?

Also we know from that lucene have some limitations 
http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations : you 
cannot index more than 2.1 billions documents as a document ID is mapped 
to a 32 bits int.

As I plan to store in column names the ID of my cassandra documents (the 
global number of documents can go well beyond 2.1 billions), will I be 
hit by the lucene limitations? I.e can I store cassandra documents ID 
(i.e keys) in column names, if in each individual row there are no more 
than few millions of those IDs? I guess the answer is "yes I can", 
because lucandra uses a similar schema but it is not clear for me why. 
Is that because the lucene index is made on each row and what really 
matters in the number of columns in one single row and not the number of 
distinct column names (globally over all the rows)?

Thanks in advance
TuX