You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Courtney Robinson <sa...@live.co.uk> on 2010/09/03 12:26:26 UTC

indexing methods

A few of us working on a book for casanadra and got to the point where we (well I did anyway) wanted to include an example of a non trivial inverted index.

I've been playing around with different ideas on how I could store the data and I've had a look at the previous threads that touched on the subject but with the 2 or 3 ideas I've seen on the list someone always points out something in the approach that punches a hole in it.

I've been playing around with the idea of using a Columnfamily for the index where I store the terms as the key then each column name is a 64 bit long and its value is the doc id. If the column name represents a ranking for the doc id it stores and the compare with option is LongType then once a term is retrieved the first x amount of columns would represent the most related docs for that term.

I'd go on in more detail but I'm using my phone to write this and I think that gets the idea across.
Ofcourse my first thought to this is, is it scalable? In a system where possibly millions of docs are related to one term, is that a good idea to have potentially that many columns in one row all associated to the one row key which is the term?

I just want to know what others think, if you have any suggestions or have a similar thing implemented and you're able to share.

On a side note to that, there has been a bit of talk about secondary indexes in 0.7 can anyone shed some light on that, or point me to any presentation or the like where its mentioned so I can get a better idea of what its for.

Thanks,
Courtney

Re: indexing methods

Posted by Jake Luciani <ja...@gmail.com>.

Hi Courtney,

You can take a look at lucandra http://github.com/tjake/Lucandra which uses
the lucene api to maintain a inverted index in cassandra. There are a couple
articles and presentations in the readme that give more info on how this is
done.

-Jake

On Fri, Sep 3, 2010 at 6:26 AM, Courtney Robinson <sa...@live.co.uk> wrote:

> A few of us working on a book for casanadra and got to the point where we
> (well I did anyway)  wanted to include an example of a non trivial inverted
> index.
>
> I've been playing around  with different ideas on how I could store the
> data and I've had a look at the previous threads that touched on the subject
> but with the 2 or 3 ideas I've seen on the list someone always points out
> something in the approach that punches a hole in it.
>
> I've been playing around with the idea of using a Columnfamily for the
> index where I store the terms as the key then each column name is a 64 bit
> long and its value is the doc id. If the column name represents a ranking
> for the doc id it stores and the compare with option is LongType then once a
> term is retrieved the first x amount of columns would represent the most
> related docs for that term.
>
> I'd go on in more detail but I'm using my phone to write this and I think
> that gets the idea across.
> Ofcourse my first thought to this is, is it scalable? In a system where
> possibly millions of docs are related to one term, is that a good idea to
> have potentially that many columns in one row all associated to the one row
> key which is the term?
>
> I just want to know what others think, if you have any suggestions or have
> a similar thing implemented and you're able to share.
>
> On a side note to that, there has been a bit of talk about secondary
> indexes in 0.7 can anyone shed some light on that, or point me to any
> presentation or the like where its mentioned so I can get a better idea of
> what its for.
>
> Thanks,
> Courtney
>