You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Utku Can Topçu <ut...@topcu.gen.tr> on 2010/05/11 20:39:29 UTC
Inverted Indexing a ColumnFamily

Hello All,

I guess the subject talks for itself.
I'm currently developing a document analysis engine using cassandra as the
scalable storage.

I just want to briefly make an overview of the data model I'm using for this
purpose.

"the key" is formed in the format of timestamp.random(), so that it'll be
sorted on the Chronological order.
so I have out-of-box range queries based on timestamps.

But I still need to index some values:

I started testing with three types of fields in the Document ColumnFamily

- fields containing text (several words) : (every word is an index term)
- fields containing positive integers : (zero padded integer is the index
term)
- fields containing enumeration : (value itself is the index term)

For indexing purposes I used another ColumnFamily called IndexCF; the key is
formed in the format of "field_name||index_term", where values are the
actual references to the keys in Documents ColumnFamily.

After searching the projects related to indexing in cassandra, I've come up
with Lucandra.

I've recently been running tests with Lucandra since then (
http://github.com/tjake/Lucandra) for indexing those type of columns, it's
basically using a similar approach.
Lucandra works fine for indexing the columns containing text values, zero
padded integers and range queries on integers also work fine too.

However, the enumeration indexing is a really big problem.
Say we have 1M documents, with the type field which can have 4 values (book,
magazine, newspaper, other). Assuming the values are distributed equally,
each "field_name||index_term" pair would have 250K related documents. When
we try to index with respect to this distribution, We'll end up with only 4
index keys each one of them containing 250k columns. This basically means
it's not reasonable to index and search with respect to the enumeration
fields.

I wrote all these in a hurry, I hope I was able to express what I'm opening
for discussion. Can you think of a better implementation for indexing
enumeration in cassandra?

Best Regards,
Utku