You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Adi <ad...@gmail.com> on 2011/04/08 16:53:35 UTC

ballpark low cardinality range for secondary indexes

I am trying to decide whether to use secondary indexes or use an inverted
index column family for a use case. Is there any suggested ballpark range
for low cardinality for which secondary indexes are suitable.
Meaning at what range should  using a secondary index be ruled in or out:
cardinality of tens, hundreds, thousands,millions?
I am not looking for any tested numbers a general suggestion/best practice
recommendation will suffice.

Thanks.

-Adi

Re: ballpark low cardinality range for secondary indexes

Posted by Ed Anuff <ed...@anuff.com>.

Well, the amazon paper is good at describing the nature of the
problem, but to solve it you'll probably want to use zookeeper.  The
paper is useful in understanding exactly what you need to lock on and
what you don't while updating the index, so you can avoid slowing
things down any more than is necessary.

Ed

On Fri, Apr 8, 2011 at 1:47 PM, Adi <ad...@gmail.com> wrote:
> Thanks for the suggestions Ed.  Your blog post is quite helpful in deciding
> on and implementing CF inverted indexes.
> Our data definitely leans towards external CF - has high cardinality(1000s
> for one column, millions for another), multiple columns need to be indexed,
> needs sorted order.
> Hope that amazon paper has some good tips on solving the transactional
> gotcha :-)
>
> -Adi
>
> On Fri, Apr 8, 2011 at 3:49 PM, Ed Anuff <ed...@anuff.com> wrote:
>>
>> If you're just indexing on a single column value and the values have
>> low cardinality in, say, the 10's - I'd have a wide row for each
>> cardinal value that contained the set of keys for rows that contained
>> that value.  For higher levels of cardinality or if you're indexing on
>> multiple columns, there are tradeoffs for secondary indexes versus CF
>> inverted indexes that are based on atomicity of updates, complexity of
>> queries, and whether you need to get results in sorted order.
>> Secondary indexes are usually the best starting point since they're
>> easy to set up and use, versus CF inverted indexes, where you'll need
>> to manage all that yourself.  Some of the client libraries make it
>> easier to build CF inverted indexes, Hector is going to soon have some
>> capabilities for JPA users leveraging the new composite column types
>> to do this.  I wrote up a blog post a while back talking about
>> indexing approaches at
>> http://www.anuff.com/2011/02/indexing-in-cassandra.html that you might
>> find useful, although it sounds like you're already familiar with the
>> concepts
>>
>> Ed
>>
>> On Fri, Apr 8, 2011 at 7:53 AM, Adi <ad...@gmail.com> wrote:
>> > I am trying to decide whether to use secondary indexes or use an
>> > inverted
>> > index column family for a use case. Is there any suggested ballpark
>> > range
>> > for low cardinality for which secondary indexes are suitable.
>> > Meaning at what range should  using a secondary index be ruled in or
>> > out:
>> > cardinality of tens, hundreds, thousands,millions?
>> > I am not looking for any tested numbers a general suggestion/best
>> > practice
>> > recommendation will suffice.
>> >
>> > Thanks.
>> >
>> > -Adi
>> >
>> >
>
>

Re: ballpark low cardinality range for secondary indexes

Posted by Adi <ad...@gmail.com>.

Thanks for the suggestions Ed.  Your blog post is quite helpful in deciding
on and implementing CF inverted indexes.
Our data definitely leans towards external CF - has high cardinality(1000s
for one column, millions for another), multiple columns need to be indexed,
needs sorted order.
Hope that amazon paper has some good tips on solving the transactional
gotcha :-)

-Adi

On Fri, Apr 8, 2011 at 3:49 PM, Ed Anuff <ed...@anuff.com> wrote:

> If you're just indexing on a single column value and the values have
> low cardinality in, say, the 10's - I'd have a wide row for each
> cardinal value that contained the set of keys for rows that contained
> that value.  For higher levels of cardinality or if you're indexing on
> multiple columns, there are tradeoffs for secondary indexes versus CF
> inverted indexes that are based on atomicity of updates, complexity of
> queries, and whether you need to get results in sorted order.
> Secondary indexes are usually the best starting point since they're
> easy to set up and use, versus CF inverted indexes, where you'll need
> to manage all that yourself.  Some of the client libraries make it
> easier to build CF inverted indexes, Hector is going to soon have some
> capabilities for JPA users leveraging the new composite column types
> to do this.  I wrote up a blog post a while back talking about
> indexing approaches at
> http://www.anuff.com/2011/02/indexing-in-cassandra.html that you might
> find useful, although it sounds like you're already familiar with the
> concepts
>
> Ed
>
> On Fri, Apr 8, 2011 at 7:53 AM, Adi <ad...@gmail.com> wrote:
> > I am trying to decide whether to use secondary indexes or use an inverted
> > index column family for a use case. Is there any suggested ballpark range
> > for low cardinality for which secondary indexes are suitable.
> > Meaning at what range should  using a secondary index be ruled in or out:
> > cardinality of tens, hundreds, thousands,millions?
> > I am not looking for any tested numbers a general suggestion/best
> practice
> > recommendation will suffice.
> >
> > Thanks.
> >
> > -Adi
> >
> >
>

Re: ballpark low cardinality range for secondary indexes

Posted by Ed Anuff <ed...@anuff.com>.

If you're just indexing on a single column value and the values have
low cardinality in, say, the 10's - I'd have a wide row for each
cardinal value that contained the set of keys for rows that contained
that value.  For higher levels of cardinality or if you're indexing on
multiple columns, there are tradeoffs for secondary indexes versus CF
inverted indexes that are based on atomicity of updates, complexity of
queries, and whether you need to get results in sorted order.
Secondary indexes are usually the best starting point since they're
easy to set up and use, versus CF inverted indexes, where you'll need
to manage all that yourself.  Some of the client libraries make it
easier to build CF inverted indexes, Hector is going to soon have some
capabilities for JPA users leveraging the new composite column types
to do this.  I wrote up a blog post a while back talking about
indexing approaches at
http://www.anuff.com/2011/02/indexing-in-cassandra.html that you might
find useful, although it sounds like you're already familiar with the
concepts

Ed

On Fri, Apr 8, 2011 at 7:53 AM, Adi <ad...@gmail.com> wrote:
> I am trying to decide whether to use secondary indexes or use an inverted
> index column family for a use case. Is there any suggested ballpark range
> for low cardinality for which secondary indexes are suitable.
> Meaning at what range should  using a secondary index be ruled in or out:
> cardinality of tens, hundreds, thousands,millions?
> I am not looking for any tested numbers a general suggestion/best practice
> recommendation will suffice.
>
> Thanks.
>
> -Adi
>
>