You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Edward Capriolo <ed...@gmail.com> on 2011/11/11 06:20:31 UTC

Efficient map reduce over ranges of Cassandra data

Hey all,

I know there are several tickets in the pipe that should make it possible
do use secondary indexes to run map reduce jobs that do not have to ingest
the entire dataset such as:

https://issues.apache.org/jira/browse/CASSANDRA-1600

I had ended up creating a sharded secondary index in user space (I just
call it ordered buckets), described here:

http://www.slideshare.net/edwardcapriolo/casbase-presentation/27

Looking at the ordered buckets implementation I realized it is a perfect
candidate for "efficient map reduce" since it is easy to split.

A unit test of that implementation is here:

https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java

With this you can current do efficient map reduce on cassandra data, while
waiting for other integrated solutions to come along.

Re: Efficient map reduce over ranges of Cassandra data

Posted by Jeremy Hanna <je...@gmail.com>.
Nice!  Thanks Ed.

On Nov 10, 2011, at 11:20 PM, Edward Capriolo wrote:

> Hey all,
> 
> I know there are several tickets in the pipe that should make it possible do use secondary indexes to run map reduce jobs that do not have to ingest the entire dataset such as:
> 
> https://issues.apache.org/jira/browse/CASSANDRA-1600
> 
> I had ended up creating a sharded secondary index in user space (I just call it ordered buckets), described here:
> 
> http://www.slideshare.net/edwardcapriolo/casbase-presentation/27
> 
> Looking at the ordered buckets implementation I realized it is a perfect candidate for "efficient map reduce" since it is easy to split.
> 
> A unit test of that implementation is here:
> 
> https://github.com/edwardcapriolo/casbase/blob/master/src/test/java/com/jointhegrid/casbase/hadoop/OrderedBucketInputFormatTest.java
> 
> With this you can current do efficient map reduce on cassandra data, while waiting for other integrated solutions to come along.
>