You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Matt Kennedy <st...@gmail.com> on 2011/02/23 23:31:10 UTC

map reduce job over indexed range of keys

Let me start out by saying that I think I'm going to have to write a patch
to get what I want, but I'm fine with that.  I just wanted to check here
first to make sure that I'm not missing something obvious.

I'd like to be able to run a MapReduce job that takes a value in an indexed
column as a parameter, and use that to select the data that the MapReduce
job operates on.  Right now, it looks like this isn't possible because
org.apache.cassandra.hadoop.ColumnFamilyRecordReader will only fetch data
with get_range_slices, not get_indexed_slices.

An example might be useful.  Let's say I want to run a map reduce job over
all the data for a particular country.  Right now I can do this in Map
Reduce by simply discarding all the data that is not from the country I want
to process on. I suspect it will be faster if I can reduce the size of the
Map Reduce job by only selecting the data I want by using secondary indexes
in Cassandra.

So, first question: Am I wrong?  Is there some clever way to enable the
behavior I'm looking for (without modifying the cassandra codebase)?

Second question: If I'm not wrong, should I open a JIRA issue for this and
start coding up this feature?

Finally, the real reason that I want to get this working is so that I can
enhance the CassandraStorage pig loadfunc so that it can take query
parameters on in the URL string that is used to specify the keyspace and
column family.  So for example, you might load data into Pig with this
sytax:

rows = LOAD 'cassandra://mykeyspace/mycolumnfamily?country=UK' using
CassandraStorage();

I'd like to get some feedback on that syntax.

Thanks,
Matt Kennedy

Re: map reduce job over indexed range of keys

Posted by Mick Semb Wever <mc...@apache.org>.

On Thu, 2011-02-24 at 19:45 -0500, Matt Kennedy wrote:
> Right, so I'm interpreting silence as a confirmation on all points. I
> opened:
> https://issues.apache.org/jira/browse/CASSANDRA-2245
> https://issues.apache.org/jira/browse/CASSANDRA-2246

I think https://issues.apache.org/jira/browse/CASSANDRA-1125 is what you
were looking for. Sorry for the late reply.

~mck

-- 
"When there is no enemy within, the enemies outside can't hurt you."
African proverb 
| http://semb.wever.org | http://sesat.no
| http://finn.no       | Java XSS Filter

Re: map reduce job over indexed range of keys

Posted by Matt Kennedy <st...@gmail.com>.

Right, so I'm interpreting silence as a confirmation on all points. I
opened:
https://issues.apache.org/jira/browse/CASSANDRA-2245
https://issues.apache.org/jira/browse/CASSANDRA-2246

to work on these.

On Wed, Feb 23, 2011 at 5:31 PM, Matt Kennedy <st...@gmail.com> wrote:

> Let me start out by saying that I think I'm going to have to write a patch
> to get what I want, but I'm fine with that.  I just wanted to check here
> first to make sure that I'm not missing something obvious.
>
> I'd like to be able to run a MapReduce job that takes a value in an indexed
> column as a parameter, and use that to select the data that the MapReduce
> job operates on.  Right now, it looks like this isn't possible because
> org.apache.cassandra.hadoop.ColumnFamilyRecordReader will only fetch data
> with get_range_slices, not get_indexed_slices.
>
> An example might be useful.  Let's say I want to run a map reduce job over
> all the data for a particular country.  Right now I can do this in Map
> Reduce by simply discarding all the data that is not from the country I want
> to process on. I suspect it will be faster if I can reduce the size of the
> Map Reduce job by only selecting the data I want by using secondary indexes
> in Cassandra.
>
> So, first question: Am I wrong?  Is there some clever way to enable the
> behavior I'm looking for (without modifying the cassandra codebase)?
>
> Second question: If I'm not wrong, should I open a JIRA issue for this and
> start coding up this feature?
>
> Finally, the real reason that I want to get this working is so that I can
> enhance the CassandraStorage pig loadfunc so that it can take query
> parameters on in the URL string that is used to specify the keyspace and
> column family.  So for example, you might load data into Pig with this
> sytax:
>
> rows = LOAD 'cassandra://mykeyspace/mycolumnfamily?country=UK' using
> CassandraStorage();
>
> I'd like to get some feedback on that syntax.
>
> Thanks,
> Matt Kennedy
>