You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Phillip Michalak <ph...@digitalreasoning.com> on 2010/01/25 14:43:47 UTC

map/reduce on Cassandra

Multiple people have expressed an interest in 'hadoop integration' and  
'map/reduce functionality' within Cassandra. I'd like to get a feel  
for what that means to different people.

As a starting point for discussion, Jeff Hodges undertook a prototype  
effort last summer which was the subject of this thread: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E 
.

Jeff explicitly mentions data locality as one of the things that was  
out of scope for the prototype. What other features or characteristics  
would you expect to see in an implementation?

Thanks,
Phil

Re: map/reduce on Cassandra

Posted by Vijay <vi...@gmail.com>.

+1

Regards,
</VJ>




On Mon, Jan 25, 2010 at 10:47 AM, Jeff Hodges <jh...@twitter.com> wrote:

> 1) Works with RandomPartitioner. This is huge and the only way almost
> everyone would able to use it.
>
2) Ability to divide up the keys of a single node to more than one
> mapper. The prototype just slurped up everything on the node. This
> would probably be easiest to not allow as a configurable thing and
> just let it be part of the InputSplit calculation.
>
3) Progress information should be calculated and displayed.
>


>  --
> Jeff
>
> On Mon, Jan 25, 2010 at 5:43 AM, Phillip Michalak
> <ph...@digitalreasoning.com> wrote:
> > Multiple people have expressed an interest in 'hadoop integration' and
> > 'map/reduce functionality' within Cassandra. I'd like to get a feel for
> what
> > that means to different people.
> >
> > As a starting point for discussion, Jeff Hodges undertook a prototype
> effort
> > last summer which was the subject of this thread:
> >
> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E
> .
> >
> > Jeff explicitly mentions data locality as one of the things that was out
> of
> > scope for the prototype. What other features or characteristics would you
> > expect to see in an implementation?
> >
> > Thanks,
> > Phil
> >
>

Re: map/reduce on Cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.

sstablekeys is really the wrong place to support m/r anyway, it just
shows that the index can handle what m/r will need

On Mon, Jan 25, 2010 at 1:28 PM, Ryan Daum <ry...@thimbleware.com> wrote:
> On Mon, Jan 25, 2010 at 2:18 PM, Brandon Williams <dr...@gmail.com> wrote:
>
>> bin/sstablekeys will dump just the keys from an sstable without row
>> deserialization overhead, but it can't introspect a commitlog.
>> -Brandon
>
> Yes, and will it not also return the keys that are replicas from
> ranges 'belonging' to other nodes? I.e. running it on all boxes across
> a cluster of  with an RF > 1 would return duplicates where the data
> was replicated. Needs a flag to indicate uniqueness.
>
> Ryan
>

Re: map/reduce on Cassandra

Posted by Ryan Daum <ry...@thimbleware.com>.

On Mon, Jan 25, 2010 at 2:18 PM, Brandon Williams <dr...@gmail.com> wrote:

> bin/sstablekeys will dump just the keys from an sstable without row
> deserialization overhead, but it can't introspect a commitlog.
> -Brandon

Yes, and will it not also return the keys that are replicas from
ranges 'belonging' to other nodes? I.e. running it on all boxes across
a cluster of  with an RF > 1 would return duplicates where the data
was replicated. Needs a flag to indicate uniqueness.

Ryan

Re: map/reduce on Cassandra

Posted by Brandon Williams <dr...@gmail.com>.

On Mon, Jan 25, 2010 at 1:13 PM, Ryan Daum <ry...@thimbleware.com> wrote:

> I agree with what Jeff says here about RandomPartitioner support being key.
>
>
+1


> For my purposes with map/reduce I'd personally be fine with some
> general all-keys dump utility that wrote contents of one node to a
> file, and then just write my own integration from that file into
> Hadoop, etc..
>
> I guess I'm thinking something similar to sstable2json except that
> unfortunately sstable2json will dump replica data not just the local
> node's data. Getting the contents of the commitlog into the file would
> be nice, too.


bin/sstablekeys will dump just the keys from an sstable without row
deserialization overhead, but it can't introspect a commitlog.

-Brandon

Re: map/reduce on Cassandra

Posted by Ryan Daum <ry...@thimbleware.com>.

I agree with what Jeff says here about RandomPartitioner support being key.

For my purposes with map/reduce I'd personally be fine with some
general all-keys dump utility that wrote contents of one node to a
file, and then just write my own integration from that file into
Hadoop, etc..

I guess I'm thinking something similar to sstable2json except that
unfortunately sstable2json will dump replica data not just the local
node's data. Getting the contents of the commitlog into the file would
be nice, too.

R

On Mon, Jan 25, 2010 at 1:47 PM, Jeff Hodges <jh...@twitter.com> wrote:
> 1) Works with RandomPartitioner. This is huge and the only way almost
> everyone would able to use it.
> 2) Ability to divide up the keys of a single node to more than one
> mapper. The prototype just slurped up everything on the node. This
> would probably be easiest to not allow as a configurable thing and
> just let it be part of the InputSplit calculation.
> 3) Progress information should be calculated and displayed.
>  --
> Jeff
>
> On Mon, Jan 25, 2010 at 5:43 AM, Phillip Michalak
> <ph...@digitalreasoning.com> wrote:
>> Multiple people have expressed an interest in 'hadoop integration' and
>> 'map/reduce functionality' within Cassandra. I'd like to get a feel for what
>> that means to different people.
>>
>> As a starting point for discussion, Jeff Hodges undertook a prototype effort
>> last summer which was the subject of this thread:
>> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E.
>>
>> Jeff explicitly mentions data locality as one of the things that was out of
>> scope for the prototype. What other features or characteristics would you
>> expect to see in an implementation?
>>
>> Thanks,
>> Phil
>>
>

Re: map/reduce on Cassandra

Posted by Jeff Hodges <jh...@twitter.com>.

1) Works with RandomPartitioner. This is huge and the only way almost
everyone would able to use it.
2) Ability to divide up the keys of a single node to more than one
mapper. The prototype just slurped up everything on the node. This
would probably be easiest to not allow as a configurable thing and
just let it be part of the InputSplit calculation.
3) Progress information should be calculated and displayed.
 --
Jeff

On Mon, Jan 25, 2010 at 5:43 AM, Phillip Michalak
<ph...@digitalreasoning.com> wrote:
> Multiple people have expressed an interest in 'hadoop integration' and
> 'map/reduce functionality' within Cassandra. I'd like to get a feel for what
> that means to different people.
>
> As a starting point for discussion, Jeff Hodges undertook a prototype effort
> last summer which was the subject of this thread:
> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3Cf5f3a6290907240123y22f065edp1649f7c5c1add491@mail.gmail.com%3E.
>
> Jeff explicitly mentions data locality as one of the things that was out of
> scope for the prototype. What other features or characteristics would you
> expect to see in an implementation?
>
> Thanks,
> Phil
>