You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jeff Hodges <je...@somethingsimilar.com> on 2009/07/24 10:23:48 UTC

hadoop tasks reading from cassandra

Hey,

Getting Hadoop to play nice with Cassandra has been a desire for many
folks on this list and probably more on the other one. For the
purposes of this email, I'm going to restrict this goal of getting
Hadoop to read from Cassandra in a not-stupid way.

"Not-stupid" has a very specific meaning. "Not-stupid" means that:

1) Each Hadoop Mapper sees only a small subset of the entire desired
dataset from Cassandra and that the entire desired dataset will never
be seen by any one phase of the Hadoop process.

2) Every portion of the desired dataset will be unique to the Mapper
it is delivered to. No two Mappers will ever overlap.

3) There will be no portion of the desired dataset that is not seen by a Mapper.

4) Partitioning of the dataset to the Mappers should try to
efficiently use the Cassandra nodes. This means attempting to keep
partitions on to one node solely.

Conspicuously not on this list is data locality. That is, keeping the
data passed from a node to a given Mapper at or near the same machine.
This requires further investigation outside the scope of this initial
project.

Also, please remember that "not-stupid" is not the same as "smart".

= How Hadoop Wants It And How It's Been Done By HBase

I've dug around the Hadoop and HBase codebases and while my
understanding is not yet perfect, this seems to be the general layout
of the problem.

First, a subclass of InputFormat needs to be written. This class's job
is to split up the dataset for a Hadoop job into InputSplits, or more
accurately, subclasses of InputSplits. These InputSplits are
serialized to disk in HDFS in files named so that each is picked up by
just one Mapper. Okay, actually the Mapper has no idea about
InputSplits.

A InputSplit is loaded up by on a machine running a Mapper, and then
getRecordReader() is called on the subclass of InputFormat, and the
Mapper's InputSplit is passed in as well as various hadoop job
information. getRecordReader() returns a subclass of RecordReader that
allows the Mapper to call next() on it over and over again to run
through the portion of the dataset represented by InputSplit.

In the 0.19.3 version of the HBase codebase, InputSplits are created
by gathering all the start keys for each "region" (which conceptually
maps, roughly, to a Cassandra node) in the database and divvying up
the keys approximately evenly across the number of Mappers desired.
Each TableSplits (the HBase subclass of InputSplit) created has
information about the start key of the region, the end key of the
region and the region location (the "name" of the node the dataset is
on).

This splitting on keys works because HBase keys are always stored in
an ordered fashion. Basically, HBase always partitions using something
akin to Cassandra's OrderPreservingPartitioner.

I've posted a gist with just the method that does this divvying up, by
the by[1]. (Interestingly, the region location seems to only be
encoded to allow for a nice toString() method on InputSplit to be used
during logging.)

HBase's subclass of RecordReader uses an instance of the ClientScanner
class to keep state about where they are in the Mapper's portion of
the dataset. This ClientScanner queries HBase's META server (which
contains information about all regions) with the start key and table
name to gather what HBase region to talk to and caches that
information. Note, that this is done for each Mapper on each machine.
Note, too, that the region information encoded in the TableSplit
doesn't seem to be used.

These last few points are where the code gets really hairy. I could be
wrong about some of it and corrections would be appreciated.

The important parts, anyway, are the subclassing of InputFormat,
RecordReader and InputSplit.

= How To Do It With Cassandra (Maybe)

So, I spoke to Jonathan on #cassandra about this and we tried to see
how we could take the HBase method and turn it into something that
would work with Cassandra.

Our initial assumption is that the Cassandra database to be mapped
over has to use the OrderPreservingPartitioner to keep the input
splitting consistent.

Now, Cassandra nodes don't really have a concept of a "start" and
"end" key. We could, however, get a start key for a given node by
taking the first key returned from SSTableReader#getIndexedKeys(). We
would then gather up the start keys from each of the nodes, and sort
them.

We could then use each key in this gathered list as a start key for a
"region" and, given an addition to the slice API, slice from start key
to the next start key (the end key, in HBase terminology). We would
need to modify the slice API to provide slices where the end key given
is exclusive to the set returned, instead of inclusive.

In terms of actual code to be written, our subclass of InputFormat is
what would gather this list of start keys, and we would serialize the
start/finish key pairs with our own subclass of InputSplit. And, of
course, our subclass of RecordReader would make the
finish-key-exclusive slice call.

This method satisifies property 1 of our not-stupid definition. At no
point in our Hadoop job are we accessing the entire dataset or even
all of the keys (if I'm remembering how getIndexedKeys works
correctly).

This satisifies property 2 and 3 because we are clearly slicing
everything once and only once unless I'm misremembering how
replication works w.r.t. ordered partitioning. If I am misremembering
and start keys are duplicated, we can just return a sorted set instead
of a sorted array.

This satisfies property 4 because we are slicing along the seams given
us by the nodes themselves.

= Back To The Game

Right, so that's the first pass. What sucks about this? What rules
about it? Questions?

[1] http://gist.github.com/150217
--
Jeff

Re: hadoop tasks reading from cassandra

Posted by Jeff Hodges <je...@somethingsimilar.com>.
For those of you playing at home, a "stupid" version of the hadoop
support has been attached to CASSANDRA-342. I mention it here for the
curious, but please keep discussion of it in the ticket. Thanks!

https://issues.apache.org/jira/browse/CASSANDRA-342?focusedCommentId=12744001&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12744001

--
Jeff

On Fri, Jul 24, 2009 at 1:23 AM, Jeff Hodges<je...@somethingsimilar.com> wrote:
> Hey,
>
> Getting Hadoop to play nice with Cassandra has been a desire for many
> folks on this list and probably more on the other one. For the
> purposes of this email, I'm going to restrict this goal of getting
> Hadoop to read from Cassandra in a not-stupid way.
>
> "Not-stupid" has a very specific meaning. "Not-stupid" means that:
>
> 1) Each Hadoop Mapper sees only a small subset of the entire desired
> dataset from Cassandra and that the entire desired dataset will never
> be seen by any one phase of the Hadoop process.
>
> 2) Every portion of the desired dataset will be unique to the Mapper
> it is delivered to. No two Mappers will ever overlap.
>
> 3) There will be no portion of the desired dataset that is not seen by a Mapper.
>
> 4) Partitioning of the dataset to the Mappers should try to
> efficiently use the Cassandra nodes. This means attempting to keep
> partitions on to one node solely.
>
> Conspicuously not on this list is data locality. That is, keeping the
> data passed from a node to a given Mapper at or near the same machine.
> This requires further investigation outside the scope of this initial
> project.
>
> Also, please remember that "not-stupid" is not the same as "smart".
>
> = How Hadoop Wants It And How It's Been Done By HBase
>
> I've dug around the Hadoop and HBase codebases and while my
> understanding is not yet perfect, this seems to be the general layout
> of the problem.
>
> First, a subclass of InputFormat needs to be written. This class's job
> is to split up the dataset for a Hadoop job into InputSplits, or more
> accurately, subclasses of InputSplits. These InputSplits are
> serialized to disk in HDFS in files named so that each is picked up by
> just one Mapper. Okay, actually the Mapper has no idea about
> InputSplits.
>
> A InputSplit is loaded up by on a machine running a Mapper, and then
> getRecordReader() is called on the subclass of InputFormat, and the
> Mapper's InputSplit is passed in as well as various hadoop job
> information. getRecordReader() returns a subclass of RecordReader that
> allows the Mapper to call next() on it over and over again to run
> through the portion of the dataset represented by InputSplit.
>
> In the 0.19.3 version of the HBase codebase, InputSplits are created
> by gathering all the start keys for each "region" (which conceptually
> maps, roughly, to a Cassandra node) in the database and divvying up
> the keys approximately evenly across the number of Mappers desired.
> Each TableSplits (the HBase subclass of InputSplit) created has
> information about the start key of the region, the end key of the
> region and the region location (the "name" of the node the dataset is
> on).
>
> This splitting on keys works because HBase keys are always stored in
> an ordered fashion. Basically, HBase always partitions using something
> akin to Cassandra's OrderPreservingPartitioner.
>
> I've posted a gist with just the method that does this divvying up, by
> the by[1]. (Interestingly, the region location seems to only be
> encoded to allow for a nice toString() method on InputSplit to be used
> during logging.)
>
> HBase's subclass of RecordReader uses an instance of the ClientScanner
> class to keep state about where they are in the Mapper's portion of
> the dataset. This ClientScanner queries HBase's META server (which
> contains information about all regions) with the start key and table
> name to gather what HBase region to talk to and caches that
> information. Note, that this is done for each Mapper on each machine.
> Note, too, that the region information encoded in the TableSplit
> doesn't seem to be used.
>
> These last few points are where the code gets really hairy. I could be
> wrong about some of it and corrections would be appreciated.
>
> The important parts, anyway, are the subclassing of InputFormat,
> RecordReader and InputSplit.
>
> = How To Do It With Cassandra (Maybe)
>
> So, I spoke to Jonathan on #cassandra about this and we tried to see
> how we could take the HBase method and turn it into something that
> would work with Cassandra.
>
> Our initial assumption is that the Cassandra database to be mapped
> over has to use the OrderPreservingPartitioner to keep the input
> splitting consistent.
>
> Now, Cassandra nodes don't really have a concept of a "start" and
> "end" key. We could, however, get a start key for a given node by
> taking the first key returned from SSTableReader#getIndexedKeys(). We
> would then gather up the start keys from each of the nodes, and sort
> them.
>
> We could then use each key in this gathered list as a start key for a
> "region" and, given an addition to the slice API, slice from start key
> to the next start key (the end key, in HBase terminology). We would
> need to modify the slice API to provide slices where the end key given
> is exclusive to the set returned, instead of inclusive.
>
> In terms of actual code to be written, our subclass of InputFormat is
> what would gather this list of start keys, and we would serialize the
> start/finish key pairs with our own subclass of InputSplit. And, of
> course, our subclass of RecordReader would make the
> finish-key-exclusive slice call.
>
> This method satisifies property 1 of our not-stupid definition. At no
> point in our Hadoop job are we accessing the entire dataset or even
> all of the keys (if I'm remembering how getIndexedKeys works
> correctly).
>
> This satisifies property 2 and 3 because we are clearly slicing
> everything once and only once unless I'm misremembering how
> replication works w.r.t. ordered partitioning. If I am misremembering
> and start keys are duplicated, we can just return a sorted set instead
> of a sorted array.
>
> This satisfies property 4 because we are slicing along the seams given
> us by the nodes themselves.
>
> = Back To The Game
>
> Right, so that's the first pass. What sucks about this? What rules
> about it? Questions?
>
> [1] http://gist.github.com/150217
> --
> Jeff
>

Re: hadoop tasks reading from cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
yes, get_string_property is our information-about-the-cluster gateway.

the rest of the thrift api is for manipulating columns and so forth.
we shouldn't mix the two.

On Fri, Jul 31, 2009 at 10:51 AM, Jun Rao<ju...@almaden.ibm.com> wrote:
>>
>> Jun mentioned #197 -- I'm still -1 on adding such a beast to the
>> thrift API, but I think it would be ok to expose it in
>> get_string_property, suitably (json?) encoded.
>>
>
> Is there any difference between get_string_property and other thrift APIs?
> They are both exposed to the client through thrift.
>
> Jun

Re: hadoop tasks reading from cassandra

Posted by Jun Rao <ju...@almaden.ibm.com>.
>
> Jun mentioned #197 -- I'm still -1 on adding such a beast to the
> thrift API, but I think it would be ok to expose it in
> get_string_property, suitably (json?) encoded.
>

Is there any difference between get_string_property and other thrift APIs?
They are both exposed to the client through thrift.

Jun

Re: hadoop tasks reading from cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
On Wed, Jul 29, 2009 at 1:37 AM, Jeff Hodges<je...@somethingsimilar.com> wrote:
> Comments inline.
>
> On Fri, Jul 24, 2009 at 10:00 AM, Jonathan Ellis<jb...@gmail.com> wrote:
>> On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<ju...@almaden.ibm.com> wrote:
>>> 1. In addition to OrderPreservingPartitioner, it would be useful to support
>>> MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
>>> that sort-of works at this moment. The difficulty with random partitioner
>>> is that it's a bit hard to generate the splits. In our prototype, we simply
>>> map each row to a split. This is ok for fat rows (e.g., a row includes all
>>> info for a user), but may be too fine-grained for other cases. Another
>>> possibility is to generate a split that corresponds to a set of rows in a
>>> hash-range (instead of key range). This requires some new apis in
>>> cassandra.
>>
>> -1 on adding new apis to pound a square peg into a round hole.
>>
>> like range queries, hadoop splits only really make sense on OPP.
>>
>
> Why would it only make sense on OPP? If it wasn't an externally
> exposed part of the api, what other concerns do you have about a hash
> range query? I can't think of any beyond the usual increased code
> complexity argument (i.e. development, testing and maintenance costs
> for it).

Because you have to violate encapsulation pretty badly and provide ops
acting on a hash instead of a key, so you'd be providing a parallel,
public api that only applies to the hash partitioner.

It's a bad enough hack that I'd say "feel free to maintain that in
your own tree, but not in the public repo." :)

> There is something in Hadoop that attempts to solve some of the data
> locality problem called NetworkTopology. It's used to provide data
> locality for CompileFileInputFormat (among, I'm sure, other things).
>
> Combining this with the knowledge we would have of which Node each key
> range would be from, there is a chance Hadoop could do some of the
> locality work for us. Looking at the code for CombineFileInputFormat,
> it doesn't seem to be particularly straightforward bit of work to
> translate to Cassandra, but I'm sure with a little time and maybe a
> little guidance from some Hadoop folks, we could make it happen.
>
> In any case, this seems to be evidence that locality can be added on
> later. It will not be a simple drop in deal, but it wouldn't seem to
> require us to completely overhaul how we think about the input
> splitting.

Jun mentioned #197 -- I'm still -1 on adding such a beast to the
thrift API, but I think it would be ok to expose it in
get_string_property, suitably (json?) encoded.

> (Oh, and has anyone got a mnemonic or anything to remember which of
> org.apache.hadoop.mapred and org.apache.hadoop.mapreduce is the new
> one? I'll be jiggered if I can keep it straight.)

mapreduce is the new one.  they got lucky and left the full name open
for their second try. :)

-Jonathan

Re: hadoop tasks reading from cassandra

Posted by Jeff Hodges <je...@somethingsimilar.com>.
> I'm going to get this thing rolling. I'm still a little foggy on how
> data flows inside the cassandra codebase, so forgive me if the start
> is a little slow.
>

(Er, let me be more clear here. I'm not talking about the data model
or how and what IO is done. I meant that I'm not familiar enough to
know well what responsibilities the various object have, nor how they
fit together.)
--
Jeff

Re: hadoop tasks reading from cassandra

Posted by Jeff Hodges <je...@somethingsimilar.com>.
Comments inline.

On Fri, Jul 24, 2009 at 10:00 AM, Jonathan Ellis<jb...@gmail.com> wrote:
> On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<ju...@almaden.ibm.com> wrote:
>> 1. In addition to OrderPreservingPartitioner, it would be useful to support
>> MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
>> that sort-of works at this moment. The difficulty with random partitioner
>> is that it's a bit hard to generate the splits. In our prototype, we simply
>> map each row to a split. This is ok for fat rows (e.g., a row includes all
>> info for a user), but may be too fine-grained for other cases. Another
>> possibility is to generate a split that corresponds to a set of rows in a
>> hash-range (instead of key range). This requires some new apis in
>> cassandra.
>
> -1 on adding new apis to pound a square peg into a round hole.
>
> like range queries, hadoop splits only really make sense on OPP.
>

Why would it only make sense on OPP? If it wasn't an externally
exposed part of the api, what other concerns do you have about a hash
range query? I can't think of any beyond the usual increased code
complexity argument (i.e. development, testing and maintenance costs
for it).

>> 2. For better performance, in the future, it would be useful to expose and
>> exploit data locality in cassandra so that a map task is executed on a
>> cassandra node that owns the data locally. A related issue is
>> https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks
>> encapsulation, but it's worth thinking about. Google's DFS and Bigtable
>> both expose certain locality info for better performance.
>
> That's why I'd like to ship hadoop integration out of the box, instead
> of adding apis that should really be internal-use only for an external
> hadoop layer.
>

There is something in Hadoop that attempts to solve some of the data
locality problem called NetworkTopology. It's used to provide data
locality for CompileFileInputFormat (among, I'm sure, other things).

Combining this with the knowledge we would have of which Node each key
range would be from, there is a chance Hadoop could do some of the
locality work for us. Looking at the code for CombineFileInputFormat,
it doesn't seem to be particularly straightforward bit of work to
translate to Cassandra, but I'm sure with a little time and maybe a
little guidance from some Hadoop folks, we could make it happen.

In any case, this seems to be evidence that locality can be added on
later. It will not be a simple drop in deal, but it wouldn't seem to
require us to completely overhaul how we think about the input
splitting.

I'm going to get this thing rolling. I'm still a little foggy on how
data flows inside the cassandra codebase, so forgive me if the start
is a little slow.

(Oh, and has anyone got a mnemonic or anything to remember which of
org.apache.hadoop.mapred and org.apache.hadoop.mapreduce is the new
one? I'll be jiggered if I can keep it straight.)
--
Jeff

Re: hadoop tasks reading from cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<ju...@almaden.ibm.com> wrote:
> 1. In addition to OrderPreservingPartitioner, it would be useful to support
> MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
> that sort-of works at this moment. The difficulty with random partitioner
> is that it's a bit hard to generate the splits. In our prototype, we simply
> map each row to a split. This is ok for fat rows (e.g., a row includes all
> info for a user), but may be too fine-grained for other cases. Another
> possibility is to generate a split that corresponds to a set of rows in a
> hash-range (instead of key range). This requires some new apis in
> cassandra.

-1 on adding new apis to pound a square peg into a round hole.

like range queries, hadoop splits only really make sense on OPP.

> 2. For better performance, in the future, it would be useful to expose and
> exploit data locality in cassandra so that a map task is executed on a
> cassandra node that owns the data locally. A related issue is
> https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks
> encapsulation, but it's worth thinking about. Google's DFS and Bigtable
> both expose certain locality info for better performance.

That's why I'd like to ship hadoop integration out of the box, instead
of adding apis that should really be internal-use only for an external
hadoop layer.

-Jonathan

Re: hadoop tasks reading from cassandra

Posted by Jun Rao <ju...@almaden.ibm.com>.
Jeff,

This looks like a great start. A few comments.

1. In addition to OrderPreservingPartitioner, it would be useful to support
MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
that sort-of works at this moment. The difficulty with random partitioner
is that it's a bit hard to generate the splits. In our prototype, we simply
map each row to a split. This is ok for fat rows (e.g., a row includes all
info for a user), but may be too fine-grained for other cases. Another
possibility is to generate a split that corresponds to a set of rows in a
hash-range (instead of key range). This requires some new apis in
cassandra.

2. For better performance, in the future, it would be useful to expose and
exploit data locality in cassandra so that a map task is executed on a
cassandra node that owns the data locally. A related issue is
https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks
encapsulation, but it's worth thinking about. Google's DFS and Bigtable
both expose certain locality info for better performance.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com


Jeff Hodges <je...@somethingsimilar.com> wrote on 07/24/2009 01:23:48 AM:

>
> Hey,
>
> Getting Hadoop to play nice with Cassandra has been a desire for many
> folks on this list and probably more on the other one. For the
> purposes of this email, I'm going to restrict this goal of getting
> Hadoop to read from Cassandra in a not-stupid way.
>
> "Not-stupid" has a very specific meaning. "Not-stupid" means that:
>
> 1) Each Hadoop Mapper sees only a small subset of the entire desired
> dataset from Cassandra and that the entire desired dataset will never
> be seen by any one phase of the Hadoop process.
>
> 2) Every portion of the desired dataset will be unique to the Mapper
> it is delivered to. No two Mappers will ever overlap.
>
> 3) There will be no portion of the desired dataset that is not seen
> by a Mapper.
>
> 4) Partitioning of the dataset to the Mappers should try to
> efficiently use the Cassandra nodes. This means attempting to keep
> partitions on to one node solely.
>
> Conspicuously not on this list is data locality. That is, keeping the
> data passed from a node to a given Mapper at or near the same machine.
> This requires further investigation outside the scope of this initial
> project.
>
> Also, please remember that "not-stupid" is not the same as "smart".
>
> = How Hadoop Wants It And How It's Been Done By HBase
>
> I've dug around the Hadoop and HBase codebases and while my
> understanding is not yet perfect, this seems to be the general layout
> of the problem.
>
> First, a subclass of InputFormat needs to be written. This class's job
> is to split up the dataset for a Hadoop job into InputSplits, or more
> accurately, subclasses of InputSplits. These InputSplits are
> serialized to disk in HDFS in files named so that each is picked up by
> just one Mapper. Okay, actually the Mapper has no idea about
> InputSplits.
>
> A InputSplit is loaded up by on a machine running a Mapper, and then
> getRecordReader() is called on the subclass of InputFormat, and the
> Mapper's InputSplit is passed in as well as various hadoop job
> information. getRecordReader() returns a subclass of RecordReader that
> allows the Mapper to call next() on it over and over again to run
> through the portion of the dataset represented by InputSplit.
>
> In the 0.19.3 version of the HBase codebase, InputSplits are created
> by gathering all the start keys for each "region" (which conceptually
> maps, roughly, to a Cassandra node) in the database and divvying up
> the keys approximately evenly across the number of Mappers desired.
> Each TableSplits (the HBase subclass of InputSplit) created has
> information about the start key of the region, the end key of the
> region and the region location (the "name" of the node the dataset is
> on).
>
> This splitting on keys works because HBase keys are always stored in
> an ordered fashion. Basically, HBase always partitions using something
> akin to Cassandra's OrderPreservingPartitioner.
>
> I've posted a gist with just the method that does this divvying up, by
> the by[1]. (Interestingly, the region location seems to only be
> encoded to allow for a nice toString() method on InputSplit to be used
> during logging.)
>
> HBase's subclass of RecordReader uses an instance of the ClientScanner
> class to keep state about where they are in the Mapper's portion of
> the dataset. This ClientScanner queries HBase's META server (which
> contains information about all regions) with the start key and table
> name to gather what HBase region to talk to and caches that
> information. Note, that this is done for each Mapper on each machine.
> Note, too, that the region information encoded in the TableSplit
> doesn't seem to be used.
>
> These last few points are where the code gets really hairy. I could be
> wrong about some of it and corrections would be appreciated.
>
> The important parts, anyway, are the subclassing of InputFormat,
> RecordReader and InputSplit.
>
> = How To Do It With Cassandra (Maybe)
>
> So, I spoke to Jonathan on #cassandra about this and we tried to see
> how we could take the HBase method and turn it into something that
> would work with Cassandra.
>
> Our initial assumption is that the Cassandra database to be mapped
> over has to use the OrderPreservingPartitioner to keep the input
> splitting consistent.
>
> Now, Cassandra nodes don't really have a concept of a "start" and
> "end" key. We could, however, get a start key for a given node by
> taking the first key returned from SSTableReader#getIndexedKeys(). We
> would then gather up the start keys from each of the nodes, and sort
> them.
>
> We could then use each key in this gathered list as a start key for a
> "region" and, given an addition to the slice API, slice from start key
> to the next start key (the end key, in HBase terminology). We would
> need to modify the slice API to provide slices where the end key given
> is exclusive to the set returned, instead of inclusive.
>
> In terms of actual code to be written, our subclass of InputFormat is
> what would gather this list of start keys, and we would serialize the
> start/finish key pairs with our own subclass of InputSplit. And, of
> course, our subclass of RecordReader would make the
> finish-key-exclusive slice call.
>
> This method satisifies property 1 of our not-stupid definition. At no
> point in our Hadoop job are we accessing the entire dataset or even
> all of the keys (if I'm remembering how getIndexedKeys works
> correctly).
>
> This satisifies property 2 and 3 because we are clearly slicing
> everything once and only once unless I'm misremembering how
> replication works w.r.t. ordered partitioning. If I am misremembering
> and start keys are duplicated, we can just return a sorted set instead
> of a sorted array.
>
> This satisfies property 4 because we are slicing along the seams given
> us by the nodes themselves.
>
> = Back To The Game
>
> Right, so that's the first pass. What sucks about this? What rules
> about it? Questions?
>
> [1] http://gist.github.com/150217
> --
> Jeff