You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by hu...@accenture.com on 2009/12/03 18:39:41 UTC

RE: what is the major difference between Hadoop and cloudMapReduce?

> -----Original Message-----
> From: Todd Lipcon [mailto:todd@cloudera.com]
> Sent: Monday, November 30, 2009 8:15 AM
> To: general@hadoop.apache.org
> Subject: Re: what is the major difference between Hadoop and
> cloudMapReduce?
> 
> On Mon, Nov 30, 2009 at 1:48 AM, <hu...@accenture.com> wrote:
> 
> > Todd,
> >
> > We do not keep all values for a key in memory. Instead, we only keep
> the
> > partial reduce result in memory, but throw away the value as soon as
> it is
> > used. The point you raised is still very valid if the reduce state
> > maintained per key is large, which I hope is a very rare case. If you
> have
> > some concrete workload examples, it will help us prioritize the
> development
> > effort. I can definitely see the benefits of introducing a paging
> mechanism
> > to spill partial reduce results to the output SQS queue in the future.
> > Thanks.
> >
> 
> Hi Huan,
> 
> I guess I misremembered or misread the paper.
> 
> Given this technique, doesn't it mean that reducers can only work when
> commutative and associative?
> 
> -Todd

Todd, 

I do not see how it is different from Hadoop's iterator interface, unless the reduce function relies on the fact that the values are sorted in a particular order when fed by the iterator one at a time. 

If there is no assumption on the value ordering, or the ordering expected is different from what the iterator presents, the reduce function has to read in all values from the iterator first (page to disk if necessary), rearrange them as necessary, then process based on that new ordering. This will be the same as what we will do in our iterator interface. In the next() function, our reduce function can read in all values from the iterator (page to disk if necessary), then in the finish() function, our reduce function rearranges the ordering and process based on the new ordering. 

Apologies for the late reply. Traveling in china with no reliable network connection this week. Thanks.

-Huan

This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information.  If you have received it in error, please notify the sender immediately and delete the original.  Any other use of the email by you is prohibited.

RE: what is the major difference between Hadoop and cloudMapReduce?

Posted by hu...@accenture.com.

Todd,

Yes, the difference is that, if special ordering is required, we hold a few keys' worth of values in memory v.s. Hadoop holding one key's worth of values. 

SQS has no limit on the number of queues you can create and the number of messages you can put in a queue. In practice, we have run an experiment in which we created 4,000 queues with 100 threads, and it finished successfully in a few seconds. Have not tried to stress test to millions of queues though. But even if the test fails with millions of queues, it will be a bug with SQS to be fixed since Amazon advertises that there is no limit. Thanks.

-Huan

> -----Original Message-----
> From: Todd Lipcon [mailto:todd@cloudera.com]
> Sent: Thursday, December 03, 2009 9:45 AM
> To: general@hadoop.apache.org
> Subject: Re: what is the major difference between Hadoop and
> cloudMapReduce?
> 
> On Thu, Dec 3, 2009 at 9:39 AM, <hu...@accenture.com> wrote:
> 
> > > -----Original Message-----
> > > From: Todd Lipcon [mailto:todd@cloudera.com]
> > > Sent: Monday, November 30, 2009 8:15 AM
> > > To: general@hadoop.apache.org
> > > Subject: Re: what is the major difference between Hadoop and
> > > cloudMapReduce?
> > >
> > > On Mon, Nov 30, 2009 at 1:48 AM, <hu...@accenture.com> wrote:
> > >
> > > > Todd,
> > > >
> > > > We do not keep all values for a key in memory. Instead, we only
> keep
> > > the
> > > > partial reduce result in memory, but throw away the value as soon
> as
> > > it is
> > > > used. The point you raised is still very valid if the reduce
> state
> > > > maintained per key is large, which I hope is a very rare case. If
> you
> > > have
> > > > some concrete workload examples, it will help us prioritize the
> > > development
> > > > effort. I can definitely see the benefits of introducing a paging
> > > mechanism
> > > > to spill partial reduce results to the output SQS queue in the
> future.
> > > > Thanks.
> > > >
> > >
> > > Hi Huan,
> > >
> > > I guess I misremembered or misread the paper.
> > >
> > > Given this technique, doesn't it mean that reducers can only work
> when
> > > commutative and associative?
> > >
> > > -Todd
> >
> > Todd,
> >
> > I do not see how it is different from Hadoop's iterator interface,
> unless
> > the reduce function relies on the fact that the values are sorted in
> a
> > particular order when fed by the iterator one at a time.
> >
> > If there is no assumption on the value ordering, or the ordering
> expected
> > is different from what the iterator presents, the reduce function has
> to
> > read in all values from the iterator first (page to disk if
> necessary),
> > rearrange them as necessary, then process based on that new ordering.
> This
> > will be the same as what we will do in our iterator interface. In the
> next()
> > function, our reduce function can read in all values from the
> iterator (page
> > to disk if necessary), then in the finish() function, our reduce
> function
> > rearranges the ordering and process based on the new ordering.
> >
> >
> If you want sorted values, you can get that in Hadoop, though it's not
> on by
> default.
> 
> Also, the reducer only needs to keep all the values for a single key in
> RAM
> in this case. In your case, since the keys come in any order, the
> reducer
> would have to keep all the values for every key in that partition in
> RAM. I
> guess you're suggesting that you could have enough partitions that
> "every
> key in that partition" is only one or two, but for large datasets this
> just
> doesn't soudn feasible (can you get millions of SQS queues?)
> 
> Sure, you can implement your own paging to disk, and then an external
> sort,
> and then read them back in the more convenient sorted order. But then
> you're
> just implementing what Hadoop already does for you :)
> 
> -Todd
> 
> 
> >
> > This message is for the designated recipient only and may contain
> > privileged, proprietary, or otherwise private information.  If you
> have
> > received it in error, please notify the sender immediately and delete
> the
> > original.  Any other use of the email by you is prohibited.
> >


This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information.  If you have received it in error, please notify the sender immediately and delete the original.  Any other use of the email by you is prohibited.

Re: what is the major difference between Hadoop and cloudMapReduce?

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Dec 3, 2009 at 9:39 AM, <hu...@accenture.com> wrote:

> > -----Original Message-----
> > From: Todd Lipcon [mailto:todd@cloudera.com]
> > Sent: Monday, November 30, 2009 8:15 AM
> > To: general@hadoop.apache.org
> > Subject: Re: what is the major difference between Hadoop and
> > cloudMapReduce?
> >
> > On Mon, Nov 30, 2009 at 1:48 AM, <hu...@accenture.com> wrote:
> >
> > > Todd,
> > >
> > > We do not keep all values for a key in memory. Instead, we only keep
> > the
> > > partial reduce result in memory, but throw away the value as soon as
> > it is
> > > used. The point you raised is still very valid if the reduce state
> > > maintained per key is large, which I hope is a very rare case. If you
> > have
> > > some concrete workload examples, it will help us prioritize the
> > development
> > > effort. I can definitely see the benefits of introducing a paging
> > mechanism
> > > to spill partial reduce results to the output SQS queue in the future.
> > > Thanks.
> > >
> >
> > Hi Huan,
> >
> > I guess I misremembered or misread the paper.
> >
> > Given this technique, doesn't it mean that reducers can only work when
> > commutative and associative?
> >
> > -Todd
>
> Todd,
>
> I do not see how it is different from Hadoop's iterator interface, unless
> the reduce function relies on the fact that the values are sorted in a
> particular order when fed by the iterator one at a time.
>
> If there is no assumption on the value ordering, or the ordering expected
> is different from what the iterator presents, the reduce function has to
> read in all values from the iterator first (page to disk if necessary),
> rearrange them as necessary, then process based on that new ordering. This
> will be the same as what we will do in our iterator interface. In the next()
> function, our reduce function can read in all values from the iterator (page
> to disk if necessary), then in the finish() function, our reduce function
> rearranges the ordering and process based on the new ordering.
>
>
If you want sorted values, you can get that in Hadoop, though it's not on by
default.

Also, the reducer only needs to keep all the values for a single key in RAM
in this case. In your case, since the keys come in any order, the reducer
would have to keep all the values for every key in that partition in RAM. I
guess you're suggesting that you could have enough partitions that "every
key in that partition" is only one or two, but for large datasets this just
doesn't soudn feasible (can you get millions of SQS queues?)

Sure, you can implement your own paging to disk, and then an external sort,
and then read them back in the more convenient sorted order. But then you're
just implementing what Hadoop already does for you :)

-Todd


>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise private information.  If you have
> received it in error, please notify the sender immediately and delete the
> original.  Any other use of the email by you is prohibited.
>