You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Felix Halim <fe...@gmail.com> on 2010/02/04 03:41:48 UTC

Barrier between reduce and map of the next round

Hi all,

As far as I know, a barrier exists between map and reduce function in
one round of MR. There is another barrier for the reducer to end the
job for that round. However if we want to run in several rounds using
the same map and reduce functions, then the barrier between reduce and
the map of the next round is NOT necessary, right? Since the reducer
only output a single value for each key. This reducer may as well run
a map task for the next round immediately rather than waiting for all
reducer to finish. This way, the utilization of the machines between
rounds can be improved.

Is there a setting in Hadoop to do that?

Felix Halim

Re: Barrier between reduce and map of the next round

Posted by Felix Halim <fe...@gmail.com>.
Hi Arun,

Ah yes.. the first comment by Owen O'Malley is exactly what I have in mind.

Thanks,

Felix Halim

On Wed, Feb 10, 2010 at 3:04 AM, Arun C Murthy <ac...@yahoo-inc.com> wrote:
> Felix, you might want to follow
> https://issues.apache.org/jira/browse/MAPREDUCE-1434.
> We are discussing ideas very similar to what you've just described over
> there.
>
> Arun
>
> On Feb 8, 2010, at 9:49 PM, Felix Halim wrote:
>
>> Hi,
>>
>> Currently the barrier between r(i) and m(i+1) is the Job barrier.
>> That is, m(i+1) will be blocked until all r(i) finish (until Job i
>> finish).
>>
>> I'm saying this blocking is not necessary if we can concatenate them
>> all in a single Job as an endless chain.
>> Therefore m(i+1) can start immediately even when r(i) is not finished.
>>
>> The termination condition is when some counter after r(i) is finished is
>> zero.
>> Thus the result of m(i+1) is discarded.
>>
>> I don't know how to make it clearer than this...
>>
>> Felix Halim
>>
>> On Tue, Feb 9, 2010 at 1:41 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>>>
>>> Hi,
>>>>>
>>>>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>>>
>>> My understanding is it would be something like:
>>> m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1), because
>>> of
>>> the hard distinction between Rs & Ms.
>>>
>>> Amogh
>>>
>>>
>>> On 2/4/10 1:46 PM, "Felix Halim" <fe...@gmail.com> wrote:
>>>
>>> Talking about barrier, currently there are barriers between anything:
>>>
>>> m1 | r1 | m2 | r2 | ... | mK | rK
>>>
>>> where | is the barrier.
>>>
>>> I'm saying that the barrier between ri and m(i+1) is not necessary.
>>> So it should go like this:
>>>
>>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>>>
>>> Here the result of m(K+1) is throwed away.
>>> We take the result of rK only.
>>>
>>> The shuffling is needed only between mi and ri.
>>> There is no shuffling needed for ri and m(i+1).
>>>
>>> Thus by removing the barrier between ri and m(i+1), the overall job
>>> can be made faster.
>>>
>>> Now the question is, can this be done using Chaining?
>>> AFAIK, the chaining has to be defined before the job is started, right?
>>> But because I don't know the value of K beforehand,
>>> I want the chain to continue forever until some counter in reduce task is
>>> zero.
>>>
>>> Felix Halim
>>>
>>>
>>> On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <am...@yahoo-inc.com>
>>> wrote:
>>>>
>>>>>> However, from ri to m(i+1) there is an unnecessary barrier. m(i+1)
>>>>>> should
>>>>>> not need to wait for all reducers ri to finish, right?
>>>>
>>>> Yes, but r(i+1) cant be in the same job, since that requires another
>>>> sort
>>>> and shuffle phase ( barrier ). So you would end up doing, job(i) :
>>>> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is
>>>> assuming
>>>> you cant do r(i+1) in m(identity), for if you can then it doesn’t need
>>>> sort
>>>> and shuffle , and hence your job would be again of the form m+rm* :)
>>>>
>>>> Amogh
>>>>
>>>> On 2/4/10 10:19 AM, "Felix Halim" <fe...@gmail.com> wrote:
>>>>
>>>> Hi Ed,
>>>>
>>>> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
>>>> barrier between mi and ri is acceptable since reducer has to wait for
>>>> all map task to finish. However, from ri to m(i+1) there is an
>>>> unnecessary barrier. m(i+1) should not need to wait for all reducers
>>>> ri to finish, right?
>>>>
>>>> Currently, I created one Job for each mi,ri. So I have total of K
>>>> jobs. Is there a way to chain them all together into a single Job?
>>>> However, I don't know the value of K in advance. It has to be checked
>>>> after each ri.  So I'm thinking that the job can speculatively do the
>>>> chain over and over until it discover that some counter in ri is zero
>>>> (so the result of m(K+1) is discarded, and the final result of rK is
>>>> taken).
>>>>
>>>> Felix Halim
>>>>
>>>>
>>>> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
>>>>>
>>>>> Felix,
>>>>>
>>>>> You can use ChainMapper and ChainReducer to create jobs of the form
>>>>> M+RM*. Is that what you're looking for? I'm not aware of anything that
>>>>> allows you to have multiple reduce functions without the job
>>>>> "barrier".
>>>>>
>>>>> Ed
>>>>>
>>>>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> As far as I know, a barrier exists between map and reduce function in
>>>>>> one round of MR. There is another barrier for the reducer to end the
>>>>>> job for that round. However if we want to run in several rounds using
>>>>>> the same map and reduce functions, then the barrier between reduce and
>>>>>> the map of the next round is NOT necessary, right? Since the reducer
>>>>>> only output a single value for each key. This reducer may as well run
>>>>>> a map task for the next round immediately rather than waiting for all
>>>>>> reducer to finish. This way, the utilization of the machines between
>>>>>> rounds can be improved.
>>>>>>
>>>>>> Is there a setting in Hadoop to do that?
>>>>>>
>>>>>> Felix Halim
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>
>

Re: Barrier between reduce and map of the next round

Posted by Arun C Murthy <ac...@yahoo-inc.com>.
Felix, you might want to follow https://issues.apache.org/jira/browse/MAPREDUCE-1434 
.
We are discussing ideas very similar to what you've just described  
over there.

Arun

On Feb 8, 2010, at 9:49 PM, Felix Halim wrote:

> Hi,
>
> Currently the barrier between r(i) and m(i+1) is the Job barrier.
> That is, m(i+1) will be blocked until all r(i) finish (until Job i  
> finish).
>
> I'm saying this blocking is not necessary if we can concatenate them
> all in a single Job as an endless chain.
> Therefore m(i+1) can start immediately even when r(i) is not finished.
>
> The termination condition is when some counter after r(i) is  
> finished is zero.
> Thus the result of m(i+1) is discarded.
>
> I don't know how to make it clearer than this...
>
> Felix Halim
>
> On Tue, Feb 9, 2010 at 1:41 PM, Amogh Vasekar <am...@yahoo-inc.com>  
> wrote:
>> Hi,
>>>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>> My understanding is it would be something like:
>> m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1),  
>> because of
>> the hard distinction between Rs & Ms.
>>
>> Amogh
>>
>>
>> On 2/4/10 1:46 PM, "Felix Halim" <fe...@gmail.com> wrote:
>>
>> Talking about barrier, currently there are barriers between anything:
>>
>> m1 | r1 | m2 | r2 | ... | mK | rK
>>
>> where | is the barrier.
>>
>> I'm saying that the barrier between ri and m(i+1) is not necessary.
>> So it should go like this:
>>
>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>>
>> Here the result of m(K+1) is throwed away.
>> We take the result of rK only.
>>
>> The shuffling is needed only between mi and ri.
>> There is no shuffling needed for ri and m(i+1).
>>
>> Thus by removing the barrier between ri and m(i+1), the overall job
>> can be made faster.
>>
>> Now the question is, can this be done using Chaining?
>> AFAIK, the chaining has to be defined before the job is started,  
>> right?
>> But because I don't know the value of K beforehand,
>> I want the chain to continue forever until some counter in reduce  
>> task is
>> zero.
>>
>> Felix Halim
>>
>>
>> On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <am...@yahoo-inc.com>  
>> wrote:
>>>
>>>>> However, from ri to m(i+1) there is an unnecessary barrier. m(i 
>>>>> +1) should
>>>>> not need to wait for all reducers ri to finish, right?
>>>
>>> Yes, but r(i+1) cant be in the same job, since that requires  
>>> another sort
>>> and shuffle phase ( barrier ). So you would end up doing, job(i) :
>>> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is  
>>> assuming
>>> you cant do r(i+1) in m(identity), for if you can then it doesn’t  
>>> need
>>> sort
>>> and shuffle , and hence your job would be again of the form m+rm* :)
>>>
>>> Amogh
>>>
>>> On 2/4/10 10:19 AM, "Felix Halim" <fe...@gmail.com> wrote:
>>>
>>> Hi Ed,
>>>
>>> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
>>> barrier between mi and ri is acceptable since reducer has to wait  
>>> for
>>> all map task to finish. However, from ri to m(i+1) there is an
>>> unnecessary barrier. m(i+1) should not need to wait for all reducers
>>> ri to finish, right?
>>>
>>> Currently, I created one Job for each mi,ri. So I have total of K
>>> jobs. Is there a way to chain them all together into a single Job?
>>> However, I don't know the value of K in advance. It has to be  
>>> checked
>>> after each ri.  So I'm thinking that the job can speculatively do  
>>> the
>>> chain over and over until it discover that some counter in ri is  
>>> zero
>>> (so the result of m(K+1) is discarded, and the final result of rK is
>>> taken).
>>>
>>> Felix Halim
>>>
>>>
>>> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu>  
>>> wrote:
>>>> Felix,
>>>>
>>>> You can use ChainMapper and ChainReducer to create jobs of the form
>>>> M+RM*. Is that what you're looking for? I'm not aware of anything  
>>>> that
>>>> allows you to have multiple reduce functions without the job
>>>> "barrier".
>>>>
>>>> Ed
>>>>
>>>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com>
>>>> wrote:
>>>>> Hi all,
>>>>>
>>>>> As far as I know, a barrier exists between map and reduce  
>>>>> function in
>>>>> one round of MR. There is another barrier for the reducer to end  
>>>>> the
>>>>> job for that round. However if we want to run in several rounds  
>>>>> using
>>>>> the same map and reduce functions, then the barrier between  
>>>>> reduce and
>>>>> the map of the next round is NOT necessary, right? Since the  
>>>>> reducer
>>>>> only output a single value for each key. This reducer may as  
>>>>> well run
>>>>> a map task for the next round immediately rather than waiting  
>>>>> for all
>>>>> reducer to finish. This way, the utilization of the machines  
>>>>> between
>>>>> rounds can be improved.
>>>>>
>>>>> Is there a setting in Hadoop to do that?
>>>>>
>>>>> Felix Halim
>>>>>
>>>>
>>>
>>>
>>
>>


Re: Barrier between reduce and map of the next round

Posted by Felix Halim <fe...@gmail.com>.
Hi,

Currently the barrier between r(i) and m(i+1) is the Job barrier.
That is, m(i+1) will be blocked until all r(i) finish (until Job i finish).

I'm saying this blocking is not necessary if we can concatenate them
all in a single Job as an endless chain.
Therefore m(i+1) can start immediately even when r(i) is not finished.

The termination condition is when some counter after r(i) is finished is zero.
Thus the result of m(i+1) is discarded.

I don't know how to make it clearer than this...

Felix Halim

On Tue, Feb 9, 2010 at 1:41 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
> Hi,
>>>m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
> My understanding is it would be something like:
> m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1), because of
> the hard distinction between Rs & Ms.
>
> Amogh
>
>
> On 2/4/10 1:46 PM, "Felix Halim" <fe...@gmail.com> wrote:
>
> Talking about barrier, currently there are barriers between anything:
>
> m1 | r1 | m2 | r2 | ... | mK | rK
>
> where | is the barrier.
>
> I'm saying that the barrier between ri and m(i+1) is not necessary.
> So it should go like this:
>
> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>
> Here the result of m(K+1) is throwed away.
> We take the result of rK only.
>
> The shuffling is needed only between mi and ri.
> There is no shuffling needed for ri and m(i+1).
>
> Thus by removing the barrier between ri and m(i+1), the overall job
> can be made faster.
>
> Now the question is, can this be done using Chaining?
> AFAIK, the chaining has to be defined before the job is started, right?
> But because I don't know the value of K beforehand,
> I want the chain to continue forever until some counter in reduce task is
> zero.
>
> Felix Halim
>
>
> On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>>
>>>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should
>>>> not need to wait for all reducers ri to finish, right?
>>
>> Yes, but r(i+1) cant be in the same job, since that requires another sort
>> and shuffle phase ( barrier ). So you would end up doing, job(i) :
>> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is assuming
>> you cant do r(i+1) in m(identity), for if you can then it doesn’t need
>> sort
>> and shuffle , and hence your job would be again of the form m+rm* :)
>>
>> Amogh
>>
>> On 2/4/10 10:19 AM, "Felix Halim" <fe...@gmail.com> wrote:
>>
>> Hi Ed,
>>
>> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
>> barrier between mi and ri is acceptable since reducer has to wait for
>> all map task to finish. However, from ri to m(i+1) there is an
>> unnecessary barrier. m(i+1) should not need to wait for all reducers
>> ri to finish, right?
>>
>> Currently, I created one Job for each mi,ri. So I have total of K
>> jobs. Is there a way to chain them all together into a single Job?
>> However, I don't know the value of K in advance. It has to be checked
>> after each ri.  So I'm thinking that the job can speculatively do the
>> chain over and over until it discover that some counter in ri is zero
>> (so the result of m(K+1) is discarded, and the final result of rK is
>> taken).
>>
>> Felix Halim
>>
>>
>> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
>>> Felix,
>>>
>>> You can use ChainMapper and ChainReducer to create jobs of the form
>>> M+RM*. Is that what you're looking for? I'm not aware of anything that
>>> allows you to have multiple reduce functions without the job
>>> "barrier".
>>>
>>> Ed
>>>
>>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com>
>>> wrote:
>>>> Hi all,
>>>>
>>>> As far as I know, a barrier exists between map and reduce function in
>>>> one round of MR. There is another barrier for the reducer to end the
>>>> job for that round. However if we want to run in several rounds using
>>>> the same map and reduce functions, then the barrier between reduce and
>>>> the map of the next round is NOT necessary, right? Since the reducer
>>>> only output a single value for each key. This reducer may as well run
>>>> a map task for the next round immediately rather than waiting for all
>>>> reducer to finish. This way, the utilization of the machines between
>>>> rounds can be improved.
>>>>
>>>> Is there a setting in Hadoop to do that?
>>>>
>>>> Felix Halim
>>>>
>>>
>>
>>
>
>

Re: Barrier between reduce and map of the next round

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
Hi,
>>m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
My understanding is it would be something like:
m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1), because of the hard distinction between Rs & Ms.

Amogh


On 2/4/10 1:46 PM, "Felix Halim" <fe...@gmail.com> wrote:

Talking about barrier, currently there are barriers between anything:

m1 | r1 | m2 | r2 | ... | mK | rK

where | is the barrier.

I'm saying that the barrier between ri and m(i+1) is not necessary.
So it should go like this:

m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)

Here the result of m(K+1) is throwed away.
We take the result of rK only.

The shuffling is needed only between mi and ri.
There is no shuffling needed for ri and m(i+1).

Thus by removing the barrier between ri and m(i+1), the overall job
can be made faster.

Now the question is, can this be done using Chaining?
AFAIK, the chaining has to be defined before the job is started, right?
But because I don't know the value of K beforehand,
I want the chain to continue forever until some counter in reduce task is zero.

Felix Halim


On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>
>>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should
>>> not need to wait for all reducers ri to finish, right?
>
> Yes, but r(i+1) cant be in the same job, since that requires another sort
> and shuffle phase ( barrier ). So you would end up doing, job(i) :
> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is assuming
> you cant do r(i+1) in m(identity), for if you can then it doesn't need sort
> and shuffle , and hence your job would be again of the form m+rm* :)
>
> Amogh
>
> On 2/4/10 10:19 AM, "Felix Halim" <fe...@gmail.com> wrote:
>
> Hi Ed,
>
> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
> barrier between mi and ri is acceptable since reducer has to wait for
> all map task to finish. However, from ri to m(i+1) there is an
> unnecessary barrier. m(i+1) should not need to wait for all reducers
> ri to finish, right?
>
> Currently, I created one Job for each mi,ri. So I have total of K
> jobs. Is there a way to chain them all together into a single Job?
> However, I don't know the value of K in advance. It has to be checked
> after each ri.  So I'm thinking that the job can speculatively do the
> chain over and over until it discover that some counter in ri is zero
> (so the result of m(K+1) is discarded, and the final result of rK is
> taken).
>
> Felix Halim
>
>
> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
>> Felix,
>>
>> You can use ChainMapper and ChainReducer to create jobs of the form
>> M+RM*. Is that what you're looking for? I'm not aware of anything that
>> allows you to have multiple reduce functions without the job
>> "barrier".
>>
>> Ed
>>
>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com> wrote:
>>> Hi all,
>>>
>>> As far as I know, a barrier exists between map and reduce function in
>>> one round of MR. There is another barrier for the reducer to end the
>>> job for that round. However if we want to run in several rounds using
>>> the same map and reduce functions, then the barrier between reduce and
>>> the map of the next round is NOT necessary, right? Since the reducer
>>> only output a single value for each key. This reducer may as well run
>>> a map task for the next round immediately rather than waiting for all
>>> reducer to finish. This way, the utilization of the machines between
>>> rounds can be improved.
>>>
>>> Is there a setting in Hadoop to do that?
>>>
>>> Felix Halim
>>>
>>
>
>


Re: Barrier between reduce and map of the next round

Posted by Felix Halim <fe...@gmail.com>.
Talking about barrier, currently there are barriers between anything:

m1 | r1 | m2 | r2 | ... | mK | rK

where | is the barrier.

I'm saying that the barrier between ri and m(i+1) is not necessary.
So it should go like this:

m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)

Here the result of m(K+1) is throwed away.
We take the result of rK only.

The shuffling is needed only between mi and ri.
There is no shuffling needed for ri and m(i+1).

Thus by removing the barrier between ri and m(i+1), the overall job
can be made faster.

Now the question is, can this be done using Chaining?
AFAIK, the chaining has to be defined before the job is started, right?
But because I don't know the value of K beforehand,
I want the chain to continue forever until some counter in reduce task is zero.

Felix Halim


On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>
>>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should
>>> not need to wait for all reducers ri to finish, right?
>
> Yes, but r(i+1) cant be in the same job, since that requires another sort
> and shuffle phase ( barrier ). So you would end up doing, job(i) :
> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is assuming
> you cant do r(i+1) in m(identity), for if you can then it doesn’t need sort
> and shuffle , and hence your job would be again of the form m+rm* :)
>
> Amogh
>
> On 2/4/10 10:19 AM, "Felix Halim" <fe...@gmail.com> wrote:
>
> Hi Ed,
>
> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
> barrier between mi and ri is acceptable since reducer has to wait for
> all map task to finish. However, from ri to m(i+1) there is an
> unnecessary barrier. m(i+1) should not need to wait for all reducers
> ri to finish, right?
>
> Currently, I created one Job for each mi,ri. So I have total of K
> jobs. Is there a way to chain them all together into a single Job?
> However, I don't know the value of K in advance. It has to be checked
> after each ri.  So I'm thinking that the job can speculatively do the
> chain over and over until it discover that some counter in ri is zero
> (so the result of m(K+1) is discarded, and the final result of rK is
> taken).
>
> Felix Halim
>
>
> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
>> Felix,
>>
>> You can use ChainMapper and ChainReducer to create jobs of the form
>> M+RM*. Is that what you're looking for? I'm not aware of anything that
>> allows you to have multiple reduce functions without the job
>> "barrier".
>>
>> Ed
>>
>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com> wrote:
>>> Hi all,
>>>
>>> As far as I know, a barrier exists between map and reduce function in
>>> one round of MR. There is another barrier for the reducer to end the
>>> job for that round. However if we want to run in several rounds using
>>> the same map and reduce functions, then the barrier between reduce and
>>> the map of the next round is NOT necessary, right? Since the reducer
>>> only output a single value for each key. This reducer may as well run
>>> a map task for the next round immediately rather than waiting for all
>>> reducer to finish. This way, the utilization of the machines between
>>> rounds can be improved.
>>>
>>> Is there a setting in Hadoop to do that?
>>>
>>> Felix Halim
>>>
>>
>
>

Re: Barrier between reduce and map of the next round

Posted by Amogh Vasekar <am...@yahoo-inc.com>.
>>However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) should not need to wait for all reducers ri to finish, right?

Yes, but r(i+1) cant be in the same job, since that requires another sort and shuffle phase ( barrier ). So you would end up doing, job(i) : m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is assuming you cant do r(i+1) in m(identity), for if you can then it doesn't need sort and shuffle , and hence your job would be again of the form m+rm* :)

Amogh

On 2/4/10 10:19 AM, "Felix Halim" <fe...@gmail.com> wrote:

Hi Ed,

Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
barrier between mi and ri is acceptable since reducer has to wait for
all map task to finish. However, from ri to m(i+1) there is an
unnecessary barrier. m(i+1) should not need to wait for all reducers
ri to finish, right?

Currently, I created one Job for each mi,ri. So I have total of K
jobs. Is there a way to chain them all together into a single Job?
However, I don't know the value of K in advance. It has to be checked
after each ri.  So I'm thinking that the job can speculatively do the
chain over and over until it discover that some counter in ri is zero
(so the result of m(K+1) is discarded, and the final result of rK is
taken).

Felix Halim


On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
> Felix,
>
> You can use ChainMapper and ChainReducer to create jobs of the form
> M+RM*. Is that what you're looking for? I'm not aware of anything that
> allows you to have multiple reduce functions without the job
> "barrier".
>
> Ed
>
> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com> wrote:
>> Hi all,
>>
>> As far as I know, a barrier exists between map and reduce function in
>> one round of MR. There is another barrier for the reducer to end the
>> job for that round. However if we want to run in several rounds using
>> the same map and reduce functions, then the barrier between reduce and
>> the map of the next round is NOT necessary, right? Since the reducer
>> only output a single value for each key. This reducer may as well run
>> a map task for the next round immediately rather than waiting for all
>> reducer to finish. This way, the utilization of the machines between
>> rounds can be improved.
>>
>> Is there a setting in Hadoop to do that?
>>
>> Felix Halim
>>
>


Re: Barrier between reduce and map of the next round

Posted by Felix Halim <fe...@gmail.com>.
Hi Ed,

Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
barrier between mi and ri is acceptable since reducer has to wait for
all map task to finish. However, from ri to m(i+1) there is an
unnecessary barrier. m(i+1) should not need to wait for all reducers
ri to finish, right?

Currently, I created one Job for each mi,ri. So I have total of K
jobs. Is there a way to chain them all together into a single Job?
However, I don't know the value of K in advance. It has to be checked
after each ri.  So I'm thinking that the job can speculatively do the
chain over and over until it discover that some counter in ri is zero
(so the result of m(K+1) is discarded, and the final result of rK is
taken).

Felix Halim


On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
> Felix,
>
> You can use ChainMapper and ChainReducer to create jobs of the form
> M+RM*. Is that what you're looking for? I'm not aware of anything that
> allows you to have multiple reduce functions without the job
> "barrier".
>
> Ed
>
> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com> wrote:
>> Hi all,
>>
>> As far as I know, a barrier exists between map and reduce function in
>> one round of MR. There is another barrier for the reducer to end the
>> job for that round. However if we want to run in several rounds using
>> the same map and reduce functions, then the barrier between reduce and
>> the map of the next round is NOT necessary, right? Since the reducer
>> only output a single value for each key. This reducer may as well run
>> a map task for the next round immediately rather than waiting for all
>> reducer to finish. This way, the utilization of the machines between
>> rounds can be improved.
>>
>> Is there a setting in Hadoop to do that?
>>
>> Felix Halim
>>
>

Re: Barrier between reduce and map of the next round

Posted by Ed Mazur <ma...@cs.umass.edu>.
Felix,

You can use ChainMapper and ChainReducer to create jobs of the form
M+RM*. Is that what you're looking for? I'm not aware of anything that
allows you to have multiple reduce functions without the job
"barrier".

Ed

On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <fe...@gmail.com> wrote:
> Hi all,
>
> As far as I know, a barrier exists between map and reduce function in
> one round of MR. There is another barrier for the reducer to end the
> job for that round. However if we want to run in several rounds using
> the same map and reduce functions, then the barrier between reduce and
> the map of the next round is NOT necessary, right? Since the reducer
> only output a single value for each key. This reducer may as well run
> a map task for the next round immediately rather than waiting for all
> reducer to finish. This way, the utilization of the machines between
> rounds can be improved.
>
> Is there a setting in Hadoop to do that?
>
> Felix Halim
>