You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by adeelmahmood <ad...@gmail.com> on 2010/01/26 23:27:18 UTC

do all mappers finish before reducer starts

I just have a conceptual question. My understanding is that all the mappers
have to complete their job for the reducers to start working because mappers
dont know about each other so we need values for a given key from all the
different mappers so we have to wait until all mappers have collectively
given the system all possible values for a key .so that then that can be
passed on the reducer .. 
but when I ran these jobs .. almost everytime before the mappers are all
done the reducers start working .. so it would say map 60% reduce 30% .. how
does this works
Does it finds all possibly values for a single key from all mappers .. pass
that on the reducer and then works on other keys
any help is appreciated
-- 
View this message in context: http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: do all mappers finish before reducer starts

Posted by Ken Goodhope <ke...@gmail.com>.

The reduce function is always called after all map tasks are complete.  This
is not to be confused with the reduce "task".  The reduce task can be
launched and begin copying data as soon as the first mapper completes.  By
default though, reduce tasks are not launched until 5% of the mappers are
completed.

2010/2/1 Jeyendran Balakrishnan <jb...@docomolabs-usa.com>

> Correct me if I'm wrong, but this:
>
> >> Yes, any reduce function call should be after all the mappers have done
> >> their work.
>
> is strictly true only if speculative execution is explicitly turned off.
> Otherwise there is a chance that some reduce tasks can actually start before
> all the maps are complete. In case it turns out that some map output key
> used by one speculative reduce task is output by some other map after this
> reduce task has started, I think the JT then kills this speculative task.
>
>
>
> -----Original Message-----
> From: Gang Luo [mailto:lgpublic@yahoo.com.cn]
> Sent: Friday, January 29, 2010 2:27 PM
> To: common-user@hadoop.apache.org
> Subject: Re: do all mappers finish before reducer starts
>
> It seems this is a hot issue
>
> When any mapper finishes (the sorted intermediate result is on local disk),
> the shuffle start to transfer the result to corresponding reducers, even
> other mappers are still working.  For the shuffle is part of the reduce
> phase, the map phase and reduce phase could be seen overlap to some extend.
> That is why you see such a progress report.
>
> What you actually mentioned is the reduce function. Yes, any reduce
> function call should be after all the mappers have done their work.
>
>  -Gang
>
>
> ----- 原始邮件 ----
> 发件人： adeelmahmood <ad...@gmail.com>
> 收件人： core-user@hadoop.apache.org
> 发送日期： 2010/1/29 (周五) 4:10:50 下午
> 主   题： do all mappers finish before reducer starts
>
>
> I just have a conceptual question. My understanding is that all the mappers
> have to complete their job for the reducers to start working because
> mappers
> dont know about each other so we need values for a given key from all the
> different mappers so we have to wait until all mappers have collectively
> given the system all possible values for a key .so that then that can be
> passed on the reducer ..
> but when I ran these jobs .. almost everytime before the mappers are all
> done the reducers start working .. so it would say map 60% reduce 30% ..
> how
> does this works
> Does it finds all possibly values for a single key from all mappers .. pass
> that on the reducer and then works on other keys
> any help is appreciated
> --
> View this message in context:
> http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>
>      ___________________________________________________________
>  好玩贺卡等你发，邮箱贺卡全新上线！
> http://card.mail.cn.yahoo.com/
>

RE: do all mappers finish before reducer starts

Posted by Jeyendran Balakrishnan <jb...@docomolabs-usa.com>.

Correct me if I'm wrong, but this:

>> Yes, any reduce function call should be after all the mappers have done
>> their work.

is strictly true only if speculative execution is explicitly turned off. Otherwise there is a chance that some reduce tasks can actually start before all the maps are complete. In case it turns out that some map output key used by one speculative reduce task is output by some other map after this reduce task has started, I think the JT then kills this speculative task.



-----Original Message-----
From: Gang Luo [mailto:lgpublic@yahoo.com.cn] 
Sent: Friday, January 29, 2010 2:27 PM
To: common-user@hadoop.apache.org
Subject: Re: do all mappers finish before reducer starts

It seems this is a hot issue

When any mapper finishes (the sorted intermediate result is on local disk), the shuffle start to transfer the result to corresponding reducers, even other mappers are still working.  For the shuffle is part of the reduce phase, the map phase and reduce phase could be seen overlap to some extend. That is why you see such a progress report. 

What you actually mentioned is the reduce function. Yes, any reduce function call should be after all the mappers have done their work. 

 -Gang


----- 原始邮件 ----
发件人： adeelmahmood <ad...@gmail.com>
收件人： core-user@hadoop.apache.org
发送日期： 2010/1/29 (周五) 4:10:50 下午
主   题： do all mappers finish before reducer starts


I just have a conceptual question. My understanding is that all the mappers
have to complete their job for the reducers to start working because mappers
dont know about each other so we need values for a given key from all the
different mappers so we have to wait until all mappers have collectively
given the system all possible values for a key .so that then that can be
passed on the reducer .. 
but when I ran these jobs .. almost everytime before the mappers are all
done the reducers start working .. so it would say map 60% reduce 30% .. how
does this works
Does it finds all possibly values for a single key from all mappers .. pass
that on the reducer and then works on other keys
any help is appreciated
-- 
View this message in context: http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re: do all mappers finish before reducer starts

Posted by Gang Luo <lg...@yahoo.com.cn>.

It seems this is a hot issue

When any mapper finishes (the sorted intermediate result is on local disk), the shuffle start to transfer the result to corresponding reducers, even other mappers are still working. For the shuffle is part of the reduce phase, the map phase and reduce phase could be seen overlap to some extend. That is why you see such a progress report.

What you actually mentioned is the reduce function. Yes, any reduce function call should be after all the mappers have done their work.

-Gang

----- 原始邮件 ----
发件人： adeelmahmood <ad...@gmail.com>
收件人： core-user@hadoop.apache.org
发送日期： 2010/1/29 (周五) 4:10:50 下午
主 题： do all mappers finish before reducer starts

I just have a conceptual question. My understanding is that all the mappers
have to complete their job for the reducers to start working because mappers
dont know about each other so we need values for a given key from all the
different mappers so we have to wait until all mappers have collectively
given the system all possible values for a key .so that then that can be
passed on the reducer ..
but when I ran these jobs .. almost everytime before the mappers are all
done the reducers start working .. so it would say map 60% reduce 30% .. how
does this works
Does it finds all possibly values for a single key from all mappers .. pass
that on the reducer and then works on other keys
any help is appreciated
--
View this message in context: http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

___________________________________________________________
好玩贺卡等你发，邮箱贺卡全新上线！
http://card.mail.cn.yahoo.com/

Re: do all mappers finish before reducer starts

Posted by Allen Wittenauer <aw...@linkedin.com>.

This is a tunable, btw.  You can set slowstart to something higher than the
default 5%.  For shared grids, this should likely be 50% or more.  Otherwise
your reduce slots may get filled by jobs that aren't using them efficiently.



On 1/26/10 6:55 PM, "Eason.Lee" <le...@gmail.com> wrote:

> No,Reduce will start as soon as Map starts
> So reduce can start transfer map outputs to local when some of they are
> finished
> 
> 2010/1/27 adeelmahmood <ad...@gmail.com>
> 
>> 
>> I just have a conceptual question. My understanding is that all the mappers
>> have to complete their job for the reducers to start working because
>> mappers
>> dont know about each other so we need values for a given key from all the
>> different mappers so we have to wait until all mappers have collectively
>> given the system all possible values for a key .so that then that can be
>> passed on the reducer ..
>> but when I ran these jobs .. almost everytime before the mappers are all
>> done the reducers start working .. so it would say map 60% reduce 30% ..
>> how
>> does this works
>> Does it finds all possibly values for a single key from all mappers .. pass
>> that on the reducer and then works on other keys
>> any help is appreciated
>> --
>> View this message in context:
>> http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p
>> 27330927.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> 
>>

Re: do all mappers finish before reducer starts

Posted by "Eason.Lee" <le...@gmail.com>.

No,Reduce will start as soon as Map starts
So reduce can start transfer map outputs to local when some of they are
finished

2010/1/27 adeelmahmood <ad...@gmail.com>

>
> I just have a conceptual question. My understanding is that all the mappers
> have to complete their job for the reducers to start working because
> mappers
> dont know about each other so we need values for a given key from all the
> different mappers so we have to wait until all mappers have collectively
> given the system all possible values for a key .so that then that can be
> passed on the reducer ..
> but when I ran these jobs .. almost everytime before the mappers are all
> done the reducers start working .. so it would say map 60% reduce 30% ..
> how
> does this works
> Does it finds all possibly values for a single key from all mappers .. pass
> that on the reducer and then works on other keys
> any help is appreciated
> --
> View this message in context:
> http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

Re: do all mappers finish before reducer starts

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

For algebraic reduce functions, it should be able to  parallally start user
reduce functions (3)  as well even before the mapper completes, right ?

On Wed, Jan 27, 2010 at 4:19 AM, Ed Mazur <ma...@cs.umass.edu> wrote:

> You're right that the user reduce function cannot be applied until all
> maps have completed. The values being reported about job completion
> are a bit misleading in this sense. The reduce percentage you're
> seeing actually encompasses three parts:
>
> 1. Fetching map output data
> 2. Merging map output data
> 3. Applying the user reduce function
>
> Only the third part has the constraint of waiting for all maps; the
> other two can be done in parallel, hence the reduce percentage
> increasing before map completes. 0-33% reduce corresponds to step 1,
> 33-67% to step 2, and 67-100% to step 3. There is overlap between
> parts 1 and 2 as the reduce memory buffer fills up, merges, and spills
> to disk. There is also overlap between parts 2 and 3 because the final
> merge is fed directly into the user reduce function to minimize the
> amount of data written to disk.
>
> Ed
>
> On Tue, Jan 26, 2010 at 5:27 PM, adeelmahmood <ad...@gmail.com>
> wrote:
> >
> > I just have a conceptual question. My understanding is that all the
> mappers
> > have to complete their job for the reducers to start working because
> mappers
> > dont know about each other so we need values for a given key from all the
> > different mappers so we have to wait until all mappers have collectively
> > given the system all possible values for a key .so that then that can be
> > passed on the reducer ..
> > but when I ran these jobs .. almost everytime before the mappers are all
> > done the reducers start working .. so it would say map 60% reduce 30% ..
> how
> > does this works
> > Does it finds all possibly values for a single key from all mappers ..
> pass
> > that on the reducer and then works on other keys
> > any help is appreciated
> > --
> > View this message in context:
> http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>

Re: do all mappers finish before reducer starts

Posted by Ed Mazur <ma...@cs.umass.edu>.

You're right that the user reduce function cannot be applied until all
maps have completed. The values being reported about job completion
are a bit misleading in this sense. The reduce percentage you're
seeing actually encompasses three parts:

1. Fetching map output data
2. Merging map output data
3. Applying the user reduce function

Only the third part has the constraint of waiting for all maps; the
other two can be done in parallel, hence the reduce percentage
increasing before map completes. 0-33% reduce corresponds to step 1,
33-67% to step 2, and 67-100% to step 3. There is overlap between
parts 1 and 2 as the reduce memory buffer fills up, merges, and spills
to disk. There is also overlap between parts 2 and 3 because the final
merge is fed directly into the user reduce function to minimize the
amount of data written to disk.

Ed

On Tue, Jan 26, 2010 at 5:27 PM, adeelmahmood <ad...@gmail.com> wrote:
>
> I just have a conceptual question. My understanding is that all the mappers
> have to complete their job for the reducers to start working because mappers
> dont know about each other so we need values for a given key from all the
> different mappers so we have to wait until all mappers have collectively
> given the system all possible values for a key .so that then that can be
> passed on the reducer ..
> but when I ran these jobs .. almost everytime before the mappers are all
> done the reducers start working .. so it would say map 60% reduce 30% .. how
> does this works
> Does it finds all possibly values for a single key from all mappers .. pass
> that on the reducer and then works on other keys
> any help is appreciated
> --
> View this message in context: http://old.nabble.com/do-all-mappers-finish-before-reducer-starts-tp27330927p27330927.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>