You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Amit Mittal <am...@gmail.com> on 2014/01/27 13:17:26 UTC

Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Hi,

Does all reducer take input from all NodeManager/Tasktrackers of Map tasks ?

*Reference:* "Hadoop: The Definitive Guide:3rd Ed" book by "Tom White"
On page# 210 (Ch 6: How MapReduce Works > Shuffle & Sort > The reducer side)

There is a note, here is the text from book:
How do reducers know which machines to fetch map output from?
...
Therefore, for a given job, the jobtracker (or application master) knows
the mapping between map outputs and hosts. A thread in the reducer
periodically asks the master for map output hosts
until it has retrieved them all.
...
*Question 1:* I believe the TaskTracker and then JobTracker/AppMaster will
receive the updates through call to Task.statusUpdate(TaskUmbilicalProtocol
obj). By which the JobTracker/AM will know the location of the map's o/p
file and host details etc, however how it will know what all the partitions
or keys this output has. In other words, from the heartbeat, how JobTracker
will know about data partitions/keys? It will be required to decide from
which Mapper, the mapper's output needs to be pulled or not.
*Question 2:* In short, not all reducer takes output from all Mappers, they
only connects and takes output related to the keys partitioned for that
particular reducer.

Thanks
Amit Mittal

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Amit Mittal <am...@gmail.com>.
Hi Vinod,

Thank you for the clarifications.
Now I reread the note and it explains "How do reducers know which
**machines** to fetch map output from?". So its about in the entire
clusters, which nodes has the map output ready for this reducer.

Thanks
Amit


On Mon, Jan 27, 2014 at 10:36 PM, Vinod Kumar Vavilapalli <
vinodkv@apache.org> wrote:

>
>
> On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:
>
> *Question 1:* I believe the TaskTracker and then JobTracker/AppMaster
> will receive the updates through call to
> Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM
> will know the location of the map's o/p file and host details etc, however
> how it will know what all the partitions or keys this output has. In other
> words, from the heartbeat, how JobTracker will know about data
> partitions/keys? It will be required to decide from which Mapper, the
> mapper's output needs to be pulled or not.
>
>
>
> Reducers pull map outputs from all the maps. So JobTracker/AppMaster
> simply give the completion events of *all* the maps to every reducer. There
> is no need for JT/AM to track the distribution of keys.
>
>
> *Question 2:* In short, not all reducer takes output from all Mappers,
> they only connects and takes output related to the keys partitioned for
> that particular reducer.
>
>
>
> That is in a sense correct.More clearly, all Reducers get a small chunk of
> output from all Mappers.
>
> +Vinod
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Amit Mittal <am...@gmail.com>.
Hi Vinod,

Thank you for the clarifications.
Now I reread the note and it explains "How do reducers know which
**machines** to fetch map output from?". So its about in the entire
clusters, which nodes has the map output ready for this reducer.

Thanks
Amit


On Mon, Jan 27, 2014 at 10:36 PM, Vinod Kumar Vavilapalli <
vinodkv@apache.org> wrote:

>
>
> On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:
>
> *Question 1:* I believe the TaskTracker and then JobTracker/AppMaster
> will receive the updates through call to
> Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM
> will know the location of the map's o/p file and host details etc, however
> how it will know what all the partitions or keys this output has. In other
> words, from the heartbeat, how JobTracker will know about data
> partitions/keys? It will be required to decide from which Mapper, the
> mapper's output needs to be pulled or not.
>
>
>
> Reducers pull map outputs from all the maps. So JobTracker/AppMaster
> simply give the completion events of *all* the maps to every reducer. There
> is no need for JT/AM to track the distribution of keys.
>
>
> *Question 2:* In short, not all reducer takes output from all Mappers,
> they only connects and takes output related to the keys partitioned for
> that particular reducer.
>
>
>
> That is in a sense correct.More clearly, all Reducers get a small chunk of
> output from all Mappers.
>
> +Vinod
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Amit Mittal <am...@gmail.com>.
Hi Vinod,

Thank you for the clarifications.
Now I reread the note and it explains "How do reducers know which
**machines** to fetch map output from?". So its about in the entire
clusters, which nodes has the map output ready for this reducer.

Thanks
Amit


On Mon, Jan 27, 2014 at 10:36 PM, Vinod Kumar Vavilapalli <
vinodkv@apache.org> wrote:

>
>
> On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:
>
> *Question 1:* I believe the TaskTracker and then JobTracker/AppMaster
> will receive the updates through call to
> Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM
> will know the location of the map's o/p file and host details etc, however
> how it will know what all the partitions or keys this output has. In other
> words, from the heartbeat, how JobTracker will know about data
> partitions/keys? It will be required to decide from which Mapper, the
> mapper's output needs to be pulled or not.
>
>
>
> Reducers pull map outputs from all the maps. So JobTracker/AppMaster
> simply give the completion events of *all* the maps to every reducer. There
> is no need for JT/AM to track the distribution of keys.
>
>
> *Question 2:* In short, not all reducer takes output from all Mappers,
> they only connects and takes output related to the keys partitioned for
> that particular reducer.
>
>
>
> That is in a sense correct.More clearly, all Reducers get a small chunk of
> output from all Mappers.
>
> +Vinod
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Amit Mittal <am...@gmail.com>.
Hi Vinod,

Thank you for the clarifications.
Now I reread the note and it explains "How do reducers know which
**machines** to fetch map output from?". So its about in the entire
clusters, which nodes has the map output ready for this reducer.

Thanks
Amit


On Mon, Jan 27, 2014 at 10:36 PM, Vinod Kumar Vavilapalli <
vinodkv@apache.org> wrote:

>
>
> On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:
>
> *Question 1:* I believe the TaskTracker and then JobTracker/AppMaster
> will receive the updates through call to
> Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM
> will know the location of the map's o/p file and host details etc, however
> how it will know what all the partitions or keys this output has. In other
> words, from the heartbeat, how JobTracker will know about data
> partitions/keys? It will be required to decide from which Mapper, the
> mapper's output needs to be pulled or not.
>
>
>
> Reducers pull map outputs from all the maps. So JobTracker/AppMaster
> simply give the completion events of *all* the maps to every reducer. There
> is no need for JT/AM to track the distribution of keys.
>
>
> *Question 2:* In short, not all reducer takes output from all Mappers,
> they only connects and takes output related to the keys partitioned for
> that particular reducer.
>
>
>
> That is in a sense correct.More clearly, all Reducers get a small chunk of
> output from all Mappers.
>
> +Vinod
>
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:

> Question 1: I believe the TaskTracker and then JobTracker/AppMaster will receive the updates through call to Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM will know the location of the map's o/p file and host details etc, however how it will know what all the partitions or keys this output has. In other words, from the heartbeat, how JobTracker will know about data partitions/keys? It will be required to decide from which Mapper, the mapper's output needs to be pulled or not.


Reducers pull map outputs from all the maps. So JobTracker/AppMaster simply give the completion events of *all* the maps to every reducer. There is no need for JT/AM to track the distribution of keys.


> Question 2: In short, not all reducer takes output from all Mappers, they only connects and takes output related to the keys partitioned for that particular reducer.


That is in a sense correct.More clearly, all Reducers get a small chunk of output from all Mappers.

+Vinod

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:

> Question 1: I believe the TaskTracker and then JobTracker/AppMaster will receive the updates through call to Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM will know the location of the map's o/p file and host details etc, however how it will know what all the partitions or keys this output has. In other words, from the heartbeat, how JobTracker will know about data partitions/keys? It will be required to decide from which Mapper, the mapper's output needs to be pulled or not.


Reducers pull map outputs from all the maps. So JobTracker/AppMaster simply give the completion events of *all* the maps to every reducer. There is no need for JT/AM to track the distribution of keys.


> Question 2: In short, not all reducer takes output from all Mappers, they only connects and takes output related to the keys partitioned for that particular reducer.


That is in a sense correct.More clearly, all Reducers get a small chunk of output from all Mappers.

+Vinod

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:

> Question 1: I believe the TaskTracker and then JobTracker/AppMaster will receive the updates through call to Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM will know the location of the map's o/p file and host details etc, however how it will know what all the partitions or keys this output has. In other words, from the heartbeat, how JobTracker will know about data partitions/keys? It will be required to decide from which Mapper, the mapper's output needs to be pulled or not.


Reducers pull map outputs from all the maps. So JobTracker/AppMaster simply give the completion events of *all* the maps to every reducer. There is no need for JT/AM to track the distribution of keys.


> Question 2: In short, not all reducer takes output from all Mappers, they only connects and takes output related to the keys partitioned for that particular reducer.


That is in a sense correct.More clearly, all Reducers get a small chunk of output from all Mappers.

+Vinod

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Does all reducer take input from all NodeManager/Tasktrackers of Map tasks

Posted by Vinod Kumar Vavilapalli <vi...@apache.org>.

On Jan 27, 2014, at 4:17 AM, Amit Mittal <am...@gmail.com> wrote:

> Question 1: I believe the TaskTracker and then JobTracker/AppMaster will receive the updates through call to Task.statusUpdate(TaskUmbilicalProtocol obj). By which the JobTracker/AM will know the location of the map's o/p file and host details etc, however how it will know what all the partitions or keys this output has. In other words, from the heartbeat, how JobTracker will know about data partitions/keys? It will be required to decide from which Mapper, the mapper's output needs to be pulled or not.


Reducers pull map outputs from all the maps. So JobTracker/AppMaster simply give the completion events of *all* the maps to every reducer. There is no need for JT/AM to track the distribution of keys.


> Question 2: In short, not all reducer takes output from all Mappers, they only connects and takes output related to the keys partitioned for that particular reducer.


That is in a sense correct.More clearly, all Reducers get a small chunk of output from all Mappers.

+Vinod

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.