You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Matthew John <tm...@gmail.com> on 2010/10/13 14:12:08 UTC

doubts

Hi all ,

Had some doubts :

1) what happens when a mapper running in node A needs data from a block it
does nt have ? ( the block might be present in some other node in the
cluster )

2) in the Sort/Shuffle phase is just a logical representation of all map
outputs together sorted rite ? and again, what happens when reduce in Node C
needs access of some map outputs not in its memory?

Matthew .

Re: doubts

Posted by Harsh J <qw...@gmail.com>.

1) It fetches the block from the rack it is on, if available or from another
rack if not.  Block is fetched (or streamed?) over the network I believe,
before map can begin.  This feature is known as the rack locality.  You can
see a counter associated with this in the jobs you run (data local tasks,
rack local tasks, etc).

2) The reducer has a phase called copy which fetches _all_ the map outputs
it needs to act on (first 33%).  Only then the sort phase is initiated (next
33%).  Only after copy and sort, the reduce begins (onto 100%).  So such an
issue won't occur, as all map outputs are fetched before any other logic
runs.

On Oct 13, 2010 5:42 PM, "Matthew John" <tm...@gmail.com> wrote:

Hi all ,

Had some doubts :

1) what happens when a mapper running in node A needs data from a block it
does nt have ? ( the block might be present in some other node in the
cluster )

2) in the Sort/Shuffle phase is just a logical representation of all map
outputs together sorted rite ? and again, what happens when reduce in Node C
needs access of some map outputs not in its memory?

Matthew .