You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/05/21 20:57:16 UTC

Shuffle phase replication factor

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

Re: Shuffle phase replication factor

Posted by Ian Wrigley <ia...@cloudera.com>.

Intermediate data is written to local disk, not to HDFS.

Ian.

On May 21, 2013, at 1:57 PM, John Lilley <jo...@redpoint.net> wrote:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075

Re: Shuffle phase replication factor

Posted by Ian Wrigley <ia...@cloudera.com>.

Intermediate data is written to local disk, not to HDFS.

Ian.

On May 21, 2013, at 1:57 PM, John Lilley <jo...@redpoint.net> wrote:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075

Re: Shuffle phase replication factor

Posted by Sandy Ryza <sa...@cloudera.com>.

In MR1, the tasktracker serves the mapper files (so that tasks don't have
to stick around taking up resources).  In MR2, the shuffle service, which
lives inside the nodemanager, serves them.

-Sandy


On Thu, May 23, 2013 at 10:22 AM, John Lilley <jo...@redpoint.net>wrote:

>  Ling,****
>
> Thanks for the response!  I could use more clarification on item 1.
> Specifically****
>
> **·         **mapred.reduce.parallel.copies  limits the number of
> outbound connections for a reducer, but not the inbound connections for a
> mapper.  Does tasktracker.http.threads limit the number of simultaneous
> inbound connections for a mapper, or only the size of the thread pool
> servicing the connections?  (i.e. is it one thread per inbound connection?).
> ****
>
> **·         **Who actually creates the listen port for serving up the
> mapper files?  The mapper task?  Or something more persistent in MapReduce?
> ****
>
> Thanks,****
>
> John****
>
> ** **
>
> *From:* erlv5241@gmail.com [mailto:erlv5241@gmail.com] *On Behalf Of *Kun
> Ling
> *Sent:* Wednesday, May 22, 2013 7:50 PM
> *To:* user
>
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> Hi John, ****
>
> ** **
>
> ** **
>
>    1. for the number of  simultaneous connection limitations. You can
> configure this using the mapred.reduce.parallel.copies flag. the default
>  is 5. ****
>
> ** **
>
>    2. For the aggressively disconnect implication, I am afraid it is only
> a little. Normally, each reducer will connect to each mapper task, and
> asking for the partions of the map output file.   Because there are about 5
> simultaneous connections to fetch the map output for each reducer. For a
> large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
> 1000 reducer, for each node, there are only about 5 connections. So the
> imply is only a little.****
>
> ** **
>
> ** **
>
>   3.  What happens to the pending/ failing coonection, the short answer
> is: just try to reconnect.    There is a List<>, which maintain all the
> output of the Mapper that need to copied, and the element will be removed
> iff the map output is successfully copied.  A forever loop will keep on
> look into the List, and fetch the corrsponding map output.****
>
> ** **
>
> ** **
>
>   All the above answer is based on the Hadoop 1.0.4 source code,
> especially the ReduceTask.java file.****
>
> ** **
>
> yours,****
>
> Ling Kun****
>
> ** **
>
> On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Ummmm, is that also the limit for the number of simultaneous connections?
> In general, one does not need a 1:1 map between threads and connections.**
> **
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
>  ****
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
>  ****
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> http://www.lingcc.com ****
>

Re: Shuffle phase replication factor

Posted by Sandy Ryza <sa...@cloudera.com>.

In MR1, the tasktracker serves the mapper files (so that tasks don't have
to stick around taking up resources).  In MR2, the shuffle service, which
lives inside the nodemanager, serves them.

-Sandy


On Thu, May 23, 2013 at 10:22 AM, John Lilley <jo...@redpoint.net>wrote:

>  Ling,****
>
> Thanks for the response!  I could use more clarification on item 1.
> Specifically****
>
> **·         **mapred.reduce.parallel.copies  limits the number of
> outbound connections for a reducer, but not the inbound connections for a
> mapper.  Does tasktracker.http.threads limit the number of simultaneous
> inbound connections for a mapper, or only the size of the thread pool
> servicing the connections?  (i.e. is it one thread per inbound connection?).
> ****
>
> **·         **Who actually creates the listen port for serving up the
> mapper files?  The mapper task?  Or something more persistent in MapReduce?
> ****
>
> Thanks,****
>
> John****
>
> ** **
>
> *From:* erlv5241@gmail.com [mailto:erlv5241@gmail.com] *On Behalf Of *Kun
> Ling
> *Sent:* Wednesday, May 22, 2013 7:50 PM
> *To:* user
>
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> Hi John, ****
>
> ** **
>
> ** **
>
>    1. for the number of  simultaneous connection limitations. You can
> configure this using the mapred.reduce.parallel.copies flag. the default
>  is 5. ****
>
> ** **
>
>    2. For the aggressively disconnect implication, I am afraid it is only
> a little. Normally, each reducer will connect to each mapper task, and
> asking for the partions of the map output file.   Because there are about 5
> simultaneous connections to fetch the map output for each reducer. For a
> large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
> 1000 reducer, for each node, there are only about 5 connections. So the
> imply is only a little.****
>
> ** **
>
> ** **
>
>   3.  What happens to the pending/ failing coonection, the short answer
> is: just try to reconnect.    There is a List<>, which maintain all the
> output of the Mapper that need to copied, and the element will be removed
> iff the map output is successfully copied.  A forever loop will keep on
> look into the List, and fetch the corrsponding map output.****
>
> ** **
>
> ** **
>
>   All the above answer is based on the Hadoop 1.0.4 source code,
> especially the ReduceTask.java file.****
>
> ** **
>
> yours,****
>
> Ling Kun****
>
> ** **
>
> On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Ummmm, is that also the limit for the number of simultaneous connections?
> In general, one does not need a 1:1 map between threads and connections.**
> **
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
>  ****
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
>  ****
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> http://www.lingcc.com ****
>

Re: Shuffle phase replication factor

Posted by Sandy Ryza <sa...@cloudera.com>.

In MR1, the tasktracker serves the mapper files (so that tasks don't have
to stick around taking up resources).  In MR2, the shuffle service, which
lives inside the nodemanager, serves them.

-Sandy


On Thu, May 23, 2013 at 10:22 AM, John Lilley <jo...@redpoint.net>wrote:

>  Ling,****
>
> Thanks for the response!  I could use more clarification on item 1.
> Specifically****
>
> **·         **mapred.reduce.parallel.copies  limits the number of
> outbound connections for a reducer, but not the inbound connections for a
> mapper.  Does tasktracker.http.threads limit the number of simultaneous
> inbound connections for a mapper, or only the size of the thread pool
> servicing the connections?  (i.e. is it one thread per inbound connection?).
> ****
>
> **·         **Who actually creates the listen port for serving up the
> mapper files?  The mapper task?  Or something more persistent in MapReduce?
> ****
>
> Thanks,****
>
> John****
>
> ** **
>
> *From:* erlv5241@gmail.com [mailto:erlv5241@gmail.com] *On Behalf Of *Kun
> Ling
> *Sent:* Wednesday, May 22, 2013 7:50 PM
> *To:* user
>
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> Hi John, ****
>
> ** **
>
> ** **
>
>    1. for the number of  simultaneous connection limitations. You can
> configure this using the mapred.reduce.parallel.copies flag. the default
>  is 5. ****
>
> ** **
>
>    2. For the aggressively disconnect implication, I am afraid it is only
> a little. Normally, each reducer will connect to each mapper task, and
> asking for the partions of the map output file.   Because there are about 5
> simultaneous connections to fetch the map output for each reducer. For a
> large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
> 1000 reducer, for each node, there are only about 5 connections. So the
> imply is only a little.****
>
> ** **
>
> ** **
>
>   3.  What happens to the pending/ failing coonection, the short answer
> is: just try to reconnect.    There is a List<>, which maintain all the
> output of the Mapper that need to copied, and the element will be removed
> iff the map output is successfully copied.  A forever loop will keep on
> look into the List, and fetch the corrsponding map output.****
>
> ** **
>
> ** **
>
>   All the above answer is based on the Hadoop 1.0.4 source code,
> especially the ReduceTask.java file.****
>
> ** **
>
> yours,****
>
> Ling Kun****
>
> ** **
>
> On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Ummmm, is that also the limit for the number of simultaneous connections?
> In general, one does not need a 1:1 map between threads and connections.**
> **
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
>  ****
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
>  ****
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> http://www.lingcc.com ****
>

Re: Shuffle phase replication factor

Posted by Sandy Ryza <sa...@cloudera.com>.

In MR1, the tasktracker serves the mapper files (so that tasks don't have
to stick around taking up resources).  In MR2, the shuffle service, which
lives inside the nodemanager, serves them.

-Sandy


On Thu, May 23, 2013 at 10:22 AM, John Lilley <jo...@redpoint.net>wrote:

>  Ling,****
>
> Thanks for the response!  I could use more clarification on item 1.
> Specifically****
>
> **·         **mapred.reduce.parallel.copies  limits the number of
> outbound connections for a reducer, but not the inbound connections for a
> mapper.  Does tasktracker.http.threads limit the number of simultaneous
> inbound connections for a mapper, or only the size of the thread pool
> servicing the connections?  (i.e. is it one thread per inbound connection?).
> ****
>
> **·         **Who actually creates the listen port for serving up the
> mapper files?  The mapper task?  Or something more persistent in MapReduce?
> ****
>
> Thanks,****
>
> John****
>
> ** **
>
> *From:* erlv5241@gmail.com [mailto:erlv5241@gmail.com] *On Behalf Of *Kun
> Ling
> *Sent:* Wednesday, May 22, 2013 7:50 PM
> *To:* user
>
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> Hi John, ****
>
> ** **
>
> ** **
>
>    1. for the number of  simultaneous connection limitations. You can
> configure this using the mapred.reduce.parallel.copies flag. the default
>  is 5. ****
>
> ** **
>
>    2. For the aggressively disconnect implication, I am afraid it is only
> a little. Normally, each reducer will connect to each mapper task, and
> asking for the partions of the map output file.   Because there are about 5
> simultaneous connections to fetch the map output for each reducer. For a
> large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
> 1000 reducer, for each node, there are only about 5 connections. So the
> imply is only a little.****
>
> ** **
>
> ** **
>
>   3.  What happens to the pending/ failing coonection, the short answer
> is: just try to reconnect.    There is a List<>, which maintain all the
> output of the Mapper that need to copied, and the element will be removed
> iff the map output is successfully copied.  A forever loop will keep on
> look into the List, and fetch the corrsponding map output.****
>
> ** **
>
> ** **
>
>   All the above answer is based on the Hadoop 1.0.4 source code,
> especially the ReduceTask.java file.****
>
> ** **
>
> yours,****
>
> Ling Kun****
>
> ** **
>
> On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Ummmm, is that also the limit for the number of simultaneous connections?
> In general, one does not need a 1:1 map between threads and connections.**
> **
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
>  ****
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
>  ****
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>
>
> ****
>
> ** **
>
> --
> http://www.lingcc.com ****
>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ling,
Thanks for the response!  I could use more clarification on item 1.  Specifically

*         mapred.reduce.parallel.copies  limits the number of outbound connections for a reducer, but not the inbound connections for a mapper.  Does tasktracker.http.threads limit the number of simultaneous inbound connections for a mapper, or only the size of the thread pool servicing the connections?  (i.e. is it one thread per inbound connection?).

*         Who actually creates the listen port for serving up the mapper files?  The mapper task?  Or something more persistent in MapReduce?
Thanks,
John

From: erlv5241@gmail.com [mailto:erlv5241@gmail.com] On Behalf Of Kun Ling
Sent: Wednesday, May 22, 2013 7:50 PM
To: user
Subject: Re: Shuffle phase replication factor

Hi John,


   1. for the number of  simultaneous connection limitations. You can configure this using the mapred.reduce.parallel.copies flag. the default  is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a little. Normally, each reducer will connect to each mapper task, and asking for the partions of the map output file.   Because there are about 5 simultaneous connections to fetch the map output for each reducer. For a large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and 1000 reducer, for each node, there are only about 5 connections. So the imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is: just try to reconnect.    There is a List<>, which maintain all the output of the Mapper that need to copied, and the element will be removed iff the map output is successfully copied.  A forever loop will keep on look into the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially the ReduceTask.java file.

yours,
Ling Kun

On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>> wrote:
Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:52 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john


--
Kai Voigt
k@123.org<ma...@123.org>








--
http://www.lingcc.com

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ling,
Thanks for the response!  I could use more clarification on item 1.  Specifically

*         mapred.reduce.parallel.copies  limits the number of outbound connections for a reducer, but not the inbound connections for a mapper.  Does tasktracker.http.threads limit the number of simultaneous inbound connections for a mapper, or only the size of the thread pool servicing the connections?  (i.e. is it one thread per inbound connection?).

*         Who actually creates the listen port for serving up the mapper files?  The mapper task?  Or something more persistent in MapReduce?
Thanks,
John

From: erlv5241@gmail.com [mailto:erlv5241@gmail.com] On Behalf Of Kun Ling
Sent: Wednesday, May 22, 2013 7:50 PM
To: user
Subject: Re: Shuffle phase replication factor

Hi John,


   1. for the number of  simultaneous connection limitations. You can configure this using the mapred.reduce.parallel.copies flag. the default  is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a little. Normally, each reducer will connect to each mapper task, and asking for the partions of the map output file.   Because there are about 5 simultaneous connections to fetch the map output for each reducer. For a large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and 1000 reducer, for each node, there are only about 5 connections. So the imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is: just try to reconnect.    There is a List<>, which maintain all the output of the Mapper that need to copied, and the element will be removed iff the map output is successfully copied.  A forever loop will keep on look into the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially the ReduceTask.java file.

yours,
Ling Kun

On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>> wrote:
Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:52 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john


--
Kai Voigt
k@123.org<ma...@123.org>








--
http://www.lingcc.com

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ling,
Thanks for the response!  I could use more clarification on item 1.  Specifically

*         mapred.reduce.parallel.copies  limits the number of outbound connections for a reducer, but not the inbound connections for a mapper.  Does tasktracker.http.threads limit the number of simultaneous inbound connections for a mapper, or only the size of the thread pool servicing the connections?  (i.e. is it one thread per inbound connection?).

*         Who actually creates the listen port for serving up the mapper files?  The mapper task?  Or something more persistent in MapReduce?
Thanks,
John

From: erlv5241@gmail.com [mailto:erlv5241@gmail.com] On Behalf Of Kun Ling
Sent: Wednesday, May 22, 2013 7:50 PM
To: user
Subject: Re: Shuffle phase replication factor

Hi John,


   1. for the number of  simultaneous connection limitations. You can configure this using the mapred.reduce.parallel.copies flag. the default  is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a little. Normally, each reducer will connect to each mapper task, and asking for the partions of the map output file.   Because there are about 5 simultaneous connections to fetch the map output for each reducer. For a large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and 1000 reducer, for each node, there are only about 5 connections. So the imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is: just try to reconnect.    There is a List<>, which maintain all the output of the Mapper that need to copied, and the element will be removed iff the map output is successfully copied.  A forever loop will keep on look into the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially the ReduceTask.java file.

yours,
Ling Kun

On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>> wrote:
Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:52 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john


--
Kai Voigt
k@123.org<ma...@123.org>








--
http://www.lingcc.com

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ling,
Thanks for the response!  I could use more clarification on item 1.  Specifically

*         mapred.reduce.parallel.copies  limits the number of outbound connections for a reducer, but not the inbound connections for a mapper.  Does tasktracker.http.threads limit the number of simultaneous inbound connections for a mapper, or only the size of the thread pool servicing the connections?  (i.e. is it one thread per inbound connection?).

*         Who actually creates the listen port for serving up the mapper files?  The mapper task?  Or something more persistent in MapReduce?
Thanks,
John

From: erlv5241@gmail.com [mailto:erlv5241@gmail.com] On Behalf Of Kun Ling
Sent: Wednesday, May 22, 2013 7:50 PM
To: user
Subject: Re: Shuffle phase replication factor

Hi John,


   1. for the number of  simultaneous connection limitations. You can configure this using the mapred.reduce.parallel.copies flag. the default  is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a little. Normally, each reducer will connect to each mapper task, and asking for the partions of the map output file.   Because there are about 5 simultaneous connections to fetch the map output for each reducer. For a large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and 1000 reducer, for each node, there are only about 5 connections. So the imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is: just try to reconnect.    There is a List<>, which maintain all the output of the Mapper that need to copied, and the element will be removed iff the map output is successfully copied.  A forever loop will keep on look into the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially the ReduceTask.java file.

yours,
Ling Kun

On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>> wrote:
Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:52 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john


--
Kai Voigt
k@123.org<ma...@123.org>








--
http://www.lingcc.com

Re: Shuffle phase replication factor

Posted by Kun Ling <lk...@gmail.com>.

Hi John,


   1. for the number of  simultaneous connection limitations. You can
configure this using the mapred.reduce.parallel.copies flag. the default
 is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a
little. Normally, each reducer will connect to each mapper task, and asking
for the partions of the map output file.   Because there are about 5
simultaneous connections to fetch the map output for each reducer. For a
large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
1000 reducer, for each node, there are only about 5 connections. So the
imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is:
just try to reconnect.    There is a List<>, which maintain all the output
of the Mapper that need to copied, and the element will be removed iff the
map output is successfully copied.  A forever loop will keep on look into
the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially
the ReduceTask.java file.

yours,
Ling Kun


On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>wrote:

>  Ummmm, is that also the limit for the number of simultaneous
> connections?  In general, one does not need a 1:1 map between threads and
> connections.****
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
> ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
> ** **
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>



-- 
http://www.lingcc.com

Re: Shuffle phase replication factor

Posted by Kun Ling <lk...@gmail.com>.

Hi John,


   1. for the number of  simultaneous connection limitations. You can
configure this using the mapred.reduce.parallel.copies flag. the default
 is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a
little. Normally, each reducer will connect to each mapper task, and asking
for the partions of the map output file.   Because there are about 5
simultaneous connections to fetch the map output for each reducer. For a
large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
1000 reducer, for each node, there are only about 5 connections. So the
imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is:
just try to reconnect.    There is a List<>, which maintain all the output
of the Mapper that need to copied, and the element will be removed iff the
map output is successfully copied.  A forever loop will keep on look into
the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially
the ReduceTask.java file.

yours,
Ling Kun


On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>wrote:

>  Ummmm, is that also the limit for the number of simultaneous
> connections?  In general, one does not need a 1:1 map between threads and
> connections.****
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
> ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
> ** **
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>



-- 
http://www.lingcc.com

Re: Shuffle phase replication factor

Posted by Kun Ling <lk...@gmail.com>.

Hi John,


   1. for the number of  simultaneous connection limitations. You can
configure this using the mapred.reduce.parallel.copies flag. the default
 is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a
little. Normally, each reducer will connect to each mapper task, and asking
for the partions of the map output file.   Because there are about 5
simultaneous connections to fetch the map output for each reducer. For a
large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
1000 reducer, for each node, there are only about 5 connections. So the
imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is:
just try to reconnect.    There is a List<>, which maintain all the output
of the Mapper that need to copied, and the element will be removed iff the
map output is successfully copied.  A forever loop will keep on look into
the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially
the ReduceTask.java file.

yours,
Ling Kun


On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>wrote:

>  Ummmm, is that also the limit for the number of simultaneous
> connections?  In general, one does not need a 1:1 map between threads and
> connections.****
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
> ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
> ** **
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>



-- 
http://www.lingcc.com

Re: Shuffle phase replication factor

Posted by Kun Ling <lk...@gmail.com>.

Hi John,


   1. for the number of  simultaneous connection limitations. You can
configure this using the mapred.reduce.parallel.copies flag. the default
 is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a
little. Normally, each reducer will connect to each mapper task, and asking
for the partions of the map output file.   Because there are about 5
simultaneous connections to fetch the map output for each reducer. For a
large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
1000 reducer, for each node, there are only about 5 connections. So the
imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is:
just try to reconnect.    There is a List<>, which maintain all the output
of the Mapper that need to copied, and the element will be removed iff the
map output is successfully copied.  A forever loop will keep on look into
the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially
the ReduceTask.java file.

yours,
Ling Kun


On Wed, May 22, 2013 at 10:57 PM, John Lilley <jo...@redpoint.net>wrote:

>  Ummmm, is that also the limit for the number of simultaneous
> connections?  In general, one does not need a 1:1 map between threads and
> connections.****
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  ****
>
> What happens to the pending/failing connection attempts that exceed the
> limit?****
>
> Thanks!****
>
> john****
>
> ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul****
>
> ** **
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
>  ****
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM****
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
>  ****
>
> Regards,****
>
> Shahab****
>
>  ****
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>  ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> ** **
>



-- 
http://www.lingcc.com

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Wednesday, May 22, 2013 8:52 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I’ve had for some time.  Between HDFS and shuffle, there seems to be the potential for “every node connecting to every other node” via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence… I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Wednesday, May 22, 2013 8:52 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I’ve had for some time.  Between HDFS and shuffle, there seems to be the potential for “every node connecting to every other node” via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence… I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Wednesday, May 22, 2013 8:52 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I’ve had for some time.  Between HDFS and shuffle, there seems to be the potential for “every node connecting to every other node” via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence… I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Ummmm, is that also the limit for the number of simultaneous connections?  In general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Wednesday, May 22, 2013 8:52 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>> wrote:
This brings up another nagging question I’ve had for some time.  Between HDFS and shuffle, there seems to be the potential for “every node connecting to every other node” via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com<ma...@gmail.com>]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence… I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

Re: Shuffle phase replication factor

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

There are properties/configuration to control the no. of copying threads
for copy.
tasktracker.http.threads=40
Thanks,
Rahul


On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>wrote:

>  This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
> ** **
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
> ** **
>
> Regards,****
>
> Shahab****
>
> ** **
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
> ** **
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
> ** **
>
>  ****
>
> ** **
>

Re: Shuffle phase replication factor

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

There are properties/configuration to control the no. of copying threads
for copy.
tasktracker.http.threads=40
Thanks,
Rahul


On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>wrote:

>  This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
> ** **
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
> ** **
>
> Regards,****
>
> Shahab****
>
> ** **
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
> ** **
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
> ** **
>
>  ****
>
> ** **
>

Re: Shuffle phase replication factor

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

There are properties/configuration to control the no. of copying threads
for copy.
tasktracker.http.threads=40
Thanks,
Rahul


On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>wrote:

>  This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
> ** **
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
> ** **
>
> Regards,****
>
> Shahab****
>
> ** **
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
> ** **
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
> ** **
>
>  ****
>
> ** **
>

Re: Shuffle phase replication factor

Posted by Rahul Bhattacharjee <ra...@gmail.com>.

There are properties/configuration to control the no. of copying threads
for copy.
tasktracker.http.threads=40
Thanks,
Rahul


On Wed, May 22, 2013 at 8:16 PM, John Lilley <jo...@redpoint.net>wrote:

>  This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?****
>
> Thanks****
>
> john****
>
> ** **
>
> *From:* Shahab Yunus [mailto:shahab.yunus@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.****
>
> ** **
>
> Regards,****
>
> Shahab****
>
> ** **
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>
> wrote:****
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
>  ****
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
>  ****
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
>  ****
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
> ** **
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
>  ****
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
>  ****
>
> ** **
>
>  ****
>
> ** **
>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com]
Sent: Wednesday, May 22, 2013 8:38 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com]
Sent: Wednesday, May 22, 2013 8:38 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com]
Sent: Wednesday, May 22, 2013 8:38 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

This brings up another nagging question I've had for some time.  Between HDFS and shuffle, there seems to be the potential for "every node connecting to every other node" via TCP.  Are there explicit mechanisms in place to manage or limit simultaneous connections?  Is the protocol simply robust enough to allow a server-side to disconnect at any time to free up slots and the client-side will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yunus@gmail.com]
Sent: Wednesday, May 22, 2013 8:38 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really definitive :) place to start. It is pretty thorough for starts and once you are gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org<ma...@123.org>]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org<ma...@hadoop.apache.org>
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

Re: Shuffle phase replication factor

Posted by Shahab Yunus <sh...@gmail.com>.

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
definitive :) place to start. It is pretty thorough for starts and once you
are gone through it, the code will start making more sense too.

Regards,
Shahab


On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>wrote:

>  Oh I see.  Does this mean there is another service and TCP listen port
> for this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
> ** **
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
> ** **
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>
>
> ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
> ** **
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
> ** **
>
>
>
> ****
>
> ** **
>

Re: Shuffle phase replication factor

Posted by Shahab Yunus <sh...@gmail.com>.

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
definitive :) place to start. It is pretty thorough for starts and once you
are gone through it, the code will start making more sense too.

Regards,
Shahab


On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>wrote:

>  Oh I see.  Does this mean there is another service and TCP listen port
> for this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
> ** **
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
> ** **
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>
>
> ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
> ** **
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
> ** **
>
>
>
> ****
>
> ** **
>

Re: Shuffle phase replication factor

Posted by Shahab Yunus <sh...@gmail.com>.

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
definitive :) place to start. It is pretty thorough for starts and once you
are gone through it, the code will start making more sense too.

Regards,
Shahab


On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>wrote:

>  Oh I see.  Does this mean there is another service and TCP listen port
> for this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
> ** **
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
> ** **
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>
>
> ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
> ** **
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
> ** **
>
>
>
> ****
>
> ** **
>

Re: Shuffle phase replication factor

Posted by Shahab Yunus <sh...@gmail.com>.

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
definitive :) place to start. It is pretty thorough for starts and once you
are gone through it, the code will start making more sense too.

Regards,
Shahab


On Wed, May 22, 2013 at 10:33 AM, John Lilley <jo...@redpoint.net>wrote:

>  Oh I see.  Does this mean there is another service and TCP listen port
> for this purpose?****
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.****
>
> john****
>
> ** **
>
> *From:* Kai Voigt [mailto:k@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor****
>
> ** **
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.****
>
> ** **
>
> Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:****
>
>
>
> ****
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?****
>
> john****
>
>  ****
>
> ** **
>
> -- ****
>
> Kai Voigt****
>
> k@123.org****
>
> ** **
>
>
>
> ****
>
> ** **
>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

RE: Shuffle phase replication factor

Posted by John Lilley <jo...@redpoint.net>.

Oh I see.  Does this mean there is another service and TCP listen port for this purpose?
Thanks for your indulgence... I would really like to read more about this without bothering the group but not sure where to start to learn these internals other than the code.
john

From: Kai Voigt [mailto:k@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
john

--
Kai Voigt
k@123.org<ma...@123.org>

Re: Shuffle phase replication factor

Posted by Kai Voigt <k...@123.org>.

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

-- 
Kai Voigt
k@123.org

Re: Shuffle phase replication factor

Posted by Kai Voigt <k...@123.org>.

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

-- 
Kai Voigt
k@123.org

Re: Shuffle phase replication factor

Posted by Ian Wrigley <ia...@cloudera.com>.

Intermediate data is written to local disk, not to HDFS.

Ian.

On May 21, 2013, at 1:57 PM, John Lilley <jo...@redpoint.net> wrote:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075

Re: Shuffle phase replication factor

Posted by Kai Voigt <k...@123.org>.

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

-- 
Kai Voigt
k@123.org

Re: Shuffle phase replication factor

Posted by Ian Wrigley <ia...@cloudera.com>.

Intermediate data is written to local disk, not to HDFS.

Ian.

On May 21, 2013, at 1:57 PM, John Lilley <jo...@redpoint.net> wrote:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

---
Ian Wrigley
Sr. Curriculum Manager
Cloudera, Inc
Cell: (323) 819 4075

Re: Shuffle phase replication factor

Posted by Kai Voigt <k...@123.org>.

The map output doesn't get written to HDFS. The map task writes its output to its local disk, the reduce tasks will pull the data through HTTP for further processing.

Am 21.05.2013 um 19:57 schrieb John Lilley <jo...@redpoint.net>:

> When MapReduce enters “shuffle” to partition the tuples, I am assuming that it writes intermediate data to HDFS.  What replication factor is used for those temporary files?
> john
>  

-- 
Kai Voigt
k@123.org