You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Virajith Jalaparti <vi...@gmail.com> on 2011/06/28 20:31:22 UTC

what is mapred.reduce.parallel.copies?

Hi,

I have a question about the "mapred.reduce.parallel.copies" configuration
parameter in Hadoop. The mapred-default.xml file says it is "The default
number of parallel transfers run by reduce
  during the copy(shuffle) phase."
Is this the number of slave nodes from which a reduce task reads in
parallel? or is it the number of parallel intermediate outputs from map task
which a reducer task can read from?

For example, if I have 4 slave nodes and run a job with 800 maps and 4
reducers with mapred.reduce.parallel.copies=5. Then can each reduce task
read from all the 4 nodes in parallel i.e. it can makes only 4 concurrent
connections to the 4 nodes present? or can it read from 5 of the 800 map
outputs i.e. it makes at least 2 concurrent connections to a single node?

In essence, I am trying to determine how many reducers would be accessing a
single disk, concurrently, in any given Hadoop cluster for any job
configuration as a function of the various parameters that can be specified
in the configuration files.

Thanks,
Virajith

Re: what is mapred.reduce.parallel.copies?

Posted by Virajith Jalaparti <vi...@gmail.com>.

I am using 0.20.2. So, you mean mapred.reduce.parallel.copies is the 
number of map outputs from which a reduce task can concurrently read the 
data from? I understand that it is the number of concurrent threads on 
ReduceTask. But what is the source of each of these threads? Is it a 
single slave node or it is a single partition value sent over a 
particular map?

Thanks
Virajith

On 6/28/2011 9:59 PM, Ted Yu wrote:
> Which hadoop version are you using ?
> If it is 0.20.2, mapred.reduce.parallel.copies is the number of 
> copying threads in ReduceTask
>
> In the scenario you described, at least 2 concurrent connections to a 
> single node would be made.
>
> I am not familiar with newer versions of hadoop.
>
> On Tue, Jun 28, 2011 at 11:31 AM, Virajith Jalaparti 
> <virajith.j@gmail.com <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     I have a question about the "mapred.reduce.parallel.copies"
>     configuration parameter in Hadoop. The mapred-default.xml file
>     says it is "The default number of parallel transfers run by reduce
>       during the copy(shuffle) phase."
>     Is this the number of slave nodes from which a reduce task reads
>     in parallel? or is it the number of parallel intermediate outputs
>     from map task which a reducer task can read from?
>
>     For example, if I have 4 slave nodes and run a job with 800 maps
>     and 4 reducers with mapred.reduce.parallel.copies=5. Then can each
>     reduce task read from all the 4 nodes in parallel i.e. it can
>     makes only 4 concurrent connections to the 4 nodes present? or can
>     it read from 5 of the 800 map outputs i.e. it makes at least 2
>     concurrent connections to a single node?
>
>     In essence, I am trying to determine how many reducers would be
>     accessing a single disk, concurrently, in any given Hadoop cluster
>     for any job configuration as a function of the various parameters
>     that can be specified in the configuration files.
>
>     Thanks,
>     Virajith
>
>

Re: what is mapred.reduce.parallel.copies?

Posted by Ted Yu <yu...@gmail.com>.

Which hadoop version are you using ?
If it is 0.20.2, mapred.reduce.parallel.copies is the number of copying
threads in ReduceTask

In the scenario you described, at least 2 concurrent connections to a single
node would be made.

I am not familiar with newer versions of hadoop.

On Tue, Jun 28, 2011 at 11:31 AM, Virajith Jalaparti
<vi...@gmail.com>wrote:

> Hi,
>
> I have a question about the "mapred.reduce.parallel.copies" configuration
> parameter in Hadoop. The mapred-default.xml file says it is "The default
> number of parallel transfers run by reduce
>   during the copy(shuffle) phase."
> Is this the number of slave nodes from which a reduce task reads in
> parallel? or is it the number of parallel intermediate outputs from map task
> which a reducer task can read from?
>
> For example, if I have 4 slave nodes and run a job with 800 maps and 4
> reducers with mapred.reduce.parallel.copies=5. Then can each reduce task
> read from all the 4 nodes in parallel i.e. it can makes only 4 concurrent
> connections to the 4 nodes present? or can it read from 5 of the 800 map
> outputs i.e. it makes at least 2 concurrent connections to a single node?
>
> In essence, I am trying to determine how many reducers would be accessing a
> single disk, concurrently, in any given Hadoop cluster for any job
> configuration as a function of the various parameters that can be specified
> in the configuration files.
>
> Thanks,
> Virajith
>