You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Patai Sangbutsarakum <si...@gmail.com> on 2013/02/28 07:57:26 UTC

where reduce is copying?

Good evening Hadoopers!

at the jobtracker page, click on a job, and click at running reduce
task, I am going to see

task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)

I am really curious where is the data is being copy.
if i clicked at the task, it will show a host that is running the task attempt.

question is "reduce > copy" is referring data copy outbound from host
that is running task attempt, or
referring to data is being copy from other machines inbound to this
host (that's running task attempt)

and in both cases how do i know what machines that host is copy data from/to?

Regards,
Patai

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The speed shown there is to be taken with a grain of salt. It is an
average value measured from the initiation of its phase. So if the
reduce is waiting for more map outputs to be available, having started
earlier (default is at 5% of maps completed) then the wait period is
also counted into this rate and thereby drops down in value even
though the real copy speed may be higher. A better way to check would
be to simply measure the network traffic if you want to.

On Thu, Feb 28, 2013 at 12:53 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Thanks Harsh, you always the first..
>
> Yeah, that's really make sense, copy inbound those output of mappers
> to running reduce task attempt.
>
> I am trying to think that the speed of 0.44MB/s is pretty low to me.
> i am deciding if it is because of data is not that much to copy as
> because not all the mappers are finished at the same time.
> or it is the problem of the network itself. (i already check that bond0 is 1gb)
>
> Thanks
> Patai
>
>
> On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
>> The latter (from other machines, inbound to where the reduce is
>> running, onto the reduce's local disk, via mapred.local.dir). The
>> reduce will, obviously, copy outputs from all maps that may have
>> produced data for its assigned partition ID.
>>
>> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> Good evening Hadoopers!
>>>
>>> at the jobtracker page, click on a job, and click at running reduce
>>> task, I am going to see
>>>
>>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>>
>>> I am really curious where is the data is being copy.
>>> if i clicked at the task, it will show a host that is running the task attempt.
>>>
>>> question is "reduce > copy" is referring data copy outbound from host
>>> that is running task attempt, or
>>> referring to data is being copy from other machines inbound to this
>>> host (that's running task attempt)
>>>
>>> and in both cases how do i know what machines that host is copy data from/to?
>>>
>>> Regards,
>>> Patai
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The speed shown there is to be taken with a grain of salt. It is an
average value measured from the initiation of its phase. So if the
reduce is waiting for more map outputs to be available, having started
earlier (default is at 5% of maps completed) then the wait period is
also counted into this rate and thereby drops down in value even
though the real copy speed may be higher. A better way to check would
be to simply measure the network traffic if you want to.

On Thu, Feb 28, 2013 at 12:53 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Thanks Harsh, you always the first..
>
> Yeah, that's really make sense, copy inbound those output of mappers
> to running reduce task attempt.
>
> I am trying to think that the speed of 0.44MB/s is pretty low to me.
> i am deciding if it is because of data is not that much to copy as
> because not all the mappers are finished at the same time.
> or it is the problem of the network itself. (i already check that bond0 is 1gb)
>
> Thanks
> Patai
>
>
> On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
>> The latter (from other machines, inbound to where the reduce is
>> running, onto the reduce's local disk, via mapred.local.dir). The
>> reduce will, obviously, copy outputs from all maps that may have
>> produced data for its assigned partition ID.
>>
>> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> Good evening Hadoopers!
>>>
>>> at the jobtracker page, click on a job, and click at running reduce
>>> task, I am going to see
>>>
>>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>>
>>> I am really curious where is the data is being copy.
>>> if i clicked at the task, it will show a host that is running the task attempt.
>>>
>>> question is "reduce > copy" is referring data copy outbound from host
>>> that is running task attempt, or
>>> referring to data is being copy from other machines inbound to this
>>> host (that's running task attempt)
>>>
>>> and in both cases how do i know what machines that host is copy data from/to?
>>>
>>> Regards,
>>> Patai
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The speed shown there is to be taken with a grain of salt. It is an
average value measured from the initiation of its phase. So if the
reduce is waiting for more map outputs to be available, having started
earlier (default is at 5% of maps completed) then the wait period is
also counted into this rate and thereby drops down in value even
though the real copy speed may be higher. A better way to check would
be to simply measure the network traffic if you want to.

On Thu, Feb 28, 2013 at 12:53 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Thanks Harsh, you always the first..
>
> Yeah, that's really make sense, copy inbound those output of mappers
> to running reduce task attempt.
>
> I am trying to think that the speed of 0.44MB/s is pretty low to me.
> i am deciding if it is because of data is not that much to copy as
> because not all the mappers are finished at the same time.
> or it is the problem of the network itself. (i already check that bond0 is 1gb)
>
> Thanks
> Patai
>
>
> On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
>> The latter (from other machines, inbound to where the reduce is
>> running, onto the reduce's local disk, via mapred.local.dir). The
>> reduce will, obviously, copy outputs from all maps that may have
>> produced data for its assigned partition ID.
>>
>> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> Good evening Hadoopers!
>>>
>>> at the jobtracker page, click on a job, and click at running reduce
>>> task, I am going to see
>>>
>>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>>
>>> I am really curious where is the data is being copy.
>>> if i clicked at the task, it will show a host that is running the task attempt.
>>>
>>> question is "reduce > copy" is referring data copy outbound from host
>>> that is running task attempt, or
>>> referring to data is being copy from other machines inbound to this
>>> host (that's running task attempt)
>>>
>>> and in both cases how do i know what machines that host is copy data from/to?
>>>
>>> Regards,
>>> Patai
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The speed shown there is to be taken with a grain of salt. It is an
average value measured from the initiation of its phase. So if the
reduce is waiting for more map outputs to be available, having started
earlier (default is at 5% of maps completed) then the wait period is
also counted into this rate and thereby drops down in value even
though the real copy speed may be higher. A better way to check would
be to simply measure the network traffic if you want to.

On Thu, Feb 28, 2013 at 12:53 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Thanks Harsh, you always the first..
>
> Yeah, that's really make sense, copy inbound those output of mappers
> to running reduce task attempt.
>
> I am trying to think that the speed of 0.44MB/s is pretty low to me.
> i am deciding if it is because of data is not that much to copy as
> because not all the mappers are finished at the same time.
> or it is the problem of the network itself. (i already check that bond0 is 1gb)
>
> Thanks
> Patai
>
>
> On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
>> The latter (from other machines, inbound to where the reduce is
>> running, onto the reduce's local disk, via mapred.local.dir). The
>> reduce will, obviously, copy outputs from all maps that may have
>> produced data for its assigned partition ID.
>>
>> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
>> <si...@gmail.com> wrote:
>>> Good evening Hadoopers!
>>>
>>> at the jobtracker page, click on a job, and click at running reduce
>>> task, I am going to see
>>>
>>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>>
>>> I am really curious where is the data is being copy.
>>> if i clicked at the task, it will show a host that is running the task attempt.
>>>
>>> question is "reduce > copy" is referring data copy outbound from host
>>> that is running task attempt, or
>>> referring to data is being copy from other machines inbound to this
>>> host (that's running task attempt)
>>>
>>> and in both cases how do i know what machines that host is copy data from/to?
>>>
>>> Regards,
>>> Patai
>>
>>
>>
>> --
>> Harsh J



-- 
Harsh J

Re: where reduce is copying?

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks Harsh, you always the first..

Yeah, that's really make sense, copy inbound those output of mappers
to running reduce task attempt.

I am trying to think that the speed of 0.44MB/s is pretty low to me.
i am deciding if it is because of data is not that much to copy as
because not all the mappers are finished at the same time.
or it is the problem of the network itself. (i already check that bond0 is 1gb)

Thanks
Patai


On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Good evening Hadoopers!
>>
>> at the jobtracker page, click on a job, and click at running reduce
>> task, I am going to see
>>
>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>
>> I am really curious where is the data is being copy.
>> if i clicked at the task, it will show a host that is running the task attempt.
>>
>> question is "reduce > copy" is referring data copy outbound from host
>> that is running task attempt, or
>> referring to data is being copy from other machines inbound to this
>> host (that's running task attempt)
>>
>> and in both cases how do i know what machines that host is copy data from/to?
>>
>> Regards,
>> Patai
>
>
>
> --
> Harsh J

Re: where reduce is copying?

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks Harsh, you always the first..

Yeah, that's really make sense, copy inbound those output of mappers
to running reduce task attempt.

I am trying to think that the speed of 0.44MB/s is pretty low to me.
i am deciding if it is because of data is not that much to copy as
because not all the mappers are finished at the same time.
or it is the problem of the network itself. (i already check that bond0 is 1gb)

Thanks
Patai


On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Good evening Hadoopers!
>>
>> at the jobtracker page, click on a job, and click at running reduce
>> task, I am going to see
>>
>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>
>> I am really curious where is the data is being copy.
>> if i clicked at the task, it will show a host that is running the task attempt.
>>
>> question is "reduce > copy" is referring data copy outbound from host
>> that is running task attempt, or
>> referring to data is being copy from other machines inbound to this
>> host (that's running task attempt)
>>
>> and in both cases how do i know what machines that host is copy data from/to?
>>
>> Regards,
>> Patai
>
>
>
> --
> Harsh J

Re: where reduce is copying?

Posted by Ling Kun <lk...@gmail.com>.
Hi,Harsh and Patai,
I also have some performance related question based on Patai's. Could
Anyone help to give some hint.

1. When running a TeraSort on a cluster, I found that the shuffle phase
takes almost half of the total reduce runtime. Is the copy from the
mapoutput to reducer takes almost all of the shuffle phase time?

2. Does each Reducer get a continuous part of the mapoutput file, when
there are more than one reducer ?   From the source code, the
ReduceTask.java will start a number of copy thread (mostly 5 threads), each
one will issue a http get operation to the corresponding taskTracker which
run the map task. And in the doGet method of the TaskTracker.java. The
TaskTrack will read the mapoutput file after looking into the index file of
the mapoutput file for an offset.

3. Have anyone done any performance analysis on the HTTP Copy framework?


yours,
Ling Kun


On Thu, Feb 28, 2013 at 3:06 PM, Harsh J <ha...@cloudera.com> wrote:

> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
> > Good evening Hadoopers!
> >
> > at the jobtracker page, click on a job, and click at running reduce
> > task, I am going to see
> >
> > task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
> >
> > I am really curious where is the data is being copy.
> > if i clicked at the task, it will show a host that is running the task
> attempt.
> >
> > question is "reduce > copy" is referring data copy outbound from host
> > that is running task attempt, or
> > referring to data is being copy from other machines inbound to this
> > host (that's running task attempt)
> >
> > and in both cases how do i know what machines that host is copy data
> from/to?
> >
> > Regards,
> > Patai
>
>
>
> --
> Harsh J
>



-- 
http://www.lingcc.com

Re: where reduce is copying?

Posted by Ling Kun <lk...@gmail.com>.
Hi,Harsh and Patai,
I also have some performance related question based on Patai's. Could
Anyone help to give some hint.

1. When running a TeraSort on a cluster, I found that the shuffle phase
takes almost half of the total reduce runtime. Is the copy from the
mapoutput to reducer takes almost all of the shuffle phase time?

2. Does each Reducer get a continuous part of the mapoutput file, when
there are more than one reducer ?   From the source code, the
ReduceTask.java will start a number of copy thread (mostly 5 threads), each
one will issue a http get operation to the corresponding taskTracker which
run the map task. And in the doGet method of the TaskTracker.java. The
TaskTrack will read the mapoutput file after looking into the index file of
the mapoutput file for an offset.

3. Have anyone done any performance analysis on the HTTP Copy framework?


yours,
Ling Kun


On Thu, Feb 28, 2013 at 3:06 PM, Harsh J <ha...@cloudera.com> wrote:

> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
> > Good evening Hadoopers!
> >
> > at the jobtracker page, click on a job, and click at running reduce
> > task, I am going to see
> >
> > task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
> >
> > I am really curious where is the data is being copy.
> > if i clicked at the task, it will show a host that is running the task
> attempt.
> >
> > question is "reduce > copy" is referring data copy outbound from host
> > that is running task attempt, or
> > referring to data is being copy from other machines inbound to this
> > host (that's running task attempt)
> >
> > and in both cases how do i know what machines that host is copy data
> from/to?
> >
> > Regards,
> > Patai
>
>
>
> --
> Harsh J
>



-- 
http://www.lingcc.com

Re: where reduce is copying?

Posted by Ling Kun <lk...@gmail.com>.
Hi,Harsh and Patai,
I also have some performance related question based on Patai's. Could
Anyone help to give some hint.

1. When running a TeraSort on a cluster, I found that the shuffle phase
takes almost half of the total reduce runtime. Is the copy from the
mapoutput to reducer takes almost all of the shuffle phase time?

2. Does each Reducer get a continuous part of the mapoutput file, when
there are more than one reducer ?   From the source code, the
ReduceTask.java will start a number of copy thread (mostly 5 threads), each
one will issue a http get operation to the corresponding taskTracker which
run the map task. And in the doGet method of the TaskTracker.java. The
TaskTrack will read the mapoutput file after looking into the index file of
the mapoutput file for an offset.

3. Have anyone done any performance analysis on the HTTP Copy framework?


yours,
Ling Kun


On Thu, Feb 28, 2013 at 3:06 PM, Harsh J <ha...@cloudera.com> wrote:

> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
> > Good evening Hadoopers!
> >
> > at the jobtracker page, click on a job, and click at running reduce
> > task, I am going to see
> >
> > task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
> >
> > I am really curious where is the data is being copy.
> > if i clicked at the task, it will show a host that is running the task
> attempt.
> >
> > question is "reduce > copy" is referring data copy outbound from host
> > that is running task attempt, or
> > referring to data is being copy from other machines inbound to this
> > host (that's running task attempt)
> >
> > and in both cases how do i know what machines that host is copy data
> from/to?
> >
> > Regards,
> > Patai
>
>
>
> --
> Harsh J
>



-- 
http://www.lingcc.com

Re: where reduce is copying?

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks Harsh, you always the first..

Yeah, that's really make sense, copy inbound those output of mappers
to running reduce task attempt.

I am trying to think that the speed of 0.44MB/s is pretty low to me.
i am deciding if it is because of data is not that much to copy as
because not all the mappers are finished at the same time.
or it is the problem of the network itself. (i already check that bond0 is 1gb)

Thanks
Patai


On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Good evening Hadoopers!
>>
>> at the jobtracker page, click on a job, and click at running reduce
>> task, I am going to see
>>
>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>
>> I am really curious where is the data is being copy.
>> if i clicked at the task, it will show a host that is running the task attempt.
>>
>> question is "reduce > copy" is referring data copy outbound from host
>> that is running task attempt, or
>> referring to data is being copy from other machines inbound to this
>> host (that's running task attempt)
>>
>> and in both cases how do i know what machines that host is copy data from/to?
>>
>> Regards,
>> Patai
>
>
>
> --
> Harsh J

Re: where reduce is copying?

Posted by Patai Sangbutsarakum <si...@gmail.com>.
Thanks Harsh, you always the first..

Yeah, that's really make sense, copy inbound those output of mappers
to running reduce task attempt.

I am trying to think that the speed of 0.44MB/s is pretty low to me.
i am deciding if it is because of data is not that much to copy as
because not all the mappers are finished at the same time.
or it is the problem of the network itself. (i already check that bond0 is 1gb)

Thanks
Patai


On Wed, Feb 27, 2013 at 11:06 PM, Harsh J <ha...@cloudera.com> wrote:
> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
>> Good evening Hadoopers!
>>
>> at the jobtracker page, click on a job, and click at running reduce
>> task, I am going to see
>>
>> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>>
>> I am really curious where is the data is being copy.
>> if i clicked at the task, it will show a host that is running the task attempt.
>>
>> question is "reduce > copy" is referring data copy outbound from host
>> that is running task attempt, or
>> referring to data is being copy from other machines inbound to this
>> host (that's running task attempt)
>>
>> and in both cases how do i know what machines that host is copy data from/to?
>>
>> Regards,
>> Patai
>
>
>
> --
> Harsh J

Re: where reduce is copying?

Posted by Ling Kun <lk...@gmail.com>.
Hi,Harsh and Patai,
I also have some performance related question based on Patai's. Could
Anyone help to give some hint.

1. When running a TeraSort on a cluster, I found that the shuffle phase
takes almost half of the total reduce runtime. Is the copy from the
mapoutput to reducer takes almost all of the shuffle phase time?

2. Does each Reducer get a continuous part of the mapoutput file, when
there are more than one reducer ?   From the source code, the
ReduceTask.java will start a number of copy thread (mostly 5 threads), each
one will issue a http get operation to the corresponding taskTracker which
run the map task. And in the doGet method of the TaskTracker.java. The
TaskTrack will read the mapoutput file after looking into the index file of
the mapoutput file for an offset.

3. Have anyone done any performance analysis on the HTTP Copy framework?


yours,
Ling Kun


On Thu, Feb 28, 2013 at 3:06 PM, Harsh J <ha...@cloudera.com> wrote:

> The latter (from other machines, inbound to where the reduce is
> running, onto the reduce's local disk, via mapred.local.dir). The
> reduce will, obviously, copy outputs from all maps that may have
> produced data for its assigned partition ID.
>
> On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
> <si...@gmail.com> wrote:
> > Good evening Hadoopers!
> >
> > at the jobtracker page, click on a job, and click at running reduce
> > task, I am going to see
> >
> > task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
> >
> > I am really curious where is the data is being copy.
> > if i clicked at the task, it will show a host that is running the task
> attempt.
> >
> > question is "reduce > copy" is referring data copy outbound from host
> > that is running task attempt, or
> > referring to data is being copy from other machines inbound to this
> > host (that's running task attempt)
> >
> > and in both cases how do i know what machines that host is copy data
> from/to?
> >
> > Regards,
> > Patai
>
>
>
> --
> Harsh J
>



-- 
http://www.lingcc.com

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The latter (from other machines, inbound to where the reduce is
running, onto the reduce's local disk, via mapred.local.dir). The
reduce will, obviously, copy outputs from all maps that may have
produced data for its assigned partition ID.

On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Good evening Hadoopers!
>
> at the jobtracker page, click on a job, and click at running reduce
> task, I am going to see
>
> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>
> I am really curious where is the data is being copy.
> if i clicked at the task, it will show a host that is running the task attempt.
>
> question is "reduce > copy" is referring data copy outbound from host
> that is running task attempt, or
> referring to data is being copy from other machines inbound to this
> host (that's running task attempt)
>
> and in both cases how do i know what machines that host is copy data from/to?
>
> Regards,
> Patai



--
Harsh J

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The latter (from other machines, inbound to where the reduce is
running, onto the reduce's local disk, via mapred.local.dir). The
reduce will, obviously, copy outputs from all maps that may have
produced data for its assigned partition ID.

On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Good evening Hadoopers!
>
> at the jobtracker page, click on a job, and click at running reduce
> task, I am going to see
>
> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>
> I am really curious where is the data is being copy.
> if i clicked at the task, it will show a host that is running the task attempt.
>
> question is "reduce > copy" is referring data copy outbound from host
> that is running task attempt, or
> referring to data is being copy from other machines inbound to this
> host (that's running task attempt)
>
> and in both cases how do i know what machines that host is copy data from/to?
>
> Regards,
> Patai



--
Harsh J

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The latter (from other machines, inbound to where the reduce is
running, onto the reduce's local disk, via mapred.local.dir). The
reduce will, obviously, copy outputs from all maps that may have
produced data for its assigned partition ID.

On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Good evening Hadoopers!
>
> at the jobtracker page, click on a job, and click at running reduce
> task, I am going to see
>
> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>
> I am really curious where is the data is being copy.
> if i clicked at the task, it will show a host that is running the task attempt.
>
> question is "reduce > copy" is referring data copy outbound from host
> that is running task attempt, or
> referring to data is being copy from other machines inbound to this
> host (that's running task attempt)
>
> and in both cases how do i know what machines that host is copy data from/to?
>
> Regards,
> Patai



--
Harsh J

Re: where reduce is copying?

Posted by Harsh J <ha...@cloudera.com>.
The latter (from other machines, inbound to where the reduce is
running, onto the reduce's local disk, via mapred.local.dir). The
reduce will, obviously, copy outputs from all maps that may have
produced data for its assigned partition ID.

On Thu, Feb 28, 2013 at 12:27 PM, Patai Sangbutsarakum
<si...@gmail.com> wrote:
> Good evening Hadoopers!
>
> at the jobtracker page, click on a job, and click at running reduce
> task, I am going to see
>
> task_201302271736_0638_r_000000 reduce > copy (136 of 261 at 0.44 MB/s)
>
> I am really curious where is the data is being copy.
> if i clicked at the task, it will show a host that is running the task attempt.
>
> question is "reduce > copy" is referring data copy outbound from host
> that is running task attempt, or
> referring to data is being copy from other machines inbound to this
> host (that's running task attempt)
>
> and in both cases how do i know what machines that host is copy data from/to?
>
> Regards,
> Patai



--
Harsh J