You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Abdul Navaz <na...@gmail.com> on 2014/09/26 02:36:46 UTC

Hadoop shuffling traffic

Hello,

I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
sample word count job on 1GB of file which is distributed among the HDFS.

When I run the map reduce job, before even completing the mapping 100 %
reduce starts.  Say for eg map 40% reduce 10% etc.

I would like to know when the shuffling traffic starts ?

->  Is there any way to find out when exactly shuffling started ?  Does it
generate any syslog in the logs .
-> How to find the total amount of shuffling traffic?



Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

I think it refers to the no. of bytes the reducer fetches from the mapper.

Pramod

On Wed, Oct 8, 2014 at 10:17 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> Fiesr of all thank you very much for your help. :)
>
> I still have some doubt with this .
>
> Is the highlighted metric “ *Reduce shuffle bytes=3059” *
>
>
>    1. Is the total bytes after the reduced phase. ( That is the output
>    file which is written into HDFS)
>
> Or
>
> 2.  Is this is the actual shuffled traffic which is exchanged between
> mappers and reducers before performing reducing ?
>
> Please clarify !
>
> Thanks & Regards,
>
> Abdul Navaz
>
>
>
> From: Pramod Biligiri <pr...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Thursday, October 2, 2014 at 12:44 AM
> To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Hadoop shuffling traffic
>
> Hi Abdul,
> That is the right metric. You can take a look at this report we made on
> this earlier:
> http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
>
> Pramod
>
> On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
>
>> Hello,
>>
>> This is the portion of the output which is displayed on the console when
>> I run sample word count job.
>>
>> map 0% reduce 0%
>>
>> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>>
>> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
>> job_201409262002_0003
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
>> waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
>> bytes=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>>
>>
>> I am trying to find the shuffling traffic that is total traffic
>> generated when mappers exchange their key values pair with the reducer. Is
>> the highlighted portion gives the shuffling traffic ?
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
>> wrote:
>>
>> see mapreduce.job.reduce.slowstart.completedmaps
>> It gives hint of  when reduce tasks could kick off.
>>
>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts.  Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> ->  Is there any way to find out when exactly shuffling started ?  Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> --
>> Bing Jiang
>> Tel：(86)134-2619-1361
>> weibo: http://weibo.com/jiangbinglover
>> BLOG: www.binospace.com
>> BLOG: http://blog.sina.com.cn/jiangbinglover
>> Focus on distributed computing, HDFS/HBase
>>
>>
>>
>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

This is the portion of the output which is displayed on the console when I
run sample word count job.

map 0% reduce 0%

14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%

14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%

14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003

14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29

14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193

14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106

14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106

14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486

14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework

14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6

14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544

14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509

14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage



I am trying to find the shuffling traffic that is total traffic generated
when mappers exchange their key values pair with the reducer. Is the
highlighted portion gives the shuffling traffic ?


Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388




On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:

> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
> 
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
> 
> Thanks,
> Karthik
> 
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>  see mapreduce.job.reduce.slowstart.completedmaps
>>  It gives hint of  when reduce tasks could kick off.
>> 
>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>> 
>>>  Hello,
>>> 
>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>> 
>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>> 
>>>  I would like to know when the shuffling traffic starts ?
>>> 
>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>  generate any syslog in the logs .
>>>  -> How to find the total amount of shuffling traffic?
>>> 
>>> 
>>> 
>>>  Thanks & Regards,
>>> 
>>>  Abdul Navaz
>>>  Research Assistant
>>>  University of Houston Main Campus, Houston TX
>>>  Ph: 281-685-0388
>>> 
>> 
>> 
>> 
>>  --
>>  Bing Jiang
>>  Tel：(86)134-2619-1361
>>  weibo: http://weibo.com/jiangbinglover
>>  BLOG: www.binospace.com
>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>  Focus on distributed computing, HDFS/HBase
>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

This is the portion of the output which is displayed on the console when I
run sample word count job.

map 0% reduce 0%

14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%

14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%

14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003

14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29

14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193

14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106

14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106

14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486

14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework

14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6

14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544

14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509

14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage



I am trying to find the shuffling traffic that is total traffic generated
when mappers exchange their key values pair with the reducer. Is the
highlighted portion gives the shuffling traffic ?


Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388




On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:

> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
> 
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
> 
> Thanks,
> Karthik
> 
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>  see mapreduce.job.reduce.slowstart.completedmaps
>>  It gives hint of  when reduce tasks could kick off.
>> 
>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>> 
>>>  Hello,
>>> 
>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>> 
>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>> 
>>>  I would like to know when the shuffling traffic starts ?
>>> 
>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>  generate any syslog in the logs .
>>>  -> How to find the total amount of shuffling traffic?
>>> 
>>> 
>>> 
>>>  Thanks & Regards,
>>> 
>>>  Abdul Navaz
>>>  Research Assistant
>>>  University of Houston Main Campus, Houston TX
>>>  Ph: 281-685-0388
>>> 
>> 
>> 
>> 
>>  --
>>  Bing Jiang
>>  Tel：(86)134-2619-1361
>>  weibo: http://weibo.com/jiangbinglover
>>  BLOG: www.binospace.com
>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>  Focus on distributed computing, HDFS/HBase
>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

This is the portion of the output which is displayed on the console when I
run sample word count job.

map 0% reduce 0%

14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%

14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%

14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003

14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29

14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193

14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106

14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106

14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486

14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework

14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6

14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544

14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509

14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage



I am trying to find the shuffling traffic that is total traffic generated
when mappers exchange their key values pair with the reducer. Is the
highlighted portion gives the shuffling traffic ?


Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388




On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:

> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
> 
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
> 
> Thanks,
> Karthik
> 
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>  see mapreduce.job.reduce.slowstart.completedmaps
>>  It gives hint of  when reduce tasks could kick off.
>> 
>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>> 
>>>  Hello,
>>> 
>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>> 
>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>> 
>>>  I would like to know when the shuffling traffic starts ?
>>> 
>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>  generate any syslog in the logs .
>>>  -> How to find the total amount of shuffling traffic?
>>> 
>>> 
>>> 
>>>  Thanks & Regards,
>>> 
>>>  Abdul Navaz
>>>  Research Assistant
>>>  University of Houston Main Campus, Houston TX
>>>  Ph: 281-685-0388
>>> 
>> 
>> 
>> 
>>  --
>>  Bing Jiang
>>  Tel：(86)134-2619-1361
>>  weibo: http://weibo.com/jiangbinglover
>>  BLOG: www.binospace.com
>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>  Focus on distributed computing, HDFS/HBase
>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

This is the portion of the output which is displayed on the console when I
run sample word count job.

map 0% reduce 0%

14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%

14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%

14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003

14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29

14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193

14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106

14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106

14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486

14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework

14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6

14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544

14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509

14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage



I am trying to find the shuffling traffic that is total traffic generated
when mappers exchange their key values pair with the reducer. Is the
highlighted portion gives the shuffling traffic ?


Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388




On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:

> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
> 
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
> 
> Thanks,
> Karthik
> 
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>  see mapreduce.job.reduce.slowstart.completedmaps
>>  It gives hint of  when reduce tasks could kick off.
>> 
>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>> 
>>>  Hello,
>>> 
>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>> 
>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>> 
>>>  I would like to know when the shuffling traffic starts ?
>>> 
>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>  generate any syslog in the logs .
>>>  -> How to find the total amount of shuffling traffic?
>>> 
>>> 
>>> 
>>>  Thanks & Regards,
>>> 
>>>  Abdul Navaz
>>>  Research Assistant
>>>  University of Houston Main Campus, Houston TX
>>>  Ph: 281-685-0388
>>> 
>> 
>> 
>> 
>>  --
>>  Bing Jiang
>>  Tel：(86)134-2619-1361
>>  weibo: http://weibo.com/jiangbinglover
>>  BLOG: www.binospace.com
>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>  Focus on distributed computing, HDFS/HBase
>

Re: Hadoop shuffling traffic

Posted by karthikeyan S <ka...@gmail.com>.

The reducer starts as soon as it has data available from any one of the mappers.
The reducer keeps polling the AM and asks if any mapper has completed
processing. If so it fetches data from that mapper.
So it's not necessary for all the mappers of a task to complete for
the reducer to start processing.

When the reducers starts fetching the data from the mappers it prints
that info in its syslog, from what I have seen.

Thanks,
Karthik

On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts.  Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> ->  Is there any way to find out when exactly shuffling started ?  Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase

Re: Hadoop shuffling traffic

Posted by karthikeyan S <ka...@gmail.com>.

The reducer starts as soon as it has data available from any one of the mappers.
The reducer keeps polling the AM and asks if any mapper has completed
processing. If so it fetches data from that mapper.
So it's not necessary for all the mappers of a task to complete for
the reducer to start processing.

When the reducers starts fetching the data from the mappers it prints
that info in its syslog, from what I have seen.

Thanks,
Karthik

On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts.  Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> ->  Is there any way to find out when exactly shuffling started ?  Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase

Re: Hadoop shuffling traffic

Posted by karthikeyan S <ka...@gmail.com>.

The reducer starts as soon as it has data available from any one of the mappers.
The reducer keeps polling the AM and asks if any mapper has completed
processing. If so it fetches data from that mapper.
So it's not necessary for all the mappers of a task to complete for
the reducer to start processing.

When the reducers starts fetching the data from the mappers it prints
that info in its syslog, from what I have seen.

Thanks,
Karthik

On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts.  Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> ->  Is there any way to find out when exactly shuffling started ?  Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase

Re: Hadoop shuffling traffic

Posted by karthikeyan S <ka...@gmail.com>.

The reducer starts as soon as it has data available from any one of the mappers.
The reducer keeps polling the AM and asks if any mapper has completed
processing. If so it fetches data from that mapper.
So it's not necessary for all the mappers of a task to complete for
the reducer to start processing.

When the reducers starts fetching the data from the mappers it prints
that info in its syslog, from what I have seen.

Thanks,
Karthik

On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts.  Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> ->  Is there any way to find out when exactly shuffling started ?  Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase

Re: Hadoop shuffling traffic

Posted by Bing Jiang <ji...@gmail.com>.

see mapreduce.job.reduce.slowstart.completedmaps
It gives hint of  when reduce tasks could kick off.

2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:

> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase

Re: Hadoop shuffling traffic

Posted by Bing Jiang <ji...@gmail.com>.

see mapreduce.job.reduce.slowstart.completedmaps
It gives hint of  when reduce tasks could kick off.

2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:

> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase

Re: Hadoop shuffling traffic

Posted by Bing Jiang <ji...@gmail.com>.

see mapreduce.job.reduce.slowstart.completedmaps
It gives hint of  when reduce tasks could kick off.

2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:

> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase

Re: Hadoop shuffling traffic

Posted by Bing Jiang <ji...@gmail.com>.

see mapreduce.job.reduce.slowstart.completedmaps
It gives hint of  when reduce tasks could kick off.

2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:

> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>


-- 
Bing Jiang
Tel：(86)134-2619-1361
weibo: http://weibo.com/jiangbinglover
BLOG: www.binospace.com
BLOG: http://blog.sina.com.cn/jiangbinglover
Focus on distributed computing, HDFS/HBase