You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Abdul Navaz <na...@gmail.com> on 2014/10/02 03:06:30 UTC

Re: Hadoop shuffling traffic

Hello,

This is the portion of the output which is displayed on the console when I
run sample word count job.

map 0% reduce 0%

14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%

14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%

14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003

14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29

14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
waiting after reserving slots (ms)=0

14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1

14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193

14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106

14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601

14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400

14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106

14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters

14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486

14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework

14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6

14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059

14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544

14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509

14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage



I am trying to find the shuffling traffic that is total traffic generated
when mappers exchange their key values pair with the reducer. Is the
highlighted portion gives the shuffling traffic ?


Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388




On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:

> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
> 
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
> 
> Thanks,
> Karthik
> 
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>  see mapreduce.job.reduce.slowstart.completedmaps
>>  It gives hint of  when reduce tasks could kick off.
>> 
>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>> 
>>>  Hello,
>>> 
>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>> 
>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>> 
>>>  I would like to know when the shuffling traffic starts ?
>>> 
>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>  generate any syslog in the logs .
>>>  -> How to find the total amount of shuffling traffic?
>>> 
>>> 
>>> 
>>>  Thanks & Regards,
>>> 
>>>  Abdul Navaz
>>>  Research Assistant
>>>  University of Houston Main Campus, Houston TX
>>>  Ph: 281-685-0388
>>> 
>> 
>> 
>> 
>>  --
>>  Bing Jiang
>>  Tel：(86)134-2619-1361
>>  weibo: http://weibo.com/jiangbinglover
>>  BLOG: www.binospace.com
>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>  Focus on distributed computing, HDFS/HBase
>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

I think it refers to the no. of bytes the reducer fetches from the mapper.

Pramod

On Wed, Oct 8, 2014 at 10:17 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> Fiesr of all thank you very much for your help. :)
>
> I still have some doubt with this .
>
> Is the highlighted metric “ *Reduce shuffle bytes=3059” *
>
>
>    1. Is the total bytes after the reduced phase. ( That is the output
>    file which is written into HDFS)
>
> Or
>
> 2.  Is this is the actual shuffled traffic which is exchanged between
> mappers and reducers before performing reducing ?
>
> Please clarify !
>
> Thanks & Regards,
>
> Abdul Navaz
>
>
>
> From: Pramod Biligiri <pr...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Thursday, October 2, 2014 at 12:44 AM
> To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Hadoop shuffling traffic
>
> Hi Abdul,
> That is the right metric. You can take a look at this report we made on
> this earlier:
> http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
>
> Pramod
>
> On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
>
>> Hello,
>>
>> This is the portion of the output which is displayed on the console when
>> I run sample word count job.
>>
>> map 0% reduce 0%
>>
>> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>>
>> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
>> job_201409262002_0003
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
>> waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
>> bytes=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>>
>>
>> I am trying to find the shuffling traffic that is total traffic
>> generated when mappers exchange their key values pair with the reducer. Is
>> the highlighted portion gives the shuffling traffic ?
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
>> wrote:
>>
>> see mapreduce.job.reduce.slowstart.completedmaps
>> It gives hint of  when reduce tasks could kick off.
>>
>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts.  Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> ->  Is there any way to find out when exactly shuffling started ?  Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> --
>> Bing Jiang
>> Tel：(86)134-2619-1361
>> weibo: http://weibo.com/jiangbinglover
>> BLOG: www.binospace.com
>> BLOG: http://blog.sina.com.cn/jiangbinglover
>> Focus on distributed computing, HDFS/HBase
>>
>>
>>
>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello Pramod,

This is great work !. Thank you for sharing the report.

Thanks & Regards,

Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388


From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Abdul Navaz <na...@gmail.com>.

Hello,

Fiesr of all thank you very much for your help. :)

I still have some doubt with this .

Is the highlighted metric “ Reduce shuffle bytes=3059”

1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or

2.  Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?

Please clarify !

Thanks & Regards,

Abdul Navaz



From:  Pramod Biligiri <pr...@gmail.com>
Reply-To:  <us...@hadoop.apache.org>
Date:  Thursday, October 2, 2014 at 12:44 AM
To:  "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject:  Re: Hadoop shuffling traffic

Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier: 
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
> 
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
> 
> map 0% reduce 0%
> 
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
> 
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
> 
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all reduces
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Reduce shuffle bytes=3059
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
> 
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
> 
> 
> 
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
> 
> 
> Thanks & Regards,
> 
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
> 
> 
> 
> 
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> 
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>> 
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>> 
>> Thanks,
>> Karthik
>> 
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>>  see mapreduce.job.reduce.slowstart.completedmaps
>>>  It gives hint of  when reduce tasks could kick off.
>>> 
>>>  2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>> 
>>>>  Hello,
>>>> 
>>>>  I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>>  sample word count job on 1GB of file which is distributed among the HDFS.
>>>> 
>>>>  When I run the map reduce job, before even completing the mapping 100 %
>>>>  reduce starts.  Say for eg map 40% reduce 10% etc.
>>>> 
>>>>  I would like to know when the shuffling traffic starts ?
>>>> 
>>>>  ->  Is there any way to find out when exactly shuffling started ?  Does it
>>>>  generate any syslog in the logs .
>>>>  -> How to find the total amount of shuffling traffic?
>>>> 
>>>> 
>>>> 
>>>>  Thanks & Regards,
>>>> 
>>>>  Abdul Navaz
>>>>  Research Assistant
>>>>  University of Houston Main Campus, Houston TX
>>>>  Ph: 281-685-0388 <tel:281-685-0388>
>>>> 
>>> 
>>> 
>>> 
>>>  --
>>>  Bing Jiang
>>>  Tel：(86)134-2619-1361
>>>  weibo: http://weibo.com/jiangbinglover
>>>  BLOG: www.binospace.com <http://www.binospace.com>
>>>  BLOG: http://blog.sina.com.cn/jiangbinglover
>>>  Focus on distributed computing, HDFS/HBase
>>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>

Re: Hadoop shuffling traffic

Posted by Pramod Biligiri <pr...@gmail.com>.

Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort

Pramod

On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:

> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient:  map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient:  map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient:   Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient:     Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of  when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts.  Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> ->  Is there any way to find out when exactly shuffling started ?  Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel：(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>