You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Abdul Navaz <na...@gmail.com> on 2014/10/02 03:06:30 UTC
Re: Hadoop shuffling traffic
Hello,
This is the portion of the output which is displayed on the console when I
run sample word count job.
map 0% reduce 0%
14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
waiting after reserving slots (ms)=0
14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
waiting after reserving slots (ms)=0
14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
bytes=3059
14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
I am trying to find the shuffling traffic that is total traffic generated
when mappers exchange their key values pair with the reducer. Is the
highlighted portion gives the shuffling traffic ?
Thanks & Regards,
Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388
On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>> see mapreduce.job.reduce.slowstart.completedmaps
>> It gives hint of when reduce tasks could kick off.
>>
>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>
>>> Hello,
>>>
>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>
>>> When I run the map reduce job, before even completing the mapping 100 %
>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>
>>> I would like to know when the shuffling traffic starts ?
>>>
>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>> generate any syslog in the logs .
>>> -> How to find the total amount of shuffling traffic?
>>>
>>>
>>>
>>> Thanks & Regards,
>>>
>>> Abdul Navaz
>>> Research Assistant
>>> University of Houston Main Campus, Houston TX
>>> Ph: 281-685-0388
>>>
>>
>>
>>
>> --
>> Bing Jiang
>> Tel:(86)134-2619-1361
>> weibo: http://weibo.com/jiangbinglover
>> BLOG: www.binospace.com
>> BLOG: http://blog.sina.com.cn/jiangbinglover
>> Focus on distributed computing, HDFS/HBase
>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello Pramod,
This is great work !. Thank you for sharing the report.
Thanks & Regards,
Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Pramod Biligiri <pr...@gmail.com>.
I think it refers to the no. of bytes the reducer fetches from the mapper.
Pramod
On Wed, Oct 8, 2014 at 10:17 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> Fiesr of all thank you very much for your help. :)
>
> I still have some doubt with this .
>
> Is the highlighted metric “ *Reduce shuffle bytes=3059” *
>
>
> 1. Is the total bytes after the reduced phase. ( That is the output
> file which is written into HDFS)
>
> Or
>
> 2. Is this is the actual shuffled traffic which is exchanged between
> mappers and reducers before performing reducing ?
>
> Please clarify !
>
> Thanks & Regards,
>
> Abdul Navaz
>
>
>
> From: Pramod Biligiri <pr...@gmail.com>
> Reply-To: <us...@hadoop.apache.org>
> Date: Thursday, October 2, 2014 at 12:44 AM
> To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
> Subject: Re: Hadoop shuffling traffic
>
> Hi Abdul,
> That is the right metric. You can take a look at this report we made on
> this earlier:
> http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
>
> Pramod
>
> On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
>
>> Hello,
>>
>> This is the portion of the output which is displayed on the console when
>> I run sample word count job.
>>
>> map 0% reduce 0%
>>
>> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>>
>> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
>> job_201409262002_0003
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all
>> reduces waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
>> waiting after reserving slots (ms)=0
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
>> bytes=3059
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: *Reduce shuffle bytes=3059*
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>>
>> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>>
>>
>> I am trying to find the shuffling traffic that is total traffic
>> generated when mappers exchange their key values pair with the reducer. Is
>> the highlighted portion gives the shuffling traffic ?
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
>> wrote:
>>
>> see mapreduce.job.reduce.slowstart.completedmaps
>> It gives hint of when reduce tasks could kick off.
>>
>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>
>>
>> Hello,
>>
>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>> sample word count job on 1GB of file which is distributed among the HDFS.
>>
>> When I run the map reduce job, before even completing the mapping 100 %
>> reduce starts. Say for eg map 40% reduce 10% etc.
>>
>> I would like to know when the shuffling traffic starts ?
>>
>> -> Is there any way to find out when exactly shuffling started ? Does it
>> generate any syslog in the logs .
>> -> How to find the total amount of shuffling traffic?
>>
>>
>>
>> Thanks & Regards,
>>
>> Abdul Navaz
>> Research Assistant
>> University of Houston Main Campus, Houston TX
>> Ph: 281-685-0388
>>
>>
>>
>>
>> --
>> Bing Jiang
>> Tel:(86)134-2619-1361
>> weibo: http://weibo.com/jiangbinglover
>> BLOG: www.binospace.com
>> BLOG: http://blog.sina.com.cn/jiangbinglover
>> Focus on distributed computing, HDFS/HBase
>>
>>
>>
>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello,
Fiesr of all thank you very much for your help. :)
I still have some doubt with this .
Is the highlighted metric “ Reduce shuffle bytes=3059”
1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or
2. Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?
Please clarify !
Thanks & Regards,
Abdul Navaz
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello Pramod,
This is great work !. Thank you for sharing the report.
Thanks & Regards,
Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello Pramod,
This is great work !. Thank you for sharing the report.
Thanks & Regards,
Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello Pramod,
This is great work !. Thank you for sharing the report.
Thanks & Regards,
Abdul Navaz
Research Assistant
University of Houston Main Campus, Houston TX
Ph: 281-685-0388
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello,
Fiesr of all thank you very much for your help. :)
I still have some doubt with this .
Is the highlighted metric “ Reduce shuffle bytes=3059”
1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or
2. Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?
Please clarify !
Thanks & Regards,
Abdul Navaz
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello,
Fiesr of all thank you very much for your help. :)
I still have some doubt with this .
Is the highlighted metric “ Reduce shuffle bytes=3059”
1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or
2. Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?
Please clarify !
Thanks & Regards,
Abdul Navaz
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Abdul Navaz <na...@gmail.com>.
Hello,
Fiesr of all thank you very much for your help. :)
I still have some doubt with this .
Is the highlighted metric “ Reduce shuffle bytes=3059”
1. Is the total bytes after the reduced phase. ( That is the output file
which is written into HDFS)
Or
2. Is this is the actual shuffled traffic which is exchanged between
mappers and reducers before performing reducing ?
Please clarify !
Thanks & Regards,
Abdul Navaz
From: Pramod Biligiri <pr...@gmail.com>
Reply-To: <us...@hadoop.apache.org>
Date: Thursday, October 2, 2014 at 12:44 AM
To: "zookeeper-user@hadoop.apache.org" <us...@hadoop.apache.org>
Subject: Re: Hadoop shuffling traffic
Hi Abdul,
That is the right metric. You can take a look at this report we made on this
earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-
hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I run
> sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete: job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all reduces
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Reduce shuffle bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
>
> I am trying to find the shuffling traffic that is total traffic generated when
> mappers exchange their key values pair with the reducer. Is the highlighted
> portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388 <tel:281-685-0388>
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
>> The reducer starts as soon as it has data available from any one of the
>> mappers.
>> The reducer keeps polling the AM and asks if any mapper has completed
>> processing. If so it fetches data from that mapper.
>> So it's not necessary for all the mappers of a task to complete for
>> the reducer to start processing.
>>
>> When the reducers starts fetching the data from the mappers it prints
>> that info in its syslog, from what I have seen.
>>
>> Thanks,
>> Karthik
>>
>> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com> wrote:
>>> see mapreduce.job.reduce.slowstart.completedmaps
>>> It gives hint of when reduce tasks could kick off.
>>>
>>> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>>>>
>>>> Hello,
>>>>
>>>> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
>>>> sample word count job on 1GB of file which is distributed among the HDFS.
>>>>
>>>> When I run the map reduce job, before even completing the mapping 100 %
>>>> reduce starts. Say for eg map 40% reduce 10% etc.
>>>>
>>>> I would like to know when the shuffling traffic starts ?
>>>>
>>>> -> Is there any way to find out when exactly shuffling started ? Does it
>>>> generate any syslog in the logs .
>>>> -> How to find the total amount of shuffling traffic?
>>>>
>>>>
>>>>
>>>> Thanks & Regards,
>>>>
>>>> Abdul Navaz
>>>> Research Assistant
>>>> University of Houston Main Campus, Houston TX
>>>> Ph: 281-685-0388 <tel:281-685-0388>
>>>>
>>>
>>>
>>>
>>> --
>>> Bing Jiang
>>> Tel:(86)134-2619-1361
>>> weibo: http://weibo.com/jiangbinglover
>>> BLOG: www.binospace.com <http://www.binospace.com>
>>> BLOG: http://blog.sina.com.cn/jiangbinglover
>>> Focus on distributed computing, HDFS/HBase
>>
Re: Hadoop shuffling traffic
Posted by Pramod Biligiri <pr...@gmail.com>.
Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts. Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> -> Is there any way to find out when exactly shuffling started ? Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel:(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>
Re: Hadoop shuffling traffic
Posted by Pramod Biligiri <pr...@gmail.com>.
Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts. Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> -> Is there any way to find out when exactly shuffling started ? Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel:(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>
Re: Hadoop shuffling traffic
Posted by Pramod Biligiri <pr...@gmail.com>.
Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts. Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> -> Is there any way to find out when exactly shuffling started ? Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel:(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>
Re: Hadoop shuffling traffic
Posted by Pramod Biligiri <pr...@gmail.com>.
Hi Abdul,
That is the right metric. You can take a look at this report we made on
this earlier:
http://www.slideshare.net/pramodbiligiri/shuffle-phase-as-the-bottleneck-in-hadoop-terasort
Pramod
On Wed, Oct 1, 2014 at 6:06 PM, Abdul Navaz <na...@gmail.com> wrote:
> Hello,
>
> This is the portion of the output which is displayed on the console when I
> run sample word count job.
>
> map 0% reduce 0%
>
> 14/10/01 18:37:52 INFO mapred.JobClient: map 100% reduce 0%
>
> 14/10/01 18:38:10 INFO mapred.JobClient: map 100% reduce 100%
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job complete:
> job_201409262002_0003
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Counters: 29
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Job Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched reduce tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=23511
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all
> reduces waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total time spent by all maps
> waiting after reserving slots (ms)=0
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Launched map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Data-local map tasks=1
>
> 14/10/01 18:38:12 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=14193
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Output Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Written=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FileSystemCounters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_READ=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_READ=1601
>
> 14/10/01 18:38:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=108400
>
> 14/10/01 18:38:12 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1106
>
> 14/10/01 18:38:12 INFO mapred.JobClient: File Input Format Counters
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Bytes Read=1486
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map-Reduce Framework
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output materialized
> bytes=3059
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map input records=6
>
> 14/10/01 18:38:12 INFO mapred.JobClient: *Reduce shuffle bytes=3059*
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Spilled Records=544
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Map output bytes=2509
>
> 14/10/01 18:38:12 INFO mapred.JobClient: Total committed heap usage
>
>
> I am trying to find the shuffling traffic that is total traffic generated
> when mappers exchange their key values pair with the reducer. Is the
> highlighted portion gives the shuffling traffic ?
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> On 9/26/14, 12:00 AM, "karthikeyan S" <ka...@gmail.com> wrote:
>
> The reducer starts as soon as it has data available from any one of the
> mappers.
> The reducer keeps polling the AM and asks if any mapper has completed
> processing. If so it fetches data from that mapper.
> So it's not necessary for all the mappers of a task to complete for
> the reducer to start processing.
>
> When the reducers starts fetching the data from the mappers it prints
> that info in its syslog, from what I have seen.
>
> Thanks,
> Karthik
>
> On Thu, Sep 25, 2014 at 8:27 PM, Bing Jiang <ji...@gmail.com>
> wrote:
>
> see mapreduce.job.reduce.slowstart.completedmaps
> It gives hint of when reduce tasks could kick off.
>
> 2014-09-26 8:36 GMT+08:00 Abdul Navaz <na...@gmail.com>:
>
>
> Hello,
>
> I am having a Hadoop cluster with 1 name node and 3 data nodes. I running
> sample word count job on 1GB of file which is distributed among the HDFS.
>
> When I run the map reduce job, before even completing the mapping 100 %
> reduce starts. Say for eg map 40% reduce 10% etc.
>
> I would like to know when the shuffling traffic starts ?
>
> -> Is there any way to find out when exactly shuffling started ? Does it
> generate any syslog in the logs .
> -> How to find the total amount of shuffling traffic?
>
>
>
> Thanks & Regards,
>
> Abdul Navaz
> Research Assistant
> University of Houston Main Campus, Houston TX
> Ph: 281-685-0388
>
>
>
>
> --
> Bing Jiang
> Tel:(86)134-2619-1361
> weibo: http://weibo.com/jiangbinglover
> BLOG: www.binospace.com
> BLOG: http://blog.sina.com.cn/jiangbinglover
> Focus on distributed computing, HDFS/HBase
>
>
>