You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by ch huang <ju...@gmail.com> on 2013/12/11 06:25:36 UTC

issue about Shuffled Maps in MR job summary

hi,maillist:
           i run terasort with 16 reducers and 8 reducers,when i double
reducer number, the Shuffled maps is also double ,my question is the job
only run 20 map tasks (total input file is 10,and each file is 100M,my
block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:


 Shuffled Maps =320

 8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Or you should check  your job history UI, which provide the similar information as job tracker, as you are using MR2 and Yarn.
The default port of job history UI is 19888.

From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: RE: issue about Shuffled Maps in MR job summary
Date: Thu, 12 Dec 2013 10:06:37 -0500

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Or you should check  your job history UI, which provide the similar information as job tracker, as you are using MR2 and Yarn.
The default port of job history UI is 19888.

From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: RE: issue about Shuffled Maps in MR job summary
Date: Thu, 12 Dec 2013 10:06:37 -0500

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Or you should check  your job history UI, which provide the similar information as job tracker, as you are using MR2 and Yarn.
The default port of job history UI is 19888.

From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: RE: issue about Shuffled Maps in MR job summary
Date: Thu, 12 Dec 2013 10:06:37 -0500

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Or you should check  your job history UI, which provide the similar information as job tracker, as you are using MR2 and Yarn.
The default port of job history UI is 19888.

From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: RE: issue about Shuffled Maps in MR job summary
Date: Thu, 12 Dec 2013 10:06:37 -0500

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Then you can check your job's status from the yarn resource manager web ui, to identify what step your reducers are in.

Date: Thu, 12 Dec 2013 11:12:47 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

one of important things is my input file is very small ,each file less than 10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong. 

So you may need to find out following:

1) How many mappers generated in your MR job?
2) Are they all finished? (Check them in the jobtracker through web or command line)
3) How many reducers in this job?
4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
5) If in the reducing stage, check the userlog of reducers. Is your code running now? 

All these information you can find out from the Job Tracker web UI.

Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800 

Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

one of important things is my input file is very small ,each file less than
10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

one of important things is my input file is very small ,each file less than
10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

1) How many mappers generated in your MR job?
i have lots of input file ,each size is below 64m ,if i set the block size
to 128m or 256m,it will help? my job total map tasks number is  2717

 2) Are they all finished? (Check them in the jobtracker through web or
command line)
yes, all map tasks done
3) How many reducers in this job?
all reducer in running state ,and the number of reducer is 15

 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
i do not know how to judge if the reducer is in copying / sorting /
reducing ,the web ui not tell me,i use yarn framework ,and i read the
syslog of the container which reducer running in ,as following output,so i
judge the fetch thread get all output from map task,and the stuff from map
task also sorted and merged, so what the reducer do now is reducing


2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009498_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009498_0
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5641, commitMemory -> 656458998, usedMemory ->656459000
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009499_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009499_0
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5642, commitMemory -> 656459000, usedMemory ->656459002
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: CHBM224:8080
freed by fetcher#3 in 3s
2013-12-05 18:14:09,547 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher:
EventFetcher is interrupted.. Returning
2013-12-05 18:14:09,555 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge
called with 5642 in-memory map-outputs and 6 on-disk map-outputs
2013-12-05 18:14:09,598 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 5642 sorted segments
2013-12-05 18:14:09,610 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 5566 segments left of total size:
656260591 bytes
2013-12-05 18:14:13,614 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 5642
segments, 656459002 bytes to disk to satisfy reduce memory limit
2013-12-05 18:14:13,615 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 7
files, 24536123780 bytes from disk
2013-12-05 18:14:13,628 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0
segments, 0 bytes from memory into reduce
2013-12-05 18:14:13,628 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 7 sorted segments
2013-12-05 18:14:13,883 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 7 segments left of total size:
24536123501 bytes
2013-12-05 18:14:14,021 WARN [main]
org.apache.hadoop.conf.Configuration: mapred.skip.on is deprecated.
Instead, use mapreduce.job.skiprecords
2013-12-05 18:14:14,076 INFO [main]
com.alibaba.dubbo.common.logger.LoggerFactory: using logger:
com.alibaba.dubbo.common.logger.log4j.Log4jLoggerAdapter



 5) If in the reducing stage, check the userlog of reducers. Is your code
running now?
which the userlog of reducers located?
On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

1) How many mappers generated in your MR job?
i have lots of input file ,each size is below 64m ,if i set the block size
to 128m or 256m,it will help? my job total map tasks number is  2717

 2) Are they all finished? (Check them in the jobtracker through web or
command line)
yes, all map tasks done
3) How many reducers in this job?
all reducer in running state ,and the number of reducer is 15

 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
i do not know how to judge if the reducer is in copying / sorting /
reducing ,the web ui not tell me,i use yarn framework ,and i read the
syslog of the container which reducer running in ,as following output,so i
judge the fetch thread get all output from map task,and the stuff from map
task also sorted and merged, so what the reducer do now is reducing


2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009498_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009498_0
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5641, commitMemory -> 656458998, usedMemory ->656459000
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009499_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009499_0
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5642, commitMemory -> 656459000, usedMemory ->656459002
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: CHBM224:8080
freed by fetcher#3 in 3s
2013-12-05 18:14:09,547 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher:
EventFetcher is interrupted.. Returning
2013-12-05 18:14:09,555 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge
called with 5642 in-memory map-outputs and 6 on-disk map-outputs
2013-12-05 18:14:09,598 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 5642 sorted segments
2013-12-05 18:14:09,610 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 5566 segments left of total size:
656260591 bytes
2013-12-05 18:14:13,614 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 5642
segments, 656459002 bytes to disk to satisfy reduce memory limit
2013-12-05 18:14:13,615 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 7
files, 24536123780 bytes from disk
2013-12-05 18:14:13,628 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0
segments, 0 bytes from memory into reduce
2013-12-05 18:14:13,628 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 7 sorted segments
2013-12-05 18:14:13,883 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 7 segments left of total size:
24536123501 bytes
2013-12-05 18:14:14,021 WARN [main]
org.apache.hadoop.conf.Configuration: mapred.skip.on is deprecated.
Instead, use mapreduce.job.skiprecords
2013-12-05 18:14:14,076 INFO [main]
com.alibaba.dubbo.common.logger.LoggerFactory: using logger:
com.alibaba.dubbo.common.logger.log4j.Log4jLoggerAdapter



 5) If in the reducing stage, check the userlog of reducers. Is your code
running now?
which the userlog of reducers located?
On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

1) How many mappers generated in your MR job?
i have lots of input file ,each size is below 64m ,if i set the block size
to 128m or 256m,it will help? my job total map tasks number is  2717

 2) Are they all finished? (Check them in the jobtracker through web or
command line)
yes, all map tasks done
3) How many reducers in this job?
all reducer in running state ,and the number of reducer is 15

 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
i do not know how to judge if the reducer is in copying / sorting /
reducing ,the web ui not tell me,i use yarn framework ,and i read the
syslog of the container which reducer running in ,as following output,so i
judge the fetch thread get all output from map task,and the stuff from map
task also sorted and merged, so what the reducer do now is reducing


2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009498_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009498_0
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5641, commitMemory -> 656458998, usedMemory ->656459000
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009499_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009499_0
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5642, commitMemory -> 656459000, usedMemory ->656459002
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: CHBM224:8080
freed by fetcher#3 in 3s
2013-12-05 18:14:09,547 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher:
EventFetcher is interrupted.. Returning
2013-12-05 18:14:09,555 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge
called with 5642 in-memory map-outputs and 6 on-disk map-outputs
2013-12-05 18:14:09,598 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 5642 sorted segments
2013-12-05 18:14:09,610 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 5566 segments left of total size:
656260591 bytes
2013-12-05 18:14:13,614 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 5642
segments, 656459002 bytes to disk to satisfy reduce memory limit
2013-12-05 18:14:13,615 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 7
files, 24536123780 bytes from disk
2013-12-05 18:14:13,628 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0
segments, 0 bytes from memory into reduce
2013-12-05 18:14:13,628 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 7 sorted segments
2013-12-05 18:14:13,883 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 7 segments left of total size:
24536123501 bytes
2013-12-05 18:14:14,021 WARN [main]
org.apache.hadoop.conf.Configuration: mapred.skip.on is deprecated.
Instead, use mapreduce.job.skiprecords
2013-12-05 18:14:14,076 INFO [main]
com.alibaba.dubbo.common.logger.LoggerFactory: using logger:
com.alibaba.dubbo.common.logger.log4j.Log4jLoggerAdapter



 5) If in the reducing stage, check the userlog of reducers. Is your code
running now?
which the userlog of reducers located?
On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

1) How many mappers generated in your MR job?
i have lots of input file ,each size is below 64m ,if i set the block size
to 128m or 256m,it will help? my job total map tasks number is  2717

 2) Are they all finished? (Check them in the jobtracker through web or
command line)
yes, all map tasks done
3) How many reducers in this job?
all reducer in running state ,and the number of reducer is 15

 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
i do not know how to judge if the reducer is in copying / sorting /
reducing ,the web ui not tell me,i use yarn framework ,and i read the
syslog of the container which reducer running in ,as following output,so i
judge the fetch thread get all output from map task,and the stuff from map
task also sorted and merged, so what the reducer do now is reducing


2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009498_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009498_0
2013-12-05 18:14:09,546 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5641, commitMemory -> 656458998, usedMemory ->656459000
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#3 about to
shuffle output of map attempt_1386139114497_0034_m_009499_0 decomp: 2
len: 6 to MEMORY
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput: Read 2
bytes from map-output for attempt_1386139114497_0034_m_009499_0
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl:
closeInMemoryFile -> map-output of size: 2, inMemoryMapOutputs.size()
-> 5642, commitMemory -> 656459000, usedMemory ->656459002
2013-12-05 18:14:09,547 INFO [fetcher#3]
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: CHBM224:8080
freed by fetcher#3 in 3s
2013-12-05 18:14:09,547 INFO [EventFetcher for fetching Map Completion
Events] org.apache.hadoop.mapreduce.task.reduce.EventFetcher:
EventFetcher is interrupted.. Returning
2013-12-05 18:14:09,555 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge
called with 5642 in-memory map-outputs and 6 on-disk map-outputs
2013-12-05 18:14:09,598 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 5642 sorted segments
2013-12-05 18:14:09,610 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 5566 segments left of total size:
656260591 bytes
2013-12-05 18:14:13,614 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 5642
segments, 656459002 bytes to disk to satisfy reduce memory limit
2013-12-05 18:14:13,615 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 7
files, 24536123780 bytes from disk
2013-12-05 18:14:13,628 INFO [main]
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0
segments, 0 bytes from memory into reduce
2013-12-05 18:14:13,628 INFO [main] org.apache.hadoop.mapred.Merger:
Merging 7 sorted segments
2013-12-05 18:14:13,883 INFO [main] org.apache.hadoop.mapred.Merger:
Down to the last merge-pass, with 7 segments left of total size:
24536123501 bytes
2013-12-05 18:14:14,021 WARN [main]
org.apache.hadoop.conf.Configuration: mapred.skip.on is deprecated.
Instead, use mapreduce.job.skiprecords
2013-12-05 18:14:14,076 INFO [main]
com.alibaba.dubbo.common.logger.LoggerFactory: using logger:
com.alibaba.dubbo.common.logger.log4j.Log4jLoggerAdapter



 5) If in the reducing stage, check the userlog of reducers. Is your code
running now?
which the userlog of reducers located?
On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

one of important things is my input file is very small ,each file less than
10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

one of important things is my input file is very small ,each file less than
10M,and i have a huge number of files

On Thu, Dec 12, 2013 at 9:58 AM, java8964 <ja...@hotmail.com> wrote:

>  Assume the block size is 128M, and your mapper each finishes within half
> minute, then there is not too much logic in your mapper, as it can finish
> processing 128M around 30 seconds. If your reducers cannot finish with 1
> week, then something is wrong.
>
> So you may need to find out following:
>
> 1) How many mappers generated in your MR job?
> 2) Are they all finished? (Check them in the jobtracker through web or
> command line)
> 3) How many reducers in this job?
> 4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?
> 5) If in the reducing stage, check the userlog of reducers. Is your code
> running now?
>
> All these information you can find out from the Job Tracker web UI.
>
> Yong
>
>  ------------------------------
> Date: Thu, 12 Dec 2013 09:03:29 +0800
>
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
> hi,
>     suppose i have 5-worknode cluster,each worknode can allocate 40G mem
> ,and i do not care map task,be cause the map task in my job finished within
> half a minuter,as my observe the real slow task is reduce, i allocate 12G
> to each reduce task,so each worknode can support 3 reduce parallel,and the
> whole cluster can support 15 reducer,and i run the job with all 15 reducer,
> and i do not know if i increase reducer number from 15 to 30 ,each reduce
> allocate 6G MEM,that will speed the job or not ,the job run on my product
> env, it run nearly 1 week,it still not finished
>
> On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:
>
>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
> Thanks and Regards,
> Vinayakumar B
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
> hi,maillist:
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
> 16 reducer summary output:
>
>
>  Shuffled Maps =320
>
>  8 reducer summary output:
>
> Shuffled Maps =160
>
>
>
>

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong.
So you may need to find out following:
1) How many mappers generated in your MR job?2) Are they all finished? (Check them in the jobtracker through web or command line)3) How many reducers in this job?4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?5) If in the reducing stage, check the userlog of reducers. Is your code running now? 
All these information you can find out from the Job Tracker web UI.
Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong.
So you may need to find out following:
1) How many mappers generated in your MR job?2) Are they all finished? (Check them in the jobtracker through web or command line)3) How many reducers in this job?4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?5) If in the reducing stage, check the userlog of reducers. Is your code running now? 
All these information you can find out from the Job Tracker web UI.
Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong.
So you may need to find out following:
1) How many mappers generated in your MR job?2) Are they all finished? (Check them in the jobtracker through web or command line)3) How many reducers in this job?4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?5) If in the reducing stage, check the userlog of reducers. Is your code running now? 
All these information you can find out from the Job Tracker web UI.
Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

Assume the block size is 128M, and your mapper each finishes within half minute, then there is not too much logic in your mapper, as it can finish processing 128M around 30 seconds. If your reducers cannot finish with 1 week, then something is wrong.
So you may need to find out following:
1) How many mappers generated in your MR job?2) Are they all finished? (Check them in the jobtracker through web or command line)3) How many reducers in this job?4) Are reducers starting? What stage are they in? Copying/Sorting/Reducing?5) If in the reducing stage, check the userlog of reducers. Is your code running now? 
All these information you can find out from the Job Tracker web UI.
Yong

Date: Thu, 12 Dec 2013 09:03:29 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem ,and i do not care map task,be cause the map task in my job finished within half a minuter,as my observe the real slow task is reduce, i allocate 12G to each reduce task,so each worknode can support 3 reduce parallel,and the whole cluster can support 15 reducer,and i run the job with all 15 reducer, and i do not know if i increase reducer number from 15 to 30 ,each reduce allocate 6G MEM,that will speed the job or not ,the job run on my product env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck? 

Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.

If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.

Make sense?

Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org 

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,

Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem
,and i do not care map task,be cause the map task in my job finished within
half a minuter,as my observe the real slow task is reduce, i allocate 12G
to each reduce task,so each worknode can support 3 reduce parallel,and the
whole cluster can support 15 reducer,and i run the job with all 15 reducer,
and i do not know if i increase reducer number from 15 to 30 ,each reduce
allocate 6G MEM,that will speed the job or not ,the job run on my product
env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by Adam Kawa <ka...@gmail.com>.

> why sometime ,increase reducer number will not decrease job complete
time ?

Apart from valid information that Yong wrote in the previous point, please
note that:

1) You do not want to have very shortly lived (seconds) reduce tasks,
because the overhead for coordinating them, starting JVMs, setting up the
connections to all map tasks becomes too costly. It depends on your use
case, but usually MapReduce jobs are for batch processing, and at my
company we set the number of reduce tasks to make sure that each task runs
at least a couple of minutes (for production jobs that are scheduled in
"background", we aim for ~10 minutes).

2) We you have more reduce tasks, then you need more slots (or containers,
if you use YARN). Sometimes, you can not get slots/containers as quick as
you want, so that you can get stuck waiting for more resources. Then job
completion time extends.

3) It you have thinner reducers, then they probably they write smaller
output files to HDFS. Small files are problematic for HDFS (e.g. higher
memory requirement on NN, bigger load on NN, slower NN restarts, more
random than streaming access pattern and more). If the output of that job
is later processed by another job, then you will see thin mappers (this can
be partially alleviated by CombineFileInputFormat, though).

2013/12/11 java8964 <ja...@hotmail.com>

> The whole job complete time depends on a lot of factors. Are you sure the
> reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem
,and i do not care map task,be cause the map task in my job finished within
half a minuter,as my observe the real slow task is reduce, i allocate 12G
to each reduce task,so each worknode can support 3 reduce parallel,and the
whole cluster can support 15 reducer,and i run the job with all 15 reducer,
and i do not know if i increase reducer number from 15 to 30 ,each reduce
allocate 6G MEM,that will speed the job or not ,the job run on my product
env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem
,and i do not care map task,be cause the map task in my job finished within
half a minuter,as my observe the real slow task is reduce, i allocate 12G
to each reduce task,so each worknode can support 3 reduce parallel,and the
whole cluster can support 15 reducer,and i run the job with all 15 reducer,
and i do not know if i increase reducer number from 15 to 30 ,each reduce
allocate 6G MEM,that will speed the job or not ,the job run on my product
env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by Adam Kawa <ka...@gmail.com>.

> why sometime ,increase reducer number will not decrease job complete
time ?

Apart from valid information that Yong wrote in the previous point, please
note that:

1) You do not want to have very shortly lived (seconds) reduce tasks,
because the overhead for coordinating them, starting JVMs, setting up the
connections to all map tasks becomes too costly. It depends on your use
case, but usually MapReduce jobs are for batch processing, and at my
company we set the number of reduce tasks to make sure that each task runs
at least a couple of minutes (for production jobs that are scheduled in
"background", we aim for ~10 minutes).

2) We you have more reduce tasks, then you need more slots (or containers,
if you use YARN). Sometimes, you can not get slots/containers as quick as
you want, so that you can get stuck waiting for more resources. Then job
completion time extends.

3) It you have thinner reducers, then they probably they write smaller
output files to HDFS. Small files are problematic for HDFS (e.g. higher
memory requirement on NN, bigger load on NN, slower NN restarts, more
random than streaming access pattern and more). If the output of that job
is later processed by another job, then you will see thin mappers (this can
be partially alleviated by CombineFileInputFormat, though).

2013/12/11 java8964 <ja...@hotmail.com>

> The whole job complete time depends on a lot of factors. Are you sure the
> reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by Adam Kawa <ka...@gmail.com>.

> why sometime ,increase reducer number will not decrease job complete
time ?

Apart from valid information that Yong wrote in the previous point, please
note that:

1) You do not want to have very shortly lived (seconds) reduce tasks,
because the overhead for coordinating them, starting JVMs, setting up the
connections to all map tasks becomes too costly. It depends on your use
case, but usually MapReduce jobs are for batch processing, and at my
company we set the number of reduce tasks to make sure that each task runs
at least a couple of minutes (for production jobs that are scheduled in
"background", we aim for ~10 minutes).

2) We you have more reduce tasks, then you need more slots (or containers,
if you use YARN). Sometimes, you can not get slots/containers as quick as
you want, so that you can get stuck waiting for more resources. Then job
completion time extends.

3) It you have thinner reducers, then they probably they write smaller
output files to HDFS. Small files are problematic for HDFS (e.g. higher
memory requirement on NN, bigger load on NN, slower NN restarts, more
random than streaming access pattern and more). If the output of that job
is later processed by another job, then you will see thin mappers (this can
be partially alleviated by CombineFileInputFormat, though).

2013/12/11 java8964 <ja...@hotmail.com>

> The whole job complete time depends on a lot of factors. Are you sure the
> reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

hi,
    suppose i have 5-worknode cluster,each worknode can allocate 40G mem
,and i do not care map task,be cause the map task in my job finished within
half a minuter,as my observe the real slow task is reduce, i allocate 12G
to each reduce task,so each worknode can support 3 reduce parallel,and the
whole cluster can support 15 reducer,and i run the job with all 15 reducer,
and i do not know if i increase reducer number from 15 to 30 ,each reduce
allocate 6G MEM,that will speed the job or not ,the job run on my product
env, it run nearly 1 week,it still not finished

On Wed, Dec 11, 2013 at 9:50 PM, java8964 <ja...@hotmail.com> wrote:

>  The whole job complete time depends on a lot of factors. Are you sure
> the reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
>  ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

Re: issue about Shuffled Maps in MR job summary

Posted by Adam Kawa <ka...@gmail.com>.

> why sometime ,increase reducer number will not decrease job complete
time ?

Apart from valid information that Yong wrote in the previous point, please
note that:

1) You do not want to have very shortly lived (seconds) reduce tasks,
because the overhead for coordinating them, starting JVMs, setting up the
connections to all map tasks becomes too costly. It depends on your use
case, but usually MapReduce jobs are for batch processing, and at my
company we set the number of reduce tasks to make sure that each task runs
at least a couple of minutes (for production jobs that are scheduled in
"background", we aim for ~10 minutes).

2) We you have more reduce tasks, then you need more slots (or containers,
if you use YARN). Sometimes, you can not get slots/containers as quick as
you want, so that you can get stuck waiting for more resources. Then job
completion time extends.

3) It you have thinner reducers, then they probably they write smaller
output files to HDFS. Small files are problematic for HDFS (e.g. higher
memory requirement on NN, bigger load on NN, slower NN restarts, more
random than streaming access pattern and more). If the output of that job
is later processed by another job, then you will see thin mappers (this can
be partially alleviated by CombineFileInputFormat, though).

2013/12/11 java8964 <ja...@hotmail.com>

> The whole job complete time depends on a lot of factors. Are you sure the
> reducers part is the bottleneck?
>
> Also, it also depends on how many Reducer input groups it has in your MR
> job. If you only have 20 reducer groups, even you jump your reducer count
> to 40, then the epoch of reducers part won't have too much change, as the
> additional 20 reducer task won't get data to process.
>
> If you have a lot of reducer input groups, and your cluster does have
> capacity at this time, and your also have a lot idle reducer slot, then
> increase your reducer count should decrease your whole job complete time.
>
> Make sense?
>
> Yong
>
> ------------------------------
> Date: Wed, 11 Dec 2013 14:20:24 +0800
> Subject: Re: issue about Shuffled Maps in MR job summary
> From: justlooks@gmail.com
> To: user@hadoop.apache.org
>
>
> i read the doc, and find if i have 8 reducer ,a map task will output 8
> partition ,each partition will be send to a different reducer, so if i
> increase reduce number ,the partition number increase ,but the volume on
> network traffic is same,why sometime ,increase reducer number will not
> decrease job complete time ?
>
> On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:
>
>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>
>
>

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck?
Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.
If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.
Make sense?
Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck?
Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.
If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.
Make sense?
Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck?
Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.
If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.
Make sense?
Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by java8964 <ja...@hotmail.com>.

The whole job complete time depends on a lot of factors. Are you sure the reducers part is the bottleneck?
Also, it also depends on how many Reducer input groups it has in your MR job. If you only have 20 reducer groups, even you jump your reducer count to 40, then the epoch of reducers part won't have too much change, as the additional 20 reducer task won't get data to process.
If you have a lot of reducer input groups, and your cluster does have capacity at this time, and your also have a lot idle reducer slot, then increase your reducer count should decrease your whole job complete time.
Make sense?
Yong

Date: Wed, 11 Dec 2013 14:20:24 +0800
Subject: Re: issue about Shuffled Maps in MR job summary
From: justlooks@gmail.com
To: user@hadoop.apache.org

i read the doc, and find if i have 8 reducer ,a map task will output 8 partition ,each partition will be send to a different reducer, so if i increase reduce number ,the partition number increase ,but the volume on network traffic is same,why sometime ,increase reducer number will not decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com> wrote:

It looks simple, J

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com] 

Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:

           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:

 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

i read the doc, and find if i have 8 reducer ,a map task will output 8
partition ,each partition will be send to a different reducer, so if i
increase reduce number ,the partition number increase ,but the volume on
network traffic is same,why sometime ,increase reducer number will not
decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

i read the doc, and find if i have 8 reducer ,a map task will output 8
partition ,each partition will be send to a different reducer, so if i
increase reduce number ,the partition number increase ,but the volume on
network traffic is same,why sometime ,increase reducer number will not
decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

i read the doc, and find if i have 8 reducer ,a map task will output 8
partition ,each partition will be send to a different reducer, so if i
increase reduce number ,the partition number increase ,but the volume on
network traffic is same,why sometime ,increase reducer number will not
decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>

Re: issue about Shuffled Maps in MR job summary

Posted by ch huang <ju...@gmail.com>.

i read the doc, and find if i have 8 reducer ,a map task will output 8
partition ,each partition will be send to a different reducer, so if i
increase reduce number ,the partition number increase ,but the volume on
network traffic is same,why sometime ,increase reducer number will not
decrease job complete time ?

On Wed, Dec 11, 2013 at 1:48 PM, Vinayakumar B <vi...@huawei.com>wrote:

>  It looks simple, J
>
>
>
> Shuffled Maps= Number of Map Tasks * Number of Reducers
>
>
>
> Thanks and Regards,
>
> Vinayakumar B
>
>
>
> *From:* ch huang [mailto:justlooks@gmail.com]
> *Sent:* 11 December 2013 10:56
> *To:* user@hadoop.apache.org
> *Subject:* issue about Shuffled Maps in MR job summary
>
>
>
> hi,maillist:
>
>            i run terasort with 16 reducers and 8 reducers,when i double
> reducer number, the Shuffled maps is also double ,my question is the job
> only run 20 map tasks (total input file is 10,and each file is 100M,my
> block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers
> run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?
>
>
>
> 16 reducer summary output:
>
>
>
>
>
>  Shuffled Maps =320
>
>
>
> 8 reducer summary output:
>
>
>
> Shuffled Maps =160
>

RE: issue about Shuffled Maps in MR job summary

Posted by Vinayakumar B <vi...@huawei.com>.

It looks simple, :)

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:
           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:


 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by Vinayakumar B <vi...@huawei.com>.

It looks simple, :)

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:
           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:


 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by Vinayakumar B <vi...@huawei.com>.

It looks simple, :)

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:
           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:


 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160

RE: issue about Shuffled Maps in MR job summary

Posted by Vinayakumar B <vi...@huawei.com>.

It looks simple, :)

Shuffled Maps= Number of Map Tasks * Number of Reducers

Thanks and Regards,
Vinayakumar B

From: ch huang [mailto:justlooks@gmail.com]
Sent: 11 December 2013 10:56
To: user@hadoop.apache.org
Subject: issue about Shuffled Maps in MR job summary

hi,maillist:
           i run terasort with 16 reducers and 8 reducers,when i double reducer number, the Shuffled maps is also double ,my question is the job only run 20 map tasks (total input file is 10,and each file is 100M,my block size is 64M,so split is 20) why i need shuffle 160 maps in 8 reducers run and 320 maps in 16 reducers run?how to caculate the shuffle maps number?

16 reducer summary output:


 Shuffled Maps =320

8 reducer summary output:

Shuffled Maps =160