You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Gaurav Dasgupta <gd...@gmail.com> on 2012/08/28 09:16:35 UTC

How to reduce total shuffle time

Hi,

I have run some large and small jobs and calculated the Total Shuffle Time
for the jobs. I can see that the Total Shuffle Time is almost half the
Total Time which was taken by the full job to complete.

My question, here, is that how can we decrease the Total Shuffle Time? And
doing so, what will be its effect on the Job?

Thanks,
Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by Minh Duc Nguyen <md...@gmail.com>.

Without knowing your exact workload, using a Combiner (if possible) as
Tsuyoshi recommended should decrease your total shuffle time.  You can also
try compressing the map output so that there's less disk and network IO.
 Here's an example configuration using Snappy:

conf.set("mapred.compress.map.output","true");
conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

HTH,
Minh

On Tue, Aug 28, 2012 at 4:37 AM, Tsuyoshi OZAWA <
ozawa.tsuyoshi@lab.ntt.co.jp> wrote:

> It depends of workload. Could you tell us more specification about
> your job? In general case which reducers are bottleneck, there are
> some tuning techniques as follows:
> 1. Allocate more memory to reducers. It decreases disk IO of reducers
> when merging and running reduce functions.
> 2. Use combine function, which enable mapper-side aggregation
> processing, if your MR job consists of the operations that satisfy
> both the commutative and the associative low.
>
> See also about combine functions:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> Tsuyoshi
>
> On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
> >
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time for the jobs. I can see that the Total Shuffle Time is almost half the
> Total Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by Minh Duc Nguyen <md...@gmail.com>.

Without knowing your exact workload, using a Combiner (if possible) as
Tsuyoshi recommended should decrease your total shuffle time.  You can also
try compressing the map output so that there's less disk and network IO.
 Here's an example configuration using Snappy:

conf.set("mapred.compress.map.output","true");
conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

HTH,
Minh

On Tue, Aug 28, 2012 at 4:37 AM, Tsuyoshi OZAWA <
ozawa.tsuyoshi@lab.ntt.co.jp> wrote:

> It depends of workload. Could you tell us more specification about
> your job? In general case which reducers are bottleneck, there are
> some tuning techniques as follows:
> 1. Allocate more memory to reducers. It decreases disk IO of reducers
> when merging and running reduce functions.
> 2. Use combine function, which enable mapper-side aggregation
> processing, if your MR job consists of the operations that satisfy
> both the commutative and the associative low.
>
> See also about combine functions:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> Tsuyoshi
>
> On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
> >
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time for the jobs. I can see that the Total Shuffle Time is almost half the
> Total Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by Minh Duc Nguyen <md...@gmail.com>.

Without knowing your exact workload, using a Combiner (if possible) as
Tsuyoshi recommended should decrease your total shuffle time.  You can also
try compressing the map output so that there's less disk and network IO.
 Here's an example configuration using Snappy:

conf.set("mapred.compress.map.output","true");
conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

HTH,
Minh

On Tue, Aug 28, 2012 at 4:37 AM, Tsuyoshi OZAWA <
ozawa.tsuyoshi@lab.ntt.co.jp> wrote:

> It depends of workload. Could you tell us more specification about
> your job? In general case which reducers are bottleneck, there are
> some tuning techniques as follows:
> 1. Allocate more memory to reducers. It decreases disk IO of reducers
> when merging and running reduce functions.
> 2. Use combine function, which enable mapper-side aggregation
> processing, if your MR job consists of the operations that satisfy
> both the commutative and the associative low.
>
> See also about combine functions:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> Tsuyoshi
>
> On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
> >
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time for the jobs. I can see that the Total Shuffle Time is almost half the
> Total Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by Minh Duc Nguyen <md...@gmail.com>.

Without knowing your exact workload, using a Combiner (if possible) as
Tsuyoshi recommended should decrease your total shuffle time.  You can also
try compressing the map output so that there's less disk and network IO.
 Here's an example configuration using Snappy:

conf.set("mapred.compress.map.output","true");
conf.set("mapred.map.output.compression.codec","org.apache.hadoop.io.compress.SnappyCodec");

HTH,
Minh

On Tue, Aug 28, 2012 at 4:37 AM, Tsuyoshi OZAWA <
ozawa.tsuyoshi@lab.ntt.co.jp> wrote:

> It depends of workload. Could you tell us more specification about
> your job? In general case which reducers are bottleneck, there are
> some tuning techniques as follows:
> 1. Allocate more memory to reducers. It decreases disk IO of reducers
> when merging and running reduce functions.
> 2. Use combine function, which enable mapper-side aggregation
> processing, if your MR job consists of the operations that satisfy
> both the commutative and the associative low.
>
> See also about combine functions:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> Tsuyoshi
>
> On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
> >
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time for the jobs. I can see that the Total Shuffle Time is almost half the
> Total Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by Tsuyoshi OZAWA <oz...@lab.ntt.co.jp>.

It depends of workload. Could you tell us more specification about
your job? In general case which reducers are bottleneck, there are
some tuning techniques as follows:
1. Allocate more memory to reducers. It decreases disk IO of reducers
when merging and running reduce functions.
2. Use combine function, which enable mapper-side aggregation
processing, if your MR job consists of the operations that satisfy
both the commutative and the associative low.

See also about combine functions:
http://wiki.apache.org/hadoop/HadoopMapReduce

Tsuyoshi

On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
>
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time for the jobs. I can see that the Total Shuffle Time is almost half the Total Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by Gaurav Dasgupta <gd...@gmail.com>.

Hi,

Thanks for your replies. I will try working on recommended suggestions and
provide feedback.

Abhi,

In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go
to Reduce Task List. Enter into the first reduce task attempt. There you
can see the start time. It is the time when the shuffle (part of reduce
phase) actually starts.
Then again, go to JobTracker Main Page -> Job Tracker History -> Same Job.
Click on "Analyse This Job". Scroll down to the portion where you can see
the "Last Shuffle Finish Time".
Calculate the difference/gap between both the times. That is your Job's
Total Shuffle Time.
Thanks,
Gaurav Dasgupta
On Wed, Aug 29, 2012 at 12:57 AM, abhiTowson cal
<ab...@gmail.com>wrote:

> hi Gaurav,
>
> Can you tell me how did calculated total shuffle time ?.Apart from
> combiners and compression, you can also use some shuffle-sort
> parameters that might increase the performance, i am not sure exactly
> which parameters to tweak .Please share if you come across some other
> techniques , i am very much interested.
>
> Regards
> Abhi
>
> On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com>
> wrote:
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time
> > for the jobs. I can see that the Total Shuffle Time is almost half the
> Total
> > Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And
> > doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by Gaurav Dasgupta <gd...@gmail.com>.

Hi,

Thanks for your replies. I will try working on recommended suggestions and
provide feedback.

Abhi,

In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go
to Reduce Task List. Enter into the first reduce task attempt. There you
can see the start time. It is the time when the shuffle (part of reduce
phase) actually starts.
Then again, go to JobTracker Main Page -> Job Tracker History -> Same Job.
Click on "Analyse This Job". Scroll down to the portion where you can see
the "Last Shuffle Finish Time".
Calculate the difference/gap between both the times. That is your Job's
Total Shuffle Time.
Thanks,
Gaurav Dasgupta
On Wed, Aug 29, 2012 at 12:57 AM, abhiTowson cal
<ab...@gmail.com>wrote:

> hi Gaurav,
>
> Can you tell me how did calculated total shuffle time ?.Apart from
> combiners and compression, you can also use some shuffle-sort
> parameters that might increase the performance, i am not sure exactly
> which parameters to tweak .Please share if you come across some other
> techniques , i am very much interested.
>
> Regards
> Abhi
>
> On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com>
> wrote:
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time
> > for the jobs. I can see that the Total Shuffle Time is almost half the
> Total
> > Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And
> > doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by Gaurav Dasgupta <gd...@gmail.com>.

Hi,

Thanks for your replies. I will try working on recommended suggestions and
provide feedback.

Abhi,

In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go
to Reduce Task List. Enter into the first reduce task attempt. There you
can see the start time. It is the time when the shuffle (part of reduce
phase) actually starts.
Then again, go to JobTracker Main Page -> Job Tracker History -> Same Job.
Click on "Analyse This Job". Scroll down to the portion where you can see
the "Last Shuffle Finish Time".
Calculate the difference/gap between both the times. That is your Job's
Total Shuffle Time.
Thanks,
Gaurav Dasgupta
On Wed, Aug 29, 2012 at 12:57 AM, abhiTowson cal
<ab...@gmail.com>wrote:

> hi Gaurav,
>
> Can you tell me how did calculated total shuffle time ?.Apart from
> combiners and compression, you can also use some shuffle-sort
> parameters that might increase the performance, i am not sure exactly
> which parameters to tweak .Please share if you come across some other
> techniques , i am very much interested.
>
> Regards
> Abhi
>
> On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com>
> wrote:
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time
> > for the jobs. I can see that the Total Shuffle Time is almost half the
> Total
> > Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And
> > doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by Gaurav Dasgupta <gd...@gmail.com>.

Hi,

Thanks for your replies. I will try working on recommended suggestions and
provide feedback.

Abhi,

In the JobTracker Web UI -> Job Tracker History, go to the specific job. Go
to Reduce Task List. Enter into the first reduce task attempt. There you
can see the start time. It is the time when the shuffle (part of reduce
phase) actually starts.
Then again, go to JobTracker Main Page -> Job Tracker History -> Same Job.
Click on "Analyse This Job". Scroll down to the portion where you can see
the "Last Shuffle Finish Time".
Calculate the difference/gap between both the times. That is your Job's
Total Shuffle Time.
Thanks,
Gaurav Dasgupta
On Wed, Aug 29, 2012 at 12:57 AM, abhiTowson cal
<ab...@gmail.com>wrote:

> hi Gaurav,
>
> Can you tell me how did calculated total shuffle time ?.Apart from
> combiners and compression, you can also use some shuffle-sort
> parameters that might increase the performance, i am not sure exactly
> which parameters to tweak .Please share if you come across some other
> techniques , i am very much interested.
>
> Regards
> Abhi
>
> On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com>
> wrote:
> > Hi,
> >
> > I have run some large and small jobs and calculated the Total Shuffle
> Time
> > for the jobs. I can see that the Total Shuffle Time is almost half the
> Total
> > Time which was taken by the full job to complete.
> >
> > My question, here, is that how can we decrease the Total Shuffle Time?
> And
> > doing so, what will be its effect on the Job?
> >
> > Thanks,
> > Gaurav Dasgupta
>

Re: How to reduce total shuffle time

Posted by abhiTowson cal <ab...@gmail.com>.

hi Gaurav,

Can you tell me how did calculated total shuffle time ?.Apart from
combiners and compression, you can also use some shuffle-sort
parameters that might increase the performance, i am not sure exactly
which parameters to tweak .Please share if you come across some other
techniques , i am very much interested.

Regards
Abhi

On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time
> for the jobs. I can see that the Total Shuffle Time is almost half the Total
> Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And
> doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by Tsuyoshi OZAWA <oz...@lab.ntt.co.jp>.

It depends of workload. Could you tell us more specification about
your job? In general case which reducers are bottleneck, there are
some tuning techniques as follows:
1. Allocate more memory to reducers. It decreases disk IO of reducers
when merging and running reduce functions.
2. Use combine function, which enable mapper-side aggregation
processing, if your MR job consists of the operations that satisfy
both the commutative and the associative low.

See also about combine functions:
http://wiki.apache.org/hadoop/HadoopMapReduce

Tsuyoshi

On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
>
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time for the jobs. I can see that the Total Shuffle Time is almost half the Total Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by Tsuyoshi OZAWA <oz...@lab.ntt.co.jp>.

It depends of workload. Could you tell us more specification about
your job? In general case which reducers are bottleneck, there are
some tuning techniques as follows:
1. Allocate more memory to reducers. It decreases disk IO of reducers
when merging and running reduce functions.
2. Use combine function, which enable mapper-side aggregation
processing, if your MR job consists of the operations that satisfy
both the commutative and the associative low.

See also about combine functions:
http://wiki.apache.org/hadoop/HadoopMapReduce

Tsuyoshi

On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
>
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time for the jobs. I can see that the Total Shuffle Time is almost half the Total Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by abhiTowson cal <ab...@gmail.com>.

hi Gaurav,

Can you tell me how did calculated total shuffle time ?.Apart from
combiners and compression, you can also use some shuffle-sort
parameters that might increase the performance, i am not sure exactly
which parameters to tweak .Please share if you come across some other
techniques , i am very much interested.

Regards
Abhi

On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time
> for the jobs. I can see that the Total Shuffle Time is almost half the Total
> Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And
> doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by abhiTowson cal <ab...@gmail.com>.

hi Gaurav,

Can you tell me how did calculated total shuffle time ?.Apart from
combiners and compression, you can also use some shuffle-sort
parameters that might increase the performance, i am not sure exactly
which parameters to tweak .Please share if you come across some other
techniques , i am very much interested.

Regards
Abhi

On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time
> for the jobs. I can see that the Total Shuffle Time is almost half the Total
> Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And
> doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by abhiTowson cal <ab...@gmail.com>.

hi Gaurav,

Can you tell me how did calculated total shuffle time ?.Apart from
combiners and compression, you can also use some shuffle-sort
parameters that might increase the performance, i am not sure exactly
which parameters to tweak .Please share if you come across some other
techniques , i am very much interested.

Regards
Abhi

On Tue, Aug 28, 2012 at 3:16 AM, Gaurav Dasgupta <gd...@gmail.com> wrote:
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time
> for the jobs. I can see that the Total Shuffle Time is almost half the Total
> Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And
> doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta

Re: How to reduce total shuffle time

Posted by Tsuyoshi OZAWA <oz...@lab.ntt.co.jp>.

It depends of workload. Could you tell us more specification about
your job? In general case which reducers are bottleneck, there are
some tuning techniques as follows:
1. Allocate more memory to reducers. It decreases disk IO of reducers
when merging and running reduce functions.
2. Use combine function, which enable mapper-side aggregation
processing, if your MR job consists of the operations that satisfy
both the commutative and the associative low.

See also about combine functions:
http://wiki.apache.org/hadoop/HadoopMapReduce

Tsuyoshi

On Tuesday, August 28, 2012, Gaurav Dasgupta wrote:
>
> Hi,
>
> I have run some large and small jobs and calculated the Total Shuffle Time for the jobs. I can see that the Total Shuffle Time is almost half the Total Time which was taken by the full job to complete.
>
> My question, here, is that how can we decrease the Total Shuffle Time? And doing so, what will be its effect on the Job?
>
> Thanks,
> Gaurav Dasgupta