You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by John Lilley <jo...@redpoint.net> on 2013/06/11 18:00:45 UTC

Shuffle design: optimization tradeoffs

I am curious about the tradeoffs that drove design of the partition/sort/shuffle (Elephant book p 208).  Doubtless this has been tuned and measured and retuned, but I'd like to know what observations came about during the iterative optimization process to drive the final design.  For example:

*         Why does the mapper output create a single ordered file containing all partitions, as opposed to a file per group of partitions (which would seem to lend itself better to multi-core scaling), or even a file per partition?

*         Why does the max number of streams to merge at once (is.sort.factor) default to 10?  Is this obsolete?  In my experience, so long as you have memory to buffer each input at 1MB or so, the merger is more efficient as a single phase.

*         Why does the mapper do a final merge of the spill files do disk, instead of having the auxiliary process (in YARN) merge and stream data on the fly?

*         Why do mappers sort the tuples, as opposed to only partitioning them and letting the reducers do the sorting?
Sorry if this is overly academic, but I'm sure a lot of people put a lot of time into the tuning effort, and I hope they left a record of their efforts.
Thanks
John

Re: Shuffle design: optimization tradeoffs

Posted by Bertrand Dechoux <de...@gmail.com>.

On the academic side, you might be interested to read about *resilient
distributed datasets (RDDs)* :
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
Not exactly the same subject but it has the merit of pointing out that a
solution is related to a context.

Bertrand


On Sat, Jun 15, 2013 at 3:39 PM, John Lilley <jo...@redpoint.net>wrote:

> Albert,
> Thanks for the link.  This is indeed what I am talking about.
> The authors have taken the idea even further, avoiding disk writes on
> either the mapper or reducer side.  It's not clear to me that this scales
> well to 1000s of nodes however, as the downside to not landing data on disk
> on the reducer side is that it would seem to impose at least one of the
> following requirements:
> -- a lot of memory on the reducer side
> -- reducers keep connections open to retrieve map file data
> -- reducer/map-file connections are juggled so as to avoid keeping too
> many open at once.
> John
>
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Wednesday, June 12, 2013 2:27 PM
> To: user@hadoop.apache.org
> Subject: RE: Shuffle design: optimization tradeoffs
>
> On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> > In reading this link as well as the sailfish report, it strikes me
> > that Hadoop skipped a potentially significant optimization.  Namely,
> > why are multiple sorted spill files merged into a single output file?
> > Why not have the auxiliary service merge on the fly, thus avoiding
> > landing them to disk?
>
> I believe what you're talking about/suggesting is similar to what's
> discussed in this paper?
>
> http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf
>
> Al
>
> > Was this considered and rejected due to placing memory/CPU
> > requirements on the auxiliary service?  I am assuming that whether the
> > merge was done on disk or in a stream, it would require
> > decompression/recompression of the data.
> > John
> >
> >
> > -----Original Message-----
> > From: Albert Chu [mailto:chu11@llnl.gov]
> > Sent: Tuesday, June 11, 2013 3:32 PM
> > To: user@hadoop.apache.org
> > Subject: Re: Shuffle design: optimization tradeoffs
> >
> > On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > > I am curious about the tradeoffs that drove design of the
> > > partition/sort/shuffle (Elephant book p 208).  Doubtless this has
> > > been tuned and measured and retuned, but I’d like to know what
> > > observations came about during the iterative optimization process to
> > > drive the final design.  For example:
> > >
> > > ·        Why does the mapper output create a single ordered file
> > > containing all partitions, as opposed to a file per group of
> > > partitions (which would seem to lend itself better to multi-core
> > > scaling), or even a file per partition?
> >
> > I researched this awhile back wondering the same thing, and found this
> > JIRA
> >
> > https://issues.apache.org/jira/browse/HADOOP-331
> >
> > Al
> >
> > > ·        Why does the max number of streams to merge at once
> > > (is.sort.factor) default to 10?  Is this obsolete?  In my
> > > experience, so long as you have memory to buffer each input at 1MB
> > > or so, the merger is more efficient as a single phase.
> > >
> > > ·        Why does the mapper do a final merge of the spill files do
> > > disk, instead of having the auxiliary process (in YARN) merge and
> > > stream data on the fly?
> > >
> > > ·        Why do mappers sort the tuples, as opposed to only
> > > partitioning them and letting the reducers do the sorting?
> > >
> > > Sorry if this is overly academic, but I’m sure a lot of people put a
> > > lot of time into the tuning effort, and I hope they left a record of
> > > their efforts.
> > >
> > > Thanks
> > >
> > > John
> > >
> > >
> > >
> > >
> > --
> > Albert Chu
> > chu11@llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> >
> >
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
>
>
>


-- 
Bertrand Dechoux

Re: Shuffle design: optimization tradeoffs

Posted by Bertrand Dechoux <de...@gmail.com>.

On the academic side, you might be interested to read about *resilient
distributed datasets (RDDs)* :
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
Not exactly the same subject but it has the merit of pointing out that a
solution is related to a context.

Bertrand


On Sat, Jun 15, 2013 at 3:39 PM, John Lilley <jo...@redpoint.net>wrote:

> Albert,
> Thanks for the link.  This is indeed what I am talking about.
> The authors have taken the idea even further, avoiding disk writes on
> either the mapper or reducer side.  It's not clear to me that this scales
> well to 1000s of nodes however, as the downside to not landing data on disk
> on the reducer side is that it would seem to impose at least one of the
> following requirements:
> -- a lot of memory on the reducer side
> -- reducers keep connections open to retrieve map file data
> -- reducer/map-file connections are juggled so as to avoid keeping too
> many open at once.
> John
>
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Wednesday, June 12, 2013 2:27 PM
> To: user@hadoop.apache.org
> Subject: RE: Shuffle design: optimization tradeoffs
>
> On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> > In reading this link as well as the sailfish report, it strikes me
> > that Hadoop skipped a potentially significant optimization.  Namely,
> > why are multiple sorted spill files merged into a single output file?
> > Why not have the auxiliary service merge on the fly, thus avoiding
> > landing them to disk?
>
> I believe what you're talking about/suggesting is similar to what's
> discussed in this paper?
>
> http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf
>
> Al
>
> > Was this considered and rejected due to placing memory/CPU
> > requirements on the auxiliary service?  I am assuming that whether the
> > merge was done on disk or in a stream, it would require
> > decompression/recompression of the data.
> > John
> >
> >
> > -----Original Message-----
> > From: Albert Chu [mailto:chu11@llnl.gov]
> > Sent: Tuesday, June 11, 2013 3:32 PM
> > To: user@hadoop.apache.org
> > Subject: Re: Shuffle design: optimization tradeoffs
> >
> > On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > > I am curious about the tradeoffs that drove design of the
> > > partition/sort/shuffle (Elephant book p 208).  Doubtless this has
> > > been tuned and measured and retuned, but I’d like to know what
> > > observations came about during the iterative optimization process to
> > > drive the final design.  For example:
> > >
> > > ·        Why does the mapper output create a single ordered file
> > > containing all partitions, as opposed to a file per group of
> > > partitions (which would seem to lend itself better to multi-core
> > > scaling), or even a file per partition?
> >
> > I researched this awhile back wondering the same thing, and found this
> > JIRA
> >
> > https://issues.apache.org/jira/browse/HADOOP-331
> >
> > Al
> >
> > > ·        Why does the max number of streams to merge at once
> > > (is.sort.factor) default to 10?  Is this obsolete?  In my
> > > experience, so long as you have memory to buffer each input at 1MB
> > > or so, the merger is more efficient as a single phase.
> > >
> > > ·        Why does the mapper do a final merge of the spill files do
> > > disk, instead of having the auxiliary process (in YARN) merge and
> > > stream data on the fly?
> > >
> > > ·        Why do mappers sort the tuples, as opposed to only
> > > partitioning them and letting the reducers do the sorting?
> > >
> > > Sorry if this is overly academic, but I’m sure a lot of people put a
> > > lot of time into the tuning effort, and I hope they left a record of
> > > their efforts.
> > >
> > > Thanks
> > >
> > > John
> > >
> > >
> > >
> > >
> > --
> > Albert Chu
> > chu11@llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> >
> >
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
>
>
>


-- 
Bertrand Dechoux

Re: Shuffle design: optimization tradeoffs

Posted by Bertrand Dechoux <de...@gmail.com>.

On the academic side, you might be interested to read about *resilient
distributed datasets (RDDs)* :
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
Not exactly the same subject but it has the merit of pointing out that a
solution is related to a context.

Bertrand


On Sat, Jun 15, 2013 at 3:39 PM, John Lilley <jo...@redpoint.net>wrote:

> Albert,
> Thanks for the link.  This is indeed what I am talking about.
> The authors have taken the idea even further, avoiding disk writes on
> either the mapper or reducer side.  It's not clear to me that this scales
> well to 1000s of nodes however, as the downside to not landing data on disk
> on the reducer side is that it would seem to impose at least one of the
> following requirements:
> -- a lot of memory on the reducer side
> -- reducers keep connections open to retrieve map file data
> -- reducer/map-file connections are juggled so as to avoid keeping too
> many open at once.
> John
>
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Wednesday, June 12, 2013 2:27 PM
> To: user@hadoop.apache.org
> Subject: RE: Shuffle design: optimization tradeoffs
>
> On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> > In reading this link as well as the sailfish report, it strikes me
> > that Hadoop skipped a potentially significant optimization.  Namely,
> > why are multiple sorted spill files merged into a single output file?
> > Why not have the auxiliary service merge on the fly, thus avoiding
> > landing them to disk?
>
> I believe what you're talking about/suggesting is similar to what's
> discussed in this paper?
>
> http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf
>
> Al
>
> > Was this considered and rejected due to placing memory/CPU
> > requirements on the auxiliary service?  I am assuming that whether the
> > merge was done on disk or in a stream, it would require
> > decompression/recompression of the data.
> > John
> >
> >
> > -----Original Message-----
> > From: Albert Chu [mailto:chu11@llnl.gov]
> > Sent: Tuesday, June 11, 2013 3:32 PM
> > To: user@hadoop.apache.org
> > Subject: Re: Shuffle design: optimization tradeoffs
> >
> > On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > > I am curious about the tradeoffs that drove design of the
> > > partition/sort/shuffle (Elephant book p 208).  Doubtless this has
> > > been tuned and measured and retuned, but I’d like to know what
> > > observations came about during the iterative optimization process to
> > > drive the final design.  For example:
> > >
> > > ·        Why does the mapper output create a single ordered file
> > > containing all partitions, as opposed to a file per group of
> > > partitions (which would seem to lend itself better to multi-core
> > > scaling), or even a file per partition?
> >
> > I researched this awhile back wondering the same thing, and found this
> > JIRA
> >
> > https://issues.apache.org/jira/browse/HADOOP-331
> >
> > Al
> >
> > > ·        Why does the max number of streams to merge at once
> > > (is.sort.factor) default to 10?  Is this obsolete?  In my
> > > experience, so long as you have memory to buffer each input at 1MB
> > > or so, the merger is more efficient as a single phase.
> > >
> > > ·        Why does the mapper do a final merge of the spill files do
> > > disk, instead of having the auxiliary process (in YARN) merge and
> > > stream data on the fly?
> > >
> > > ·        Why do mappers sort the tuples, as opposed to only
> > > partitioning them and letting the reducers do the sorting?
> > >
> > > Sorry if this is overly academic, but I’m sure a lot of people put a
> > > lot of time into the tuning effort, and I hope they left a record of
> > > their efforts.
> > >
> > > Thanks
> > >
> > > John
> > >
> > >
> > >
> > >
> > --
> > Albert Chu
> > chu11@llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> >
> >
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
>
>
>


-- 
Bertrand Dechoux

Re: Shuffle design: optimization tradeoffs

Posted by Bertrand Dechoux <de...@gmail.com>.

On the academic side, you might be interested to read about *resilient
distributed datasets (RDDs)* :
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf.
Not exactly the same subject but it has the merit of pointing out that a
solution is related to a context.

Bertrand


On Sat, Jun 15, 2013 at 3:39 PM, John Lilley <jo...@redpoint.net>wrote:

> Albert,
> Thanks for the link.  This is indeed what I am talking about.
> The authors have taken the idea even further, avoiding disk writes on
> either the mapper or reducer side.  It's not clear to me that this scales
> well to 1000s of nodes however, as the downside to not landing data on disk
> on the reducer side is that it would seem to impose at least one of the
> following requirements:
> -- a lot of memory on the reducer side
> -- reducers keep connections open to retrieve map file data
> -- reducer/map-file connections are juggled so as to avoid keeping too
> many open at once.
> John
>
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Wednesday, June 12, 2013 2:27 PM
> To: user@hadoop.apache.org
> Subject: RE: Shuffle design: optimization tradeoffs
>
> On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> > In reading this link as well as the sailfish report, it strikes me
> > that Hadoop skipped a potentially significant optimization.  Namely,
> > why are multiple sorted spill files merged into a single output file?
> > Why not have the auxiliary service merge on the fly, thus avoiding
> > landing them to disk?
>
> I believe what you're talking about/suggesting is similar to what's
> discussed in this paper?
>
> http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf
>
> Al
>
> > Was this considered and rejected due to placing memory/CPU
> > requirements on the auxiliary service?  I am assuming that whether the
> > merge was done on disk or in a stream, it would require
> > decompression/recompression of the data.
> > John
> >
> >
> > -----Original Message-----
> > From: Albert Chu [mailto:chu11@llnl.gov]
> > Sent: Tuesday, June 11, 2013 3:32 PM
> > To: user@hadoop.apache.org
> > Subject: Re: Shuffle design: optimization tradeoffs
> >
> > On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > > I am curious about the tradeoffs that drove design of the
> > > partition/sort/shuffle (Elephant book p 208).  Doubtless this has
> > > been tuned and measured and retuned, but I’d like to know what
> > > observations came about during the iterative optimization process to
> > > drive the final design.  For example:
> > >
> > > ·        Why does the mapper output create a single ordered file
> > > containing all partitions, as opposed to a file per group of
> > > partitions (which would seem to lend itself better to multi-core
> > > scaling), or even a file per partition?
> >
> > I researched this awhile back wondering the same thing, and found this
> > JIRA
> >
> > https://issues.apache.org/jira/browse/HADOOP-331
> >
> > Al
> >
> > > ·        Why does the max number of streams to merge at once
> > > (is.sort.factor) default to 10?  Is this obsolete?  In my
> > > experience, so long as you have memory to buffer each input at 1MB
> > > or so, the merger is more efficient as a single phase.
> > >
> > > ·        Why does the mapper do a final merge of the spill files do
> > > disk, instead of having the auxiliary process (in YARN) merge and
> > > stream data on the fly?
> > >
> > > ·        Why do mappers sort the tuples, as opposed to only
> > > partitioning them and letting the reducers do the sorting?
> > >
> > > Sorry if this is overly academic, but I’m sure a lot of people put a
> > > lot of time into the tuning effort, and I hope they left a record of
> > > their efforts.
> > >
> > > Thanks
> > >
> > > John
> > >
> > >
> > >
> > >
> > --
> > Albert Chu
> > chu11@llnl.gov
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> >
> >
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
>
>
>


-- 
Bertrand Dechoux

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

Albert,
Thanks for the link.  This is indeed what I am talking about.
The authors have taken the idea even further, avoiding disk writes on either the mapper or reducer side.  It's not clear to me that this scales well to 1000s of nodes however, as the downside to not landing data on disk on the reducer side is that it would seem to impose at least one of the following requirements:
-- a lot of memory on the reducer side
-- reducers keep connections open to retrieve map file data
-- reducer/map-file connections are juggled so as to avoid keeping too many open at once.
John

-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Wednesday, June 12, 2013 2:27 PM
To: user@hadoop.apache.org
Subject: RE: Shuffle design: optimization tradeoffs

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me 
> that Hadoop skipped a potentially significant optimization.  Namely, 
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding 
> landing them to disk?

I believe what you're talking about/suggesting is similar to what's discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU 
> requirements on the auxiliary service?  I am assuming that whether the 
> merge was done on disk or in a stream, it would require 
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has 
> > been tuned and measured and retuned, but I’d like to know what 
> > observations came about during the iterative optimization process to 
> > drive the final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this 
> JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my 
> > experience, so long as you have memory to buffer each input at 1MB 
> > or so, the merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

Albert,
Thanks for the link.  This is indeed what I am talking about.
The authors have taken the idea even further, avoiding disk writes on either the mapper or reducer side.  It's not clear to me that this scales well to 1000s of nodes however, as the downside to not landing data on disk on the reducer side is that it would seem to impose at least one of the following requirements:
-- a lot of memory on the reducer side
-- reducers keep connections open to retrieve map file data
-- reducer/map-file connections are juggled so as to avoid keeping too many open at once.
John

-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Wednesday, June 12, 2013 2:27 PM
To: user@hadoop.apache.org
Subject: RE: Shuffle design: optimization tradeoffs

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me 
> that Hadoop skipped a potentially significant optimization.  Namely, 
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding 
> landing them to disk?

I believe what you're talking about/suggesting is similar to what's discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU 
> requirements on the auxiliary service?  I am assuming that whether the 
> merge was done on disk or in a stream, it would require 
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has 
> > been tuned and measured and retuned, but I’d like to know what 
> > observations came about during the iterative optimization process to 
> > drive the final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this 
> JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my 
> > experience, so long as you have memory to buffer each input at 1MB 
> > or so, the merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

Albert,
Thanks for the link.  This is indeed what I am talking about.
The authors have taken the idea even further, avoiding disk writes on either the mapper or reducer side.  It's not clear to me that this scales well to 1000s of nodes however, as the downside to not landing data on disk on the reducer side is that it would seem to impose at least one of the following requirements:
-- a lot of memory on the reducer side
-- reducers keep connections open to retrieve map file data
-- reducer/map-file connections are juggled so as to avoid keeping too many open at once.
John

-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Wednesday, June 12, 2013 2:27 PM
To: user@hadoop.apache.org
Subject: RE: Shuffle design: optimization tradeoffs

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me 
> that Hadoop skipped a potentially significant optimization.  Namely, 
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding 
> landing them to disk?

I believe what you're talking about/suggesting is similar to what's discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU 
> requirements on the auxiliary service?  I am assuming that whether the 
> merge was done on disk or in a stream, it would require 
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has 
> > been tuned and measured and retuned, but I’d like to know what 
> > observations came about during the iterative optimization process to 
> > drive the final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this 
> JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my 
> > experience, so long as you have memory to buffer each input at 1MB 
> > or so, the merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

Albert,
Thanks for the link.  This is indeed what I am talking about.
The authors have taken the idea even further, avoiding disk writes on either the mapper or reducer side.  It's not clear to me that this scales well to 1000s of nodes however, as the downside to not landing data on disk on the reducer side is that it would seem to impose at least one of the following requirements:
-- a lot of memory on the reducer side
-- reducers keep connections open to retrieve map file data
-- reducer/map-file connections are juggled so as to avoid keeping too many open at once.
John

-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Wednesday, June 12, 2013 2:27 PM
To: user@hadoop.apache.org
Subject: RE: Shuffle design: optimization tradeoffs

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me 
> that Hadoop skipped a potentially significant optimization.  Namely, 
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding 
> landing them to disk?

I believe what you're talking about/suggesting is similar to what's discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU 
> requirements on the auxiliary service?  I am assuming that whether the 
> merge was done on disk or in a stream, it would require 
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov]
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has 
> > been tuned and measured and retuned, but I’d like to know what 
> > observations came about during the iterative optimization process to 
> > drive the final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this 
> JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my 
> > experience, so long as you have memory to buffer each input at 1MB 
> > or so, the merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me
> that Hadoop skipped a potentially significant optimization.  Namely,
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding
> landing them to disk?  

I believe what you're talking about/suggesting is similar to what's
discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU
> requirements on the auxiliary service?  I am assuming that whether the
> merge was done on disk or in a stream, it would require
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov] 
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> > tuned and measured and retuned, but I’d like to know what observations 
> > came about during the iterative optimization process to drive the 
> > final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> > so long as you have memory to buffer each input at 1MB or so, the 
> > merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me
> that Hadoop skipped a potentially significant optimization.  Namely,
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding
> landing them to disk?  

I believe what you're talking about/suggesting is similar to what's
discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU
> requirements on the auxiliary service?  I am assuming that whether the
> merge was done on disk or in a stream, it would require
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov] 
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> > tuned and measured and retuned, but I’d like to know what observations 
> > came about during the iterative optimization process to drive the 
> > final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> > so long as you have memory to buffer each input at 1MB or so, the 
> > merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me
> that Hadoop skipped a potentially significant optimization.  Namely,
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding
> landing them to disk?  

I believe what you're talking about/suggesting is similar to what's
discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU
> requirements on the auxiliary service?  I am assuming that whether the
> merge was done on disk or in a stream, it would require
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov] 
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> > tuned and measured and retuned, but I’d like to know what observations 
> > came about during the iterative optimization process to drive the 
> > final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> > so long as you have memory to buffer each input at 1MB or so, the 
> > merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Wed, 2013-06-12 at 18:08 +0000, John Lilley wrote:
> In reading this link as well as the sailfish report, it strikes me
> that Hadoop skipped a potentially significant optimization.  Namely,
> why are multiple sorted spill files merged into a single output file?
> Why not have the auxiliary service merge on the fly, thus avoiding
> landing them to disk?  

I believe what you're talking about/suggesting is similar to what's
discussed in this paper?

http://pasl.eng.auburn.edu/pubs/sc11-netlev.pdf

Al

> Was this considered and rejected due to placing memory/CPU
> requirements on the auxiliary service?  I am assuming that whether the
> merge was done on disk or in a stream, it would require
> decompression/recompression of the data.
> John
> 
> 
> -----Original Message-----
> From: Albert Chu [mailto:chu11@llnl.gov] 
> Sent: Tuesday, June 11, 2013 3:32 PM
> To: user@hadoop.apache.org
> Subject: Re: Shuffle design: optimization tradeoffs
> 
> On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> > I am curious about the tradeoffs that drove design of the 
> > partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> > tuned and measured and retuned, but I’d like to know what observations 
> > came about during the iterative optimization process to drive the 
> > final design.  For example:
> > 
> > ·        Why does the mapper output create a single ordered file
> > containing all partitions, as opposed to a file per group of 
> > partitions (which would seem to lend itself better to multi-core 
> > scaling), or even a file per partition?
> 
> I researched this awhile back wondering the same thing, and found this JIRA
> 
> https://issues.apache.org/jira/browse/HADOOP-331
> 
> Al
> 
> > ·        Why does the max number of streams to merge at once
> > (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> > so long as you have memory to buffer each input at 1MB or so, the 
> > merger is more efficient as a single phase.
> > 
> > ·        Why does the mapper do a final merge of the spill files do
> > disk, instead of having the auxiliary process (in YARN) merge and 
> > stream data on the fly?
> > 
> > ·        Why do mappers sort the tuples, as opposed to only
> > partitioning them and letting the reducers do the sorting?
> > 
> > Sorry if this is overly academic, but I’m sure a lot of people put a 
> > lot of time into the tuning effort, and I hope they left a record of 
> > their efforts.
> > 
> > Thanks
> > 
> > John
> > 
> >  
> > 
> > 
> --
> Albert Chu
> chu11@llnl.gov
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

In reading this link as well as the sailfish report, it strikes me that Hadoop skipped a potentially significant optimization.  Namely, why are multiple sorted spill files merged into a single output file?  Why not have the auxiliary service merge on the fly, thus avoiding landing them to disk?  Was this considered and rejected due to placing memory/CPU requirements on the auxiliary service?  I am assuming that whether the merge was done on disk or in a stream, it would require decompression/recompression of the data.
John


-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Tuesday, June 11, 2013 3:32 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle design: optimization tradeoffs

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the 
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> tuned and measured and retuned, but I’d like to know what observations 
> came about during the iterative optimization process to drive the 
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of 
> partitions (which would seem to lend itself better to multi-core 
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> so long as you have memory to buffer each input at 1MB or so, the 
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and 
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a 
> lot of time into the tuning effort, and I hope they left a record of 
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

In reading this link as well as the sailfish report, it strikes me that Hadoop skipped a potentially significant optimization.  Namely, why are multiple sorted spill files merged into a single output file?  Why not have the auxiliary service merge on the fly, thus avoiding landing them to disk?  Was this considered and rejected due to placing memory/CPU requirements on the auxiliary service?  I am assuming that whether the merge was done on disk or in a stream, it would require decompression/recompression of the data.
John


-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Tuesday, June 11, 2013 3:32 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle design: optimization tradeoffs

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the 
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> tuned and measured and retuned, but I’d like to know what observations 
> came about during the iterative optimization process to drive the 
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of 
> partitions (which would seem to lend itself better to multi-core 
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> so long as you have memory to buffer each input at 1MB or so, the 
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and 
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a 
> lot of time into the tuning effort, and I hope they left a record of 
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

In reading this link as well as the sailfish report, it strikes me that Hadoop skipped a potentially significant optimization.  Namely, why are multiple sorted spill files merged into a single output file?  Why not have the auxiliary service merge on the fly, thus avoiding landing them to disk?  Was this considered and rejected due to placing memory/CPU requirements on the auxiliary service?  I am assuming that whether the merge was done on disk or in a stream, it would require decompression/recompression of the data.
John


-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Tuesday, June 11, 2013 3:32 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle design: optimization tradeoffs

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the 
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> tuned and measured and retuned, but I’d like to know what observations 
> came about during the iterative optimization process to drive the 
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of 
> partitions (which would seem to lend itself better to multi-core 
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> so long as you have memory to buffer each input at 1MB or so, the 
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and 
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a 
> lot of time into the tuning effort, and I hope they left a record of 
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

RE: Shuffle design: optimization tradeoffs

Posted by John Lilley <jo...@redpoint.net>.

In reading this link as well as the sailfish report, it strikes me that Hadoop skipped a potentially significant optimization.  Namely, why are multiple sorted spill files merged into a single output file?  Why not have the auxiliary service merge on the fly, thus avoiding landing them to disk?  Was this considered and rejected due to placing memory/CPU requirements on the auxiliary service?  I am assuming that whether the merge was done on disk or in a stream, it would require decompression/recompression of the data.
John


-----Original Message-----
From: Albert Chu [mailto:chu11@llnl.gov] 
Sent: Tuesday, June 11, 2013 3:32 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle design: optimization tradeoffs

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the 
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been 
> tuned and measured and retuned, but I’d like to know what observations 
> came about during the iterative optimization process to drive the 
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of 
> partitions (which would seem to lend itself better to multi-core 
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience, 
> so long as you have memory to buffer each input at 1MB or so, the 
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and 
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a 
> lot of time into the tuning effort, and I hope they left a record of 
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
--
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

Re: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been
> tuned and measured and retuned, but I’d like to know what observations
> came about during the iterative optimization process to drive the
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of
> partitions (which would seem to lend itself better to multi-core
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this
JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience,
> so long as you have memory to buffer each input at 1MB or so, the
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a
> lot of time into the tuning effort, and I hope they left a record of
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

Re: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been
> tuned and measured and retuned, but I’d like to know what observations
> came about during the iterative optimization process to drive the
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of
> partitions (which would seem to lend itself better to multi-core
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this
JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience,
> so long as you have memory to buffer each input at 1MB or so, the
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a
> lot of time into the tuning effort, and I hope they left a record of
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

Re: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been
> tuned and measured and retuned, but I’d like to know what observations
> came about during the iterative optimization process to drive the
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of
> partitions (which would seem to lend itself better to multi-core
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this
JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience,
> so long as you have memory to buffer each input at 1MB or so, the
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a
> lot of time into the tuning effort, and I hope they left a record of
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory

Re: Shuffle design: optimization tradeoffs

Posted by Albert Chu <ch...@llnl.gov>.

On Tue, 2013-06-11 at 16:00 +0000, John Lilley wrote:
> I am curious about the tradeoffs that drove design of the
> partition/sort/shuffle (Elephant book p 208).  Doubtless this has been
> tuned and measured and retuned, but I’d like to know what observations
> came about during the iterative optimization process to drive the
> final design.  For example:
> 
> ·        Why does the mapper output create a single ordered file
> containing all partitions, as opposed to a file per group of
> partitions (which would seem to lend itself better to multi-core
> scaling), or even a file per partition?

I researched this awhile back wondering the same thing, and found this
JIRA

https://issues.apache.org/jira/browse/HADOOP-331

Al

> ·        Why does the max number of streams to merge at once
> (is.sort.factor) default to 10?  Is this obsolete?  In my experience,
> so long as you have memory to buffer each input at 1MB or so, the
> merger is more efficient as a single phase.
> 
> ·        Why does the mapper do a final merge of the spill files do
> disk, instead of having the auxiliary process (in YARN) merge and
> stream data on the fly?
> 
> ·        Why do mappers sort the tuples, as opposed to only
> partitioning them and letting the reducers do the sorting?
> 
> Sorry if this is overly academic, but I’m sure a lot of people put a
> lot of time into the tuning effort, and I hope they left a record of
> their efforts.
> 
> Thanks
> 
> John
> 
>  
> 
> 
-- 
Albert Chu
chu11@llnl.gov
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory