You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Muhammad Haseeb Javed <11...@seecs.edu.pk> on 2015/08/15 22:42:33 UTC

Difference between Sort based and Hash based shuffle

What are the major differences between how Sort based and Hash based
shuffle operate and what is it that cause Sort Shuffle to perform better
than Hash?
Any talks that discuss both shuffles in detail, how they are implemented
and the performance gains ?

Re: Difference between Sort based and Hash based shuffle

Posted by Andrew Or <an...@databricks.com>.
Yes, in other words, a "bucket" is a single file in hash-based shuffle (no
consolidation), but a segment of partitioned file in sort-based shuffle.

2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed <11...@seecs.edu.pk>:

> Thanks Andrew for a detailed response,
>
> So the reason why key value pairs with same keys are always found in a
> single buckets in Hash based shuffle but not in Sort is because in
> sort-shuffle each mapper writes a single partitioned file, and it is up to
> the reducer to fetch correct partitions from the the files ?
>
> On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or <an...@databricks.com> wrote:
>
>> Hi Muhammad,
>>
>> On a high level, in hash-based shuffle each mapper M writes R shuffle
>> files, one for each reducer where R is the number of reduce partitions.
>> This results in M * R shuffle files. Since it is not uncommon for M and R
>> to be O(1000), this quickly becomes expensive. An optimization with
>> hash-based shuffle is consolidation, where all mappers run in the same core
>> C write one file per reducer, resulting in C * R files. This is a strict
>> improvement, but it is still relatively expensive.
>>
>> Instead, in sort-based shuffle each mapper writes a single partitioned
>> file. This allows a particular reducer to request a specific portion of
>> each mapper's single output file. In more detail, the mapper first fills up
>> an internal buffer in memory and continually spills the contents of the
>> buffer to disk, then finally merges all the spilled files together to form
>> one final output file. This places much less stress on the file system and
>> requires much fewer I/O operations especially on the read side.
>>
>> -Andrew
>>
>>
>>
>> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
>> 11besemjaved@seecs.edu.pk>:
>>
>>> I did check it out and although I did get a general understanding of the
>>> various classes used to implement Sort and Hash shuffles, however these
>>> slides lack details as to how they are implemented and why sort generally
>>> has better performance than hash
>>>
>>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ra...@gmail.com>
>>> wrote:
>>>
>>>> Have a look at this presentation.
>>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>>>> of help to you.
>>>>
>>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>>> 11besemjaved@seecs.edu.pk> wrote:
>>>>
>>>>> What are the major differences between how Sort based and Hash based
>>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>>> than Hash?
>>>>> Any talks that discuss both shuffles in detail, how they are
>>>>> implemented and the performance gains ?
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Difference between Sort based and Hash based shuffle

Posted by Muhammad Haseeb Javed <11...@seecs.edu.pk>.
Thanks Andrew for a detailed response,

So the reason why key value pairs with same keys are always found in a
single buckets in Hash based shuffle but not in Sort is because in
sort-shuffle each mapper writes a single partitioned file, and it is up to
the reducer to fetch correct partitions from the the files ?

On Wed, Aug 19, 2015 at 2:13 AM, Andrew Or <an...@databricks.com> wrote:

> Hi Muhammad,
>
> On a high level, in hash-based shuffle each mapper M writes R shuffle
> files, one for each reducer where R is the number of reduce partitions.
> This results in M * R shuffle files. Since it is not uncommon for M and R
> to be O(1000), this quickly becomes expensive. An optimization with
> hash-based shuffle is consolidation, where all mappers run in the same core
> C write one file per reducer, resulting in C * R files. This is a strict
> improvement, but it is still relatively expensive.
>
> Instead, in sort-based shuffle each mapper writes a single partitioned
> file. This allows a particular reducer to request a specific portion of
> each mapper's single output file. In more detail, the mapper first fills up
> an internal buffer in memory and continually spills the contents of the
> buffer to disk, then finally merges all the spilled files together to form
> one final output file. This places much less stress on the file system and
> requires much fewer I/O operations especially on the read side.
>
> -Andrew
>
>
>
> 2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <
> 11besemjaved@seecs.edu.pk>:
>
>> I did check it out and although I did get a general understanding of the
>> various classes used to implement Sort and Hash shuffles, however these
>> slides lack details as to how they are implemented and why sort generally
>> has better performance than hash
>>
>> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ra...@gmail.com>
>> wrote:
>>
>>> Have a look at this presentation.
>>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>>> of help to you.
>>>
>>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>>> 11besemjaved@seecs.edu.pk> wrote:
>>>
>>>> What are the major differences between how Sort based and Hash based
>>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>>> than Hash?
>>>> Any talks that discuss both shuffles in detail, how they are
>>>> implemented and the performance gains ?
>>>>
>>>
>>>
>>
>

Re: Difference between Sort based and Hash based shuffle

Posted by Andrew Or <an...@databricks.com>.
Hi Muhammad,

On a high level, in hash-based shuffle each mapper M writes R shuffle
files, one for each reducer where R is the number of reduce partitions.
This results in M * R shuffle files. Since it is not uncommon for M and R
to be O(1000), this quickly becomes expensive. An optimization with
hash-based shuffle is consolidation, where all mappers run in the same core
C write one file per reducer, resulting in C * R files. This is a strict
improvement, but it is still relatively expensive.

Instead, in sort-based shuffle each mapper writes a single partitioned
file. This allows a particular reducer to request a specific portion of
each mapper's single output file. In more detail, the mapper first fills up
an internal buffer in memory and continually spills the contents of the
buffer to disk, then finally merges all the spilled files together to form
one final output file. This places much less stress on the file system and
requires much fewer I/O operations especially on the read side.

-Andrew



2015-08-16 11:08 GMT-07:00 Muhammad Haseeb Javed <11...@seecs.edu.pk>
:

> I did check it out and although I did get a general understanding of the
> various classes used to implement Sort and Hash shuffles, however these
> slides lack details as to how they are implemented and why sort generally
> has better performance than hash
>
> On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ra...@gmail.com>
> wrote:
>
>> Have a look at this presentation.
>> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be
>> of help to you.
>>
>> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
>> 11besemjaved@seecs.edu.pk> wrote:
>>
>>> What are the major differences between how Sort based and Hash based
>>> shuffle operate and what is it that cause Sort Shuffle to perform better
>>> than Hash?
>>> Any talks that discuss both shuffles in detail, how they are implemented
>>> and the performance gains ?
>>>
>>
>>
>

Re: Difference between Sort based and Hash based shuffle

Posted by Muhammad Haseeb Javed <11...@seecs.edu.pk>.
I did check it out and although I did get a general understanding of the
various classes used to implement Sort and Hash shuffles, however these
slides lack details as to how they are implemented and why sort generally
has better performance than hash

On Sun, Aug 16, 2015 at 4:31 AM, Ravi Kiran <ra...@gmail.com>
wrote:

> Have a look at this presentation.
> http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
> help to you.
>
> On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
> 11besemjaved@seecs.edu.pk> wrote:
>
>> What are the major differences between how Sort based and Hash based
>> shuffle operate and what is it that cause Sort Shuffle to perform better
>> than Hash?
>> Any talks that discuss both shuffles in detail, how they are implemented
>> and the performance gains ?
>>
>
>

Re: Difference between Sort based and Hash based shuffle

Posted by Ravi Kiran <ra...@gmail.com>.
Have a look at this presentation.
http://www.slideshare.net/colorant/spark-shuffle-introduction . Can be of
help to you.

On Sat, Aug 15, 2015 at 1:42 PM, Muhammad Haseeb Javed <
11besemjaved@seecs.edu.pk> wrote:

> What are the major differences between how Sort based and Hash based
> shuffle operate and what is it that cause Sort Shuffle to perform better
> than Hash?
> Any talks that discuss both shuffles in detail, how they are implemented
> and the performance gains ?
>