You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2016/12/28 09:21:58 UTC

Why is spark.shuffle.sort.bypassMergeThreshold 200?

Hi,

I'm wondering what's so special about 200 to have it the default value
of spark.shuffle.sort.bypassMergeThreshold?

Is this arbitrary number? Is there any theory behind it?

Is the number of partitions in Spark SQL, i.e. 200, somehow related to
spark.shuffle.sort.bypassMergeThreshold?

scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
res3: Int = 200

I'd appreciate any guidance to get the gist of this seemingly magic
number. Thanks!

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.

I believe that these two were indeed originally related.  In the old
hash-based shuffle, we wrote objects out immediately to disk as they were
generated by an RDD's iterator. On the other hand, with the original
version of the new sort-based shuffle, Spark buffered a bunch of objects
before writing them out to disk.  My vague memory is that this caused
issues for Spark SQL -- I think because SQL got a performance improvement
from re-using the same objects when generating data from the iterator (but
if it re-used objects, the sort-based shuffle didn't work, because all of
the buffered objects would incorrectly point to the same underlying
object).  So, the default configuration was 200 so that SQL wouldn't use
the sort-based shuffle.  My memory is that the issues around this have
since been fixed but Michael / Reynold / Andrew Or probably have a better
memory of this.

-Kay

On Wed, Dec 28, 2016 at 7:05 PM, Liang-Chi Hsieh <vi...@gmail.com> wrote:

>
> This https://github.com/apache/spark/pull/1799 seems the first PR to
> introduce this number. But there is no explanation about the number.
>
>
> Jacek Laskowski wrote
> > Hi,
> >
> > I'm wondering what's so special about 200 to have it the default value
> > of spark.shuffle.sort.bypassMergeThreshold?
> >
> > Is this arbitrary number? Is there any theory behind it?
> >
> > Is the number of partitions in Spark SQL, i.e. 200, somehow related to
> > spark.shuffle.sort.bypassMergeThreshold?
> >
> > scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
> > res3: Int = 200
> >
> > I'd appreciate any guidance to get the gist of this seemingly magic
> > number. Thanks!
> >
> > Pozdrawiam,
> > Jacek Laskowski
> > ----
> > https://medium.com/@jaceklaskowski/
> > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> > Follow me at https://twitter.com/jaceklaskowski
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> -----
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Why-is-spark-shuffle-sort-
> bypassMergeThreshold-200-tp20379p20389.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>
>

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

Posted by Liang-Chi Hsieh <vi...@gmail.com>.

This https://github.com/apache/spark/pull/1799 seems the first PR to
introduce this number. But there is no explanation about the number.


Jacek Laskowski wrote
> Hi,
> 
> I'm wondering what's so special about 200 to have it the default value
> of spark.shuffle.sort.bypassMergeThreshold?
> 
> Is this arbitrary number? Is there any theory behind it?
> 
> Is the number of partitions in Spark SQL, i.e. 200, somehow related to
> spark.shuffle.sort.bypassMergeThreshold?
> 
> scala> spark.range(5).groupByKey(_ % 5).count.rdd.getNumPartitions
> res3: Int = 200
> 
> I'd appreciate any guidance to get the gist of this seemingly magic
> number. Thanks!
> 
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Why-is-spark-shuffle-sort-bypassMergeThreshold-200-tp20379p20389.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org