You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/11/17 09:01:16 UTC

[jira] [Commented] (SPARK-18479) spark.sql.shuffle.partitions defaults should be a prime number

    [ https://issues.apache.org/jira/browse/SPARK-18479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15673177#comment-15673177 ] 

Sean Owen commented on SPARK-18479:
-----------------------------------

For the reason you give, you would never want to partition on this key this way. You would hash the whole time and that is what HashPartitoner can do. In theory there is no problem regardless of number of buckets if the hash function is good. The problem with an odd number of partitions is that it won't match the number of slots available for computing and later stages will run one per partition. 199 tasks on 200 cores wastes one, and 211 tasks is a bit worse. 

> spark.sql.shuffle.partitions defaults should be a prime number
> --------------------------------------------------------------
>
>                 Key: SPARK-18479
>                 URL: https://issues.apache.org/jira/browse/SPARK-18479
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>            Reporter: Hamel Ajay Kothari
>
> For most hash bucketing use cases it is my understanding that a prime value, such as 199, would be a safer value than the existing value of 200. Using a non-prime value makes the likelihood of collisions much higher when the hash function isn't great.
> Consider the case where you've got a Timestamp or Long column with millisecond times at midnight each day. With the default value for spark.sql.shuffle.partitions, you'll end up with 120/200 partitions being completely empty.
> Looking around there doesn't seem to be a good reason why we chose 200 so I don't see a huge risk in changing it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org