You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rahul Palamuttam <ra...@gmail.com> on 2015/07/28 09:42:29 UTC

Spark Number of Partitions Recommendations

Hi All,

I was wondering why the recommended number for parallelism was 2 -3 times
the number of cores on your cluster.
Is the heuristic explained in any of the Spark papers? Or is it more of an
agreed upon rule of thumb?

Thanks,

Rahul P



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark Number of Partitions Recommendations

Posted by Igor Berman <ig...@gmail.com>.

imho, you need to take into account size of your data too
if your cluster is relatively small, you may cause memory pressure on your
executors if trying to repartition to some #cores connected number of
partitions

better to take some max between initial number of partitions(assuming your
data is on hdfs with 64Mb block size) and between number you get from your
formula



On 29 July 2015 at 12:31, ponkin <al...@ya.ru> wrote:

> Hi Rahul,
>
> Where did you see such a recommendation?
> I personally define partitions with the following formula
>
> partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores )
> )
>
> where
> nextPrimeNumberAbove(x) - prime number which is greater than x
> K - multiplicator  to calculate start with 1 and encrease untill join
> perfomance start to degrade
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark Number of Partitions Recommendations

Posted by Понькин Алексей <al...@ya.ru>.

Yes, I forgot to mention
I chose prime number as a modulo for hash function because my keys are usually 
strings and spark calculates particular partitiion using key hash(see HashPartitioner.scala) So, to avoid big number of collisions(when many keys located in few partition) it is common to use prime number in modulo. But it makes sense only for String keys offcourse, because of hash function. If yuo have different hash function for key of different type you can use any other modulo instead prime number.
I like this discussion on this topic http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1&t=1

02.08.2015, 00:14, "Ruslan Dautkhanov" <da...@gmail.com>:
> You should also take into account amount of memory that you plan to use.
> It's advised not to give too much memory for each executor .. otherwise GC overhead will go up.
>
> Btw, why prime numbers?
>
> --
> Ruslan Dautkhanov
>
> On Wed, Jul 29, 2015 at 3:31 AM, ponkin <al...@ya.ru> wrote:
>> Hi Rahul,
>>
>> Where did you see such a recommendation?
>> I personally define partitions with the following formula
>>
>> partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )
>>
>> where
>> nextPrimeNumberAbove(x) - prime number which is greater than x
>> K - multiplicator  to calculate start with 1 and encrease untill join
>> perfomance start to degrade
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html
>>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark Number of Partitions Recommendations

Posted by Ruslan Dautkhanov <da...@gmail.com>.

You should also take into account amount of memory that you plan to use.
It's advised not to give too much memory for each executor .. otherwise GC
overhead will go up.

Btw, why prime numbers?



-- 
Ruslan Dautkhanov

On Wed, Jul 29, 2015 at 3:31 AM, ponkin <al...@ya.ru> wrote:

> Hi Rahul,
>
> Where did you see such a recommendation?
> I personally define partitions with the following formula
>
> partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores )
> )
>
> where
> nextPrimeNumberAbove(x) - prime number which is greater than x
> K - multiplicator  to calculate start with 1 and encrease untill join
> perfomance start to degrade
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Spark Number of Partitions Recommendations

Posted by ponkin <al...@ya.ru>.

Hi Rahul,

Where did you see such a recommendation?
I personally define partitions with the following formula

partitions = nextPrimeNumberAbove( K*(--num-executors * --executor-cores ) )

where 
nextPrimeNumberAbove(x) - prime number which is greater than x
K - multiplicator  to calculate start with 1 and encrease untill join
perfomance start to degrade




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Number-of-Partitions-Recommendations-tp24022p24059.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org