You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Joe L <se...@yahoo.com> on 2014/04/17 01:50:50 UTC

choose the number of partition according to the number of nodes

Is it true that it is better to choose the number of partition according to
the number of nodes in the cluster?

partitionBy(numNodes)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: choose the number of partition according to the number of nodes

Posted by Joe L <se...@yahoo.com>.
Thank you Nicholas



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362p4364.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: choose the number of partition according to the number of nodes

Posted by Nicholas Chammas <ni...@gmail.com>.
>From the Spark tuning guide<http://spark.apache.org/docs/latest/tuning.html>
:

In general, we recommend 2-3 tasks per CPU core in your cluster.


I think you can only get one task per partition to run concurrently for a
given RDD. So if your RDD has 10 partitions, then 10 tasks at most can
operate on it concurrently (given you have 10 cores available). At the same
time, you want each task to run quickly on a small slice of data. The
smaller the slice of data, the more likely the task working on it will
complete successfully. So it's good to have more, smaller partitions.

For example, if you have 10 cores and an RDD with 20 partitions, you should
see 2 waves of 10 tasks operate on the RDD concurrently.

Hence, I interpret that line from the tuning guide to mean it's best to
have your RDD partitioned into (numCores * 2) or (numCores * 3) partitions.

Nick


On Wed, Apr 16, 2014 at 7:50 PM, Joe L <se...@yahoo.com> wrote:

> Is it true that it is better to choose the number of partition according to
> the number of nodes in the cluster?
>
> partitionBy(numNodes)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>