You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mazen <ma...@gmail.com> on 2016/07/10 12:58:16 UTC
location of a partition in the cluster/ how parallelize method
distribute the RDD partitions over the cluster.
Hi,
Any hint about getting the location of a particular RDD partition on the
cluster? a workaround?
Parallelize method on RDDs partitions the RDD into splits as specified or
per as per the default parallelism configuration. Does parallelize actually
distribute the partitions into the cluster or the partitions are kept on the
driver node. In the first case is there a protocol for assigning/mapping
partitions (parallelocollectionpartition) to workers or it is just random.
Otherwise, when partitions are distributed on the cluster? Is that when
tasks are launched on partitions?
thanks.
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/location-of-a-partition-in-the-cluster-how-parallelize-method-distribute-the-RDD-partitions-over-the-tp27316.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: location of a partition in the cluster/ how parallelize method
distribute the RDD partitions over the cluster.
Posted by "aka.fe2s" <ak...@gmail.com>.
The local collection is distributed into the cluster when you run any
action http://spark.apache.org/docs/latest/programming-guide.html#actions
due to laziness of RDD.
If you want to control how the collection is split into parititions, you
can create your own RDD implementation and implement this logic
in getPartitions/compute methods. See the ParallelCollectionRDD as a
reference.
--
Oleksiy Dyagilev
On Sun, Jul 10, 2016 at 3:58 PM, Mazen <ma...@gmail.com> wrote:
> Hi,
>
> Any hint about getting the location of a particular RDD partition on the
> cluster? a workaround?
>
>
> Parallelize method on RDDs partitions the RDD into splits as specified or
> per as per the default parallelism configuration. Does parallelize
> actually
> distribute the partitions into the cluster or the partitions are kept on
> the
> driver node. In the first case is there a protocol for assigning/mapping
> partitions (parallelocollectionpartition) to workers or it is just random.
> Otherwise, when partitions are distributed on the cluster? Is that when
> tasks are launched on partitions?
>
> thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/location-of-a-partition-in-the-cluster-how-parallelize-method-distribute-the-RDD-partitions-over-the-tp27316.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>