You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sameer Tilak <ss...@live.com> on 2014/07/31 07:07:58 UTC

Spark partition

Hi All,
>From the documention RDDs are already partitioned distributed. However, there is a way to repartition a given RDD using the following function. Can someone please point out the best practices for using this. I have a 10 GB TSV file stored in HDFS and I have a 4 node cluster with 1 master and 3 workers. Each worker has 15 GB memory and 4 cores. My processing pipeline is not very deep as of now. Can someone please tell me when repartitioning is recommended? When the documentation says balance doe to refer to memory usage or compute load or I/O?
repartition(numPartitions)Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

Re: Spark partition

Posted by Haiyang Fu <ha...@gmail.com>.

Hi,
you may referer this
http://spark.apache.org/docs/latest/tuning.html#level-of-parallelism
and
http://spark.apache.org/docs/latest/programming-guide.html#parallelized-collections
,both of which are about the RDD partitions.As you are going to load data
from hdfs, so you maybe also need to know
http://spark.apache.org/docs/latest/programming-guide.html#external-datasets
.

On Thu, Jul 31, 2014 at 1:07 PM, Sameer Tilak <ss...@live.com> wrote:

> Hi All,
>
> From the documention RDDs are already partitioned distributed. However,
> there is a way to repartition a given RDD using the following function. Can
> someone please point out the best practices for using this. I have a 10 GB
> TSV file stored in HDFS and I have a 4 node cluster with 1 master and 3
> workers. Each worker has 15 GB memory and 4 cores. My processing pipeline
> is not very deep as of now. Can someone please tell me when repartitioning
> is recommended? When the documentation says balance doe to refer to memory
> usage or compute load or I/O?
>
> *repartition*(*numPartitions*)Reshuffle the data in the RDD randomly to
> create either more or fewer partitions and balance it across them. This
> always shuffles all data over the network.
>
>
>
>