You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Disha Shrivastava <di...@gmail.com> on 2015/12/29 12:57:58 UTC

Partitioning of RDD across worker machines

Hi,

Suppose I have a file locally on my master machine and the same file is
also present in the same path on all the worker machines , say
/home/user_name/Desktop. I wanted to know that when we partition the data
using sc.parallelize , Spark actually broadcasts parts of the RDD to all
the worker machines or it reads the corresponding segment locally from the
memory of the worker machine?

How to I avoid movement of this data? Will it help if I store the file in
HDFS?

Thanks and Regards,
Disha

Re: Partitioning of RDD across worker machines

Posted by Reynold Xin <rx...@databricks.com>.

If you use hadoopFile (or textFile) and have the same file on the same path
in every node, I suspect it might just work.

On Tue, Dec 29, 2015 at 3:57 AM, Disha Shrivastava <di...@gmail.com>
wrote:

> Hi,
>
> Suppose I have a file locally on my master machine and the same file is
> also present in the same path on all the worker machines , say
> /home/user_name/Desktop. I wanted to know that when we partition the data
> using sc.parallelize , Spark actually broadcasts parts of the RDD to all
> the worker machines or it reads the corresponding segment locally from the
> memory of the worker machine?
>
> How to I avoid movement of this data? Will it help if I store the file in
> HDFS?
>
> Thanks and Regards,
> Disha
>