You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/01/03 20:34:23 UTC

Data locality during Spark RDD creation

Hi,

I have HDFS and MapReduce running on 20 nodes and a experimental spark
cluster running on subset of the HDFS node (say 8 of them).

If some ETL is done using MR most likely the data will be replicated across
all 20 nodes (assuming I used all the nodes).

Is it a good idea to run spark cluster on all 20 nodes where HDFS is
running so that all the RDDs are data local and the bulk data transfer is
minimized ?

Thanks.
Deb

Re: Data locality during Spark RDD creation

Posted by Andrew Ash <an...@andrewash.com>.

I definitely think so.  Network transfer is often a bottleneck for
distributed jobs, especially if you're using groupBys or re-keying things
often.

What network speed do you have between each HDFS node?  1GB?

On Fri, Jan 3, 2014 at 2:34 PM, Debasish Das <de...@gmail.com>wrote:

> Hi,
>
> I have HDFS and MapReduce running on 20 nodes and a experimental spark
> cluster running on subset of the HDFS node (say 8 of them).
>
> If some ETL is done using MR most likely the data will be replicated
> across all 20 nodes (assuming I used all the nodes).
>
> Is it a good idea to run spark cluster on all 20 nodes where HDFS is
> running so that all the RDDs are data local and the bulk data transfer is
> minimized ?
>
> Thanks.
> Deb
>