You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/01/03 20:34:23 UTC
Data locality during Spark RDD creation
Hi,
I have HDFS and MapReduce running on 20 nodes and a experimental spark
cluster running on subset of the HDFS node (say 8 of them).
If some ETL is done using MR most likely the data will be replicated across
all 20 nodes (assuming I used all the nodes).
Is it a good idea to run spark cluster on all 20 nodes where HDFS is
running so that all the RDDs are data local and the bulk data transfer is
minimized ?
Thanks.
Deb
Re: Data locality during Spark RDD creation
Posted by Andrew Ash <an...@andrewash.com>.
I definitely think so. Network transfer is often a bottleneck for
distributed jobs, especially if you're using groupBys or re-keying things
often.
What network speed do you have between each HDFS node? 1GB?
On Fri, Jan 3, 2014 at 2:34 PM, Debasish Das <de...@gmail.com>wrote:
> Hi,
>
> I have HDFS and MapReduce running on 20 nodes and a experimental spark
> cluster running on subset of the HDFS node (say 8 of them).
>
> If some ETL is done using MR most likely the data will be replicated
> across all 20 nodes (assuming I used all the nodes).
>
> Is it a good idea to run spark cluster on all 20 nodes where HDFS is
> running so that all the RDDs are data local and the bulk data transfer is
> minimized ?
>
> Thanks.
> Deb
>