You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "anishsneh@yahoo.co.in" <an...@yahoo.co.in> on 2014/06/14 08:06:57 UTC

Fw: How Spark Choose Worker Nodes for respective HDFS block

Hi All

Is there any communication between Spark MASTER node and Hadoop NameNode while distributing work to WORKER nodes, like we have in MapReduce.

Please suggest

TIA

-- 
Anish Sneh
"Experience is the best teacher."
http://in.linkedin.com/in/anishsneh

Re: Fw: How Spark Choose Worker Nodes for respective HDFS block

Posted by Chris Fregly <ch...@fregly.com>.

yes, spark attempts to achieve data locality (PROCESS_LOCAL or NODE_LOCAL)
where possible just like MapReduce.  it's a best practice to co-locate your
Spark Workers on the same nodes as your HDFS Name Nodes for just this
reason.

this is achieved through the RDD.preferredLocations() interface method:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

on a related note, you can configure spark.locality.wait as the number of
millis to wait before falling back to a less-local data node (RACK_LOCAL):
  http://spark.apache.org/docs/latest/configuration.html

-chris


On Fri, Jun 13, 2014 at 11:06 PM, anishsneh@yahoo.co.in <
anishsneh@yahoo.co.in> wrote:

> Hi All
>
> Is there any communication between Spark MASTER node and Hadoop NameNode
> while distributing work to WORKER nodes, like we have in MapReduce.
>
> Please suggest
>
> TIA
>
> --
> Anish Sneh
> "Experience is the best teacher."
> http://in.linkedin.com/in/anishsneh
>
>
>  ------------------------------
> * From: * anishsneh@yahoo.co.in <an...@yahoo.co.in>;
> * To: * user@spark.incubator.apache.org <us...@spark.incubator.apache.org>;
>
> * Subject: * How Spark Choose Worker Nodes for respective HDFS block
> * Sent: * Fri, Jun 13, 2014 9:17:50 PM
>
>   Hi All
>
> I am new to Spark, workin on 3 node test cluster. I am trying to explore
> Spark scope in analytics, my Spark codes interacts with HDFS mostly.
>
> I have a confusion that how Spark choose on which node it will distribute
> its work.
>
> Since we assume that it can be an alternative to Hadoop MapReduce. In
> MapReduce we know that internally framework will distribute code (or logic)
> to the nearest TaskTracker which are co-located with DataNode or in same
> rack or probably nearest to the data blocks.
>
> My confusion is when I give HDFS path inside a Spark program how it choose
> which node is nearest (if it does).
>
> If it does not then how it will work when I have TBs of data where high
> network latency will be involved.
>
> My apologies for asking basic question, please suggest.
>
> TIA
> --
> Anish Sneh
> "Experience is the best teacher."
> http://www.anishsneh.com
>