You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sai Prasanna <an...@gmail.com> on 2014/01/24 10:15:21 UTC

Spark Scheduler

Hello Everybody, Please help me with this.

preferredLocations(p) method for an RDD gives nodes where partition p of a
given RDD can be accessed faster. How does SPARK inherently implements
this?...Does any history about access times, network bandwidth  for various
partitions across nodes are stored and used, or else jobs allocated to a
node only determines the preferredLocations in case for multiple copies of
RDD.
Or is the intelligence derived from underlying framework, say HDFS.

-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*

Re: Spark Scheduler

Posted by Sai Prasanna <an...@gmail.com>.

Thathanga Das, With respect to HDFS, i think the job seeker will return
which of the replicated nodes is the preferred locations. But on a
stand-alone spark system, using native filesystem, say if partitions are
cached, its straightforward to return the same. IF not cached but
replicated across 3 nodes, how will spark return preferredlocations(p) in
the absence of Hadoop/HDFS.
In this case what is the logic ??


On Sat, Jan 25, 2014 at 12:11 AM, Tathagata Das <tathagata.das1565@gmail.com
> wrote:

> The logic behind the preferred location of an RDD partition is pretty
> simple. For RDDs that are based on the HDFS file, the preferred location is
> set based on the where the HDFS blocks corresponding to the RDD's
> partitions are located. This is done by querying the HDFS framework. For
> any RDD that may be cached, the preferred location is set based on where a
> partition is cached (may be replicated as well). So the system does not
> maintain any history about block / partition access times, bandwidth, etc.
>
>
> On Fri, Jan 24, 2014 at 1:15 AM, Sai Prasanna <an...@gmail.com>wrote:
>
>> Hello Everybody, Please help me with this.
>>
>> preferredLocations(p) method for an RDD gives nodes where partition p of
>> a given RDD can be accessed faster. How does SPARK inherently implements
>> this?...Does any history about access times, network bandwidth  for various
>> partitions across nodes are stored and used, or else jobs allocated to a
>> node only determines the preferredLocations in case for multiple copies of
>> RDD.
>> Or is the intelligence derived from underlying framework, say HDFS.
>>
>> --
>> *Sai Prasanna. AN*
>> *II M.Tech (CS), SSSIHL*
>>
>>
>> *Entire water in the ocean can never sink a ship, Unless it gets inside.
>> All the pressures of life can never hurt you, Unless you let them in.*
>>
>
>


-- 
*Sai Prasanna. AN*
*II M.Tech (CS), SSSIHL*


*Entire water in the ocean can never sink a ship, Unless it gets inside.All
the pressures of life can never hurt you, Unless you let them in.*

Re: Spark Scheduler

Posted by Tathagata Das <ta...@gmail.com>.

The logic behind the preferred location of an RDD partition is pretty
simple. For RDDs that are based on the HDFS file, the preferred location is
set based on the where the HDFS blocks corresponding to the RDD's
partitions are located. This is done by querying the HDFS framework. For
any RDD that may be cached, the preferred location is set based on where a
partition is cached (may be replicated as well). So the system does not
maintain any history about block / partition access times, bandwidth, etc.

On Fri, Jan 24, 2014 at 1:15 AM, Sai Prasanna <an...@gmail.com>wrote:

> Hello Everybody, Please help me with this.
>
> preferredLocations(p) method for an RDD gives nodes where partition p of a
> given RDD can be accessed faster. How does SPARK inherently implements
> this?...Does any history about access times, network bandwidth  for various
> partitions across nodes are stored and used, or else jobs allocated to a
> node only determines the preferredLocations in case for multiple copies of
> RDD.
> Or is the intelligence derived from underlying framework, say HDFS.
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
> *Entire water in the ocean can never sink a ship, Unless it gets inside.
> All the pressures of life can never hurt you, Unless you let them in.*
>