You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Haopu Wang <HW...@qilinsoft.com> on 2014/07/18 09:35:07 UTC

data locality

I have a standalone spark cluster and a HDFS cluster which share some of nodes.

 

When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node?

 

How about a spark cluster on Yarn?

 

Thank you very much!

Re: data locality

Posted by Chris Fregly <ch...@fregly.com>.

you can view the Locality Level of each task within a stage by using the
Spark Web UI under the Stages tab.

levels are as follows (in order of decreasing desirability):
1) PROCESS_LOCAL <- data was found directly in the executor JVM
2) NODE_LOCAL <- data was found on the same node as the executor JVM
3) RACK_LOCAL <- data was found in the same rack
4) ANY <- outside the rack

also, the Aggregated Metrics by Executor section of the Stage detail view
shows how much data is being shuffled across the network (Shuffle
Read/Write).  0 is where you wanna be with that metric.

-chris


On Fri, Jul 25, 2014 at 4:13 AM, Tsai Li Ming <ma...@ltsai.com> wrote:

> Hi,
>
> In the standalone mode, how can we check data locality is working as
> expected when tasks are assigned?
>
> Thanks!
>
>
> On 23 Jul, 2014, at 12:49 am, Sandy Ryza <sa...@cloudera.com> wrote:
>
> On standalone there is still special handling for assigning tasks within
> executors.  There just isn't special handling for where to place executors,
> because standalone generally places an executor on every node.
>
>
> On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
>>   Sandy,
>>
>>
>>
>> I just tried the standalone cluster and didn't have chance to try Yarn
>> yet.
>>
>> So if I understand correctly, there are **no** special handling of task
>> assignment according to the HDFS block's location when Spark is running as
>> a **standalone** cluster.
>>
>> Please correct me if I'm wrong. Thank you for your patience!
>>
>>
>>  ------------------------------
>>
>> *From:* Sandy Ryza [mailto:sandy.ryza@cloudera.com]
>> *Sent:* 2014年7月22日 9:47
>>
>> *To:* user@spark.apache.org
>> *Subject:* Re: data locality
>>
>>
>>
>> This currently only works for YARN.  The standalone default is to place
>> an executor on every node for every job.
>>
>>
>>
>> The total number of executors is specified by the user.
>>
>>
>>
>> -Sandy
>>
>>
>>
>> On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
>>
>> Sandy,
>>
>>
>>
>> Do you mean the “preferred location” is working for standalone cluster
>> also? Because I check the code of SparkContext and see comments as below:
>>
>>
>>
>>   // This is used only by YARN for now, but should be relevant to other
>> cluster types (*Mesos*,
>>
>>   // etc) too. This is typically generated from
>> InputFormatInfo.computePreferredLocations. It
>>
>>   // contains a map from *hostname* to a list of input format splits on
>> the host.
>>
>>   *private*[spark] *var* preferredNodeLocationData: Map[String,
>> Set[SplitInfo]] = Map()
>>
>>
>>
>> BTW, even with the preferred hosts, how does Spark decide how many total
>> executors to use for this application?
>>
>>
>>
>> Thanks again!
>>
>>
>>  ------------------------------
>>
>> *From:* Sandy Ryza [mailto:sandy.ryza@cloudera.com]
>> *Sent:* Friday, July 18, 2014 3:44 PM
>> *To:* user@spark.apache.org
>> *Subject:* Re: data locality
>>
>>
>>
>> Hi Haopu,
>>
>>
>>
>> Spark will ask HDFS for file block locations and try to assign tasks
>> based on these.
>>
>>
>>
>> There is a snag.  Spark schedules its tasks inside of "executor"
>> processes that stick around for the lifetime of a Spark application.  Spark
>> requests executors before it runs any jobs, i.e. before it has any
>> information about where the input data for the jobs is located.  If the
>> executors occupy significantly fewer nodes than exist in the cluster, it
>> can be difficult for Spark to achieve data locality.  The workaround for
>> this is an API that allows passing in a set of preferred locations when
>> instantiating a Spark context.  This API is currently broken in Spark 1.0,
>> and will likely changed to be something a little simpler in a future
>> release.
>>
>>
>>
>> val locData = InputFormatInfo.computePreferredLocations
>>
>>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new
>> Path(“myfile.txt”)))
>>
>>
>>
>> val sc = new SparkContext(conf, locData)
>>
>>
>>
>> -Sandy
>>
>>
>>
>>
>>
>> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
>>
>> I have a standalone spark cluster and a HDFS cluster which share some of
>> nodes.
>>
>>
>>
>> When reading HDFS file, how does spark assign tasks to nodes? Will it ask
>> HDFS the location for each file block in order to get a right worker node?
>>
>>
>>
>> How about a spark cluster on Yarn?
>>
>>
>>
>> Thank you very much!
>>
>>
>>
>>
>>
>>
>>
>
>
>

Re: data locality

Posted by Tsai Li Ming <ma...@ltsai.com>.

Hi,

In the standalone mode, how can we check data locality is working as expected when tasks are assigned?

Thanks!


On 23 Jul, 2014, at 12:49 am, Sandy Ryza <sa...@cloudera.com> wrote:

> On standalone there is still special handling for assigning tasks within executors.  There just isn't special handling for where to place executors, because standalone generally places an executor on every node.
> 
> 
> On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang <HW...@qilinsoft.com> wrote:
> Sandy,
> 
>  
> 
> I just tried the standalone cluster and didn't have chance to try Yarn yet.
> 
> So if I understand correctly, there are *no* special handling of task assignment according to the HDFS block's location when Spark is running as a *standalone* cluster.
> 
> Please correct me if I'm wrong. Thank you for your patience!
> 
>  
> 
> From: Sandy Ryza [mailto:sandy.ryza@cloudera.com] 
> Sent: 2014年7月22日 9:47
> 
> 
> To: user@spark.apache.org
> Subject: Re: data locality
> 
>  
> 
> This currently only works for YARN.  The standalone default is to place an executor on every node for every job.
> 
>  
> 
> The total number of executors is specified by the user.
> 
>  
> 
> -Sandy
> 
>  
> 
> On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
> 
> Sandy,
> 
>  
> 
> Do you mean the “preferred location” is working for standalone cluster also? Because I check the code of SparkContext and see comments as below:
> 
>  
> 
>   // This is used only by YARN for now, but should be relevant to other cluster types (Mesos,
> 
>   // etc) too. This is typically generated from InputFormatInfo.computePreferredLocations. It
> 
>   // contains a map from hostname to a list of input format splits on the host.
> 
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = Map()
> 
>  
> 
> BTW, even with the preferred hosts, how does Spark decide how many total executors to use for this application?
> 
>  
> 
> Thanks again!
> 
>  
> 
> From: Sandy Ryza [mailto:sandy.ryza@cloudera.com] 
> Sent: Friday, July 18, 2014 3:44 PM
> To: user@spark.apache.org
> Subject: Re: data locality
> 
>  
> 
> Hi Haopu,
> 
>  
> 
> Spark will ask HDFS for file block locations and try to assign tasks based on these.
> 
>  
> 
> There is a snag.  Spark schedules its tasks inside of "executor" processes that stick around for the lifetime of a Spark application.  Spark requests executors before it runs any jobs, i.e. before it has any information about where the input data for the jobs is located.  If the executors occupy significantly fewer nodes than exist in the cluster, it can be difficult for Spark to achieve data locality.  The workaround for this is an API that allows passing in a set of preferred locations when instantiating a Spark context.  This API is currently broken in Spark 1.0, and will likely changed to be something a little simpler in a future release.
> 
>  
> 
> val locData = InputFormatInfo.computePreferredLocations
> 
>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt”)))
> 
>  
> 
> val sc = new SparkContext(conf, locData)
> 
>  
> 
> -Sandy
> 
>  
> 
>  
> 
> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
> 
> I have a standalone spark cluster and a HDFS cluster which share some of nodes.
> 
>  
> 
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node?
> 
>  
> 
> How about a spark cluster on Yarn?
> 
>  
> 
> Thank you very much!
> 
>  
> 
>  
> 
>  
> 
>

Re: data locality

Posted by Sandy Ryza <sa...@cloudera.com>.

On standalone there is still special handling for assigning tasks within
executors.  There just isn't special handling for where to place executors,
because standalone generally places an executor on every node.


On Mon, Jul 21, 2014 at 7:42 PM, Haopu Wang <HW...@qilinsoft.com> wrote:

>    Sandy,
>
>
>
> I just tried the standalone cluster and didn't have chance to try Yarn yet.
>
> So if I understand correctly, there are **no** special handling of task
> assignment according to the HDFS block's location when Spark is running as
> a **standalone** cluster.
>
> Please correct me if I'm wrong. Thank you for your patience!
>
>
>  ------------------------------
>
> *From:* Sandy Ryza [mailto:sandy.ryza@cloudera.com]
> *Sent:* 2014年7月22日 9:47
>
> *To:* user@spark.apache.org
> *Subject:* Re: data locality
>
>
>
> This currently only works for YARN.  The standalone default is to place an
> executor on every node for every job.
>
>
>
> The total number of executors is specified by the user.
>
>
>
> -Sandy
>
>
>
> On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
> Sandy,
>
>
>
> Do you mean the “preferred location” is working for standalone cluster
> also? Because I check the code of SparkContext and see comments as below:
>
>
>
>   // This is used only by YARN for now, but should be relevant to other
> cluster types (*Mesos*,
>
>   // etc) too. This is typically generated from
> InputFormatInfo.computePreferredLocations. It
>
>   // contains a map from *hostname* to a list of input format splits on
> the host.
>
>   *private*[spark] *var* preferredNodeLocationData: Map[String,
> Set[SplitInfo]] = Map()
>
>
>
> BTW, even with the preferred hosts, how does Spark decide how many total
> executors to use for this application?
>
>
>
> Thanks again!
>
>
>  ------------------------------
>
> *From:* Sandy Ryza [mailto:sandy.ryza@cloudera.com]
> *Sent:* Friday, July 18, 2014 3:44 PM
> *To:* user@spark.apache.org
> *Subject:* Re: data locality
>
>
>
> Hi Haopu,
>
>
>
> Spark will ask HDFS for file block locations and try to assign tasks based
> on these.
>
>
>
> There is a snag.  Spark schedules its tasks inside of "executor" processes
> that stick around for the lifetime of a Spark application.  Spark requests
> executors before it runs any jobs, i.e. before it has any information about
> where the input data for the jobs is located.  If the executors occupy
> significantly fewer nodes than exist in the cluster, it can be difficult
> for Spark to achieve data locality.  The workaround for this is an API that
> allows passing in a set of preferred locations when instantiating a Spark
> context.  This API is currently broken in Spark 1.0, and will likely
> changed to be something a little simpler in a future release.
>
>
>
> val locData = InputFormatInfo.computePreferredLocations
>
>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new
> Path(“myfile.txt”)))
>
>
>
> val sc = new SparkContext(conf, locData)
>
>
>
> -Sandy
>
>
>
>
>
> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
> I have a standalone spark cluster and a HDFS cluster which share some of
> nodes.
>
>
>
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask
> HDFS the location for each file block in order to get a right worker node?
>
>
>
> How about a spark cluster on Yarn?
>
>
>
> Thank you very much!
>
>
>
>
>
>
>

RE: data locality

Posted by Haopu Wang <HW...@qilinsoft.com>.

Sandy,

 

I just tried the standalone cluster and didn't have chance to try Yarn yet.

So if I understand correctly, there are *no* special handling of task assignment according to the HDFS block's location when Spark is running as a *standalone* cluster.

Please correct me if I'm wrong. Thank you for your patience!

 

________________________________

From: Sandy Ryza [mailto:sandy.ryza@cloudera.com] 
Sent: 2014年7月22日 9:47
To: user@spark.apache.org
Subject: Re: data locality

 

This currently only works for YARN.  The standalone default is to place an executor on every node for every job.

 

The total number of executors is specified by the user.

 

-Sandy

 

On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <HW...@qilinsoft.com> wrote:

Sandy,

 

Do you mean the “preferred location” is working for standalone cluster also? Because I check the code of SparkContext and see comments as below:

 

  // This is used only by YARN for now, but should be relevant to other cluster types (Mesos,

  // etc) too. This is typically generated from InputFormatInfo.computePreferredLocations. It

  // contains a map from hostname to a list of input format splits on the host.

  private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = Map()

 

BTW, even with the preferred hosts, how does Spark decide how many total executors to use for this application?

 

Thanks again!

 

________________________________

From: Sandy Ryza [mailto:sandy.ryza@cloudera.com] 
Sent: Friday, July 18, 2014 3:44 PM
To: user@spark.apache.org
Subject: Re: data locality

 

Hi Haopu,

 

Spark will ask HDFS for file block locations and try to assign tasks based on these.

 

There is a snag.  Spark schedules its tasks inside of "executor" processes that stick around for the lifetime of a Spark application.  Spark requests executors before it runs any jobs, i.e. before it has any information about where the input data for the jobs is located.  If the executors occupy significantly fewer nodes than exist in the cluster, it can be difficult for Spark to achieve data locality.  The workaround for this is an API that allows passing in a set of preferred locations when instantiating a Spark context.  This API is currently broken in Spark 1.0, and will likely changed to be something a little simpler in a future release.

 

val locData = InputFormatInfo.computePreferredLocations

  (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt”)))

 

val sc = new SparkContext(conf, locData)

 

-Sandy

 

 

On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <HW...@qilinsoft.com> wrote:

I have a standalone spark cluster and a HDFS cluster which share some of nodes.

 

When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node?

 

How about a spark cluster on Yarn?

 

Thank you very much!

Re: data locality

Posted by Sandy Ryza <sa...@cloudera.com>.

This currently only works for YARN.  The standalone default is to place an
executor on every node for every job.

The total number of executors is specified by the user.

-Sandy


On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang <HW...@qilinsoft.com> wrote:

>    Sandy,
>
>
>
> Do you mean the “preferred location” is working for standalone cluster
> also? Because I check the code of SparkContext and see comments as below:
>
>
>
>   // This is used only by YARN for now, but should be relevant to other
> cluster types (*Mesos*,
>
>   // etc) too. This is typically generated from
> InputFormatInfo.computePreferredLocations. It
>
>   // contains a map from *hostname* to a list of input format splits on
> the host.
>
>   *private*[spark] *var* preferredNodeLocationData: Map[String,
> Set[SplitInfo]] = Map()
>
>
>
> BTW, even with the preferred hosts, how does Spark decide how many total
> executors to use for this application?
>
>
>
> Thanks again!
>
>
>  ------------------------------
>
> *From:* Sandy Ryza [mailto:sandy.ryza@cloudera.com]
> *Sent:* Friday, July 18, 2014 3:44 PM
> *To:* user@spark.apache.org
> *Subject:* Re: data locality
>
>
>
> Hi Haopu,
>
>
>
> Spark will ask HDFS for file block locations and try to assign tasks based
> on these.
>
>
>
> There is a snag.  Spark schedules its tasks inside of "executor" processes
> that stick around for the lifetime of a Spark application.  Spark requests
> executors before it runs any jobs, i.e. before it has any information about
> where the input data for the jobs is located.  If the executors occupy
> significantly fewer nodes than exist in the cluster, it can be difficult
> for Spark to achieve data locality.  The workaround for this is an API that
> allows passing in a set of preferred locations when instantiating a Spark
> context.  This API is currently broken in Spark 1.0, and will likely
> changed to be something a little simpler in a future release.
>
>
>
> val locData = InputFormatInfo.computePreferredLocations
>
>   (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new
> Path(“myfile.txt”)))
>
>
>
> val sc = new SparkContext(conf, locData)
>
>
>
> -Sandy
>
>
>
>
>
> On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <HW...@qilinsoft.com> wrote:
>
> I have a standalone spark cluster and a HDFS cluster which share some of
> nodes.
>
>
>
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask
> HDFS the location for each file block in order to get a right worker node?
>
>
>
> How about a spark cluster on Yarn?
>
>
>
> Thank you very much!
>
>
>
>
>

RE: data locality

Posted by Haopu Wang <HW...@qilinsoft.com>.

Sandy,

 

Do you mean the “preferred location” is working for standalone cluster also? Because I check the code of SparkContext and see comments as below:

 

  // This is used only by YARN for now, but should be relevant to other cluster types (Mesos,

  // etc) too. This is typically generated from InputFormatInfo.computePreferredLocations. It

  // contains a map from hostname to a list of input format splits on the host.

  private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = Map()

 

BTW, even with the preferred hosts, how does Spark decide how many total executors to use for this application?

 

Thanks again!

 

________________________________

From: Sandy Ryza [mailto:sandy.ryza@cloudera.com] 
Sent: Friday, July 18, 2014 3:44 PM
To: user@spark.apache.org
Subject: Re: data locality

 

Hi Haopu,

 

Spark will ask HDFS for file block locations and try to assign tasks based on these.

 

There is a snag.  Spark schedules its tasks inside of "executor" processes that stick around for the lifetime of a Spark application.  Spark requests executors before it runs any jobs, i.e. before it has any information about where the input data for the jobs is located.  If the executors occupy significantly fewer nodes than exist in the cluster, it can be difficult for Spark to achieve data locality.  The workaround for this is an API that allows passing in a set of preferred locations when instantiating a Spark context.  This API is currently broken in Spark 1.0, and will likely changed to be something a little simpler in a future release.

 

val locData = InputFormatInfo.computePreferredLocations

  (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt”)))

 

val sc = new SparkContext(conf, locData)

 

-Sandy

 

 

On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <HW...@qilinsoft.com> wrote:

I have a standalone spark cluster and a HDFS cluster which share some of nodes.

 

When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node?

 

How about a spark cluster on Yarn?

 

Thank you very much!

Re: data locality

Posted by Sandy Ryza <sa...@cloudera.com>.

Hi Haopu,

Spark will ask HDFS for file block locations and try to assign tasks based
on these.

There is a snag.  Spark schedules its tasks inside of "executor" processes
that stick around for the lifetime of a Spark application.  Spark requests
executors before it runs any jobs, i.e. before it has any information about
where the input data for the jobs is located.  If the executors occupy
significantly fewer nodes than exist in the cluster, it can be difficult
for Spark to achieve data locality.  The workaround for this is an API that
allows passing in a set of preferred locations when instantiating a Spark
context.  This API is currently broken in Spark 1.0, and will likely
changed to be something a little simpler in a future release.

val locData = InputFormatInfo.computePreferredLocations
  (Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new
Path(“myfile.txt”)))

val sc = new SparkContext(conf, locData)

-Sandy

On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang <HW...@qilinsoft.com> wrote:

>  I have a standalone spark cluster and a HDFS cluster which share some of
> nodes.
>
>
>
> When reading HDFS file, how does spark assign tasks to nodes? Will it ask
> HDFS the location for each file block in order to get a right worker node?
>
>
>
> How about a spark cluster on Yarn?
>
>
>
> Thank you very much!
>
>
>