You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by sryza <gi...@git.apache.org> on 2014/04/30 19:04:39 UTC

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

GitHub user sryza opened a pull request:

    https://github.com/apache/spark/pull/601

    SPARK-1492. Update Spark YARN docs to use spark-submit

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sryza/spark sandy-spark-1492

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #601
    
----
commit 867a3eaec67d91ec777a775029715b59071faf63
Author: Sandy Ryza <sa...@cloudera.com>
Date:   2014-04-30T17:02:03Z

    SPARK-1492. Update Spark YARN docs to use spark-submit

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173690
  
    --- Diff: docs/cluster-overview.md ---
    @@ -118,21 +118,25 @@ If you are ever unclear where configuration options are coming from. fine-graine
     information can be printed by adding the `--verbose` option to `./spark-submit`.
     
     ### Advanced Dependency Management
    -When using `./bin/spark-submit` jars will be automatically transferred to the cluster. For many
    -users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
    +When using `./bin/spark-submit` the app jar will be automatically transferred to the cluster. For
    +many users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
     on an existing SparkContext. This can be used to distribute JAR files (Java/Scala) or .egg and
     .zip libraries (Python) to executors. Spark uses the following URL scheme to allow different
     strategies for disseminating jars:
     
     - **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
    -  every executor pulls the file from the driver HTTP server
    +  every executor pulls the file from the driver HTTP server. When running the driver in the cluster,
    +  the jars need a way of getting from the client to the driver so that it can host them. This is not
    +  currently supported with Spark standalone, and on YARN this requires passing additional jars on the
    +  command line with the `--jars` option.
     - **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
     - **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node.  This
       means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,
       or shared via NFS, GlusterFS, etc.
     
     Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
    -Over time this can use up a significant amount of space and will need to be cleaned up.
    +With Mesos and the Spark Standalone cluster manager, this can use up a significant amount of space over
    --- End diff --
    
    this is actually outdated now, you can set a flag to clean them up. Maybe just mention that and link to the config page.
    
    http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/configuration.html
    spark.worker.cleanup.appDataTtl


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173906
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -12,12 +12,14 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0.
     We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster.
    --- End diff --
    
    This section is actually redundant with the "building with maven" section about Hadoop versions. Maybe it would make sense to just have a short introduction here that explains that (i) you need to have a version of spark that is specially compiled with YARN support if you don't already and (ii) if you don't have one go to the maven build to learn how to make one.
    
    I think right now if users go to this, the first thing they'll think is they have to go build Spark. But actually in almost all cases they can just download the pre-build yarn jar and be done with it. I think the first draft of this document was when we didn't package a binary with YARN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173709
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -12,12 +12,14 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0.
     We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster.
     This can be built by setting the Hadoop version and `SPARK_YARN` environment variable, as follows:
     
    -    SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
    +    mvn package -Pyarn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -DskipTests
    --- End diff --
    
    You no longer need to set `-Dyarn.version` in this case due a recent improvement in the maven build.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/601


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41888361
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14604/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41886854
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173726
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -47,83 +49,42 @@ System Properties:
     # Launching Spark on YARN
     
     Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster.
    -These configs are used to connect to the cluster, write to the dfs, and connect to the YARN ResourceManager.
    +These configs are used to write to the dfs, and connect to the YARN ResourceManager.
    --- End diff --
    
    Not sure a comma is necessary here anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41822356
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-42096177
  
    Thanks Sandy, I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41959988
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41869512
  
    Hey @sryza thanks a bunch for this. Looking good. I built it locally and read through the doc.
    
    I noticed a few other issues with the doc that you can choose to address or not depending on if you have time. In general, I think the doc makes it seem like you need to build Spark yourself to submit a YARN job, but actually, most users should not have to do this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41886859
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12169776
  
    --- Diff: docs/cluster-overview.md ---
    @@ -118,21 +118,25 @@ If you are ever unclear where configuration options are coming from. fine-graine
     information can be printed by adding the `--verbose` option to `./spark-submit`.
     
     ### Advanced Dependency Management
    -When using `./bin/spark-submit` jars will be automatically transferred to the cluster. For many
    -users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
    +When using `./bin/spark-submit` the app jar will be automatically transferred to the cluster. For
    +many users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
    --- End diff --
    
    Maybe mention that spark-submit has a "--jars" option too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41826741
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12181272
  
    --- Diff: docs/cluster-overview.md ---
    @@ -118,21 +118,25 @@ If you are ever unclear where configuration options are coming from. fine-graine
     information can be printed by adding the `--verbose` option to `./spark-submit`.
     
     ### Advanced Dependency Management
    -When using `./bin/spark-submit` jars will be automatically transferred to the cluster. For many
    -users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
    +When using `./bin/spark-submit` the app jar will be automatically transferred to the cluster. For
    +many users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
     on an existing SparkContext. This can be used to distribute JAR files (Java/Scala) or .egg and
     .zip libraries (Python) to executors. Spark uses the following URL scheme to allow different
     strategies for disseminating jars:
     
     - **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
    -  every executor pulls the file from the driver HTTP server
    +  every executor pulls the file from the driver HTTP server. When running the driver in the cluster,
    +  the jars need a way of getting from the client to the driver so that it can host them. This is not
    +  currently supported with Spark standalone, and on YARN this requires passing additional jars on the
    +  command line with the `--jars` option.
     - **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
     - **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node.  This
       means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,
       or shared via NFS, GlusterFS, etc.
     
     Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
    -Over time this can use up a significant amount of space and will need to be cleaned up.
    +With Mesos and the Spark Standalone cluster manager, this can use up a significant amount of space over
    --- End diff --
    
    Ah okay fair enough, I guess this is still an issue for mesos.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41888359
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41822345
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173921
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -12,12 +12,14 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0.
     We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster.
    --- End diff --
    
    Also above here - maybe update "improved in 0.7.0 and 0.8.0" to say "improved in subsequent releases".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173623
  
    --- Diff: docs/cluster-overview.md ---
    @@ -118,21 +118,25 @@ If you are ever unclear where configuration options are coming from. fine-graine
     information can be printed by adding the `--verbose` option to `./spark-submit`.
     
     ### Advanced Dependency Management
    -When using `./bin/spark-submit` jars will be automatically transferred to the cluster. For many
    -users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
    +When using `./bin/spark-submit` the app jar will be automatically transferred to the cluster. For
    +many users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
    --- End diff --
    
    yeah I like that idea. Something like `When using ./bin/spark-submit, the application jar along with those included via the --jars flag will automatically be transferred to the cluster`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12169839
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -47,83 +49,42 @@ System Properties:
     # Launching Spark on YARN
     
     Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster.
    -These configs are used to connect to the cluster, write to the dfs, and connect to the YARN ResourceManager.
    +These configs are used to write to the dfs, and connect to the YARN ResourceManager.
     
     There are two deploy modes that can be used to launch Spark applications on YARN. In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In yarn-client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
     
     Unlike in Spark standalone and Mesos mode, in which the master's address is specified in the "master" parameter, in YARN mode the ResourceManager's address is picked up from the Hadoop configuration.  Thus, the master parameter is simply "yarn-client" or "yarn-cluster".
     
    -The spark-submit script described in the [cluster mode overview](cluster-overview.html) provides the most straightforward way to submit a compiled Spark application to YARN in either deploy mode. For info on the lower-level invocations it uses, read ahead. For running spark-shell against YARN, skip down to the yarn-client section. 
    -
    -## Launching a Spark application with yarn-cluster mode.
    -
    -The command to launch the Spark application on the cluster is as follows:
    -
    -    SPARK_JAR=<SPARK_ASSEMBLY_JAR_FILE> ./bin/spark-class org.apache.spark.deploy.yarn.Client \
    -      --jar <YOUR_APP_JAR_FILE> \
    -      --class <APP_MAIN_CLASS> \
    -      --arg <APP_MAIN_ARGUMENT> \
    -      --num-executors <NUMBER_OF_EXECUTOR_PROCESSES> \
    -      --driver-memory <MEMORY_FOR_ApplicationMaster> \
    -      --executor-memory <MEMORY_PER_EXECUTOR> \
    -      --executor-cores <CORES_PER_EXECUTOR> \
    -      --name <application_name> \
    -      --queue <queue_name> \
    -      --addJars <any_local_files_used_in_SparkContext.addJar> \
    -      --files <files_for_distributed_cache> \
    -      --archives <archives_for_distributed_cache>
    -
    -To pass multiple arguments the "arg" option can be specified multiple times. For example:
    -
    -    # Build the Spark assembly JAR and the Spark examples JAR
    -    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
    -
    -    # Configure logging
    -    $ cp conf/log4j.properties.template conf/log4j.properties
    -
    -    # Submit Spark's ApplicationMaster to YARN's ResourceManager, and instruct Spark to run the SparkPi example
    -    $ SPARK_JAR=./assembly/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop2.0.5-alpha.jar \
    -        ./bin/spark-class org.apache.spark.deploy.yarn.Client \
    -          --jar examples/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-examples-assembly-{{site.SPARK_VERSION}}.jar \
    -          --class org.apache.spark.examples.SparkPi \
    -          --arg yarn-cluster \
    -          --arg 5 \
    -          --num-executors 3 \
    -          --driver-memory 4g \
    -          --executor-memory 2g \
    -          --executor-cores 1
    -
    -The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running.  Refer to the "Viewing Logs" section below for how to see driver and executor logs.
    -
    -Because the application is run on a remote machine where the Application Master is running, applications that involve local interaction, such as spark-shell, will not work.
    -
    -## Launching a Spark application with yarn-client mode.
    -
    -With yarn-client mode, the application will be launched locally, just like running an application or spark-shell on Local / Mesos / Standalone client mode. The launch method is also the same, just make sure to specify the master URL as "yarn-client". You also need to export the env value for SPARK_JAR.
    +To launch a Spark application in yarn-cluster mode:
     
    -Configuration in yarn-client mode:
    +    ./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
    --- End diff --
    
    This works, but I thought the preferred way was:
    
      --master yarn --deploy-mode [client|cluster]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173720
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -12,12 +12,14 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0.
     We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster.
     This can be built by setting the Hadoop version and `SPARK_YARN` environment variable, as follows:
     
    -    SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
    +    mvn package -Pyarn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -DskipTests
     
     The assembled JAR will be something like this:
    -`./assembly/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.0.5.jar`.
    +`./assembly/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.2.0.jar`.
     
    -The build process now also supports new YARN versions (2.2.x). See below.
    +The build process also supports YARN versions older than 2.2.0 (e.g. 0.23.x).
    +
    +    mvn package -Pyarn-alpha -Dyarn.version=0.23.7 -Dhadoop.version=0.23.7 -DskipTests
    --- End diff --
    
    Same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41959976
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41826742
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14589/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12173948
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -12,12 +12,14 @@ was added to Spark in version 0.6.0, and improved in 0.7.0 and 0.8.0.
     We need a consolidated Spark JAR (which bundles all the required dependencies) to run Spark jobs on a YARN cluster.
     This can be built by setting the Hadoop version and `SPARK_YARN` environment variable, as follows:
     
    -    SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
    +    mvn package -Pyarn -Dyarn.version=2.2.0 -Dhadoop.version=2.2.0 -DskipTests
     
     The assembled JAR will be something like this:
    -`./assembly/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.0.5.jar`.
    +`./assembly/target/scala-{{site.SCALA_BINARY_VERSION}}/spark-assembly_{{site.SPARK_VERSION}}-hadoop2.2.0.jar`.
     
    -The build process now also supports new YARN versions (2.2.x). See below.
    +The build process also supports YARN versions older than 2.2.0 (e.g. 0.23.x).
    +
    +    mvn package -Pyarn-alpha -Dyarn.version=0.23.7 -Dhadoop.version=0.23.7 -DskipTests
     
     # Preparations
    --- End diff --
    
    It would also be nice to update this section if you have time. Right now it says you need to build the examples jar but again, almost all users will just download a pre-compiled version of spark that has examples in the `lib/` directory.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41963937
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14610/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12179554
  
    --- Diff: docs/cluster-overview.md ---
    @@ -118,21 +118,25 @@ If you are ever unclear where configuration options are coming from. fine-graine
     information can be printed by adding the `--verbose` option to `./spark-submit`.
     
     ### Advanced Dependency Management
    -When using `./bin/spark-submit` jars will be automatically transferred to the cluster. For many
    -users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
    +When using `./bin/spark-submit` the app jar will be automatically transferred to the cluster. For
    +many users this is sufficient. However, advanced users can add jars by calling `addFile` or `addJar`
     on an existing SparkContext. This can be used to distribute JAR files (Java/Scala) or .egg and
     .zip libraries (Python) to executors. Spark uses the following URL scheme to allow different
     strategies for disseminating jars:
     
     - **file:** - Absolute paths and `file:/` URIs are served by the driver's HTTP file server, and
    -  every executor pulls the file from the driver HTTP server
    +  every executor pulls the file from the driver HTTP server. When running the driver in the cluster,
    +  the jars need a way of getting from the client to the driver so that it can host them. This is not
    +  currently supported with Spark standalone, and on YARN this requires passing additional jars on the
    +  command line with the `--jars` option.
     - **hdfs:**, **http:**, **https:**, **ftp:** - these pull down files and JARs from the URI as expected
     - **local:** - a URI starting with local:/ is expected to exist as a local file on each worker node.  This
       means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker,
       or shared via NFS, GlusterFS, etc.
     
     Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes.
    -Over time this can use up a significant amount of space and will need to be cleaned up.
    +With Mesos and the Spark Standalone cluster manager, this can use up a significant amount of space over
    --- End diff --
    
    Is this still an issue for Mesos?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/601#discussion_r12181301
  
    --- Diff: docs/running-on-yarn.md ---
    @@ -44,86 +28,47 @@ System Properties:
     * `spark.yarn.max.executor.failures`, the maximum number of executor failures before failing the application. Default is the number of executors requested times 2 with minimum of 3.
     * `spark.yarn.historyServer.address`, the address of the Spark history server (i.e. host.com:18080). The address should not contain a scheme (http://). Defaults to not being set since the history server is an optional service. This address is given to the Yarn ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI. 
     
    +By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, export SPARK_JAR=hdfs:/some/path.
    --- End diff --
    
    jw - is it normal to do `hdfs:/some/path` and not `hdfs://some/path`? I think they are technically both valid URL's.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1492. Update Spark YARN docs to use spar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/601#issuecomment-41963936
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---