You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2014/07/18 01:10:38 UTC

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/1472

    [SPARK-2454] Do not assume drivers and executors share the same Spark home

    **Problem.** When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. `bin/compute-classpath.sh`) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.
    
    **Solution.** Simply do not pass the driver's Spark home to the Workers. Note that we should still send *some* Spark home to the Workers, in case there are multiple installations of Spark on the worker machines and the application wants to pick among them.
    
    **Spark config changes.**
    - `spark.home` - This is removed and deprecated. The motivation is that this is currently used for 3+ different things and is often confused with `SPARK_HOME`.
    - `spark.executor.home` - This is the Spark home that the executors will use. If this is not set, the Worker will use its own current working directory. This is not set by default.
    - `spark.driver.home` - Same as above, but for the driver. This is only relevant for standalone-cluster mode (not yet supported. See SPARK-2260).
    - `spark.test.home` - This is the Spark home used only for tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark spark-home

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1472
    
----
commit 75923697a08e035c8e46b53b67a9d98938212915
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T18:42:18Z

    Allow applications to specify their executor/driver spark homes
    
    This allows the worker to launch a driver or an executor from a
    different installation of Spark on the same machine. To do so, the
    user needs to set "spark.executor.home" and/or "spark.driver.home".
    
    Note that this was already possible for the executors even before
    this commit. However, it used to rely on "spark.home", which was
    also used for 20 other things. The next step is to remove all usages
    of "spark.home", which was confusing to many users (myself included).

commit b90444d65744174ba6105da23459218e90788644
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T21:33:44Z

    Remove / deprecate all occurrences of spark.home
    
    This involves replacing spark.home to spark.test.home in tests.
    Looks like python still uses spark.home, however. The next commit
    will fix this.

commit 2a64cfcc63023a7ded58421f094421e9a1067e10
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T21:52:48Z

    Remove usages of spark.home in python

commit 81710627925ee6cbd2099215efd17c3173b7bed8
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:02:58Z

    Add back *SparkContext functionality to setSparkHome
    
    This is because we cannot deprecate these constructors easily...

commit 2333c0ecb8ccd16a2c9dbf1a97ae58d7c6e708eb
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:04:46Z

    Minor deprecation message change

commit b94020e13917ae59b1f3d8954cdecc7089c77141
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:46:58Z

    Document spark.executor.home (but not spark.driver.home)
    
    ... because the only mode that uses spark.driver.home right now is
    standalone-cluster, which is broken (SPARK-2260). It makes little
    sense to document that this feature exists on a mode that is broken.

commit a50f0e74d3916a7fc7e178ba8391260d4127ba36
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:47:59Z

    Merge branch 'master' of github.com:apache/spark into spark-home

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-50954417
  
    Closing this in favor of #1734. Please disregard this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49383563
  
    QA results for PR 1472:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16794/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49688158
  
    QA tests have started for PR 1472. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16938/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-50952651
  
    QA results for PR 1472:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17741/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-50675249
  
    QA results for PR 1472:<br>- This patch PASSES unit tests.<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17462/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-50661174
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49693537
  
    QA results for PR 1472:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16938/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by pwendell <gi...@git.apache.org>.

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1472#discussion_r15603816
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/Client.scala ---
    @@ -69,13 +69,16 @@ private class ClientActor(driverArgs: ClientArguments, conf: SparkConf) extends
             val javaOpts = sys.props.get(javaOptionsConf)
             val command = new Command(mainClass, Seq("{{WORKER_URL}}", driverArgs.mainClass) ++
               driverArgs.driverOptions, env, classPathEntries, libraryPathEntries, javaOpts)
    +        // TODO: document this once standalone-cluster mode is fixed (SPARK-2260)
    --- End diff --
    
    Does this get updated now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-50952210
  
    QA tests have started for PR 1472. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17741/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-50661797
  
    QA tests have started for PR 1472. This patch DID NOT merge cleanly! <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17462/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1472#discussion_r15612362
  
    --- Diff: core/src/main/scala/org/apache/spark/SparkConf.scala ---
    @@ -121,7 +121,9 @@ class SparkConf(loadDefaults: Boolean) extends Cloneable with Logging {
        * Set the location where Spark is installed on worker nodes.
        */
       def setSparkHome(home: String): SparkConf = {
    --- End diff --
    
    Maybe should mark this as deprecated too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 closed the pull request at:

    https://github.com/apache/spark/pull/1472


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49689468
  
    I have tested this on a standalone cluster, purposefully changing the directory structure of the driver to be different from that of the executors. I was able to confirm that the the workers now use their own local directory to launch the executors. I also tested changing `spark.executor.home` to both the valid path and a bogus path, and, as expected, an application with the former runs successfully while the latter fails the application.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49689328
  
    Oops, accidentally closed. Please disregard.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-50700741
  
    UPDATE: I had a conversation with @pwendell about this. We came to the conclusion that there is really no benefit from having a mechanism to specify an executor home, at least for standalone mode. Even if we have multiple installations of Spark on the worker machines, we can pick which one to connect to by simply specifying a different Master. In either case, we should just use the Worker's current working directory as the executor's (or driver's, in the case of standalone-cluster mode) Spark home.
    
    I will make the relevant changes shortly. If I don't get to it by the 1.1 code freeze, we should just merge in #1392 instead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49378511
  
    QA tests have started for PR 1472. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16794/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2454] Do not assume drivers and ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49675550
  
    QA results for PR 1472:<br>- This patch FAILED unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16923/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by andrewor14 <gi...@git.apache.org>.

Github user andrewor14 closed the pull request at:

    https://github.com/apache/spark/pull/1472


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP][SPARK-2454] Do not assume drivers and ex...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1472#issuecomment-49666445
  
    QA tests have started for PR 1472. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16923/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2454] Do not assume drivers and executo...

Posted by andrewor14 <gi...@git.apache.org>.

GitHub user andrewor14 reopened a pull request:

    https://github.com/apache/spark/pull/1472

    [SPARK-2454] Do not assume drivers and executors share the same Spark home

    **Problem.** When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. `bin/compute-classpath.sh`) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.
    
    **Solution.** Simply do not pass the driver's Spark home to the Workers. Note that we should still send *some* Spark home to the Workers, in case there are multiple installations of Spark on the worker machines and the application wants to pick among them.
    
    **Spark config changes.**
    - `spark.home` - This is removed and deprecated. The motivation is that this is currently used for 3+ different things and is often confused with `SPARK_HOME`.
    - `spark.executor.home` - This is the Spark home that the executors will use. If this is not set, the Worker will use its own current working directory. This is not set by default.
    - `spark.driver.home` - Same as above, but for the driver. This is only relevant for standalone-cluster mode (not yet supported. See SPARK-2260).
    - `spark.test.home` - This is the Spark home used only for tests.
    
    Note: #1392 proposes part of the solution described here.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark spark-home

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1472.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1472
    
----
commit 75923697a08e035c8e46b53b67a9d98938212915
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T18:42:18Z

    Allow applications to specify their executor/driver spark homes
    
    This allows the worker to launch a driver or an executor from a
    different installation of Spark on the same machine. To do so, the
    user needs to set "spark.executor.home" and/or "spark.driver.home".
    
    Note that this was already possible for the executors even before
    this commit. However, it used to rely on "spark.home", which was
    also used for 20 other things. The next step is to remove all usages
    of "spark.home", which was confusing to many users (myself included).

commit b90444d65744174ba6105da23459218e90788644
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T21:33:44Z

    Remove / deprecate all occurrences of spark.home
    
    This involves replacing spark.home to spark.test.home in tests.
    Looks like python still uses spark.home, however. The next commit
    will fix this.

commit 2a64cfcc63023a7ded58421f094421e9a1067e10
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T21:52:48Z

    Remove usages of spark.home in python

commit 81710627925ee6cbd2099215efd17c3173b7bed8
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:02:58Z

    Add back *SparkContext functionality to setSparkHome
    
    This is because we cannot deprecate these constructors easily...

commit 2333c0ecb8ccd16a2c9dbf1a97ae58d7c6e708eb
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:04:46Z

    Minor deprecation message change

commit b94020e13917ae59b1f3d8954cdecc7089c77141
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:46:58Z

    Document spark.executor.home (but not spark.driver.home)
    
    ... because the only mode that uses spark.driver.home right now is
    standalone-cluster, which is broken (SPARK-2260). It makes little
    sense to document that this feature exists on a mode that is broken.

commit a50f0e74d3916a7fc7e178ba8391260d4127ba36
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-17T22:47:59Z

    Merge branch 'master' of github.com:apache/spark into spark-home

commit 953997a279f5cd4a7f47f07d5fd32ff65c59620d
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-21T20:57:38Z

    Merge branch 'master' of github.com:apache/spark into spark-home

commit 00147646ec8594caa8915c9a3fb329fcbe0042a4
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-22T01:28:18Z

    Fix tests that use local-cluster mode

commit ecdfa92fd33f19fc57e041e4269405c011a43261
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-22T01:28:32Z

    Formatting changes (minor)

commit c81f506639a88d789fd6736c11a9098901b394cd
Author: Andrew Or <an...@gmail.com>
Date:   2014-07-22T01:28:50Z

    Merge branch 'master' of github.com:apache/spark into spark-home

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---