You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2014/09/02 01:07:05 UTC

[GitHub] spark pull request: [SPARK-3319 / 3338] Resolve Spark submit confi...

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/2232

    [SPARK-3319 / 3338] Resolve Spark submit config paths

    **SPARK-3319.** There is currently a divergence in behavior when the user passes in additional jars through `--jars` and through setting `spark.jars` in the default properties file. The former will happily resolve the paths (e.g. convert `my.jar` to `file:/absolute/path/to/my.jar`), while the latter does not. We should resolve paths consistently in both cases. This also applies to the following pairs of command line arguments and Spark configs:
    
    - `--jars` ~ `spark.jars`
    - `--files` ~ `spark.files` / `spark.yarn.dist.files`
    - `--archives` ~ `spark.yarn.dist.archives`
    - `--py-files` ~ `spark.submit.pyFiles`
    
    **SPARK-3338.** This PR also fixes the following bug: if the user sets `spark.submit.pyFiles` in his/her properties file, it does not actually get picked up, even if `--py-files` is not set. This is because it is overridden by an empty string.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark resolve-config-paths

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2232.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2232
    
----
commit fe039d35855b4af1635ff7046adccbe639884ed1
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-30T01:58:07Z

    Beef up tests to test fixed-pointed-ness of Utils.resolveURI(s)

commit 460117e9744b173f4209f28a6c7dde2fcc519efd
Author: Andrew Or <an...@gmail.com>
Date:   2014-09-01T22:58:16Z

    Resolve config paths properly

commit 05e03d649980d5025a1ba53c4c9c27b51e5224cc
Author: Andrew Or <an...@gmail.com>
Date:   2014-09-01T22:58:51Z

    Add tests for resolving both command line and config paths

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-55985896
  
    Hmm.  My feeling is that it's better to be consistent here and consider the old behavior a bug than to maintain compatibility than to support a cornerish case for a property that's rarely used.  Just wanted to bring it up.  @pwendell do you have an opinion on this?
    
    Either way, for each of these properties, we should document whether they can take HDFS paths and the default behavior when no scheme is given.  Also, are there any strange corner cases we can run into if a file exists at the same path on the local FS and HDFS?
    
    Last, do we need to also resolve URIs for spark.yarn.jar?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-61156206
  
    Alright, @sryza @tgravescs any other comments?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-56091918
  
    Hi @tgravescs, I believe every point of the behavior you listed is correct and preserved in this PR, since it only affects `spark.yarn.dist.*` and these are resolved to `file://` if relative paths are provided. Then it seems OK to keep the `spark.yarn.dist.*` in the list of things to resolve in `SparkSubmit`.
    
    In addition, here's a tangential clarification question: Isn't setting SPARK_YARN_DIST_* meaningless in cluster mode, because the driver is launched on one of the slave nodes and the resources specified here should already be visible to the executors, which are launched on the same nodes? If so, it will simplify things if we always treat the paths specified through these variables as `hdfs://` paths regardless of the deploy mode. I believe we currently already do this in [ClientArguments](https://github.com/apache/spark/blob/master/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala), where we distinguish between client and cluster mode only in the comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-55837711
  
    @sryza can you look at this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-55858105
  
    Could this change behavior in cases where the spark.yarn.dist.files is configured with no scheme?  Without this change, it would interpret no scheme to mean that it's on HDFS, and with it, it would interpret it to mean that it's on the local FS?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-61156911
  
      [Test build #22549 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22549/consoleFull) for   PR 2232 at commit [`fff2869`](https://github.com/apache/spark/commit/fff2869e9536f02ce3ddc81186b5594ca16dad5d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by tgravescs <gi...@git.apache.org>.
Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-56042913
  
    From the other pr we need to keep backwards compatibility for the env variables, the configs should be fine.  I believe this pr matches the behavior described below. Please correct me if I'm wrong as I haven't looked at the code in detail.
    
    The behavior we want is and should have been the behavior as of pr969:
    
    - --archives/--files via sparksubmit defaults to use file:// if not specified, for both yarn-client and yarn-cluster
    - spark.yarn.dist.archives/spark.yarn.dist.files defaults to use file:// if not specified, for both yarn-client and yarn-cluster
    - env variable SPARK_YARN_DIST_ARCHIVES/SPARK_YARN_DIST_FILES set in yarn-client then it should default to hdfs://
    - --files/--archives specified from spark-class in yarn-cluster mode then it should default to hdfs://


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-61180747
  
    Alright merging this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-55986797
  
    Yes, probably. This list is by no means complete. For `spark.yarn.dist.*` however, it seems that there is some prior discussion on keeping it unresolved (i.e. relative paths become HDFS paths) here: #969. @tgravescs and @vanzin may have objections on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2232


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-61167385
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22549/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-61167381
  
      [Test build #22549 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22549/consoleFull) for   PR 2232 at commit [`fff2869`](https://github.com/apache/spark/commit/fff2869e9536f02ce3ddc81186b5594ca16dad5d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319 / 3338] Resolve Spark submit confi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-54096247
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19569/consoleFull) for   PR 2232 at commit [`05e03d6`](https://github.com/apache/spark/commit/05e03d649980d5025a1ba53c4c9c27b51e5224cc).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319 / 3338] Resolve Spark submit confi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-54098222
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19569/consoleFull) for   PR 2232 at commit [`05e03d6`](https://github.com/apache/spark/commit/05e03d649980d5025a1ba53c4c9c27b51e5224cc).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-55966480
  
    Yes, this changes the behavior if the user sets `spark.yarn.dist.*`. Note that the whole point of this PR is to fix the divergence in path resolution behavior if the user specifies `--files` vs the corresponding config (which in the case of yarn-client is `spark.yarn.dist.files`).
    
    However I believe these configs are actually introduced before Spark submit. Do you think we should leave the yarn ones out to keep backward compatibility?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-55842293
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20441/consoleFull) for   PR 2232 at commit [`f0fae64`](https://github.com/apache/spark/commit/f0fae6410aed83f5acb5ab9727840613e0b391b8).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by sryza <gi...@git.apache.org>.
Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-61179179
  
    This looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3319] [SPARK-3338] Resolve Spark submit...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2232#issuecomment-55838217
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20441/consoleFull) for   PR 2232 at commit [`f0fae64`](https://github.com/apache/spark/commit/f0fae6410aed83f5acb5ab9727840613e0b391b8).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org