You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by adambalogh <gi...@git.apache.org> on 2018/08/30 19:14:49 UTC

[GitHub] spark pull request #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_...

GitHub user adambalogh opened a pull request:

    https://github.com/apache/spark/pull/22289

    [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property

    ## What changes were proposed in this pull request?
    
    When submitting applications to Yarn in cluster mode, using the InProcessLauncher, spark finds the cluster's configuration files based on the HADOOP_CONF_DIR/YARN_CONF_DIR environment variables. This does not make it possible to submit to more than one Yarn clusters concurrently using the InProcessLauncher. 
    
    This PR adds a new property `spark.yarn.conf.dir` that lets users select the location of the config files for each submission separately.
    
    ## How was this patch tested?
    
    Integration test


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/adambalogh/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22289.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22289
    
----
commit f285081c8307e090638213b4e26cf5c35072b0e0
Author: Adam Balogh <ab...@...>
Date:   2018-08-30T13:05:43Z

    read hadoop config dir from spark property

commit ae41a6603d22e66367d9d25890cf888737ca5db7
Author: Adam Balogh <ab...@...>
Date:   2018-08-30T13:08:36Z

    Merge remote-tracking branch 'upstream/master'

commit 503c9859c37afbf753ee70459c122199d4942af2
Author: Adam Balogh <ab...@...>
Date:   2018-08-30T14:12:47Z

    renames

commit 5622511769a6422d6268163fa7e77981e904b732
Author: Adam Balogh <ab...@...>
Date:   2018-08-30T18:31:08Z

    test

commit d60736acc9b346c75669f924f34a29add855c53b
Author: Adam Balogh <ab...@...>
Date:   2018-08-30T18:39:26Z

    remove SPARK_TEST_HADOOP_CONF_DIR

commit d690b26956323144f8eeb76e79467920ca7dde23
Author: Adam Balogh <ab...@...>
Date:   2018-08-30T18:40:17Z

    move conf.dir to end of classpath

commit bed3f447886caa5419575d8cbc5f29e4ab8ab9cf
Author: Adam Balogh <ab...@...>
Date:   2018-08-30T18:45:06Z

    use right prop in test

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22289#discussion_r214414143
  
    --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java ---
    @@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) {
     
         addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
         addToClassPath(cp, getenv("YARN_CONF_DIR"));
    +    addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));
    --- End diff --
    
    Saisai's question about the classpath configuration is actually the most complicated part of this feature. I haven't fully thought about how they would play out, but I really don't think it's as simple as appending this new config to the classpath.
    
    e.g. what is the expectation if you run "spark-shell" with this option? Do you end up using the config from the env variable or from the config? If you have both, and you reference a file in `--files` that is on an HDFS namespace declared in the `hdfs-site.xml` from the config, what will happen? (Answer: it will be ignored, since that is being masked by the `hdfs-site.xml` from the env variable.)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_...

Posted by adambalogh <gi...@git.apache.org>.

Github user adambalogh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22289#discussion_r215947975
  
    --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java ---
    @@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) {
     
         addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
         addToClassPath(cp, getenv("YARN_CONF_DIR"));
    +    addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));
    --- End diff --
    
    @vanzin Did you have time to think about how this config should work?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Posted by adambalogh <gi...@git.apache.org>.

Github user adambalogh commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    Thank you for the detailed explanation! @vanzin 
    
    I agree with what you are saying, however I'm not sure about some of your points about configs, so I would like to find a common ground regarding how hadoop/yarn configuration is supposed to work.
    
    Regarding your 3 points about how configs work, I agree with point 1, however for point 2, I failed to find documentation about the RM adding its own Hadoop config files to the AM/executors' classpath. Is that documented somewhere or is that configurable? I did some experimenting where I placed some invalid configurations in `HADOOP_CONF_DIR`'s `hdfs-site.xml` (but _not_ in the yarn `Client`'s configs in the classpath), and the AM failed to start up, indicating that it's actually using the configs from [`LOCALIZED_HADOOP_CONF_DIR`](https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1204), which is based on the contents of `HADOOP_CONF_DIR`, and not from the RM's `hdfs-site.xml` which had the correct configuration.
    
    For point 3, the yarn `Client` does distribute its own Hadoop configs to [`SPARK_HADOOP_CONF_FILE`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L411), which should be overlaid on top of AM/executors' configs as you said, however it seems like `ApplicationMaster` is actually not doing that, because it [doesn’t use](https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L83) the [newConfiguration instance method](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L112) from `SparkHadoopUtil`, instead it uses the [static newConfiguration method](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala#L434), which doesn’t do the overlaying. Is that intentional? It seems like it was introduced [here](https://github.com/apache/
 spark/pull/19631/files#diff-f442537993cdfc7444783a606b3bd7a4L60)
    
    Sorry for the long comment, and please let me know if I got something wrong.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Posted by adambalogh <gi...@git.apache.org>.

Github user adambalogh commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    @ifilonenko I don't think we need to extend this to other resource managers, since users of Kubernetes/Mesos RMs probably don't want to use use multiple Hadoop clusters at the same time that often. Or did I misunderstand your question? 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_...

Posted by jerryshao <gi...@git.apache.org>.

Github user jerryshao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22289#discussion_r214233802
  
    --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java ---
    @@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) {
     
         addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
         addToClassPath(cp, getenv("YARN_CONF_DIR"));
    +    addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));
    --- End diff --
    
    I'm wondering how do we update the classpath to change to another hadoop confs with InProcessLauncher? Seems the classpath here is not changeable after JVM is launched.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    `spark.hadoop.*` is not a good name. That's a special prefix in Spark that modifies any Hadoop `Configuration` object that Spark instantiates. That's the easy one.
    
    The hard one is that your change doesn't seem to achieve what your PR description says. What you're doing is just uploading the contents of `spark.hadoop.config.dir` instead of `HADOOP_CONF_DIR` with your YARN app. That means a bunch of things:
    
    - the `Client` class is still using whatever Hadoop configuration is in the classpath to choose the YARN service that will actually run the app.
    - the uploaded config is actually added at the end of the classpath of the AM / executors; the RM places its own configuration before that in the classpath, so in the launched processes, you're still *not* going to be using the configuration you defined in `spark.hadoop.conf.dir`.
    - the configuration used by the `Client` class that I mention above is actually written to a separate file and also sent over to the AM / executors, and overlayed on top of the configuration (see `SparkHadoopUtil.newConfiguration`).
    
    So to actually achieve what you want to do, you'd have to fix at least two things:
    
    - `SparkHadoopUtil.newConfiguration`
    - the way `Client` creates the YARN configuration (which is [here](https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L68))
    
    Otherwise, this change isn't actually doing much that I can see.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as ...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22289
  
    > I failed to find documentation about the RM adding its own Hadoop config files to the AM/executors' classpath
    
    See `Client.getYarnAppClasspath` and `Client.getMRAppClasspath`.
    
    > however it seems like ApplicationMaster is actually not doing that, because it doesn’t use the newConfiguration instance method
    
    That may have been intentional. The AM-specific code (which is in the `ApplicationMaster` class) should use the configuration provided by the YARN cluster, since its purpose is to talk to YARN and that's it. User-specific changes can actually break things (e.g. point the AM at a different HDFS than the one where files were staged).
    
    The driver code, which also runs inside the AM process, overlays the user config in the `Configuration` used by the `SparkContext`.
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_...

Posted by ifilonenko <gi...@git.apache.org>.

Github user ifilonenko commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22289#discussion_r214381467
  
    --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java ---
    @@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) {
     
         addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
         addToClassPath(cp, getenv("YARN_CONF_DIR"));
    +    addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));
    --- End diff --
    
    On another note, is this meant to extend to other resource-managers? As Kubernetes assumes only the ENV `HADOOP_CONF_DIR`, but if such a change is desirable this would cause a slight re-work of the current Hadoop Conf Files mounting logic. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_...

Posted by adambalogh <gi...@git.apache.org>.

Github user adambalogh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22289#discussion_r214401744
  
    --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java ---
    @@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) {
     
         addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
         addToClassPath(cp, getenv("YARN_CONF_DIR"));
    +    addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));
    --- End diff --
    
    @jerryshao My understanding is that this method is not used by the InProcessLauncher. So instead, the caller of InProcessLauncher has to make sure that the conf files are available to hadoop's Configuration class in the YarnClusterApplication. For example, by adding the config files to the calling thread's context class loader


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22289: [SPARK-25200][YARN] Allow specifying HADOOP_CONF_...

Posted by adambalogh <gi...@git.apache.org>.

Github user adambalogh commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22289#discussion_r214428426
  
    --- Diff: launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java ---
    @@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) {
     
         addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
         addToClassPath(cp, getenv("YARN_CONF_DIR"));
    +    addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));
    --- End diff --
    
    Yes, it is quite tricky. My expectation is that it would behave the same way as if you pointed `HADOOP_CONF_DIR` and `YARN_CONF_DIR` to different directories that both contain `hdfs-site.xml`. Files in `HADOOP_CONF_DIR` would take precedence (as far as I know, nothing prevents this from happening). So with this new config, the order of preference would be `HADOOP_CONF_DIR`, `YARN_CONF_DIR`, then `spark.yarn.conf.dir`. 
    
    Perhaps this could be clarified in the docs,  but let me know what you think about it, I'm happy to implement other resolutions.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org