You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by mikhaildubkov <gi...@git.apache.org> on 2016/04/26 03:25:49 UTC

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

GitHub user mikhaildubkov opened a pull request:

    https://github.com/apache/spark/pull/12678

    [SPARK-14908] [YARN] Provide support HDFS-located resources for spark…

    The main goal behind these changes are provide support to use HDFS resources for "spark.executor.extraClassPath", when Hadoop/YARN deployments used.
     This can be helpful when you want to use custom SparkSerializer implementation (our project case).
    
    How it works with these changes:
    1. Value of "spark.executor.extraClassPath" splits by comma
    2. Iterate over all paths and filter those which started with "hdfs;//"
    3. Generate link for each path and add LocalResource to executor launch context local resources
    4. Add generated links to executor CLASSPATH
    5. NodeManager loads the specified local resources to application cache
    
    After that, you do not need deploy extra resources to each Hadoop node manually, it will be automatically.
    
    The changes fully backward compatible and does not break any existing "spark.executor.extraClassPath" usages.
    
    This patch was tested manually on our Hadoop cluster (4-nodes).
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mikhaildubkov/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12678.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12678
    
----
commit a4f1c10a3f0f10b9f18ca61e599f50a1e17ba8bd
Author: Mikhail Dubkov <mi...@macys.com>
Date:   2016-04-26T00:23:42Z

    [SPARK-14908] [YARN] Provide support HDFS-located resources for spark.executor.extraClasspath on YARN

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by mikhaildubkov <gi...@git.apache.org>.

Github user mikhaildubkov commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215228729
  
    @tgravescs,
    Do you mean use "spark.yarn.cache.filenames" with other "**spark.yarn.cache.***" (timestamps, sizes so on) ?
    If yes, _does it mean i have to update these properties each time_, when i change extra classpath jar, such as timestamp and size corresponding properties ?
    What i looking for is simple way to auto deployment and with these changes ti can be, because you don't need change any deploy scripts, just deploy new version of desired library to hdfs.
    
    Could you please explain how to simplify usage of option with distributed cache, i suppose i missed something? 
    
    The maximized answer will be really appreciated.
    Thanks in advance!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215249919
  
    ah  that is right, forgot about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215242079
  
    > If the jar/file is there that it will be in the classpath when spark is launched 
    
    For jars it's not as simple as that. But your previous instructions are correct; you need to do *both* of the following:
    
    - distribute the jar file using --files (or --jars although for this case --files is slightly better)
    - add its *name* (not the full path) to `spark.{driver,executor}.extraClassPath`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-214580205
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215241450
  
    first  is com.exeample.CustomSparkSerializer spelled correctly? or did you typo the name.
    
    what kind of file is com.exeample.CustomSparkSerializer in (I assume a jar but please confirm)?  the --files/--jars get downloaded before any container is launched. The classpath is set to include $PWD before its launched.  If the jar/file is there that it will be in the classpath when spark is launched executor is launched.  its possible there is something goofy going on but can you verify your YARN env?  Do you have access to the nodemanagers?  If so go to the nodemanager and check that the files properly got downloaded and the launch script has the correct classpath set.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/12678#issuecomment-215232131

No those are internal Spark configs. I mean either use the --jars/--files/--archives options to spark-submit or use the corresponding configs spark.yarn.dist.archives, spark.yarn.dist.files, spark.jars config options. See http://spark.apache.org/docs/latest/running-on-yarn.html for further descriptions of the configs. Or run spark-submit --help.

On YARN that will cause whatever files you specify to be downloaded on each of the driver/am/executor node and will be put in ./. ./ is including in the classpath so if its a file or jar you don't have to do anything else. If its an archive it will be extracted and if the file you want in classpath is under a sub directory you need to modify the extraClasspath. It properly handles things in hdfs:// or in file://. If you specify something as file:// it looked locally on your launcher box, uploads it to the hdfs staging directory and then it gets downloaded onto the node. If its already in hdfs, YARN simply downloaded it to the executor before launching.

Note the important note at the bottom of that page:
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215249127
  
    > If --jars doesn't add it to the classpath then we broke it.
    
    No, that was intentional. jars adds the jars to the application's class loader (which is different from the JVM's system class path). That's how standalone and mesos have always behaved, and YARN started behaving like that since a few releases at least.
    
    If you want app jars to be in the system class path you need to use the trick described above, or use `spark.yarn.user.classpath.first`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215246440
  
    @vanzin 
    
    If --jars doesn't add it to the classpath then we broke it.
    
     --jars JARS                 Comma-separated list of local jars to include on the driver
                                  and executor classpaths.
    
    Why is --files better?  Both go through distributed cache the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by tgravescs <gi...@git.apache.org>.

Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215139182
  
    Just distribute the file via the distributed cache via the files/jars/archives options (which can point to hdfs) and then include it in your classpath with ./.... if its someplace down in an archive.   If its just a jar or file its automatically included in classpath as $PWD is put there.
    
    I don't see why this change is needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215250284
  
    Ah, as for the `--files` question, it's slightly better because it avoids adding the jar again to the application's class loader.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by mikhaildubkov <gi...@git.apache.org>.

Github user mikhaildubkov commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215258691
  
    @tgravescs , I've just replaced actual name with typo.
    @vanzin 
    
    Thank you guys, it  works fine with:
        distribute the jar file using --files (or --jars although for this case --files is slightly better)
        add its name (not the full path) to spark.{driver,executor}.extraClassPath
    
    I'm going to close PR and corresponding JIRA as won't fix if you have no objections.
    
    Thank you!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by mikhaildubkov <gi...@git.apache.org>.

Github user mikhaildubkov closed the pull request at:

    https://github.com/apache/spark/pull/12678


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-14908] [YARN] Provide support HDFS-loca...

Posted by mikhaildubkov <gi...@git.apache.org>.

Github user mikhaildubkov commented on the pull request:

    https://github.com/apache/spark/pull/12678#issuecomment-215239721
  
    @tgravescs,
    
    I have tried to use --jars, --files and so on, but the main reason why we have to use "spark.executor.extraClassPath" is that we need add jar to **CoarseGrainedExecutorBackend** classpath.
    That is due to usage of **custom Spark serializer** in our project and Spark instantiates serializer before loading any --jars/--files etc.
    Here is the stack trace, which we have without "spark.executor.extraClassPath" usage:
    
    `Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1672)
    	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:253)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
    Caused by: java.lang.ClassNotFoundException: com.exeample.CustomSparkSerializer
    	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    	at java.lang.Class.forName0(Native Method)
    	at java.lang.Class.forName(Class.java:348)
    	at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
    	at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:286)
    	at org.apache.spark.SparkEnv$.instantiateClassFromConf$1(SparkEnv.scala:307)
    	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:310)
    	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:217)
    	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:186)
    	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
    	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
    	at java.security.AccessController.doPrivileged(Native Method)
    	at javax.security.auth.Subject.doAs(Subject.java:422)
    	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    	... 4 more`
    
    As you can find, the serializer instantiates during Spark executor evn creation time. There are no other possible options, which I found, to put jar to the right place. Only "spark.executor.extraClassPath" helps.
    I'll try one more time options that you mentioned, but I believe that I already tried all of them.
    The root cause, why it not works for me is that too late, the serializer class instantiates before.
    
    What do think about it?
    
    Thank you!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org