You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2014/05/22 10:36:56 UTC

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/853

    [SPARK-1900] Fix running PySpark files on YARN

    If I run the following on a YARN cluster
    ```
    bin/spark-submit sheep.py --master yarn-client
    ```
    it fails because of a mismatch in paths. Spark submit thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
    ```
    bin/spark-submit file:/path/to/sheep.py --master yarn-client
    ```
    However, this also fails, this time because python does not understand URI schemes.
    
    This PR fixes this by automatically resolving all paths passed as command line argument to spark-submit properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes.
    
    Much of the code is written by @mengxr.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark submit-paths

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/853.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #853
    
----
commit 02f77f39c5f8a530c58f86dbe28e8e3507c3cfb0
Author: Andrew Or <an...@gmail.com>
Date:   2014-05-22T08:17:08Z

    Resolve command line arguments to spark-submit properly
    
    Jars and files provided to spark-submit are treated as HDFS paths
    on YARN clusters, even if they exist locally. This is inconsistent
    across different modes. Instead, we should always treat the command
    line argument paths passed to spark-submit as local paths, unless
    otherwise specified.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43958672
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43984209
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44051068
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44084017
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44051069
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15169/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43946727
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980095
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44069407
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43988911
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43946728
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15145/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980116
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15157/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43873047
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43961142
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43864110
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43985497
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12987849
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,35 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /** Return the URI of the input path. If a relative path is given, assume it is local. */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    // On Windows, file names cannot contain backslashes and colons,
    +    // and each drive contains only a single alphabet character
    +    val windows = isWindows || testWindows
    +    val sanitizedPath = if (windows) path.replace("\\", "/") else path
    --- End diff --
    
     Might be good to document this and just say `// Convert from Windows path syntax to URI syntax` or something:. In my mind "sanitized" would convey that you are e.g. encoding special characters. But really this is just converting from one format to another.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12988570
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,35 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /** Return the URI of the input path. If a relative path is given, assume it is local. */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    // On Windows, file names cannot contain backslashes and colons,
    +    // and each drive contains only a single alphabet character
    +    val windows = isWindows || testWindows
    +    val sanitizedPath = if (windows) path.replace("\\", "/") else path
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(sanitizedPath)
    +    uri.getScheme match {
    +      case windowsDrive(d) if windows =>
    +        new URI("file:/" + uri.toString.stripPrefix("/"))
    +      case null =>
    +        val fragment = uri.getFragment
    --- End diff --
    
    There is a small bullet point that explains this: http://people.apache.org/~pwendell/spark-1.0.0-rc7-docs/running-on-yarn.html#important-notes
    
    I'll add a comment here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43976505
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43985463
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/853


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980011
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44069414
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980689
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44108589
  
    I have tested them with Hadoop 2.4 and Spark Standalone as well and they work. This was a very tricky bug fix, that required testing all combinations of deployment modes (local, spark standalone, yarn-client, yarn-cluster, windowS) and execution modes (jars, spark shell, python shell, python scripts). Thanks  @andrewor14 for doing this and thanks to @mengxr for helping us out. I am merging this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44071593
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43976497
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43988914
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15161/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43863207
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43972255
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r13012674
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,42 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /**
    +   * Return a well-formed URI for the file described by a user input string.
    +   *
    +   * If the supplied path does not contain a scheme, or is a relative path, it will be
    +   * converted into an absolute path with a file:// scheme.
    +   */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    val windows = isWindows || testWindows
    +    // In Windows, the file separator is a backslash, but this is inconsistent with the URI format
    +    val formattedPath = if (windows) path.replace("\\", "/") else path
    +    // Each Windows drive contains only a single alphabet character
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(formattedPath)
    +    uri.getScheme match {
    --- End diff --
    
    I think we should delegate checking whether file exists to somewhere else outside of this function. This is intended to be a general utils function, where the file may not need to exist locally even if it is a `file:/`.
    
    Yes, NPEs are bad error messages and I will add a guard them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43984213
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15158/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44044689
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44071596
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15174/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44042038
  
    I see. #849 recently went in but clearly doesn't resolve paths (because it was in another PR). This clearly needs to be fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43942626
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980021
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44083218
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44071594
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43977357
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43985452
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43958505
  
    Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980697
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r13000943
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,42 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /**
    +   * Return a well-formed URI for the file described by a user input string.
    +   *
    +   * If the supplied path does not contain a scheme, or is a relative path, it will be
    +   * converted into an absolute path with a file:// scheme.
    +   */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    val windows = isWindows || testWindows
    +    // In Windows, the file separator is a backslash, but this is inconsistent with the URI format
    +    val formattedPath = if (windows) path.replace("\\", "/") else path
    +    // Each Windows drive contains only a single alphabet character
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(formattedPath)
    +    uri.getScheme match {
    --- End diff --
    
    Also, please add test cases for some bad paths that you can think of. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r13001368
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,42 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /**
    +   * Return a well-formed URI for the file described by a user input string.
    +   *
    +   * If the supplied path does not contain a scheme, or is a relative path, it will be
    +   * converted into an absolute path with a file:// scheme.
    +   */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    val windows = isWindows || testWindows
    +    // In Windows, the file separator is a backslash, but this is inconsistent with the URI format
    +    val formattedPath = if (windows) path.replace("\\", "/") else path
    +    // Each Windows drive contains only a single alphabet character
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(formattedPath)
    +    uri.getScheme match {
    --- End diff --
    
    Also, I think, at least for local files, we can check early, in spark submit whether that file exists or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44068890
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12987475
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,35 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /** Return the URI of the input path. If a relative path is given, assume it is local. */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    // On Windows, file names cannot contain backslashes and colons,
    +    // and each drive contains only a single alphabet character
    +    val windows = isWindows || testWindows
    +    val sanitizedPath = if (windows) path.replace("\\", "/") else path
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(sanitizedPath)
    +    uri.getScheme match {
    +      case windowsDrive(d) if windows =>
    +        new URI("file:/" + uri.toString.stripPrefix("/"))
    +      case null =>
    +        val fragment = uri.getFragment
    --- End diff --
    
    It would be good to document that we preserve the fragment because it has a special meaning for YARN URI's.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43942604
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12999133
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,42 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /**
    +   * Return a well-formed URI for the file described by a user input string.
    +   *
    +   * If the supplied path does not contain a scheme, or is a relative path, it will be
    +   * converted into an absolute path with a file:// scheme.
    +   */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    val windows = isWindows || testWindows
    +    // In Windows, the file separator is a backslash, but this is inconsistent with the URI format
    +    val formattedPath = if (windows) path.replace("\\", "/") else path
    +    // Each Windows drive contains only a single alphabet character
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(formattedPath)
    +    uri.getScheme match {
    +      case windowsDrive(d) if windows =>
    +        new URI("file:/" + uri.toString.stripPrefix("/"))
    --- End diff --
    
    I am a little confused with this. This function will convert window's path like `c:\hello\world.txt` to `file:/c:/hello/world.txt` . So the backslashes get permanently replaced by front slashes. How does resolving paths in window work after that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43973957
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15154/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43873048
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15141/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43870051
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43985009
  
    Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44051067
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980115
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12987504
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,35 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /** Return the URI of the input path. If a relative path is given, assume it is local. */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    // On Windows, file names cannot contain backslashes and colons,
    +    // and each drive contains only a single alphabet character
    +    val windows = isWindows || testWindows
    +    val sanitizedPath = if (windows) path.replace("\\", "/") else path
    --- End diff --
    
    Just so I understand - this conversion is necessary because we need to convert the Windows path syntax to the standard URI syntax used by Hadoop. Is that correct? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12987643
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,35 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /** Return the URI of the input path. If a relative path is given, assume it is local. */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    // On Windows, file names cannot contain backslashes and colons,
    +    // and each drive contains only a single alphabet character
    +    val windows = isWindows || testWindows
    +    val sanitizedPath = if (windows) path.replace("\\", "/") else path
    --- End diff --
    
    Or rather, new URI("path\\that\\contains\\backslashes") throws an exception otherwise


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44083220
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43863221
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44084121
  
    @tdas I have pushed a commit that corrects the way we set PYTHONPATH. In a nutshell, python does not understand URI schemes (e.g. `file:/`), but the paths we add to PYTHONPATH do contain these prefixes (e.g. `file:/path/to/hello.py`). Instead, we should strip the prefix and only add the actual path (e.g. `/path/to/hello.py`).
    
    Unfortunately, this involves a fairly non-trivial change, because we also have to make sure that the provided python files exist locally, such that adding them to the PYTHONPATH is actually meaningful.
    
    Also, we have been adding the python file itself to the PYTHONPATH. This is incorrect and does not work on YARN; instead, we should be adding the python file's containing directory. However, `--py-files` may also contain zip files, in which case we still have to add the file itself to the PYTHONPATH. This is reflected in my latest commit (in context.py).
    
    This is a slightly invasive change, but much of the new code are tests for formatting the paths properly. The good news is that I have tested this locally, on a CDH5 cluster, and on Windows, and everything behaves as expected. More specifically, on each of these deploy modes, I ran a combination of spark-shell, spark-submit, and pyspark, with jars / python files referencing each other. I can confirm that `--py-files` (which was broken for YARN before this commit) is now working.
    
    I have not had the time to test this on standalone mode or HDP cluster (especially with Hadoop 2.4). After these have been tested, I think this PR is ready for merge.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12972009
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
    @@ -377,4 +382,13 @@ object SparkSubmitArguments {
         }
         properties.stringPropertyNames().toSeq.map(k => (k, properties(k).trim))
       }
    +
    +  /** Resolves comma separated paths. */
    --- End diff --
    
    Note to self: also add tests for this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44010184
  
    I tested this and this breaks spark-shell in standalone mode with this error (when the given relative path was `app.jar`)
    ```
    java.io.FileNotFoundException: /root/tdas/file:/root/tdas/app.jar (No such file or directory)
    	at java.io.FileInputStream.open(Native Method)
    	at java.io.FileInputStream.<init>(FileInputStream.java:146)
    	at com.google.common.io.Files$FileByteSource.openStream(Files.java:124)
    	at com.google.common.io.Files$FileByteSource.openStream(Files.java:114)
    	at com.google.common.io.ByteSource.copyTo(ByteSource.java:202)
    	at com.google.common.io.Files.copy(Files.java:436)
    	at org.apache.spark.HttpFileServer.addFileToDir(HttpFileServer.scala:62)
    	at org.apache.spark.HttpFileServer.addJar(HttpFileServer.scala:57)
    	at org.apache.spark.SparkContext.addJar(SparkContext.scala:944)
    	at org.apache.spark.SparkContext$$anonfun$11.apply(SparkContext.scala:265)
    	at org.apache.spark.SparkContext$$anonfun$11.apply(SparkContext.scala:265)
    	at scala.collection.immutable.List.foreach(List.scala:318)
    	at org.apache.spark.SparkContext.<init>(SparkContext.scala:265)
    	at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
    ```
    
    This is because when the Spark Repl adds jars to the spark context, it does not consider the path in `spark.jars` property as a URI. Rather it considers the URI to be a file system path, and prepends the current working dir once again, leading to the above crazy path. 
    
    The solution was to make SparkILoop.getAddedJars returns non-URI paths, as it was designed to. This should be minimally invasive change. 
    ```
    --- a/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
    +++ b/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala
    @@ -997,7 +997,8 @@ object SparkILoop {
         val propJars = sys.props.get("spark.jars").flatMap { p =>
           if (p == "") None else Some(p)
         }
    -    propJars.orElse(envJars).map(_.split(",")).getOrElse(Array.empty)
    +    val jars = propJars.orElse(envJars).map(_.split(",")).getOrElse(Array.empty)
    +    jars.map(Utils.resolveURI(_).getPath)
       }
    ```
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44045962
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43870080
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r13008543
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,42 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /**
    +   * Return a well-formed URI for the file described by a user input string.
    +   *
    +   * If the supplied path does not contain a scheme, or is a relative path, it will be
    +   * converted into an absolute path with a file:// scheme.
    +   */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    val windows = isWindows || testWindows
    +    // In Windows, the file separator is a backslash, but this is inconsistent with the URI format
    +    val formattedPath = if (windows) path.replace("\\", "/") else path
    +    // Each Windows drive contains only a single alphabet character
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(formattedPath)
    +    uri.getScheme match {
    +      case windowsDrive(d) if windows =>
    +        new URI("file:/" + uri.toString.stripPrefix("/"))
    --- End diff --
    
    In Windows, both backslashes and forward slashes are valid file separators


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44071595
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15173/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44084018
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15178/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43977351
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12987599
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,35 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /** Return the URI of the input path. If a relative path is given, assume it is local. */
    --- End diff --
    
    This should maybe be expanded slightly:
    
    ```
    /** Return a well formed URI for the file described by a user input string. If the
      * supplied path does not contain a scheme, or is a relative path, it will be
      * converted into an absolute path with a file:// scheme. */
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43961144
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15148/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r13000834
  
    --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala ---
    @@ -1166,4 +1166,42 @@ private[spark] object Utils extends Logging {
             true
         }
       }
    +
    +  /**
    +   * Return a well-formed URI for the file described by a user input string.
    +   *
    +   * If the supplied path does not contain a scheme, or is a relative path, it will be
    +   * converted into an absolute path with a file:// scheme.
    +   */
    +  def resolveURI(path: String, testWindows: Boolean = false): URI = {
    +
    +    val windows = isWindows || testWindows
    +    // In Windows, the file separator is a backslash, but this is inconsistent with the URI format
    +    val formattedPath = if (windows) path.replace("\\", "/") else path
    +    // Each Windows drive contains only a single alphabet character
    +    val windowsDrive = "([a-zA-Z])".r
    +
    +    val uri = new URI(formattedPath)
    +    uri.getScheme match {
    --- End diff --
    
    Can you also add a validation check here to ensure that the URI (and hence the original path given by the user) is valid and not malformed. I, just for fun, entered "file:hello", and it gave me a NPE at a random location. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43861929
  
    Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44051070
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15170/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43972250
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44045956
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43958662
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43867197
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15139/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44044672
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43864111
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15140/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43861912
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43976985
  
    LGTM - added some minor suggestions. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43980096
  
    
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15155/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44068880
  
     Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43985498
  
    All automated tests passed.
    Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15159/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43973956
  
    Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/853#discussion_r12970247
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
    @@ -377,4 +382,13 @@ object SparkSubmitArguments {
         }
         properties.stringPropertyNames().toSeq.map(k => (k, properties(k).trim))
       }
    +
    +  /** Resolves comma separated paths. */
    --- End diff --
    
    Move this to Utils.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-43867196
  
    Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---