You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by mridulm <gi...@git.apache.org> on 2017/10/14 06:23:12 UTC

[GitHub] spark pull request #19497: [SPARK-21549][CORE] Respect OutputFormats with no...

GitHub user mridulm opened a pull request:

    https://github.com/apache/spark/pull/19497

    [SPARK-21549][CORE] Respect OutputFormats with no/invalid output directory provided

    
    
    ## What changes were proposed in this pull request?
    
    PR #19294 added support for null's - but spark 2.1 handled other error cases where path argument can be invalid.
    Namely:
    
    * empty string
    * URI parse exception while creating Path
    
    This is resubmission of PR #19487, which I messed up while updating my repo.
    
    ## How was this patch tested?
    
    Enhanced test to cover new support added.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mridulm/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19497.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19497
    
----
commit a319df36db5bd202a14b44a09e9d1887f1633aec
Author: Mridul Muralidharan <mr...@gmail.com>
Date:   2017-10-14T06:21:38Z

    Fix HadoopMapReduceCommitProtocol based on failures seen with Phoenix output format

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    @HyukjinKwon My intention was to preserve earlier behavior.
    Particularly for non-path based committer's, the `path` variable and its use/processing is not relevant, it makes more sense to ignore that codepath entirely.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    **[Test build #82753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82753/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    @HyukjinKwon Thanks for clarifying.
    
    The way I look at it is:
    `saveAsHadoopFile` is explicitly referring to `Output the RDD to any Hadoop-supported file system` in its description (and name) - and so valid `Path` is a reasonable requirement.
    
    Additionally, in createPathFromString for `path == null` we are explicitly throwing `IllegalArgumentException` (`new Path` will do the same now, but I think this changed in past where it used to result in NPE ?).
    The subsequent `val outputPath = new Path(path)` will do that for other invalid input paths as well.
    
    In contrast `saveAsHadoopDataset` is not related to file system but `Output the RDD to any Hadoop-supported storage system` : where output being a valid `Path` is not a requirement.
    
    Having said that, we can always iterate in a jira if you feel there is some confusion - it is always better to be explicitly clear about the interfaces we expose and support !
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    **[Test build #82753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82753/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #19497: [SPARK-21549][CORE] Respect OutputFormats with no...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/19497


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Currently, I meant `saveAsNewAPIHadoopFile` comparing to `saveAsHadoopFile`.
    
    ```
    saveAsNewAPIHadoopFile[...]("") // succeeds
    ```
    
    ```
    saveAsHadoopFile[...]("") // fails
    
    Can not create a Path from an empty string
    java.lang.IllegalArgumentException: Can not create a Path from an empty string
    	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
    	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
    	at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
    ```
    
    I wanted to talk about this. `saveAsHadoopFile` is being failed within Spark side. So, I suspect `saveAsNewAPIHadoopFile` should also fail fast in this way.
    
    `saveAsHadoopFile` validates the path so I thought `saveAsNewAPIHadoopFile` should also validate.
    
    https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1004-L1008
    
    https://github.com/apache/spark/blob/3f958a99921d149fb9fdf7ba7e78957afdad1405/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L983-L987



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Thanks for the reviews everyone !


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82753/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    @mridulm, I just checked thought the related changes and checked the tests pass on branch-2.1.
    
    Seems this PR will actually also allow the cases below:
    
    ```scala
    .saveAsNewAPIHadoopFile[...]("")
    .saveAsNewAPIHadoopFile[...]("::invalid:::")
    ```
    
    Currently both are failed but seems this PR allows those cases:
    
    ```
    Can not create a Path from an empty string
    java.lang.IllegalArgumentException: Can not create a Path from an empty string
    	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
    	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
    	at org.apache.hadoop.fs.Path.<init>(Path.java:89)
    	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
    ...
    ```
    
    ```
    java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
    java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
    	at org.apache.hadoop.fs.Path.initialize(Path.java:206)
    	at org.apache.hadoop.fs.Path.<init>(Path.java:172)
    	at org.apache.hadoop.fs.Path.<init>(Path.java:89)
    	at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:61)
    ...
    ```
    
    I think we should protect these cases as this.
    
    For the cases for old one:
    
    ```scala
    .saveAsHadoopFile[...]("")
    .saveAsHadoopFile[...]("::invalid:::")
    ```
    
    these looks failed fast (whether it was initially intended or not) and I guess this PR does not affect these:
    
    ```
    Can not create a Path from an empty string
    java.lang.IllegalArgumentException: Can not create a Path from an empty string
    	at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
    	at org.apache.hadoop.fs.Path.<init>(Path.java:135)
    	at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
    ```
    
    ```
    java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
    java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: ::invalid:::
    	at org.apache.hadoop.fs.Path.initialize(Path.java:206)
    	at org.apache.hadoop.fs.Path.<init>(Path.java:172)
    	at org.apache.spark.internal.io.SparkHadoopWriterUtils$.createPathFromString(SparkHadoopWriterUtils.scala:54)
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    I support this PR itself of course. I have no problem with this.
    
    I meant a separate (soft) question about `saveAsNewAPIHadoopFile` (not `saveAsNewAPIHadoopDataset`) to validate `path` parameter which we take in `saveAsNewAPIHadoopFile` explicitly.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Thank you @mridulm. I regret that I raised this here, causing confusion. Let's talk more in another place. I will cc you (and @jiangxb1987) when I happened to file up a JIRA or see similar issue related with this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Thx for taking a deeper look @HyukjinKwon, much appreciated !
    I will wait for @jiangxb1987 to also opine before committing - I want to make sure we are not adding incorrect behavior; given that this is a followup to an earlier PR (some excellent work by @szhem btw)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    To clarify, we can look at changing the behavior (if required) in future - but that should be an explicit design choice informed by hadoop committer design. Until then, we should look to interoperate.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    @mridulm, BTW, WDYT about disallowing:
    
    ```
    .saveAsNewAPIHadoopFile[...]("")
    .saveAsNewAPIHadoopFile[...]("::invalid:::")
    ```
    
    within the APIs? If i tested this correctly, this PR also allows both cases but I think we should disallow as it requires `path` and overrides it and `saveAsHadoopFile` disallows it. Seems this is also allowed in branch-2.1 as I recall correctly but looks disallowing it sounds more making sense in any event.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    +CC @HyukjinKwon, @steveloughran 
    
    Sorry for messing up PR #19487
    The only change in this PR is to use `::invalid::` instead of `test:` in the test to address @steveloughran's comment.
    
    Thanks.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by mridulm <gi...@git.apache.org>.
Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    `saveAsNewAPIHadoopFile ` simply delegates to `saveAsNewAPIHadoopDataset` (with some options set), right ? The behavior would be similar ?
    
    Do you mean `saveAsHadoopDataset` instead ?
    I did not change behavior there - since the exception was getting raised from within hadoop code and not from our code (when we pass invalid values), and it is preserving behavior from earlier code.
    I was focussed more on the regression introduced.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    retest this please


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by steveloughran <gi...@git.apache.org>.
Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    I guess one aspect of `saveAsNewAPIHadoopFile` is that it calls ` jobConfiguration.set("mapreduce.output.fileoutputformat.outputdir", path)`, and `Configuration.set(String key, String value)` has a check for null key or value.
    
    If handling of paths is to be done in the committer, `saveAsNewAPIHadoopFile` should really be looking @ path and calling jobConfiguration.unset("mapreduce.output.fileoutputformat.outputdir) if path==null. 
    
    Looking at how Hadoop's FileOutputFormat implementations work, they can handle a null/undefined output dir property, *but not an empty one*.
    
    ```java
    public static Path getOutputPath(JobContext job) {
       String name = job.getConfiguration().get(FileOutputFormat.OUTDIR);
        return name == null ? null: new Path(name);
    ```  
    
    Which implies that `saveAsNewHadoopFile("")` might want to unset the config option too, so offloading the problem of what happens on an empty path to the committer. Though I'd recommend checking to see what meaningful exceptions actually get raised in this situation when the committer is the normal FileOutputFormat/FileOutputCommitter setup


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Let me take a look with few tests and be back. Also I think I should cc @jiangxb1987 too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/82754/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    I agree this was focussed more on the regression introduced and it should be good enough already, and I am talking about a different thing for behaviour change.
    
    Let me organise my idea and and try to file a JIRA later. I think strictly this is separate anyway if I haven't missed something or simply I was wrong.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    **[Test build #82754 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #19497: [SPARK-21549][CORE] Respect OutputFormats with no/invali...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19497
  
    **[Test build #82754 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/82754/testReport)** for PR 19497 at commit [`a319df3`](https://github.com/apache/spark/commit/a319df36db5bd202a14b44a09e9d1887f1633aec).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org