You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/28 18:44:04 UTC

[GitHub] [spark] dongjoon-hyun opened a new pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

dongjoon-hyun opened a new pull request #29895:
URL: https://github.com/apache/spark/pull/29895


   ### What changes were proposed in this pull request?
   
   This PR aims to use a consistent and safe version of Apache Hadoop file output committer algorithm.
   
   ### Why are the changes needed?
   
   Apache Spark provides multiple distributions with Hadoop 2.7 and Hadoop 3.2. Apache Hadoop 3.0 switches the default algorithm from v1 to v2 and now they are discussion of removing v2. We had better provide a consistent default behavior.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes. This changes the default behavior. Users can override this conf.
   
   ### How was this patch tested?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700281750


   GitHub Action passed. The Jenkins failure is irrelevant.
   
   BTW, cc @waleedfateem , @srowen , @HyukjinKwon , @wangyum 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700241263


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33808/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] yamyamyuo commented on a change in pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
yamyamyuo commented on a change in pull request #29895:
URL: https://github.com/apache/spark/pull/29895#discussion_r496799072



##########
File path: docs/configuration.md
##########
@@ -1761,16 +1761,10 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td><code>spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version</code></td>
-  <td>Dependent on environment</td>
+  <td>1</td>
   <td>
     The file output committer algorithm version, valid algorithm version number: 1 or 2.
-    Version 2 may have better performance, but version 1 may handle failures better in certain situations,
-    as per <a href="https://issues.apache.org/jira/browse/MAPREDUCE-4815">MAPREDUCE-4815</a>.
-    The default value depends on the Hadoop version used in an environment:
-    1 for Hadoop versions lower than 3.0
-    2 for Hadoop versions 3.0 and higher
-    It's important to note that this can change back to 1 again in the future once <a href="https://issues.apache.org/jira/browse/MAPREDUCE-7282">MAPREDUCE-7282</a>
-    is fixed and merged.

Review comment:
       Just curious why this is deleted? It is a very comprehensive comments about the hadoop version background. @dongjoon-hyun 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700220808


   **[Test build #129193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129193/testReport)** for PR 29895 at commit [`1d4d3ea`](https://github.com/apache/spark/commit/1d4d3ea719a7bb21e3757ad6e163145140cbef61).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29895:
URL: https://github.com/apache/spark/pull/29895#discussion_r496427310



##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
##########
@@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil {
     for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
       hadoopConf.set(key.substring("spark.hadoop.".length), value)
     }
+    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
+      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")

Review comment:
       Gotya




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dbtsai commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dbtsai commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700324231


   also cc @steveloughran 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700821012






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700821012


   Hi, @steveloughran and @tgravescs . 
   
   No matter what happens in the future, they cannot change the history (Apache Hadoop 3.2.0 and all exiting Hadoop 3.x versions). And, for now, Apache Spark 3.1 will be stuck in Apache Hadoop 3.2.0 due to the Guava issue. That's the reason why we need to do this right now from Spark side.
   
   For the following, @steveloughran , as I wrote in the PR description, this PR doesn't override the explicit user-given config. This is only setting `v1` when there is no explicit setting.
   > V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. 
   
   Eventually, I believe we can use `hadoop-client-runtime` only in order to remove guava dependency (#29843) and consume @steveloughran 's new Hadoop release in the future. Until that time, Apache Spark 3.1 had better provide a no-known-correctness-regression migration. If Apache Spark 3.1 default distribution is unsafe due to the 3rd party (in this case Hadoop), how can we recommend this to the users?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700220808


   **[Test build #129193 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129193/testReport)** for PR 29895 at commit [`1d4d3ea`](https://github.com/apache/spark/commit/1d4d3ea719a7bb21e3757ad6e163145140cbef61).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700233436


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33808/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700275606


   **[Test build #129193 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129193/testReport)** for PR 29895 at commit [`1d4d3ea`](https://github.com/apache/spark/commit/1d4d3ea719a7bb21e3757ad6e163145140cbef61).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700276891


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129193/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700429110


   I labeled this issue as a correctness issue because a working Apache Spark 2.4 PySpark program can generate a wrong result in Apache Spark 3.0 with Hadoop 3.2 distribution or Apache Spark 3.1 by default. This is a release blocker for Apache. Spark 3.1.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700915407


   Thank you, @steveloughran and @tgravescs . 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700281750


   GitHub Action passed. The Jenkins failure is irrelevant.
   
   BTW, cc @waleedfateem , @srowen , @HyukjinKwon , @wangyum since you are in 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700774332


   FWIW I'm going to change the default to be v1, and log @ WARN in job set up when you use v2 (unless you turn that specific log off). V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. Note that if your job doesn't generate unique files with each task attempt, even without atomic task commit the output is correct. The danger is when when you get one or more of
   
   * different task attempts generating files with different names
   * a requirement of all output files of a task to consist entirely and exclusively of a single task attempt.
   
   If your attempts are 100% deterministic, you are going to be safe.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
tgravescs commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700866136


   I'm fine with changing the default. I was trying to figure out cases when a user would really see this.  
   
   The MapReduce paradigm and Spark rely on the output of tasks being deterministic. If they are not they have other issues with retries and the output has no guarantees.  I thought Spark had deterministic output path naming but I was just starting to make sure I was remembering properly. 
   
   If those are true. I think that just leaves the _SUCCESS file thing. Which I can see if people don't check would be a problem.
   
   Are there cases I'm missing here?  Are there cases cloud providers or other tools are changing the output paths or something? @steveloughran  did you see this in a particular situation?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700429110


   I labeled SPARK-33019 as a correctness issue because a working Apache Spark 2.4 PySpark program can generate a wrong result with Apache Spark 3.0 with Hadoop 3.2 distribution or Apache Spark 3.1 default distribution. This is a release blocker for Apache Spark 3.1.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29895:
URL: https://github.com/apache/spark/pull/29895#discussion_r496405283



##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
##########
@@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil {
     for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
       hadoopConf.set(key.substring("spark.hadoop.".length), value)
     }
+    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
+      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")

Review comment:
       Thanks for review, but we cannot use `Configuration.setIfUnset` because Apache Hadoop already loads the default value of `mapreduce.fileoutputcommitter.algorithm.version`, @HyukjinKwon .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700919027


   I'll merge this PR. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700281750


   GitHub Action passed. The Jenkins failure is irrelevant.
   
   BTW, cc @waleedfateem , @srowen , @HyukjinKwon , @wangyum for https://github.com/apache/spark/pull/29541


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29895:
URL: https://github.com/apache/spark/pull/29895#discussion_r496866857



##########
File path: docs/configuration.md
##########
@@ -1761,16 +1761,10 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td><code>spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version</code></td>
-  <td>Dependent on environment</td>
+  <td>1</td>
   <td>
     The file output committer algorithm version, valid algorithm version number: 1 or 2.
-    Version 2 may have better performance, but version 1 may handle failures better in certain situations,
-    as per <a href="https://issues.apache.org/jira/browse/MAPREDUCE-4815">MAPREDUCE-4815</a>.
-    The default value depends on the Hadoop version used in an environment:
-    1 for Hadoop versions lower than 3.0
-    2 for Hadoop versions 3.0 and higher
-    It's important to note that this can change back to 1 again in the future once <a href="https://issues.apache.org/jira/browse/MAPREDUCE-7282">MAPREDUCE-7282</a>
-    is fixed and merged.

Review comment:
       This PR aims to provide a consistent view for Apache Spark users. For example, ` The default value depends on the Hadoop version used in an environment` is not valid any more. After this PR, Apache Spark users will use `v1` consistently by default.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700821012






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700276884






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29895:
URL: https://github.com/apache/spark/pull/29895#discussion_r496405283



##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
##########
@@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil {
     for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
       hadoopConf.set(key.substring("spark.hadoop.".length), value)
     }
+    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
+      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")

Review comment:
       Thanks for review, but we cannot use `org.apache.hadoop.conf.Configuration.setIfUnset` because Apache Hadoop already loads the default value of `mapreduce.fileoutputcommitter.algorithm.version`, @HyukjinKwon .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700821012


   Hi, @steveloughran and @tgravescs . 
   
   No matter what happens in the future, they cannot change the history (Apache Hadoop 3.2.0 and all exiting Hadoop 3.x versions). And, for now, Apache Spark 3.1 will be stuck in Apache Hadoop 3.2.0 due to the Guava issue. That's the reason why we need to do this right now from Spark side.
   
   For the following, @steveloughran , as I wrote in the PR description, this PR doesn't override the explicit user-give config. This is only setting `v1` when there is no explicit setting.
   > V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. 
   
   Eventually, I believe we can use `hadoop-client-runtime` only in order to remove guava dependency (#29843) and consume @steveloughran 's new Hadoop release in the future.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700352390


   Looks fine to me but how do you think @steveloughran? Looks like your call is important here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
tgravescs commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700775716


   https://issues.apache.org/jira/browse/MAPREDUCE-7282 is not yet resolved so I think we should wait for resolution there.  I don't remember the details off the top of my head so need to go look again.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700241280






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700821012


   Hi, @steveloughran and @tgravescs . 
   
   No matter what happens in the future, they cannot change the history (Apache Hadoop 3.2.0 and all exiting Hadoop 3.x versions). And, for now, Apache Spark 3.1 will be stuck in Apache Hadoop 3.2.0 due to the Guava issue. That's the reason why we need to do this right now from Spark side.
   
   For the following, @steveloughran , as I wrote in the PR description, this PR doesn't override the explicit user-give config. This is only setting `v1` when there is no explicit setting.
   > V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. 
   
   Eventually, I believe we can use `hadoop-client-runtime` only in order to remove guava dependency (#29843) and consume @steveloughran 's new Hadoop release in the future. Until that time, Apache Spark 3.1 had better provide a no-known-correctness-regression migration. If Apache Spark 3.1 default distribution is unsafe due to the 3rd party (in this case Hadoop), how can we recommend this to the users?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700429110


   I labeled SPARK-33019 as a correctness issue because a working Apache Spark 2.4 PySpark program can generate a wrong result with Apache Spark 3.0 with Hadoop 3.2 distribution or Apache Spark 3.1 default distribution. This is a release blocker for Apache Spark 3.1. Note that this is no-op when the user provides the conf.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700821012


   Hi, @steveloughran and @tgravescs . 
   
   What happens in the future, they cannot change the history (Apache Hadoop 3.2.0). For now, Apache Spark 3.1 will be stuck in Apache Hadoop 3.2.0 due to the Guava issue. That's the reason why we need to do this right now from Spark side.
   
   For the following, @steveloughran , as I wrote in the PR description, this PR doesn't not override the explicit user-give config. This is only setting `v1` when there is no explicit setting.
   > V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. 
   
   Eventually, I believe we can use `hadoop-client-runtime` only in order to remove guava dependency (#29843) and consume @steveloughran 's new Hadoop release in the future.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700917856


   @tgravescs . Apache Spark's official cloud integration document is here. We are already recommending `v1` for safety.
   - https://spark.apache.org/docs/latest/cloud-integration#recommended-settings-for-writing-to-object-stores
   
   > For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700917856


   @tgravescs . Apache Spark's official cloud integration document is here. We are already recommending `v1` for safety. With this PR, Apache Spark 3.1 (default Hadoop 3.2) can be as safe as with Apache Spark 3.0 (default Hadoop 2.7).
   - https://spark.apache.org/docs/latest/cloud-integration#recommended-settings-for-writing-to-object-stores
   
   > For object stores whose consistency model means that rename-based commits are safe use the FileOutputCommitter v2 algorithm for performance; v1 for safety.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700429110


   I labeled this issue as a correctness issue because a working Apache Spark 2.4 PySpark program can generate a wrong result with Apache Spark 3.0 with Hadoop 3.2 distribution or Apache Spark 3.1 default distribution. This is a release blocker for Apache Spark 3.1.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700283893


   cc @dbtsai , @viirya , @sunchao 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700276884


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #29895:
URL: https://github.com/apache/spark/pull/29895#discussion_r496405283



##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
##########
@@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil {
     for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
       hadoopConf.set(key.substring("spark.hadoop.".length), value)
     }
+    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
+      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")

Review comment:
       Thanks for review, but we cannot use it because Apache Hadoop already loads the default value of `mapreduce.fileoutputcommitter.algorithm.version`, @HyukjinKwon .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] steveloughran commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
steveloughran commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700855328


   Patch LGTM: you are changing the default algorithm to v1 if the user doesn't say otherwise. 
   
   I'm sorry about "the guava problem".. something to discuss there. It's just there were some security fixes we needed to get in and we couldn't stay on older versions. FWIW we are removing the Preconditions checks out of hadoop-common entirely and moving to our own, just to avoid grief there -but it other bits (executors, cache, ...) still be used. What a pain.Are


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #29895:
URL: https://github.com/apache/spark/pull/29895


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700241280






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700821012


   Hi, @steveloughran and @tgravescs . 
   
   What happens in the future, they cannot change the history (Apache Hadoop 3.2.0 and all exiting Hadoop 3.x versions). For now, Apache Spark 3.1 will be stuck in Apache Hadoop 3.2.0 due to the Guava issue. That's the reason why we need to do this right now from Spark side.
   
   For the following, @steveloughran , as I wrote in the PR description, this PR doesn't not override the explicit user-give config. This is only setting `v1` when there is no explicit setting.
   > V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. 
   
   Eventually, I believe we can use `hadoop-client-runtime` only in order to remove guava dependency (#29843) and consume @steveloughran 's new Hadoop release in the future.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700429110


   I labeled this issue as a correctness issue because a working Apache Spark 2.4 PySpark program can generate a wrong result in Apache Spark 3.0 with Hadoop 3.2 distribution or Apache Spark 3.1 by default. This is a release blocker for Apache Spark 3.1.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700919766


   Merged to master/3.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29895:
URL: https://github.com/apache/spark/pull/29895#discussion_r496306160



##########
File path: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
##########
@@ -462,6 +462,9 @@ private[spark] object SparkHadoopUtil {
     for ((key, value) <- conf.getAll if key.startsWith("spark.hadoop.")) {
       hadoopConf.set(key.substring("spark.hadoop.".length), value)
     }
+    if (conf.getOption("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version").isEmpty) {
+      hadoopConf.set("mapreduce.fileoutputcommitter.algorithm.version", "1")

Review comment:
       What about we just use `hadoopConf.setIfUnset`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun edited a comment on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun edited a comment on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700429110


   I labeled SPARK-33019 as a correctness issue because a working Apache Spark 2.4 PySpark program can generate a wrong result with Apache Spark 3.0 with Hadoop 3.2 distribution or Apache Spark 3.1 default distribution. This is a release blocker for Apache Spark 3.1. Note that this is no-op if the user provides the conf.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org