You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2020/08/27 14:10:52 UTC
[spark] branch branch-3.0 updated: [SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
     new 60f4856  [SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value
60f4856 is described below

commit 60f485671a07a93ae8a8506ed2c0999cfe6ded7b
Author: waleedfateem <wa...@gmail.com>
AuthorDate: Thu Aug 27 09:05:50 2020 -0500

    [SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value
    
    The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class.
    
    ### What changes were proposed in this pull request?
    
    I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment.
    
    ### Why are the changes needed?
    
    An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA:
    https://issues.apache.org/jira/browse/MAPREDUCE-7282
    
    ### Does this PR introduce _any_ user-facing change?
    
    Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate.
    
    ### How was this patch tested?
    
    Checked changes locally in browser
    
    Closes #29541 from waleedfateem/SPARK-32701.
    
    Authored-by: waleedfateem <wa...@gmail.com>
    Signed-off-by: Sean Owen <sr...@gmail.com>
    (cherry picked from commit 8749b2b6fae5ee0ce7b48aae6d859ed71e98491d)
    Signed-off-by: Sean Owen <sr...@gmail.com>
---
 docs/configuration.md | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index 2701fdb..95ff282 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1761,11 +1761,16 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td><code>spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version</code></td>
-  <td>1</td>
+  <td>Dependent on environment</td>
   <td>
     The file output committer algorithm version, valid algorithm version number: 1 or 2.
     Version 2 may have better performance, but version 1 may handle failures better in certain situations,
     as per <a href="https://issues.apache.org/jira/browse/MAPREDUCE-4815">MAPREDUCE-4815</a>.
+    The default value depends on the Hadoop version used in an environment:
+    1 for Hadoop versions lower than 3.0
+    2 for Hadoop versions 3.0 and higher
+    It's important to note that this can change back to 1 again in the future once <a href="https://issues.apache.org/jira/browse/MAPREDUCE-7282">MAPREDUCE-7282</a>
+    is fixed and merged.
   </td>
   <td>2.2.0</td>
 </tr>


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org