You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/29 15:14:02 UTC

[GitHub] [spark] steveloughran commented on pull request #29895: [SPARK-33019][CORE] Use spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default

steveloughran commented on pull request #29895:
URL: https://github.com/apache/spark/pull/29895#issuecomment-700774332


   FWIW I'm going to change the default to be v1, and log @ WARN in job set up when you use v2 (unless you turn that specific log off). V2 is used in places where people have hit the scale limits with v1, and they are happy with the risk of failures. Note that if your job doesn't generate unique files with each task attempt, even without atomic task commit the output is correct. The danger is when when you get one or more of
   
   * different task attempts generating files with different names
   * a requirement of all output files of a task to consist entirely and exclusively of a single task attempt.
   
   If your attempts are 100% deterministic, you are going to be safe.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org