You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/18 18:14:08 UTC

[GitHub] [spark] steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

steveloughran commented on issue #21066: [SPARK-23977][CLOUD][WIP] Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
URL: https://github.com/apache/spark/pull/21066#issuecomment-474039157

> I am not clear why it should throw this exception? What would happen if this code is opened.

> I tried to decode the comment on the parameter, " dynamic partition overwrite is not supported, so that committers for stores which do not support rename will not get confused.", but I got bit confused.

oops :)

the dynamicPartitionOverwrite was something which came in with HADOOP-20326, but assumes that the committer is always one which works with it, (as the filesystem ones can do). The PathOutputCommitter interface inserted into hadoop-mapreduce is a bit more relaxed than that, and things break.

bq. when DynamicOverwrite is set, the committer(Directory/partition) will delete the existing contents(in REPLACE mode).

Hadoop HDFS &c, should do. For the [S3A committers](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md), things are subtly different in how conflicts are handled.

In particular, the Ryan Blue's *partioned* committer has special handling for writing out to a directory tree, where the conflict policy (replace, append, fail) _is only applied to the final directories at the end of the partition tree_ [/PartitionedStagingCommitter.java#L122](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/staging/PartitionedStagingCommitter.java#L122)

This lets you do work which can update in place a directory full of data, ignoring all directories which don't contain new data from the current job. It can instead be set to only fail if the final destinations exist, or, if you configure it to delete files, it will purge all existing data in that destination path, but nowhere else

Example: imagine a query which generates data in s3a/dest/year=2019/month=03/day=18

if the destination path has data in s3a/dest/year=2019/month=03/day=17, there'll be no confict. If there is something in 18,. then the new query can either: add new files (remember, it defaults to giving each file a UUID in its name), delete the old ones before adding the new files, or fail.

Is that clearer? It's designed for in-place updates of very-large datasets without having to rename/move any output files after.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org