You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2021/04/03 16:28:00 UTC

[jira] [Commented] (SPARK-23977) Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism

    [ https://issues.apache.org/jira/browse/SPARK-23977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314296#comment-17314296 ] 

Steve Loughran commented on SPARK-23977:
----------------------------------------

the spark settings don't make it down from sql; you can do more at the RDD API level.

The problem is that the spark partition insert logic all relies on renaming which has the O(data) performance penalty as well as the other commit correctness issues. Yes, something to do the pushdown could be done, or with the multipart APIs of HADOOP-13186 give Spark a standard API to implement a zero rename committer in its own code.

However, focus is on things like Iceberg and Delta lake, which offer more in terms of : 
* atomic job commit
* avoid the performance and cost issues of relying on directory tree scan as a way to identify source files.; the performance issues of doing all IO down a single shard of S3 storage.
I do not disagree with the direction of that work; we have to view the S3A committers (and IBM's Stocator + AWS EMR spark committers) as the final attempts to maintain that "it's just a directory tree" model into a cloud world where directories don't always exist, and listing them is measurable in hundreds of milliseconds. 

> Add commit protocol binding to Hadoop 3.1 PathOutputCommitter mechanism
> -----------------------------------------------------------------------
>
>                 Key: SPARK-23977
>                 URL: https://issues.apache.org/jira/browse/SPARK-23977
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>             Fix For: 3.0.0
>
>
> Hadoop 3.1 adds a mechanism for job-specific and store-specific committers (MAPREDUCE-6823, MAPREDUCE-6956), and one key implementation, S3A committers, HADOOP-13786
> These committers deliver high-performance output of MR and spark jobs to S3, and offer the key semantics which Spark depends on: no visible output until job commit, a failure of a task at an stage, including partway through task commit, can be handled by executing and committing another task attempt. 
> In contrast, the FileOutputFormat commit algorithms on S3 have issues:
> * Awful performance because files are copied by rename
> * FileOutputFormat v1: weak task commit failure recovery semantics as the (v1) expectation: "directory renames are atomic" doesn't hold.
> * S3 metadata eventual consistency can cause rename to miss files or fail entirely (SPARK-15849)
> Note also that FileOutputFormat "v2" commit algorithm doesn't offer any of the commit semantics w.r.t observability of or recovery from task commit failure, on any filesystem.
> The S3A committers address these by way of uploading all data to the destination through multipart uploads, uploads which are only completed in job commit.
> The new {{PathOutputCommitter}} factory mechanism allows applications to work with the S3A committers and any other, by adding a plugin mechanism into the MRv2 FileOutputFormat class, where it job config and filesystem configuration options can dynamically choose the output committer.
> Spark can use these with some binding classes to 
> # Add a subclass of {{HadoopMapReduceCommitProtocol}} which uses the MRv2 classes and {{PathOutputCommitterFactory}} to create the committers.
> # Add a {{BindingParquetOutputCommitter extends ParquetOutputCommitter}}
> to wire up Parquet output even when code requires the committer to be a subclass of {{ParquetOutputCommitter}}
> This patch builds on SPARK-23807 for setting up the dependencies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org