You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2021/08/06 10:51:00 UTC

[jira] [Commented] (HADOOP-17833) Improve Magic Committer Performane

    [ https://issues.apache.org/jira/browse/HADOOP-17833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394697#comment-17394697 ] 

Steve Loughran commented on HADOOP-17833:
-----------------------------------------


Thinking about other improvements


* knowing that all committers must be on the same (marker aware) Hadoop release, we should enable marker retention on every magic path. Saves on DELETE requests
* skip the mkdirs() in task setup; saves on scan up tree and PUT; will need to make sure task commit is OK with FNFE on list
* fix s3a openFile().with(FileStatus) to accept file status not an instance of S3AFS (in the openFile() enhancements patch, but we only need this), and JsonSerDeser to pass it down when opening a file. Saves on HEAD request when going from dir list to opening a file in task and job commit
* make sure job commit is optimised as it is the critical path for compute
* maybe: collect task commit stats as the manifest committer will do. Might be best done first for measuring optimisation;
* include those of input and output streams if we can enhance json ser deser to add ability (new methods) to return them

The first thre are strightforward with minimal production code changes; tests not that difficult.

> Improve Magic Committer Performane
> ----------------------------------
>
>                 Key: HADOOP-17833
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17833
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs/s3
>    Affects Versions: 3.3.1
>            Reporter: Steve Loughran
>            Priority: Minor
>
> Magic committer tasks can be slow because every file created with overwrite=false triggers a HEAD (verify there's no file) and a LIST (that there's no dir). And because of delayed manifestations, it may not behave as expected.
> ParquetOutputFormat is one example of a library which does this.
> we could fix parquet to use overwrite=true, but (a) there may be surprises in other uses (b) it'd still leave the list and (c) do nothing for other formats call
> Proposed: createFile() under a magic path to skip all probes for file/dir at end of path
> Only a single task attempt Will be writing to that directory and it should know what it is doing. If there is conflicting file names and parts across tasks that won't even get picked up at this point. Oh and none of the committers ever check for this: you'll get the last file manifested (s3a) or renamed (file)
> If we skip the checks we will save 2 HTTP requests/file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org