You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2022/06/08 10:58:00 UTC

[jira] [Resolved] (HADOOP-18278) Do not perform a LIST call when creating a file

     [ https://issues.apache.org/jira/browse/HADOOP-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran resolved HADOOP-18278.
-------------------------------------
    Target Version/s: 3.4.0
          Resolution: Duplicate

We do the check to make sure that apps don't create files over directories. if they do, your object store loses a lot of its "filesystemness"; list, rename and delete all break.

HEAD doesn't do the validation, and if you create a file with overwrite=false we skip that call. Sadly, parquet likes creating files with overwrite=false, it does HEAD and LIST, even when writing to task attempt dirs which are exclusively for use by single thread and will be completely deleted at the end of the job.

The magic committer performance issue HADOOP-17833 and its PR https://github.com/apache/hadoop/pull/3289 turns off all the safety checks when writing under __magic dirs as we know they are short lived. We don't even check if directories have been created under files. 

The same options are available when writing any file, as it contains
HADOOP-15460, S3A FS to add "fs.s3a.create.performance" to the builder file creation option set.

{code}
out = fs.createFile(new Path("s3a://bucket/subdir/output.txt")
  .opt("fs.s3a.create.performance", true)
	.build();
{code}

If you use this you will get the speed up you want anywhere, but you had a better be confident you are not overwriting a directory. See
https://github.com/steveloughran/hadoop/blob/s3/HADOOP-17833-magic-committer-performance/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdataoutputstreambuilder.md#-s3a-specific-options

At the time of writing (june 8 2022) this PR is in critical need of review. Please look at the patch review it and make sure it will work for you. This will be your opportunity to make sure it is correct before we ship it. You are clearly looking at the internals of what we're doing, so your insight will be valued. Thanks.

> Do not perform a LIST call when creating a file
> -----------------------------------------------
>
>                 Key: HADOOP-18278
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18278
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Sam Kramer
>            Priority: Major
>
> Hello,
> We've noticed that when creating a file, which does not exist in S3, we see an extra LIST call gets issued to see if it's a directory (i.e. if key = "bar", it will issue an object list request for "bar/"). 
> Is this really necessary, shouldn't a HEAD request be sufficient to determine if it actually exists or not? As we're creating 1000s of files, this is quite expensive, as we're effectively doubling our costs for file creation. Curious if others have experienced similar or identical issues, or if there are any workarounds. 
> [https://github.com/apache/hadoop/blob/516a2a8e440378c868ddb02cb3ad14d0d879037f/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3359-L3369]
>  
> Thanks,
> Sam



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org