You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "R.satish Srinivas (Jira)" <ji...@apache.org> on 2020/03/25 17:59:00 UTC

[jira] [Commented] (HADOOP-16090) S3A Client to add explicit support for versioned stores

    [ https://issues.apache.org/jira/browse/HADOOP-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17066935#comment-17066935 ] 

R.satish Srinivas commented on HADOOP-16090:
--------------------------------------------

[~stevel@apache.org] [~dchmelev] Does this issue occur only when dealing with large number of file writes to S3? I use a Spark streaming application with Hadoop 2.8.3, which keeps adding files to S3 directories and it is accumulating lots of directory level delete markers, which is causing XML parsing errors during S3 list operations. Also is the fix for this available in any version of Hadoop yet?

> S3A Client to add explicit support for versioned stores
> -------------------------------------------------------
>
>                 Key: HADOOP-16090
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16090
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.1
>            Reporter: Dmitri Chmelev
>            Assignee: Steve Loughran
>            Priority: Minor
>
> The fix to avoid calls to getFileStatus() for each path component in deleteUnnecessaryFakeDirectories() (HADOOP-13164) results in accumulation of delete markers in versioned S3 buckets. The above patch replaced getFileStatus() checks with a single batch delete request formed by generating all ancestor keys formed from a given path. Since the delete request is not checking for existence of fake directories, it will create a delete marker for every path component that did not exist (or was previously deleted). Note that issuing a DELETE request without specifying a version ID will always create a new delete marker, even if one already exists ([AWS S3 Developer Guide|https://docs.aws.amazon.com/AmazonS3/latest/dev/RemDelMarker.html])
> Since deleteUnnecessaryFakeDirectories() is called as a callback on successful writes and on renames, delete markers accumulate rather quickly and their rate of accumulation is inversely proportional to the depth of the path. In other words, directories closer to the root will have more delete markers than the leaves.
> This behavior negatively impacts performance of getFileStatus() operation when it has to issue listObjects() request (especially v1) as the delete markers have to be examined when the request searches for first current non-deleted version of an object following a given prefix.
> I did a quick comparison against 3.x and the issue is still present: [https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2947]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org