You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2016/08/07 09:14:20 UTC
[jira] [Commented] (HADOOP-13430) Optimize and fix getFileStatus in S3A

    [ https://issues.apache.org/jira/browse/HADOOP-13430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410875#comment-15410875 ] 

Steve Loughran commented on HADOOP-13430:
-----------------------------------------

looks good, we'll all need to test this, especially against Hive and spark perf runs.

One thing we don't have for the s3/object store tests is a good real-world directory tree; our local tests only create small and unrealistic tree views —there's a risk that code (mine especially) optimises for that test layout, not real world ones. That'll get even worse once we look at globStatus optimisation, where the glob patterns for queries need to be realistic too

Given you are clearly using this in production, is there a way you could share some of the directory structure & query patterns with us? If we had some text file which listed all the paths, we could have a (manually invoked) test case which would read this and generate the directory tree —which could then be used by all tests looking at metadata performance. We wouldn't need contents of files, or the real names, but knowing things like dates in the layout & file extensions, along with any globStatus calls, would help make for realistic operations.

> Optimize and fix getFileStatus in S3A
> -------------------------------------
>
>                 Key: HADOOP-13430
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13430
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steven K. Wong
>            Priority: Minor
>         Attachments: HADOOP-13430.001.WIP.patch
>
>
> Currently, S3AFileSystem.getFileStatus(Path f) sends up to 3 requests to S3 when pathToKey(f) = key = "foo/bar" is a directory:
> 1. HEAD key=foo/bar \[continue if not found]
> 2. HEAD key=foo/bar/ \[continue if not found]
> 3. LIST prefix=foo/bar/ delimiter=/ max-keys=1
> My experience (and generally true, I reckon) is that almost all directories are nonempty directories without a "fake directory" file (e.g. "foo/bar/"). Under this condition, request #2 is mostly unhelpful; it only slows down getFileStatus. Therefore, I propose swapping the order of requests #2 and #3. The swapped HEAD request will be skipped in practically all cases.
> Furthermore, when key = "foo/bar" is a nonempty directory that contains a "fake directory" file (in addition to actual files), getFileStatus currently returns an S3AFileStatus with isEmptyDirectory=true, which is wrong. Swapping will fix this. The swapped LIST request will use max-keys=2 to determine isEmptyDirectory correctly. (Removing the delimiter from the LIST request should make the logic a little simpler than otherwise.)
> Note that key = "foo/bar/" has the same problem with isEmptyDirectory. To fix it, I propose skipping request #1 when key ends with "/". The price is this will, for an empty directory, replace a HEAD request with a LIST request that's generally more taxing on S3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org