You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2019/07/29 15:33:00 UTC

[jira] [Created] (HADOOP-16465) S3AFileSystem.listLocatedStatus to LIST before HEAD

Steve Loughran created HADOOP-16465:
---------------------------------------

             Summary: S3AFileSystem.listLocatedStatus to LIST before HEAD
                 Key: HADOOP-16465
                 URL: https://issues.apache.org/jira/browse/HADOOP-16465
             Project: Hadoop Common
          Issue Type: Sub-task
          Components: fs/s3
    Affects Versions: 3.2.0
            Reporter: Steve Loughran


Looking at logs of LocatedFileStatus/FileInputFormat scans; there's a needless call to getFileStatus whenever a S3AFileSystem.listLocatedStatus() call is made

# {{S3AFileSystem.listLocatedStatus()}} does a getFileStatus call, returns the file status first
# But if you look at all the uses in the MR code in FileInputFormat and LocatedFileStatusFetcher, they only call this method *knowing the destination is a directory*

Which means for every unguarded S3 path: two needless HEADS and a single entry LIST, before the real LIST is initiated.

If the S3A FS can assume that a dest is a non-empty directory, then it can go straight to the LIST operation, only falling back to the HEAD + HEAD +/ if that fails.

We could also think about doing the same for listStatus



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org