You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/02/10 15:22:00 UTC
[jira] [Work logged] (HADOOP-13704) S3A getContentSummary() to move to listFiles(recursive) to count children; instrument use

     [ https://issues.apache.org/jira/browse/HADOOP-13704?focusedWorklogId=724563&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-724563 ]

ASF GitHub Bot logged work on HADOOP-13704:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 10/Feb/22 15:21
            Start Date: 10/Feb/22 15:21
    Worklog Time Spent: 10m 
      Work Description: ahmarsuhail opened a new pull request #3978:
URL: https://github.com/apache/hadoop/pull/3978


   ### Description of PR
   JIRA: https://issues.apache.org/jira/browse/HADOOP-13704
   
   This PR implements an optimised version of getContentSummary which uses the result from the listFiles iterator.
   
   Explanation of new `buildDirectorySet` method added:
   
   Since the listFiles operation can return the directory `a/b/c` as a single object, we need to recurse over the path `a/b/c` to ensure we have counted all directories. We do this by keeping two sets, dirSet (Set of all directories under the base path) and pathTraversed (Set of paths we have recursed over so far).
   
   Iterating over directory structure `basePath/a/b/c`, `basePath/a/b/d`, we will first find all the directories in `basePath/a/b/c`. Once this is completed, the pathTraversed set will have `{basePath/a/b}` and dirSet will have `{basePath/a, basePath/a/b, basePath/a/b/c}`.
   
   Then for `basePath/a/b/d`, just add `basePath/a/b/d` to the dirSet and don't do any additional work as path `basePath/a/b` has already been traversed.
   
   The Jira ticket mentions that we should add in some instrumentation to measure usage. T's already code that does this [here](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3256)and usage is tested in an integration test [here](https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/performance/ITestS3AMiscOperationCost.java#L144) .
   
   ### How was this patch tested?
   
   Tested in eu-west-1 by running
   
   `mvn -Dparallel-tests -DtestsThreadCount=16 clean verify`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 724563)
    Remaining Estimate: 0h
            Time Spent: 10m

> S3A getContentSummary() to move to listFiles(recursive) to count children; instrument use
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-13704
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13704
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive and a bit of Spark use {{getContentSummary()}} to get some summary stats of a filesystem. This is very expensive on S3A (and any other object store), especially as the base implementation does the recursive tree walk.
> Because of HADOOP-13208, we have a full enumeration of files under a path without directory costs...S3A can/should switch to this to speed up those places where the operation is called.
> Also
> * API call needs FS spec and contract tests
> * S3A could instrument invocation, so as to enable real-world popularity to be measured



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org