You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/03/18 00:51:42 UTC

[jira] [Commented] (HADOOP-13371) S3A globber to use bulk listObject call over recursive directory scan

    [ https://issues.apache.org/jira/browse/HADOOP-13371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930952#comment-15930952 ] 

ASF GitHub Bot commented on HADOOP-13371:
-----------------------------------------

GitHub user kazuyukitanimura opened a pull request:

    https://github.com/apache/hadoop/pull/203

    HADOOP-13371. S3A globber to use bulk listObject call over recursive directory scan

    Hi @steveloughran 
    
    This pull request is for fixing (mitigating) the issue of [HADOOP-13371](https://issues.apache.org/jira/browse/HADOOP-13371).
    
    With this patch, it now passes the filter before glob happens.
    
    I had an issue of getting OOM for globbing large s3 buckets before since it kept all possible paths and the filtering happened at the end. Now this patch prunes unnecessary paths with the filter first. I applied this patch to our production pipelines, things run flawlessly.
    This should be applicable to branch-2.8 as well.
    
    Thanks in advance for reviewing this.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bloomreach/hadoop trunk

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/hadoop/pull/203.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #203
    
----
commit 5d6b3e1ebb97cc11479db6c30b0a1a04986c4967
Author: kazu <ka...@bloomreach.com>
Date:   2017-03-18T00:24:41Z

    HADOOP-13371. S3A globber to use bulk listObject call over recursive directory scan

----


> S3A globber to use bulk listObject call over recursive directory scan
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-13371
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13371
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/s3
>    Affects Versions: 2.8.0
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>
> HADOOP-13208 produces O(1) listing of directory trees in {{FileSystem.listStatus}} calls, but doesn't do anything for {{FileSystem.globStatus()}}, which uses a completely different codepath, one which does a selective recursive scan by pattern matching as it goes down, filtering out those patterns which don't match. Cost is O(matching-directories) + cost of examining the files.
> It should be possible to do the glob status listing in S3A not through the filtered treewalk, but through a list + filter operation. This would be an O(files) lookup *before any filtering took place*.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org