You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2019/10/29 17:43:00 UTC
[jira] [Commented] (HADOOP-16673) Add filter parameter to
FileSystem>>listFiles
[ https://issues.apache.org/jira/browse/HADOOP-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16962267#comment-16962267 ]
Steve Loughran commented on HADOOP-16673:
-----------------------------------------
This wouldn't work. The path filtering you have described Will only work efficiently if you are actually doing a tree walk. For S3 we are issuing LIST path/ commands and getting pages of results back. There is no way we could do this filtering except by doing exactly what you are trying to do yourself: List everything then discard the stuff that is not needed. It won't be any more efficient it will only set unrealistic expectations on performance.
Have a look at {{org.apache.hadoop.fs.s3a.Listing.ProvidedFileStatusIterator}} to see what todo. You just need to write an iterator which wraps the current one and does the filtering there, which you can then iterate over.
Closing as a WONTFIX. Sorry
I'll look at the Hive problem.
+[~gabor.bota]
> Add filter parameter to FileSystem>>listFiles
> ---------------------------------------------
>
> Key: HADOOP-16673
> URL: https://issues.apache.org/jira/browse/HADOOP-16673
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs
> Reporter: Attila Magyar
> Priority: Major
>
> Currently getting recursively a filtered list of files in a directory is clumsy because filtering should happen afterwards on the result list.
> Imagine we want to list all non hidden files recursively.
> The non hidden files filter is defined as:
> {code:java}
> !name.startsWith("_") && !name.startsWith(".") {code}
>
> Then we can do:
>
> {code:java}
> RemoteIterator<LocatedFileStatus> remoteIterator = fs.listFiles(path, /*recursive*/true);
> while (remoteIterator.hasNext()) {
> LocatedFileStatus each = remoteIterator.next();
> if (filter applies to all of the path elements in each) {
> result.add(each);
> }
> }
>
> {code}
>
> For example each of these paths should be skipped:
> * /.a/b/c
> * /a/.b/c
> * /a/b/.c/
> It would be lot better to have a filter parameter on listFiles. This is needed to solve HIVE-22411 effectively.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org