You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org> on 2019/06/20 21:55:01 UTC

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG@7
PS3, Line 7: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories
> nit: wrap
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@549
PS3, Line 549:    if (recursive) {
             :         // The Hadoop FileSystem API doesn't provide a recursive listStatus call that
             :         // doesn't also fetch block locations, and fetching block locations is expensive.
             :         // Here, our caller specifically doesn't need block locations, so we don't want to
             :         // call the expensive 'listFiles' call on HDFS. Instead, we need to "manually"
             :         // recursively call FileSystem.listStatusIterator().
             :         //
             :         // Note that this "manual" recursion is not actually any slower than the recursion
             :         // provided by the HDFS 'listFiles(recursive=true)' API, since the HDFS wire
             :         // protocol doesn't provide any such recursive support anyway. In other words,
             :         // the API that looks like a single recursive call is just as bad as what we're
             :         // doing here.
             :         //
             :         // However, S3 actually implements 'listFiles(recursive=true)' with a faster path
             :         // which natively recurses. In that case, it's quite preferable to use 'listFiles'
             :         // even though it returns LocatedFileStatus objects with "fake" blocks which we
             :         // will ignore.
             :         if (isS3AFileSystem(fs)) {
             :           return listFiles(fs, p, recursive);
             :         }
             : 
             :         return new RecursingIteratorWithFilter(fs, p, true);
             :       }
             : 
             :       return new RecursingIteratorWithFilter(fs, p, fal
> I think this is equivalent to...
yeah, makes sense. Changed.


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@581
PS3, Line 581: erface
> nit: used ?
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@585
PS3, Line 585: 
> nit: FILTER? (predicate is a java construct)
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@642
PS3, Line 642: * certain files/directories. This is a implementation of
> This is always used, no? Hard code it?
I was hoping to make this class generic enough to take in any predicate. But for now, probably makes sense to hard code.


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@643
PS3, Line 643:    * {@link org.apache.hadoop.fs.RemoteIterator}
> Document this?
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@658
PS3, Line 658:     private final boolean useListStatus_;
> line too long (94 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100
PS2, Line 100: dstFs.deleteOnExit(tmpTestPath);
> Are you sure?
The FileSystem#close is called when the JVM exits. See this, https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1618

given this discussion, its probably clearer to use try with resources to make it explicit to the reader. Made the change accordingly


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112
PS2, Line 112: 24
> not sure I follow. I think what were asserting here is that the number of l
Isn't this approach assuming the load() method itself is not buggy? Eg, in the suggested approach test will pass if load() method is hardcoded to return a fixed number of files (my point being, a unrelated bug in FileMetadataLoader will cause a false positive in the suggested approach)

In my opinion, comparing it to a known hard-coded number of files (or a trusted third party library for results) is a better source of truth to compare against.



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Jun 2019 21:55:01 +0000
Gerrit-HasComments: Yes