You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org> on 2019/06/18 01:01:20 UTC

[Impala-ASF-CR] IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories

Vihang Karajgaonkar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13665


Change subject: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................

IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil.

Testing:
1. Added a new test in FilemetadataloaderTest and modified existing ones
in AcidUtils
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. However, still need to
figure out if we should skip listing _copy_ files which Hive creates
when the insert is in progress

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
3 files changed, 103 insertions(+), 17 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil.

Also, the existing code to recover partitions implements its own
recursion logic which includes path validation. This already skips such
hidden directories since they do not conform to the partition spec. The
patch does a minor modification to this method by directly calling the
listStatusIterator instead of going through FileSystemUtil which uses
the filtering remote iterator now.

Testing:
1. Added a new test in FilemetadataloaderTest and modified existing ones
in AcidUtils. Modified existing test_recursive_listing.py
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M tests/metadata/test_recursive_listing.py
5 files changed, 146 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/7
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 7
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3746/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 8
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jun 2019 23:35:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch adds a new FilterIterator which wraps around existing
listFiles, listStatus and RecursingIterator to skip the hidden
directories from the listing result.

Also, the existing code to recover partitions implements its own
recursion logic which includes path validation. This already skips such
hidden directories since they do not conform to the partition spec. The
patch does a minor modification to this method by directly calling the
listStatusIterator instead of going through FileSystemUtil#listStatus
whiche uses the filtering remote iterator now.

Testing:
1. Added a new tests as well as modified existing ones which were
related to cover interesting cases.
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Reviewed-on: http://gerrit.cloudera.org:8080/13665
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Vihang Karajgaonkar <vi...@cloudera.com>
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
A fe/src/test/java/org/apache/impala/common/FileSystemUtilTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M tests/metadata/test_recursive_listing.py
6 files changed, 236 insertions(+), 9 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Vihang Karajgaonkar: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 14
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3686/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 3
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Jun 2019 22:23:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3709/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 5
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jun 2019 07:08:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 7:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@545
PS6, Line 545:    * If 'recursive' is true, all underlying files (except which are
> this javadoc should be updated to indicate that tmp/hidden files and direct
Done


http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@566
PS6, Line 566:       // even though it returns Locate
> It looks like this will break the above-mentioned optimization. In the case
hmm.. yeah thats true. Thanks for catching that. I think this method is a bit confusing or rather the name of the method is confusing. In my opinion its cleaner if listStatus does just what it says and the caller invokes listStatus v/s listFiles dependending of if its S3 file system or not. However, I can see why it is done this way so that we don't inadvertently introduce a inefficient code path if we don't know about this S3 perf optimization.


http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@576
PS6, Line 576: 
> The naming and this comment are a bit opposite from each other. The predica
agreed. Changed this to a simple static method called isIgnoredDirectory


http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@580
PS6, Line 580:    * which tools like Hive create in the table/partition directories when a query is
> is this ever used as a Predicate? It seems like in this file it's only used
Done


http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@592
PS6, Line 592:       boolean recursive) throws IOException {
> nit: add an inline comment like this: /*useListStatus=*/false
Changed the input arguments to use a Enum for get the LISTING_TYPE


http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@633
PS6, Line 633: 
> nit: weird spacing that's inconsistent with most of the java code in impala
Done


http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@681
PS6, Line 681:   // state)
             :       while (curFile_ == null) {
             :         if (curIter_.hasNext()) {
             :           // Consume the next file or directory from the current iterator.
             :           handleFileStat(curIter_.next());
             :         } else if (!iters_.empty()) {
             :           // We ran out of entries in the current one, but we might still have
             :        
> IMO this javadoc is a bit redundant with the implementation and maybe not n
agreed. removed it


http://gerrit.cloudera.org:8080/#/c/13665/6/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@694
PS6, Line 694: 
> this should say "Do not", right?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 7
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jun 2019 22:50:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3650/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 18 Jun 2019 01:41:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4544/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 8
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jun 2019 22:58:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil.

Testing:
1. Added a new test in FilemetadataloaderTest and modified existing ones
in AcidUtils
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
3 files changed, 96 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/5
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 5
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@580
PS2, Line 580: /**
             :    * Filter interface for a FileStatus. Useful for filtering out disinteresting
             :    * FileStatuses from a given list
             :    */
             :   public interface FileStatusFilter {
             : 
             :     /**
             :      * Given a FileStatus returns if the filter accepts it or not
             :      * @param fileStatus
             :      * @return true if the filter accepts it, else false
             :      */
             :     boolean accept(FileStatus fileStatus);
             :   }
             : 
             :   /**
             :    * FileStatusFilter implementation which is
             :    * useful to filter out hidden directories and temporary staging directories
             :    * which tools like Hive create in the table/partition directories when a query is
             :    * inserting data into them.
             :    */
             :   public static final FileStatusFilter HIDDEN_DIRECTORIES_FILTER = fileStatus -> {
             :     if (!fileStatus.isDirectory()) return false;
             :     String filename = fileStatus.getPath().getName();
             :     return filename.startsWith(".") || filename.startsWith("_tmp.");
             :   };
You could also implement a Predicate<FileStatus> overriding test().


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@656
PS2, Line 656:     private final boolean isRecursive_;
I think the refactor is confusing. RecursiingIterator can also have isRecursive_ = false? Can we clean it up a bit?


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@716
PS2, Line 716: { return; }
nit: no braces


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@93
PS2, Line 93: hdfs://localhost:20500/
Do we need these qualifiers? I think the Configuration should resolve it to an appropriate file system


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100
PS2, Line 100: dstFs.deleteOnExit(tmpTestPath);
Do you need wrap the fs in try-with-resources for this to work?


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112
PS2, Line 112: 24
Instead, we could try loading sourcePath and substitute the values?



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 01:41:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4562/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 11
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Jun 2019 07:52:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 8:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@541
PS8, Line 541: temporary directories
> the code now seems to also skip any files that match this pattern, not just
The latest patchset changed it so that it only skips temp directories now. I saw code in FileMetadataLoader which keeps stats related to hidden files. Didn't want to mess up those, since temp/hidden files in valid directories seem intentional and long-lived, while temp directories most are transient in most cases.


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@545
PS8, Line 545: all underlying files (except which are
             :    * in the ignored directories) 
> does this mean we no longer yield directories, and only yield files? does t
Actually, I added this line in the doc since I noticed that the RecursingIterator yields files not directories even without the patch. I thought that was important and hence added it here. These semantics are not changed after the patch.


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@568
PS8, Line 568: isS3AFileSystem(p)
> why'd you change from isS3AFileSystem(fs) to isS3AFileSystem(p)? In the cas
this was unintentional change. Didn't realize there are two isS3AFileSystem methods and I used the wrong one.


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@568
PS8, Line 568: isS3AFileSystem
> perhaps we can add a @VisibleForTesting way we can make this path get used 
Sure. Do you have any suggestions on how can we do this? static methods are hard to mock. Are you thinking of using a test-only flag?


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@626
PS8, Line 626: LISTING_TYPE
> style nit: enums should be named like classes (ListingType)
Done


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@699
PS8, Line 699:       // if the current file is on a ignored path return early
             :       if (isIgnoredPath(fileStatus)) return;
> how does this prevent recursion into tmp dirs in the recursive listFiles ca
yeah, I realized that when one of my own tests failed on the last patch set. The latest patch set addresses this.



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 8
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Jun 2019 07:40:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@545
PS8, Line 545: all underlying files (except which are
             :    * in the ignored directories) 
> hrm, were the semantics already broken then? See the unit test AcidUtilsTes
yeah, it looks like its broken currently for this case. I will file a JIRA. The AcidUtils test works because it works off the List<FileStatus> not the ones which are actually returned by the FileMetadataLoader.

Specifically, if the table is transactional we cannot rely on using listFiles API since it will skip empty directories (Or we should implement a version of it for own purposes). Not sure how to handle the S3 recursive listFiles case though.

Created https://issues.apache.org/jira/browse/IMPALA-8739 as suggested



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 8
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Jul 2019 20:45:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13665/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@566
PS4, Line 566:         if (isS3AFileSystem(fs)) {
> nit: merge the branches?
Done


http://gerrit.cloudera.org:8080/#/c/13665/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@627
PS4, Line 627: 
> s/filter? (FileStatusFilter isn't a thing anymore I guess.)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jun 2019 06:27:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@580
PS2, Line 580: /**
             :    * Predicate useful to filter out hidden directories and temporary staging directories
             :    * which tools like Hive create in the table/partition directories when a query is
             :    * inserting data into them.
             :    */
             :   public static final Predicate <FileStatus> HIDDEN_DIRECTORIES_PREDICATE = fileStatus -> {
             :     if (!fileStatus.isDirectory()) return false;
             :     String filename = fileStatus.getPath().getName();
             :     return filename.startsWith(".") || filename.startsWith("_tmp.");
             :   };
             : 
             :   /**
             :    * Wrapper around FileSystem.listFiles(), similar to the listStatus() wrapper above.
             :    */
             :   public static RemoteIterator<? extends FileStatus> listFiles(FileSystem fs, Path p,
             :       boolean recursive) throws IOException {
             :     try {
             :       return new RemoteIteratorWithFilter(fs, p, recursive, false);
             :     } catch (FileNotFoundException e) {
             :       if (LOG.isWarnEnabled()) LOG.warn("Path does not exist: " + p.toString(), e);
             :       return null;
             :     }
             :   }
             : 
             :   /*
> You could also implement a Predicate<FileStatus> overriding test().
Done


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@656
PS2, Line 656:     private RemoteIteratorWithFilter(FileSystem fs, Path startPath,
> I think the refactor is confusing. RecursiingIterator can also have isRecur
Renamed it to RemoteIteratorWithFilter which hopefully is clearer


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@716
PS2, Line 716: 
> nit: no braces
Done


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@93
PS2, Line 93: hdfs://localhost:20500/
> Do we need these qualifiers? I think the Configuration should resolve it to
Don't think we need them. I just followed what was being done in the other tests of this class. I personally think its clearer than letting Configuration resolve it to the default FileSystem


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100
PS2, Line 100: dstFs.deleteOnExit(tmpTestPath);
> Do you need wrap the fs in try-with-resources for this to work?
don't think so. This will clean up the directory when the JVM exits. The other way to do this would be a finally block to explicitly remove the directory before exiting.


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112
PS2, Line 112: 24
> Instead, we could try loading sourcePath and substitute the values?
I am vary of using load() to test load() method's behavior. Seems like a anti-pattern.



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 3
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Jun 2019 21:43:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3708/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 4
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jun 2019 06:55:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 8: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4544/


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 8
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Jun 2019 04:38:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil.

Testing:
1. Added a new test in FilemetadataloaderTest and modified existing ones
in AcidUtils
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
3 files changed, 89 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 3
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil.

Also, the existing code to recover partitions implements its own
recursion logic which includes path validation. This already skips such
hidden directories since they do not conform to the partition spec. The
patch does a minor modification to this method by directly calling the
listStatusIterator instead of going through FileSystemUtil which uses
the filtering remote iterator now.

Testing:
1. Added a new test in FilemetadataloaderTest and modified existing ones
in AcidUtils. Modified existing test_recursive_listing.py
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M tests/metadata/test_recursive_listing.py
5 files changed, 148 insertions(+), 49 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/8
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 8
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3745/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 7
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jun 2019 23:30:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3810/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 12
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Jul 2019 03:30:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG@7
PS3, Line 7: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories
nit: wrap


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@549
PS3, Line 549:    if (recursive) {
             :         // The Hadoop FileSystem API doesn't provide a recursive listStatus call that
             :         // doesn't also fetch block locations, and fetching block locations is expensive.
             :         // Here, our caller specifically doesn't need block locations, so we don't want to
             :         // call the expensive 'listFiles' call on HDFS. Instead, we need to "manually"
             :         // recursively call FileSystem.listStatusIterator().
             :         //
             :         // Note that this "manual" recursion is not actually any slower than the recursion
             :         // provided by the HDFS 'listFiles(recursive=true)' API, since the HDFS wire
             :         // protocol doesn't provide any such recursive support anyway. In other words,
             :         // the API that looks like a single recursive call is just as bad as what we're
             :         // doing here.
             :         //
             :         // However, S3 actually implements 'listFiles(recursive=true)' with a faster path
             :         // which natively recurses. In that case, it's quite preferable to use 'listFiles'
             :         // even though it returns LocatedFileStatus objects with "fake" blocks which we
             :         // will ignore.
             :         if (isS3AFileSystem(fs)) {
             :           return listFiles(fs, p, recursive);
             :         }
             : 
             :         return new RemoteIteratorWithFilter(fs, p, true);
             :       }
             : 
             :       return new RemoteIteratorWithFilter(fs, p, false)
I think this is equivalent to...

if (S3) return listFiles(fs, p, recursive);
return new RemoteIteratorWithFilter(fs, p, recursive);


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@581
PS3, Line 581: useful
nit: used ?


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@585
PS3, Line 585: PREDICATE
nit: FILTER? (predicate is a java construct)


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@642
PS3, Line 642:  private final Predicate<FileStatus> fileStatusPredicate_;
This is always used, no? Hard code it?


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@643
PS3, Line 643:     private final boolean useListStatus_;
Document this?


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100
PS2, Line 100: dstFs.deleteOnExit(tmpTestPath);
> don't think so. This will clean up the directory when the JVM exits. The ot
Are you sure?

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L2505

Its done in close(), so you probably need to wrap it in try-with-resources or explicitly call in a finally {}


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112
PS2, Line 112: 24
> I am vary of using load() to test load() method's behavior. Seems like a an
not sure I follow. I think what were asserting here is that the number of loaded files in the sourcePath would be the same with/without the filters. So why not load the sourcePath, get the the numFiles, add the junk folders, load again and compare the load numbers? I think thats better than hardcoding.



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 3
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Jun 2019 18:44:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13665/7/tests/metadata/test_recursive_listing.py
File tests/metadata/test_recursive_listing.py:

http://gerrit.cloudera.org:8080/#/c/13665/7/tests/metadata/test_recursive_listing.py@98
PS7, Line 98: ,
flake8: E501 line too long (91 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/13665/7/tests/metadata/test_recursive_listing.py@101
PS7, Line 101: a
flake8: E501 line too long (104 > 90 characters)



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 7
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Jun 2019 22:51:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 11: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 11
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Jun 2019 12:59:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13665/3//COMMIT_MSG@7
PS3, Line 7: IMPALA-8663 : [WIP] FileMetadataLoader should skip listing files in hidden and tmp directories
> nit: wrap
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@549
PS3, Line 549:    if (recursive) {
             :         // The Hadoop FileSystem API doesn't provide a recursive listStatus call that
             :         // doesn't also fetch block locations, and fetching block locations is expensive.
             :         // Here, our caller specifically doesn't need block locations, so we don't want to
             :         // call the expensive 'listFiles' call on HDFS. Instead, we need to "manually"
             :         // recursively call FileSystem.listStatusIterator().
             :         //
             :         // Note that this "manual" recursion is not actually any slower than the recursion
             :         // provided by the HDFS 'listFiles(recursive=true)' API, since the HDFS wire
             :         // protocol doesn't provide any such recursive support anyway. In other words,
             :         // the API that looks like a single recursive call is just as bad as what we're
             :         // doing here.
             :         //
             :         // However, S3 actually implements 'listFiles(recursive=true)' with a faster path
             :         // which natively recurses. In that case, it's quite preferable to use 'listFiles'
             :         // even though it returns LocatedFileStatus objects with "fake" blocks which we
             :         // will ignore.
             :         if (isS3AFileSystem(fs)) {
             :           return listFiles(fs, p, recursive);
             :         }
             : 
             :         return new RecursingIteratorWithFilter(fs, p, true);
             :       }
             : 
             :       return new RecursingIteratorWithFilter(fs, p, fal
> I think this is equivalent to...
yeah, makes sense. Changed.


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@581
PS3, Line 581: erface
> nit: used ?
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@585
PS3, Line 585: 
> nit: FILTER? (predicate is a java construct)
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@642
PS3, Line 642: * certain files/directories. This is a implementation of
> This is always used, no? Hard code it?
I was hoping to make this class generic enough to take in any predicate. But for now, probably makes sense to hard code.


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@643
PS3, Line 643:    * {@link org.apache.hadoop.fs.RemoteIterator}
> Document this?
Done


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@658
PS3, Line 658:     private final boolean useListStatus_;
> line too long (94 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
File fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java:

http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@100
PS2, Line 100: dstFs.deleteOnExit(tmpTestPath);
> Are you sure?
The FileSystem#close is called when the JVM exits. See this, https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1618

given this discussion, its probably clearer to use try with resources to make it explicit to the reader. Made the change accordingly


http://gerrit.cloudera.org:8080/#/c/13665/2/fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java@112
PS2, Line 112: 24
> not sure I follow. I think what were asserting here is that the number of l
Isn't this approach assuming the load() method itself is not buggy? Eg, in the suggested approach test will pass if load() method is hardcoded to return a fixed number of files (my point being, a unrelated bug in FileMetadataLoader will cause a false positive in the suggested approach)

In my opinion, comparing it to a known hard-coded number of files (or a trusted third party library for results) is a better source of truth to compare against.



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 2
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Jun 2019 21:55:01 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 8:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@541
PS8, Line 541: temporary directories
the code now seems to also skip any files that match this pattern, not just directories. I think that's probably OK but we should be clear on the semantics.


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@545
PS8, Line 545: all underlying files (except which are
             :    * in the ignored directories) 
does this mean we no longer yield directories, and only yield files? does this not break the case of an empty base data dir, which actually has semantic value?


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@568
PS8, Line 568: isS3AFileSystem(p)
why'd you change from isS3AFileSystem(fs) to isS3AFileSystem(p)? In the case that 'p' isn't a fully qualified path, this will check the wrong filesystem, and we don't have any preconditions checking that 'p' must be fully qualified.


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@568
PS8, Line 568: isS3AFileSystem
perhaps we can add a @VisibleForTesting way we can make this path get used even for non-S3, so we can get test coverage here?


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@626
PS8, Line 626: LISTING_TYPE
style nit: enums should be named like classes (ListingType)


http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@699
PS8, Line 699:       // if the current file is on a ignored path return early
             :       if (isIgnoredPath(fileStatus)) return;
how does this prevent recursion into tmp dirs in the recursive listFiles case? It seems like this is only checking the filename (last path component) and not intermediate path components, which we'd need to be doing to filter out contents of subdirectories for the "native recursion"



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 8
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Jun 2019 06:46:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil.

Testing:
1. Added a new test in FilemetadataloaderTest and modified existing ones
in AcidUtils
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
3 files changed, 95 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/4
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 4
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 13: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 13
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Jul 2019 05:04:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has uploaded a new patch set (#11). ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch addes this logic of filtering into existing RecursingIterator
in FileSystemUtil. In case of S3 we use the listFiles API to natively
recurse using the S3AFileSystem. This case needs a special handling to
check if the files returned are contained within a ignored directory.
The patch adds this to the recursing iterator using a new util method in
FileSystemUtil.

Also, the existing code to recover partitions implements its own
recursion logic which includes path validation. This already skips such
hidden directories since they do not conform to the partition spec. The
patch does a minor modification to this method by directly calling the
listStatusIterator instead of going through FileSystemUtil#listStatus
whiche uses the filtering remote iterator now.

Testing:
1. Added a new tests as well as modified existing ones which were
related to cover interesting cases.
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
A fe/src/test/java/org/apache/impala/common/FileSystemUtilTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M tests/metadata/test_recursive_listing.py
6 files changed, 275 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/11
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 11
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................

IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

The FileMetadataLoader is used to load the file information in when the
table is loaded. By default, it lists all the files in the
table/partition directory. Currently, it only skips the filenames which
are invalid (hidden files and ones starting with "_" etc). However, it
does not skip the directories which are temporary or hidden. In case of
Hive when data is inserted into a table, it creates a temporary staging
directory which is a hidden directory under the table location. When the
insert in hive is completed, such staging directories are removed. But
if there is a refresh called during that time, FileMetadataLoader will
add the files in the staging directory as well. Not only this could
cause temporary invalid results but it causes table to go in a bad state
when these temporary directories are removed. The only work-around in
such a case to issue a refresh on the table again.

This patch adds logic in the filemetadataloader to ignore such temporary
staging directories. Unfortunately, hadoop does not provide a API which
can recursively list files in a directory and skip certain directories.
This patch adds a new FilterIterator which wraps around existing
listFiles, listStatus and RecursingIterator to skip the hidden
directories from the listing result.

Also, the existing code to recover partitions implements its own
recursion logic which includes path validation. This already skips such
hidden directories since they do not conform to the partition spec. The
patch does a minor modification to this method by directly calling the
listStatusIterator instead of going through FileSystemUtil#listStatus
whiche uses the filtering remote iterator now.

Testing:
1. Added a new tests as well as modified existing ones which were
related to cover interesting cases.
2. Ran concurrent inserts from Hive while issuing refresh in a loop on
Impala side. Earlier this would cause the table to go into a bad state.
Now, it works fine for the staging directories. It still runs into a
FileNotFoundException from the impalad when there are insert overwrite
statements in Hive

Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
M fe/src/test/java/org/apache/impala/catalog/FileMetadataLoaderTest.java
A fe/src/test/java/org/apache/impala/common/FileSystemUtilTest.java
M fe/src/test/java/org/apache/impala/util/AcidUtilsTest.java
M tests/metadata/test_recursive_listing.py
6 files changed, 236 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/65/13665/12
-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 12
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/8/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@545
PS8, Line 545: 
             :    * If 'recursive' is true, all 
> Actually, I added this line in the doc since I noticed that the RecursingIt
hrm, were the semantics already broken then? See the unit test AcidUtilsTest.testAcidStateNoBase:

    assertFiltering(new String[]{
            "base_01.txt",
            "post_upgrade.txt",
            "delta_000005_000005_0000/",
            "delta_000005_000005_0000/lmn.txt",
            "base_000010/",
            "delta_0000012_0000012_0000/",
            "delta_0000012_0000012_0000/0000_0",
            "delta_0000012_0000012_0000/0000_1"},
        "", // writeIdList that accepts all transactions as valid
        new String[]{
            // Post upgrade files are ignored if there is a valid base.
            "delta_0000012_0000012_0000/0000_0",
            "delta_0000012_0000012_0000/0000_1"});


ie here the empty base directory has important semantics. Maybe we don't have a valid e2e test case to reproduce this yet because of this Hive bug: https://issues.apache.org/jira/browse/HIVE-21750 but it certainly seems like this is an issue. Let's file a JIRA to make sure we validate this case before releasing this feature.


http://gerrit.cloudera.org:8080/#/c/13665/11/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/11/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@731
PS11, Line 731:       // we use listFiles API to natively recurse on S3 for performance reasons, so we
Are we really gaining much from using this RecursiveIteratorWithFilter class for the S3 case now? It seems like this is getting to be a bit of spaghetti, with this being named "Recursive" but having a flag that makes it non-recursive.

What about just having a separate class like 'FilteringIterator' which wraps RemoteIterator and does filtering, and keep this one as always-recursive, so the code path is easier to follow?


http://gerrit.cloudera.org:8080/#/c/13665/11/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@736
PS11, Line 736:         LOG.info("Ignoring {} since its contained in a ignored directory",
This probably shouldn't be at INFO level



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 11
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Jul 2019 22:40:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip listing files in hidden and tmp directories
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@585
PS3, Line 585:   public static final Predicate <FileStatus> HIDDEN_DIRECTORIES_PREDICATE = fileStatus -> {
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/13665/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@658
PS3, Line 658:         Predicate<FileStatus> fileStatusPredicate, boolean useListStatus) throws IOException {
line too long (94 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 3
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Jun 2019 21:43:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 13: Code-Review+2

Carrying forward Todd's +2 from earlier


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 13
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Jul 2019 15:31:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 4: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4519/


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 4
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jun 2019 12:14:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 12: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 12
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Jul 2019 22:52:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4519/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 4
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jun 2019 06:18:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3765/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 11
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Jun 2019 08:31:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 13:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4584/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 13
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Jul 2019 23:25:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13665 )

Change subject: IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13665/11/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/13665/11/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@734
PS11, Line 734:       if (isRecursive_ && listingType_.equals(ListingType.LIST_FILES) && isInIgnoredDirectory(
line too long (94 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/13665
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa
Gerrit-Change-Number: 13665
Gerrit-PatchSet: 11
Gerrit-Owner: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Jun 2019 07:52:36 +0000
Gerrit-HasComments: Yes