You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Quanlong Huang (Code Review)" <ge...@cloudera.org> on 2021/03/11 02:25:03 UTC

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17171


Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................

IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

HDFS FileSystem provides a listStatusIterator() API for listing remote
storage using a RemoteIterator. We use it to list files when loading
table file metadata.

It's not guaranteed that a RemoteIterator can survive when its hasNext()
or next() throws IOExceptions. We should stop the loop in this case.
Otherwise, we may go into a dead loop.

Without HADOOP-16685, it's also not guaranteed that
FileSystem.listStatusIterator() will thrown a FileNotFoundException when
the path doesn't exist.

This patch refactors the file listing iterators so we don't need to
depend on these two assumptions.

Tests:
 - Run test_insert_stress.py. Verified the non-existing subdirs are
   skipped.

Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
1 file changed, 37 insertions(+), 18 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/17171/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Kurt Deschler (Code Review)" <ge...@cloudera.org>.
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17171/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/17171/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@959
PS5, Line 959:       curIter_ = subIter;
> No, it's not stale. When newIterFunc_ throws an exception, it means we can'
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Mar 2021 17:28:11 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 4: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6951/


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 11 Mar 2021 20:17:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8339/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Mar 2021 03:49:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 30 Mar 2021 23:28:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17171

to look at the new patch set (#2).

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................

IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

HDFS FileSystem provides a listStatusIterator() API for listing remote
storage using a RemoteIterator. We use it to list files when loading
table file metadata.

It's not guaranteed that a RemoteIterator can survive when its hasNext()
or next() throws IOExceptions. We should stop the loop in this case.
Otherwise, we may go into a dead loop.

Without HADOOP-16685, it's also not guaranteed that
FileSystem.listStatusIterator() will thrown a FileNotFoundException when
the path doesn't exist.

This patch refactors the file listing iterators so we don't need to
depend on these two assumptions.

Tests:
 - Run test_insert_stress.py. Verified the non-existing subdirs are
   skipped.

Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
1 file changed, 37 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/17171/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8340/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 11 Mar 2021 07:57:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 4:

> Patch Set 4: Verified-1
> 
> Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6951/

It seems to be a flaky failure: IMPALA-10559
I'll rerun the test.


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 12 Mar 2021 00:15:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6954/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 12 Mar 2021 00:15:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 3:

When I loop over tests/stress/test_insert_stress.py to run it several times, I found the inserts are more easy to fail by TableLoadingException. And then lead the table metadata to a bad state. The exception is

E0311 11:59:42.633015 20506 ParallelFileMetadataLoader.java:166] Refreshing file and block metadata for 1 paths for table test_inserts_ab08196b.test_concurrent_inserts encountered an error loading data for path hdfs://localhost:20500/test-warehouse/test_inserts_ab08196b.db/test_concurrent_inserts
Java exception follows:
java.util.concurrent.ExecutionException: java.io.FileNotFoundException: File hdfs://localhost:20500/test-warehouse/test_inserts_ab08196b.db/test_concurrent_inserts/_impala_insert_staging/5e4afdf5a6978311_9f9d0baa00000000 does not exist.
        at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:552)
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:513)
        at com.google.common.util.concurrent.FluentFuture$TrustedFuture.get(FluentFuture.java:86)
        at org.apache.impala.catalog.ParallelFileMetadataLoader.loadInternal(ParallelFileMetadataLoader.java:163)
        at org.apache.impala.catalog.ParallelFileMetadataLoader.load(ParallelFileMetadataLoader.java:115)
        at org.apache.impala.catalog.HdfsTable.loadFileMetadataForPartitions(HdfsTable.java:747)
        at org.apache.impala.catalog.HdfsTable.updateUnpartitionedTableFileMd(HdfsTable.java:1296)
        at org.apache.impala.catalog.HdfsTable.load(HdfsTable.java:1182)
        at org.apache.impala.service.CatalogOpExecutor.loadTableMetadata(CatalogOpExecutor.java:1015)
        at org.apache.impala.service.CatalogOpExecutor.updateCatalog(CatalogOpExecutor.java:4808)
        at org.apache.impala.service.JniCatalog.updateCatalog(JniCatalog.java:327)
Caused by: java.io.FileNotFoundException: File hdfs://localhost:20500/test-warehouse/test_inserts_ab08196b.db/test_concurrent_inserts/_impala_insert_staging/5e4afdf5a6978311_9f9d0baa00000000 does not exist. 
        at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1273)
        at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.<init>(DistributedFileSystem.java:1247)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1192)
        at org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1188)
        at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
        at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:1206)
        at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2126)
        at org.apache.hadoop.fs.FileSystem$5.handleFileStat(FileSystem.java:2314)
        at org.apache.hadoop.fs.FileSystem$5.hasNext(FileSystem.java:2291)
        at org.apache.impala.common.FileSystemUtil$FilterIterator.hasNext(FileSystemUtil.java:813)
        at org.apache.impala.catalog.FileMetadataLoader.load(FileMetadataLoader.java:202)
        at org.apache.impala.catalog.ParallelFileMetadataLoader.lambda$loadInternal$1(ParallelFileMetadataLoader.java:157)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:322)
        at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
        at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:66)
        at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:36)
        at org.apache.impala.catalog.ParallelFileMetadataLoader.loadInternal(ParallelFileMetadataLoader.java:157)
        ... 7 more

The reason is that we have removed the catching of FileNotFoundException in FilterIterator. When listing files with locations, we use FileSystem#listFiles() which returns a RemoteIterator similar to our RecursingIterator except the handling of non-exisitng subdir in its hasNext(). To make the file listing with location more robust as well, PS3 use our RecursingIterator when fs is not S3.


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 11 Mar 2021 07:38:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Kurt Deschler (Code Review)" <ge...@cloudera.org>.
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17171/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/17171/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@959
PS5, Line 959:       curIter_ = subIter;
Should curIter_ be cleared first so it is not stale if there is an exception?



-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Mon, 22 Mar 2021 18:07:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@15
PS4, Line 15: dead
nit, I think using "infinite loop" is a more readable.


http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@18
PS4, Line 18: thrown
nit, throw


http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@847
PS4, Line 847: next = baseIterator_.next();
Does this mean that if the file which is present inside a directory and which is removed after the RemoteIterator was created, we will start seeing FileNotFoundException which can cause Table loading errors?



-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Mar 2021 17:17:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17171

to look at the new patch set (#3).

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................

IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

HDFS FileSystem provides a listStatusIterator() API for listing remote
storage using a RemoteIterator. We use it to list files when loading
table file metadata.

It's not guaranteed that a RemoteIterator can survive when its hasNext()
or next() throws IOExceptions. We should stop the loop in this case.
Otherwise, we may go into a dead loop.

Without HADOOP-16685, it's also not guaranteed that
FileSystem.listStatusIterator() will thrown a FileNotFoundException when
the path doesn't exist.

This patch refactors the file listing iterators so we don't need to
depend on these two assumptions.

Tests:
 - Loop test_insert_stress.py 100 times. Verified the non-existing
   subdirs are skipped and inserts are stable in a high concurrency.

Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
1 file changed, 100 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/17171/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 31 Mar 2021 08:45:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7026/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 31 Mar 2021 02:55:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17171/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/17171/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@959
PS5, Line 959:       curIter_ = subIter;
> Should curIter_ be cleared first so it is not stale if there is an exceptio
No, it's not stale. When newIterFunc_ throws an exception, it means we can't go into the subdirectory. We should still use curIter_ to get the next subdirectory or file.



-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Mar 2021 01:18:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................

IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

HDFS FileSystem provides a listStatusIterator() API for listing remote
storage using a RemoteIterator. We use it to list files when loading
table file metadata.

It's not guaranteed that a RemoteIterator can survive when its hasNext()
or next() throws IOExceptions. We should stop the loop in this case.
Otherwise, we may go into a infinite loop.

Without HADOOP-16685, it's also not guaranteed that
FileSystem.listStatusIterator() will throw a FileNotFoundException when
the path doesn't exist.

This patch refactors the file listing iterators so we don't need to
depend on these two assumptions. The basic idea is:
 - On one side, we should not depends on other RemoteIterator's behavior
   after exception.
 - On the other side, we try to make our own iterators more robust on
   transient sub-directories. So table loading won't be failed by them.

Tests:
 - Loop test_insert_stress.py 100 times. Verified the non-existing
   subdirs are skipped and inserts are stable in a high concurrency.

Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Reviewed-on: http://gerrit.cloudera.org:8080/17171
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
1 file changed, 83 insertions(+), 32 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 4:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6951/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Thu, 11 Mar 2021 14:29:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8374/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Mar 2021 07:58:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 4: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Comment-Date: Fri, 12 Mar 2021 05:57:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@847
PS4, Line 847: next = baseIterator_.next();
> AFAIK, it depends on the fs implementation.
Thanks for the explanation. I think the primary motivation of catching the FileNotException was for the staging directory case as reported in IMPALA-9122. Since the staging directory iterator creation call will catch the exception in handleFileStat method and ignore it I think we should be fine.



-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Mar 2021 19:10:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8338/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Mar 2021 02:44:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Wed, 31 Mar 2021 02:55:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Vihang Karajgaonkar (Code Review)" <ge...@cloudera.org>.
Vihang Karajgaonkar has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5: Code-Review+1

+1 from my side and bumping it up to +2 based on Kurt's vote from earlier.


-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Mon, 29 Mar 2021 19:10:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17171

to look at the new patch set (#4).

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................

IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

HDFS FileSystem provides a listStatusIterator() API for listing remote
storage using a RemoteIterator. We use it to list files when loading
table file metadata.

It's not guaranteed that a RemoteIterator can survive when its hasNext()
or next() throws IOExceptions. We should stop the loop in this case.
Otherwise, we may go into a dead loop.

Without HADOOP-16685, it's also not guaranteed that
FileSystem.listStatusIterator() will thrown a FileNotFoundException when
the path doesn't exist.

This patch refactors the file listing iterators so we don't need to
depend on these two assumptions. The basic idea is:
 - On one side, we should not depends on other RemoteIterator's behavior
   after exception.
 - On the other side, we try to make our own iterators more robust on
   transient sub-directories. So table loading won't be failed by them.

Tests:
 - Loop test_insert_stress.py 100 times. Verified the non-existing
   subdirs are skipped and inserts are stable in a high concurrency.

Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
1 file changed, 100 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/17171/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/17171 )

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@15
PS4, Line 15: infi
> nit, I think using "infinite loop" is a more readable.
Done


http://gerrit.cloudera.org:8080/#/c/17171/4//COMMIT_MSG@18
PS4, Line 18: throw 
> nit, throw
Done


http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

http://gerrit.cloudera.org:8080/#/c/17171/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java@847
PS4, Line 847: erIterator(Path startPath, R
> Does this mean that if the file which is present inside a directory and whi
AFAIK, it depends on the fs implementation.

First of all, the exception is more possible to be thrown by hasNext() than next() in this pattern. Both the hasNext() and next() calls could throw IOException. AFAIK, most of the implementations cache some listing results when hasNext() is called. In their next() call, hasNext() will also be called again in case it's not called. So if hasNext() doesn't throw exception, the following next() call won't throw exception either, i.e. next() will return an item from the cache results. It could be a stale item that already removed after the last fetch.

Secondly, it's more likely to get FileNotFoundException in these two cases:
1. When this is a recursive RemoteIterator and when the absent item is a subdirectory (instead of a file), the iterator gets exception when entering(listing) the subdirectory.
2. When the current directory being listing is removed, the iterator gets exception when it wants to fetch more results.

If the absent directory is a partition folder, it's possible to hit case #2. E.g. when a high level partition is removed. I think it's ok to cause table loading errors. Otherwise, we may have loaded some stale files.

If the absent directory is a staging folder (created by Hive/Impala INSERTs), it's more likely to hit case #1, because staging folder usually won't have more than 1000 files. The iterator gets all of them in one fetch. However, it still depends on the fs implementation. Take listing on HDFS as an example. The fetch size defaults to 1000 (configured by dfs.ls.limit).

For case #1, our expectation is ignoring the absent subdirectory and continuely listing. So this patch catches the FileNotFoundException in RecursivingIterator#hasNext() when calling handleFileStat(), instead of catching it here. This patch also prefers using RecursivingIterator if the target FS doesn't have a real resursive listing like S3. It avoids hitting case #1.



-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Mar 2021 07:38:46 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Hello Vihang Karajgaonkar, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17171

to look at the new patch set (#5).

Change subject: IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil
......................................................................

IMPALA-10579: Fix usage of RemoteIterator in FileSystemUtil

HDFS FileSystem provides a listStatusIterator() API for listing remote
storage using a RemoteIterator. We use it to list files when loading
table file metadata.

It's not guaranteed that a RemoteIterator can survive when its hasNext()
or next() throws IOExceptions. We should stop the loop in this case.
Otherwise, we may go into a infinite loop.

Without HADOOP-16685, it's also not guaranteed that
FileSystem.listStatusIterator() will throw a FileNotFoundException when
the path doesn't exist.

This patch refactors the file listing iterators so we don't need to
depend on these two assumptions. The basic idea is:
 - On one side, we should not depends on other RemoteIterator's behavior
   after exception.
 - On the other side, we try to make our own iterators more robust on
   transient sub-directories. So table loading won't be failed by them.

Tests:
 - Loop test_insert_stress.py 100 times. Verified the non-existing
   subdirs are skipped and inserts are stable in a high concurrency.

Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
---
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
1 file changed, 83 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/71/17171/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17171
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I859bd4f976c51a34eb6a03cefd2ddcdf11656cea
Gerrit-Change-Number: 17171
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Vihang Karajgaonkar <vi...@cloudera.com>