You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org> on 2017/06/06 16:50:05 UTC

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Bharath Vissapragada has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7095

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................

IMPALA-5431: Remove redundant path exists check during table load

It is already handled by refreshFileMetadata() and hence saves
one RPC to NN.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
1 file changed, 0 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/7095/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 5: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7095/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 267:       if (partsByPath.size() == 0 || !fs.exists(dirPath)) return;
Is this exists() really needed? We are going to do a listFiles() which throws is the path does not exist. In that sense this separate check seems unnecessary.


Line 647:       if (fs.exists(tblLocation)) {
We check exists() inside getAvailableAccessLevel()


Line 707:       if (!fs.exists(partDir)) {
listStatus() throws if the path does not exist, in that sense this check seems redundant


Line 1108:       if (fs.exists(location)) {
exists() is checked inside getAvailableAccessLevel()


Line 1662:     if (!fs.exists(path)) return;
Is this really necessary? Whatever we do inside getAllPartitionsNotInHms() should be able to detect the non-existence as part of the operation


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7095/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 267:       if (partsByPath.size() == 0 || !fs.exists(dirPath)) return;
> Is this exists() really needed? We are going to do a listFiles() which thro
Okay. I felt this adds a lot of try{} catch{} to the code and hence didn't do it. But if it saves an RPC, why not :)


Line 647:       if (fs.exists(tblLocation)) {
> We check exists() inside getAvailableAccessLevel()
I thought about this, but didn't want to do it because of the following scenario. If the table directory doesn't exist, getAvailableAccessLevel() recurses up to the parent path, and if the parent had a READ only permission for example, that would make the table READ_ONLY. Seemed like a regression in  behavior if a user wants to insert data into it. Not totally sure if such a situation is possible.


Line 707:       if (!fs.exists(partDir)) {
> listStatus() throws if the path does not exist, in that sense this check se
Done


Line 1108:       if (fs.exists(location)) {
> exists() is checked inside getAvailableAccessLevel()
Same as above, please let me know if I'm wrong.


Line 1662:     if (!fs.exists(path)) return;
> Is this really necessary? Whatever we do inside getAllPartitionsNotInHms() 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 7: Code-Review+2

(3 comments)

Carrying +2.

http://gerrit.cloudera.org:8080/#/c/7095/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS5, Line 302: RemoteIterator<LocatedFileStatus> fileStatusIter =
> nit: move it below L305. If we're going to return there is no need to get a
Done


PS5, Line 727: DataFile(fileStatus)) continue;
> newFileDescs?
Done. Rewrote a little.


http://gerrit.cloudera.org:8080/#/c/7095/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 498:       return fs.listStatus(p);
> I am thinking maybe add a warn in the log that the path doesn't exist.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7095

to look at the new patch set (#4).

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................

IMPALA-5431: Remove redundant path exists checks during table load

There are multiple places that do an exists() check on a path and then
perform some subsequent action on it. This pattern results in two
RPCs to the NN (one for the exists() check and one for the subsequent
action). We can avoid the exists() check in these cases since most HDFS
methods on paths throw a FileNotFoundException if the path does not
exist. This can save an RPC to NN and improve the metadata loading time.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 59 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/7095/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7095

to look at the new patch set (#2).

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................

IMPALA-5431: Remove redundant path exists check during table load

It is already handled by refreshFileMetadata() and hence saves
one RPC to NN.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
1 file changed, 0 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/7095/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 7: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................


Patch Set 1: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7095/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS1, Line 1486: FileSystem fs = partDirPath.getFileSystem(CONF);
remove


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Hello Dimitris Tsirogiannis, Alex Behm,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7095

to look at the new patch set (#7).

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................

IMPALA-5431: Remove redundant path exists checks during table load

There are multiple places that do an exists() check on a path and then
perform some subsequent action on it. This pattern results in two
RPCs to the NN (one for the exists() check and one for the subsequent
action). We can avoid the exists() check in these cases since most HDFS
methods on paths throw a FileNotFoundException if the path does not
exist. This can save an RPC to NN and improve the metadata loading time.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Metadata benchmark shows decent increase in perf numbers, for ex:

100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INV -20.51%
80-PARTITIONS-250K-FILES-S3-03-RECOVER -20.58%
80-PARTITIONS-250K-FILES-11-DROP-PARTITION -22.13%
80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION -22.38%
80-PARTITIONS-250K-FILES-S3-12-DROP -23.69%
100K-PARTITIONS-1M-FILES-CUSTOM-11-REFRESH-PARTITION -23.91%
100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTITION -26.04%
100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH -26.38%
80-PARTITIONS-250K-FILES-S3-02-CREATE -36.47%
100K-PARTITIONS-1M-FILES-CUSTOM-12-QUERY-PARTITIONS -58.72%
80-PARTITIONS-250K-FILES-S3-01-DROP -95.33%
80-PARTITIONS-250K-FILES-01-DROP -95.93%

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 64 insertions(+), 39 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/7095/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 6: Code-Review+2

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7095/5/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS5, Line 302: Reference<Long> numUnknownDiskIds = new Reference<Long>(Long.valueOf(0));
nit: move it below L305. If we're going to return there is no need to get a new Reference.


PS5, Line 727: new ArrayList<FileDescriptor>()
newFileDescs?


http://gerrit.cloudera.org:8080/#/c/7095/5/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 498:     } catch (FileNotFoundException e) {
I am thinking maybe add a warn in the log that the path doesn't exist.


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Hello Dimitris Tsirogiannis, Alex Behm,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7095

to look at the new patch set (#6).

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................

IMPALA-5431: Remove redundant path exists checks during table load

There are multiple places that do an exists() check on a path and then
perform some subsequent action on it. This pattern results in two
RPCs to the NN (one for the exists() check and one for the subsequent
action). We can avoid the exists() check in these cases since most HDFS
methods on paths throw a FileNotFoundException if the path does not
exist. This can save an RPC to NN and improve the metadata loading time.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Metadata benchmark shows decent increase in perf numbers, for ex:

100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INV -20.51%
80-PARTITIONS-250K-FILES-S3-03-RECOVER -20.58%
80-PARTITIONS-250K-FILES-11-DROP-PARTITION -22.13%
80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION -22.38%
80-PARTITIONS-250K-FILES-S3-12-DROP -23.69%
100K-PARTITIONS-1M-FILES-CUSTOM-11-REFRESH-PARTITION -23.91%
100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTITION -26.04%
100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH -26.38%
80-PARTITIONS-250K-FILES-S3-02-CREATE -36.47%
100K-PARTITIONS-1M-FILES-CUSTOM-12-QUERY-PARTITIONS -58.72%
80-PARTITIONS-250K-FILES-S3-01-DROP -95.33%
80-PARTITIONS-250K-FILES-01-DROP -95.93%

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 51 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/7095/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Hello Dimitris Tsirogiannis, Alex Behm,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7095

to look at the new patch set (#5).

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................

IMPALA-5431: Remove redundant path exists checks during table load

There are multiple places that do an exists() check on a path and then
perform some subsequent action on it. This pattern results in two
RPCs to the NN (one for the exists() check and one for the subsequent
action). We can avoid the exists() check in these cases since most HDFS
methods on paths throw a FileNotFoundException if the path does not
exist. This can save an RPC to NN and improve the metadata loading time.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 51 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/7095/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................


Patch Set 1:

(1 comment)

Alex, I did a quick audit of the exists() checks in HdfsTable and none of them seem to be duplicates/redundant. Could you confirm which ones you refer to in your last comment? Ofcourse we can cache that check per table/partition but I don't think we should do that and instead have a runtime check.

http://gerrit.cloudera.org:8080/#/c/7095/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS1, Line 1486: FileSystem fs = partDirPath.getFileSystem(CONF);
> remove
Looking into it, actually StorageDescriptor is not at all needed, we are just passing it around without using it. I'll check its implications and remove it in a follow up patch. Raised IMPALA-5533


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Hello Dimitris Tsirogiannis,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7095

to look at the new patch set (#3).

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................

IMPALA-5431: Remove redundant path exists check during table load

It is already handled by refreshFileMetadata() and hence saves
one RPC to NN.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
1 file changed, 37 insertions(+), 26 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/95/7095/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 4:

(7 comments)

As discussed, I'm checking the following,

- Effect on CTAS when the path doesn't exist.
- Running metadata benchmarks to see if there is some perf gain.

Meanwhile, posting the rebased patch that addresses the comments.

http://gerrit.cloudera.org:8080/#/c/7095/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 267:       // No need to load blocks for empty partitions list.
> Right, but saving an RPC seems worth it.
Done (in PS4).


Line 725:             }
> remove while you are here
Done


http://gerrit.cloudera.org:8080/#/c/7095/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 266:       FileSystem fs = dirPath.getFileSystem(CONF);
> Move this closer to where it is used. getFileSystem() tends to be expensive
Done


Line 305:       if (fileStatusIter == null) {
> I think you can make this a single line and remove the comment.
Done


Line 371:     if (fileStatusIter == null) {
> I think you can make this a single line and remove the comment.
Done


Line 1709:     if (statuses == null) {
> I think you can make this a single line and remove the comment.
Done


http://gerrit.cloudera.org:8080/#/c/7095/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 491:    * the path does not exist. This helps simplifies the caller code in cases where
> This helps simplify the caller code...
Ouch, done.


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 6:

- Updated the commit message with benchmark numbers. 

- Also, looks like CTAS analysis is likely a non-issue (also discussed with Alex offline) since the "create" part of CTAS results in creation of base directory for table (by HMS) and we always inherit correct ACLs of parent and the case that the table directory doesn't exist is not possible. 

- Ran the following test to confirm that.

➜  Impala git:(i5431) ✗ hadoop fs -setfacl -m  user:bharath:rw- /tmp/foo
➜  Impala git:(i5431) ✗ hadoop fs -getfacl /tmp/foo
# file: /tmp/foo
# owner: bharath
# group: supergroup
user::rwx
user:bharath:rw-
group::r-x
mask::rwx
other::r-x

[localhost:21000] > create table bar location '/tmp/foo/bar' as select * from tpch_parquet.nation;
Query: create table bar location '/tmp/foo/bar' as select * from tpch_parquet.nation
Query submitted at: 2017-06-26 15:55:28 (Coordinator: http://optimus:25000)
Query progress can be monitored at: http://optimus:25000/query_plan?query_id=8d4f60928bebcc9c:f7e2c92f00000000
+--------------------+
| summary            |
+--------------------+
| Inserted 25 row(s) |
+--------------------+

[localhost:21000] > select count(*) from bar;
Query: select count(*) from bar
Query submitted at: 2017-06-26 15:55:44 (Coordinator: http://optimus:25000)
Query progress can be monitored at: http://optimus:25000/query_plan?query_id=8e44ede23419d848:afac511600000000
+----------+
| count(*) |
+----------+
| 25       |
+----------+

➜  Impala git:(i5431) ✗ hadoop fs -getfacl /tmp/foo/bar
# file: /tmp/foo/bar
# owner: bharath
# group: supergroup
user::rwx
user:bharath:rw-
group::rwx
mask::rwx
other::r-x

-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................


Patch Set 3:

Still need to update the commit message, I'll do once the patch is approved.

-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


IMPALA-5431: Remove redundant path exists checks during table load

There are multiple places that do an exists() check on a path and then
perform some subsequent action on it. This pattern results in two
RPCs to the NN (one for the exists() check and one for the subsequent
action). We can avoid the exists() check in these cases since most HDFS
methods on paths throw a FileNotFoundException if the path does not
exist. This can save an RPC to NN and improve the metadata loading time.

Testing: Enough tests already cover this code path. This patch
passed core and exhaustive tests.

Metadata benchmark shows decent increase in perf numbers, for ex:

100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INV -20.51%
80-PARTITIONS-250K-FILES-S3-03-RECOVER -20.58%
80-PARTITIONS-250K-FILES-11-DROP-PARTITION -22.13%
80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION -22.38%
80-PARTITIONS-250K-FILES-S3-12-DROP -23.69%
100K-PARTITIONS-1M-FILES-CUSTOM-11-REFRESH-PARTITION -23.91%
100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTITION -26.04%
100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH -26.38%
80-PARTITIONS-250K-FILES-S3-02-CREATE -36.47%
100K-PARTITIONS-1M-FILES-CUSTOM-12-QUERY-PARTITIONS -58.72%
80-PARTITIONS-250K-FILES-S3-01-DROP -95.33%
80-PARTITIONS-250K-FILES-01-DROP -95.93%

Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Reviewed-on: http://gerrit.cloudera.org:8080/7095
Reviewed-by: Bharath Vissapragada <bh...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 64 insertions(+), 39 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Bharath Vissapragada: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists check during table load

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists check during table load
......................................................................


Patch Set 1:

Seems fine to me, but there are way more exists() places that can be removed imo

-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 4: Code-Review+1

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7095/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 267:       // No need to load blocks for empty partitions list.
> Okay. I felt this adds a lot of try{} catch{} to the code and hence didn't 
Right, but saving an RPC seems worth it.


Line 725:             }
remove while you are here


http://gerrit.cloudera.org:8080/#/c/7095/4/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 266:       FileSystem fs = dirPath.getFileSystem(CONF);
Move this closer to where it is used. getFileSystem() tends to be expensive, so great if we can avoid it!


Line 305:       if (fileStatusIter == null) {
I think you can make this a single line and remove the comment.


Line 371:     if (fileStatusIter == null) {
I think you can make this a single line and remove the comment.


Line 1709:     if (statuses == null) {
I think you can make this a single line and remove the comment.


http://gerrit.cloudera.org:8080/#/c/7095/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 491:    * the path does not exist. This helps simplifies the caller code in cases where
This helps simplify the caller code...


-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5431: Remove redundant path exists checks during table load

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5431: Remove redundant path exists checks during table load
......................................................................


Patch Set 7:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/791/

-- 
To view, visit http://gerrit.cloudera.org:8080/7095
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Id10ecf64ea2eda2d0f9299c0aa371933eca22281
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-HasComments: No