You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2017/01/19 16:03:29 UTC

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Alex Behm has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5743

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................

IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

The fix for IMPALA-4172/IMPALA-3653 introduced a performance
regression for loading tables that have many partitions with:
1. inconsistent HDFS path qualification or
2. a custom location (not under the table root dir)

For the first issue consider a table whose root path is at
'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1'
will not be recognized as being a descendant of the table root
dir by FileSystemUtil.isDescendentPath() because of how
Path.equals() behaves, even if 'hdfs://localhost:8020' is the
default filesystem.
Such partitions are incorrectly recognized as having a custom
location and are treated specially. The treatment of such
partitions is very inefficient.

This patch fixes the detection of partitions with custom
locations, and improves the speed of loading partitions
with custom locations.

Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 35 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/5743/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 6: Code-Review+2

rebase

-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5743/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

PS4, Line 454: private static boolean nullOrEqual(Object a, Object b) {
             :     if (a == null && b == null) return true;
             :     if (a != null && b != null) return a.equals(b);
             :     return false;
             :   }
> Use Guava's Objects.equal()? I think it has the same semantics as this one.
Even better. Done.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5743/4/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

PS4, Line 454: private static boolean nullOrEqual(Object a, Object b) {
             :     if (a == null && b == null) return true;
             :     if (a != null && b != null) return a.equals(b);
             :     return false;
             :   }
Use Guava's Objects.equal()? I think it has the same semantics as this one.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5743/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 275:         // a descendant of dirPath.
> do we already have test cases that will take this path?
This is the 'default' path for loading the metadata of a partitioned table, so almost all tables load through this path. The above path is covered by things like INSERT, ALTER TABLE ADD PARTITION, etc. so in terms of code coverage we have everything.


Line 693:     tblLocation = tblLocation.makeQualified(fs.getUri(), tblLocation);
> here and elsewhere: should we use createFullyQualifiedPath() to avoid the b
Yes. Didn't know about that. Done.


http://gerrit.cloudera.org:8080/#/c/5743/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

PS3, Line 440: !=
> nit: I feel an XOR makes the intention more clear than using a !=, especial
Rewrote this with a helper function to simplify.


PS3, Line 442: pUri.getScheme().equals(parentUri.getScheme())
> An example that could hit NPE. isDescendantPath(new Path("/foo/bar"), new P
You are right. Rewrote this with a helper function to simplify.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 5:

No

-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 6: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5743

to look at the new patch set (#4).

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................

IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

The fix for IMPALA-4172/IMPALA-3653 introduced a performance
regression for loading tables that have many partitions with:
1. Inconsistent HDFS path qualification or
2. A custom location (not under the table root dir)

For the first issue consider a table whose root path is at
'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1'
will not be recognized as being a descendant of the table root
dir by FileSystemUtil.isDescendentPath() because of how
Path.equals() behaves, even if 'hdfs://localhost:8020' is the
default filesystem. Such partitions are incorrectly recognized
as having a custom location and are loaded separately.

There were two performance issues:
1. The code for loading the files/blocks of partitions with
   truly custom locations was inefficient with an O(N^2)
   loop for determining the target partitions.
2. Each partition that is incorrectly identified as having
   a custom path (e.g. due to inconsistent qualification),
   is going to have its files/blocks loaded twice. Once
   when the table root path is processed, and once when the
   'custom' partition is processed.

This patch fixes the detection of partitions with custom
locations, and improves the speed of loading partitions
with custom locations.

Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 67 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/5743/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 433:     while (!p.isRoot() && p.depth() != parent.depth()) p = p.getParent();
> I'd prefer to leave that as follow-on work because it requires additional i
Sorry missed this. I Agree with you.


http://gerrit.cloudera.org:8080/#/c/5743/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

PS3, Line 442: pUri.getScheme().equals(parentUri.getScheme())
> Wouldn't this still hit an NPE? For example, if both pUri.getScheme() and 
An example that could hit NPE. isDescendantPath(new Path("/foo/bar"), new Path("/blah/blah"));


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS1, Line 698: // TODO: We can still do some advanced optimization by grouping all the partition
             :     // directories under the same ancestor path up the tree.
             :     List<Path> dirsToLoad = Lists.newArrayList(tblLocation);
We already have the ability to compress partition paths by removing the common prefix (see HdfsPartitionLocationCompressor). Can we utilize/expose some of the functions in this class to create a map of common prefixes to list of partition paths which is essentially the grouping described in this TODO? Then we can avoid the overhead in L274-280.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 760:       partsByPath.put(partDirPath, Lists.newArrayList(partition));
do we know that on this path it will be a qualified path, or do we not care since we're always using partDirPath as the key?


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

The fix for IMPALA-4172/IMPALA-3653 introduced a performance
regression for loading tables that have many partitions with:
1. Inconsistent HDFS path qualification or
2. A custom location (not under the table root dir)

For the first issue consider a table whose root path is at
'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1'
will not be recognized as being a descendant of the table root
dir by FileSystemUtil.isDescendentPath() because of how
Path.equals() behaves, even if 'hdfs://localhost:8020' is the
default filesystem. Such partitions are incorrectly recognized
as having a custom location and are loaded separately.

There were two performance issues:
1. The code for loading the files/blocks of partitions with
   truly custom locations was inefficient with an O(N^2)
   loop for determining the target partitions.
2. Each partition that is incorrectly identified as having
   a custom path (e.g. due to inconsistent qualification),
   is going to have its files/blocks loaded twice. Once
   when the table root path is processed, and once when the
   'custom' partition is processed.

This patch fixes the detection of partitions with custom
locations, and improves the speed of loading partitions
with custom locations.

Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Reviewed-on: http://gerrit.cloudera.org:8080/5743
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 59 insertions(+), 18 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Alex Behm: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5743/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS2, Line 250: sDescendantPath
typo


http://gerrit.cloudera.org:8080/#/c/5743/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

PS2, Line 434: URI pUri = p.toUri();
             :       URI parentUri = parent.toUri();
Based on the checks in L444, both p and parent could be null. If that is the case and trace is enabled, we'll get a NPE.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 5:

No comments. LGTM.

-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/5743/1//COMMIT_MSG
Commit Message:

PS1, Line 22: The treatment of such
            : partitions is very inefficient.
> I could be wrong about it, but I don't think this is an accurate descriptio
You are exactly right. Modified comment.


http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 263
> Oh never mind, i didn't read the change description carefully enough.
Glad to hear we arrived at the same fix :)


Line 266:         // unpartitioned table, or the path of a partition with a custom location.
> how does the 'custom location' part follow?
You are right, it does not always follow. Rephrased.


PS1, Line 698: // TODO: We can still do some advanced optimization by grouping all the partition
             :     // directories under the same ancestor path up the tree.
             :     List<Path> dirsToLoad = Lists.newArrayList(tblLocation);
> We already have the ability to compress partition paths by removing the com
1. This TODO refers to partitions with custom locations specifically. Those will never go through L274-280, but always through L267-269.

2. I'm not even convinced this optimization is easy/practical. At least it's complicated enough to investigate in a different JIRA/patch. For example, consider something like this:

HDFS paths:
/warehouse/staging/2017/a
/warehouse/staging/2017/b
/warehouse/staging/2017/c
/warehouse/staging/2016/a
/warehouse/staging/2016/b
/warehouse/staging/2016/c

Now let's way we have two partitions with these custom locations:
/warehouse/staging/2017/a
/warehouse/staging/2016/a

Their common ancestor is:
/warehouse/staging/

We will load all files under the common ancestor from the NN. Might be good or might be a disaster.

It doesn't sound trivial to make this optimization robust.


Line 760:       partsByPath.put(partDirPath, Lists.newArrayList(partition));
> do we know that on this path it will be a qualified path, or do we not care
Good point. As long as the qualification is consistent, we're ok. I'm inclined to leave it as is because qualification
complicates the code due to additional exception handling in various places (qualification can throw IOException). What do you think?


http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 429:    * be qualified or unqualified paths for this function to behave as expected.
> add a precondition that enforces this?
I'm a little worried about that for the following reasons:

1. If there is still a bug lurking somewhere this may lead to failed table loads (Preconditions check will throw), instead of table loads possibly being slow.

2. The Preconditions check in itself is pretty expensive (see the new code).

3. If something with the path qualification is going wrong, then the modified log message in HdfsTable.loadMetadataAndDiskIds() will give us a good indication of that.

So instead of a Preconditions check I added a LOG message. How does that sound?


Line 433:     while (!p.isRoot() && p.depth() != parent.depth()) p = p.getParent();
> Now that we are here, should we consider fixing this method to make it more
I'd prefer to leave that as follow-on work because it requires additional investigation. Based on Mostafa's
profiling, we don't have evidence to indicate this function is a bottleneck, so why fix it if it's not broken?


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 6:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/220/

-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 5:

Dan, Bharath, any more comments?

-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1: Code-Review+1

(1 comment)

The fix makes sense to me. We are doing a lot of un-needed isDescendantPath() calls. Thanks for fixing it.

http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 433:     while (!p.isRoot() && p.depth() != parent.depth()) p = p.getParent();
Now that we are here, should we consider fixing this method to make it more faster by avoiding p.getParent() ?


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5743/3/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 275:           Path partDir = entry.getKey();
do we already have test cases that will take this path?


Line 693:     // loading to reduce the number of RPCs to the NN by separately loading partitions
here and elsewhere: should we use createFullyQualifiedPath() to avoid the bug described there?


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5743/7/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 775:     for (Path location: locations) { loadBlockMetadata(location, partsByPath); }
remove {}


http://gerrit.cloudera.org:8080/#/c/5743/7/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 445:       }
what's the case for logging instead of a precondition? since we control who calls this function and what parameters get passed, this doesn't look like a runtime error to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5743

to look at the new patch set (#5).

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................

IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

The fix for IMPALA-4172/IMPALA-3653 introduced a performance
regression for loading tables that have many partitions with:
1. Inconsistent HDFS path qualification or
2. A custom location (not under the table root dir)

For the first issue consider a table whose root path is at
'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1'
will not be recognized as being a descendant of the table root
dir by FileSystemUtil.isDescendentPath() because of how
Path.equals() behaves, even if 'hdfs://localhost:8020' is the
default filesystem. Such partitions are incorrectly recognized
as having a custom location and are loaded separately.

There were two performance issues:
1. The code for loading the files/blocks of partitions with
   truly custom locations was inefficient with an O(N^2)
   loop for determining the target partitions.
2. Each partition that is incorrectly identified as having
   a custom path (e.g. due to inconsistent qualification),
   is going to have its files/blocks loaded twice. Once
   when the table root path is processed, and once when the
   'custom' partition is processed.

This patch fixes the detection of partitions with custom
locations, and improves the speed of loading partitions
with custom locations.

Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 59 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/5743/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5743/3/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

PS3, Line 440: !=
nit: I feel an XOR makes the intention more clear than using a !=, especially for boolean operands.


PS3, Line 442: pUri.getScheme().equals(parentUri.getScheme())
Wouldn't this still hit an NPE? For example, if both pUri.getScheme() and 
parentUri.getScheme() are null but pUri.getAuthority() and parentUri.getAuthority() are non-null but don't match (result = false).


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5743/1//COMMIT_MSG
Commit Message:

PS1, Line 22: The treatment of such
            : partitions is very inefficient.
I could be wrong about it, but I don't think this is an accurate description of the problem. By not having qualified paths and consequently not recognizing that the majority of partitions were under the tbl location, we ended up listing and loading block metadata for the partition files twice, once when we listed the files under tbl location and once for each partition that we incorrectly marked as not being a descendant of tbl location. Is this correct?


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 263
> why don't we just make partDir and dirPath qualified paths before calling F
Oh never mind, i didn't read the change description carefully enough.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5743

to look at the new patch set (#2).

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................

IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

The fix for IMPALA-4172/IMPALA-3653 introduced a performance
regression for loading tables that have many partitions with:
1. Inconsistent HDFS path qualification or
2. A custom location (not under the table root dir)

For the first issue consider a table whose root path is at
'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1'
will not be recognized as being a descendant of the table root
dir by FileSystemUtil.isDescendentPath() because of how
Path.equals() behaves, even if 'hdfs://localhost:8020' is the
default filesystem. Such partitions are incorrectly recognized
as having a custom location and are loaded separately.

There were two performance issues:
1. The code for loading the files/blocks of partitions with
   truly custom locations was inefficient with an O(N^2)
   loop for determining the target partition.
2. Each partition that is incorrectly identified as having
   a custom path (e.g. due to inconsistent qualification),
   is going to have its files/blocks loaded twice. Once
   when the table root path is processed, and once when the
   'custom' partition is processed.

This patch fixes the detection of partitions with custom
locations, and improves the speed of loading partitions
with custom locations.

Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 55 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/5743/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5743

to look at the new patch set (#3).

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................

IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

The fix for IMPALA-4172/IMPALA-3653 introduced a performance
regression for loading tables that have many partitions with:
1. Inconsistent HDFS path qualification or
2. A custom location (not under the table root dir)

For the first issue consider a table whose root path is at
'hdfs://localhost:8020/warehouse/tbl/'.
A partition with an unqualified location '/warehouse/tbl/p=1'
will not be recognized as being a descendant of the table root
dir by FileSystemUtil.isDescendentPath() because of how
Path.equals() behaves, even if 'hdfs://localhost:8020' is the
default filesystem. Such partitions are incorrectly recognized
as having a custom location and are loaded separately.

There were two performance issues:
1. The code for loading the files/blocks of partitions with
   truly custom locations was inefficient with an O(N^2)
   loop for determining the target partitions.
2. Each partition that is incorrectly identified as having
   a custom path (e.g. due to inconsistent qualification),
   is going to have its files/blocks loaded twice. Once
   when the table root path is processed, and once when the
   'custom' partition is processed.

This patch fixes the detection of partitions with custom
locations, and improves the speed of loading partitions
with custom locations.

Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
---
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
2 files changed, 57 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/5743/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 263
why don't we just make partDir and dirPath qualified paths before calling FileSystemUtil.isDescendantPath()?


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

Line 266:         // unpartitioned table, or the path of a partition with a custom location.
how does the 'custom location' part follow?


http://gerrit.cloudera.org:8080/#/c/5743/1/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

Line 429:    * be qualified or unqualified paths for this function to behave as expected.
add a precondition that enforces this?


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5743/2/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

PS2, Line 250: sDescendantPath
> typo
Done


http://gerrit.cloudera.org:8080/#/c/5743/2/fe/src/main/java/org/apache/impala/common/FileSystemUtil.java
File fe/src/main/java/org/apache/impala/common/FileSystemUtil.java:

PS2, Line 434: URI pUri = p.toUri();
             :       URI parentUri = parent.toUri();
> Based on the checks in L444, both p and parent could be null. If that is th
Good catch. Done. Also changed code to only log when necessary.


-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

Posted by "Dimitris Tsirogiannis (Code Review)" <ge...@cloudera.org>.
Dimitris Tsirogiannis has posted comments on this change.

Change subject: IMPALA-4789: Fix slow metadata loading due to inconsistent paths.
......................................................................


Patch Set 5: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5743
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dt...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No