You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2019/06/07 00:31:11 UTC

[Impala-ASF-CR] IMPALA-8630: Include partition id when calculating consistent remote placement

Joe McDonnell has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13545


Change subject: IMPALA-8630: Include partition id when calculating consistent remote placement
......................................................................

IMPALA-8630: Include partition id when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds the partition_id in the hash, so files with the same
name from different partitions will have different hashes.

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler.cc
1 file changed, 5 insertions(+), 2 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 7:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG@24
PS7, Line 24: see
> nit: sees?
Done


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h
File be/src/scheduling/scheduler-test-util.h:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h@95
PS7, Line 95: PARTITIONED_SIMPLE_NAMES
> PARTITIONED_SINGLE_NAME? PARTITIONED_CONSTANT_NAMES?
Changed these to PARTITIONED_SINGLE_FILENAME and PARTITIONED_UNIQUE_FILENAMES.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@295
PS7, Line 295:     // Encoding the table name and index in the file helps debugging.
> Maybe include one or two examples of a path in the comment in each case? Th
Added examples for each section


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@353
PS7, Line 353:   GetBlockPaths(table_name, true, spec_idx, naming_policy, &relative_path,
> One of the parameters could be returned by value, e.g. the partition_id
I think I prefer returning multiple things all via args rather than having one via return value, so I'm going to leave this.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: void InitRemoteCluster(Cluster *cluster) {
> This could be a member of Cluster, e.g. static Cluster CreateRemoteCluster(
Moved this to a static function on Cluster. I added a CreateStandardRemoteCluster for the 50 impala node, 3 data node configuration.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@305
PS7, Line 305: 
> Can you add brief comments above each test describing what to do (see other
Added comments for the new tests


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@314
PS7, Line 314:     Schema schema(cluster);
> These loops could benefit from adding a SCOPED_TRACE for the naming policy 
Added SCOPED_TRACE for the naming policy.


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@195
PS7, Line 195: partition's path
> For S3, this could be a bucket? Does it matter whether it uses the whole pa
The partition_id is something that Impala assigns relatively arbitrarily and it is not really stable across time. For example, it is not stable across an "invalidate metadata" for a table. So, it does need to be the path.


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@196
PS7, Line 196: consistent
> "consistent" here mean "given the same input, must return the same hash", r
Right



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 00:55:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 9: Code-Review+1

(4 comments)

Thanks for the changes. I had few more nits but otherwise the change looks good to me.

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc@47
PS9, Line 47:   
nit: double space


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@18
PS9, Line 18: #include <map>
Is this one needed?


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@228
PS9, Line 228:   cluster.AddHosts(num_data_nodes, false, true);
nit: use AddRemoteCluster()?


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@376
PS9, Line 376:   cluster.AddHosts(num_impala_nodes, true, false);
nit: AddRemoteCluster()?



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:21:39 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3690/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:31:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 3: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 3
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 17 Jun 2019 17:59:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4493/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 5
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 18 Jun 2019 21:43:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13545/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13545/2//COMMIT_MSG@27
PS2, Line 27: The alternative is to construct the DescriptorTbl in the scheduler
the advantage of the alternative is saving some extra memory per file descriptor. The FileDescs are persistent per-file so this can add up to many MB of increased consumption on a big catalogd. That said, with the recent work to reduce consumption and add eviction on catalogd, maybe it's not such a big deal as it used to be (eg IMPALA-7406 saved 100 bytes per FileFbDesc and here we're only losing a little bit of that gain)


http://gerrit.cloudera.org:8080/#/c/13545/2/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/13545/2/be/src/scheduling/scheduler.cc@801
PS2, Line 801:   uint32_t hash = HashUtil::Hash(&hdfs_file_split->full_path_hash,
             :       sizeof(hdfs_file_split->full_path_hash), 0);
why hash the hash instead of just making the hash here be a uint32_t directly?


http://gerrit.cloudera.org:8080/#/c/13545/2/common/fbs/CatalogObjects.fbs
File common/fbs/CatalogObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/13545/2/common/fbs/CatalogObjects.fbs@82
PS2, Line 82: int
should be a uint?



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 14 Jun 2019 23:20:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 12: Code-Review+2

Carry +2


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 12
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 25 Jun 2019 22:05:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 9:

(4 comments)

Thanks for the review

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test-util.cc@47
PS9, Line 47:   
> nit: double space
Done


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@18
PS9, Line 18: #include <map>
> Is this one needed?
Oh, good point, this is leftover from an approach I ended up not using. Removed.


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@228
PS9, Line 228:   cluster.AddHosts(num_data_nodes, false, true);
> nit: use AddRemoteCluster()?
Switched this over. Documented that CreateRemoteCluster() puts the Impala nodes first, so the indices for the AddSingleBlockTable() call needed to change.


http://gerrit.cloudera.org:8080/#/c/13545/9/be/src/scheduling/scheduler-test.cc@376
PS9, Line 376:   cluster.AddHosts(num_impala_nodes, true, false);
> nit: AddRemoteCluster()?
Good catch, these statements shouldn't be here, as this uses CreateRemoteCluster().



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 17:02:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4493/


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 5
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 03:06:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Include partition id when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Include partition id when calculating consistent remote placement
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3530/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Jun 2019 01:15:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 2:

(1 comment)

This did work out pretty cleanly.

http://gerrit.cloudera.org:8080/#/c/13545/2/common/fbs/CatalogObjects.fbs
File common/fbs/CatalogObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/13545/2/common/fbs/CatalogObjects.fbs@81
PS2, Line 81:   // A hash of the absolute file path
I wonder if this even makes the struct bigger, given padding. Or if we can reorder things a bit like the above TODO suggests to reduce the padding.

If we can remove a TODO and avoid the space/time regression concern, could be a win.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 14 Jun 2019 23:43:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13545/2/common/fbs/CatalogObjects.fbs
File common/fbs/CatalogObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/13545/2/common/fbs/CatalogObjects.fbs@81
PS2, Line 81:   // A hash of the absolute file path
> I wonder if this even makes the struct bigger, given padding. Or if we can 
FWIW we currently use a flatbuffer 'table' instead of a 'struct' which makes all the fields optional, and ends up taking extra space since there's some kind of vtable entry for each field or something. Given that I don't think there's padding involved. But we could probably get a win by moving all of the non-optional fields into a struct



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 17 Jun 2019 15:41:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3674/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 00:30:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: void InitRemoteCluster(Cluster *cluster) {
> Thx. I think I'd prefer to spell out the size of the cluster by calling Cre
Removed the CreateStandardRemoteCluster and changed tests to call CreateRemoteCluster directly.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:14:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#7).

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling see blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 281 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/7
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3633/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 14 Jun 2019 23:56:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Include partition id when calculating consistent remote placement

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Include partition id when calculating consistent remote placement
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13545/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13545/1//COMMIT_MSG@17
PS1, Line 17: This adds the partition_id in the hash, so files with the same
How are the partition IDs generated? I'm wondering if there might be a less severe version of the same problem where there are collisions in partition IDs between different tables?

I think the general idea of adding something unique to the hash per-partition makes sense, just wanting to make sure I understand the solution.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Jun 2019 03:03:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 05:17:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4543/


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 12
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 26 Jun 2019 03:38:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 13:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4554/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 13
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 26 Jun 2019 22:51:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#3).

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

[WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

This is a proof of concept and is not ready to go.

Work remaining (if this is the approach we take):
 - Add some scheduler tests to simulate this approach
 - Look into an end-to-end test based on the Docker tests (which do remote IO)

Alternatives:
 - Hash the full path when reading the file metadata in the catalog
   and store it in FbFileDesc.
 - Construct the DescriptorTbl in the scheduler and use that to
   reconstruct the full path or partition path for a file and hash it.

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
6 files changed, 87 insertions(+), 25 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 3
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#9).

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 329 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/9
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8630: Include partition id when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Include partition id when calculating consistent remote placement
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13545/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13545/1//COMMIT_MSG@17
PS1, Line 17: This adds the partition_id in the hash, so files with the same
> How are the partition IDs generated? I'm wondering if there might be a less
They are just sequential numbers that we make up, but it turns out that they are unique, even across tables:

https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java#L572

But that creates a different problem, which is that they are not stable over time or across different coordinators. If I invalidate metadata and rerun a query, I get different partition ids. I'll explore ways to make something stable to use.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Mon, 10 Jun 2019 18:31:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@267
PS6, Line 267:     // Set of Impala hosts
> Extra space indentation
Done


http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@282
PS6, Line 282: /// overlapping by chance is extremely low, so this test should only fail if the
> Can we control how the RNG used for the random replica selection is seeded?
Good point. I was thinking the probability of overlap should be small enough not to matter (~10^-18 or so), but then I noticed that we do srand(0) in SchedulerTest constructor:
https://github.com/apache/impala/blob/master/be/src/scheduling/scheduler-test.cc#L31
The constructor runs for each test, so I guess these are deterministic.

Changed this comment to indicate this is deterministic (which is fine, because these tests don't benefit much from actual randomness).


http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift@195
PS6, Line 195:   // hash of the partition's path
> I think it would be useful to document how the hash should be computed (sin
Added a comment about using Java String.hashCode() and that it needs to be consistent across processes/machines.


http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@951
PS6, Line 951:         partition.getLocation().hashCode());
> This was surprisingly subtle. I remembered that hashCode() was not guarante
Good point, I added a comment in the thrift file.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 18:03:11 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 6:

(4 comments)

Tests look great, thanks for adding them. I had a few minor requests but can +2 after you've addressed those.

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@267
PS6, Line 267:     // Set of Impala hosts
Extra space indentation


http://gerrit.cloudera.org:8080/#/c/13545/6/be/src/scheduling/scheduler-test.cc@282
PS6, Line 282: /// overlapping by chance is extremely low, so this test should only fail if the
Can we control how the RNG used for the random replica selection is seeded? E.g. call srand() with a seed that we control before each run. Otherwise it might be hard to reproduce if we have occasional collisions (e.g. RNG isn't as good as we think and has more collisions than we'd expect).

This isn't a blocker if it turns out to be tricky, but it's a nice-to-have.


http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/6/common/thrift/PlanNodes.thrift@195
PS6, Line 195:   // hash of the partition's path
I think it would be useful to document how the hash should be computed (since we rely on it being computed the same way). I.e. that it's the standard Java String.hashCode()


http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/13545/6/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@951
PS6, Line 951:         partition.getLocation().hashCode());
This was surprisingly subtle. I remembered that hashCode() was not guaranteed to be consistent across JVMs: https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html. But it appears that the String hashCode is actually documented as being stable: https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#hashCode--

I think it is worth documenting, but might be best to document in the thrift spec instead of here.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 16:48:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: /// Helper function to verify that two things are treated as distinct for consistent
> Moved this to a static function on Cluster. I added a CreateStandardRemoteC
Thx. I think I'd prefer to spell out the size of the cluster by calling CreateRemoteCluster(50, 30), similar to how we have several calls to cluster.AddHosts(3, true, true) through out this file. This makes it easier for the reader to see the cluster size without having to check another file.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 8
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:02:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4503/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:15:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Reviewed-on: http://gerrit.cloudera.org:8080/13545
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 332 insertions(+), 55 deletions(-)

Approvals:
  Joe McDonnell: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 14
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3699/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 10
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 17:42:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 2:

(1 comment)

I'm going to go ahead and post the version that sidesteps the catalog.

http://gerrit.cloudera.org:8080/#/c/13545/2/be/src/scheduling/scheduler.cc
File be/src/scheduling/scheduler.cc:

http://gerrit.cloudera.org:8080/#/c/13545/2/be/src/scheduling/scheduler.cc@801
PS2, Line 801:   uint32_t hash = HashUtil::Hash(&hdfs_file_split->full_path_hash,
             :       sizeof(hdfs_file_split->full_path_hash), 0);
> why hash the hash instead of just making the hash here be a uint32_t direct
Fixed this. Java hashCode() returns a signed integer and backend uses unsigned. I was punting on deciding the right place to cast.



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 15 Jun 2019 00:42:11 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4499/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 18:11:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 11: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 25 Jun 2019 00:24:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#6).

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling see blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 277 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/6
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3689/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 8
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 01:36:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4513/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 19:53:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#4).

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling see blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
6 files changed, 276 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/4
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 4
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3635/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 3
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 15 Jun 2019 01:23:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4496/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 6
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 18 Jun 2019 23:57:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 11: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 21 Jun 2019 00:36:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#8).

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 330 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/8
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 8
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 9: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 9
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 20 Jun 2019 06:41:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Tim Armstrong, Todd Lipcon, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#10).

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a partition_path_hash to the THdfsFileSplit and
THdfsFileSplitGeneratorSpec, calculated in the frontend (which has all of
the partition information). The scheduler hashes this in addition to
the relative path.

Testing:
 - Added several new scheduler tests that verify the consistent remote
   scheduling sees blocks with different relative paths, partition paths,
   or offsets as distinct.
 - Ran core tests

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/test/java/org/apache/impala/planner/ExplainTest.java
7 files changed, 332 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/10
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 10
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 2:

I was missing an option: Don't store it in FbFileDesc, but hash the partition path it in the frontend. HdfsScanNode.java already has the partition information in order to pass in partition_id, so it can hash the partition path (FeFsPartition.getLocation()) and put it in the THdfsFileSplit. The backend doesn't need to rebuild the descriptor table, it just hashes one more thing. Nothing happens in the catalog.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 14 Jun 2019 23:51:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 7: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 23:42:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3680/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 18:43:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13545

to look at the new patch set (#2).

Change subject: [WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................

[WIP] IMPALA-8630: Hash the full path when calculating consistent remote placement

Consistent remote placement currently uses the relative filename within
a partition for the consistent hash. If the relative filenames for
different partitions have a simple naming scheme, then multiple
partitions may have files of the same name. This is true for some
tables written by Hive (e.g. in our minicluster the tpcds.store_sales
has this problem). This can lead to unbalanced placement of remote
ranges.

This adds a full_path_hash to the FbFileDesc and computes it in the catalog.
The THdfsFileSplit is based on the FbFileDesc, and this gets passed through.
The scheduler hashes this rather than the relative path.

This is a proof of concept and is not ready to go.

Work remaining (if this is the approach we take):
 - Add some scheduler tests to simulate this approach
 - Look into an end-to-end test based on the Docker tests (which do remote IO)

The alternative is to construct the DescriptorTbl in the scheduler
and use that to reconstruct the full path for a file and hash it.

Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
---
M be/src/scheduling/scheduler-test-util.cc
M be/src/scheduling/scheduler-test-util.h
M be/src/scheduling/scheduler-test.cc
M be/src/scheduling/scheduler.cc
M common/fbs/CatalogObjects.fbs
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/catalog/HdfsPartition.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
8 files changed, 86 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/45/13545/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 7:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13545/7//COMMIT_MSG@24
PS7, Line 24: see
nit: sees?


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h
File be/src/scheduling/scheduler-test-util.h:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.h@95
PS7, Line 95: PARTITIONED_SIMPLE_NAMES
PARTITIONED_SINGLE_NAME? PARTITIONED_CONSTANT_NAMES?


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc
File be/src/scheduling/scheduler-test-util.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@295
PS7, Line 295:     // Encoding the table name and index in the file helps debugging.
Maybe include one or two examples of a path in the comment in each case? That might make the intent easier to follow.


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test-util.cc@353
PS7, Line 353:   GetBlockPaths(table_name, true, spec_idx, naming_policy, &relative_path,
One of the parameters could be returned by value, e.g. the partition_id


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc
File be/src/scheduling/scheduler-test.cc:

http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@264
PS7, Line 264: void InitRemoteCluster(Cluster *cluster) {
This could be a member of Cluster, e.g. static Cluster CreateRemoteCluster(num_impala_nodes, num_data_nodes).


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@305
PS7, Line 305: 
Can you add brief comments above each test describing what to do (see other tests in this file for examples)?


http://gerrit.cloudera.org:8080/#/c/13545/7/be/src/scheduling/scheduler-test.cc@314
PS7, Line 314:     Schema schema(cluster);
These loops could benefit from adding a SCOPED_TRACE for the naming policy (https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#adding-traces-to-assertions).


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@195
PS7, Line 195: partition's path
For S3, this could be a bucket? Does it matter whether it uses the whole path or just the partition id? Should we just call it "partition_hash"?


http://gerrit.cloudera.org:8080/#/c/13545/7/common/thrift/PlanNodes.thrift@196
PS7, Line 196: consistent
"consistent" here mean "given the same input, must return the same hash", right?



-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 7
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 19 Jun 2019 22:29:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 13: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 13
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 27 Jun 2019 04:24:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4543/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 12
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 25 Jun 2019 22:05:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 13: Code-Review+2

Rebased, carry +2


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 13
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 26 Jun 2019 22:51:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/3670/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 4
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 18 Jun 2019 22:16:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-8630: Hash the full path when calculating consistent remote placement

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/13545 )

Change subject: IMPALA-8630: Hash the full path when calculating consistent remote placement
......................................................................


Patch Set 11: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13545
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46c739fc31af539af2b3509e2a161f4e29f44d7b
Gerrit-Change-Number: 13545
Gerrit-PatchSet: 11
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 21 Jun 2019 01:22:42 +0000
Gerrit-HasComments: No