You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Mahesh Reddy (Code Review)" <ge...@cloudera.org> on 2020/10/14 01:06:16 UTC

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Mahesh Reddy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16596


Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

[partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates Partition::CreatePartitions() to support the ability to
add different hash schemas per each range. If no hash schema per range
is specified, the table wide hash schema is used. Currently, this only
works if no split_rows are specified. Open to discussion on whether or
not we should support this feature with split_rows.

Currently, RangeHashSchema within partition.h maps each range's vector
of hash schemas to an integer. Ideally, it would be mapped to the
range's upper and lower bounds but if we support split_rows the bounds
would not be finalized until within CreatePartitions().

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 258 insertions(+), 41 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 1
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 12:

> Uploaded patch set 12.

updated comments in 'partition.h', replaced 'std::random_shuffle' with 'std::shuffle' in test case.


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 12
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 30 Nov 2020 23:51:11 +0000
Gerrit-HasComments: No

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 12: Verified+1 Code-Review+2

The test failures are unrelated to this patch.


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 12
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 01 Dec 2020 00:41:58 +0000
Gerrit-HasComments: No

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#9).

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

[partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
not supporting this feature with split_rows. Returning a message to the
user stating to specify both upper and lower bounds either at table
creation or alteration time should suffice. Split_rows is also more
syntactically ambiguous when specifying bounds.

Currently, range_hash_schemas holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds so that when the
bounds are sorted its corresponding hash schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
10 files changed, 447 insertions(+), 122 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/9
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 9
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 10:

> Uploaded patch set 10.

Just fixing what Tidy bot pointed out, debug build failed due to TxnParticipantElectionStormITest.TestFrequentElections failing, saw it was on the flaky test dashboard but will investigate further if it fails again.


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 10
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 26 Nov 2020 03:49:51 +0000
Gerrit-HasComments: No

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@389
PS1, Line 389: std::vector<HashBucketSchema>
> nit: I wonder if it makes sense to typedef vector<HashBucketSchema> as some
Oops, s/HashBucketSpec/HashBucketSchemas

This would also be useful for hash_partition_schemas().



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 1
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Oct 2020 02:41:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 7:

(6 comments)

Overall looking better. Just some more cosmetic suggestions and a suggestion to reduce code duplication in test.

http://gerrit.cloudera.org:8080/#/c/16596/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/7//COMMIT_MSG@14
PS7, Line 14: Since split_rows only exists for backwards compatibility reasons, I'm
            : leaning towards not supporting this feature with split_rows. Returning
            : a message to the user stating to specify both upper and lower bounds
            : either at table creation or alteration time should suffice. Split_rows
            : is also more syntactically ambigious when specifying bounds.
+1

It isn't clear to me whether this is the reason this is marked WIP. If so, could you make that clearer by saying something like, "WIP because ..."

Also, if you want to discuss this further, I'd recommend pulling people into a meeting or an email thread or Slack discussion to discuss further.


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc@121
PS7, Line 121: CheckPartitions
nit: maybe move this closer to the test that uses it.

Protip: you can create anonymous namespaces anywhere in the file, not just at the top. The top is just convenient if we expect many tests to uses these methods, which isn't the case here.


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc@1101
PS7, Line 1101: range_hash_schema
nit: make this plural


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc@1042
PS7, Line 1042: PartitionTest, TestVaryingHashSchemasPerRangeOutOfOrder) {
              :   // CREATE TABLE t (a VARCHAR, b VARCHAR, c VARCHAR, PRIMARY KEY (a, b, c))
              :   // PARTITION BY [HASH BUCKET (a, c), HASH BUCKET (b), RANGE (a, b, c)];
              :   Schema schema({ ColumnSchema("a", STRING),
              :                   ColumnSchema("b", STRING),
              :                   ColumnSchema("c", STRING) },
              :                 { ColumnId(0), ColumnId(1), ColumnId(2) }, 3);
              : 
              :   PartitionSchemaPB schema_builder;
              :   // Table-wide HashSchema defined below, 3 by 2 buckets so 6 total.
              :   AddHashBucketComponent(&schema_builder, { "a", "c" }, 3, 0);
              :   AddHashBucketComponent(&schema_builder, { "b" }, 2, 0);
              :   PartitionSchema partition_schema;
              :   ASSERT_OK(PartitionSchema::FromPB(schema_builder, schema, &partition_schema));
              : 
              :   ASSERT_EQ("HASH (a, c) PARTITIONS 3, HASH (b) PARTITIONS 2, RANGE (a, b, c)",
              :   partition_schema.DebugString(schema));
              : 
              :   vector<pair<KuduPartialRow, KuduPartialRow>> bounds;
              : 
              :   // The bounds aren't inserted in sorted order unlike the previous test,
              :   // yet the resulting partitions will be the same.
              :   { // [(a3, b3, _), (a4, b4, _))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a3"));
              :     ASSERT_OK(lower.SetStringCopy("b", "b3"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a4"));
              :     ASSERT_OK(upper.SetStringCopy("b", "b4"));
              :     bounds.emplace_back(std::move(lower), std::move(upper));
              :   }
              : 
              :   { // [(a1, _, c1), (a2, _, c2))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a1"));
              :     ASSERT_OK(lower.SetStringCopy("c", "c1"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a2"));
              :     ASSERT_OK(upper.SetStringCopy("c", "c2"));
              :     bounds.emplace_back(std::move(lower), std::move(upper));
              :   }
              : 
              :   { // [(a5, b5, _), (a6, _, c6))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a5"));
              :     ASSERT_OK(lower.SetStringCopy("b", "b5"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a6"));
              :     ASSERT_OK(upper.SetStringCopy("c", "c6"));
              :     bounds.emplace_back(std::move(lower), std::move(upper));
              :   }
              : 
              :   vector<PartitionSchema::HashBucketSchema> hash_schema_4_buckets = {{{ColumnId(0)}, 4, 0}};
              :   vector<PartitionSchema::HashBucketSchema> hash_schema_2_buckets_by_3 = {
              :       {{ColumnId(0)}, 2, 0},
              :       {{ColumnId(1)}, 3, 0}
              :   };
              : 
              :   // HashBucketSchemas have to be in the same order as the bounds it corresponds to
              :   PartitionSchema::RangeHashSchema range_hash_schema = {
              :       {vector<PartitionSchema::HashBucketSchema>()},
              :       {hash_schema_4_buckets},
              :       {hash_schema_2_buckets_by_3}
              :   };
              : 
              :   vector<Partition> partitions;
              :   ASSERT_OK(partition_schema.CreatePartitions({}, bounds, range_hash_schema, schema, &partitions));
              :   CheckPartitions(partitions);
              : }
How about instead of duplicating the above test, we combine 'bounds' and 'range_hash_schema' to 'bounds_and_hash_schemas' of type vector<pair<pair<KuduPartialRow, KuduPartialRow>, RangeHashSchema>>. Then, we could generate 'bounds' and 'range_hash_schemas' from the 'bounds_and_hash_schemas' that would correctly correspond with each other, even if we were to, say, shuffle 'bounds_and_hash_schemas'.


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition.cc@329
PS7, Line 329:                                              bounds_with_hash_schemas) const {
If 'splits' is empty, could we short circuit out of here and just return without changing 'bounds_with_hash_schemas'?


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition.cc@359
PS7, Line 359: bound.hash_schemas
If you go my suggested route of short circuiting, we should be able to just replace these with {}, right? Since hash schemas + split rows isn't supported?



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 7
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 24 Nov 2020 07:29:27 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 8:

The ASAN errors will be alleviated if you rebase on master.


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 25 Nov 2020 01:04:34 +0000
Gerrit-HasComments: No

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16596/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/3//COMMIT_MSG@25
PS3, Line 25: deisgn
nit: design


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.h@363
PS3, Line 363: c
nit: spacing


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc@387
PS3, Line 387:   if (!split_rows.empty() && !range_hash_schema.empty()) {
Can we also enforce that 'range_hash_schema' and 'range_bounds' are the same length? Is that always going to be true if 'range_hash_schema' isn't empty?


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc@409
PS3, Line 409:   RETURN_NOT_OK(EncodeRangeBounds(range_bounds, schema, &bounds));
             :   RETURN_NOT_OK(EncodeRangeSplits(split_rows, schema, &splits));
             :   RETURN_NOT_OK(SplitRangeBounds(schema, std::move(splits), &bounds));
You mentioned that we sort the bounds here, and in your commit message, you mention that being an issue since the 'range_hash_schema' doesn't get sorted in the same order. IMO we should address that in this patch.

Could you also add some tests for this?


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc@414
PS3, Line 414: as per current design.
nit: remove this -- "current design" doesn't really mean anything in the broader context of the Kudu codebase.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 3
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 29 Oct 2020 19:34:10 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 11:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition-test.cc@1073
PS11, Line 1073:   range_hash_schemas.emplace_back(PartitionSchema::HashBucketSchemas());
> Is this important? If not, maybe remove it? Seems like it shouldn't be sinc
I added it b/c there's a check for the size of 'range_hash_schemas' to match the size of the eventual bounds, but that check happens after the check below that will return an Invalid argument message. In short, we will never reach this size check if the below check is triggered properly so yes I can remove it.


http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h@413
PS11, Line 413: encoded_bounds
> nit: can you update this?
Ack


http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h@416
PS11, Line 416: range_hash_schemas
> nit: can you also document how this is used, what to expect if it's empty, 
Ack


http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h@420
PS11, Line 420:   // Splits the encoded range bounds by the split points. The splits and bounds
              :   // must be sorted. If `bounds` is empty, then a single unbounded range is
              :   // assumed. If any of the splits falls outside of the bounds then an
              :   // InvalidArgument status is returned.
> nit: can you update this?
Ack



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 11
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 30 Nov 2020 23:49:27 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#3).

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates Partition::CreatePartitions() to support the ability to
add different hash schemas per each range. If no hash schema per range
is specified, the table wide hash schema is used. Currently, this only
works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
leaning towards not supporting this feature with split_rows. Returning
a message to the user stating to specify both upper and lower bounds
either at table creation or alteration time should suffice. Split_rows
is also more syntactically ambigious when specifying bounds.

Currently, RangeHashSchema holds the HashBucketSchemas for each range.
One of the future patches will update the client code to match a range
to its HashBucketSchemas. When the bounds get sorted, since the
HashBucketSchemas are matched to the proper bound, they will also get
sorted. This will guarantee us sorted order of the HashBucketSchemas in
relation to the bounds and validate the current deisgn of RangeHashSchema.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 289 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 3
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#2).

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

[partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates Partition::CreatePartitions() to support the ability to
add different hash schemas per each range. If no hash schema per range
is specified, the table wide hash schema is used. Currently, this only
works if no split_rows are specified (more info in design doc).

Currently, RangeHashSchema within partition.h maps each range's vector
of hash schemas to a filler string. Ideally, it would be mapped to the
range's lower bound but if we support split_rows the bounds
would not be finalized until within CreatePartitions().

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 299 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 2
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#7).

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
leaning towards not supporting this feature with split_rows. Returning
a message to the user stating to specify both upper and lower bounds
either at table creation or alteration time should suffice. Split_rows
is also more syntactically ambigious when specifying bounds.

Currently, range_hash_schema holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds and an extra
parameter is supplied to EncodeRangeBounds to ensure that when the
bounds are sorted its corresponding schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 454 insertions(+), 118 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/7
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 7
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

[partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, this
feature will not be supported with split_rows. Returning a message to
the user stating to specify both upper and lower bounds either at table
creation or alteration time should suffice. Split_rows is also more
syntactically ambiguous when specifying bounds.

Currently, range_hash_schemas holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds so that when the
bounds are sorted its corresponding hash schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Reviewed-on: http://gerrit.cloudera.org:8080/16596
Reviewed-by: Andrew Wong <aw...@cloudera.com>
Tested-by: Andrew Wong <aw...@cloudera.com>
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
10 files changed, 455 insertions(+), 126 deletions(-)

Approvals:
  Andrew Wong: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 13
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 1:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16596/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/1//COMMIT_MSG@12
PS1, Line 12: Open to discussion on whether or
            : not we should support this feature with split_rows.
> Done
will add a section for this in design doc


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@149
PS1, Line 149: integer
> I guess it would be enough to have just the lower bound for a range given t
hm yeh just the lower bound should suffice. Right now, the 'int' is just a placeholder for the bound that should replace it. In hindsight, using an 'int' wasn't very clear and perhaps I should have used 'std::string' instead.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@366
PS1, Line 366: partitions
> +1
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@398
PS1, Line 398:       RETURN_NOT_OK(EncodeRangeBounds(range_bounds, schema, &bounds));
             :       RETURN_NOT_OK(EncodeRangeSplits(split_rows, schema, &splits));
             :       RETURN_NOT_OK(SplitRangeBounds(schema, std::move(splits), &bounds));
> style nit: it seems there three lines has shifted right; what might be the 
yeh I noticed that as soon as I uploaded the patch, fixed it for next one


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@410
PS1, Line 410: vector<Partition> current_bound_hash_partitions = vector<Partition>(1);
> nit: a shorter form might be
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@413
PS1, Line 413: second
> answered in definition of 'range_hash_schema' in header file
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@669
PS1, Line 669: string PartitionSchema::HashSchemaPerPartitionDebugString(const Partition& partition,
             :                                                           const vector<HashBucketSchema>& range_hash_schema,
             :                                              const Schema& schema) const {
> style nit: the parameters are not aligned properly
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@675
PS1, Line 675: 
> nit: reserving the space for 'components' might be a good idea; I guess her
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@686
PS1, Line 686: use default HashSchema when no range-specific hash partitions are provided
> style nit: in comments like this, in Kudu code we tend to use full sentence
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@695
PS1, Line 695: hash_buckets_[i]
> How is it guaranteed that partition.hash_buckets_ has enough elements to ha
The for loop runs through the hash schemas that were applied to the current partition. The current partition will have as many elements in 'partition.hash_buckets_' as the hash schemas because the amount of necessary hash buckets has already been generated from the hash schemas per each range. For example, if two different hash schemas apply to a range, a partition with two different hash buckets will be created.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 1
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Thu, 22 Oct 2020 23:15:47 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.

Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 8:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.h@167
PS8, Line 167: const google::protobuf::RepeatedPtrField
             :                                  <PartitionSchemaPB_ColumnIdentifierPB>& identifiers,
Nit: Might be better to move the entire argument declaration on separate next line so that it fits on one line.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@124
PS8, Line 124: const RepeatedPtrField
             :         <PartitionSchemaPB_ColumnIdentifierPB>& identifiers,
Nit: Same as above.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@128
PS8, Line 128: column_ids
> Ack
+1.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@136
PS8, Line 136: column_id
Nit: std::move(column_id)


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@144
PS8, Line 144: push_back
Nit: emplace_back instead.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@375
PS8, Line 375: void
If no error status is expected then better to return hash_partitions instead of using it as out parameter.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@378
PS8, Line 378: vector<Partition> hash_partitions_so_far = vector<Partition>(1)
Assignment/copy constructor is unnecessary.

vector<Partition> hash_partitions_so_far(1)


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@382
PS8, Line 382:     // For each of the partitions created so far, replicate it
Looks like the size of new_partitions can be predetermined.
hash_partitions_so_far.size() * bucket.schema.num_buckets so can use new_partitions.reserve() with that size.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@404
PS8, Line 404: KeyEncoder<string>
Nit: auto


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@459
PS8, Line 459: current_bound_hash_partitions.begin(),
             :                                current_bound_hash_partitions.end()
Nit: Could use std::make_move_iterator() to save the copies.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 25 Nov 2020 21:15:02 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16596/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/3//COMMIT_MSG@25
PS3, Line 25: rg/c/1
> nit: design
Done


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.h@363
PS3, Line 363: o
> nit: spacing
Done


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc@387
PS3, Line 387:   }
> Can we also enforce that 'range_hash_schema' and 'range_bounds' are the sam
I check if they are the same length below underneath on line 416. If 'range_hash_schema' isnt empty, then yes its length should match 'range_bounds'. So either 'range_hash_schema' is empty or its length will match 'range_bounds'.


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc@409
PS3, Line 409:     if (!InsertIfNotPresent(&range_column_idxs, column_idx)) {
             :       return Status::InvalidArgument("duplicate column in range partition",
             :                                      schema.column(column_idx).name())
> You mentioned that we sort the bounds here, and in your commit message, you
Sure, for now I just changed the existing test to insert them out of order and verify that the final partitions are still in sorted order. I can add a separate test that adds the bounds out of order than checks the partitions.


http://gerrit.cloudera.org:8080/#/c/16596/3/src/kudu/common/partition.cc@414
PS3, Line 414: 
> nit: remove this -- "current design" doesn't really mean anything in the br
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 4
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 19 Nov 2020 05:41:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has removed a vote on this change.

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 12
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#6).

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
leaning towards not supporting this feature with split_rows. Returning
a message to the user stating to specify both upper and lower bounds
either at table creation or alteration time should suffice. Split_rows
is also more syntactically ambigious when specifying bounds.

Currently, range_hash_schema holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds and an extra
parameter is supplied to EncodeRangeBounds to ensure that when the
bounds are sorted its corresponding schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 359 insertions(+), 94 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/6
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 6
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#4).

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
leaning towards not supporting this feature with split_rows. Returning
a message to the user stating to specify both upper and lower bounds
either at table creation or alteration time should suffice. Split_rows
is also more syntactically ambigious when specifying bounds.

Currently, range_hash_schema holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds and an extra
parameter is supplied to EncodeRangeBounds to ensure that when the
bounds are sorted its corresponding schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 343 insertions(+), 94 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/4
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 4
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#10).

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

[partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
not supporting this feature with split_rows. Returning a message to the
user stating to specify both upper and lower bounds either at table
creation or alteration time should suffice. Split_rows is also more
syntactically ambiguous when specifying bounds.

Currently, range_hash_schemas holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds so that when the
bounds are sorted its corresponding hash schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
10 files changed, 447 insertions(+), 122 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/10
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 10
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 1:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@149
PS1, Line 149: integer
> right now it's useless, but the idea was to replace the integer with the up
I guess it would be enough to have just the lower bound for a range given that we don't allow ranges to intersect, right?

Also, why is it 'int', not 'std::string' as the type of the Partition::partition_key_start_ member field would imply?


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@366
PS1, Line 366: partitions
> nit: not your fault, but IMO it makes it more difficult to read to use 'par
+1


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@398
PS1, Line 398:       RETURN_NOT_OK(EncodeRangeBounds(range_bounds, schema, &bounds));
             :       RETURN_NOT_OK(EncodeRangeSplits(split_rows, schema, &splits));
             :       RETURN_NOT_OK(SplitRangeBounds(schema, std::move(splits), &bounds));
> If we don't have any 'split_rows', then the number of bounds would be known
style nit: it seems there three lines has shifted right; what might be the reason?


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@410
PS1, Line 410: vector<Partition> current_bound_hash_partitions = vector<Partition>(1);
nit: a shorter form might be
  vector<Partition> current_bound_hash_partitions(1);


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@669
PS1, Line 669: string PartitionSchema::HashSchemaPerPartitionDebugString(const Partition& partition,
             :                                                           const vector<HashBucketSchema>& range_hash_schema,
             :                                              const Schema& schema) const {
style nit: the parameters are not aligned properly


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@675
PS1, Line 675: 
nit: reserving the space for 'components' might be a good idea; I guess here it'd be

  components.reserve(hash_bucket_schemas_.size() + 1);


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@686
PS1, Line 686: use default HashSchema when no range-specific hash partitions are provided
style nit: in comments like this, in Kudu code we tend to use full sentences, i.e. it should start with a capital letter and end with a period (dot).


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@695
PS1, Line 695: hash_buckets_[i]
How is it guaranteed that partition.hash_buckets_ has enough elements to have index 'i' to be in the bounds of the container?



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 1
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Oct 2020 23:09:37 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#8).

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

[partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
not supporting this feature with split_rows. Returning a message to the
user stating to specify both upper and lower bounds either at table
creation or alteration time should suffice. Split_rows is also more
syntactically ambigious when specifying bounds.

Currently, range_hash_schemas holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds so that when the
bounds are sorted its corresponding hash schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 422 insertions(+), 118 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@276
PS4, Line 276: range_partitions_with_hash_schemas
nit: add four spaces here since it's a continuation of the previous line


http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@295
PS4, Line 295: range_hash_schema[j++]
What if this is empty?


http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@302
PS4, Line 302:  &t1
nit: put the & with the rest of the type; same below


http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@418
PS4, Line 418:   RETURN_NOT_OK(EncodeRangeBounds(range_bounds, range_hash_schema, schema, &bounds,
             :                                   &bounds_with_hash_schemas));
             :   RETURN_NOT_OK(EncodeRangeSplits(split_rows, schema, &splits));
             :   RETURN_NOT_OK(SplitRangeBounds(schema, std::move(splits), &bounds));
We chatted on Slack about maybe using a struct{string,string,optional<HashBucketSchemas>} (or just use empty to signify the default hash bucket schema) instead of a tuple, and passing a single 'bounds_with_hash_schemas' around instead of duplicating the 'bounds'.

Upon looking around at the usages here, I think it's worth moving forward with both of these changes: EncodeRangeBounds(), EncodeRangeSplits(), and SplitRangeBounds() are all used entirely in this file, so the changes to their API won't have a very wide footprint. I don't recall what reasons there were for the duplication, but it isn't clear why that wouldn't be doable.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 4
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 19 Nov 2020 05:26:28 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 8:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16596/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/6//COMMIT_MSG@18
PS6, Line 18: n specify
ambiguous


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition-test.cc@1000
PS8, Line 1000:   { // [(a1, _, c1), (a2, _, c2))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a1"));
              :     ASSERT_OK(lower.SetStringCopy("c", "c1"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a2"));
              :     ASSERT_OK(upper.SetStringCopy("c", "c2"));
              :     PartitionSchema::HashBucketSchemas hash_schema_4_buckets = {{{ColumnId(0)}, 4, 0}};
              :     bounds.emplace_back(lower, upper);
              :     range_hash_schemas.emplace_back(hash_schema_4_buckets);
              :     bounds_with_hash_schemas.emplace_back(make_pair(std::move(lower), std::move(upper)),
              :                                           std::move(hash_schema_4_buckets));
              :   }
Would it make sense to check for the result of partition building after each of these smaller blocks?


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@128
PS8, Line 128: column_ids
nit: not exactly the essential part of this changelist, but since you are touching this code anyways, consider making few improvements here.  Maybe, it's more convenient to do so in a separate changelist.

Since this method can return non-OK status when an error is detected in the middle, consider not changing the output parameter in the middle, but create a local variable and do std::move() in the end before returning Status::OK():

  vector<ColumnId> col_ids;
  col_ids.reserve(identifiers.size());
  ...
  *column_ids = std::move(col_ids);


Also, move the 'continue' to the same scope as their corresponding 'case' label is.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@409
PS8, Line 409:         return Status::InvalidArgument("Both 'split_rows' and 'range_hash_schemas' cannot be "
             :                                        "populated at the same time.");
Could you add a testcase to trigger this error to make sure this method behaves as expected?



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 25 Nov 2020 01:27:39 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 1:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/16596/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/1//COMMIT_MSG@12
PS1, Line 12: Open to discussion on whether or
            : not we should support this feature with split_rows.
nit: it'd be good to add some context as to what this would entail, what user-facing effects this has, etc. Or outline the problem in the design doc so it can be discussed further.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@909
PS1, Line 909: hashSchema1
nit: snake_case, same elsewhere


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@912
PS1, Line 912: hashSchema2.emplace_back(PartitionSchema::HashBucketSchema{{ColumnId(0)}, 2, 0});
             :   hashSchema2.emplace_back(PartitionSchema::HashBucketSchema{{ColumnId(1)}, 3, 0});
nit: pretty sure you can initialize this via initializer list, e.g.

Vector<HashBucketSchema> hash_schema2 = {
  { { ColumnId(0) }, 2, 0 },
  { { ColumnId(1) }, 3, 0 }
};


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@919
PS1, Line 919: 16
nit: could you add a comment explaining this value? May also be able to adjust variable names to make this a bit clearer, e.g.

 default_hash_buckets_with_3_by_2 = { HashBucketSchema(3), HashBucketSchema(2) };
 hash_buckets_with_4 = { HashBucketSchema(4) };
 hash_buckets_with_2_by_3 = { HashBucketSchema(2), HashBucketSchema(3) };

At least right now, a reader trying to follow/verify this code would have read between the lines, parse out exactly what ranges have what, and then do the math.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@921
PS1, Line 921: EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[0], range_hash_schema[0].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[1], range_hash_schema[0].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 2",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[2], range_hash_schema[0].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 3",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[3], range_hash_schema[0].second, schema));
             : 
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 0, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[4], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 0, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[5], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 1, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[6], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 1, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[7], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 2, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[8], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 2, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[9], range_hash_schema[1].second, schema));
             : 
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 0, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[10], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 0, HASH (b) PARTITION 1",
             :            partition_schema.HashSchemaPerPartitionDebugString
             :                 (partitions[11], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 0, HASH (b) PARTITION 2",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[12], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 1, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[13], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 1, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[14], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 1, HASH (b) PARTITION 2",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[15], range_hash_schema[2].second, schema));
While this is pretty exhaustive, it does seem a bit excessive. Could we instead just iterate through 'partitions' and check that the partitions of specific ranges have the expected hash bucket schemas? I'd prefer a test that validated Partition directly, so there's no chance that HashSchemaPerPartitionDebugString is hiding issues.

HashSchemaPerPartitionDebugString is also brand new code and if we want to keep it, it should have smaller, more targeted tests. I'm not convinced it's worth keeping given it's used in test only right now though.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@149
PS1, Line 149: Maps
Haven't looked around at how this is used, but should this actually be an unordered_map<int, vector<HashBucketSchema>> or somesuch?


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@149
PS1, Line 149: integer
What is this integer? How do we use it?


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@177
PS1, Line 177: range_hash_schema
nit: add a comment on how developers should use this, what invariants exist with respect to 'split_rows' and 'range_bounds', how it affects the creation of partitions, etc.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@389
PS1, Line 389: std::vector<HashBucketSchema>
nit: I wonder if it makes sense to typedef vector<HashBucketSchema> as some HashBucketSchemas (plural!) or somesuch, that would depict all hashes associated with a given range. Then there would be a default HashBucketSpec for a table, and we would pass around a vector<HashBucketSpec> when defining the hashes for ranges in CreatePartitions().


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@366
PS1, Line 366: partitions
nit: not your fault, but IMO it makes it more difficult to read to use 'partitions' for a bunch of different things as we are here. Could we instead make a local variable vector<Partition> along the lines of:

 vector<Partition> default_hash_partitions(1);
 for (...

In general, unless there's good reason to, I'm not a huge fan of using the output variable as a replacement local variable. How about instead, defining appropriate local variables (e.g. default_hash_partitions, hashed_range_partitions, etc.), and only once we're done and know we are going to return OK, set 'partitions'.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@398
PS1, Line 398:       RETURN_NOT_OK(EncodeRangeBounds(range_bounds, schema, &bounds));
             :       RETURN_NOT_OK(EncodeRangeSplits(split_rows, schema, &splits));
             :       RETURN_NOT_OK(SplitRangeBounds(schema, std::move(splits), &bounds));
If 'bounds' gets populated here, how are callers expected to know how many elements 'range_hash_schema' should have? I think the answer is that it depends on whether we have 'split_rows' or not. If so, it'd be nice to comment here or add context somewhere to further discussion.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@403
PS1, Line 403:     // Hash schema per range cannot be applied to split rows (doing bare minimum first)
nit: some of these sentences are fragments. Mind filling them out and adding punctuation?


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@405
PS1, Line 405:     DCHECK(range_hash_schema.size() == bounds.size());
nit: you can use DCHECK_EQ(expected, actual) for a more useful error message upon failure (it'll print what the values were)


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@413
PS1, Line 413: second
Is 'first' ever used? If not, why not just have this be vector<vector<HashBucketSchema>>?


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@417
PS1, Line 417: RangeHashSchema
nit: variable names should be snake_case unless they're const static


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@419
PS1, Line 419: : 
nit: not sure it's explicitly mentioned in the style guide, but most examples, and most of this codebase abides by having a space after the element's variable name, i.e.

 for (const auto& foo : foos) { ...

Same elsewhere.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@419
PS1, Line 419:         for (const HashBucketSchema& bucket_schema: RangeHashSchema) {
             :           vector<Partition> new_partitions;
             :           // Goes over existing partitions for current bound to generate all combinations if
             :           // multiple schemas exist
             :           for (const Partition& base_partition: current_bound_hash_partitions) {
             :             for (int32_t bucket = 0; bucket < bucket_schema.num_buckets; bucket++) {
             :               Partition partition = base_partition;
             :               partition.hash_buckets_.push_back(bucket);
             :               hash_encoder.Encode(&bucket, &partition.partition_key_start_);
             :               hash_encoder.Encode(&bucket, &partition.partition_key_end_);
             :               new_partitions.push_back(partition);
             :             }
             :           }
             :           current_bound_hash_partitions.swap(new_partitions);
             :         }
nit: this seems to be a copy of the logic at L367, and it seems we define some reusable GenerateHashPartitions() function that takes in a vector<HashBucketSchema> and returns a vector<Partition>.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 1
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Oct 2020 02:30:12 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 2:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition-test.cc@920
PS2, Line 920:       {"first bound", hash_schema_4_buckets},
             :       {"second bound", vector<PartitionSchema::HashBucketSchema>()},
             :       {"third bound", hash_schema_2_buckets_by_3}
If this patch is not going to use the string bounds, and it's unclear how they will be used in the future, I think we're better off removing them from this patch.


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition-test.cc@931
PS2, Line 931: EXPECT_EQ(0, partitions[0].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\0" "a1\0\0\0\0c1", 12),partitions[0].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "a2\0\0\0\0c2", 12),partitions[0].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[1].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\1" "a1\0\0\0\0c1", 12),partitions[1].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "a2\0\0\0\0c2", 12),partitions[1].partition_key_end());
             : 
             :   EXPECT_EQ(2, partitions[2].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\2" "a1\0\0\0\0c1", 12),partitions[2].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\2" "a2\0\0\0\0c2", 12),partitions[2].partition_key_end());
             : 
             :   EXPECT_EQ(3, partitions[3].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\3" "a1\0\0\0\0c1", 12),partitions[3].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\3" "a2\0\0\0\0c2", 12),partitions[3].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[4].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[4].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a3\0\0b3\0\0", 16),partitions[4].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a4\0\0b4\0\0", 16),partitions[4].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[5].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[5].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a3\0\0b3\0\0", 16),partitions[5].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a4\0\0b4\0\0", 16),partitions[5].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[6].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[6].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a3\0\0b3\0\0", 16),partitions[6].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a4\0\0b4\0\0", 16),partitions[6].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[7].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[7].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a3\0\0b3\0\0", 16),partitions[7].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a4\0\0b4\0\0", 16),partitions[7].partition_key_end());
             : 
             :   EXPECT_EQ(2, partitions[8].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[8].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\0" "a3\0\0b3\0\0", 16),partitions[8].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\0" "a4\0\0b4\0\0", 16),partitions[8].partition_key_end());
             : 
             :   EXPECT_EQ(2, partitions[9].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[9].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\1" "a3\0\0b3\0\0", 16),partitions[9].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\1" "a4\0\0b4\0\0", 16),partitions[9].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[10].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[10].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a5\0\0b5\0\0", 16),partitions[10].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a6\0\0\0\0c6", 16),partitions[10].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[11].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[11].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a5\0\0b5\0\0", 16),partitions[11].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a6\0\0\0\0c6", 16),partitions[11].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[12].hash_buckets()[0]);
             :   EXPECT_EQ(2, partitions[12].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\2" "a5\0\0b5\0\0", 16),partitions[12].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\2" "a6\0\0\0\0c6", 16),partitions[12].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[13].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[13].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a5\0\0b5\0\0", 16),partitions[13].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a6\0\0\0\0c6", 16),partitions[13].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[14].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[14].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a5\0\0b5\0\0", 16),partitions[14].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a6\0\0\0\0c6", 16),partitions[14].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[15].hash_buckets()[0]);
             :   EXPECT_EQ(2, partitions[15].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\2" "a5\0\0b5\0\0", 16),partitions[15].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\2" "a6\0\0\0\0c6", 16),partitions[15].partition_key_end());
nit: could this be more succinct with some loops?


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.h@149
PS2, Line 149: still working on best way to
             :   // include this bound.
nit: If this is the case, could you mention this patch is a WIP in the commit message, and mention what uncertainties there are with this approach? That'd set expectations a bit better for the reviewers and direct our attention better

Otherwise, remove this from the comment.


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@357
PS2, Line 357: Status PartitionSchema::GenerateHashPartitions(const std::vector<HashBucketSchema>& hash_schemas,
> warning: method 'GenerateHashPartitions' can be made static [readability-co
+1

This could be put in an anonymous namespace


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@359
PS2, Line 359: std::
nit: don't need "std::" on account of the "using" above.


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@361
PS2, Line 361:   DCHECK(hash_partitions->size() == 1);
How about creating the initial vector in this method? That would make it easier to use this function (callers could just pass an empty vector).


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@369
PS2, Line 369: *hash_partitions
nit: see my earlier feedback about using local variables vs output variables. This could probably be a local variable like 'hash_partitions_so_far' or something


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@378
PS2, Line 378:     expected_partitions *= bucket_schema.num_buckets;
nit: not sure how valuable this check is, given you can validate it fairly easily by reading the code. I'd feel a bit differently if the code were more complicated, but this seems reasonably easy to check?


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@379
PS2, Line 379: hash_partitions->swap(new_partitions);
nit: we've more recently begun using:

 *hash_partitions = std::move(new_partitions);

instead of swap()


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@383
PS2, Line 383:   return Status::OK();
nit: How about just return the vector<Partition>? or make this a void function, given this never returns an error


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@386
PS2, Line 386: const vector<KuduPartialRow>& split_rows,
Shouldn't we return an error if both this _and_ range_hash_schema are set?


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@424
PS2, Line 424: [i].second;
If range_hash_schema[i].first isn't used, why does it exist at all? Why not just pass in a vector<vector<HashBucketSchema>>?


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@668
PS2, Line 668: string PartitionSchema::HashSchemaPerPartitionDebugString(
nit: Since this is no longer used, could we get rid of it?



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 2
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 23 Oct 2020 18:03:32 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 7:

> Uploaded patch set 7.

Added a test to ensure the bounds and their respective hash schemas are sorted correctly. Also used a struct to signify the bounds and their hash schemas within 'EncodeRangeBounds' and 'SplitRangeBounds'.


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 7
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 24 Nov 2020 03:35:10 +0000
Gerrit-HasComments: No

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 8: Code-Review+1

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition-test.cc@1052
PS8, Line 1052:   // Inserting bounds and their hash schemas out of sorted order,
              :   // yet resulting partitions will still be the same.
              :   bounds.emplace_back(std::move(bounds_with_hash_schemas[1].first));
              :   range_hash_schemas.emplace_back(std::move(bounds_with_hash_schemas[1].second));
              : 
              :   bounds.emplace_back(std::move(bounds_with_hash_schemas[2].first));
              :   range_hash_schemas.emplace_back(std::move(bounds_with_hash_schemas[2].second));
              : 
              :   bounds.emplace_back(std::move(bounds_with_hash_schemas[0].first));
              :   range_hash_schemas.emplace_back(std::move(bounds_with_hash_schemas[0].second));
nit: Could we shuffle 'bounds_with_hash_schemas' with std::shuffle, and build 'bounds' and 'range_hash_schemas' with the shuffled ordering? It'd bump the coverage of this test a bit at least.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@333
PS8, Line 333: unsigned long
nit: auto probably works here too?


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@335
PS8, Line 335:   if (splits.empty()) {
             :     return Status::OK();
             :   }
nit: Could we do this before L333?


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@407
PS8, Line 407:     for (auto schema: range_hash_schemas) {
nit: add a space after

Also maybe call it 'hash_schemas' so it doesn't collide with 'schema'?



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 25 Nov 2020 00:25:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 8:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.h@167
PS8, Line 167: const google::protobuf::RepeatedPtrField
             :                                  <PartitionSchemaPB_ColumnIdentifierPB>& identifiers,
> Nit: Might be better to move the entire argument declaration on separate ne
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@124
PS8, Line 124: const RepeatedPtrField
             :         <PartitionSchemaPB_ColumnIdentifierPB>& identifiers,
> Nit: Same as above.
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@128
PS8, Line 128: column_ids
> +1.
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@136
PS8, Line 136: column_id
> Nit: std::move(column_id)
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@144
PS8, Line 144: push_back
> Nit: emplace_back instead.
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@375
PS8, Line 375: void
> If no error status is expected then better to return hash_partitions instea
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@378
PS8, Line 378: vector<Partition> hash_partitions_so_far = vector<Partition>(1)
> Assignment/copy constructor is unnecessary.
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@382
PS8, Line 382:     // For each of the partitions created so far, replicate it
> Looks like the size of new_partitions can be predetermined.
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@404
PS8, Line 404: KeyEncoder<string>
> Nit: auto
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@459
PS8, Line 459: current_bound_hash_partitions.begin(),
             :                                current_bound_hash_partitions.end()
> Nit: Could use std::make_move_iterator() to save the copies.
Ack



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 26 Nov 2020 01:52:37 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#5).

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, I'm
leaning towards not supporting this feature with split_rows. Returning
a message to the user stating to specify both upper and lower bounds
either at table creation or alteration time should suffice. Split_rows
is also more syntactically ambigious when specifying bounds.

Currently, range_hash_schema holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds and an extra
parameter is supplied to EncodeRangeBounds to ensure that when the
bounds are sorted its corresponding schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
9 files changed, 358 insertions(+), 94 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/5
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 5
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 6:

> Uploaded patch set 6.

just trying to fix build errors now, feel free to ignore for now.


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 6
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 19 Nov 2020 19:37:03 +0000
Gerrit-HasComments: No

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16596/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/6//COMMIT_MSG@18
PS6, Line 18: n specify
> ambiguous
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition-test.cc@1000
PS8, Line 1000:   { // [(a1, _, c1), (a2, _, c2))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a1"));
              :     ASSERT_OK(lower.SetStringCopy("c", "c1"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a2"));
              :     ASSERT_OK(upper.SetStringCopy("c", "c2"));
              :     PartitionSchema::HashBucketSchemas hash_schema_4_buckets = {{{ColumnId(0)}, 4, 0}};
              :     bounds.emplace_back(lower, upper);
              :     range_hash_schemas.emplace_back(hash_schema_4_buckets);
              :     bounds_with_hash_schemas.emplace_back(make_pair(std::move(lower), std::move(upper)),
              :                                           std::move(hash_schema_4_buckets));
              :   }
> Would it make sense to check for the result of partition building after eac
The partitions don't get filled until CreatePartitions() is called later. These blocks are just populating the bounds and their respective hash schemas that will be passed as parameters to CreatePartitions(). So I'm not exactly sure what we would check after these blocks, maybe I misunderstood your question.


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition-test.cc@1052
PS8, Line 1052:   // Inserting bounds and their hash schemas out of sorted order,
              :   // yet resulting partitions will still be the same.
              :   bounds.emplace_back(std::move(bounds_with_hash_schemas[1].first));
              :   range_hash_schemas.emplace_back(std::move(bounds_with_hash_schemas[1].second));
              : 
              :   bounds.emplace_back(std::move(bounds_with_hash_schemas[2].first));
              :   range_hash_schemas.emplace_back(std::move(bounds_with_hash_schemas[2].second));
              : 
              :   bounds.emplace_back(std::move(bounds_with_hash_schemas[0].first));
              :   range_hash_schemas.emplace_back(std::move(bounds_with_hash_schemas[0].second));
> nit: Could we shuffle 'bounds_with_hash_schemas' with std::shuffle, and bui
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@128
PS8, Line 128: column_ids
> nit: not exactly the essential part of this changelist, but since you are t
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@333
PS8, Line 333: unsigned long
> nit: auto probably works here too?
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@335
PS8, Line 335:   if (splits.empty()) {
             :     return Status::OK();
             :   }
> nit: Could we do this before L333?
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@407
PS8, Line 407:     for (auto schema: range_hash_schemas) {
> nit: add a space after
Ack


http://gerrit.cloudera.org:8080/#/c/16596/8/src/kudu/common/partition.cc@409
PS8, Line 409:         return Status::InvalidArgument("Both 'split_rows' and 'range_hash_schemas' cannot be "
             :                                        "populated at the same time.");
> Could you add a testcase to trigger this error to make sure this method beh
Ack



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 25 Nov 2020 20:20:59 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 7:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/16596/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/7//COMMIT_MSG@14
PS7, Line 14: Since split_rows only exists for backwards compatibility reasons, I'm
            : leaning towards not supporting this feature with split_rows. Returning
            : a message to the user stating to specify both upper and lower bounds
            : either at table creation or alteration time should suffice. Split_rows
            : is also more syntactically ambigious when specifying bounds.
> +1
I initially marked it as WIP but at this point I don't see a reason for it to be marked WIP either. I've discussed the feasibility of this feature with 'split_rows' with you and Grant so I feel comfortable not making this compatible with 'split_rows'.


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc@121
PS7, Line 121: CheckPartitions
> nit: maybe move this closer to the test that uses it.
Done


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc@1101
PS7, Line 1101: range_hash_schema
> nit: make this plural
Done


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition-test.cc@1042
PS7, Line 1042: PartitionTest, TestVaryingHashSchemasPerRangeOutOfOrder) {
              :   // CREATE TABLE t (a VARCHAR, b VARCHAR, c VARCHAR, PRIMARY KEY (a, b, c))
              :   // PARTITION BY [HASH BUCKET (a, c), HASH BUCKET (b), RANGE (a, b, c)];
              :   Schema schema({ ColumnSchema("a", STRING),
              :                   ColumnSchema("b", STRING),
              :                   ColumnSchema("c", STRING) },
              :                 { ColumnId(0), ColumnId(1), ColumnId(2) }, 3);
              : 
              :   PartitionSchemaPB schema_builder;
              :   // Table-wide HashSchema defined below, 3 by 2 buckets so 6 total.
              :   AddHashBucketComponent(&schema_builder, { "a", "c" }, 3, 0);
              :   AddHashBucketComponent(&schema_builder, { "b" }, 2, 0);
              :   PartitionSchema partition_schema;
              :   ASSERT_OK(PartitionSchema::FromPB(schema_builder, schema, &partition_schema));
              : 
              :   ASSERT_EQ("HASH (a, c) PARTITIONS 3, HASH (b) PARTITIONS 2, RANGE (a, b, c)",
              :   partition_schema.DebugString(schema));
              : 
              :   vector<pair<KuduPartialRow, KuduPartialRow>> bounds;
              : 
              :   // The bounds aren't inserted in sorted order unlike the previous test,
              :   // yet the resulting partitions will be the same.
              :   { // [(a3, b3, _), (a4, b4, _))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a3"));
              :     ASSERT_OK(lower.SetStringCopy("b", "b3"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a4"));
              :     ASSERT_OK(upper.SetStringCopy("b", "b4"));
              :     bounds.emplace_back(std::move(lower), std::move(upper));
              :   }
              : 
              :   { // [(a1, _, c1), (a2, _, c2))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a1"));
              :     ASSERT_OK(lower.SetStringCopy("c", "c1"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a2"));
              :     ASSERT_OK(upper.SetStringCopy("c", "c2"));
              :     bounds.emplace_back(std::move(lower), std::move(upper));
              :   }
              : 
              :   { // [(a5, b5, _), (a6, _, c6))
              :     KuduPartialRow lower(&schema);
              :     KuduPartialRow upper(&schema);
              :     ASSERT_OK(lower.SetStringCopy("a", "a5"));
              :     ASSERT_OK(lower.SetStringCopy("b", "b5"));
              :     ASSERT_OK(upper.SetStringCopy("a", "a6"));
              :     ASSERT_OK(upper.SetStringCopy("c", "c6"));
              :     bounds.emplace_back(std::move(lower), std::move(upper));
              :   }
              : 
              :   vector<PartitionSchema::HashBucketSchema> hash_schema_4_buckets = {{{ColumnId(0)}, 4, 0}};
              :   vector<PartitionSchema::HashBucketSchema> hash_schema_2_buckets_by_3 = {
              :       {{ColumnId(0)}, 2, 0},
              :       {{ColumnId(1)}, 3, 0}
              :   };
              : 
              :   // HashBucketSchemas have to be in the same order as the bounds it corresponds to
              :   PartitionSchema::RangeHashSchema range_hash_schema = {
              :       {vector<PartitionSchema::HashBucketSchema>()},
              :       {hash_schema_4_buckets},
              :       {hash_schema_2_buckets_by_3}
              :   };
              : 
              :   vector<Partition> partitions;
              :   ASSERT_OK(partition_schema.CreatePartitions({}, bounds, range_hash_schema, schema, &partitions));
              :   CheckPartitions(partitions);
              : }
> How about instead of duplicating the above test, we combine 'bounds' and 'r
Done


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition.cc@329
PS7, Line 329:                                              bounds_with_hash_schemas) const {
> If 'splits' is empty, could we short circuit out of here and just return wi
I thought about this as well, I was curious why the previous implementation of this method didn't do that since 'EncodeRangeBounds()' does a similar check with 'range_bounds'. I don't see a reason why we couldn't short circuit here but I will test the changes and see.


http://gerrit.cloudera.org:8080/#/c/16596/7/src/kudu/common/partition.cc@359
PS7, Line 359: bound.hash_schemas
> If you go my suggested route of short circuiting, we should be able to just
If we get to this point, bound.hash_schemas will be {} so they would be the same thing. I guess it comes down to which is more clear, which is probably {}. Also, even if we don't short circuit, bound.hash_schemas at this point will still be {} since if both 'split_rows' and any vectors within 'range_hash_schema' are defined then an error would be returned before this point.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 7
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 24 Nov 2020 23:54:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 1:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/16596/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16596/1//COMMIT_MSG@12
PS1, Line 12: Open to discussion on whether or
            : not we should support this feature with split_rows.
> nit: it'd be good to add some context as to what this would entail, what us
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@909
PS1, Line 909: hashSchema1
> nit: snake_case, same elsewhere
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@912
PS1, Line 912: hashSchema2.emplace_back(PartitionSchema::HashBucketSchema{{ColumnId(0)}, 2, 0});
             :   hashSchema2.emplace_back(PartitionSchema::HashBucketSchema{{ColumnId(1)}, 3, 0});
> nit: pretty sure you can initialize this via initializer list, e.g.
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@919
PS1, Line 919: 16
> nit: could you add a comment explaining this value? May also be able to adj
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition-test.cc@921
PS1, Line 921: EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[0], range_hash_schema[0].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[1], range_hash_schema[0].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 2",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[2], range_hash_schema[0].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (0, -2147483648, 2) <= VALUES < (1, -2147483648, 3), "
             :                  "HASH (a) PARTITION 3",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[3], range_hash_schema[0].second, schema));
             : 
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 0, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[4], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 0, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[5], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 1, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[6], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 1, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[7], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 2, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[8], range_hash_schema[1].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (1, 1, -2147483648) <= VALUES < (2, 3, -2147483648), "
             :                  "HASH (a, c) PARTITION 2, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[9], range_hash_schema[1].second, schema));
             : 
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 0, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[10], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 0, HASH (b) PARTITION 1",
             :            partition_schema.HashSchemaPerPartitionDebugString
             :                 (partitions[11], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 0, HASH (b) PARTITION 2",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[12], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 1, HASH (b) PARTITION 0",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[13], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 1, HASH (b) PARTITION 1",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[14], range_hash_schema[2].second, schema));
             :   EXPECT_EQ("RANGE (a, b, c) PARTITION (3, -2147483648, -2147483648) <= "
             :                  "VALUES < (4, -2147483648, 1), HASH (a) PARTITION 1, HASH (b) PARTITION 2",
             :             partition_schema.HashSchemaPerPartitionDebugString
             :                  (partitions[15], range_hash_schema[2].second, schema));
> While this is pretty exhaustive, it does seem a bit excessive. Could we ins
Makes sense, I'll just iterate through 'partitions' and check each individual partition


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@149
PS1, Line 149: Maps
> Haven't looked around at how this is used, but should this actually be an u
Maybe I shouldn't have said "maps", but I suppose if the vector<HashBucketSchema> is mapped to the range's upper and lower bounds then we could use an unordered_map.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@149
PS1, Line 149: integer
> What is this integer? How do we use it?
right now it's useless, but the idea was to replace the integer with the upper and lower bound for a specific range


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.h@177
PS1, Line 177: range_hash_schema
> nit: add a comment on how developers should use this, what invariants exist
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@366
PS1, Line 366: partitions
> nit: not your fault, but IMO it makes it more difficult to read to use 'par
makes sense, I can change that


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@398
PS1, Line 398:       RETURN_NOT_OK(EncodeRangeBounds(range_bounds, schema, &bounds));
             :       RETURN_NOT_OK(EncodeRangeSplits(split_rows, schema, &splits));
             :       RETURN_NOT_OK(SplitRangeBounds(schema, std::move(splits), &bounds));
> If 'bounds' gets populated here, how are callers expected to know how many 
If we don't have any 'split_rows', then the number of bounds would be known before this point, they just get encoded here. It gets tricky if we include both 'splits_rows' and 'bounds' for range specific hash partitioning because then it's more unclear to define how many elements 'range_hash_schema' should have and what bounds to map the schemas to.


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@403
PS1, Line 403:     // Hash schema per range cannot be applied to split rows (doing bare minimum first)
> nit: some of these sentences are fragments. Mind filling them out and addin
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@405
PS1, Line 405:     DCHECK(range_hash_schema.size() == bounds.size());
> nit: you can use DCHECK_EQ(expected, actual) for a more useful error messag
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@413
PS1, Line 413: second
> Is 'first' ever used? If not, why not just have this be vector<vector<HashB
answered in definition of 'range_hash_schema' in header file


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@417
PS1, Line 417: RangeHashSchema
> nit: variable names should be snake_case unless they're const static
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@419
PS1, Line 419: : 
> nit: not sure it's explicitly mentioned in the style guide, but most exampl
Done


http://gerrit.cloudera.org:8080/#/c/16596/1/src/kudu/common/partition.cc@419
PS1, Line 419:         for (const HashBucketSchema& bucket_schema: RangeHashSchema) {
             :           vector<Partition> new_partitions;
             :           // Goes over existing partitions for current bound to generate all combinations if
             :           // multiple schemas exist
             :           for (const Partition& base_partition: current_bound_hash_partitions) {
             :             for (int32_t bucket = 0; bucket < bucket_schema.num_buckets; bucket++) {
             :               Partition partition = base_partition;
             :               partition.hash_buckets_.push_back(bucket);
             :               hash_encoder.Encode(&bucket, &partition.partition_key_start_);
             :               hash_encoder.Encode(&bucket, &partition.partition_key_end_);
             :               new_partitions.push_back(partition);
             :             }
             :           }
             :           current_bound_hash_partitions.swap(new_partitions);
             :         }
> nit: this seems to be a copy of the logic at L367, and it seems we define s
yep I noticed that as well, I can refactor this by creating a helper function.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 1
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Oct 2020 06:02:52 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Hello Tidy Bot, Alexey Serbin, Kudu Jenkins, Andrew Wong, Grant Henke, Bankim Bhavsar, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16596

to look at the new patch set (#12).

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................

[partitioning] KUDU-2671: Support for range specific HashSchemas.

This patch updates PartitionSchema::CreatePartitions() to support the
ability to add different hash schemas per each range. If no hash schema
per range is specified, the table wide hash schema is used. Currently,
this only works if no split_rows are specified.

Since split_rows only exists for backwards compatibility reasons, this
feature will not be supported with split_rows. Returning a message to
the user stating to specify both upper and lower bounds either at table
creation or alteration time should suffice. Split_rows is also more
syntactically ambiguous when specifying bounds.

Currently, range_hash_schemas holds the HashBucketSchemas for each range.
Its order corresponds to the bounds in range_bounds so that when the
bounds are sorted its corresponding hash schemas are sorted as well.

Inspiration from Vlad: https://gerrit.cloudera.org/c/15758/

Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
---
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/ts_tablet_manager-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/tablet-harness.h
M src/kudu/tserver/tablet_server-test.cc
10 files changed, 455 insertions(+), 126 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/96/16596/12
-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 12
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: WIP [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 4:

(4 comments)

> Patch Set 4:
> 
> (4 comments)

http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@276
PS4, Line 276: range_partitions_with_hash_schemas
> nit: add four spaces here since it's a continuation of the previous line
Done


http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@295
PS4, Line 295: range_hash_schema[j++]
> What if this is empty?
oversight on my part, will have insert an if check for now.


http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@302
PS4, Line 302:  &t1
> nit: put the & with the rest of the type; same below
Done


http://gerrit.cloudera.org:8080/#/c/16596/4/src/kudu/common/partition.cc@418
PS4, Line 418:   RETURN_NOT_OK(EncodeRangeBounds(range_bounds, range_hash_schema, schema, &bounds,
             :                                   &bounds_with_hash_schemas));
             :   RETURN_NOT_OK(EncodeRangeSplits(split_rows, schema, &splits));
             :   RETURN_NOT_OK(SplitRangeBounds(schema, std::move(splits), &bounds));
> We chatted on Slack about maybe using a struct{string,string,optional<HashB
More code would have to be changed if I just use a single struct instead of duplicating 'bounds', but the changes would be minor and ultimately I agree avoiding this duplication is optimal.



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 4
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Thu, 19 Nov 2020 06:09:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 2:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition-test.cc@920
PS2, Line 920:       {"first bound", hash_schema_4_buckets},
             :       {"second bound", vector<PartitionSchema::HashBucketSchema>()},
             :       {"third bound", hash_schema_2_buckets_by_3}
> If this patch is not going to use the string bounds, and it's unclear how t
removed for now


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition-test.cc@931
PS2, Line 931: EXPECT_EQ(0, partitions[0].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\0" "a1\0\0\0\0c1", 12),partitions[0].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "a2\0\0\0\0c2", 12),partitions[0].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[1].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\1" "a1\0\0\0\0c1", 12),partitions[1].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "a2\0\0\0\0c2", 12),partitions[1].partition_key_end());
             : 
             :   EXPECT_EQ(2, partitions[2].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\2" "a1\0\0\0\0c1", 12),partitions[2].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\2" "a2\0\0\0\0c2", 12),partitions[2].partition_key_end());
             : 
             :   EXPECT_EQ(3, partitions[3].hash_buckets()[0]);
             :   EXPECT_EQ(string("\0\0\0\3" "a1\0\0\0\0c1", 12),partitions[3].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\3" "a2\0\0\0\0c2", 12),partitions[3].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[4].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[4].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a3\0\0b3\0\0", 16),partitions[4].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a4\0\0b4\0\0", 16),partitions[4].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[5].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[5].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a3\0\0b3\0\0", 16),partitions[5].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a4\0\0b4\0\0", 16),partitions[5].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[6].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[6].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a3\0\0b3\0\0", 16),partitions[6].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a4\0\0b4\0\0", 16),partitions[6].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[7].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[7].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a3\0\0b3\0\0", 16),partitions[7].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a4\0\0b4\0\0", 16),partitions[7].partition_key_end());
             : 
             :   EXPECT_EQ(2, partitions[8].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[8].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\0" "a3\0\0b3\0\0", 16),partitions[8].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\0" "a4\0\0b4\0\0", 16),partitions[8].partition_key_end());
             : 
             :   EXPECT_EQ(2, partitions[9].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[9].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\1" "a3\0\0b3\0\0", 16),partitions[9].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\2" "\0\0\0\1" "a4\0\0b4\0\0", 16),partitions[9].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[10].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[10].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a5\0\0b5\0\0", 16),partitions[10].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\0" "a6\0\0\0\0c6", 16),partitions[10].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[11].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[11].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a5\0\0b5\0\0", 16),partitions[11].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\1" "a6\0\0\0\0c6", 16),partitions[11].partition_key_end());
             : 
             :   EXPECT_EQ(0, partitions[12].hash_buckets()[0]);
             :   EXPECT_EQ(2, partitions[12].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\2" "a5\0\0b5\0\0", 16),partitions[12].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\0" "\0\0\0\2" "a6\0\0\0\0c6", 16),partitions[12].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[13].hash_buckets()[0]);
             :   EXPECT_EQ(0, partitions[13].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a5\0\0b5\0\0", 16),partitions[13].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\0" "a6\0\0\0\0c6", 16),partitions[13].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[14].hash_buckets()[0]);
             :   EXPECT_EQ(1, partitions[14].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a5\0\0b5\0\0", 16),partitions[14].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\1" "a6\0\0\0\0c6", 16),partitions[14].partition_key_end());
             : 
             :   EXPECT_EQ(1, partitions[15].hash_buckets()[0]);
             :   EXPECT_EQ(2, partitions[15].hash_buckets()[1]);
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\2" "a5\0\0b5\0\0", 16),partitions[15].partition_key_start());
             :   EXPECT_EQ(string("\0\0\0\1" "\0\0\0\2" "a6\0\0\0\0c6", 16),partitions[15].partition_key_end());
> nit: could this be more succinct with some loops?
all the partition keys are different are due to hash values, imo it's more clear this way without loops.


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.h@149
PS2, Line 149: still working on best way to
             :   // include this bound.
> nit: If this is the case, could you mention this patch is a WIP in the comm
removed comment, will add WIP context in commit message


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@357
PS2, Line 357: Status PartitionSchema::GenerateHashPartitions(const std::vector<HashBucketSchema>& hash_schemas,
> +1
changed to static


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@359
PS2, Line 359: std::
> nit: don't need "std::" on account of the "using" above.
Done


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@361
PS2, Line 361:   DCHECK(hash_partitions->size() == 1);
> How about creating the initial vector in this method? That would make it ea
Done


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@369
PS2, Line 369: *hash_partitions
> nit: see my earlier feedback about using local variables vs output variable
Done


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@378
PS2, Line 378:     expected_partitions *= bucket_schema.num_buckets;
> nit: not sure how valuable this check is, given you can validate it fairly 
removed this check


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@379
PS2, Line 379: hash_partitions->swap(new_partitions);
> nit: we've more recently begun using:
Done


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@383
PS2, Line 383:   return Status::OK();
> nit: How about just return the vector<Partition>? or make this a void funct
Done


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@386
PS2, Line 386: const vector<KuduPartialRow>& split_rows,
> Shouldn't we return an error if both this _and_ range_hash_schema are set?
added Status::InvalidArg() if both are populated


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@424
PS2, Line 424: [i].second;
> If range_hash_schema[i].first isn't used, why does it exist at all? Why not
Done


http://gerrit.cloudera.org:8080/#/c/16596/2/src/kudu/common/partition.cc@668
PS2, Line 668: string PartitionSchema::HashSchemaPerPartitionDebugString(
> nit: Since this is no longer used, could we get rid of it?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 2
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Sat, 24 Oct 2020 00:37:31 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 11: Code-Review+1

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition-test.cc
File src/kudu/common/partition-test.cc:

http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition-test.cc@1073
PS11, Line 1073:   range_hash_schemas.emplace_back(PartitionSchema::HashBucketSchemas());
Is this important? If not, maybe remove it? Seems like it shouldn't be since we didn't clear 'range_hash_schemas'.


http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h
File src/kudu/common/partition.h:

http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h@413
PS11, Line 413: encoded_bounds
nit: can you update this?


http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h@416
PS11, Line 416: range_hash_schemas
nit: can you also document how this is used, what to expect if it's empty, etc?


http://gerrit.cloudera.org:8080/#/c/16596/11/src/kudu/common/partition.h@420
PS11, Line 420:   // Splits the encoded range bounds by the split points. The splits and bounds
              :   // must be sorted. If `bounds` is empty, then a single unbounded range is
              :   // assumed. If any of the splits falls outside of the bounds then an
              :   // InvalidArgument status is returned.
nit: can you update this?



-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 11
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 30 Nov 2020 20:47:36 +0000
Gerrit-HasComments: Yes

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 8:

> Uploaded patch set 8.

Refactored some testing code, added check for 'split_rows' compatibility, other minor changes.


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 8
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 25 Nov 2020 00:11:12 +0000
Gerrit-HasComments: No

[kudu-CR] [partitioning] KUDU-2671: Support for range specific HashSchemas.

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.

Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/16596 )

Change subject: [partitioning] KUDU-2671: Support for range specific HashSchemas.
......................................................................


Patch Set 11:

> Uploaded patch set 11: Patch Set 10 was rebased.

starting build to see if unrelated test failures still exist


-- 
To view, visit http://gerrit.cloudera.org:8080/16596
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I8725f4bd072a81b05b36dfc7df0c074c172b4ce8
Gerrit-Change-Number: 16596
Gerrit-PatchSet: 11
Gerrit-Owner: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 30 Nov 2020 20:20:55 +0000
Gerrit-HasComments: No