You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2021/08/17 02:34:08 UTC

[kudu-CR] WIP [proto] KUDU-2671 update range partitioning with customs hash schema

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17779


Change subject: WIP [proto] KUDU-2671 update range partitioning with customs hash schema
......................................................................

WIP [proto] KUDU-2671 update range partitioning with customs hash schema

DONT_BUILD

This patch is a quick and dirty sketch on updating already existing,
but not yet released (so we should not be concerned with the backward
compatibility) structures in protobuf files to create Kudu tables
with custom hash partitioning.

Advantages:
  * no need to have two separate arrays of ranges and their hash
    schemas, requiring them to be of the same size
  * the naming is closer to what we have in the C++ code after
    the renamings introduces in https://gerrit.cloudera.org/#/c/17775/

WIP:
  * collect the feedback
  * BTW, why don't we put the hash schema info for a range into
    PartitionPB instead?

Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
---
M src/kudu/common/common.proto
1 file changed, 28 insertions(+), 11 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/17779/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] [mega] KUDU-2671 update range partitioning with customs hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Mahesh Reddy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17779

to look at the new patch set (#2).

Change subject: [mega] KUDU-2671 update range partitioning with customs hash schema
......................................................................

[mega] KUDU-2671 update range partitioning with customs hash schema

This patch updates already existing, but not yet released (so we should
not be concerned about the backward compatibility) protobuf data
structures used to create Kudu tables with custom hash partitioning per
range.  With this patch, there is no need to have two separate arrays
of ranges and their hash schemas, requiring them to be of the same size.

I also renamed the 'hash_bucket_schemas' field into 'hash_schema' in
the PartitionSchemaPB data structure.

Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/CreateTableOptions.java
M java/kudu-client/src/main/java/org/apache/kudu/client/ProtobufHelper.java
M src/kudu/client/client.cc
M src/kudu/client/table_creator-internal.cc
M src/kudu/client/table_creator-internal.h
M src/kudu/common/common.proto
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/table_locations-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master-test.cc
15 files changed, 244 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/17779/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.
Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 3: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/client/client.cc
File src/kudu/client/client.cc:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/client/client.cc@1030
PS3, Line 1030:     }
Just to clarify, the only way to not enter this block of if/else if statements is if there are no custom hash schemas for any of the ranges defined. In that case, the field 'custom_hash_schemas_ranges' would not be populated correct?


http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/partition.cc@195
PS3, Line 195: const auto
nit: const auto&


http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/master/catalog_manager.cc@1868
PS3, Line 1868:   // TODO(aserbin): make sure range boundaries in
+1

Ideally, we would decode the range bounds and range hash schemas from the 'custom_hash_schemas_ranges' field at once rather than decoding the range bounds from a separate field.



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 02:01:14 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.
Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 5: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 01:13:54 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@358
PS1, Line 358: HashBu
> If we're changing this to be unsigned, would have to make similar changes e
After discussing this offline, the decision was to remove this new data structure and use HashBucketSchemaPB everywhere for consistency.


http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@374
PS1, Line 374: // of the table. Otherwise, particular range
> Eventually this will be deprecated correct?
Yep, my initial idea was to eventually deprecate this field.  However, since we need to keep the API backward-compatible, this is to stay at least for a few releases.

An alternative approach is to keep this field and use the newly introduced 'custom_hash_schema_ranges' only for the overrides.  This alternative approach automatically lets us keep the new servers compatible with old clients and have somewhat shorter requests to create tables when there is a dominant hash schema for the majority of the ranges -- something similar what we have as of now, semantically.

Let me know if you think it's better to switch to using only 'custom_hash_schema_ranges' at the new code (at least in the client side as of now).


http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@377
PS1, Line 377: 
             : 
> Should we mark these as deprecated?
Done


http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/common.proto@369
PS3, Line 369:   //repeated PerRangeHashBucketSchemasPB DEPRECATED_range_hash_schemas = 3;
             :   //repeated RowOperationsPB DEPRECATED_range_bounds = 4;
I guess we could add 

  reserved 3;
  reserved 4;

after removing these fields, but given we haven't released the version with those fields, we can safely re-use the fields is in future versions.  Not sure how to proceed here: to be really safe, we should add 'reserved' fields, I guess.



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 01:47:46 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [proto] KUDU-2671 update range partitioning with customs hash schema

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.
Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: WIP [proto] KUDU-2671 update range partitioning with customs hash schema
......................................................................


Patch Set 1:

(4 comments)

Sorry for the late review. In regards to storing the hash dimension info in PartitionPB, that would be superfluous. Imagine a range with a plethora of partitions, each partition would store the same hash dimension rather than just the range bounds and its hash dimension being stored once within PartitionSchemaPB.

http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@358
PS1, Line 358: uint32
If we're changing this to be unsigned, would have to make similar changes elsewhere in codebase.


http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@367
PS1, Line 367:  message RangeWithHashSchemaPB {
             :     // A set of row operations containing the lower and upper range bound
             :     // for the range.
             :     optional RowOperationsPB range_bounds = 1;
             :     repeated HashDimensionPB hash_schema = 2;
             :   }
             : 
+1

This a neater approach than having two separate fields and ensuring they have the same size.


http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@374
PS1, Line 374: repeated HashBucketSchemaPB hash_schema = 1;
Eventually this will be deprecated correct?


http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@377
PS1, Line 377: //repeated PerRangeHashBucketSchemasPB range_hash_schemas = 3;
             :   //repeated RowOperationsPB range_bounds = 4;
Should we mark these as deprecated?



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Aug 2021 22:14:32 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto@353
PS4, Line 353:   // This data structure represents a range partition with a custom hash schema.
             :   message RangeWithHashSchemaPB {
             :     // Row operations containing the lower and upper range bound for the range.
             :     optional RowOperationsPB range_bounds = 1;
             :     // Hash schema for the range.
             :     repeated HashBucketSchemaPB hash_schema = 2;
             :   }
> Tablets and ranges are already decoupled de facto -- a single range might h
Right,  they are already decoupled, though I meant that the range bounds we provide don't necessarily have to correspond to each individual range, but could refer to multiple ranges, eg when there are multiple ranges with the same hash schema, and only one with a custom schema. Especially if the entire keyspace is accounted for, we could save on the number of range keys specified.

The last sentence, I meant that right now it's nice that it's clear each range in this partition schema corresponds to a single range, and that it may be more confusing if we were to try coalescing ranges as I described.



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 23:22:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Mahesh Reddy, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17779

to look at the new patch set (#3).

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................

KUDU-2671 update range partitioning with custom hash schema

This patch updates already existing, but not yet released (so we should
not be concerned about the backward compatibility) protobuf data
structures used to create Kudu tables with custom hash partitioning per
range.  With this patch, there is no need to have two separate arrays
of ranges and their hash schemas, requiring them to be of the same size.

I also renamed the 'hash_bucket_schemas' field into 'hash_schema' in
the PartitionSchemaPB data structure.

Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/CreateTableOptions.java
M java/kudu-client/src/main/java/org/apache/kudu/client/ProtobufHelper.java
M src/kudu/client/client.cc
M src/kudu/client/table_creator-internal.cc
M src/kudu/client/table_creator-internal.h
M src/kudu/common/common.proto
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/table_locations-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master-test.cc
15 files changed, 244 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/17779/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc
File src/kudu/client/client.cc:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc@1020
PS4, Line 1020: range->has_table_wide_hash_schema_
Just curious, maybe i'm missing something but is there a difference between using this and checking if hash_schema_ is empty?


http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto@353
PS4, Line 353:   // This data structure represents a range partition with a custom hash schema.
             :   message RangeWithHashSchemaPB {
             :     // Row operations containing the lower and upper range bound for the range.
             :     optional RowOperationsPB range_bounds = 1;
             :     // Hash schema for the range.
             :     repeated HashBucketSchemaPB hash_schema = 2;
             :   }
I wonder if it makes sense to decouple the idea of ranges and tablets a bit. Namely, if we want 12 ranges, one per month, and we wanted to have 10 months with 1 hash bucket, and 2 months with 3 hash buckets, I wonder how feasible it is to store that as two ranges in the partition schema: one range spanning 10 months, the other spanning 2 months. We may end up saving some on storage of such sparsely modified partition definitions.

I suppose such optimizations/compressions could be performed by the master, if there's any concern about the size of the schema we send to clients.

That said, I don't love the idea of having range bounds refer to both the upper and lower bound of a given tablet, and the upper and lower bound of multiple tablets that share the same hash schema.



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 07:00:49 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Mahesh Reddy (Code Review)" <ge...@cloudera.org>.
Mahesh Reddy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@358
PS1, Line 358: HashBu
> After discussing this offline, the decision was to remove this new data str
Ack


http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@374
PS1, Line 374: // of the table. Otherwise, particular range
> Yep, my initial idea was to eventually deprecate this field.  However, sinc
Yep I think the alternative approach is a better route at the moment. We can re-evaluate this later although I guess it's better to nail down an approach now so less client side changes will be needed in the future.


http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/common.proto@369
PS3, Line 369:   //repeated PerRangeHashBucketSchemasPB DEPRECATED_range_hash_schemas = 3;
             :   //repeated RowOperationsPB DEPRECATED_range_bounds = 4;
> I guess we could add 
I lean towards being cautious here and adding reserved fields for 3 and 4 but open to more feedback.



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 02:24:22 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 07:01:06 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Mahesh Reddy, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17779

to look at the new patch set (#4).

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................

KUDU-2671 update range partitioning with custom hash schema

This patch updates already existing, but not yet released (so we should
not be concerned about the backward compatibility) protobuf data
structures used to create Kudu tables with custom hash partitioning per
range.  With this patch, there is no need to have two separate arrays
of ranges and their hash schemas, requiring them to be of the same size.

I also renamed the 'hash_bucket_schemas' field into 'hash_schema' in
the PartitionSchemaPB data structure.

Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/CreateTableOptions.java
M java/kudu-client/src/main/java/org/apache/kudu/client/ProtobufHelper.java
M src/kudu/client/client.cc
M src/kudu/client/table_creator-internal.cc
M src/kudu/client/table_creator-internal.h
M src/kudu/common/common.proto
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/table_locations-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master-test.cc
15 files changed, 249 insertions(+), 226 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/17779/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Mahesh Reddy, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17779

to look at the new patch set (#5).

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................

KUDU-2671 update range partitioning with custom hash schema

This patch updates already existing, but not yet released (so we should
not be concerned about the backward compatibility) protobuf data
structures used to create Kudu tables with custom hash partitioning per
range.  With this patch, there is no need to have two separate arrays
of ranges and their hash schemas, requiring them to be of the same size.

I also renamed the 'hash_bucket_schemas' field into 'hash_schema' in
the PartitionSchemaPB data structure.

Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/CreateTableOptions.java
M java/kudu-client/src/main/java/org/apache/kudu/client/ProtobufHelper.java
M src/kudu/client/client.cc
M src/kudu/client/flex_partitioning_client-test.cc
M src/kudu/client/table_creator-internal.h
M src/kudu/common/common.proto
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/table_locations-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master-test.cc
15 files changed, 254 insertions(+), 228 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/79/17779/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 5: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc
File src/kudu/client/client.cc:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc@1020
PS4, Line 1020: ash_dimension_pb = range_pb->add_h
> It seems the change I was referring is actually in a follow-up patch.  But 
Ack



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 00:32:40 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................

KUDU-2671 update range partitioning with custom hash schema

This patch updates already existing, but not yet released (so we should
not be concerned about the backward compatibility) protobuf data
structures used to create Kudu tables with custom hash partitioning per
range.  With this patch, there is no need to have two separate arrays
of ranges and their hash schemas, requiring them to be of the same size.

I also renamed the 'hash_bucket_schemas' field into 'hash_schema' in
the PartitionSchemaPB data structure.

Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Reviewed-on: http://gerrit.cloudera.org:8080/17779
Reviewed-by: Andrew Wong <aw...@cloudera.com>
Tested-by: Kudu Jenkins
Reviewed-by: Mahesh Reddy <mr...@cloudera.com>
---
M java/kudu-client/src/main/java/org/apache/kudu/client/AsyncKuduClient.java
M java/kudu-client/src/main/java/org/apache/kudu/client/CreateTableOptions.java
M java/kudu-client/src/main/java/org/apache/kudu/client/ProtobufHelper.java
M src/kudu/client/client.cc
M src/kudu/client/flex_partitioning_client-test.cc
M src/kudu/client/table_creator-internal.h
M src/kudu/common/common.proto
M src/kudu/common/partition-test.cc
M src/kudu/common/partition.cc
M src/kudu/common/partition.h
M src/kudu/common/partition_pruner-test.cc
M src/kudu/common/scan_spec-test.cc
M src/kudu/integration-tests/table_locations-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master-test.cc
15 files changed, 254 insertions(+), 228 deletions(-)

Approvals:
  Andrew Wong: Looks good to me, approved
  Kudu Jenkins: Verified
  Mahesh Reddy: Looks good to me, but someone else must approve

-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/client/client.cc
File src/kudu/client/client.cc:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/client/client.cc@1030
PS3, Line 1030:     }
> Just to clarify, the only way to not enter this block of if/else if stateme
Yes, correct.  I updated the code to be easier for understanding.


http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/1/src/kudu/common/common.proto@374
PS1, Line 374: // of the table. Otherwise, particular range
> Yep I think the alternative approach is a better route at the moment. We ca
Right -- it's better to decide now, while we still have some time before the next release.


http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/common.proto@369
PS3, Line 369:   //repeated PerRangeHashBucketSchemasPB DEPRECATED_range_hash_schemas = 3;
             :   //repeated RowOperationsPB DEPRECATED_range_bounds = 4;
> I lean towards being cautious here and adding reserved fields for 3 and 4 b
OK, added 'reserved'.


http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/partition.cc
File src/kudu/common/partition.cc:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/common/partition.cc@195
PS3, Line 195: const auto
> nit: const auto&
I don't think a reference in this case (64-bit) doesn't make much sense compared with the value itself (32-bit), so I think I'll keep the value here.  Overall, for primitive types like these there isn't much difference between using reference or the value itself, IIRC.


http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/17779/3/src/kudu/master/catalog_manager.cc@1868
PS3, Line 1868:   // TODO(aserbin): make sure range boundaries in
> +1
Right -- I'm thinking to update the corresponding code in both the client and here in catalog manager to rely only on custom_hash_schema_ranges() for the information on ranges when creating a table with custom hash schema for a range.



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 05:29:27 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc
File src/kudu/client/client.cc:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc@1020
PS4, Line 1020: range->has_table_wide_hash_schema_
> Just curious, maybe i'm missing something but is there a difference between
Yep, there is a difference: an empty hash_schema_ means no hash bucketing for this particular range, but table-wide it might be some hash bucketing (i.e. having non-zero hash dimensions for all other ranges of the table).

It seems I forgot to mention a semantically important point of this patch: now it's possible to create a table where a range might have no hash bucketing, even if table-wide there is hash bucketing.

I think I need to add a blurb about that in the commit description and add a test scenario for that as well (TODO).


http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto@353
PS4, Line 353:   // This data structure represents a range partition with a custom hash schema.
             :   message RangeWithHashSchemaPB {
             :     // Row operations containing the lower and upper range bound for the range.
             :     optional RowOperationsPB range_bounds = 1;
             :     // Hash schema for the range.
             :     repeated HashBucketSchemaPB hash_schema = 2;
             :   }
> I wonder if it makes sense to decouple the idea of ranges and tablets a bit
Tablets and ranges are already decoupled de facto -- a single range might have hash bucketing, so a range turns into a number of tablets (the number of tablets corresponding to a range is the number of buckets in the range's hash schema).  This PartitionSchemaPB data structure provides information on a so-called schema of a table: the hashing rules used to be non-discriminant of ranges prior, and now the requirement is to have a custom hash schema per range.

I'm planning a follow-up change that would add a registry of different hash schemas, and here instead of hash_schema there will be hash_schema_index pointing to an element in the registry.  That will help to save of the size of the PB structure, so that will be less network traffic between masters and clients when fetching tables' schemas.  Same for the savings w.r.t. storing the information in the system catalog.

I couldn't parse the last sentence of your comment, though.  Maybe, you could rephrase it a bit?



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 22:16:36 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2671 update range partitioning with custom hash schema

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/17779 )

Change subject: KUDU-2671 update range partitioning with custom hash schema
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc
File src/kudu/client/client.cc:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/client/client.cc@1020
PS4, Line 1020: range->has_table_wide_hash_schema_
> Yep, there is a difference: an empty hash_schema_ means no hash bucketing f
It seems the change I was referring is actually in a follow-up patch.  But this 'has_table_wide_hash_schema_' field somehow went into this patch.  Removed.

Thank you for pointing at this!


http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto
File src/kudu/common/common.proto:

http://gerrit.cloudera.org:8080/#/c/17779/4/src/kudu/common/common.proto@353
PS4, Line 353:   // This data structure represents a range partition with a custom hash schema.
             :   message RangeWithHashSchemaPB {
             :     // Row operations containing the lower and upper range bound for the range.
             :     optional RowOperationsPB range_bounds = 1;
             :     // Hash schema for the range.
             :     repeated HashBucketSchemaPB hash_schema = 2;
             :   }
> Right,  they are already decoupled, though I meant that the range bounds we
Ah, I see -- thank you for the clarification!  I'll think about the ways to make the representation of the ranges to be more optimal, and I agree that without coalesced ranges it's clearer.



-- 
To view, visit http://gerrit.cloudera.org:8080/17779
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I37aae56a33170894f30d6cd73a5698d6cbb7a697
Gerrit-Change-Number: 17779
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <mr...@cloudera.com>
Gerrit-Comment-Date: Tue, 31 Aug 2021 00:27:10 +0000
Gerrit-HasComments: Yes