You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/02/24 00:09:00 UTC
[jira] [Commented] (KUDU-2671) Change hash number for range partitioning

    [ https://issues.apache.org/jira/browse/KUDU-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289441#comment-17289441 ] 

ASF subversion and git services commented on KUDU-2671:
-------------------------------------------------------

Commit d7b5abc027d492a60ebf5059b27541fc04cfaab3 in kudu's branch refs/heads/master from Mahesh Reddy
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=d7b5abc ]

KUDU-2671: Adds compatibility for per range hash schemas with unbounded ranges.

This patch updates the logic at the end of PartitionSchema::CreatePartiitons()
to allow per range hash schemas to be compatible with unbounded ranges. Some
additional context about the block of code is given below.

For the start partition key, it iterates in reverse order through the partition's
hash buckets. It breaks out of the loop at the first instance of a bucket not
equal to 0; If the bucket is equal to 0 it erases that part of the partition key.
Essentially, if all hash buckets are equal to 0, then it erases the entire key.

For the end partition key, it also iterates in reverse order through the
partition's hash buckets. It first erases the index portition of the
partition key. It then checks if the current hash bucket is the max
bucket of the current hash schema. If it is not the max, it encodes the
current hash bucket + 1 at the index portion of the key and breaks the loop.
If it is the max, it continues within the loop. Essentially, if all the
hash buckets are the max then it erases the entire key.

Prior to this change, this block of code assumed the same hash bucket
schema for each partition. With per range hash schemas, that may not
necessarily be the case. The vector 'partition_idx_to_hash_schemas_idx'
maps each partition to the index of 'bounds_with_hash_schemas' to ensure
the correct hash bucket schema is used. '-1' is used to signify the use
of the table wide hash schema.

Change-Id: I5f6c709e211359b04f7597af5f670c787bda7481
Reviewed-on: http://gerrit.cloudera.org:8080/17090
Reviewed-by: Andrew Wong <aw...@cloudera.com>
Tested-by: Andrew Wong <aw...@cloudera.com>


> Change hash number for range partitioning
> -----------------------------------------
>
>                 Key: KUDU-2671
>                 URL: https://issues.apache.org/jira/browse/KUDU-2671
>             Project: Kudu
>          Issue Type: Improvement
>          Components: client, java, master, server
>    Affects Versions: 1.8.0
>            Reporter: yangz
>            Assignee: Mahesh Reddy
>            Priority: Major
>              Labels: feature, roadmap-candidate, scalability
>         Attachments: 屏幕快照 2019-01-24 下午12.03.41.png
>
>
> For our usage, the kudu schema design isn't flexible enough.
> We create our table for day range such as dt='20181112' as hive table.
> But our data size change a lot every day, for one day it will be 50G， but for some other day it will be 500G. For this case, it be hard to set the hash schema. If too big, for most case, it will be too wasteful. But too small, there is a performance problem in the case of a large amount of data.
>  
> So we suggest a solution we can change the hash number by the history data of a table.
> for example
>  # we create schema with one estimated value.
>  # we collect the data size by day range
>  # we create new day range partition by our collected day size.
> We use this feature for half a year, and it work well. We hope this feature will be useful for the community. Maybe the solution isn't so complete. Please help us make it better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)