You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Peter Ebert (JIRA)" <ji...@apache.org> on 2018/09/19 16:14:00 UTC

[jira] [Created] (KUDU-2585) Custom Partitioning Schemes

Peter Ebert created KUDU-2585:
---------------------------------

             Summary: Custom Partitioning Schemes
                 Key: KUDU-2585
                 URL: https://issues.apache.org/jira/browse/KUDU-2585
             Project: Kudu
          Issue Type: New Feature
            Reporter: Peter Ebert


In HBase or HDFS tables you can come up with complex key design or partitioning (respectively) and build that logic into your application.  It would be nice to have more flexibility with Kudu beyond the range and hash options currently provided.

One example where this would help, borrowed from the docs:
CREATE TABLE metrics (
    host STRING NOT NULL,
    metric STRING NOT NULL,
    time INT64 NOT NULL,
    value DOUBLE NOT NULL,
    PRIMARY KEY (host, metric, time),
);
 
Now lets say these hosts to be stored in kudu are part of 2 Hadoop clusters which I happen to indicate as part of the hostname [c1dn1.domain.com|http://c1dn1.domain.com/] for cluster1 and [c2dn1.domain.com|http://c2dn1.domain.com/] for cluster2.  With a random hash and enough datanodes/hosts values, I might have to read all partitions because those will be randomly distributed.
 
If instead I can provide some UDF of some sort (or here even a simple substring of the first two letters) I could group cluster1 into one or a few different values, skipping reading any tablets for cluster 2 when I do a scan.
 
So instead of hash(host) it would be something like hash(substr(host, 1, 2)) but of course you could get more complex with a UDF and use the remainder of the string to hash and mod to 10 tablets to distribute the c1 to, and so on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)