You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by 夏天松 <wu...@gmail.com> on 2020/03/25 03:29:10 UTC

hash and range partition uneven distribution for one tablet server

I have a hot data insert problem when using kudu. If I use both hash and
range partition, all buckets will be unevenly distributed.
Kudu cluster distribution:  3 master and 5 tablet server

My create table sql:
CREATE TABLE tmp.sales_by_year (
  device_id STRING NOT NULL,
  update_date STRING NOT NULL,
  update_time STRING NOT NULL,
  object_name STRING NOT NULL,
  attribute_name STRING NOT NULL,
  present_value STRING NULL,
  PRIMARY KEY (device_id, update_date, update_time, object_name,
attribute_name)
)
PARTITION BY HASH (device_id) PARTITIONS 5, RANGE (update_date) (
  PARTITION '2020-03-21'<= VALUES < '2020-03-22'
)
STORED AS KUDU;

Then I hope when update_date = '2020-03-21' , every tablet server has one
partition , but the real distribution is not like this. The real
distribution is that some machines have no partitions, and some have 2 or 3
partitions. This situation leads to high CPU usage on some machines when
writing large amounts of time series data.

Please help me, how can i solve this problem.

Re: hash and range partition uneven distribution for one tablet server

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.

What you're seeing sort of makes sense given that partition assignment
uses "power of 2" selection process: two servers are chosen at random,
and the one with the fewer partitions is selected as the recipient of
the new partition. Given enough partitions, this algorithm should
result in an even distribution of partitions across servers. But since
you're only assigning 5 (or 15, if the replication factor is 3)
partitions to 5 servers, there may be some skew.

Have you tried running the Kudu rebalancer tool? That's "kudu cluster
rebalance". It'll redistribute your partitions to minimize skew across
tservers.

All that said, we currently don't have a mechanism to distribute
tablet leaders evenly across the cluster, so you may still see
hotspotting on writes if one server happens to host more leaders than
the others and if those leaders are servicing a high write load.

On Tue, Mar 24, 2020 at 9:09 PM 夏天松 <wu...@gmail.com> wrote:
>
> I have a hot data insert problem when using kudu. If I use both hash and range partition, all buckets will be unevenly distributed.
> Kudu cluster distribution:  3 master and 5 tablet server
>
> My create table sql:
> CREATE TABLE tmp.sales_by_year (
>   device_id STRING NOT NULL,
>   update_date STRING NOT NULL,
>   update_time STRING NOT NULL,
>   object_name STRING NOT NULL,
>   attribute_name STRING NOT NULL,
>   present_value STRING NULL,
>   PRIMARY KEY (device_id, update_date, update_time, object_name, attribute_name)
> )
> PARTITION BY HASH (device_id) PARTITIONS 5, RANGE (update_date) (
>   PARTITION '2020-03-21'<= VALUES < '2020-03-22'
> )
> STORED AS KUDU;
>
> Then I hope when update_date = '2020-03-21' , every tablet server has one partition , but the real distribution is not like this. The real distribution is that some machines have no partitions, and some have 2 or 3 partitions. This situation leads to high CPU usage on some machines when writing large amounts of time series data.
>
> Please help me, how can i solve this problem.