You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Mauricio Aristizabal <ma...@impact.com> on 2019/04/30 20:22:52 UTC

Best way to merge range partitions

I'm doing the delicate dance of maximizing ingest by having enough current
hash partitions (say 25), minimizing query runtime by having range
partitions that roughly match most report runs (say 2 weeks), while keeping
tablet count not far above the 600 recommended, and supporting at least 18
months of data.

I'm thinking of a strategy of routinely merging older cold data range
partitions into bigger ones (say 2 months instead of 2 weeks), and leverage
the reduced overall tablet count to increase the hash buckets.

It would be really nice if there was a Kudu CLI 'merge_range_partition'
command (ranges would need to be contiguous).  It would greatly simplify
optimization of time-series data structures.

So instead i'm planning on copying the range partitions' data to a parquet
side table, dropping the partitions, creating a single one, and copying the
data back in.

Any better approach I can use currently?

Using CDH 5.15 Impala 2.13 Kudu 1.7

Thanks in advance,

-m

-- 
Mauricio Aristizabal
Architect - Data Pipeline
mauricio@impact.com | 323 309 4260
https://impact.com
<https://www.linkedin.com/company/impact-martech/>
<https://www.facebook.com/ImpactMarTech/>
<https://twitter.com/impactmartech>
<https://www.youtube.com/c/impactmartech>

Re: Best way to merge range partitions

Posted by Mauricio Aristizabal <ma...@impact.com>.
Sorry William, yes, 600 is not exactly right, I've just adopted it as my
soft target to stay close to given the error you get when creating with
more ("NonRecoverableException: The requested number of tablets is over the
maximum permitted at creation time (580). Additional tablets may be added
by adding range partitions to the table post-creation.")

But question still stands as I would like to stay comfortably under the
limits in https://kudu.apache.org/docs/known_issues.html and we'll be
adding ~100 big tables when all is said and done (with 30 node cluster at
least for now).

Thanks for validating that's the right approach at the moment.

This would be a very nice feature to add IMHO.  Conceptually seems
relatively simple, especially if agreed limitation is that tablets for
ranges being merged will go read-only and ingest ops on them will fail
(merging cold data partitions so that's fine).

On Wed, May 1, 2019 at 12:55 PM William Berkeley <wd...@cloudera.com>
wrote:

> Where's the 600 tablet count recommendation sourced from? Is that
> pre-replication and per-tserver, so there's 1800 replicas per tablet
> server? We recommend 1000-2000 replicas per server.
>
> As for your strategy for merging range partitions, I think it's the best
> available at this point.
>
> -Will
>
> On Tue, Apr 30, 2019 at 1:23 PM Mauricio Aristizabal <ma...@impact.com>
> wrote:
>
>> I'm doing the delicate dance of maximizing ingest by having enough
>> current hash partitions (say 25), minimizing query runtime by having range
>> partitions that roughly match most report runs (say 2 weeks), while keeping
>> tablet count not far above the 600 recommended, and supporting at least 18
>> months of data.
>>
>> I'm thinking of a strategy of routinely merging older cold data range
>> partitions into bigger ones (say 2 months instead of 2 weeks), and leverage
>> the reduced overall tablet count to increase the hash buckets.
>>
>> It would be really nice if there was a Kudu CLI 'merge_range_partition'
>> command (ranges would need to be contiguous).  It would greatly simplify
>> optimization of time-series data structures.
>>
>> So instead i'm planning on copying the range partitions' data to a
>> parquet side table, dropping the partitions, creating a single one, and
>> copying the data back in.
>>
>> Any better approach I can use currently?
>>
>> Using CDH 5.15 Impala 2.13 Kudu 1.7
>>
>> Thanks in advance,
>>
>> -m
>>
>> --
>> Mauricio Aristizabal
>> Architect - Data Pipeline
>> mauricio@impact.com | 323 309 4260
>> https://impact.com
>> <https://www.linkedin.com/company/impact-martech/>
>> <https://www.facebook.com/ImpactMarTech/>
>> <https://twitter.com/impactmartech>
>> <https://www.youtube.com/c/impactmartech>
>>
>

-- 
Mauricio Aristizabal
Architect - Data Pipeline
mauricio@impact.com | 323 309 4260
https://impact.com
<https://www.linkedin.com/company/impact-martech/>
<https://www.facebook.com/ImpactMarTech/>
<https://twitter.com/impactmartech>
<https://www.youtube.com/c/impactmartech>

Re: Best way to merge range partitions

Posted by William Berkeley <wd...@cloudera.com>.
Where's the 600 tablet count recommendation sourced from? Is that
pre-replication and per-tserver, so there's 1800 replicas per tablet
server? We recommend 1000-2000 replicas per server.

As for your strategy for merging range partitions, I think it's the best
available at this point.

-Will

On Tue, Apr 30, 2019 at 1:23 PM Mauricio Aristizabal <ma...@impact.com>
wrote:

> I'm doing the delicate dance of maximizing ingest by having enough current
> hash partitions (say 25), minimizing query runtime by having range
> partitions that roughly match most report runs (say 2 weeks), while keeping
> tablet count not far above the 600 recommended, and supporting at least 18
> months of data.
>
> I'm thinking of a strategy of routinely merging older cold data range
> partitions into bigger ones (say 2 months instead of 2 weeks), and leverage
> the reduced overall tablet count to increase the hash buckets.
>
> It would be really nice if there was a Kudu CLI 'merge_range_partition'
> command (ranges would need to be contiguous).  It would greatly simplify
> optimization of time-series data structures.
>
> So instead i'm planning on copying the range partitions' data to a parquet
> side table, dropping the partitions, creating a single one, and copying the
> data back in.
>
> Any better approach I can use currently?
>
> Using CDH 5.15 Impala 2.13 Kudu 1.7
>
> Thanks in advance,
>
> -m
>
> --
> Mauricio Aristizabal
> Architect - Data Pipeline
> mauricio@impact.com | 323 309 4260
> https://impact.com
> <https://www.linkedin.com/company/impact-martech/>
> <https://www.facebook.com/ImpactMarTech/>
> <https://twitter.com/impactmartech>
> <https://www.youtube.com/c/impactmartech>
>