You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Rob Verkuylen <ro...@verkuylen.net> on 2020/11/24 20:27:59 UTC

Compactions with AgeOff and Combiner

We have a scenario where we use the SummingCombiner aggregating stats on
high cardinality properties of a streaming dataset. Use-case is generating
histograms over a certain period, so we age off these stats after a certain
time.

We run into some unexpected behaviour where the ageoff does not physically
happen, unless we trigger a manual compaction using EverythingStrategy as
opposed to the DefaultStrategy. This in combination with fairly large
splitsizes(50-100G) to prevent tablets from splitting further.

The default strategy with majc ratio of 3 and table.max.files=15 seem to
result in a scenario where the tablet servers over time will contain one
reasonable large file, ie 20G(A-*), and then several smaller files(C-*), of
1 to 5 and maybe 8GB. It will take a very long time before these C-* files
will sum upto <ratio> x <largest file>, so the 20G file will almost never
be considered for compaction and over time will hurt query performance
because of all the aged-of data which needs to be skipped in a scan.

Manual compaction will correct this, but it is a matter of time before we
run into the same problem. What is the best approach to let accumulo handle
this automatically? Is this a matter of lowering the ratio to get to the
20G quicker, fending against continuously running compactions? Or writing a
custom CompactionStrategy?

Re: Compactions with AgeOff and Combiner

Posted by Rob Verkuylen <ro...@verkuylen.net>.
Thanks for the pointer, it seems that 2.x has some wonderful improvements.

Just now looked into detail on Timely and they seem to also have
encountered this problem and went ahead with a
custom MetricCompactionStrategy. We will go the same route.


On Tue, Nov 24, 2020 at 10:00 PM Christopher <ct...@apache.org> wrote:

> A custom CompactionStrategy is probably your best bet, I would think,
> since you have very specific requirements.
> You may also be interested in the work done by Keith Turner for 2.1.0 (not
> yet released, as it is still under development) to add more control over
> compactions. A preview of the javadoc for the features can be found at
> https://github.com/apache/accumulo/blob/main/core/src/main/java/org/apache/accumulo/core/spi/compaction/package-info.java
> (there may be a better doc... I'm not sure; perhaps these pending website
> documentation updates:
> https://github.com/apache/accumulo-website/pull/232/files)
>
> On Tue, Nov 24, 2020 at 3:28 PM Rob Verkuylen <ro...@verkuylen.net> wrote:
>
>> We have a scenario where we use the SummingCombiner aggregating stats on
>> high cardinality properties of a streaming dataset. Use-case is generating
>> histograms over a certain period, so we age off these stats after a certain
>> time.
>>
>> We run into some unexpected behaviour where the ageoff does not
>> physically happen, unless we trigger a manual compaction using
>> EverythingStrategy as opposed to the DefaultStrategy. This in combination
>> with fairly large splitsizes(50-100G) to prevent tablets from splitting
>> further.
>>
>> The default strategy with majc ratio of 3 and table.max.files=15 seem to
>> result in a scenario where the tablet servers over time will contain one
>> reasonable large file, ie 20G(A-*), and then several smaller files(C-*), of
>> 1 to 5 and maybe 8GB. It will take a very long time before these C-* files
>> will sum upto <ratio> x <largest file>, so the 20G file will almost never
>> be considered for compaction and over time will hurt query performance
>> because of all the aged-of data which needs to be skipped in a scan.
>>
>> Manual compaction will correct this, but it is a matter of time before we
>> run into the same problem. What is the best approach to let accumulo handle
>> this automatically? Is this a matter of lowering the ratio to get to the
>> 20G quicker, fending against continuously running compactions? Or writing a
>> custom CompactionStrategy?
>>
>

Re: Compactions with AgeOff and Combiner

Posted by Christopher <ct...@apache.org>.
A custom CompactionStrategy is probably your best bet, I would think, since
you have very specific requirements.
You may also be interested in the work done by Keith Turner for 2.1.0 (not
yet released, as it is still under development) to add more control over
compactions. A preview of the javadoc for the features can be found at
https://github.com/apache/accumulo/blob/main/core/src/main/java/org/apache/accumulo/core/spi/compaction/package-info.java
(there may be a better doc... I'm not sure; perhaps these pending website
documentation updates:
https://github.com/apache/accumulo-website/pull/232/files)

On Tue, Nov 24, 2020 at 3:28 PM Rob Verkuylen <ro...@verkuylen.net> wrote:

> We have a scenario where we use the SummingCombiner aggregating stats on
> high cardinality properties of a streaming dataset. Use-case is generating
> histograms over a certain period, so we age off these stats after a certain
> time.
>
> We run into some unexpected behaviour where the ageoff does not physically
> happen, unless we trigger a manual compaction using EverythingStrategy as
> opposed to the DefaultStrategy. This in combination with fairly large
> splitsizes(50-100G) to prevent tablets from splitting further.
>
> The default strategy with majc ratio of 3 and table.max.files=15 seem to
> result in a scenario where the tablet servers over time will contain one
> reasonable large file, ie 20G(A-*), and then several smaller files(C-*), of
> 1 to 5 and maybe 8GB. It will take a very long time before these C-* files
> will sum upto <ratio> x <largest file>, so the 20G file will almost never
> be considered for compaction and over time will hurt query performance
> because of all the aged-of data which needs to be skipped in a scan.
>
> Manual compaction will correct this, but it is a matter of time before we
> run into the same problem. What is the best approach to let accumulo handle
> this automatically? Is this a matter of lowering the ratio to get to the
> 20G quicker, fending against continuously running compactions? Or writing a
> custom CompactionStrategy?
>