You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bryan Beaudreault <bb...@gmail.com> on 2012/07/18 19:26:32 UTC

Smart Managed Major Compactions

Hello all, 

Before I start, I'm running cdh3u2, so 0.90.4.

I am looking into managing major compactions ourselves, but there doesn't appear to be any mechanisms I can hook in to determine which tables need compacting.  Ideally each time my cron job runs it would compact the table with the next longest time since compaction, but I can't find a way to access this metric.

The default major compaction algorithm seems to be able to get the oldest modified time for all store files for a region to determine when it was last major compacted.  I know this is not ideal, but it seems good enough.  Unfortunately I don't see an easy way to get this.

Alternatively I can keep my own compaction log, but I'd rather not have to do that if there is another way.  Is there some easy way to access this value that I am not seeing?  I know I could construct the paths to store files myself, but this seems less than ideal as well (i.e. might break when we upgrade, etc).

Thanks 

-- 
Bryan Beaudreault


Re: Smart Managed Major Compactions

Posted by Stack <st...@duboce.net>.
On Wed, Jul 18, 2012 at 7:26 PM, Bryan Beaudreault
<bb...@gmail.com> wrote:
> I am looking into managing major compactions ourselves, but there doesn't appear to be any mechanisms I can hook in to determine which tables need compacting.  Ideally each time my cron job runs it would compact the table with the next longest time since compaction, but I can't find a way to access this metric.
>

Would suggest you have a region-view rather than a table-view.

Internally, we look at the hdfs modification time when we check if we
are to compact.  If it is > whatever the major compaction interval set
for the particular column family is, we'll do a major compaction.

Running an external script, you could look at each region in turn on
occasion.  Look at its files.  Check their modification time (and you
perhaps how many files there are under the region column family) and
if its > whatever you want, run a major compaction on the region.

Try to balance how many you'd have running at a time.

> The default major compaction algorithm seems to be able to get the oldest modified time for all store files for a region to determine when it was last major compacted.  I know this is not ideal, but it seems good enough.  Unfortunately I don't see an easy way to get this.
>

Its in the stats datastructure for an hdfs file.  Scripting you could
parse it from an hdfs listing.


St.Ack