You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Dickson, Matt MR" <ma...@defence.gov.au> on 2014/05/06 00:43:03 UTC

How to speed up table compaction [SEC=UNOFFICIAL]

UNOFFICIAL

I'm trying to compact a table and have had a queue of 60,000 compactions with only 800 running at a time.  This has now been running for 4 days and only decreased the queue by 8,000.

I've increased tserver.compaction.major.concurrent.max=12 and stopped all ingest but not seen a change in progress.  Are there other accumulo settings I can alter to improve this?  I also saw tserver.compaction.major.thread.files.open.max=10 should this be increased?

Thanks in advance,
Matt

Re: How to speed up table compaction [SEC=UNOFFICIAL]

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, May 5, 2014 at 6:43 PM, Dickson, Matt MR <
matt.dickson@defence.gov.au> wrote:

>  *UNOFFICIAL*
> I'm trying to compact a table and have had a queue of 60,000 compactions
> with only 800 running at a time.  This has now been running for 4 days and
> only decreased the queue by 8,000.
>
> I've increased tserver.compaction.major.concurrent.max=12 and stopped all
> ingest but not seen a change in progress.  Are there other accumulo
> settings I can alter to improve this?  I also saw
> tserver.compaction.major.thread.files.open.max=10 should this be increased?
>

This may speed things up if you have tablets w/ lots of files.  I think
this default should probably be higher based on anecdotal evidence, but I
have not experimented with to find out what a good default would be.  I'll
open an issue.


>
> Thanks in advance,
> Matt
>

Re: How to speed up table compaction [SEC=UNOFFICIAL]

Posted by Josh Elser <jo...@gmail.com>.
Yup, that property can be changed on the fly, and future compactions 
will use the new codec (all of the queued compactions will see it). It 
will not actively seek out and re-write files which are not compressed 
with the current codec.

And, yes, depending on the distro of Hadoop you're using, there might be 
some extra bits to install. I believe that your normal Hadoop-2.x.y 
release from Apache will have the native libs available for you and 
might just to install libsnappy via your OS's package manager.

On 5/6/14, 11:52 AM, David Medinets wrote:
> Can that property be changed on the fly? And Snappy needs to be
> installed throughout the cluster before making the configuration change?
>
>
> On Tue, May 6, 2014 at 12:43 AM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Depending on the CPU/IO ratio for your system, switching to a
>     different compression codec might help. Snappy tends to be a bit
>     quicker writing out data as opposed to gzip at the cost of being
>     larger on disk. The increase in final size on disk might be
>     prohibitive depending on your requirements though.
>
>     I forget the table property off hand, but, if you haven't changed
>     this already, it will be the property with a defauly value of 'gz' :)
>
>     On May 5, 2014 6:43 PM, "Dickson, Matt MR"
>     <matt.dickson@defence.gov.au <ma...@defence.gov.au>>
>     wrote:
>
>         __
>
>         *UNOFFICIAL*
>
>         I'm trying to compact a table and have had a queue of 60,000
>         compactions with only 800 running at a time.  This has now been
>         running for 4 days and only decreased the queue by 8,000.
>         I've increased tserver.compaction.major.concurrent.max=12 and
>         stopped all ingest but not seen a change in progress.  Are there
>         other accumulo settings I can alter to improve this?  I also saw
>         tserver.compaction.major.thread.files.open.max=10 should this be
>         increased?
>         Thanks in advance,
>         Matt
>
>

RE: How to speed up table compaction [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL

To sumamrise the outcome of this.

In the past 24hours the compactions have reduced from 58K to 6K primarily due to increasing the tserver.compaction.major.concurrent.max to 18 (requires all tservers to be restarted).  The default is 3 so initially I had only pushed it up to 10 to leave head room for other processes but had not seen major increases in performance.

Due to disk space constraints I left the tserver.file.compress.type as gz.  The other settings altered were tserver.cache.data.size, tserver.cache.data.size, tserver.memory.maps.max and tserver.walog.max.size.  We increased all these to maximise memory usage but aren't confident these had a big impact on the compaction progress.

For completeness, the reason I had to run a table compact was because there were tablets on the table that were receiving no new data and therefore the ageoff filter was never being applied due to no compactions being triggered on these tablets. I posted a question in the user group "Identify tablets with no new data loaded" on 30/4/14 trying to find a clean way to identify these tablets via timestamps in the metadata table.  The goal was to only compact the necessary ranges rather than this approach which is quite heavy handed.  I'm still keen to persue an idea related to inspecting the files in hdfs to get the time of the last compact, but that's a topic for another post.  Any suggestions are welcome.

Thanks again to everyone for all the feedback.

________________________________
From: David Medinets [mailto:david.medinets@gmail.com]
Sent: Wednesday, 7 May 2014 01:52
To: accumulo-user
Subject: Re: How to speed up table compaction [SEC=UNOFFICIAL]

Can that property be changed on the fly? And Snappy needs to . On Tue, May 6, 2014 at 12:43 AM, Josh Elser <jo...@gmail.com>> wrote:

Depending on the CPU/IO ratio for your system, switching to a different compression codec might help. Snappy tends to be a bit quicker writing out data as opposed to gzip at the cost of being larger on disk. The increase in final size on disk might be prohibitive depending on your requirements though.

I forget the table property off hand, but, if you haven't changed this already, it will be the property with a defauly value of 'gz' :)

On May 5, 2014 6:43 PM, "Dickson, Matt MR" <ma...@defence.gov.au>> wrote:

UNOFFICIAL

I'm trying to compact a table and have had a queue of 60,000 compactions with only 800 running at a time.  This has now been running for 4 days and only decreased the queue by 8,000.

I've increased tserver.compaction.major.concurrent.max=12 and stopped all ingest but not seen a change in progress.  Are there other accumulo settings I can alter to improve this?  I also saw tserver.compaction.major.thread.files.open.max=10 should this be increased?

Thanks in advance,
Matt


Re: How to speed up table compaction [SEC=UNOFFICIAL]

Posted by David Medinets <da...@gmail.com>.
Can that property be changed on the fly? And Snappy needs to be installed
throughout the cluster before making the configuration change?


On Tue, May 6, 2014 at 12:43 AM, Josh Elser <jo...@gmail.com> wrote:

> Depending on the CPU/IO ratio for your system, switching to a different
> compression codec might help. Snappy tends to be a bit quicker writing out
> data as opposed to gzip at the cost of being larger on disk. The increase
> in final size on disk might be prohibitive depending on your requirements
> though.
>
> I forget the table property off hand, but, if you haven't changed this
> already, it will be the property with a defauly value of 'gz' :)
> On May 5, 2014 6:43 PM, "Dickson, Matt MR" <ma...@defence.gov.au>
> wrote:
>
>>  *UNOFFICIAL*
>> I'm trying to compact a table and have had a queue of 60,000 compactions
>> with only 800 running at a time.  This has now been running for 4 days and
>> only decreased the queue by 8,000.
>>
>> I've increased tserver.compaction.major.concurrent.max=12 and stopped all
>> ingest but not seen a change in progress.  Are there other accumulo
>> settings I can alter to improve this?  I also saw
>> tserver.compaction.major.thread.files.open.max=10 should this be increased?
>>
>> Thanks in advance,
>> Matt
>>
>

Re: How to speed up table compaction [SEC=UNOFFICIAL]

Posted by Josh Elser <jo...@gmail.com>.
Depending on the CPU/IO ratio for your system, switching to a different
compression codec might help. Snappy tends to be a bit quicker writing out
data as opposed to gzip at the cost of being larger on disk. The increase
in final size on disk might be prohibitive depending on your requirements
though.

I forget the table property off hand, but, if you haven't changed this
already, it will be the property with a defauly value of 'gz' :)
On May 5, 2014 6:43 PM, "Dickson, Matt MR" <ma...@defence.gov.au>
wrote:

>  *UNOFFICIAL*
> I'm trying to compact a table and have had a queue of 60,000 compactions
> with only 800 running at a time.  This has now been running for 4 days and
> only decreased the queue by 8,000.
>
> I've increased tserver.compaction.major.concurrent.max=12 and stopped all
> ingest but not seen a change in progress.  Are there other accumulo
> settings I can alter to improve this?  I also saw
> tserver.compaction.major.thread.files.open.max=10 should this be increased?
>
> Thanks in advance,
> Matt
>