You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Dwight Smith <dw...@genesys.com> on 2014/01/11 00:21:49 UTC

Impact of running major compaction with Size Tiered Compaction - version 1.1.11

Hi



We have a 6 node cluster in two DCs, Cassandra version 1.1.11, RF=3 in each DC.



The DataStax Documentation says the following:



Initiate a major compaction through nodetool compact<http://www.datastax.co= m/docs/1.1/references/nodetool#nodetool-compact<http://www.datastax.co=%20m/docs/1.1/references/nodetool#nodetool-compact>>. A major compaction merges all SSTables into one. Though major compaction can free disk space used by accumulated SSTables, during runtime it temporarily doubles disk space usage and is I/O and CPU intensive. After running a major compaction, automatic minor compactions are no longer triggered on a frequent basis. Consequently, you no longer have to manually run major compactions on a routine basis. Expect read performance to improve immediately following a major compaction, and then to continually degrade until you invoke the next major compaction. For this reason, DataStax does not recommend major compaction.



A maintenance procedure has been run ( periodically ) on the nodes in the cluster which performs repair -pr, flush, compact, then cleanup.



This runs fine for all CFs except one which is very large, with large rows. The entries all have TTLs specified which are less than gc_grace.



Currently the SSTables are as follows for the xxxx CF, the maintenance just completed after running for 9+ hours:



   19977911 Dec 27 06:38 xxxx-hf-57288-Data.db

       5817 Dec 27 06:52 xxxx-hf-57304-Data.db

2735747237 Dec 27 06:52 xxxx-hf-57291-Data.db

     718192 Dec 27 06:52 xxxx-hf-57305-Data.db

2581373226 Dec 29 16:48 xxxx-hf-57912-Data.db

  936062446 Jan  9 22:22 xxxx-hf-58875-Data.db

  235463043 Jan 10 05:23 xxxx-hf-58888-Data.db

   60851675 Jan 10 08:33 xxxx-hf-58893-Data.db

   60871570 Jan 10 11:44 xxxx-hf-58898-Data.db

   60537384 Jan 10 14:54 xxxx-hf-58903-Data.db



Min_compaction_threshold is set to 4.



Now for the questions:





1) Given that the DataStax recommendation was not followed - will minor compactions still be triggered if the major compactions are no longer performed?



2) Would the maintenance steps: repair -pr, flush, and cleanup still be useful?



Thanks

Re: Impact of running major compaction with Size Tiered Compaction - version 1.1.11

Posted by Robert Coli <rc...@eventbrite.com>.
On Fri, Jan 10, 2014 at 3:21 PM, Dwight Smith <dw...@genesys.com>wrote:

>  Initiate a major compaction through nodetool compact<http://www.datastax.co=
> m/docs/1.1/references/nodetool#nodetool-compact>. A major compaction
> merges all SSTables into one. Though major compaction can free disk space
> used by accumulated SSTables, during runtime it temporarily doubles disk
> space usage and is I/O and CPU intensive. After running a major compaction,
> automatic minor compactions are no longer triggered on a frequent basis.
> Consequently, you no longer have to manually run major compactions on a
> routine basis. Expect read performance to improve immediately following a
> major compaction, and then to continually degrade until you invoke the next
> major compaction. For this reason, DataStax does not recommend major
> compaction.
>
>  It's a little sad that the moment I read your subject line, I knew which
paragraph in the docs you would be asking about. Suffice it to say that
this particular doc snippet has come up repeatedly, and the summation of
various threads about it is that it is hopelessly confused/meaningless.
Maybe docs@datastax.com could remove it from the 1.1 era docs, bringing
them in line with later, less confusingly verbose versions of the "nodetool
compact" doc? (I've added them to the bcc: on this reply, FWIW!)

Many Cassandra experts recommend major compaction in certain cases; I know
of one who runs a major compaction every night. Major compaction is the
single most efficient way to merge row fragments that Cassandra has
available to it, so it is very very helpful for certain write patterns.
However if major compaction helps a very lot (as in the case of app I once
saw where it recovered 50% of data size if run every two days...) it is
probably an indication that you are Doing Something Wrong such that you
have a large number of overwrites that are purged during merge.

> 1) Given that the DataStax recommendation was not followed - will minor
> compactions still be triggered if the major compactions are no longer
> performed?
>
Yes. What they're trying to say is that because Size Tiered Compaction is..
Size Tiered.. your One Big SSTable is *less likely* to be compacted during
minor compaction than it otherwise would be. If this we for some reason to
become an actual problem, you could use sstablesplit (after upgrading to
1.2.x HEAD) to split it into multiple SSTables.

> 2) Would the maintenance steps: repair -pr, flush, and cleanup still be
> useful?
>
Periodic repair is a best practice. I recommend setting gc_grace_seconds to
34 days and then repairing on the first of every month.

Cleanup is only necessary if you add or remove nodes from a cluster. If you
haven't done that in between runs of cleanup, you are just wasting
resources recompacting SSTables as a NOOP.

>  This runs fine for all CFs except one which is very large, with large
> rows. The entries all have TTLs specified which are less than gc_grace.
>
What does it do on that CF?

If that CF has TTLs which are less than gc_grace, then a major compaction
should clean up all data?

=Rob