You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Frederick Ryckbosch <fr...@gmail.com> on 2012/05/10 18:27:17 UTC

Concurrent major compaction

Hi,

We have a single-node cassandra that contains volatile data: every day about 2 Gb of data is written, this data is kept for 7 days and then removed (using TTL). To avoid that the application becomes slow during a large compaction, we do a major compaction every night (less users, less performance impact).

The major compaction is CPU bound: it uses about 1 core and only consumes 4 Mb/sec disk IO. We would like to scale the compaction with the resources available in the machine (cores, disks). Enabling multithreaded_compaction didn't help a lot, the CPU usage goes up to 120% of one core, but does not scale with the number of cores.

To make the compaction scale with the number of cores in our machine, we tried to perform a major compaction on multiple column families (in the same keyspace) at the same time using `nodetool -h localhost compact testSpace data1 data2`, however the 2 compactions are executed serially in stead of concurrently, with concurrent_compactors set to 4 (the number of cores).

Is this normal behavior (both the multihreading and concurrent compactions) ? Is there any way to make the major compactions scale with the number of cores in the machine ?

Thanks !
Frederick

Re: Concurrent major compaction

Posted by Sylvain Lebresne <sy...@datastax.com>.

For the multithreaded compaction,
https://issues.apache.org/jira/browse/CASSANDRA-4182 is relevant.
Basically, because you do a major compaction every night, you do are
in the case of '1 large sstables and a bunch of others', for which the
design of multithreaded compaction won't help too much.

For the concurrent part, this is due to the fact that major compaction
grabs a global lock before running. We could (will) change that to be
one lock per-CF (https://issues.apache.org/jira/browse/CASSANDRA-3430)
but it's not done yet. If you feel adventurous and care enough, you
can always try to apply the patch on CASSANDRA-3430, it should be fine
if you don't use truncate.

--
Sylvain

On Thu, May 10, 2012 at 6:27 PM, Frederick Ryckbosch
<fr...@gmail.com> wrote:
> Hi,
>
> We have a single-node cassandra that contains volatile data: every day about 2 Gb of data is written, this data is kept for 7 days and then removed (using TTL). To avoid that the application becomes slow during a large compaction, we do a major compaction every night (less users, less performance impact).
>
> The major compaction is CPU bound: it uses about 1 core and only consumes 4 Mb/sec disk IO. We would like to scale the compaction with the resources available in the machine (cores, disks). Enabling multithreaded_compaction didn't help a lot, the CPU usage goes up to 120% of one core, but does not scale with the number of cores.
>
> To make the compaction scale with the number of cores in our machine, we tried to perform a major compaction on multiple column families (in the same keyspace) at the same time using `nodetool -h localhost compact testSpace data1 data2`, however the 2 compactions are executed serially in stead of concurrently, with concurrent_compactors set to 4 (the number of cores).
>
> Is this normal behavior (both the multihreading and concurrent compactions) ? Is there any way to make the major compactions scale with the number of cores in the machine ?
>
> Thanks !
> Frederick
>