You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Germán Kondolf <ge...@gmail.com> on 2010/12/17 01:04:16 UTC

Parallel Compaction

Hi everybody,

I've just finished the first implementation of a Parallel Compaction
Patch for the trunk version, tomorrow I'll test it with high volumen
of data to see if it works as I expected, but before I wan't to
validate with you the approach.

I know it's kinda naif, but, maybe it works as starting point for a
future production implementation or at least allow to make
configurable the compaction strategy.
First of all, I don't know in depth the C* code, so maybe I took a few
shortcuts and that's why I need a second look from an expert...

I've modified the doCompaction method of CompactionManager, added a
few static classes (I'm working to remove them, so V2 is coming), and
simply splitted the sstables to compact in a balanced order and fire
each group compaction in parallel.

The revision I've based the patch is: 1050234
The files are attached, the patch and the CompactionManager.java

Thanks in advance, I'll appreciate the feedback.

-- 
//GK
german.kondolf@gmail.com
// sites
http://twitter.com/germanklf
http://www.facebook.com/germanklf
http://ar.linkedin.com/in/germankondolf

Re: Parallel Compaction

Posted by Germán Kondolf <ge...@gmail.com>.
I've created the patch ticket:
https://issues.apache.org/jira/browse/CASSANDRA-1876

On Fri, Dec 17, 2010 at 12:30 PM, Germán Kondolf
<ge...@gmail.com> wrote:
> On Fri, Dec 17, 2010 at 11:15 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>> On Fri, Dec 17, 2010 at 8:01 AM, Germán Kondolf <ge...@gmail.com>wrote:
>>
>>> Thanks Jonathan for the feedback.
>>>
>>> By flush/schema migration you mean the SSTables replace lock? I've put
>>> that lock just to be sure, if it's fine by you I'll remove it.
>>> I'll clean up the code according to the code-style article, add the
>>> parameter to the configuration using a default of "1" and I'll send it
>>> again.
>>>
>>> Why do you think is only worth it on SSDs?
>>>
>>
>> Because even a single compaction causes a ton of i/o contention.  99% of the
>> time your concern is how to make compaction use _less_ resources, not more.
>> :)
>
> We guess that depending on the scenario there are room for different
> strategies in order to use less resources.
>
> With short lived keys, a parallel fast compaction jointly with
> CASSANDRA-1074 may cause that the node will be compacting for very
> short period of time and while this is happening the other nodes could
> handle the load provided the compaction takes just seconds.
>
> In other scenario, with long lived keys, we're thinking that if the
> minor compaction just compacted the BF and Indexes and leaving the
> SSTables the way they were, we would save the I/O bandwidth we're
> using in write phase, and just writing BF and Indexes.
>
> The proposed structure of SSTables would change an look like this:
> LogicSSTable
>     Index
>     BloomFilter
>     Collection<SSTableOnDisk>
>
> The LogicSSTable contains a the Idx & BF of the given compacted SSTables.
>
> Where reading a column would implied using the BF, reading the index
> which would indicated not only and offset but also a file, and reading
> the corresponding file.
>
> In this way, the minor compaction is just a reading process and not a
> writing intensive process.
>
> Of course, it depends on the behaviour of the dataset. With short
> lived keys, this later strategy just makes the major compaction
> harder. On the other hand, with the current strategy and long lived
> columns, after a while, every column is read and written a lot of
> times just to be left in its original state.
>
> We know that this isn't an easy change, but eventually will try it at
> home, so your critics, warnings and advice are welcome.
>
> Regards.
> --
> //GK
> german.kondolf@gmail.com
> // sites
> http://twitter.com/germanklf
> http://www.facebook.com/germanklf
> http://ar.linkedin.com/in/germankondolf
>



-- 
//GK
german.kondolf@gmail.com
// sites
http://twitter.com/germanklf

Re: Parallel Compaction

Posted by Germán Kondolf <ge...@gmail.com>.
On Fri, Dec 17, 2010 at 11:15 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Fri, Dec 17, 2010 at 8:01 AM, Germán Kondolf <ge...@gmail.com>wrote:
>
>> Thanks Jonathan for the feedback.
>>
>> By flush/schema migration you mean the SSTables replace lock? I've put
>> that lock just to be sure, if it's fine by you I'll remove it.
>> I'll clean up the code according to the code-style article, add the
>> parameter to the configuration using a default of "1" and I'll send it
>> again.
>>
>> Why do you think is only worth it on SSDs?
>>
>
> Because even a single compaction causes a ton of i/o contention.  99% of the
> time your concern is how to make compaction use _less_ resources, not more.
> :)

We guess that depending on the scenario there are room for different
strategies in order to use less resources.

With short lived keys, a parallel fast compaction jointly with
CASSANDRA-1074 may cause that the node will be compacting for very
short period of time and while this is happening the other nodes could
handle the load provided the compaction takes just seconds.

In other scenario, with long lived keys, we're thinking that if the
minor compaction just compacted the BF and Indexes and leaving the
SSTables the way they were, we would save the I/O bandwidth we're
using in write phase, and just writing BF and Indexes.

The proposed structure of SSTables would change an look like this:
LogicSSTable
     Index
     BloomFilter
     Collection<SSTableOnDisk>

The LogicSSTable contains a the Idx & BF of the given compacted SSTables.

Where reading a column would implied using the BF, reading the index
which would indicated not only and offset but also a file, and reading
the corresponding file.

In this way, the minor compaction is just a reading process and not a
writing intensive process.

Of course, it depends on the behaviour of the dataset. With short
lived keys, this later strategy just makes the major compaction
harder. On the other hand, with the current strategy and long lived
columns, after a while, every column is read and written a lot of
times just to be left in its original state.

We know that this isn't an easy change, but eventually will try it at
home, so your critics, warnings and advice are welcome.

Regards.
-- 
//GK
german.kondolf@gmail.com
// sites
http://twitter.com/germanklf
http://www.facebook.com/germanklf
http://ar.linkedin.com/in/germankondolf

Re: Parallel Compaction

Posted by Jonathan Ellis <jb...@gmail.com>.
On Fri, Dec 17, 2010 at 8:01 AM, Germán Kondolf <ge...@gmail.com>wrote:

> Thanks Jonathan for the feedback.
>
> By flush/schema migration you mean the SSTables replace lock? I've put
> that lock just to be sure, if it's fine by you I'll remove it.
> I'll clean up the code according to the code-style article, add the
> parameter to the configuration using a default of "1" and I'll send it
> again.
>
> Why do you think is only worth it on SSDs?
>

Because even a single compaction causes a ton of i/o contention.  99% of the
time your concern is how to make compaction use _less_ resources, not more.
:)

How do I get to commit this patch?
>

Attach it to a ticket on https://issues.apache.org/jira/browse/CASSANDRA.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Parallel Compaction

Posted by Germán Kondolf <ge...@gmail.com>.
Thanks Jonathan for the feedback.

By flush/schema migration you mean the SSTables replace lock? I've put
that lock just to be sure, if it's fine by you I'll remove it.
I'll clean up the code according to the code-style article, add the
parameter to the configuration using a default of "1" and I'll send it
again.

Why do you think is only worth it on SSDs?
How do I get to commit this patch?

On Fri, Dec 17, 2010 at 2:41 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> Hi Germán,
>
> Thanks for taking a stab at this!
>
> I don't actually think there are going to be any tricky race conditions with
> flush or schema migration; flush has been parallel for a long time itself,
> and we already have the lock in CompactionManager for schema migration.
>
> To clean this up for submission you'd want to follow the style guide at
> http://wiki.apache.org/cassandra/CodeStyle, r/m the commented-out sections,
> and add a configuration parameter for how many compactions to allow
> simultaneously (IMO it mainly only makes sense to have > 1 when you are
> running on SSDs, and there's no good way for us to auto-detect that).
>
> On Thu, Dec 16, 2010 at 6:04 PM, Germán Kondolf <ge...@gmail.com>wrote:
>
>> Hi everybody,
>>
>> I've just finished the first implementation of a Parallel Compaction
>> Patch for the trunk version, tomorrow I'll test it with high volumen
>> of data to see if it works as I expected, but before I wan't to
>> validate with you the approach.
>>
>> I know it's kinda naif, but, maybe it works as starting point for a
>> future production implementation or at least allow to make
>> configurable the compaction strategy.
>> First of all, I don't know in depth the C* code, so maybe I took a few
>> shortcuts and that's why I need a second look from an expert...
>>
>> I've modified the doCompaction method of CompactionManager, added a
>> few static classes (I'm working to remove them, so V2 is coming), and
>> simply splitted the sstables to compact in a balanced order and fire
>> each group compaction in parallel.
>>
>> The revision I've based the patch is: 1050234
>> The files are attached, the patch and the CompactionManager.java
>>
>> Thanks in advance, I'll appreciate the feedback.
>>
>> --
>> //GK
>> german.kondolf@gmail.com
>> // sites
>> http://twitter.com/germanklf
>> http://www.facebook.com/germanklf
>> http://ar.linkedin.com/in/germankondolf
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>



-- 
//GK
german.kondolf@gmail.com
// sites
http://twitter.com/germanklf
http://www.facebook.com/germanklf
http://ar.linkedin.com/in/germankondolf

Re: Parallel Compaction

Posted by Jonathan Ellis <jb...@gmail.com>.
Hi Germán,

Thanks for taking a stab at this!

I don't actually think there are going to be any tricky race conditions with
flush or schema migration; flush has been parallel for a long time itself,
and we already have the lock in CompactionManager for schema migration.

To clean this up for submission you'd want to follow the style guide at
http://wiki.apache.org/cassandra/CodeStyle, r/m the commented-out sections,
and add a configuration parameter for how many compactions to allow
simultaneously (IMO it mainly only makes sense to have > 1 when you are
running on SSDs, and there's no good way for us to auto-detect that).

On Thu, Dec 16, 2010 at 6:04 PM, Germán Kondolf <ge...@gmail.com>wrote:

> Hi everybody,
>
> I've just finished the first implementation of a Parallel Compaction
> Patch for the trunk version, tomorrow I'll test it with high volumen
> of data to see if it works as I expected, but before I wan't to
> validate with you the approach.
>
> I know it's kinda naif, but, maybe it works as starting point for a
> future production implementation or at least allow to make
> configurable the compaction strategy.
> First of all, I don't know in depth the C* code, so maybe I took a few
> shortcuts and that's why I need a second look from an expert...
>
> I've modified the doCompaction method of CompactionManager, added a
> few static classes (I'm working to remove them, so V2 is coming), and
> simply splitted the sstables to compact in a balanced order and fire
> each group compaction in parallel.
>
> The revision I've based the patch is: 1050234
> The files are attached, the patch and the CompactionManager.java
>
> Thanks in advance, I'll appreciate the feedback.
>
> --
> //GK
> german.kondolf@gmail.com
> // sites
> http://twitter.com/germanklf
> http://www.facebook.com/germanklf
> http://ar.linkedin.com/in/germankondolf
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com