You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Schubert Zhang <zs...@gmail.com> on 2010/07/18 09:34:33 UTC

Re: GC Storm

Benjamin,

It is not difficult to stack thousands of SSTables.
In a heavy inserting (many client threads), the memtable flush (generate new
sstable) is fren


On Mon, Jun 14, 2010 at 2:03 AM, Benjamin Black <b...@b3k.us> wrote:

> On Sat, Jun 12, 2010 at 7:46 PM, Anty <an...@gmail.com> wrote:
> > Hi:ALL
> > I have 10 nodes cluster ,after inserting many records into the cluster, i
> > compact each node by nodetool compact.
> > during the compaciton process ,something  wrong with one of the 10 nodes
> ,
> > when the size of the compacted temp file rech nearly 100GB( before
> > compaction ,the size is ~240G)
>
> Compaction is not compression, it is merging of SSTables and tombstone
> elimination.  If you are not doing many deletes or overwrites of
> existing data, the compacted SSTable will be about the same size as
> the total size of all the smaller SSTables that went into it.  It is
> not clear to me how you ended up with 5000 SSTables (the *-data.db
> files) of such small size if you have not disabled minor compactions.
>
> Can you post your storage-conf.xml someplace (pastie or
> gist.github.com, for example)?
>
>
> b
>

Re: GC Storm

Posted by Peter Schuller <pe...@infidyne.com>.

(adding dev@)

> (2) Can we implement multi-thread compaction?

I think this is the only way to scale. Or at least to implement
concurrent compaction (whether it is by division into threads or not)
of multiple size classes. As long as the worst-case compactions are
significantly slower than best-base compactions, then presumably you
will have the problem of accumulation of lots of sstables during long
compactions. Since having few sstables is part of the design goal (or
so I have assumed, or else you will seek to much on disk when doing
e.g. a range query), triggering situations where this is not the case
is a performance problem for readers.

I've been thinking about this for a bit and maybe there could be one
tweakable configuration setting which sets the desired machine
concurrency, that the user tweaks in order to make compaction fast
enough in relation to incoming writes. Regardless of the database
size, this is necessary whenever cassandra is able to take writes
faster than a CPU-bound compaction thread is able to process them.

The other thing would be to have an intelligent compaction scheduler
that does something along the lines of scheduling a compaction thread
for every "level" of compaction (i.e., one for log_m(n) = 1, one for
log_m(n) = 2, etc). To avoid inefficiency and huge spikes in CPU
usage, these compaction threads could stop every now and then
(something reasonable; say ever 100 mb compacted or something) and
yield to other compaction threads.

This way:

(a) a limited amount of threads will be actively runnable at any given
moment, allowing the user to limit the effect of background compaction
can have on CPU usage
(b) but on the other hand, it also means that more than one CPU can be
used; whatever is appropriate for the cluster
(c) it should be reasonably easy to implement because each compaction
is just a regular thread doing what it does now already
(d) the synchronization overhead between compaction threads should be
completely irrelevant as long as one selects a high enough
synchronization threshold (100 mb was just a suggestion; might be 1
gig).
(e) log_m(n) will never be large enough for it to be a scaling problem
that you have one thread per "level"

Thoughts?

-- 
/ Peter Schuller

Re: GC Storm

Posted by Peter Schuller <pe...@infidyne.com>.

(adding dev@)

> (2) Can we implement multi-thread compaction?

I think this is the only way to scale. Or at least to implement
concurrent compaction (whether it is by division into threads or not)
of multiple size classes. As long as the worst-case compactions are
significantly slower than best-base compactions, then presumably you
will have the problem of accumulation of lots of sstables during long
compactions. Since having few sstables is part of the design goal (or
so I have assumed, or else you will seek to much on disk when doing
e.g. a range query), triggering situations where this is not the case
is a performance problem for readers.

I've been thinking about this for a bit and maybe there could be one
tweakable configuration setting which sets the desired machine
concurrency, that the user tweaks in order to make compaction fast
enough in relation to incoming writes. Regardless of the database
size, this is necessary whenever cassandra is able to take writes
faster than a CPU-bound compaction thread is able to process them.

The other thing would be to have an intelligent compaction scheduler
that does something along the lines of scheduling a compaction thread
for every "level" of compaction (i.e., one for log_m(n) = 1, one for
log_m(n) = 2, etc). To avoid inefficiency and huge spikes in CPU
usage, these compaction threads could stop every now and then
(something reasonable; say ever 100 mb compacted or something) and
yield to other compaction threads.

This way:

(a) a limited amount of threads will be actively runnable at any given
moment, allowing the user to limit the effect of background compaction
can have on CPU usage
(b) but on the other hand, it also means that more than one CPU can be
used; whatever is appropriate for the cluster
(c) it should be reasonably easy to implement because each compaction
is just a regular thread doing what it does now already
(d) the synchronization overhead between compaction threads should be
completely irrelevant as long as one selects a high enough
synchronization threshold (100 mb was just a suggestion; might be 1
gig).
(e) log_m(n) will never be large enough for it to be a scaling problem
that you have one thread per "level"

Thoughts?

-- 
/ Peter Schuller

Re: GC Storm

Posted by Schubert Zhang <zs...@gmail.com>.

Agree to Peter Schuller.

On Sun, Jul 18, 2010 at 8:40 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Sun, Jul 18, 2010 at 2:45 AM, Schubert Zhang <zs...@gmail.com> wrote:
> > In a heavy inserting (many client threads), the memtable flush (generate
> new
> > sstable) is frequent (e.g. one in 30s).
>
> This is a sign you should increase your memtable thresholds, btw.  If
> you wrote out larger sstables, there would be less duplicate i/o
> reading them back in to compact them back out.
>
>
Yes, I usually set very bigger memtable size, such as 512MB or 1GB, but I
think is shoud just be a temp solution.


>  > Question:
> > (1) Can we modify the compaction policy, to compact the smaller sstables
> > with high priority? even when there is/are larger-compaction runing.
> > (2) Can we implement multi-thread compaction?
>
> Isn't (1) a subset of (2)?
>

Maybe yes, depends the implementation.


>
> There is a ticket open for (2) at
> http://issues.apache.org/jira/browse/CASSANDRA-1187
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: GC Storm

Posted by Jonathan Ellis <jb...@gmail.com>.

On Sun, Jul 18, 2010 at 2:45 AM, Schubert Zhang <zs...@gmail.com> wrote:
> In a heavy inserting (many client threads), the memtable flush (generate new
> sstable) is frequent (e.g. one in 30s).

This is a sign you should increase your memtable thresholds, btw.  If
you wrote out larger sstables, there would be less duplicate i/o
reading them back in to compact them back out.

> Question:
> (1) Can we modify the compaction policy, to compact the smaller sstables
> with high priority? even when there is/are larger-compaction runing.
> (2) Can we implement multi-thread compaction?

Isn't (1) a subset of (2)?

There is a ticket open for (2) at
http://issues.apache.org/jira/browse/CASSANDRA-1187

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: GC Storm

Posted by Schubert Zhang <zs...@gmail.com>.

Benjamin and Jonathan,

It is not difficult to stack thousands of small SSTables.

In a heavy inserting (many client threads), the memtable flush (generate new
sstable) is frequent (e.g. one in 30s).

The compaction only run in a single thread and is CPU bound. Consider the
compactionManager is compacting 10 sstables (totally 600GB), and this
compaction took 10 hours. Then, during this compaction, 1200 new small
sstables are generated.

Question:
(1) Can we modify the compaction policy, to compact the smaller sstables
with high priority? even when there is/are larger-compaction runing.
(2) Can we implement multi-thread compaction?

Schubert

On Sun, Jul 18, 2010 at 3:34 PM, Schubert Zhang <zs...@gmail.com> wrote:

> Benjamin,
>
> It is not difficult to stack thousands of SSTables.
> In a heavy inserting (many client threads), the memtable flush (generate
> new sstable) is fren
>
>
> On Mon, Jun 14, 2010 at 2:03 AM, Benjamin Black <b...@b3k.us> wrote:
>
>> On Sat, Jun 12, 2010 at 7:46 PM, Anty <an...@gmail.com> wrote:
>> > Hi:ALL
>> > I have 10 nodes cluster ,after inserting many records into the cluster,
>> i
>> > compact each node by nodetool compact.
>> > during the compaciton process ,something  wrong with one of the 10 nodes
>> ,
>> > when the size of the compacted temp file rech nearly 100GB( before
>> > compaction ,the size is ~240G)
>>
>> Compaction is not compression, it is merging of SSTables and tombstone
>> elimination.  If you are not doing many deletes or overwrites of
>> existing data, the compacted SSTable will be about the same size as
>> the total size of all the smaller SSTables that went into it.  It is
>> not clear to me how you ended up with 5000 SSTables (the *-data.db
>> files) of such small size if you have not disabled minor compactions.
>>
>> Can you post your storage-conf.xml someplace (pastie or
>> gist.github.com, for example)?
>>
>>
>> b
>>
>
>