You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Carlos Alvarez <cb...@gmail.com> on 2010/11/25 19:32:44 UTC

Capacity problem with a lot of writes?

Hello All.


I am facing a (capacity?) problem in my eight nodes cluster running 0.6.2
patched with CASSANDRA-1014 and CASSANDRA-699.


I have 200 writes/second on peaks (on each node, taking into account
replication), with arow size of 35kb. I configured the memtable size to 1GB,
the biggest size my heap seems to tolerate. With this configuration, the
cluster is writing a sstable each 5 mins.

When I decrease the memtable size I run into a minor compaction storm.
However, the 1GB memtable forces me to have a hughe heap and makes me living
on the edge of the memory.

The question is:

- Do I need more power in order to write less than 1GB each five minutes?
- Has anyone experience with heaps of more than 8GBs?
- Are the standard minor compaction thresholds usually enough for high
loads? Or is something I am supposed to tune?.


Thanks!
Carlos.



-- 
Tal vez hubo un error en la grafía. O en la articulación del Sacro Nombre.

Re: Capacity problem with a lot of writes?

Posted by Carlos Alvarez <cb...@gmail.com>.
On Thu, Nov 25, 2010 at 9:09 PM, Peter Schuller
<pe...@infidyne.com> wrote:
>> My total data size is 2TB
>
> For the entire cluster? (I realized I was being ambiguous, I meant per node.)
Yes, for the entire cluster.

Regarding the multithreaded compaction, I'll look into the code and do
my best. I think that compaction is a limiting factor in my cluster,
with a faster compaction I guess I could support more operations with
the same hardware ( having less sstables using less memory to hold
hughe memtables).



Carlos.

Re: Capacity problem with a lot of writes?

Posted by Peter Schuller <pe...@infidyne.com>.
>> When you say that it grows constantly, does that mean up to 30 or even
>> farther?
>
> My total data size is 2TB

For the entire cluster? (I realized I was being ambiguous, I meant per node.)

> Actually, I never see the count stable.  When it reached 30 I thinked
> "I am reaching the default upper limit for a compaction, something
> went wrong" and I went back to 1GB memtables (also, I saw bigger read
> latencies).

Assuming 2 TB is for the entire cluster and you have 8 nodes, that's
250 GB per node. So yeah, that'll take a reasonable time to compact
when larger or major compactions take place.

> Well, I think you are right: I am CPU bounded on compaction, because I
> see during compactions a single jvm thread which is almost all the
> time in running state and the disk is not used beyond 50%.

Sounds like it.

> I think that a partial solution would help: if the compaction
> compacted to 'n' diferents new sstables (not one), the implementation
> would be easier. I mean, the compaction would compact, for instance,
> 10 sstables to 2 (being 2 the level of paralelism). In this way, the
> sstables count would remain eventually stable (although higher). What
> do you think?

Yes, that's been my thinking. Even for a single huge column family, it
doesn't really matter that individual large compactions take longer as
long as smaller compactions keep happening concurrently.

It's worth noting though that merely supporting concurrent compactions
in the sense of spawning more than one thread to do it is only a
partial solution; with sufficiently large column families you end up
having to keep several (not just a couple) compactions going in order
to ensure that large and medium compactions are run while at the same
time allowing the smallest compactions to take place. For one thing
you probably don't want to commit too many cores to compactions, and
in addition you can run into issues with I/O becoming seek bound if
you have too many processes trying to stream data concurrently.

I think the optimal solution for this would involve something along
the lines of having a fixed configurable compaction machine
concurrency (number of threads), and then have those threads
interleave an arbitrary number of compactions (do 1 gig, switch to the
other compaction, do 1 gig, etc). That way you have direct control
over actual CPU and I/O concurrency, and can independently make sure
that compactions happen in proper prioritized order, constantly
keeping sstable counts within appropriate bounds.

(For comparison the problem is very similar to PostgreSQL and
auto-vacuuming of databases containing tables with extreme differences
in size. Allowing multiple vacuum processes helps solve this, but is
has the equivalent scalability issues if you were to have a
particularly problematic distribution of table sizes and workload.)

-- 
/ Peter Schuller

Re: Capacity problem with a lot of writes?

Posted by Carlos Alvarez <cb...@gmail.com>.
On Fri, Nov 26, 2010 at 1:34 PM, Edward Capriolo <ed...@gmail.com> wrote:
> I believe there is room for other compaction models. I am interested
> in systems that can optimize the case with multiple data directories
> for example. It seems like from my experiment a major compaction can
> not fully utilize hardware is specific conditions. Although knowing
> which ones to use where and how to automatically select the optimal
> strategy are interesting concerns.

Thank you for sharing your technique.

I also think that a diferent model of compaction could be useful, esp
in situations when the normal and nice compaction (the one which gives
place to read/write to ocurr) is not enough to recover or when you
have small windows of low activity with a lot of unused resources in
the cluster.


Carlos.

Re: Capacity problem with a lot of writes?

Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Nov 26, 2010 at 10:49 AM, Peter Schuller
<pe...@infidyne.com> wrote:
>> Making compaction parallel isn't a priority because the problem is
>> almost always the opposite: how do we spread it out over a longer
>> period of time instead of sharp spikes of activity that hurt
>> read/write latency.  I'd be very surprised if latency would be
>> acceptable if you did have parallel compaction.  In other words, your
>> real problem is you need more capacity for your workload.
>
> Do you expect this to be true even with the I/O situation improved
> (i.e., under conditions where the additional I/O is not a problem)? It
> seems counter-intuitive to me that single-core compaction would make a
> huge impact on latency when compaction is CPU bound on a 8+ core
> system under moderate load (even taking into account cache
> coherency/NUMA etc).
>
> --
> / Peter Schuller
>

Carlos,

I wanted to mention a specific technique I used to solve a situation I
ran into. We had a large influx of data that pushed at our current
hardware, as stated above the true answer was more hardware. However
we ran into a situation where a single node failed several large
compactions. We failed 2 or 3 big compactions we ended up with ~1000
SSTables for a column family.

This turned into a chicken and egg situation where reads were slow
because there were many sstables and extra data like tombstones.
However the compaction was brutally slow from the read/write traffic.

My solution was to create a side by side install on the same box, I
used different data directories and different ports,
/var/lib/cassandra/compact 9168 etc, moved the data to the new install
and started it up. Then I ran nodetool compact on the new instance.
This node was seeing no read or write traffic.

I was surprised to see the machine was at 400%/1600% CPU used and not
much io-wait. Compacting 600 GB of small SSTables took about 4 days.
(However when sstables are larger I have compacted 400GB in 4 hours on
the same hardware.)

After which I moved the data file back in place and started the node
back into the cluster. I have lived on both sides of the fence where i
want long slow compactions or breakneck fast ones.

I believe there is room for other compaction models. I am interested
in systems that can optimize the case with multiple data directories
for example. It seems like from my experiment a major compaction can
not fully utilize hardware is specific conditions. Although knowing
which ones to use where and how to automatically select the optimal
strategy are interesting concerns.

Re: Capacity problem with a lot of writes?

Posted by Peter Schuller <pe...@infidyne.com>.
> Making compaction parallel isn't a priority because the problem is
> almost always the opposite: how do we spread it out over a longer
> period of time instead of sharp spikes of activity that hurt
> read/write latency.  I'd be very surprised if latency would be
> acceptable if you did have parallel compaction.  In other words, your
> real problem is you need more capacity for your workload.

Do you expect this to be true even with the I/O situation improved
(i.e., under conditions where the additional I/O is not a problem)? It
seems counter-intuitive to me that single-core compaction would make a
huge impact on latency when compaction is CPU bound on a 8+ core
system under moderate load (even taking into account cache
coherency/NUMA etc).

-- 
/ Peter Schuller

Re: Capacity problem with a lot of writes?

Posted by Jonathan Ellis <jb...@gmail.com>.
Making compaction parallel isn't a priority because the problem is
almost always the opposite: how do we spread it out over a longer
period of time instead of sharp spikes of activity that hurt
read/write latency.  I'd be very surprised if latency would be
acceptable if you did have parallel compaction.  In other words, your
real problem is you need more capacity for your workload.

On Thu, Nov 25, 2010 at 5:18 PM, Carlos Alvarez <cb...@gmail.com> wrote:
>> When you say that it grows constantly, does that mean up to 30 or even
>> farther?
>
> My total data size is 2TB
>
> Actually, I never see the count stable.  When it reached 30 I thinked
> "I am reaching the default upper limit for a compaction, something
> went wrong" and I went back to 1GB memtables (also, I saw bigger read
> latencies).
>
> Well, I think you are right: I am CPU bounded on compaction, because I
> see during compactions a single jvm thread which is almost all the
> time in running state and the disk is not used beyond 50%.
>
>
>> (A nice future improvement would be to allow for concurrent compaction
>> so that Cassandra would be able to utilize multiple CPU cores which
>> may mitigate this if you have left-over CPU. However, this is not
>> currently supported.)
>
> Yes, sure. I'd be happy to test, but I don't dare to alter the code :-)
>
> I think that a partial solution would help: if the compaction
> compacted to 'n' diferents new sstables (not one), the implementation
> would be easier. I mean, the compaction would compact, for instance,
> 10 sstables to 2 (being 2 the level of paralelism). In this way, the
> sstables count would remain eventually stable (although higher). What
> do you think?
>
>
> Carlos.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Capacity problem with a lot of writes?

Posted by Carlos Alvarez <cb...@gmail.com>.
> When you say that it grows constantly, does that mean up to 30 or even
> farther?

My total data size is 2TB

Actually, I never see the count stable.  When it reached 30 I thinked
"I am reaching the default upper limit for a compaction, something
went wrong" and I went back to 1GB memtables (also, I saw bigger read
latencies).

Well, I think you are right: I am CPU bounded on compaction, because I
see during compactions a single jvm thread which is almost all the
time in running state and the disk is not used beyond 50%.


> (A nice future improvement would be to allow for concurrent compaction
> so that Cassandra would be able to utilize multiple CPU cores which
> may mitigate this if you have left-over CPU. However, this is not
> currently supported.)

Yes, sure. I'd be happy to test, but I don't dare to alter the code :-)

I think that a partial solution would help: if the compaction
compacted to 'n' diferents new sstables (not one), the implementation
would be easier. I mean, the compaction would compact, for instance,
10 sstables to 2 (being 2 the level of paralelism). In this way, the
sstables count would remain eventually stable (although higher). What
do you think?


Carlos.

Re: Capacity problem with a lot of writes?

Posted by Peter Schuller <pe...@infidyne.com>.
> However, the point has to do with the fact Peter mentions. With smaller
> memtables I see that minor compaction is unable to keep up with the writes.
> The number of sstables grows constantly during my peaks hours. With 400MB
> memtables the cluster is always compacting and the number of sstables grows
> constantly.
> I don't see the cluster is io bounded even with compaction (disk
>  utilization is bellow 60% during compactions) but I think that a large
> number of  sstables affects my reads latency. Now I have 5-7 sstables during
> the peak hours and when I tried with smaller sstables I saw a 30 sstables
> (and then I got scared and rollbacked the change)

When you say that it grows constantly, does that mean up to 30 or even
farther? Because it is expected that smaller sstables will give you
higher sstable count spikes. Only one compaction runs at a time, and
as larger compactions run, they will take some amount of time. Given
that amount of time, with smaller memtable sizes the number of
sstables that have time to be flushed in the mean time is higher.

So a higher sstable count spike is not necessarily indicative that
you're not keeping up, unless it just grows and grows indefinitely.
But you're right that sstable count will affect the seek overhead of
reads.

What is your total data size? (Affects the maximum work necessary for
the biggest compaction jobs.)

With respect to your disk utilization: I assume your ~ 35 kb rows are
made up of several smaller columns? (If not I would expect compaction
to be disk bound rather than I/O bound, at least assuming you're not
running with a very fast I/O device)

In any case; if indeed you are in a position where the sstable counts
are not just due to the results of large compactions allowing for
several memtable flushes to happen in the mean time, and you are in
fact not keeping up with writes due to being CPU bound, then yeah -
basically that means you need more capacity to handle the load (unless
you can re-model data to be less CPU heavy in Cassandra, but that
seems like the wrong way to go in most cases).

Given your 200 writes/second, assuming they are full fows of 35 kb,
implies about 7 MB/second of writes. Given small enough column values
it seems plausible that you'd be CPU bound on compaction (hand-wavy
gut feelingly on my part).

(A nice future improvement would be to allow for concurrent compaction
so that Cassandra would be able to utilize multiple CPU cores which
may mitigate this if you have left-over CPU. However, this is not
currently supported.)

-- 
/ Peter Schuller

Re: Capacity problem with a lot of writes?

Posted by Carlos Alvarez <cb...@gmail.com>.
Thank you very much you both Jonathan and Peter.

I will upgrade.

However, the point has to do with the fact Peter mentions. With smaller
memtables I see that minor compaction is unable to keep up with the writes.
The number of sstables grows constantly during my peaks hours. With 400MB
memtables the cluster is always compacting and the number of sstables grows
constantly.

I don't see the cluster is io bounded even with compaction (disk
 utilization is bellow 60% during compactions) but I think that a large
number of  sstables affects my reads latency. Now I have 5-7 sstables during
the peak hours and when I tried with smaller sstables I saw a 30 sstables
(and then I got scared and rollbacked the change)



Carlos.

Re: Capacity problem with a lot of writes?

Posted by Peter Schuller <pe...@infidyne.com>.
> When I decrease the memtable size I run into a minor compaction storm.

By minor compaction storm, do you mean that compaction (whenever it
runs) degrades your node performance too much and smaller sstables
means this happens more often; or do you mean that compaction is not
keeping up with writes at all over time?

-- 
/ Peter Schuller

Re: Capacity problem with a lot of writes?

Posted by Jonathan Ellis <jb...@gmail.com>.
You want your memtables "as large as is reasonable, but not too
large."  Sounds like yours are too large.  As a first step, I would
strongly recommend upgrading to 0.6.8 and reducing the compaction
priority: http://www.riptano.com/blog/cassandra-annotated-changelog-063

On Thu, Nov 25, 2010 at 12:32 PM, Carlos Alvarez <cb...@gmail.com> wrote:
> Hello All.
>
> I am facing a (capacity?) problem in my eight nodes cluster running 0.6.2
> patched with CASSANDRA-1014 and CASSANDRA-699.
>
> I have 200 writes/second on peaks (on each node, taking into account
> replication), with arow size of 35kb. I configured the memtable size to 1GB,
> the biggest size my heap seems to tolerate. With this configuration, the
> cluster is writing a sstable each 5 mins.
> When I decrease the memtable size I run into a minor compaction storm.
> However, the 1GB memtable forces me to have a hughe heap and makes me living
> on the edge of the memory.
> The question is:
> - Do I need more power in order to write less than 1GB each five minutes?
> - Has anyone experience with heaps of more than 8GBs?
> - Are the standard minor compaction thresholds usually enough for high
> loads? Or is something I am supposed to tune?.
>
> Thanks!
> Carlos.
>
>
> --
> Tal vez hubo un error en la grafía. O en la articulación del Sacro Nombre.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com