You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by tsuraan <ts...@gmail.com> on 2010/07/26 18:23:52 UTC

Cassandra behaviour

I have a system where we're currently using Postgres for all our data
storage needs, but on a large table the index checks for primary keys
are really slowing us down on insert.  Cassandra sounds like a good
alternative (not saying postgres and cassandra are equivalent; just
that I think they are both reasonable fits for our particular
product), so I tried running the py_stress tool on a recent repos
checkout.  I'm using code that's recent enough that it doesn't pay
attention to the keyspace definitions in cassandra.yaml, so whatever
the values are for cached info is just what py_stress defined when it
made the keyspace it uses.  I didn't change anything in
cassandra.yaml, but I did change cassandra.in.sh to use 2G of RAM
rather than 1G.  I then ran "python stress.py -o insert -n 1000000000"
(that's one billion).  I left for a day, and when I came back
cassandra had run out of RAM, and stress.py had crashed at somewhere
around 120,000,000 inserts.  This brings up a few questions:

- is Cassandra's RAM use proportional to the number of values that
it's storing?  I know that it uses bloom filters for preventing
lookups of non-existent keys, but since bloom filters are designed to
give an accuracy/space tradeoff, Cassandra should sacrifice accuracy
in order to prevent crashes, if it's just bloom filters that are using
all the RAM

- When I start Cassandra again, it appears to go into an eternal
read/write loop, using between 45% and 90% of my CPU.  It says it's
compacting tables, but it's been doing that for hours, and it only has
70GB of data stored.  How can cassandra be run on huge datasets, when
70GB appears to take forever to compact?

I assume I'm doing something wrong, but I don't see a ton of tunables
to play with.  Can anybody give me advice on how to make cassandra
keep running under a high insert load?

Re: Cassandra behaviour

Posted by Peter Schuller <pe...@infidyne.com>.

> So userspace throttling is probably the answer?

I believe so.

>  Is the normal way of
> doing this to go through the JMX interface from a userspace program,
> and hold off on inserts until the values fall below a given threshold?
>  If so, that's going to be a pain, since most of my system is
> currently using python :)

I don't know what the normal way is or what people have done with
cassandra in production.

What I have tended to do personally and in general (not cassandra
specifik) is to do domain specific rate limiting as required whenever
I do batch jobs / bulk reads/writes. Regardless of whether your
database is cassandra, postgresql or anything else - throwing writes
(or reads for that matter) at the database at maximum possible speed
tends to have adverse effects on latency on other normal traffic. Only
during offline batch operations where latency of other traffic is
irrelevant, do I ever go "all in" and throw traffic at a database at
full speed.

That said, often simple measures like "write with a single sequential
writer subject to RTT of RPC requests" is sufficient to rate limit
pretty well in practice. But of course that depends on the nature of
the writes and how expensive they are relative to RTT and/or RPC.

FWIW, whenever I have needed a hard "maximum of X per second" rate
limit I have implemented or re-used a rate limiter (e.g. a token
bucket) for the language in question and used it in my client code.

-- 
/ Peter Schuller

Re: Cassandra behaviour

Posted by tsuraan <ts...@gmail.com>.

> It's reading through keys in the index and adding offset information
> about roughly every 128th entry in RAM, in order to speed up reads.
> Performing a binary search in an sstable from scratch would be
> expensive. Because of the high cost of disk seeks, most storage
> systems use btrees with a high branching factor to keep the number of
> seeks low. In cassandra there is instead binary searching (owing to
> the fact that sstables are sorted on disk), but pre-seeded with the
> information gained from index sampling to keep the amount of seeks
> bounded even in the face of very large sstables.

That makes sense.  Lucene does the same thing, although it has a
parameter on the IndexReader "open" function that lets you specify how
many terms to skip.  For huge indices on limited machines, that has
been an occasional lifesaver :)

> Those settings only directly affect, as far as I know, the interaction
> with the commit log. Now, if your system is truly disk bound rather
> than CPU bound on compaction, writes to the commit log will indeed
> have the capability to effectively throttle the write speed. In such a
> case I would expect more frequent fsync():s to the commit log to
> throttle writes to a higher degree than they would if the commit log
> was just periodically fsync():ed in the background once per minute;
> however I would not use this as the means to throttle writes.
>
> The other thing which may happen is that memtables aren't flushed fast
> enough to keep up with writes. I don't remember whether or not there
> was already a fix for this; I think there is, at least in trunk.
> Previously you could trigger an out-of-memory condition by writing
> faster than memtable flushing was happening.
>
> However even if that is fixed (again, I'm not sure), I'm pretty sure
> there is still no mechanism to throttle based on background
> compaction. It's not entirely trivial to do in a sensible fashion
> given how extremely asynchronous compaction is with respect to writes.

So userspace throttling is probably the answer?  Is the normal way of
doing this to go through the JMX interface from a userspace program,
and hold off on inserts until the values fall below a given threshold?
 If so, that's going to be a pain, since most of my system is
currently using python :)

Re: Cassandra behaviour

Posted by Peter Schuller <pe...@infidyne.com>.

> be the most portable thing to do.  I had been thinking that the bloom
> filters were created on startup, but further reading of the docs
> indicates that they are in the SSTable Index.  What is cassandra
> doing, then, when it's printing out that it's sampling indices while
> it starts?

It's reading through keys in the index and adding offset information
about roughly every 128th entry in RAM, in order to speed up reads.
Performing a binary search in an sstable from scratch would be
expensive. Because of the high cost of disk seeks, most storage
systems use btrees with a high branching factor to keep the number of
seeks low. In cassandra there is instead binary searching (owing to
the fact that sstables are sorted on disk), but pre-seeded with the
information gained from index sampling to keep the amount of seeks
bounded even in the face of very large sstables.

This should translate to memory use that scales linearly with the
number of keys too, though I don't have a good feel for the overhead
you can expect.

At least this is my current understanding. I have not looked much at
the read path in cassandra yet. This is based on
SSTableReader.loadIndexFile() and callees of that method.

> I think that is what happened; in the INFO printouts, it was saying
> CompactionManager had 50+ pending operations.  If I set commitlog_sync
> to batch and commitlog_sync_period_in_ms to 0, then cassandra can't
> write data any faster than my drives can keep up, right?  Would that
> have any effect in preventing a huge compaction backlog, or would it
> just thrash my drives a ton?

Those settings only directly affect, as far as I know, the interaction
with the commit log. Now, if your system is truly disk bound rather
than CPU bound on compaction, writes to the commit log will indeed
have the capability to effectively throttle the write speed. In such a
case I would expect more frequent fsync():s to the commit log to
throttle writes to a higher degree than they would if the commit log
was just periodically fsync():ed in the background once per minute;
however I would not use this as the means to throttle writes.

The other thing which may happen is that memtables aren't flushed fast
enough to keep up with writes. I don't remember whether or not there
was already a fix for this; I think there is, at least in trunk.
Previously you could trigger an out-of-memory condition by writing
faster than memtable flushing was happening.

However even if that is fixed (again, I'm not sure), I'm pretty sure
there is still no mechanism to throttle based on background
compaction. It's not entirely trivial to do in a sensible fashion
given how extremely asynchronous compaction is with respect to writes.

Hopefully one of the cassandra developers will chime in if I'm
misrepresenting something.

>> * Increasing the memtable size. Increasing memtable size directly
>> affects the number of times a given entry will end up having to be
>> compacted on average; i.e., it decreases the total compaction work
>> that must be done for a given insertion workload. The default is
>> something like 64 MB; on a large system you probably want this
>> significantly larger, even up to several gigs (depending on heap sizes
>> and other concerns of course).
>
> Is that {binary_,}memtable_throughput_in_mb?  It definitely sounds
> like fewer compactions would help me, so I will give that a shot.

That (minus the binary one for normal operations) and
MemtableOperationsInMillions, depending on your workload.

For large mutations MemtableOperationsInMillions may be irrelevant; it
will be more likely to be relevant the smaller your data is. I.e., the
smaller the size of the average piece of data, because as it becomes
smaller the overhead of keeping it in memory is higher. In most cases
you probably want to change both at the same time unless you are
specifically looking to tweak them in relation to each other.

> Is this anything other than ensuring that -Xmx in JVM_OPTS is
> something reasonably large?

Not as far as I know. Though I failed to list the index summary
information; I believe those goes into the same category (i.e., a need
to increase the heap size but not to adjust cassandra settings).

-- 
/ Peter Schuller

Re: Cassandra behaviour

Posted by tsuraan <ts...@gmail.com>.

> Bloom filters are indeed linear in size with respect to the number of
> items (assuming a constant target false positive rate). While I have
> not looked at how Cassandra calculates the bloom filter sizes, I feel
> pretty confident in saying that it won't dynamically replace bloom
> filters with filters of smaller sizes in response to memory pressure.
> The number of implementation issues involved in doing that are many,
> just that I can think of off the top of my head ;)
>
> Also keep in mind that, at least as far as I can tell, silently
> sacrificing false positive rates would not necessarily be an optimal
> thing to do anyway.

Well, it would probably kill the lookup speeds.  I'm not testing that
right now, so I wouldn't notice until I do :)  I suppose that since
bloom filters are stored alongside their associated data on-disk,
having their size dependent on the RAM in the current machine wouldn't
be the most portable thing to do.  I had been thinking that the bloom
filters were created on startup, but further reading of the docs
indicates that they are in the SSTable Index.  What is cassandra
doing, then, when it's printing out that it's sampling indices while
it starts?

> One aspect of start-up is reading through indexes, which will take
> some time linear in the size of the indexes. Given 120M entries this
> should not take terribly long.
>
> WIth respect to compaction: In general, compactions may take up to
> whatever time it takes to read and write the entire data set (in the
> case of a full compaction). In addition, if your test threw write
> traffic at the node faster than it was able to do compaction, you may
> have a built-up backlog of compaction activity. I'm pretty sure there
> is no active feedback mechanism (yet anyway) to prevent this from
> happening (though IIRC it's been discussed on the lists).

I think that is what happened; in the INFO printouts, it was saying
CompactionManager had 50+ pending operations.  If I set commitlog_sync
to batch and commitlog_sync_period_in_ms to 0, then cassandra can't
write data any faster than my drives can keep up, right?  Would that
have any effect in preventing a huge compaction backlog, or would it
just thrash my drives a ton?

> For a database with many items I would start by:
>
> * Increasing the memtable size. Increasing memtable size directly
> affects the number of times a given entry will end up having to be
> compacted on average; i.e., it decreases the total compaction work
> that must be done for a given insertion workload. The default is
> something like 64 MB; on a large system you probably want this
> significantly larger, even up to several gigs (depending on heap sizes
> and other concerns of course).

Is that {binary_,}memtable_throughput_in_mb?  It definitely sounds
like fewer compactions would help me, so I will give that a shot.

> * Making sure enough memory is reserved for the bloom filters.

Is this anything other than ensuring that -Xmx in JVM_OPTS is
something reasonably large?

> For sustaining high read traffic you may then want to tweak cache
> sizes, but that should not affect insertion.

Yeah, I haven't done any reads yet, so I'll play with that later.

Re: Cassandra behaviour

Posted by Peter Schuller <pe...@infidyne.com>.

[ 1 billion inserts, failed after 120m with out-of-mem ]

> - is Cassandra's RAM use proportional to the number of values that
> it's storing?  I know that it uses bloom filters for preventing
> lookups of non-existent keys, but since bloom filters are designed to
> give an accuracy/space tradeoff, Cassandra should sacrifice accuracy
> in order to prevent crashes, if it's just bloom filters that are using
> all the RAM

Bloom filters are indeed linear in size with respect to the number of
items (assuming a constant target false positive rate). While I have
not looked at how Cassandra calculates the bloom filter sizes, I feel
pretty confident in saying that it won't dynamically replace bloom
filters with filters of smaller sizes in response to memory pressure.
The number of implementation issues involved in doing that are many,
just that I can think of off the top of my head ;)

Also keep in mind that, at least as far as I can tell, silently
sacrificing false positive rates would not necessarily be an optimal
thing to do anyway.

If you application is such that you can accept a significantly higher
false positive rate on the bloom filters, the best bet is probably to
tweak it to use different target rates. I don't know if this is
currently possible without code changes (I don't think so, but I'm not
sure).

> - When I start Cassandra again, it appears to go into an eternal
> read/write loop, using between 45% and 90% of my CPU.  It says it's
> compacting tables, but it's been doing that for hours, and it only has
> 70GB of data stored.  How can cassandra be run on huge datasets, when
> 70GB appears to take forever to compact?

One aspect of start-up is reading through indexes, which will take
some time linear in the size of the indexes. Given 120M entries this
should not take terribly long.

WIth respect to compaction: In general, compactions may take up to
whatever time it takes to read and write the entire data set (in the
case of a full compaction). In addition, if your test threw write
traffic at the node faster than it was able to do compaction, you may
have a built-up backlog of compaction activity. I'm pretty sure there
is no active feedback mechanism (yet anyway) to prevent this from
happening (though IIRC it's been discussed on the lists).

> I assume I'm doing something wrong, but I don't see a ton of tunables
> to play with.  Can anybody give me advice on how to make cassandra
> keep running under a high insert load?

For a database with many items I would start by:

* Increasing the memtable size. Increasing memtable size directly
affects the number of times a given entry will end up having to be
compacted on average; i.e., it decreases the total compaction work
that must be done for a given insertion workload. The default is
something like 64 MB; on a large system you probably want this
significantly larger, even up to several gigs (depending on heap sizes
and other concerns of course).

* Making sure enough memory is reserved for the bloom filters.

For sustaining high read traffic you may then want to tweak cache
sizes, but that should not affect insertion.

-- 
/ Peter Schuller

Re: Cassandra behaviour

Posted by Peter Schuller <pe...@infidyne.com>.

> to play with.  Can anybody give me advice on how to make cassandra
> keep running under a high insert load?

I forgot to mention that if your insertion speed is simply
legitimately faster than compaction, but you have left-over idle CPU
on the system, then currently as far as I know you're out of luck. If
cassandra gets concurrent compaction support in the future it would be
able to sustain a higher write rate in cases where it is currently
bottlenecking on the sequential nature of compaction.

-- 
/ Peter Schuller

Re: Cassandra behaviour

Posted by tsuraan <ts...@gmail.com>.

> My guess:
> Your test is beating up your system. The system may need more memory
> or disk throughput or CPU in order to keep up with that particular
> test.

Yeah, I am testing on a pretty wimpy machine; I just wanted to get
some practice getting cassandra up and running, and I ran into this
problem.

> Check some of the posts on the list with "deferred processing" in the
> body to see why.

I'll look for that.  Thanks for the tip.

> Also, can you post the error log?

All I have is a system.log, which only seems to have GC info messages.
 Very strange.

Re: Cassandra behaviour

Posted by Jonathan Shook <js...@gmail.com>.

My guess:
Your test is beating up your system. The system may need more memory
or disk throughput or CPU in order to keep up with that particular
test.
Check some of the posts on the list with "deferred processing" in the
body to see why.

Also, can you post the error log?

On Mon, Jul 26, 2010 at 11:23 AM, tsuraan <ts...@gmail.com> wrote:
> I have a system where we're currently using Postgres for all our data
> storage needs, but on a large table the index checks for primary keys
> are really slowing us down on insert.  Cassandra sounds like a good
> alternative (not saying postgres and cassandra are equivalent; just
> that I think they are both reasonable fits for our particular
> product), so I tried running the py_stress tool on a recent repos
> checkout.  I'm using code that's recent enough that it doesn't pay
> attention to the keyspace definitions in cassandra.yaml, so whatever
> the values are for cached info is just what py_stress defined when it
> made the keyspace it uses.  I didn't change anything in
> cassandra.yaml, but I did change cassandra.in.sh to use 2G of RAM
> rather than 1G.  I then ran "python stress.py -o insert -n 1000000000"
> (that's one billion).  I left for a day, and when I came back
> cassandra had run out of RAM, and stress.py had crashed at somewhere
> around 120,000,000 inserts.  This brings up a few questions:
>
> - is Cassandra's RAM use proportional to the number of values that
> it's storing?  I know that it uses bloom filters for preventing
> lookups of non-existent keys, but since bloom filters are designed to
> give an accuracy/space tradeoff, Cassandra should sacrifice accuracy
> in order to prevent crashes, if it's just bloom filters that are using
> all the RAM
>
> - When I start Cassandra again, it appears to go into an eternal
> read/write loop, using between 45% and 90% of my CPU.  It says it's
> compacting tables, but it's been doing that for hours, and it only has
> 70GB of data stored.  How can cassandra be run on huge datasets, when
> 70GB appears to take forever to compact?
>
> I assume I'm doing something wrong, but I don't see a ton of tunables
> to play with.  Can anybody give me advice on how to make cassandra
> keep running under a high insert load?
>