You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Wayne <wa...@gmail.com> on 2010/12/16 23:35:03 UTC

Read Latency Degradation

We are running .6.8 and are reaching 1tb/node in a 10 node cluster rf=3. Our
reads times seem to be getting worse as we load data into the cluster, and I
am worried there is a scale problem in terms of large column families. All
benchmarks/times come from cfstats reporting so no client code or times are
referenced. Our initial tests always hovered around <=5ms in terms of read
latency. We then went through a lot of work to load large amounts of data
into the system on now our read latency is ~20-25ms. We can find no reason
for this, and we have checked under load, no load, with manually compacted
CFs etc. and the numbers all seem consistent to what we saw before but 3-4x
slower. We then compared to a another larger CF that we have loaded and we
are seeing it taking ~50-60ms to read from. The scare for us is that the
data file for the slower CF is 3x the size of the smaller one with 3x the
latency to read. We were expecting a < 5ms read latency (with key caching)
when we had a lot less data in the cluster but are worried it will only get
worse as the table gets bigger. The 20ms table file is ~30gb and the bigger
one is ~100gb. I keep thinking we are missing something obvious and this is
just a coincidence, as it does not make sense. We also upgraded from .6.6
between tests also so we will try to see if .6.6 is faster but is the only
real change that has occurred in our cluster.

I have read that read latency goes up with the total data size, but to what
degree should we expect a degradation in performance? What is the "normal"
read latency range if there is such a thing for a small slice of scol/cols?
Can we really put 2TB of data on a node and get good read latency querying
data off of a handful of CFs? Any experience or explanations would be
greatly appreciated.

Thanks in advance for any help!

Re: Read Latency Degradation

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Dec 16, 2010 at 7:15 PM, Robert Coli <rc...@digg.com> wrote:
> On Thu, Dec 16, 2010 at 2:35 PM, Wayne <wa...@gmail.com> wrote:
>
>> I have read that read latency goes up with the total data size, but to what
>> degree should we expect a degradation in performance?
>
> I'm not sure this is generally answerable because of data modelling
> and workload variability, but there are some known
> performance-impacting issues with very large data files.
>
> For one example, this error :
>
> "
> WARN [COMPACTION-POOL:1] 2010-09-28 12:17:11,932 BloomFilter.java
> (line 82) Cannot provide an optimal BloomFilter for 245256960 elements
> (8/15 buckets per element).
> "
>
> Which I saw on a SSTable which was 90gb, around the size of one of your files.
>
> https://issues.apache.org/jira/browse/CASSANDRA-1555
>
> Is open with some great work from the Twitter guys to deal with this
> particular problem.
>
> Generally, I'm sure that there are other similar issues, because the
> simple fact is that the set of people running very large datasets with
> Apache Cassandra in production is still relatively small, and
> non-squeaking wheels usually get less grease.. ;D
>
> =Rob
>

What you are seeing is expected. Your latency and overall performance
will get slower as your data gets larger. You mention the size of
column families but not how much physical RAM is on your system.
Regardless of your key cache or row cache settings if you have more
RAM then data all of your sstables, bloom-filters, and index files
live in your VFS cache, and everything is going to be fast. At the
point that your data gets larger then main memory things can take a
step down. It may be a small step depending on the dynamics of your
traffic. As your data gets larger compared to main memory less will be
key|row|vfs cached. At this point your disks will start becoming more
active. If you are at the point where all your bloom filters and
indexes are getting much larger then main memory you can expect
another big step down.

In a nutshell if you can not keep some proportion of disk data to RAM
size you can expect latency to go up, fasts disks make this proportion
larger as does how large your active set it.

Commonly the Cassandra model calls for de-normalization. "Put
everything in one big column family" some might say. This is true in
some cases and not true in others. For example, you may run into a
situation where some columns for a key need to be read 100 times a
minute, while other columns need to be read 10 times a minute. If you
store all these columns together in a 300 GB table  each read searches
that 300 GB table. But if the data is sized drastically differently
and you can make those 100 reads per minute read a 30 GB CF and the
other 10 reads sort through the 270 GB CF everything will perform
better. Also in a situation like this you can tune the caching on the
column families independently. Maybe you can mix and match key and row
cache or size them differently. Also this depends heavily on how you
store your data. Is it big wide rows and small indexes? or small tiny
rows and large indexes? Many variables to take into account.

To get to 1TB have fast fast disk, and or lots of RAM. Then again you
might be able to accomplish this with 4 smaller nodes vs one beefy
one.

Re: Read Latency Degradation

Posted by Robert Coli <rc...@digg.com>.

On Thu, Dec 16, 2010 at 2:35 PM, Wayne <wa...@gmail.com> wrote:

> I have read that read latency goes up with the total data size, but to what
> degree should we expect a degradation in performance?

I'm not sure this is generally answerable because of data modelling
and workload variability, but there are some known
performance-impacting issues with very large data files.

For one example, this error :

"
WARN [COMPACTION-POOL:1] 2010-09-28 12:17:11,932 BloomFilter.java
(line 82) Cannot provide an optimal BloomFilter for 245256960 elements
(8/15 buckets per element).
"

Which I saw on a SSTable which was 90gb, around the size of one of your files.

https://issues.apache.org/jira/browse/CASSANDRA-1555

Is open with some great work from the Twitter guys to deal with this
particular problem.

Generally, I'm sure that there are other similar issues, because the
simple fact is that the set of people running very large datasets with
Apache Cassandra in production is still relatively small, and
non-squeaking wheels usually get less grease.. ;D

=Rob

Re: Read Latency Degradation

Posted by Edward Capriolo <ed...@gmail.com>.

On Sat, Dec 18, 2010 at 11:31 AM, Peter Schuller
<pe...@infidyne.com> wrote:
> I started a page on the wiki that still needs improvement,
> specifically for concerns relating to running large nodes:
>
>   http://wiki.apache.org/cassandra/LargeDataSetConsiderations
>
> I haven't linked to it from anywhere yet, pending adding various JIRA
> ticket references + give people a chance to object if any information
> is misleading.
>
> --
> / Peter Schuller
>

I will definitely contribute to this page. There are two far extremes
in use cases. On one side some use Cassandra like a "persistent
memcache" on small datasets and get very high throughput, on the other
end we have people using it for much larger datasets as a
supplement/replacement to RDBMS.

Re: Read Latency Degradation

Posted by Peter Schuller <pe...@infidyne.com>.

I started a page on the wiki that still needs improvement,
specifically for concerns relating to running large nodes:

   http://wiki.apache.org/cassandra/LargeDataSetConsiderations

I haven't linked to it from anywhere yet, pending adding various JIRA
ticket references + give people a chance to object if any information
is misleading.

-- 
/ Peter Schuller

Re: Read Latency Degradation

Posted by Peter Schuller <pe...@infidyne.com>.

> +1 on each of Peter's points except one.
>
> For example, if the hot set is very small and slowly changing, you may
> be able to have 100 TB per node and take the traffic without any
> difficulties.

So that statement was probably not the best. I should have been more
careful. I meant it purely in terms of dealing with the traffic
patterns (hot set and the implied IOPS relative to request rate etc),
rather than as a claim that there are no issues with having 100 TB of
data on a single node.

Sorry if that was unclear.

> Also this page
> http://wiki.apache.org/cassandra/CassandraHardware
>
> On ext2/ext3 the maximum file size is 2TB, even on a 64 bit kernel. On
> ext4 that goes up to 16TB. Since Cassandra can use almost half your
> disk space on a single file, if you are raiding large disks together
> you may want to use XFS instead, particularly if you are using a
> 32-bit kernel. XFS file size limits are 16TB max on a 32 bit kernel,
> and basically unlimited on 64 bit.

Another problem is that file removals (unlink()) are *DOG* slow on
ext3/ext2, and extremely seek bound, and put the process in
uninterruptable sleep mode.

I would strongly urge the use of XFS for large data sets. The main
argument against it is probably that more people test with ext3
because it tends to be the default.

> 2) If you have small columns the start up time for a node (0.6.0)
> would be mind boggling to sample those indexes

This is true with 0.7 as well. It's improved, but sampling still takes
place and will take time. At minimum you have to read through the
indexes on disk, so whatever time it takes to do that is time to wait.
In addition if the sampling is CPU bound, it will take even longer.

(I think the parallel index sampling has gone in so that multiple core
are used, but the issue remains.)

> We should be careful not to mislead people. Talking about 16TB XFS
> setup, or 100TB/node without any difficulties , seems very very far
> from the common use case.

I completely agree. I didn't mean to imply that and I hope no one was mislead.

-- 
/ Peter Schuller

Re: Read Latency Degradation

Posted by Edward Capriolo <ed...@gmail.com>.

On Sat, Dec 18, 2010 at 5:27 AM, Peter Schuller
<pe...@infidyne.com> wrote:
> And I forgot:
>
> (6) It is fully expected that sstable counts spike during large
> compactions that take a lot of time simply because smaller compactions
> never get a chance to run. (There was just recently JIRA traffic that
> added support for parallel compaction, but I'm not sure whether it
> fully addresses this particular issue or not.) If you have a lot rows
> that are written incrementally and thus span multiple sstables, and
> your data size is truly large and written to fairly quickly, that
> means you will have a lot of data in sstables spread out over smaller
> ones that won't get compacted for extended periods once larger
> multi-hundreds-of-gig sstables are being compacted. However, that
> said, if you are just continually increasing your sstable count
> (rather than there just being spikes) that indicates compaction is not
> keeping up with write traffic.
>
> --
> / Peter Schuller
>

+1 on each of Peter's points except one.

For example, if the hot set is very small and slowly changing, you may
be able to have 100 TB per node and take the traffic without any
difficulties.

Also this page
http://wiki.apache.org/cassandra/CassandraHardware

On ext2/ext3 the maximum file size is 2TB, even on a 64 bit kernel. On
ext4 that goes up to 16TB. Since Cassandra can use almost half your
disk space on a single file, if you are raiding large disks together
you may want to use XFS instead, particularly if you are using a
32-bit kernel. XFS file size limits are 16TB max on a 32 bit kernel,
and basically unlimited on 64 bit.

Both of these statements imply there should be no challenges to use
disks this large. But there are challenges namely the ones mentioned
in this thread
1) Bloom filters currently stop being effective
2) If you have small columns the start up time for a node (0.6.0)
would be mind boggling to sample those indexes
3) The compaction scenarios that take a long time and cause sstable
build thus lowering read performance
4) node joins/moves/repairs take a long time (due to compaction taking
a long time)

We should be careful not to mislead people. Talking about 16TB XFS
setup, or 100TB/node without any difficulties , seems very very far
from the common use case.

Re: Read Latency Degradation

Posted by Peter Schuller <pe...@infidyne.com>.

And I forgot:

(6) It is fully expected that sstable counts spike during large
compactions that take a lot of time simply because smaller compactions
never get a chance to run. (There was just recently JIRA traffic that
added support for parallel compaction, but I'm not sure whether it
fully addresses this particular issue or not.) If you have a lot rows
that are written incrementally and thus span multiple sstables, and
your data size is truly large and written to fairly quickly, that
means you will have a lot of data in sstables spread out over smaller
ones that won't get compacted for extended periods once larger
multi-hundreds-of-gig sstables are being compacted. However, that
said, if you are just continually increasing your sstable count
(rather than there just being spikes) that indicates compaction is not
keeping up with write traffic.

-- 
/ Peter Schuller

Re: Read Latency Degradation

Posted by Daniel Doubleday <da...@gmx.net>.

On 19.12.10 03:05, Wayne wrote:
> Rereading through everything again I am starting to wonder if the page 
> cache is being affected by compaction. 
Oh yes ...

http://chbits.blogspot.com/2010/06/lucene-and-fadvisemadvise.html
https://issues.apache.org/jira/browse/CASSANDRA-1470

> We have been heavily loading data for weeks and compaction is 
> basically running non-stop. The manual compaction should be done some 
> time tomorrow, so when totally caught up I will try again. What 
> changes can be hoped for in 1470 or 1882 in terms of isolating 
> compactions (or writes) affects on read requests?
>
> Thanks
>
>
>
> On Sat, Dec 18, 2010 at 2:36 PM, Peter Schuller 
> <peter.schuller@infidyne.com <ma...@infidyne.com>> wrote:
>
>     > You are absolutely back to my main concern. Initially we were
>     consistently
>     > seeing < 10ms read latency and now we see 25ms (30GB sstable
>     file), 50ms
>     > (100GB sstable file) and 65ms (330GB table file) read times for
>     a single
>     > read with nothing else going on in the cluster. Concurrency is
>     not our
>     > problem/concern (at this point), our problem is slow reads in total
>     > isolation. Frankly the concern is that a 2TB node with a 1TB
>     sstable (worst
>     > case scenario) will result in > 100ms read latency in total
>     isolation.
>
>     So if you have a single non-concurrent client, along, submitting these
>     reads that take 65 ms - are you disk bound (according to the last
>     column of iostat -x 1), and how many reads per second (rps column) are
>     you seeing relative to client reads? Is the number of disk reads per
>     client read consistent with the actual number of sstables at the time?
>
>     The behavior you're describing really does seem indicative of a
>     problem, unless the the bottleneck is legitimately reads from disk
>     from multiple sstables resulting from rows being spread over said
>     sstables.
>
>     --
>     / Peter Schuller
>
>

Re: Read Latency Degradation

Posted by Chris Goffinet <cg...@chrisgoffinet.com>.

You can disable compaction and enable it later. Use nodetool and setcompactionthreshold to 0 0

-Chris

On Dec 18, 2010, at 6:05 PM, Wayne wrote:

> Rereading through everything again I am starting to wonder if the page cache is being affected by compaction. We have been heavily loading data for weeks and compaction is basically running non-stop. The manual compaction should be done some time tomorrow, so when totally caught up I will try again. What changes can be hoped for in 1470 or 1882 in terms of isolating compactions (or writes) affects on read requests?
> 
> Thanks
> 
> 
> 
> On Sat, Dec 18, 2010 at 2:36 PM, Peter Schuller <pe...@infidyne.com> wrote:
> > You are absolutely back to my main concern. Initially we were consistently
> > seeing < 10ms read latency and now we see 25ms (30GB sstable file), 50ms
> > (100GB sstable file) and 65ms (330GB table file) read times for a single
> > read with nothing else going on in the cluster. Concurrency is not our
> > problem/concern (at this point), our problem is slow reads in total
> > isolation. Frankly the concern is that a 2TB node with a 1TB sstable (worst
> > case scenario) will result in > 100ms read latency in total isolation.
> 
> So if you have a single non-concurrent client, along, submitting these
> reads that take 65 ms - are you disk bound (according to the last
> column of iostat -x 1), and how many reads per second (rps column) are
> you seeing relative to client reads? Is the number of disk reads per
> client read consistent with the actual number of sstables at the time?
> 
> The behavior you're describing really does seem indicative of a
> problem, unless the the bottleneck is legitimately reads from disk
> from multiple sstables resulting from rows being spread over said
> sstables.
> 
> --
> / Peter Schuller
>

Re: Read Latency Degradation

Posted by Peter Schuller <pe...@infidyne.com>.

> Rereading through everything again I am starting to wonder if the page cache
> is being affected by compaction. We have been heavily loading data for weeks
> and compaction is basically running non-stop. The manual compaction should
> be done some time tomorrow, so when totally caught up I will try again.

If your 65 ms measurements were taken as an average while
compaction/repair was running, that would most definitely be a very
very likely candidate for a root cause. Especially if your compaction
is disk bound or close to it (rather than CPU bound).

What made me concerned was that it sounded like you were getting the
65ms latencies to reads with *no* other activity going on. But was
compaction/repair still running at that point?

And yes - definitely make sure to time it again when there's no active
compaction/repair going on.

> What
> changes can be hoped for in 1470 or 1882 in terms of isolating compactions
> (or writes) affects on read requests?

Speaking only for myself now and my expectations (not making any
statements officially for cassandra):

Under the assumption of large data sets with disk I/O and cache
effectiveness being the primary concerns, the negative impact of
background bulk I/O is falling into two categories:

(1) Direct impact on latency resulting from the I/O being done at any
given moment.
(2) Indirect impact resulting from eviction of hot data from page cache.

1470 is part of decimating (2). It sounds like 1470 itself will be
closed with fadvise working, but there is more to be done to achieve a
final goal of mitigating (2). Various options are discussed in 1470
itself; I guess the latest is the fadvise+mincore plan provided that
it pans out. It is worth noting though that barring a user-level page
cache, the effect of (2) will likely never be completely eliminated.
Even given fadvise+mincore, there are other concerns such as blowing
away recenticity information and defeating the LRU behavior (or
similar) of the OS page cache.

1882 is about controlling (1) and it is considerably easier to get
something "good enough" working for 1882 than 1470. Although certainly
the general problem of I/O scheduling is a difficult one, given the
specific use-case in Cassandra and the low hanging fruit to be picked,
I expect 1882 even in it's simplest form to significantly help for (1)
(but this only matters if (1) is your problem; if you are sufficiently
CPU bound already so that I/O is sufficiently rate limited in practice
anyway, 1882 will make no difference at all).

-- 
/ Peter Schuller

Re: Read Latency Degradation

Posted by Wayne <wa...@gmail.com>.

Rereading through everything again I am starting to wonder if the page cache
is being affected by compaction. We have been heavily loading data for weeks
and compaction is basically running non-stop. The manual compaction should
be done some time tomorrow, so when totally caught up I will try again. What
changes can be hoped for in 1470 or 1882 in terms of isolating compactions
(or writes) affects on read requests?

Thanks



On Sat, Dec 18, 2010 at 2:36 PM, Peter Schuller <peter.schuller@infidyne.com
> wrote:

> > You are absolutely back to my main concern. Initially we were
> consistently
> > seeing < 10ms read latency and now we see 25ms (30GB sstable file), 50ms
> > (100GB sstable file) and 65ms (330GB table file) read times for a single
> > read with nothing else going on in the cluster. Concurrency is not our
> > problem/concern (at this point), our problem is slow reads in total
> > isolation. Frankly the concern is that a 2TB node with a 1TB sstable
> (worst
> > case scenario) will result in > 100ms read latency in total isolation.
>
> So if you have a single non-concurrent client, along, submitting these
> reads that take 65 ms - are you disk bound (according to the last
> column of iostat -x 1), and how many reads per second (rps column) are
> you seeing relative to client reads? Is the number of disk reads per
> client read consistent with the actual number of sstables at the time?
>
> The behavior you're describing really does seem indicative of a
> problem, unless the the bottleneck is legitimately reads from disk
> from multiple sstables resulting from rows being spread over said
> sstables.
>
> --
> / Peter Schuller
>

Re: Read Latency Degradation

Posted by Peter Schuller <pe...@infidyne.com>.

> You are absolutely back to my main concern. Initially we were consistently
> seeing < 10ms read latency and now we see 25ms (30GB sstable file), 50ms
> (100GB sstable file) and 65ms (330GB table file) read times for a single
> read with nothing else going on in the cluster. Concurrency is not our
> problem/concern (at this point), our problem is slow reads in total
> isolation. Frankly the concern is that a 2TB node with a 1TB sstable (worst
> case scenario) will result in > 100ms read latency in total isolation.

So if you have a single non-concurrent client, along, submitting these
reads that take 65 ms - are you disk bound (according to the last
column of iostat -x 1), and how many reads per second (rps column) are
you seeing relative to client reads? Is the number of disk reads per
client read consistent with the actual number of sstables at the time?

The behavior you're describing really does seem indicative of a
problem, unless the the bottleneck is legitimately reads from disk
from multiple sstables resulting from rows being spread over said
sstables.

-- 
/ Peter Schuller

Re: Read Latency Degradation

Posted by Wayne <wa...@gmail.com>.

You are absolutely back to my main concern. Initially we were consistently
seeing < 10ms read latency and now we see 25ms (30GB sstable file), 50ms
(100GB sstable file) and 65ms (330GB table file) read times for a single
read with nothing else going on in the cluster. Concurrency is not our
problem/concern (at this point), our problem is slow reads in total
isolation. Frankly the concern is that a 2TB node with a 1TB sstable (worst
case scenario) will result in > 100ms read latency in total isolation.

I got a lot of help from Jonathan Ellis to get very good heap settings some
of which are now standard in .6.8. CMF problems have yet to raise their
heads again. He also helped fix some major inefficiencies with quorum reads
(we read/write with quorum).

I do think that we could optimize our caching for sure, but our assumption
for worst case scenario is disk (non ssd) based reads.

Thanks.

On Sat, Dec 18, 2010 at 12:58 PM, Peter Schuller <
peter.schuller@infidyne.com> wrote:

> > Smaller nodes just seem to fit the Cassandra architecture a lot better.
> We
> > can not use cloud instances, so the cost for us to go to <500gb nodes is
> > prohibitive. Cassandra lumps all processes on the node together into one
> > bucket, and that almost then requires a smaller node data set. There are
> no
> > regions, tablets, or partitions created to throttle compaction and
> prevent
> > huge data files.
>
> There are definitely some things to improve. I think what you have
> mentioned is covered, but if you feel you're hitting something which
> is not covered by the wiki page I mentioned in my previous post
> (http://wiki.apache.org/cassandra/LargeDataSetConsiderations), please
> do augment or say so.
>
> In your original post you said you went from 5 ms to 50 ms. Is this
> average latencies under load, or the latency of a single request
> absent other traffic and absent background compaction etc?
>
> If a single read is taking 50 ms for reasons that have nothing to do
> with other concurrent activity, that smells of something being wrong
> to me.
>
> Otherwise, is your primary concern worse latency/throughput during
> compactions/repairs, or just the overall throughput/latency during
> normal operation?
>
> > I have considered dropping the heap down to 8gb, but having pained
> through
> > many cmf in the past I thought the larger heap should help prevent the
> stop
> > the world gc.
>
> I'm not sure what got merged to 0.6.8, but you may way want to grab
> the JVM options from the 0.7 branch. In particular, the initial
> occuprancy triggering of CMS mark-sweep phases. Concurrent mode
> failures could just be because the CMS heuristics failed, rather than
> due to the heap legitimately being too small. If the heuristics are
> failing, maybe you do have the ability to lower the heap size if you
> change the CMS trigger. I recommend monitoring heap usage for that;
> look for the heap usage as it appears right after a CMS collection has
> completed to judge the "real" live set size.
>
> > Row cache is not an option for us. We expect going to disk, and key cache
> is
> > the only cache that can help speed things up a little. We have wide rows
> so
> > key cache is an un-expensive boost.
>
> Ok, makes sense.
>
> > This is why we schedule weekly major compaction. We update ALL rows every
> > day, often over-writing previous values.
>
> Ok - so you're definitely in a position to suffer more than most use
> cases from data being spread over multiple sstables.
>
> >> (5) In general the way I/O works, latency will skyrocket once you
> >> start saturating your disks. As long as you're significantly below
> >> full utilization of your disks, you'll see pretty stable and low
> >> latencies. As you approach full saturation, the latencies will tend to
> >> increase super-linearly. Once you're *above* saturation, your
> >> latencies skyrocket and reads are dropped because the rate cannot be
> >> sustained. This means that while latency is a great indicator to look
> >> at to judge what the current user perceived behavior is, it is *not* a
> >> good thing to look at to extrapolate resource demands or figure out
> >> how far you are from saturation / need for more hardware.
> >>
> > This we can see with munin. We throttle the read load to avoid that
> "wall".
>
> Do you have a sense of how many reads on disk you're taking per read
> request to the node? Do you have a sense of the size of the active
> set? A big question is going to be whether caching is effective at
> all, and how much additional caching would help.
>
> In any case, it would be interesting to know whether you are seeing
> more disk seeks per read than you "should".
>
> --
> / Peter Schuller
>

Re: Read Latency Degradation

Posted by Peter Schuller <pe...@infidyne.com>.

> Smaller nodes just seem to fit the Cassandra architecture a lot better. We
> can not use cloud instances, so the cost for us to go to <500gb nodes is
> prohibitive. Cassandra lumps all processes on the node together into one
> bucket, and that almost then requires a smaller node data set. There are no
> regions, tablets, or partitions created to throttle compaction and prevent
> huge data files.

There are definitely some things to improve. I think what you have
mentioned is covered, but if you feel you're hitting something which
is not covered by the wiki page I mentioned in my previous post
(http://wiki.apache.org/cassandra/LargeDataSetConsiderations), please
do augment or say so.

In your original post you said you went from 5 ms to 50 ms. Is this
average latencies under load, or the latency of a single request
absent other traffic and absent background compaction etc?

If a single read is taking 50 ms for reasons that have nothing to do
with other concurrent activity, that smells of something being wrong
to me.

Otherwise, is your primary concern worse latency/throughput during
compactions/repairs, or just the overall throughput/latency during
normal operation?

> I have considered dropping the heap down to 8gb, but having pained through
> many cmf in the past I thought the larger heap should help prevent the stop
> the world gc.

I'm not sure what got merged to 0.6.8, but you may way want to grab
the JVM options from the 0.7 branch. In particular, the initial
occuprancy triggering of CMS mark-sweep phases. Concurrent mode
failures could just be because the CMS heuristics failed, rather than
due to the heap legitimately being too small. If the heuristics are
failing, maybe you do have the ability to lower the heap size if you
change the CMS trigger. I recommend monitoring heap usage for that;
look for the heap usage as it appears right after a CMS collection has
completed to judge the "real" live set size.

> Row cache is not an option for us. We expect going to disk, and key cache is
> the only cache that can help speed things up a little. We have wide rows so
> key cache is an un-expensive boost.

Ok, makes sense.

> This is why we schedule weekly major compaction. We update ALL rows every
> day, often over-writing previous values.

Ok - so you're definitely in a position to suffer more than most use
cases from data being spread over multiple sstables.

>> (5) In general the way I/O works, latency will skyrocket once you
>> start saturating your disks. As long as you're significantly below
>> full utilization of your disks, you'll see pretty stable and low
>> latencies. As you approach full saturation, the latencies will tend to
>> increase super-linearly. Once you're *above* saturation, your
>> latencies skyrocket and reads are dropped because the rate cannot be
>> sustained. This means that while latency is a great indicator to look
>> at to judge what the current user perceived behavior is, it is *not* a
>> good thing to look at to extrapolate resource demands or figure out
>> how far you are from saturation / need for more hardware.
>>
> This we can see with munin. We throttle the read load to avoid that "wall".

Do you have a sense of how many reads on disk you're taking per read
request to the node? Do you have a sense of the size of the active
set? A big question is going to be whether caching is effective at
all, and how much additional caching would help.

In any case, it would be interesting to know whether you are seeing
more disk seeks per read than you "should".

-- 
/ Peter Schuller

Re: Read Latency Degradation

Posted by Wayne <wa...@gmail.com>.

We are using XFS for the data volume. We are load testing now, and
compaction is way behind but weekly manual compaction should help catch
things up.

Smaller nodes just seem to fit the Cassandra architecture a lot better. We
can not use cloud instances, so the cost for us to go to <500gb nodes is
prohibitive. Cassandra lumps all processes on the node together into one
bucket, and that almost then requires a smaller node data set. There are no
regions, tablets, or partitions created to throttle compaction and prevent
huge data files.

On Sat, Dec 18, 2010 at 5:24 AM, Peter Schuller <peter.schuller@infidyne.com
> wrote:

> > How many nodes? 10 - 16 cores each (2 x quad ht cpus)
> > How much ram per node? 24gb
> > What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0)
> > for data
> > Is your ring balanced? yes, random partitioned very evenly
> > How many column families? 4 CFs x 3 Keyspaces
> > How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> > What type of caching are you using? Key caching
> > What are the sizes of caches? 500k-1m values for 2 of the CFs
> > What is the hit rate of caches? high, 90%+
> > What does your disk utiliztion|CPU|Memory look like at peak times? Disk
> goes
> > to 90%+ under heavy read load. CPU load high as well. Latency does not
> > change that much for single reads vs. under load (30 threads). We can
> keep
> > current read latency up to 25-30 read threads if no writes or compaction
> is
> > going on. We are worried about what we see in terms of latency for a
> single
> > read.
> > What are your average mean|max row size from cfstats? 30k avg/5meg max
> for
> > one CF and 311k avg/855k max for the other.
> > On average for a given sstable how large is the data bloom and index
> files?
> > 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k
> filter,
> > 18meg index for the other.
>
> I do want to make one point very clear: Regardless of whether or not
> you're running a perfectly configured Cassandra, or any other data
> base, whenever your total data set is beyond RAM you *will* take I/O
> load as a function of the locality of data access. There is just no
> way around that. If 10% of your read requests go to data that has not
> recently been read (the long tail), it won't be cached, and no storage
> system ever will magically make that read not go down to disk.
>
> There are definitely things to tweak. For example, making sure that
> the memory you do have for caching purposes is used as efficiently as
> possible to minimize the amount of reads that go down to disk. But in
> order to figure out what is going on you are going to have to gain an
> understanding of what the data access pattern is like. For example, if
> the hot set is very small and slowly changing, you may be able to have
> 100 TB per node and take the traffic without any difficulties. On the
> other hand if your data is accessed completely randomly, you may have
> trouble with a data set just twice the size of RAM (and most requests
> and up going down to disk).
>
> In the worst case access patterns where caching is completely
> ineffective (such as random access to a data set several times the
> size of RAM), it will be solely about disk IOPS and SSD:s are probably
> the way to go unless lots and lots of servers become cheaper
> (depending on data sizes mostly).
>
> Now, that said, the practical reality with Cassandra is not the
> theoretical optimum w.r.t. caching. Some things to consider:
>
> (1) Whatever memory you give the JVM is going to be a direct trade-off
> in the sense that said memory is no longer available for the operating
> system page cache. If you "waste" memory there for no good reason,
> you're just loosing caching potential.
>

I have considered dropping the heap down to 8gb, but having pained through
many cmf in the past I thought the larger heap should help prevent the stop
the world gc.

>
> (2) That said, the row cache in Cassandra is not affected by
> compaction/repair. Anything cached by the operating system page cache
> (anything not in row cache) will be affected, in terms of cache
> warmness, by compactions and repair operations. This means you have to
> expect a variation in cache locality whenever compaction/repair runs,
> in addition to the I/O load caused directly by same. (There is
> on-going work to improve this, but nothing usable right now.)
>

Row cache is not an option for us. We expect going to disk, and key cache is
the only cache that can help speed things up a little. We have wide rows so
key cache is an un-expensive boost.

>
> (3) Reads that do go down to disk will go down to all sstables that
> contain data for that *row* (the bloom filters are at the row key
> level). If you have reads to rows that span multiple sstables (because
> they are continually written for example) this means that sstable
> counts is very important. If you only write rows once and never touch
> them again, it is much much less of an issue.
>
> This is why we schedule weekly major compaction. We update ALL rows every
day, often over-writing previous values.


> (4) There is a limit to bloom filter size that means that if you go
> above 143 M keys in a single sstable (I believe that is the number
> based on IRC conversations, I haven't checked) the bloom filter false
> positive rate will increase. Very significantly if you are much much
> bigger than 143 M keys. This leads to unnecessary reads from sstables
> that do not contain data for the row being requested. Given that you
> say you have lots of data, consider this. But note that only the row
> *count* matters, not the size (in terms of megabytes) of the sstables.
> Remember that on major compactions, all data in a column family is
> going to end up in a single sstable, so do consider the total row
> count (per node) here, not just what the sstable layout happens to be
> be now. (There is on-going work to remove this limitation w.r.t. row
> counts.)
>
> Not an issue for us with wide rows.


> (5) In general the way I/O works, latency will skyrocket once you
> start saturating your disks. As long as you're significantly below
> full utilization of your disks, you'll see pretty stable and low
> latencies. As you approach full saturation, the latencies will tend to
> increase super-linearly. Once you're *above* saturation, your
> latencies skyrocket and reads are dropped because the rate cannot be
> sustained. This means that while latency is a great indicator to look
> at to judge what the current user perceived behavior is, it is *not* a
> good thing to look at to extrapolate resource demands or figure out
> how far you are from saturation / need for more hardware.
>
> This we can see with munin. We throttle the read load to avoid that "wall".




> --
> / Peter Schuller
>

Re: Read Latency Degradation

Posted by Peter Schuller <pe...@infidyne.com>.

> How many nodes? 10 - 16 cores each (2 x quad ht cpus)
> How much ram per node? 24gb
> What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0)
> for data
> Is your ring balanced? yes, random partitioned very evenly
> How many column families? 4 CFs x 3 Keyspaces
> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What type of caching are you using? Key caching
> What are the sizes of caches? 500k-1m values for 2 of the CFs
> What is the hit rate of caches? high, 90%+
> What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes
> to 90%+ under heavy read load. CPU load high as well. Latency does not
> change that much for single reads vs. under load (30 threads). We can keep
> current read latency up to 25-30 read threads if no writes or compaction is
> going on. We are worried about what we see in terms of latency for a single
> read.
> What are your average mean|max row size from cfstats? 30k avg/5meg max for
> one CF and 311k avg/855k max for the other.
> On average for a given sstable how large is the data bloom and index files?
> 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter,
> 18meg index for the other.

I do want to make one point very clear: Regardless of whether or not
you're running a perfectly configured Cassandra, or any other data
base, whenever your total data set is beyond RAM you *will* take I/O
load as a function of the locality of data access. There is just no
way around that. If 10% of your read requests go to data that has not
recently been read (the long tail), it won't be cached, and no storage
system ever will magically make that read not go down to disk.

There are definitely things to tweak. For example, making sure that
the memory you do have for caching purposes is used as efficiently as
possible to minimize the amount of reads that go down to disk. But in
order to figure out what is going on you are going to have to gain an
understanding of what the data access pattern is like. For example, if
the hot set is very small and slowly changing, you may be able to have
100 TB per node and take the traffic without any difficulties. On the
other hand if your data is accessed completely randomly, you may have
trouble with a data set just twice the size of RAM (and most requests
and up going down to disk).

In the worst case access patterns where caching is completely
ineffective (such as random access to a data set several times the
size of RAM), it will be solely about disk IOPS and SSD:s are probably
the way to go unless lots and lots of servers become cheaper
(depending on data sizes mostly).

Now, that said, the practical reality with Cassandra is not the
theoretical optimum w.r.t. caching. Some things to consider:

(1) Whatever memory you give the JVM is going to be a direct trade-off
in the sense that said memory is no longer available for the operating
system page cache. If you "waste" memory there for no good reason,
you're just loosing caching potential.

(2) That said, the row cache in Cassandra is not affected by
compaction/repair. Anything cached by the operating system page cache
(anything not in row cache) will be affected, in terms of cache
warmness, by compactions and repair operations. This means you have to
expect a variation in cache locality whenever compaction/repair runs,
in addition to the I/O load caused directly by same. (There is
on-going work to improve this, but nothing usable right now.)

(3) Reads that do go down to disk will go down to all sstables that
contain data for that *row* (the bloom filters are at the row key
level). If you have reads to rows that span multiple sstables (because
they are continually written for example) this means that sstable
counts is very important. If you only write rows once and never touch
them again, it is much much less of an issue.

(4) There is a limit to bloom filter size that means that if you go
above 143 M keys in a single sstable (I believe that is the number
based on IRC conversations, I haven't checked) the bloom filter false
positive rate will increase. Very significantly if you are much much
bigger than 143 M keys. This leads to unnecessary reads from sstables
that do not contain data for the row being requested. Given that you
say you have lots of data, consider this. But note that only the row
*count* matters, not the size (in terms of megabytes) of the sstables.
Remember that on major compactions, all data in a column family is
going to end up in a single sstable, so do consider the total row
count (per node) here, not just what the sstable layout happens to be
be now. (There is on-going work to remove this limitation w.r.t. row
counts.)

(5) In general the way I/O works, latency will skyrocket once you
start saturating your disks. As long as you're significantly below
full utilization of your disks, you'll see pretty stable and low
latencies. As you approach full saturation, the latencies will tend to
increase super-linearly. Once you're *above* saturation, your
latencies skyrocket and reads are dropped because the rate cannot be
sustained. This means that while latency is a great indicator to look
at to judge what the current user perceived behavior is, it is *not* a
good thing to look at to extrapolate resource demands or figure out
how far you are from saturation / need for more hardware.

-- 
/ Peter Schuller

Re: Read Latency Degradation

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Dec 17, 2010 at 12:26 PM, Daniel Doubleday
<da...@gmx.net> wrote:
> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What is the hit rate of caches? high, 90%+
>
> If your heap allows it I would definitely try to give more ram for fs cache.
> Your not using row cache so I don't see what cassandra would gain from so
> much memory.
> A question about your tests:
> I assume that they run isolated (you load test one cf at a time) and the
> results are the same byte-wise?
> So the only difference is that one time you are reading from a larger file?
> Do you see the same IO load in both tests? Do you use mem-mapped io? And if
> so are the number of page faults the same in both tests?
> In the end it could just be more physical movements of the disc heads with
> larger files ...
>
> On Dec 17, 2010, at 5:46 PM, Wayne wrote:
>
> Below are some answers to your questions. We have wide rows (what we like
> about Cassandra) and I wonder if that plays into this? We have been loading
> 1 keyspace in our cluster heavily in the last week so it is behind in
> compaction for that keyspace. I am not even looking at those read latency
> times as there are as many as 100+ sstables. Compaction will run tomorrow
> for all nodes (weekend is our slow time) and I will test the read latency
> there. For the keyspace/CFs that are already well compacted we are seeing a
> steady increase in read latency as the total sstable size grows and a linear
> relationship between our different keyspaces cfs sizes and the read latency
> for reads.
>
> How many nodes? 10 - 16 cores each (2 x quad ht cpus)
> How much ram per node? 24gb
> What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0)
> for data
> Is your ring balanced? yes, random partitioned very evenly
> How many column families? 4 CFs x 3 Keyspaces
> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What type of caching are you using? Key caching
> What are the sizes of caches? 500k-1m values for 2 of the CFs
> What is the hit rate of caches? high, 90%+
> What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes
> to 90%+ under heavy read load. CPU load high as well. Latency does not
> change that much for single reads vs. under load (30 threads). We can keep
> current read latency up to 25-30 read threads if no writes or compaction is
> going on. We are worried about what we see in terms of latency for a single
> read.
> What are your average mean|max row size from cfstats? 30k avg/5meg max for
> one CF and 311k avg/855k max for the other.
> On average for a given sstable how large is the data bloom and index files?
> 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter,
> 18meg index for the other.
>
> Thanks.
>
>
>
> On Fri, Dec 17, 2010 at 10:58 AM, Edward Capriolo <ed...@gmail.com>
> wrote:
>>
>> On Fri, Dec 17, 2010 at 8:21 AM, Wayne <wa...@gmail.com> wrote:
>> > We have been testing Cassandra for 6+ months and now have 10TB in 10
>> > nodes
>> > with rf=3. It is 100% real data generated by real code in an almost
>> > production level mode. We have gotten past all our stability issues,
>> > java/cmf issues, etc. etc. now to find the one thing we "assumed" may
>> > not be
>> > true. Our current production environment is mysql with extensive
>> > partitioning. We have mysql tables with 3-4 billion records and our
>> > query
>> > performance is the same as with 1 million records (< 100ms).
>> >
>> > For those of us really trying to manage large volumes of data memory is
>> > not
>> > an option in any stretch of the imagination. Our current data volume
>> > once
>> > placed within Cassandra ignoring growth should be around 50 TB. We run
>> > manual compaction once a week (absolutely required to keep ss table
>> > counts
>> > down) and it is taking a very long amount of time. Now that our nodes
>> > are
>> > past 1TB I am worried it will take more than a day. I was hoping
>> > everyone
>> > would respond to my posting with something must be wrong, but instead I
>> > am
>> > hearing you are off the charts good luck and be patient. Scary to say
>> > the
>> > least given our current investment in Cassandra. Is it true/expected
>> > that
>> > read latency will get worse in a linear fashion as the ss table size
>> > grows?
>> >
>> > Can anyone talk me off the fence here? We have 9 MySQL servers that now
>> > serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
>> > nodes with rf=3 to give us good read latency (by keeping the node data
>> > sizes
>> > down). The cost/value equation just does not add up.
>> >
>> > Thanks in advance for any advice/experience you can provide.
>> >
>> >
>> > On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday
>> > <da...@gmx.net>
>> > wrote:
>> >>
>> >> On Dec 16, 2010, at 11:35 PM, Wayne wrote:
>> >>
>> >> > I have read that read latency goes up with the total data size, but
>> >> > to
>> >> > what degree should we expect a degradation in performance? What is
>> >> > the
>> >> > "normal" read latency range if there is such a thing for a small
>> >> > slice of
>> >> > scol/cols? Can we really put 2TB of data on a node and get good read
>> >> > latency
>> >> > querying data off of a handful of CFs? Any experience or explanations
>> >> > would
>> >> > be greatly appreciated.
>> >>
>> >> If you really mean 2TB per node I strongly advise you to perform
>> >> thorough
>> >> testing with real world column sizes and the read write load you
>> >> expect. Try
>> >> to load test at least with a test cluster / data that represents one
>> >> replication group. I.e. RF=3 -> 3 nodes. And test with the consistency
>> >> level
>> >> you want to use. Also test ring operations (repair, adding nodes,
>> >> moving
>> >> nodes) while under expected load/
>> >>
>> >> Combined with 'a handful of CFs' I would assume that you are expecting
>> >> a
>> >> considerable write load. You will get massive compaction load and with
>> >> that
>> >> data size the file system cache will suffer big time. You'll need loads
>> >> of
>> >> RAM and still ...
>> >>
>> >> I can only speak about 0.6 but ring management operations will become a
>> >> nightmare and you will have very long running repairs.
>> >>
>> >> The cluster behavior changes massively with different access patterns
>> >> (cold vs warm data) and data sizes. So you have to understand yours and
>> >> test
>> >> it. I think most generic load tests are mainly marketing instruments
>> >> and I
>> >> believe this is especially true for cassandra.
>> >>
>> >> Don't want to sound negative (I am a believer and don't regret our
>> >> investment) but cassandra is no silver bullet. You really need to know
>> >> what
>> >> you are doing.
>> >>
>> >> Cheers,
>> >> Daniel
>> >
>>
>> Yes major compactions for large sets of data do take a long time
>> (360GB takes me about 6 hours).
>>
>> You said "needing to compact to keep the sstable count low". This is
>> not a good sign. My sstable counts sawtooth between 8-15 per CF
>> through the day. If you are in a scenario where the SSTables are
>> growing all day and only catch up at night, and you have tuned
>> memtables, then your need more nodes likely. This means that your
>> cluster can not really keep up with your write traffic. You know
>> cassandra can take bursts of writes well, but if you are at the case
>> where your sstables count is getting higher you are essentially
>> failing behind. (You may not need 100 nodes like you are suggesting
>> but possibly a few to get you over the fence.)
>>
>> I do run major compactions at night, but not on every night on every
>> node. I do one a node a night to make sure these are splayed out over
>> the week, With deletes on non-major compactions you may not need to do
>> this, but we add and remove a lot of data per day so I find I have
>> to/should. Since the nights are quite for us anyway.
>>
>> As for how many nodes you need...What works out better ?
>> Big Iron: 1x (2 TB 64 GB RAM ) cost ? power ? Rack size ?
>> Small factor: 4x (500GB  16GB RAM) cost ? power ? Rack Size ?
>> Generally I think most are running the "small factor" type deployment,
>> and generally this works better by avoiding 2GB compactions!
>>
>> Is it true that read latency grows linearly with sstable size? No (but
>> it could be true in your case).
>>
>> As for your specific problems. More info is needed.
>>
>> How many nodes?
>> How much ram per node?
>> What disks and how many?
>> Is your ring balanced?
>> How many column families?
>> How much ram is dedicated to cassandra?
>> What type of caching are you using?
>> What are the sizes of caches?
>> What is the hit rate of caches?
>> What does your disk utiliztion|CPU|Memory look like at peak times?
>> What are your average mean|max row size from cfstats
>> On average for a given sstable how large is the data bloom and index
>> files?
>
>
>
I +1 many of Daniel's points.

Your set up pretty good. I like having 24 GB ram and 12GB used for
JVM. That is the general suggestion. You are getting good hit rate,
but maybe you could get about the same hit rate with small cache and
leaving more memory for VFS cache.

If possible try to get the bulk loading done in off hours. That
SSTable build-up is not a good sign. That likely means that you are in
compaction mode most of the time. are trying to disable compaction
during your bulk loads. That is a suggested tune but if you bulk load
takes a long time and compaction is off then you get that same SSTable
build up so it is a wash.

I see a couple of things:
Your storage to RAM ratio is high. I know RAM is not cheap but, if you
have some laying around. Try bumping a single machine up to 32GB or
higher. Do not change the cache/JVM settings, just add more RAM for
vfs cache and see what type of improvement you get on that node. Every
use case is different but I recently saw another NoSQL presentation
(not cassandra) that was using 128GB ram to manage 1TB data/node! It
would be interesting to hear what other people are doing with respect
to memory/disk size ratio!

I know your drive configuration is another thing that is hard to
change but 7200 RPM drives leave a lot to be desired for in terms of
seek time. Even with a great cache hit rate you are going to have to
move around that RAID0. Have you tested the RAID card and your RAID
setup for iozone bonnie++ etc?

Try --iostat -kd 5 and take one of the second and third samples. We
know your disk is high utilizing but what is that disk capable of?
For reference this is my RAID setup at ~ 81% Utilization
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             550.60     21497.60      2240.80     107488      11204

Re: Read Latency Degradation

Posted by Daniel Doubleday <da...@gmx.net>.

> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What is the hit rate of caches? high, 90%+

If your heap allows it I would definitely try to give more ram for fs cache. Your not using row cache so I don't see what cassandra would gain from so much memory.

A question about your tests:

I assume that they run isolated (you load test one cf at a time) and the results are the same byte-wise?
So the only difference is that one time you are reading from a larger file?

Do you see the same IO load in both tests? Do you use mem-mapped io? And if so are the number of page faults the same in both tests?

In the end it could just be more physical movements of the disc heads with larger files ...


On Dec 17, 2010, at 5:46 PM, Wayne wrote:

> Below are some answers to your questions. We have wide rows (what we like about Cassandra) and I wonder if that plays into this? We have been loading 1 keyspace in our cluster heavily in the last week so it is behind in compaction for that keyspace. I am not even looking at those read latency times as there are as many as 100+ sstables. Compaction will run tomorrow for all nodes (weekend is our slow time) and I will test the read latency there. For the keyspace/CFs that are already well compacted we are seeing a steady increase in read latency as the total sstable size grows and a linear relationship between our different keyspaces cfs sizes and the read latency for reads.
> 
> How many nodes? 10 - 16 cores each (2 x quad ht cpus)
> How much ram per node? 24gb
> What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0) for data 
> Is your ring balanced? yes, random partitioned very evenly
> How many column families? 4 CFs x 3 Keyspaces
> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What type of caching are you using? Key caching
> What are the sizes of caches? 500k-1m values for 2 of the CFs
> What is the hit rate of caches? high, 90%+
> What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes to 90%+ under heavy read load. CPU load high as well. Latency does not change that much for single reads vs. under load (30 threads). We can keep current read latency up to 25-30 read threads if no writes or compaction is going on. We are worried about what we see in terms of latency for a single read.
> What are your average mean|max row size from cfstats? 30k avg/5meg max for one CF and 311k avg/855k max for the other.
> On average for a given sstable how large is the data bloom and index files? 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter, 18meg index for the other.
> 
> Thanks.
> 
> 
> 
> On Fri, Dec 17, 2010 at 10:58 AM, Edward Capriolo <ed...@gmail.com> wrote:
> On Fri, Dec 17, 2010 at 8:21 AM, Wayne <wa...@gmail.com> wrote:
> > We have been testing Cassandra for 6+ months and now have 10TB in 10 nodes
> > with rf=3. It is 100% real data generated by real code in an almost
> > production level mode. We have gotten past all our stability issues,
> > java/cmf issues, etc. etc. now to find the one thing we "assumed" may not be
> > true. Our current production environment is mysql with extensive
> > partitioning. We have mysql tables with 3-4 billion records and our query
> > performance is the same as with 1 million records (< 100ms).
> >
> > For those of us really trying to manage large volumes of data memory is not
> > an option in any stretch of the imagination. Our current data volume once
> > placed within Cassandra ignoring growth should be around 50 TB. We run
> > manual compaction once a week (absolutely required to keep ss table counts
> > down) and it is taking a very long amount of time. Now that our nodes are
> > past 1TB I am worried it will take more than a day. I was hoping everyone
> > would respond to my posting with something must be wrong, but instead I am
> > hearing you are off the charts good luck and be patient. Scary to say the
> > least given our current investment in Cassandra. Is it true/expected that
> > read latency will get worse in a linear fashion as the ss table size grows?
> >
> > Can anyone talk me off the fence here? We have 9 MySQL servers that now
> > serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
> > nodes with rf=3 to give us good read latency (by keeping the node data sizes
> > down). The cost/value equation just does not add up.
> >
> > Thanks in advance for any advice/experience you can provide.
> >
> >
> > On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday <da...@gmx.net>
> > wrote:
> >>
> >> On Dec 16, 2010, at 11:35 PM, Wayne wrote:
> >>
> >> > I have read that read latency goes up with the total data size, but to
> >> > what degree should we expect a degradation in performance? What is the
> >> > "normal" read latency range if there is such a thing for a small slice of
> >> > scol/cols? Can we really put 2TB of data on a node and get good read latency
> >> > querying data off of a handful of CFs? Any experience or explanations would
> >> > be greatly appreciated.
> >>
> >> If you really mean 2TB per node I strongly advise you to perform thorough
> >> testing with real world column sizes and the read write load you expect. Try
> >> to load test at least with a test cluster / data that represents one
> >> replication group. I.e. RF=3 -> 3 nodes. And test with the consistency level
> >> you want to use. Also test ring operations (repair, adding nodes, moving
> >> nodes) while under expected load/
> >>
> >> Combined with 'a handful of CFs' I would assume that you are expecting a
> >> considerable write load. You will get massive compaction load and with that
> >> data size the file system cache will suffer big time. You'll need loads of
> >> RAM and still ...
> >>
> >> I can only speak about 0.6 but ring management operations will become a
> >> nightmare and you will have very long running repairs.
> >>
> >> The cluster behavior changes massively with different access patterns
> >> (cold vs warm data) and data sizes. So you have to understand yours and test
> >> it. I think most generic load tests are mainly marketing instruments and I
> >> believe this is especially true for cassandra.
> >>
> >> Don't want to sound negative (I am a believer and don't regret our
> >> investment) but cassandra is no silver bullet. You really need to know what
> >> you are doing.
> >>
> >> Cheers,
> >> Daniel
> >
> 
> Yes major compactions for large sets of data do take a long time
> (360GB takes me about 6 hours).
> 
> You said "needing to compact to keep the sstable count low". This is
> not a good sign. My sstable counts sawtooth between 8-15 per CF
> through the day. If you are in a scenario where the SSTables are
> growing all day and only catch up at night, and you have tuned
> memtables, then your need more nodes likely. This means that your
> cluster can not really keep up with your write traffic. You know
> cassandra can take bursts of writes well, but if you are at the case
> where your sstables count is getting higher you are essentially
> failing behind. (You may not need 100 nodes like you are suggesting
> but possibly a few to get you over the fence.)
> 
> I do run major compactions at night, but not on every night on every
> node. I do one a node a night to make sure these are splayed out over
> the week, With deletes on non-major compactions you may not need to do
> this, but we add and remove a lot of data per day so I find I have
> to/should. Since the nights are quite for us anyway.
> 
> As for how many nodes you need...What works out better ?
> Big Iron: 1x (2 TB 64 GB RAM ) cost ? power ? Rack size ?
> Small factor: 4x (500GB  16GB RAM) cost ? power ? Rack Size ?
> Generally I think most are running the "small factor" type deployment,
> and generally this works better by avoiding 2GB compactions!
> 
> Is it true that read latency grows linearly with sstable size? No (but
> it could be true in your case).
> 
> As for your specific problems. More info is needed.
> 
> How many nodes?
> How much ram per node?
> What disks and how many?
> Is your ring balanced?
> How many column families?
> How much ram is dedicated to cassandra?
> What type of caching are you using?
> What are the sizes of caches?
> What is the hit rate of caches?
> What does your disk utiliztion|CPU|Memory look like at peak times?
> What are your average mean|max row size from cfstats
> On average for a given sstable how large is the data bloom and index files?
>

Re: Read Latency Degradation

Posted by Wayne <wa...@gmail.com>.

Below are some answers to your questions. We have wide rows (what we like
about Cassandra) and I wonder if that plays into this? We have been loading
1 keyspace in our cluster heavily in the last week so it is behind in
compaction for that keyspace. I am not even looking at those read latency
times as there are as many as 100+ sstables. Compaction will run tomorrow
for all nodes (weekend is our slow time) and I will test the read latency
there. For the keyspace/CFs that are already well compacted we are seeing a
steady increase in read latency as the total sstable size grows and a linear
relationship between our different keyspaces cfs sizes and the read latency
for reads.

How many nodes? 10 - 16 cores each (2 x quad ht cpus)
How much ram per node? 24gb
What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0)
for data
Is your ring balanced? yes, random partitioned very evenly
How many column families? 4 CFs x 3 Keyspaces
How much ram is dedicated to cassandra? 12gb heap (probably too high?)
What type of caching are you using? Key caching
What are the sizes of caches? 500k-1m values for 2 of the CFs
What is the hit rate of caches? high, 90%+
What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes
to 90%+ under heavy read load. CPU load high as well. Latency does not
change *that* much for single reads vs. under load (30 threads). We can keep
current read latency up to 25-30 read threads if no writes or compaction is
going on. We are worried about what we see in terms of latency for a single
read.
What are your average mean|max row size from cfstats? 30k avg/5meg max for
one CF and 311k avg/855k max for the other.
On average for a given sstable how large is the data bloom and index files?
30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter,
18meg index for the other.

Thanks.



On Fri, Dec 17, 2010 at 10:58 AM, Edward Capriolo <ed...@gmail.com>wrote:

> On Fri, Dec 17, 2010 at 8:21 AM, Wayne <wa...@gmail.com> wrote:
> > We have been testing Cassandra for 6+ months and now have 10TB in 10
> nodes
> > with rf=3. It is 100% real data generated by real code in an almost
> > production level mode. We have gotten past all our stability issues,
> > java/cmf issues, etc. etc. now to find the one thing we "assumed" may not
> be
> > true. Our current production environment is mysql with extensive
> > partitioning. We have mysql tables with 3-4 billion records and our query
> > performance is the same as with 1 million records (< 100ms).
> >
> > For those of us really trying to manage large volumes of data memory is
> not
> > an option in any stretch of the imagination. Our current data volume once
> > placed within Cassandra ignoring growth should be around 50 TB. We run
> > manual compaction once a week (absolutely required to keep ss table
> counts
> > down) and it is taking a very long amount of time. Now that our nodes are
> > past 1TB I am worried it will take more than a day. I was hoping everyone
> > would respond to my posting with something must be wrong, but instead I
> am
> > hearing you are off the charts good luck and be patient. Scary to say the
> > least given our current investment in Cassandra. Is it true/expected that
> > read latency will get worse in a linear fashion as the ss table size
> grows?
> >
> > Can anyone talk me off the fence here? We have 9 MySQL servers that now
> > serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
> > nodes with rf=3 to give us good read latency (by keeping the node data
> sizes
> > down). The cost/value equation just does not add up.
> >
> > Thanks in advance for any advice/experience you can provide.
> >
> >
> > On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday <
> daniel.doubleday@gmx.net>
> > wrote:
> >>
> >> On Dec 16, 2010, at 11:35 PM, Wayne wrote:
> >>
> >> > I have read that read latency goes up with the total data size, but to
> >> > what degree should we expect a degradation in performance? What is the
> >> > "normal" read latency range if there is such a thing for a small slice
> of
> >> > scol/cols? Can we really put 2TB of data on a node and get good read
> latency
> >> > querying data off of a handful of CFs? Any experience or explanations
> would
> >> > be greatly appreciated.
> >>
> >> If you really mean 2TB per node I strongly advise you to perform
> thorough
> >> testing with real world column sizes and the read write load you expect.
> Try
> >> to load test at least with a test cluster / data that represents one
> >> replication group. I.e. RF=3 -> 3 nodes. And test with the consistency
> level
> >> you want to use. Also test ring operations (repair, adding nodes, moving
> >> nodes) while under expected load/
> >>
> >> Combined with 'a handful of CFs' I would assume that you are expecting a
> >> considerable write load. You will get massive compaction load and with
> that
> >> data size the file system cache will suffer big time. You'll need loads
> of
> >> RAM and still ...
> >>
> >> I can only speak about 0.6 but ring management operations will become a
> >> nightmare and you will have very long running repairs.
> >>
> >> The cluster behavior changes massively with different access patterns
> >> (cold vs warm data) and data sizes. So you have to understand yours and
> test
> >> it. I think most generic load tests are mainly marketing instruments and
> I
> >> believe this is especially true for cassandra.
> >>
> >> Don't want to sound negative (I am a believer and don't regret our
> >> investment) but cassandra is no silver bullet. You really need to know
> what
> >> you are doing.
> >>
> >> Cheers,
> >> Daniel
> >
>
> Yes major compactions for large sets of data do take a long time
> (360GB takes me about 6 hours).
>
> You said "needing to compact to keep the sstable count low". This is
> not a good sign. My sstable counts sawtooth between 8-15 per CF
> through the day. If you are in a scenario where the SSTables are
> growing all day and only catch up at night, and you have tuned
> memtables, then your need more nodes likely. This means that your
> cluster can not really keep up with your write traffic. You know
> cassandra can take bursts of writes well, but if you are at the case
> where your sstables count is getting higher you are essentially
> failing behind. (You may not need 100 nodes like you are suggesting
> but possibly a few to get you over the fence.)
>
> I do run major compactions at night, but not on every night on every
> node. I do one a node a night to make sure these are splayed out over
> the week, With deletes on non-major compactions you may not need to do
> this, but we add and remove a lot of data per day so I find I have
> to/should. Since the nights are quite for us anyway.
>
> As for how many nodes you need...What works out better ?
> Big Iron: 1x (2 TB 64 GB RAM ) cost ? power ? Rack size ?
> Small factor: 4x (500GB  16GB RAM) cost ? power ? Rack Size ?
> Generally I think most are running the "small factor" type deployment,
> and generally this works better by avoiding 2GB compactions!
>
> Is it true that read latency grows linearly with sstable size? No (but
> it could be true in your case).
>
> As for your specific problems. More info is needed.
>
> How many nodes?
> How much ram per node?
> What disks and how many?
> Is your ring balanced?
> How many column families?
> How much ram is dedicated to cassandra?
> What type of caching are you using?
> What are the sizes of caches?
> What is the hit rate of caches?
> What does your disk utiliztion|CPU|Memory look like at peak times?
> What are your average mean|max row size from cfstats
> On average for a given sstable how large is the data bloom and index files?
>

Re: Read Latency Degradation

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Dec 17, 2010 at 8:21 AM, Wayne <wa...@gmail.com> wrote:
> We have been testing Cassandra for 6+ months and now have 10TB in 10 nodes
> with rf=3. It is 100% real data generated by real code in an almost
> production level mode. We have gotten past all our stability issues,
> java/cmf issues, etc. etc. now to find the one thing we "assumed" may not be
> true. Our current production environment is mysql with extensive
> partitioning. We have mysql tables with 3-4 billion records and our query
> performance is the same as with 1 million records (< 100ms).
>
> For those of us really trying to manage large volumes of data memory is not
> an option in any stretch of the imagination. Our current data volume once
> placed within Cassandra ignoring growth should be around 50 TB. We run
> manual compaction once a week (absolutely required to keep ss table counts
> down) and it is taking a very long amount of time. Now that our nodes are
> past 1TB I am worried it will take more than a day. I was hoping everyone
> would respond to my posting with something must be wrong, but instead I am
> hearing you are off the charts good luck and be patient. Scary to say the
> least given our current investment in Cassandra. Is it true/expected that
> read latency will get worse in a linear fashion as the ss table size grows?
>
> Can anyone talk me off the fence here? We have 9 MySQL servers that now
> serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
> nodes with rf=3 to give us good read latency (by keeping the node data sizes
> down). The cost/value equation just does not add up.
>
> Thanks in advance for any advice/experience you can provide.
>
>
> On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday <da...@gmx.net>
> wrote:
>>
>> On Dec 16, 2010, at 11:35 PM, Wayne wrote:
>>
>> > I have read that read latency goes up with the total data size, but to
>> > what degree should we expect a degradation in performance? What is the
>> > "normal" read latency range if there is such a thing for a small slice of
>> > scol/cols? Can we really put 2TB of data on a node and get good read latency
>> > querying data off of a handful of CFs? Any experience or explanations would
>> > be greatly appreciated.
>>
>> If you really mean 2TB per node I strongly advise you to perform thorough
>> testing with real world column sizes and the read write load you expect. Try
>> to load test at least with a test cluster / data that represents one
>> replication group. I.e. RF=3 -> 3 nodes. And test with the consistency level
>> you want to use. Also test ring operations (repair, adding nodes, moving
>> nodes) while under expected load/
>>
>> Combined with 'a handful of CFs' I would assume that you are expecting a
>> considerable write load. You will get massive compaction load and with that
>> data size the file system cache will suffer big time. You'll need loads of
>> RAM and still ...
>>
>> I can only speak about 0.6 but ring management operations will become a
>> nightmare and you will have very long running repairs.
>>
>> The cluster behavior changes massively with different access patterns
>> (cold vs warm data) and data sizes. So you have to understand yours and test
>> it. I think most generic load tests are mainly marketing instruments and I
>> believe this is especially true for cassandra.
>>
>> Don't want to sound negative (I am a believer and don't regret our
>> investment) but cassandra is no silver bullet. You really need to know what
>> you are doing.
>>
>> Cheers,
>> Daniel
>

Yes major compactions for large sets of data do take a long time
(360GB takes me about 6 hours).

You said "needing to compact to keep the sstable count low". This is
not a good sign. My sstable counts sawtooth between 8-15 per CF
through the day. If you are in a scenario where the SSTables are
growing all day and only catch up at night, and you have tuned
memtables, then your need more nodes likely. This means that your
cluster can not really keep up with your write traffic. You know
cassandra can take bursts of writes well, but if you are at the case
where your sstables count is getting higher you are essentially
failing behind. (You may not need 100 nodes like you are suggesting
but possibly a few to get you over the fence.)

I do run major compactions at night, but not on every night on every
node. I do one a node a night to make sure these are splayed out over
the week, With deletes on non-major compactions you may not need to do
this, but we add and remove a lot of data per day so I find I have
to/should. Since the nights are quite for us anyway.

As for how many nodes you need...What works out better ?
Big Iron: 1x (2 TB 64 GB RAM ) cost ? power ? Rack size ?
Small factor: 4x (500GB  16GB RAM) cost ? power ? Rack Size ?
Generally I think most are running the "small factor" type deployment,
and generally this works better by avoiding 2GB compactions!

Is it true that read latency grows linearly with sstable size? No (but
it could be true in your case).

As for your specific problems. More info is needed.

How many nodes?
How much ram per node?
What disks and how many?
Is your ring balanced?
How many column families?
How much ram is dedicated to cassandra?
What type of caching are you using?
What are the sizes of caches?
What is the hit rate of caches?
What does your disk utiliztion|CPU|Memory look like at peak times?
What are your average mean|max row size from cfstats
On average for a given sstable how large is the data bloom and index files?

Re: Single Column vs Multiple Columns

Posted by Jonathan Ellis <jb...@gmail.com>.

The performance difference is negligible and the other drawbacks are
significant, e.g., losing the ability to create indexes on a column.

On Fri, Dec 17, 2010 at 11:49 AM, Ericson, Doug
<do...@zoominfo.com>wrote:

>  I have a data model where each read and each write will read and write
> all data for a given key.  Thus, does it make more sense to separate the
> data into multiple columns, or does it make more sense to store all data in
> a single column?
>
>
>
> My understanding is that breaking data into multiple columns is an
> optimization for reading and writing a single column. If all data is updated
> for every write, and read for every read, then having all data in a single
> column offers better performance, is this a correct assumption?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Single Column vs Multiple Columns

Posted by "Ericson, Doug" <do...@zoominfo.com>.

I have a data model where each read and each write will read and write all data for a given key.  Thus, does it make more sense to separate the data into multiple columns, or does it make more sense to store all data in a single column?

My understanding is that breaking data into multiple columns is an optimization for reading and writing a single column. If all data is updated for every write, and read for every read, then having all data in a single column offers better performance, is this a correct assumption?

Re: Read Latency Degradation

Posted by Wayne <wa...@gmail.com>.

We have been testing Cassandra for 6+ months and now have 10TB in 10 nodes
with rf=3. It is 100% real data generated by real code in an almost
production level mode. We have gotten past all our stability issues,
java/cmf issues, etc. etc. now to find the one thing we "assumed" may not be
true. Our current production environment is mysql with extensive
partitioning. We have mysql tables with 3-4 billion records and our query
performance is the same as with 1 million records (< 100ms).

For those of us really trying to manage large volumes of data memory is not
an option in any stretch of the imagination. Our current data volume once
placed within Cassandra ignoring growth should be around 50 TB. We run
manual compaction once a week (absolutely required to keep ss table counts
down) and it is taking a very long amount of time. Now that our nodes are
past 1TB I am worried it will take more than a day. I was hoping everyone
would respond to my posting with something must be wrong, but instead I am
hearing you are off the charts good luck and be patient. Scary to say the
least given our current investment in Cassandra. Is it true/expected that
read latency will get worse in a linear fashion as the ss table size grows?

Can anyone talk me off the fence here? We have 9 MySQL servers that now
serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
nodes with rf=3 to give us good read latency (by keeping the node data sizes
down). The cost/value equation just does not add up.

Thanks in advance for any advice/experience you can provide.

On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday
<da...@gmx.net>wrote:

>
> On Dec 16, 2010, at 11:35 PM, Wayne wrote:
>
> > I have read that read latency goes up with the total data size, but to
> what degree should we expect a degradation in performance? What is the
> "normal" read latency range if there is such a thing for a small slice of
> scol/cols? Can we really put 2TB of data on a node and get good read latency
> querying data off of a handful of CFs? Any experience or explanations would
> be greatly appreciated.
>
> If you really mean 2TB per node I strongly advise you to perform thorough
> testing with real world column sizes and the read write load you expect. Try
> to load test at least with a test cluster / data that represents one
> replication group. I.e. RF=3 -> 3 nodes. And test with the consistency level
> you want to use. Also test ring operations (repair, adding nodes, moving
> nodes) while under expected load/
>
> Combined with 'a handful of CFs' I would assume that you are expecting a
> considerable write load. You will get massive compaction load and with that
> data size the file system cache will suffer big time. You'll need loads of
> RAM and still ...
>
> I can only speak about 0.6 but ring management operations will become a
> nightmare and you will have very long running repairs.
>
> The cluster behavior changes massively with different access patterns (cold
> vs warm data) and data sizes. So you have to understand yours and test it. I
> think most generic load tests are mainly marketing instruments and I believe
> this is especially true for cassandra.
>
> Don't want to sound negative (I am a believer and don't regret our
> investment) but cassandra is no silver bullet. You really need to know what
> you are doing.
>
> Cheers,
> Daniel

Re: Read Latency Degradation

Posted by Daniel Doubleday <da...@gmx.net>.

On Dec 16, 2010, at 11:35 PM, Wayne wrote:

> I have read that read latency goes up with the total data size, but to what degree should we expect a degradation in performance? What is the "normal" read latency range if there is such a thing for a small slice of scol/cols? Can we really put 2TB of data on a node and get good read latency querying data off of a handful of CFs? Any experience or explanations would be greatly appreciated.

If you really mean 2TB per node I strongly advise you to perform thorough testing with real world column sizes and the read write load you expect. Try to load test at least with a test cluster / data that represents one replication group. I.e. RF=3 -> 3 nodes. And test with the consistency level you want to use. Also test ring operations (repair, adding nodes, moving nodes) while under expected load/

Combined with 'a handful of CFs' I would assume that you are expecting a considerable write load. You will get massive compaction load and with that data size the file system cache will suffer big time. You'll need loads of RAM and still ...

I can only speak about 0.6 but ring management operations will become a nightmare and you will have very long running repairs.

The cluster behavior changes massively with different access patterns (cold vs warm data) and data sizes. So you have to understand yours and test it. I think most generic load tests are mainly marketing instruments and I believe this is especially true for cassandra.

Don't want to sound negative (I am a believer and don't regret our investment) but cassandra is no silver bullet. You really need to know what you are doing.

Cheers,
Daniel