You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "Freeman, Tim" <ti...@hp.com> on 2009/12/01 19:11:28 UTC

Persistently increasing read latency

In an 8 hour test run, I've seen the read latency for Cassandra drift fairly linearly from ~460ms to ~900ms. Eventually my application gets starved for reads and starts misbehaving. I have attached graphs -- horizontal scales are seconds, vertical scales are operations per minute and average milliseconds per operation. The clearest feature is the light blue line in the left graph drifting consistently upward during the run.

I have a Cassandra 0.4.1 database, one node, records are 100kbytes each, 350K records, 8 threads reading, around 700 reads per minute. There are also 8 threads writing. This is all happening on a 4 core processor that's supporting both the Cassandra node and the code that's generating load for it. I'm reasonably sure that there are no page faults.

I have attached my storage-conf.xml. Briefly, it has default values, except RpcTimeoutInMillis is 30000 and the partitioner is OrderPreservingPartitioner. Cassandra's garbage collection parameters are:

-Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled

Is this normal behavior? Is there some change to the configuration I should make to get it to stop getting slower? If it's not normal, what debugging information should I gather? Should I give up on Cassandra 0.4.1 and move to a newer version?

I'll leave it running for the time being in case there's something useful to extract from it.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

I'm only reporting what trunk is like right now, not what it will be
in the future.  Trunk has been buggy before and will be again, don't
worry. :)

On Wed, Dec 2, 2009 at 2:57 PM, Ian Holsman <ia...@holsman.net> wrote:
> hmm.
> doesn't that leave the trunk in a bad position in terms of new development?
> you may go through times when a major feature lands and trunk is broken/buggy.
> or are you planning on building new features on a branch and then merging into trunk when it's stable?
>
> On Dec 3, 2009, at 5:32 AM, Jonathan Ellis wrote:
>
>> We are using trunk.  0.5 beta / trunk is better than 0.4 at the 0.4
>> functionality and IMO is production ready (although you should always
>> test first), but I would not yet rely on the new stuff (bootstrap,
>> loadbalance, and moving nodes around in general).
>>
>> -Jonathan
>>
>> On Wed, Dec 2, 2009 at 12:26 PM, Adam Fisk <a...@littleshoot.org> wrote:
>>> Helpful thread guys. In general, Jonathan, would you recommend
>>> building from trunk for new deployments at our current snapshot in
>>> time? Are you using trunk at Rackspace?
>>>
>>> Thanks.
>>>
>>> -Adam
>>>
>>>
>>> On Tue, Dec 1, 2009 at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>> On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim <ti...@hp.com> wrote:
>>>>> Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.
>>>>
>>>> That's probably the culprit right there.  Something is wrong if you
>>>> have 1200 pending compactions.
>>>>
>>>> This is something that upgrading to trunk will help with right away
>>>> since we parallelize compactions there.
>>>>
>>>> Another thing you can do is increase the memtable limits so you are
>>>> not flushing + compacting so often with your insert traffic.
>>>>
>>>> -Jonathan
>>>>
>>>
>>>
>>>
>>> --
>>> Adam Fisk
>>> http://www.littleshoot.org | http://adamfisk.wordpress.com |
>>> http://twitter.com/adamfisk
>>>
>
> --
> Ian Holsman
> Ian@Holsman.net
>
>
>
>

Re: Persistently increasing read latency

Posted by Ian Holsman <ia...@holsman.net>.

hmm.
doesn't that leave the trunk in a bad position in terms of new development?
you may go through times when a major feature lands and trunk is broken/buggy.
or are you planning on building new features on a branch and then merging into trunk when it's stable?

On Dec 3, 2009, at 5:32 AM, Jonathan Ellis wrote:

> We are using trunk.  0.5 beta / trunk is better than 0.4 at the 0.4
> functionality and IMO is production ready (although you should always
> test first), but I would not yet rely on the new stuff (bootstrap,
> loadbalance, and moving nodes around in general).
> 
> -Jonathan
> 
> On Wed, Dec 2, 2009 at 12:26 PM, Adam Fisk <a...@littleshoot.org> wrote:
>> Helpful thread guys. In general, Jonathan, would you recommend
>> building from trunk for new deployments at our current snapshot in
>> time? Are you using trunk at Rackspace?
>> 
>> Thanks.
>> 
>> -Adam
>> 
>> 
>> On Tue, Dec 1, 2009 at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim <ti...@hp.com> wrote:
>>>> Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.
>>> 
>>> That's probably the culprit right there.  Something is wrong if you
>>> have 1200 pending compactions.
>>> 
>>> This is something that upgrading to trunk will help with right away
>>> since we parallelize compactions there.
>>> 
>>> Another thing you can do is increase the memtable limits so you are
>>> not flushing + compacting so often with your insert traffic.
>>> 
>>> -Jonathan
>>> 
>> 
>> 
>> 
>> --
>> Adam Fisk
>> http://www.littleshoot.org | http://adamfisk.wordpress.com |
>> http://twitter.com/adamfisk
>> 

--
Ian Holsman
Ian@Holsman.net

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

We are using trunk.  0.5 beta / trunk is better than 0.4 at the 0.4
functionality and IMO is production ready (although you should always
test first), but I would not yet rely on the new stuff (bootstrap,
loadbalance, and moving nodes around in general).

-Jonathan

On Wed, Dec 2, 2009 at 12:26 PM, Adam Fisk <a...@littleshoot.org> wrote:
> Helpful thread guys. In general, Jonathan, would you recommend
> building from trunk for new deployments at our current snapshot in
> time? Are you using trunk at Rackspace?
>
> Thanks.
>
> -Adam
>
>
> On Tue, Dec 1, 2009 at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim <ti...@hp.com> wrote:
>>> Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.
>>
>> That's probably the culprit right there.  Something is wrong if you
>> have 1200 pending compactions.
>>
>> This is something that upgrading to trunk will help with right away
>> since we parallelize compactions there.
>>
>> Another thing you can do is increase the memtable limits so you are
>> not flushing + compacting so often with your insert traffic.
>>
>> -Jonathan
>>
>
>
>
> --
> Adam Fisk
> http://www.littleshoot.org | http://adamfisk.wordpress.com |
> http://twitter.com/adamfisk
>

Re: Persistently increasing read latency

Posted by Adam Fisk <a...@littleshoot.org>.

Helpful thread guys. In general, Jonathan, would you recommend
building from trunk for new deployments at our current snapshot in
time? Are you using trunk at Rackspace?

Thanks.

-Adam


On Tue, Dec 1, 2009 at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim <ti...@hp.com> wrote:
>> Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.
>
> That's probably the culprit right there.  Something is wrong if you
> have 1200 pending compactions.
>
> This is something that upgrading to trunk will help with right away
> since we parallelize compactions there.
>
> Another thing you can do is increase the memtable limits so you are
> not flushing + compacting so often with your insert traffic.
>
> -Jonathan
>



-- 
Adam Fisk
http://www.littleshoot.org | http://adamfisk.wordpress.com |
http://twitter.com/adamfisk

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

if you have a backlog of several hundred files it will take a while to
get back down to normal.

On Thu, Dec 3, 2009 at 4:00 PM, B. Todd Burruss <bb...@real.com> wrote:
> i saw those too, but the number of files don't decrease, should they?
> and the read performance still isn't good.
>
> i'll run the test again tonight with the latest from trunk
>
>
> On Thu, 2009-12-03 at 15:56 -0600, Jonathan Ellis wrote:
>> the lines like
>>
>> INFO [COMPACTION-POOL:1] [ColumnFamilyStore.java:932] Compacted to
>> /home/bburruss/cassandra.884232-0.5.0-beta1/btoddb/data/uds/bucket-443-Data.db.
>>
>> would seem to indicate that it's working.
>>
>> On Thu, Dec 3, 2009 at 3:00 PM, B. Todd Burruss <bb...@real.com> wrote:
>> > correct.  i just bounced again, ran:
>> >
>> >   bin/nodeprobe -host gen-app02.dev.real.com -port 8080 compact
>> >
>> > this time i waited long enough to let it work.  a lot of logging went by
>> > so i've attached it.  this log covers starting the server until after
>> > nodeprobe was run.
>> >
>> > lemme know you got it.
>> >
>> >
>> >
>> > On Thu, 2009-12-03 at 14:51 -0600, Jonathan Ellis wrote:
>> >> On Thu, Dec 3, 2009 at 2:33 PM, B. Todd Burruss <bb...@real.com> wrote:
>> >> > nothing - i have DEBUG level set and bounced the server.  i'll restart
>> >> > again.
>> >>
>> >> just to verify: you bounced, then did nodeprobe compact, and nothing was logged?
>> >>
>> >> > cassandra spends a lot of time loading INDEXes
>> >>
>> >> that's normal, especially w/ a lot of uncompacted sstables
>> >
>> >
>
>
>

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

i saw those too, but the number of files don't decrease, should they?
and the read performance still isn't good.

i'll run the test again tonight with the latest from trunk


On Thu, 2009-12-03 at 15:56 -0600, Jonathan Ellis wrote:
> the lines like
> 
> INFO [COMPACTION-POOL:1] [ColumnFamilyStore.java:932] Compacted to
> /home/bburruss/cassandra.884232-0.5.0-beta1/btoddb/data/uds/bucket-443-Data.db.
> 
> would seem to indicate that it's working.
> 
> On Thu, Dec 3, 2009 at 3:00 PM, B. Todd Burruss <bb...@real.com> wrote:
> > correct.  i just bounced again, ran:
> >
> >   bin/nodeprobe -host gen-app02.dev.real.com -port 8080 compact
> >
> > this time i waited long enough to let it work.  a lot of logging went by
> > so i've attached it.  this log covers starting the server until after
> > nodeprobe was run.
> >
> > lemme know you got it.
> >
> >
> >
> > On Thu, 2009-12-03 at 14:51 -0600, Jonathan Ellis wrote:
> >> On Thu, Dec 3, 2009 at 2:33 PM, B. Todd Burruss <bb...@real.com> wrote:
> >> > nothing - i have DEBUG level set and bounced the server.  i'll restart
> >> > again.
> >>
> >> just to verify: you bounced, then did nodeprobe compact, and nothing was logged?
> >>
> >> > cassandra spends a lot of time loading INDEXes
> >>
> >> that's normal, especially w/ a lot of uncompacted sstables
> >
> >

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

the lines like

INFO [COMPACTION-POOL:1] [ColumnFamilyStore.java:932] Compacted to
/home/bburruss/cassandra.884232-0.5.0-beta1/btoddb/data/uds/bucket-443-Data.db.

would seem to indicate that it's working.

On Thu, Dec 3, 2009 at 3:00 PM, B. Todd Burruss <bb...@real.com> wrote:
> correct.  i just bounced again, ran:
>
>   bin/nodeprobe -host gen-app02.dev.real.com -port 8080 compact
>
> this time i waited long enough to let it work.  a lot of logging went by
> so i've attached it.  this log covers starting the server until after
> nodeprobe was run.
>
> lemme know you got it.
>
>
>
> On Thu, 2009-12-03 at 14:51 -0600, Jonathan Ellis wrote:
>> On Thu, Dec 3, 2009 at 2:33 PM, B. Todd Burruss <bb...@real.com> wrote:
>> > nothing - i have DEBUG level set and bounced the server.  i'll restart
>> > again.
>>
>> just to verify: you bounced, then did nodeprobe compact, and nothing was logged?
>>
>> > cassandra spends a lot of time loading INDEXes
>>
>> that's normal, especially w/ a lot of uncompacted sstables
>
>

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

correct.  i just bounced again, ran:

   bin/nodeprobe -host gen-app02.dev.real.com -port 8080 compact

this time i waited long enough to let it work.  a lot of logging went by
so i've attached it.  this log covers starting the server until after
nodeprobe was run.

lemme know you got it.

On Thu, 2009-12-03 at 14:51 -0600, Jonathan Ellis wrote:
> On Thu, Dec 3, 2009 at 2:33 PM, B. Todd Burruss <bb...@real.com> wrote:
> > nothing - i have DEBUG level set and bounced the server.  i'll restart
> > again.
> 
> just to verify: you bounced, then did nodeprobe compact, and nothing was logged?
> 
> > cassandra spends a lot of time loading INDEXes
> 
> that's normal, especially w/ a lot of uncompacted sstables

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Dec 3, 2009 at 2:33 PM, B. Todd Burruss <bb...@real.com> wrote:
> nothing - i have DEBUG level set and bounced the server.  i'll restart
> again.

just to verify: you bounced, then did nodeprobe compact, and nothing was logged?

> cassandra spends a lot of time loading INDEXes

that's normal, especially w/ a lot of uncompacted sstables

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

nothing - i have DEBUG level set and bounced the server.  i'll restart
again.

cassandra spends a lot of time loading INDEXes



On Thu, 2009-12-03 at 14:08 -0600, Jonathan Ellis wrote:
> what does it log when you nodeprobe compact, with debug logging on?
> 
> On Thu, Dec 3, 2009 at 1:59 PM, B. Todd Burruss <bb...@real.com> wrote:
> > files with "Data" in their name = 384
> > files with "Compacted" in their name = 11
> >
> > one CF, my keyspace is like this:
> >
> >  <Keyspaces>
> >     <Keyspace Name="uds">
> >       <KeysCachedFraction>0.01</KeysCachedFraction>
> >       <ColumnFamily CompareWith="BytesType" Name="bucket" />
> >     </Keyspace>
> >   </Keyspaces>
> >
> >
> > i've also used nodeprobe to manually compact, but didn't really do
> > anything noticable.
> >
> > thx!
> >
> >
> > On Thu, 2009-12-03 at 13:43 -0600, Jonathan Ellis wrote:
> >> On Thu, Dec 3, 2009 at 1:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >> > On Thu, Dec 3, 2009 at 1:32 PM, B. Todd Burruss <bb...@real.com> wrote:
> >> >> i do not have any pending tasks in the compaction pool but i have 1164
> >> >> files in my data directory.
> >> >
> >> > how many CFs are those spread across?
> >>
> >> ... and how many are not marked -Compacted in the filename?
> >
> >
> >

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

what does it log when you nodeprobe compact, with debug logging on?

On Thu, Dec 3, 2009 at 1:59 PM, B. Todd Burruss <bb...@real.com> wrote:
> files with "Data" in their name = 384
> files with "Compacted" in their name = 11
>
> one CF, my keyspace is like this:
>
>  <Keyspaces>
>     <Keyspace Name="uds">
>       <KeysCachedFraction>0.01</KeysCachedFraction>
>       <ColumnFamily CompareWith="BytesType" Name="bucket" />
>     </Keyspace>
>   </Keyspaces>
>
>
> i've also used nodeprobe to manually compact, but didn't really do
> anything noticable.
>
> thx!
>
>
> On Thu, 2009-12-03 at 13:43 -0600, Jonathan Ellis wrote:
>> On Thu, Dec 3, 2009 at 1:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> > On Thu, Dec 3, 2009 at 1:32 PM, B. Todd Burruss <bb...@real.com> wrote:
>> >> i do not have any pending tasks in the compaction pool but i have 1164
>> >> files in my data directory.
>> >
>> > how many CFs are those spread across?
>>
>> ... and how many are not marked -Compacted in the filename?
>
>
>

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

files with "Data" in their name = 384
files with "Compacted" in their name = 11

one CF, my keyspace is like this:

  <Keyspaces>
     <Keyspace Name="uds">
       <KeysCachedFraction>0.01</KeysCachedFraction>
       <ColumnFamily CompareWith="BytesType" Name="bucket" />
     </Keyspace>
   </Keyspaces>


i've also used nodeprobe to manually compact, but didn't really do
anything noticable.

thx!


On Thu, 2009-12-03 at 13:43 -0600, Jonathan Ellis wrote:
> On Thu, Dec 3, 2009 at 1:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > On Thu, Dec 3, 2009 at 1:32 PM, B. Todd Burruss <bb...@real.com> wrote:
> >> i do not have any pending tasks in the compaction pool but i have 1164
> >> files in my data directory.
> >
> > how many CFs are those spread across?
> 
> ... and how many are not marked -Compacted in the filename?

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Dec 3, 2009 at 1:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Thu, Dec 3, 2009 at 1:32 PM, B. Todd Burruss <bb...@real.com> wrote:
>> i do not have any pending tasks in the compaction pool but i have 1164
>> files in my data directory.
>
> how many CFs are those spread across?

... and how many are not marked -Compacted in the filename?

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Dec 3, 2009 at 1:32 PM, B. Todd Burruss <bb...@real.com> wrote:
> i do not have any pending tasks in the compaction pool but i have 1164
> files in my data directory.

how many CFs are those spread across?

> does compaction only happen when idle?

no, but it will happen faster then :)

-Jonathan

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

i do not have any pending tasks in the compaction pool but i have 1164
files in my data directory.  one thing to note about my situation is
that i did run out of disk space during my test.  cassandra _seemed_ to
recover nicely.

tim, is your's recovering?  i plan to rerun the test tonight with a
slightly smaller data set, however, the 'get' performance was dwindling
before the node ran out of disk space.

does compaction only happen when idle?

and yes, i'm running with 0.5-beta1, but not trunk.



On Thu, 2009-12-03 at 11:03 -0800, B. Todd Burruss wrote:
> i am seeing this as well.  i did a test with just 1 cassandra node,
> ReplicationFactor=1, 'get' ConsistencyLevel.ONE, 'put'
> ConsistencyLevel.QUORUM.  The first test was writing and reading random
> values starting from a fresh database.  The put performance is staying
> reasonabe, but the read performance falls off dramatically as the data
> grows.  The get performance fell from approx 6500 get/sec to 150 get/sec
> (as reported by my client stats.)  The database has grown to approx
> 500gig.  i have the stats recorded on 5 second intervals and i see a
> very linear drop off as the data grows.
> 
> i stopped the server and restarted it, let it do its thing during
> restart and then rerun a read-only test using the exact same data.  i am
> still at about 150 get/sec.  via JMX i can see the read latency at about
> 60, but this varies as the app runs.
> 
> my keyspace is simple:
> 
>  <Keyspaces>
>     <Keyspace Name="uds">
>       <KeysCachedFraction>0.01</KeysCachedFraction>
>       <ColumnFamily CompareWith="BytesType" Name="bucket" />
>     </Keyspace>
>   </Keyspaces>
> 
> all values are exactly the same and are 2k in length.
> 
> i've tried to do some tuning to make things faster but don't necessarily
> understand the options.  here are some of the params i've changed in the
> config file:
> 
> <CommitLogRotationThresholdInMB>256</CommitLogRotationThresholdInMB>
> <MemtableSizeInMB>1024</MemtableSizeInMB>
> <MemtableObjectCountInMillions>0.6</MemtableObjectCountInMillions>
> <CommitLogSyncPeriodInMS>1000</CommitLogSyncPeriodInMS>
> <MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes>
> 
> hope this data helps, and any help you can provide is much appreciated.
> 
> 
> On Tue, 2009-12-01 at 20:18 -0600, Jonathan Ellis wrote:
> > On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim <ti...@hp.com> wrote:
> > > Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.
> > 
> > That's probably the culprit right there.  Something is wrong if you
> > have 1200 pending compactions.
> > 
> > This is something that upgrading to trunk will help with right away
> > since we parallelize compactions there.
> > 
> > Another thing you can do is increase the memtable limits so you are
> > not flushing + compacting so often with your insert traffic.
> > 
> > -Jonathan
> 
>

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

i am seeing this as well.  i did a test with just 1 cassandra node,
ReplicationFactor=1, 'get' ConsistencyLevel.ONE, 'put'
ConsistencyLevel.QUORUM.  The first test was writing and reading random
values starting from a fresh database.  The put performance is staying
reasonabe, but the read performance falls off dramatically as the data
grows.  The get performance fell from approx 6500 get/sec to 150 get/sec
(as reported by my client stats.)  The database has grown to approx
500gig.  i have the stats recorded on 5 second intervals and i see a
very linear drop off as the data grows.

i stopped the server and restarted it, let it do its thing during
restart and then rerun a read-only test using the exact same data.  i am
still at about 150 get/sec.  via JMX i can see the read latency at about
60, but this varies as the app runs.

my keyspace is simple:

 <Keyspaces>
    <Keyspace Name="uds">
      <KeysCachedFraction>0.01</KeysCachedFraction>
      <ColumnFamily CompareWith="BytesType" Name="bucket" />
    </Keyspace>
  </Keyspaces>

all values are exactly the same and are 2k in length.

i've tried to do some tuning to make things faster but don't necessarily
understand the options.  here are some of the params i've changed in the
config file:

<CommitLogRotationThresholdInMB>256</CommitLogRotationThresholdInMB>
<MemtableSizeInMB>1024</MemtableSizeInMB>
<MemtableObjectCountInMillions>0.6</MemtableObjectCountInMillions>
<CommitLogSyncPeriodInMS>1000</CommitLogSyncPeriodInMS>
<MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes>

hope this data helps, and any help you can provide is much appreciated.

On Tue, 2009-12-01 at 20:18 -0600, Jonathan Ellis wrote:
> On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim <ti...@hp.com> wrote:
> > Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.
> 
> That's probably the culprit right there.  Something is wrong if you
> have 1200 pending compactions.
> 
> This is something that upgrading to trunk will help with right away
> since we parallelize compactions there.
> 
> Another thing you can do is increase the memtable limits so you are
> not flushing + compacting so often with your insert traffic.
> 
> -Jonathan

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Dec 1, 2009 at 7:31 PM, Freeman, Tim <ti...@hp.com> wrote:
> Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.

That's probably the culprit right there.  Something is wrong if you
have 1200 pending compactions.

This is something that upgrading to trunk will help with right away
since we parallelize compactions there.

Another thing you can do is increase the memtable limits so you are
not flushing + compacting so often with your insert traffic.

-Jonathan

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

From: Jonathan Ellis [mailto:jbellis@gmail.com] 
>1) use jconsole to see what is happening to jvm / cassandra internals.
> possibly you are slowly exceeding cassandra's ability to keep up with
>writes, causing the jvm to spend more and more effort GCing to find
>enough memory to keep going

After looking at jconsole, I don't think so.  Cassandra has been up 18 hours.  It says it has spent 2 hours 15 minutes on ParNew GC's and 1 second on ConcurrentMarkSweep, so we aren't in the middle of GC death.  jconsole's view of the memory used has a sawtooth curve increasing from around 0.05GB to 0.95GB in about 7 minutes, then dropping suddenly back to 0.05GB.  CPU usage is 2-4% of one of four cores.  

Looking at the Cassandra mbean's, the attributes of ROW-MUTATION-STAGE and ROW-READ-STAGE and RESPONSE-STAGE are all  less than 10.  MINOR-COMPACTION-POOL reports 1218 pending tasks.  (I really like the idea of using mbean's for this.  Thanks for the example.)

>2) you should be at least on 0.4.2 and preferably trunk if you are
>stress testing

Fair enough. 

I left it running since my previous email, and the read latency seems to have maxed out at 1 second on the average at 50K seconds (13 hours) of run time.  Perhaps the problem is just that I didn't wait long enough for the system to come into equilibrium.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Tuesday, December 01, 2009 11:10 AM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

1) use jconsole to see what is happening to jvm / cassandra internals.
 possibly you are slowly exceeding cassandra's ability to keep up with
writes, causing the jvm to spend more and more effort GCing to find
enough memory to keep going

2) you should be at least on 0.4.2 and preferably trunk if you are
stress testing

-Jonathan

On Tue, Dec 1, 2009 at 12:11 PM, Freeman, Tim <ti...@hp.com> wrote:
> In an 8 hour test run, I've seen the read latency for Cassandra drift fairly linearly from ~460ms to ~900ms.  Eventually my application gets starved for reads and starts misbehaving.  I have attached graphs -- horizontal scales are seconds, vertical scales are operations per minute and average milliseconds per operation.  The clearest feature is the light blue line in the left graph drifting consistently upward during the run.
>
> I have a Cassandra 0.4.1 database, one node, records are 100kbytes each, 350K records, 8 threads reading, around 700 reads per minute.  There are also 8 threads writing.  This is all happening on a 4 core processor that's supporting both the Cassandra node and the code that's generating load for it.  I'm reasonably sure that there are no page faults.
>
> I have attached my storage-conf.xml.  Briefly, it has default values, except RpcTimeoutInMillis is 30000 and the partitioner is OrderPreservingPartitioner.  Cassandra's garbage collection parameters are:
>
>   -Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>
> Is this normal behavior?  Is there some change to the configuration I should make to get it to stop getting slower?  If it's not normal, what debugging information should I gather?  Should I give up on Cassandra 0.4.1 and move to a newer version?
>
> I'll leave it running for the time being in case there's something useful to extract from it.
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>
>

Re: Persistently increasing read latency

Posted by Thorsten von Eicken <tv...@rightscale.com>.

It sounds like the system is limited by bytes/sec not by ops/sec since 
the bottleneck seems to be disk bandwidth. It might be helpful to 
compute bytes written per sec at the API level, and then estimate how 
often each "API byte" is written to disk and read from disk due to 
compactions. That would support some quick back of the envelope 
calculations to do a performance smell test...
Thorsten

Jonathan Ellis wrote:
> Thanks for looking into this.  Doesn't seem like there's much
> low-hanging fruit to make compaction faster but I'll keep that in the
> back of my mind.
>
> -Jonathan
>
>

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

Sure.

On Fri, Dec 4, 2009 at 2:20 PM, Joe Stump <jo...@joestump.net> wrote:
>
> On Dec 4, 2009, at 1:09 PM, Jonathan Ellis wrote:
>
>> If you don't have enough room for both, it doesn't matter how you prioritize.
>
> I'm assuming another alternative is to add more boxes to your cluster.
>
> --Joe
>
>

Re: Persistently increasing read latency

Posted by Joe Stump <jo...@joestump.net>.

On Dec 4, 2009, at 1:09 PM, Jonathan Ellis wrote:

> If you don't have enough room for both, it doesn't matter how you prioritize.

I'm assuming another alternative is to add more boxes to your cluster.

--Joe

MySQL has same read latency issue; short benchmarks; slow writes are better than uncompacted data (was RE: Persistently increasing read latency)

Posted by "Freeman, Tim" <ti...@hp.com>.

I gave MySQL 5.1.38 a try (that's not the clustered version, so nevermind what happens if a disk is lost) and I saw the same persistently increasing latency that I saw with Cassandra.

I also tried storing into files on the local filesystem (not clustered or transactional, so nevermind what happens if a node fails permanently or temporarily).  It apparently came to an equilibrium after 47000 seconds, or around 13 hours.  It's still running so I'll watch it some more before really believing that the latency won't increase more.

It looks like I'll have to let benchmarks run for a day to determine whether I have true long-term performance numbers.  Yuck.

See the attached chart comparing them.  The vertical scale is milliseconds of latency per read, and the horizontal scale is seconds.  We're reading and writing 350K records, 100Kb each, around 650 reads per minute and 780 writes per minute on one 4 core Z600 with one disk drive.  The "write 2 files" part is reminding me of a performance bug when writing to the filesystem.

Here's where I come into disagreement with the following:

From: Jonathan Ellis [mailto:jbellis@gmail.com] 
>I agree that [bounding the compaction backlog is] much much nicer 
>in the sense that it makes it more
>obvious what the problem is (not enough capacity) but it only helps
>diagnosis, not mitigation.

The next thing I would try is to use the local filesystem as a cache in front of some distributed database, perhaps Cassandra.  This is a use case where we could win if we had slow writes but will lose if there's an unbounded commit backlog with slow reads.  This is an attempt to solve a real problem, not something that's contrived to win a debate.

Suppose I use the local filesystem as a cache in front of Cassandra.  The application would write to the local filesystem and read from the local filesystem.  We have a background task that persists records from the local filesystem to Cassandra.  When reading, if the data isn't on the local filesystem, we get it from Cassandra.  The read load to Cassandra is eliminated, except when a replacement node is starting up.  The write load to Cassandra can be as low as we want because we can update the file on the local filesystem several times while persisting it to Cassandra only once.  There are details omitted here; the description here has the same performance (I hope) as the real idea but more bugs.  Nevermind the bugs for now.

If Cassandra writes things quickly, then we have fresh data in Cassandra, so we have to redo relatively little work when a node fails because less data was lost.

If Cassandra is slow to write but has bounded read time, then we have more stale data in Cassandra.  When a node is replaced, it can read a stale version of the data on the original node, and we have to redo more work when a node fails.  No big deal.

If Cassandra allows fast writes but is accumulating an unbounded backlog of uncompacted files, then I'm in trouble.  The unbounded backlog either fills up disks, or it leads to reads taking forever and when it comes time to recover from a node failing we can't really read the data that was persisted when the node was up.  Or perhaps I throttle writes to Cassandra based on guesses about how fast it can safely go.  The problem with that is that different nodes can probably go different speeds, and any throttling code I write would be outside of Cassandra so it would have to throttle based on the slowest node and discovering that would be awkward.  I don't know yet what ratio in speed to expect between nodes.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Friday, December 04, 2009 9:14 PM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

On Fri, Dec 4, 2009 at 10:40 PM, Thorsten von Eicken <tv...@rightscale.com> wrote:
>>> For the first few hours of my load test, I have enough I/O.  The problem
>>> is that Cassandra is spending too much I/O on reads and writes and too
>>> little on compactions to function well in the long term.
>>>
>>
>> If you don't have enough room for both, it doesn't matter how you
>> prioritize.
>>
>
> Mhhh, maybe... You're technically correct. The question here is whether
> cassandra degrades gracefully or not. If I understand correctly, there are
> two ways to look at it:
>
> 1) it's accepting a higher request load than it can actually process and
> builds up an increasing backlog that eventually brings performance down far
> below the level of performance that it could sustain, thus it fails to do
> the type of early admission control or back-pressure that keeps the request
> load close to the sustainable maximum performance.
>
> 2) the compaction backlog size is a primary variable that has to be exposed
> and monitored in any cassandra installation because it's a direct indicator
> for an overload situation, just like hitting 100% cpu or similar would be.
>
> I can buy that (2) is ok, but (1) is certainly nicer.

I agree that it's much much nicer in the sense that it makes it more
obvious what the problem is (not enough capacity) but it only helps
diagnosis, not mitigation.

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Dec 4, 2009 at 10:40 PM, Thorsten von Eicken <tv...@rightscale.com> wrote:
>>> For the first few hours of my load test, I have enough I/O.  The problem
>>> is that Cassandra is spending too much I/O on reads and writes and too
>>> little on compactions to function well in the long term.
>>>
>>
>> If you don't have enough room for both, it doesn't matter how you
>> prioritize.
>>
>
> Mhhh, maybe... You're technically correct. The question here is whether
> cassandra degrades gracefully or not. If I understand correctly, there are
> two ways to look at it:
>
> 1) it's accepting a higher request load than it can actually process and
> builds up an increasing backlog that eventually brings performance down far
> below the level of performance that it could sustain, thus it fails to do
> the type of early admission control or back-pressure that keeps the request
> load close to the sustainable maximum performance.
>
> 2) the compaction backlog size is a primary variable that has to be exposed
> and monitored in any cassandra installation because it's a direct indicator
> for an overload situation, just like hitting 100% cpu or similar would be.
>
> I can buy that (2) is ok, but (1) is certainly nicer.

I agree that it's much much nicer in the sense that it makes it more
obvious what the problem is (not enough capacity) but it only helps
diagnosis, not mitigation.

Re: Persistently increasing read latency

Posted by Thorsten von Eicken <tv...@rightscale.com>.

Jonathan Ellis wrote:
> On Fri, Dec 4, 2009 at 1:31 PM, Freeman, Tim <ti...@hp.com> wrote:
>   
>>> Fundamentally there's only so much I/O you can do at a time.  If you
>>> don't have enough, you need to upgrade to servers with better i/o
>>> (i.e. not EC2: http://pl.atyp.us/wordpress/?p=2240&cpage=1) and/or
>>> more ram to cache the reads against.
>>>       
>> I agree that any given configuration will support a limited amount of I/O.
>>
>> For the first few hours of my load test, I have enough I/O.  The problem is that Cassandra is spending too much I/O on reads and writes and too little on compactions to function well in the long term.
>>     
>
> If you don't have enough room for both, it doesn't matter how you prioritize.
>   
Mhhh, maybe... You're technically correct. The question here is whether 
cassandra degrades gracefully or not. If I understand correctly, there 
are two ways to look at it:

1) it's accepting a higher request load than it can actually process and 
builds up an increasing backlog that eventually brings performance down 
far below the level of performance that it could sustain, thus it fails 
to do the type of early admission control or back-pressure that keeps 
the request load close to the sustainable maximum performance.

2) the compaction backlog size is a primary variable that has to be 
exposed and monitored in any cassandra installation because it's a 
direct indicator for an overload situation, just like hitting 100% cpu 
or similar would be.

I can buy that (2) is ok, but (1) is certainly nicer.
Thorsten

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Fri, Dec 4, 2009 at 1:31 PM, Freeman, Tim <ti...@hp.com> wrote:
>>Fundamentally there's only so much I/O you can do at a time.  If you
>>don't have enough, you need to upgrade to servers with better i/o
>>(i.e. not EC2: http://pl.atyp.us/wordpress/?p=2240&cpage=1) and/or
>>more ram to cache the reads against.
>
> I agree that any given configuration will support a limited amount of I/O.
>
> For the first few hours of my load test, I have enough I/O.  The problem is that Cassandra is spending too much I/O on reads and writes and too little on compactions to function well in the long term.

If you don't have enough room for both, it doesn't matter how you prioritize.

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 5  1  29576  88400  27656 13797784    1    0    46   134    0    1  8  1 91  1  0

performance is definitely better now that caching is working, thx!

however, the pending tasks in compaction pool are still not decreasing
and i've reduced the load a bunch on the server.  i'll be doing another
test tonight with some modifications to my test client to simulate real
world a bit better.



On Fri, 2009-12-04 at 16:47 -0600, Jonathan Ellis wrote:
> how much does vmstat say is being used for cache?
> 
> if it's not most of the other 14, then read I/O doesn't sound like the culprit
> 
> On Fri, Dec 4, 2009 at 4:17 PM, B. Todd Burruss <bb...@real.com> wrote:
> > thx for the tip.  i've allocated -Xmx2G for cassandra out of 16G.
> >
> >
> > On Fri, 2009-12-04 at 14:22 -0600, Brandon Williams wrote:
> >> On Fri, Dec 4, 2009 at 1:40 PM, B. Todd Burruss <bb...@real.com>
> >> wrote:
> >>         i've actually allocatted plenty of RAM, but it isn't fully
> >>         used.  the
> >>         read performance dwindles well before reaching a RAM limit.
> >>
> >>
> >> You don't want to over-allocate memory to Cassandra, you want to leave
> >> as much free as you can so the OS can cache the reads for you.
> >>
> >>
> >> -Brandon
> >>
> >
> >
> >

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

how much does vmstat say is being used for cache?

if it's not most of the other 14, then read I/O doesn't sound like the culprit

On Fri, Dec 4, 2009 at 4:17 PM, B. Todd Burruss <bb...@real.com> wrote:
> thx for the tip.  i've allocated -Xmx2G for cassandra out of 16G.
>
>
> On Fri, 2009-12-04 at 14:22 -0600, Brandon Williams wrote:
>> On Fri, Dec 4, 2009 at 1:40 PM, B. Todd Burruss <bb...@real.com>
>> wrote:
>>         i've actually allocatted plenty of RAM, but it isn't fully
>>         used.  the
>>         read performance dwindles well before reaching a RAM limit.
>>
>>
>> You don't want to over-allocate memory to Cassandra, you want to leave
>> as much free as you can so the OS can cache the reads for you.
>>
>>
>> -Brandon
>>
>
>
>

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

thx for the tip.  i've allocated -Xmx2G for cassandra out of 16G.


On Fri, 2009-12-04 at 14:22 -0600, Brandon Williams wrote:
> On Fri, Dec 4, 2009 at 1:40 PM, B. Todd Burruss <bb...@real.com>
> wrote:
>         i've actually allocatted plenty of RAM, but it isn't fully
>         used.  the
>         read performance dwindles well before reaching a RAM limit.
> 
> 
> You don't want to over-allocate memory to Cassandra, you want to leave
> as much free as you can so the OS can cache the reads for you.
> 
> 
> -Brandon
>

Re: Persistently increasing read latency

Posted by Brandon Williams <dr...@gmail.com>.

On Fri, Dec 4, 2009 at 1:40 PM, B. Todd Burruss <bb...@real.com> wrote:

> i've actually allocatted plenty of RAM, but it isn't fully used.  the
> read performance dwindles well before reaching a RAM limit.

You don't want to over-allocate memory to Cassandra, you want to leave as
much free as you can so the OS can cache the reads for you.

-Brandon

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

i've actually allocatted plenty of RAM, but it isn't fully used.  the
read performance dwindles well before reaching a RAM limit.

thx

On Fri, 2009-12-04 at 13:18 -0600, Jonathan Ellis wrote:
> Fundamentally there's only so much I/O you can do at a time.  If you
> don't have enough, you need to upgrade to servers with better i/o
> (i.e. not EC2: http://pl.atyp.us/wordpress/?p=2240&cpage=1) and/or
> more ram to cache the reads against.
> 
> On Fri, Dec 4, 2009 at 1:07 PM, B. Todd Burruss <bb...@real.com> wrote:
> > this is very concerning to me.  it doesn't seem to take much to bring
> > the read performance to an unacceptable level.  are there any
> > suggestions about how to improve performance.
> >
> > here are the params from my config file that are not defaults.  i
> > adjusted these to get real good performance, but not over the long haul.
> > has anyone had any luck adjusting these to help the problem tim and I
> > are having?
> >
> > <CommitLogRotationThresholdInMB>256</CommitLogRotationThresholdInMB>
> > <MemtableSizeInMB>1024</MemtableSizeInMB>
> > <MemtableObjectCountInMillions>0.6</MemtableObjectCountInMillions>
> > <CommitLogSyncPeriodInMS>1000</CommitLogSyncPeriodInMS>
> > <MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes>
> >
> >
> > thx!
> >
> > On Fri, 2009-12-04 at 18:49 +0000, Freeman, Tim wrote:
> >> The speed of compaction isn't the problem.  The problem is that lots of reads and writes cause compaction to fall behind.
> >>
> >> You could solve the problem by throttling reads and writes so compaction isn't starved.  (Maybe just the writes.  I'm not sure.)
> >>
> >> Different nodes will have different compaction backlogs, so you'd want to do this on a per node basis after Cassandra has made decisions about whatever replication it's going to do.  For example, Cassandra could observe the number of pending compaction tasks and sleep that many milliseconds before every read and write.
> >>
> >> The status quo is that I have to count a load test as passing only if the amount of backlogged compaction work stays less than some bound.  I'd rather not have to peer into Cassandra internals to determine whether it's really working or not.  It's a problem if 16 hour load tests get different results than 1 hour load tests because in my tests I'm renting a cluster by the hour.
> >>
> >> Tim Freeman
> >> Email: tim.freeman@hp.com
> >> Desk in Palo Alto: (650) 857-2581
> >> Home: (408) 774-1298
> >> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
> >>
> >> -----Original Message-----
> >> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> >> Sent: Thursday, December 03, 2009 3:06 PM
> >> To: cassandra-user@incubator.apache.org
> >> Subject: Re: Persistently increasing read latency
> >>
> >> Thanks for looking into this.  Doesn't seem like there's much
> >> low-hanging fruit to make compaction faster but I'll keep that in the
> >> back of my mind.
> >>
> >> -Jonathan
> >>
> >> On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
> >> >>So this is working as designed, but the design is poor because it
> >> >>causes confusion.  If you can open a ticket for this that would be
> >> >>great.
> >> >
> >> > Done, see:
> >> >
> >> >   https://issues.apache.org/jira/browse/CASSANDRA-599
> >> >
> >> >>What does iostat -x 10 (for instance) say about the disk activity?
> >> >
> >> > rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
> >> >
> >> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> >> >>           1.52    0.00    1.70   27.49   69.28
> >> >>
> >> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> >> >>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> >> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >> >>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> >> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >> >
> >> > and at the low end:
> >> >
> >> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> >> >>           1.50    0.00    1.77   25.80   70.93
> >> >>
> >> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> >> >>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> >> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >> >>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> >> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >> >
> >> > Tim Freeman
> >> > Email: tim.freeman@hp.com
> >> > Desk in Palo Alto: (650) 857-2581
> >> > Home: (408) 774-1298
> >> > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
> >> >
> >> >
> >> > -----Original Message-----
> >> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
> >> > Sent: Thursday, December 03, 2009 2:45 PM
> >> > To: cassandra-user@incubator.apache.org
> >> > Subject: Re: Persistently increasing read latency
> >> >
> >> > On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
> >> >>>Can you tell if the system is i/o or cpu bound during compaction?
> >> >>
> >> >> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
> >> >
> >> > What does iostat -x 10 (for instance) say about the disk activity?
> >> >
> >
> >
> >

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

>Fundamentally there's only so much I/O you can do at a time.  If you
>don't have enough, you need to upgrade to servers with better i/o
>(i.e. not EC2: http://pl.atyp.us/wordpress/?p=2240&cpage=1) and/or
>more ram to cache the reads against.

I agree that any given configuration will support a limited amount of I/O.  

For the first few hours of my load test, I have enough I/O.  The problem is that Cassandra is spending too much I/O on reads and writes and too little on compactions to function well in the long term.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)


-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Friday, December 04, 2009 11:19 AM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

Fundamentally there's only so much I/O you can do at a time.  If you
don't have enough, you need to upgrade to servers with better i/o
(i.e. not EC2: http://pl.atyp.us/wordpress/?p=2240&cpage=1) and/or
more ram to cache the reads against.

On Fri, Dec 4, 2009 at 1:07 PM, B. Todd Burruss <bb...@real.com> wrote:
> this is very concerning to me.  it doesn't seem to take much to bring
> the read performance to an unacceptable level.  are there any
> suggestions about how to improve performance.
>
> here are the params from my config file that are not defaults.  i
> adjusted these to get real good performance, but not over the long haul.
> has anyone had any luck adjusting these to help the problem tim and I
> are having?
>
> <CommitLogRotationThresholdInMB>256</CommitLogRotationThresholdInMB>
> <MemtableSizeInMB>1024</MemtableSizeInMB>
> <MemtableObjectCountInMillions>0.6</MemtableObjectCountInMillions>
> <CommitLogSyncPeriodInMS>1000</CommitLogSyncPeriodInMS>
> <MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes>
>
>
> thx!
>
> On Fri, 2009-12-04 at 18:49 +0000, Freeman, Tim wrote:
>> The speed of compaction isn't the problem.  The problem is that lots of reads and writes cause compaction to fall behind.
>>
>> You could solve the problem by throttling reads and writes so compaction isn't starved.  (Maybe just the writes.  I'm not sure.)
>>
>> Different nodes will have different compaction backlogs, so you'd want to do this on a per node basis after Cassandra has made decisions about whatever replication it's going to do.  For example, Cassandra could observe the number of pending compaction tasks and sleep that many milliseconds before every read and write.
>>
>> The status quo is that I have to count a load test as passing only if the amount of backlogged compaction work stays less than some bound.  I'd rather not have to peer into Cassandra internals to determine whether it's really working or not.  It's a problem if 16 hour load tests get different results than 1 hour load tests because in my tests I'm renting a cluster by the hour.
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>>
>> -----Original Message-----
>> From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> Sent: Thursday, December 03, 2009 3:06 PM
>> To: cassandra-user@incubator.apache.org
>> Subject: Re: Persistently increasing read latency
>>
>> Thanks for looking into this.  Doesn't seem like there's much
>> low-hanging fruit to make compaction faster but I'll keep that in the
>> back of my mind.
>>
>> -Jonathan
>>
>> On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
>> >>So this is working as designed, but the design is poor because it
>> >>causes confusion.  If you can open a ticket for this that would be
>> >>great.
>> >
>> > Done, see:
>> >
>> >   https://issues.apache.org/jira/browse/CASSANDRA-599
>> >
>> >>What does iostat -x 10 (for instance) say about the disk activity?
>> >
>> > rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
>> >
>> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
>> >>           1.52    0.00    1.70   27.49   69.28
>> >>
>> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>> >>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >
>> > and at the low end:
>> >
>> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
>> >>           1.50    0.00    1.77   25.80   70.93
>> >>
>> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>> >>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >
>> > Tim Freeman
>> > Email: tim.freeman@hp.com
>> > Desk in Palo Alto: (650) 857-2581
>> > Home: (408) 774-1298
>> > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>> >
>> >
>> > -----Original Message-----
>> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> > Sent: Thursday, December 03, 2009 2:45 PM
>> > To: cassandra-user@incubator.apache.org
>> > Subject: Re: Persistently increasing read latency
>> >
>> > On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
>> >>>Can you tell if the system is i/o or cpu bound during compaction?
>> >>
>> >> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
>> >
>> > What does iostat -x 10 (for instance) say about the disk activity?
>> >
>
>
>

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

Fundamentally there's only so much I/O you can do at a time.  If you
don't have enough, you need to upgrade to servers with better i/o
(i.e. not EC2: http://pl.atyp.us/wordpress/?p=2240&cpage=1) and/or
more ram to cache the reads against.

On Fri, Dec 4, 2009 at 1:07 PM, B. Todd Burruss <bb...@real.com> wrote:
> this is very concerning to me.  it doesn't seem to take much to bring
> the read performance to an unacceptable level.  are there any
> suggestions about how to improve performance.
>
> here are the params from my config file that are not defaults.  i
> adjusted these to get real good performance, but not over the long haul.
> has anyone had any luck adjusting these to help the problem tim and I
> are having?
>
> <CommitLogRotationThresholdInMB>256</CommitLogRotationThresholdInMB>
> <MemtableSizeInMB>1024</MemtableSizeInMB>
> <MemtableObjectCountInMillions>0.6</MemtableObjectCountInMillions>
> <CommitLogSyncPeriodInMS>1000</CommitLogSyncPeriodInMS>
> <MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes>
>
>
> thx!
>
> On Fri, 2009-12-04 at 18:49 +0000, Freeman, Tim wrote:
>> The speed of compaction isn't the problem.  The problem is that lots of reads and writes cause compaction to fall behind.
>>
>> You could solve the problem by throttling reads and writes so compaction isn't starved.  (Maybe just the writes.  I'm not sure.)
>>
>> Different nodes will have different compaction backlogs, so you'd want to do this on a per node basis after Cassandra has made decisions about whatever replication it's going to do.  For example, Cassandra could observe the number of pending compaction tasks and sleep that many milliseconds before every read and write.
>>
>> The status quo is that I have to count a load test as passing only if the amount of backlogged compaction work stays less than some bound.  I'd rather not have to peer into Cassandra internals to determine whether it's really working or not.  It's a problem if 16 hour load tests get different results than 1 hour load tests because in my tests I'm renting a cluster by the hour.
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>>
>> -----Original Message-----
>> From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> Sent: Thursday, December 03, 2009 3:06 PM
>> To: cassandra-user@incubator.apache.org
>> Subject: Re: Persistently increasing read latency
>>
>> Thanks for looking into this.  Doesn't seem like there's much
>> low-hanging fruit to make compaction faster but I'll keep that in the
>> back of my mind.
>>
>> -Jonathan
>>
>> On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
>> >>So this is working as designed, but the design is poor because it
>> >>causes confusion.  If you can open a ticket for this that would be
>> >>great.
>> >
>> > Done, see:
>> >
>> >   https://issues.apache.org/jira/browse/CASSANDRA-599
>> >
>> >>What does iostat -x 10 (for instance) say about the disk activity?
>> >
>> > rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
>> >
>> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
>> >>           1.52    0.00    1.70   27.49   69.28
>> >>
>> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>> >>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >
>> > and at the low end:
>> >
>> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
>> >>           1.50    0.00    1.77   25.80   70.93
>> >>
>> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>> >>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> >
>> > Tim Freeman
>> > Email: tim.freeman@hp.com
>> > Desk in Palo Alto: (650) 857-2581
>> > Home: (408) 774-1298
>> > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>> >
>> >
>> > -----Original Message-----
>> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> > Sent: Thursday, December 03, 2009 2:45 PM
>> > To: cassandra-user@incubator.apache.org
>> > Subject: Re: Persistently increasing read latency
>> >
>> > On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
>> >>>Can you tell if the system is i/o or cpu bound during compaction?
>> >>
>> >> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
>> >
>> > What does iostat -x 10 (for instance) say about the disk activity?
>> >
>
>
>

RE: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

this is very concerning to me.  it doesn't seem to take much to bring
the read performance to an unacceptable level.  are there any
suggestions about how to improve performance.

here are the params from my config file that are not defaults.  i
adjusted these to get real good performance, but not over the long haul.
has anyone had any luck adjusting these to help the problem tim and I
are having?

<CommitLogRotationThresholdInMB>256</CommitLogRotationThresholdInMB>
<MemtableSizeInMB>1024</MemtableSizeInMB>
<MemtableObjectCountInMillions>0.6</MemtableObjectCountInMillions>
<CommitLogSyncPeriodInMS>1000</CommitLogSyncPeriodInMS>
<MemtableFlushAfterMinutes>1440</MemtableFlushAfterMinutes>


thx!

On Fri, 2009-12-04 at 18:49 +0000, Freeman, Tim wrote:
> The speed of compaction isn't the problem.  The problem is that lots of reads and writes cause compaction to fall behind.
> 
> You could solve the problem by throttling reads and writes so compaction isn't starved.  (Maybe just the writes.  I'm not sure.)
> 
> Different nodes will have different compaction backlogs, so you'd want to do this on a per node basis after Cassandra has made decisions about whatever replication it's going to do.  For example, Cassandra could observe the number of pending compaction tasks and sleep that many milliseconds before every read and write.
> 
> The status quo is that I have to count a load test as passing only if the amount of backlogged compaction work stays less than some bound.  I'd rather not have to peer into Cassandra internals to determine whether it's really working or not.  It's a problem if 16 hour load tests get different results than 1 hour load tests because in my tests I'm renting a cluster by the hour.
> 
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
> 
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com] 
> Sent: Thursday, December 03, 2009 3:06 PM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Persistently increasing read latency
> 
> Thanks for looking into this.  Doesn't seem like there's much
> low-hanging fruit to make compaction faster but I'll keep that in the
> back of my mind.
> 
> -Jonathan
> 
> On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
> >>So this is working as designed, but the design is poor because it
> >>causes confusion.  If you can open a ticket for this that would be
> >>great.
> >
> > Done, see:
> >
> >   https://issues.apache.org/jira/browse/CASSANDRA-599
> >
> >>What does iostat -x 10 (for instance) say about the disk activity?
> >
> > rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
> >
> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> >>           1.52    0.00    1.70   27.49   69.28
> >>
> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> >>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >
> > and at the low end:
> >
> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> >>           1.50    0.00    1.77   25.80   70.93
> >>
> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> >>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >
> > Tim Freeman
> > Email: tim.freeman@hp.com
> > Desk in Palo Alto: (650) 857-2581
> > Home: (408) 774-1298
> > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
> >
> >
> > -----Original Message-----
> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
> > Sent: Thursday, December 03, 2009 2:45 PM
> > To: cassandra-user@incubator.apache.org
> > Subject: Re: Persistently increasing read latency
> >
> > On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
> >>>Can you tell if the system is i/o or cpu bound during compaction?
> >>
> >> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
> >
> > What does iostat -x 10 (for instance) say about the disk activity?
> >

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

The speed of compaction isn't the problem.  The problem is that lots of reads and writes cause compaction to fall behind.

You could solve the problem by throttling reads and writes so compaction isn't starved.  (Maybe just the writes.  I'm not sure.)

Different nodes will have different compaction backlogs, so you'd want to do this on a per node basis after Cassandra has made decisions about whatever replication it's going to do.  For example, Cassandra could observe the number of pending compaction tasks and sleep that many milliseconds before every read and write.

The status quo is that I have to count a load test as passing only if the amount of backlogged compaction work stays less than some bound.  I'd rather not have to peer into Cassandra internals to determine whether it's really working or not.  It's a problem if 16 hour load tests get different results than 1 hour load tests because in my tests I'm renting a cluster by the hour.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, December 03, 2009 3:06 PM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

Thanks for looking into this.  Doesn't seem like there's much
low-hanging fruit to make compaction faster but I'll keep that in the
back of my mind.

-Jonathan

On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
>>So this is working as designed, but the design is poor because it
>>causes confusion.  If you can open a ticket for this that would be
>>great.
>
> Done, see:
>
>   https://issues.apache.org/jira/browse/CASSANDRA-599
>
>>What does iostat -x 10 (for instance) say about the disk activity?
>
> rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
>
>>avg-cpu:  %user   %nice    %sys %iowait   %idle
>>           1.52    0.00    1.70   27.49   69.28
>>
>>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>
> and at the low end:
>
>>avg-cpu:  %user   %nice    %sys %iowait   %idle
>>           1.50    0.00    1.77   25.80   70.93
>>
>>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Thursday, December 03, 2009 2:45 PM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Persistently increasing read latency
>
> On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
>>>Can you tell if the system is i/o or cpu bound during compaction?
>>
>> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
>
> What does iostat -x 10 (for instance) say about the disk activity?
>

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

those will be removed on GC or restart, and are harmless

On Thu, Dec 3, 2009 at 6:20 PM, B. Todd Burruss <bb...@real.com> wrote:
> another note on this, i stopped my client and after about 35 minutes the
> compaction did complete, no more pending in compaction-pool.  however
> the Index, Data, and Filter files still exist with lots of data in them.
> "Compact" files exist for all but 4 of the Data files - these "compact"
> files are zero length.
>
> thx!
>
>
> On Thu, 2009-12-03 at 15:40 -0800, B. Todd Burruss wrote:
>> in my situation it seems like the compaction process is being starved.
>> i'm hitting the server hard for the last 45 minutes and the compaction
>> pool is sitting at 1 active, 25 pending, and 7 completed.  it has been
>> at 1 active and 7 completed for about 20 minutes.  the pending have been
>> growing steadily since then.  and as i was typing it finally finished
>> another compaction, so they must be just taking forever.
>>
>> snapshots of nodeprobe and iostats follow:
>>
>> Pool Name                    Active   Pending      Completed
>> FILEUTILS-DELETE-POOL             0         0            116
>> MESSAGING-SERVICE-POOL            0         0              0
>> STREAM-STAGE                      0         0              0
>> RESPONSE-STAGE                    0         0              0
>> ROW-READ-STAGE                    1         4        8652560
>> LB-OPERATIONS                     0         0              0
>> COMMITLOG                         1         0       14695623
>> MESSAGE-DESERIALIZER-POOL         0         0              0
>> GMFD                              0         0              0
>> LB-TARGET                         0         0              0
>> CONSISTENCY-MANAGER               0         0              0
>> ROW-MUTATION-STAGE                1         1       14692604
>> MESSAGE-STREAMING-POOL            0         0              0
>> LOAD-BALANCER-STAGE               0         0              0
>> FLUSH-SORTER-POOL                 0         0             28
>> MEMTABLE-POST-FLUSHER             0         0             28
>> COMPACTION-POOL                   1        25              7
>> FLUSH-WRITER-POOL                 0         0             28
>> HINTED-HANDOFF-POOL               0         0              0
>>
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>           61.85    0.00   26.68    7.73    0.00    3.74
>>
>> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
>> sda             246.00     11456.00     18528.00      11456      18528
>> sda2          23074.00        20.50      1854.00         20       1854
>> sda1              0.00         0.00         0.00          0          0
>>
>>
>>
>>
>> On Thu, 2009-12-03 at 17:05 -0600, Jonathan Ellis wrote:
>> > Thanks for looking into this.  Doesn't seem like there's much
>> > low-hanging fruit to make compaction faster but I'll keep that in the
>> > back of my mind.
>> >
>> > -Jonathan
>> >
>> > On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
>> > >>So this is working as designed, but the design is poor because it
>> > >>causes confusion.  If you can open a ticket for this that would be
>> > >>great.
>> > >
>> > > Done, see:
>> > >
>> > >   https://issues.apache.org/jira/browse/CASSANDRA-599
>> > >
>> > >>What does iostat -x 10 (for instance) say about the disk activity?
>> > >
>> > > rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
>> > >
>> > >>avg-cpu:  %user   %nice    %sys %iowait   %idle
>> > >>           1.52    0.00    1.70   27.49   69.28
>> > >>
>> > >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>> > >>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>> > >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> > >>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>> > >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> > >
>> > > and at the low end:
>> > >
>> > >>avg-cpu:  %user   %nice    %sys %iowait   %idle
>> > >>           1.50    0.00    1.77   25.80   70.93
>> > >>
>> > >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>> > >>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>> > >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> > >>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>> > >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>> > >
>> > > Tim Freeman
>> > > Email: tim.freeman@hp.com
>> > > Desk in Palo Alto: (650) 857-2581
>> > > Home: (408) 774-1298
>> > > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> > > Sent: Thursday, December 03, 2009 2:45 PM
>> > > To: cassandra-user@incubator.apache.org
>> > > Subject: Re: Persistently increasing read latency
>> > >
>> > > On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
>> > >>>Can you tell if the system is i/o or cpu bound during compaction?
>> > >>
>> > >> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
>> > >
>> > > What does iostat -x 10 (for instance) say about the disk activity?
>> > >
>>
>>
>
>
>

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

another note on this, i stopped my client and after about 35 minutes the
compaction did complete, no more pending in compaction-pool.  however
the Index, Data, and Filter files still exist with lots of data in them.
"Compact" files exist for all but 4 of the Data files - these "compact"
files are zero length.

thx!


On Thu, 2009-12-03 at 15:40 -0800, B. Todd Burruss wrote:
> in my situation it seems like the compaction process is being starved.
> i'm hitting the server hard for the last 45 minutes and the compaction
> pool is sitting at 1 active, 25 pending, and 7 completed.  it has been
> at 1 active and 7 completed for about 20 minutes.  the pending have been
> growing steadily since then.  and as i was typing it finally finished
> another compaction, so they must be just taking forever.
> 
> snapshots of nodeprobe and iostats follow:
> 
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            116
> MESSAGING-SERVICE-POOL            0         0              0
> STREAM-STAGE                      0         0              0
> RESPONSE-STAGE                    0         0              0
> ROW-READ-STAGE                    1         4        8652560
> LB-OPERATIONS                     0         0              0
> COMMITLOG                         1         0       14695623
> MESSAGE-DESERIALIZER-POOL         0         0              0
> GMFD                              0         0              0
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0              0
> ROW-MUTATION-STAGE                1         1       14692604
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0             28
> MEMTABLE-POST-FLUSHER             0         0             28
> COMPACTION-POOL                   1        25              7
> FLUSH-WRITER-POOL                 0         0             28
> HINTED-HANDOFF-POOL               0         0              0
> 
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           61.85    0.00   26.68    7.73    0.00    3.74
> 
> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> sda             246.00     11456.00     18528.00      11456      18528
> sda2          23074.00        20.50      1854.00         20       1854
> sda1              0.00         0.00         0.00          0          0
> 
> 
> 
> 
> On Thu, 2009-12-03 at 17:05 -0600, Jonathan Ellis wrote:
> > Thanks for looking into this.  Doesn't seem like there's much
> > low-hanging fruit to make compaction faster but I'll keep that in the
> > back of my mind.
> > 
> > -Jonathan
> > 
> > On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
> > >>So this is working as designed, but the design is poor because it
> > >>causes confusion.  If you can open a ticket for this that would be
> > >>great.
> > >
> > > Done, see:
> > >
> > >   https://issues.apache.org/jira/browse/CASSANDRA-599
> > >
> > >>What does iostat -x 10 (for instance) say about the disk activity?
> > >
> > > rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
> > >
> > >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> > >>           1.52    0.00    1.70   27.49   69.28
> > >>
> > >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> > >>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> > >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> > >>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> > >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> > >
> > > and at the low end:
> > >
> > >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> > >>           1.50    0.00    1.77   25.80   70.93
> > >>
> > >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> > >>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> > >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> > >>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> > >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> > >
> > > Tim Freeman
> > > Email: tim.freeman@hp.com
> > > Desk in Palo Alto: (650) 857-2581
> > > Home: (408) 774-1298
> > > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
> > >
> > >
> > > -----Original Message-----
> > > From: Jonathan Ellis [mailto:jbellis@gmail.com]
> > > Sent: Thursday, December 03, 2009 2:45 PM
> > > To: cassandra-user@incubator.apache.org
> > > Subject: Re: Persistently increasing read latency
> > >
> > > On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
> > >>>Can you tell if the system is i/o or cpu bound during compaction?
> > >>
> > >> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
> > >
> > > What does iostat -x 10 (for instance) say about the disk activity?
> > >
> 
>

Re: Persistently increasing read latency

Posted by "B. Todd Burruss" <bb...@real.com>.

in my situation it seems like the compaction process is being starved.
i'm hitting the server hard for the last 45 minutes and the compaction
pool is sitting at 1 active, 25 pending, and 7 completed.  it has been
at 1 active and 7 completed for about 20 minutes.  the pending have been
growing steadily since then.  and as i was typing it finally finished
another compaction, so they must be just taking forever.

snapshots of nodeprobe and iostats follow:

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            116
MESSAGING-SERVICE-POOL            0         0              0
STREAM-STAGE                      0         0              0
RESPONSE-STAGE                    0         0              0
ROW-READ-STAGE                    1         4        8652560
LB-OPERATIONS                     0         0              0
COMMITLOG                         1         0       14695623
MESSAGE-DESERIALIZER-POOL         0         0              0
GMFD                              0         0              0
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0              0
ROW-MUTATION-STAGE                1         1       14692604
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0             28
MEMTABLE-POST-FLUSHER             0         0             28
COMPACTION-POOL                   1        25              7
FLUSH-WRITER-POOL                 0         0             28
HINTED-HANDOFF-POOL               0         0              0


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          61.85    0.00   26.68    7.73    0.00    3.74

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             246.00     11456.00     18528.00      11456      18528
sda2          23074.00        20.50      1854.00         20       1854
sda1              0.00         0.00         0.00          0          0




On Thu, 2009-12-03 at 17:05 -0600, Jonathan Ellis wrote:
> Thanks for looking into this.  Doesn't seem like there's much
> low-hanging fruit to make compaction faster but I'll keep that in the
> back of my mind.
> 
> -Jonathan
> 
> On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
> >>So this is working as designed, but the design is poor because it
> >>causes confusion.  If you can open a ticket for this that would be
> >>great.
> >
> > Done, see:
> >
> >   https://issues.apache.org/jira/browse/CASSANDRA-599
> >
> >>What does iostat -x 10 (for instance) say about the disk activity?
> >
> > rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
> >
> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> >>           1.52    0.00    1.70   27.49   69.28
> >>
> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> >>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >
> > and at the low end:
> >
> >>avg-cpu:  %user   %nice    %sys %iowait   %idle
> >>           1.50    0.00    1.77   25.80   70.93
> >>
> >>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
> >>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> >>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
> >>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> >
> > Tim Freeman
> > Email: tim.freeman@hp.com
> > Desk in Palo Alto: (650) 857-2581
> > Home: (408) 774-1298
> > Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
> >
> >
> > -----Original Message-----
> > From: Jonathan Ellis [mailto:jbellis@gmail.com]
> > Sent: Thursday, December 03, 2009 2:45 PM
> > To: cassandra-user@incubator.apache.org
> > Subject: Re: Persistently increasing read latency
> >
> > On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
> >>>Can you tell if the system is i/o or cpu bound during compaction?
> >>
> >> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
> >
> > What does iostat -x 10 (for instance) say about the disk activity?
> >

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

Thanks for looking into this.  Doesn't seem like there's much
low-hanging fruit to make compaction faster but I'll keep that in the
back of my mind.

-Jonathan

On Thu, Dec 3, 2009 at 4:58 PM, Freeman, Tim <ti...@hp.com> wrote:
>>So this is working as designed, but the design is poor because it
>>causes confusion.  If you can open a ticket for this that would be
>>great.
>
> Done, see:
>
>   https://issues.apache.org/jira/browse/CASSANDRA-599
>
>>What does iostat -x 10 (for instance) say about the disk activity?
>
> rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:
>
>>avg-cpu:  %user   %nice    %sys %iowait   %idle
>>           1.52    0.00    1.70   27.49   69.28
>>
>>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>
> and at the low end:
>
>>avg-cpu:  %user   %nice    %sys %iowait   %idle
>>           1.50    0.00    1.77   25.80   70.93
>>
>>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Thursday, December 03, 2009 2:45 PM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Persistently increasing read latency
>
> On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
>>>Can you tell if the system is i/o or cpu bound during compaction?
>>
>> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.
>
> What does iostat -x 10 (for instance) say about the disk activity?
>

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

>So this is working as designed, but the design is poor because it
>causes confusion.  If you can open a ticket for this that would be
>great.

Done, see:

   https://issues.apache.org/jira/browse/CASSANDRA-599

>What does iostat -x 10 (for instance) say about the disk activity?

rkB/s is consistently high, and wkB/s varies.  This is a typical entry with wkB/s at the high end of its range:

>avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.52    0.00    1.70   27.49   69.28
>
>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>sda          3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>sda2         3.10 3249.25 124.08 29.67 26299.30 26288.11 13149.65 13144.06   342.04    17.75   92.25   5.98  91.92
>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

and at the low end:

>avg-cpu:  %user   %nice    %sys %iowait   %idle
>           1.50    0.00    1.77   25.80   70.93
>
>Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
>sda          3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>sda1         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
>sda2         3.40 817.10 128.60 17.70 27828.80 6600.00 13914.40  3300.00   235.33     6.13   56.63   6.21  90.81
>sda3         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)


-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, December 03, 2009 2:45 PM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
>>Can you tell if the system is i/o or cpu bound during compaction?
>
> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.

What does iostat -x 10 (for instance) say about the disk activity?

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Dec 3, 2009 at 4:34 PM, Freeman, Tim <ti...@hp.com> wrote:
>>Can you tell if the system is i/o or cpu bound during compaction?
>
> It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.

What does iostat -x 10 (for instance) say about the disk activity?

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

>Can you tell if the system is i/o or cpu bound during compaction?

It's I/O bound.  It's using ~9% of 1 of 4 cores as I watch it, and all it's doing right now is compactions.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, December 03, 2009 2:19 PM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

On Thu, Dec 3, 2009 at 3:59 PM, Freeman, Tim <ti...@hp.com> wrote:
> I stopped the client at 11:28.  There were 2306 files in data/Keyspace1.  It's now 12:44, and there are 1826 files in data/Keyspace1.  As I wrote this email, the number increased to 1903, then to 1938 and 2015, even though the server has no clients.  I used jconsole to invoke a few explicit garbage collections and the number went down to 811.

Sounds normal.

> jconsole reports that the compaction pool has 1670 pending tasks.  As I wrote this email, the number gradually increased to 1673.  The server has no clients, so this is odd.  The number of completed tasks in the compaction pool has consistently been going up while the number of pending tasks stays the same.  The number of completed tasks increased from 130 to 136.

This is because whenever compaction finishes, it adds another
compaction task to see if the newly compacted table is itself large
enough to compact with others.  In a system where compaction has kept
up with demand, these are quickly cleaned out of the queue, but in
your case they are stuck behind all the compactions that are merging
sstables.

So this is working as designed, but the design is poor because it
causes confusion.  If you can open a ticket for this that would be
great.

> log.2009-12-02-19: WARN [Timer-0] 2009-12-02 19:55:23,305 LoadDisseminator.java (line 44) Exception was generated at : 12/02/2009 19:55:22 on thread Timer-0

These have been fixed and are unrelated to compaction.

So, it sounds like things are working, and if you leave it alone for a
while it will finish compacting everything and the queue of compaction
jobs will clear out, and reads should be fast(er) again.

Like I said originally, increaing memtable size / object count will
reduce the number of compactions requred.  That's about all you can do
in 0.5...  Can you tell if the system is i/o or cpu bound during
compaction?

-Jonathan

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Dec 3, 2009 at 3:59 PM, Freeman, Tim <ti...@hp.com> wrote:
> I stopped the client at 11:28.  There were 2306 files in data/Keyspace1.  It's now 12:44, and there are 1826 files in data/Keyspace1.  As I wrote this email, the number increased to 1903, then to 1938 and 2015, even though the server has no clients.  I used jconsole to invoke a few explicit garbage collections and the number went down to 811.

Sounds normal.

> jconsole reports that the compaction pool has 1670 pending tasks.  As I wrote this email, the number gradually increased to 1673.  The server has no clients, so this is odd.  The number of completed tasks in the compaction pool has consistently been going up while the number of pending tasks stays the same.  The number of completed tasks increased from 130 to 136.

This is because whenever compaction finishes, it adds another
compaction task to see if the newly compacted table is itself large
enough to compact with others.  In a system where compaction has kept
up with demand, these are quickly cleaned out of the queue, but in
your case they are stuck behind all the compactions that are merging
sstables.

So this is working as designed, but the design is poor because it
causes confusion.  If you can open a ticket for this that would be
great.

> log.2009-12-02-19: WARN [Timer-0] 2009-12-02 19:55:23,305 LoadDisseminator.java (line 44) Exception was generated at : 12/02/2009 19:55:22 on thread Timer-0

These have been fixed and are unrelated to compaction.

So, it sounds like things are working, and if you leave it alone for a
while it will finish compacting everything and the queue of compaction
jobs will clear out, and reads should be fast(er) again.

Like I said originally, increaing memtable size / object count will
reduce the number of compactions requred.  That's about all you can do
in 0.5...  Can you tell if the system is i/o or cpu bound during
compaction?

-Jonathan

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

I'm trying to get all the questions answered at once here so people aren't seeing a ton of emails from me.

>If you stop doing client requests does that number start to go down?

I stopped the client at 11:28.  There were 2306 files in data/Keyspace1.  It's now 12:44, and there are 1826 files in data/Keyspace1.  As I wrote this email, the number increased to 1903, then to 1938 and 2015, even though the server has no clients.  I used jconsole to invoke a few explicit garbage collections and the number went down to 811.

jconsole reports that the compaction pool has 1670 pending tasks.  As I wrote this email, the number gradually increased to 1673.  The server has no clients, so this is odd.  The number of completed tasks in the compaction pool has consistently been going up while the number of pending tasks stays the same.  The number of completed tasks increased from 130 to 136.

> If not there is a bug.  If there are exceptions in the logs then it is
> an obvious one otherwise something more subtle.

The output from "grep -v INFO log*" is reasonably concise:

log.2009-12-02-19: WARN [Timer-0] 2009-12-02 19:55:23,305 LoadDisseminator.java (line 44) Exception was generated at : 12/02/2009 19:55:22 on thread Timer-0
log.2009-12-02-19:
log.2009-12-02-19:java.lang.NullPointerException
log.2009-12-02-19:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpaceForPath(FileUtils.java:198)
log.2009-12-02-19:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpaceForPath(FileUtils.java:200)
log.2009-12-02-19:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpaceForPath(FileUtils.java:200)
log.2009-12-02-19:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpace(FileUtils.java:211)
log.2009-12-02-19:      at org.apache.cassandra.service.StorageService.getLoad(StorageService.java:510)
log.2009-12-02-19:      at org.apache.cassandra.service.LoadDisseminator.run(LoadDisseminator.java:37)
log.2009-12-02-19:      at java.util.TimerThread.mainLoop(Timer.java:512)
log.2009-12-02-19:      at java.util.TimerThread.run(Timer.java:462)
log.2009-12-02-19:
log.2009-12-02-23: WARN [Timer-0] 2009-12-02 23:10:27,128 LoadDisseminator.java (line 44) Exception was generated at : 12/02/2009 23:10:27 on thread Timer-0
log.2009-12-02-23:
log.2009-12-02-23:java.lang.NullPointerException
log.2009-12-02-23:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpaceForPath(FileUtils.java:198)
log.2009-12-02-23:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpaceForPath(FileUtils.java:200)
log.2009-12-02-23:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpaceForPath(FileUtils.java:200)
log.2009-12-02-23:      at org.apache.cassandra.utils.FileUtils.getUsedDiskSpace(FileUtils.java:211)
log.2009-12-02-23:      at org.apache.cassandra.service.StorageService.getLoad(StorageService.java:510)
log.2009-12-02-23:      at org.apache.cassandra.service.LoadDisseminator.run(LoadDisseminator.java:37)
log.2009-12-02-23:      at java.util.TimerThread.mainLoop(Timer.java:512)
log.2009-12-02-23:      at java.util.TimerThread.run(Timer.java:462)
log.2009-12-02-23:

Recall that these line numbers should match the 29 Nov build.

Chris Goffinet [goffinet@digg.com] said:
>After you stop the test. Do you see the pending tasks for compaction  
>drop?

I see chatter in the log files saying that compactions are happening, but the number of reported pending tasks has increased while I was watching.

Chris Goffinet [goffinet@digg.com] said:
>[Were] there any other numbers in tpstats from nodeprobe that are growing?

I didn't ask this question before stopping my client.  Now tpstats reports that there are no pending tasks other than 1672 COMPACTION-POOL tasks, and the only active tasks are one COMMITLOG and one COMPACTION-POOL.

>Can you plot the number of SSTables? 

I suppose so.  Do you mean the number of files, or one of the stats output from "nodeprobe tpstats"?  Keep in mind that the number of files is influenced by the garbage collector, so I doubt that counting the files will tell us much.

>Are you using the standard storage-conf.xml defaults?

Pretty much.  I attached the file to this email.  My summary of the changes from 1 Dec 09 is still valid even though I changed Cassandra versions:

>Briefly, it has default values, except RpcTimeoutInMillis is 30000 and the partitioner is OrderPreservingPartitioner.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, December 03, 2009 11:14 AM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

On Thu, Dec 3, 2009 at 1:11 PM, Freeman, Tim <ti...@hp.com> wrote:
>>how many are in your data directories?  is your compaction
>>lagging 1000s of tables behind again?
>
> Yes, there are 2348 files in data/Keyspace1, and jconsole says the compaction pool has >1600 pending tasks.

If you stop doing client requests does that number start to go down?

If not there is a bug.  If there are exceptions in the logs then it is
an obvious one otherwise something more subtle. :)

-Jonathan

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Dec 3, 2009 at 1:11 PM, Freeman, Tim <ti...@hp.com> wrote:
>>how many are in your data directories?  is your compaction
>>lagging 1000s of tables behind again?
>
> Yes, there are 2348 files in data/Keyspace1, and jconsole says the compaction pool has >1600 pending tasks.

If you stop doing client requests does that number start to go down?

If not there is a bug.  If there are exceptions in the logs then it is
an obvious one otherwise something more subtle. :)

-Jonathan

Re: Persistently increasing read latency

Posted by Chris Goffinet <go...@digg.com>.

Tim,

After you stop the test. Do you see the pending tasks for compaction  
drop? Need to verify you didn't run into a new bug. If the number  
starts to slowly drop that just indicates that compactions are not  
keeping up with your write levels.

On Dec 3, 2009, at 11:11 AM, Freeman, Tim wrote:

>> how many are in your data directories?  is your compaction
>> lagging 1000s of tables behind again?
>
> Yes, there are 2348 files in data/Keyspace1, and jconsole says the  
> compaction pool has >1600 pending tasks.
>
> Chris Goffinet's questions were good too but it will take me a  
> little while to get answers.
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday,  
> and Thursday; call my desk instead.)
>
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Thursday, December 03, 2009 11:02 AM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Persistently increasing read latency
>
> i would expect read latency to increase linearly w/ the number of
> sstables you have around.  how many are in your data directories?  is
> your compaction lagging 1000s of tables behind again?
>
> On Thu, Dec 3, 2009 at 12:58 PM, Freeman, Tim <ti...@hp.com>  
> wrote:
>> I ran another test last night with the build dated 29 Nov 2009.   
>> Other than the Cassandra version, the setup was the same as  
>> before.  I got qualitatively similar results as before, too -- the  
>> read latency increased fairly smoothly from 250ms to 1s, the GC  
>> times reported by jconsole are low, the pending tasks for row- 
>> mutation-stage and row-read-stage are less than 10, the pending  
>> tasks for the compaction pool are 1615.  Last time around the read  
>> latency maxed out at one second.  This time, it just got to one  
>> second as I'm writing this so I don't know yet if it will continue  
>> to increase.
>>
>> I have attached a fresh graph describing the present run.  It's  
>> qualitatively similar to the previous one.  The vertical units are  
>> milliseconds (for latency) and operations per minute (for reads or  
>> writes).  The horizontal scale is seconds.  The feature that's  
>> bothering me is the red line for the read latency going diagonally  
>> from lower left to the lower-middle right.  The scale doesn't make  
>> it look dramatic, but Cassandra slowed down by a factor of 4.
>>
>> The read and write rates were stable for 45,000 seconds or so, and  
>> then the read latency got big enough that the application was  
>> starved for reads and it started writing less.
>>
>> If this is worth pursuing, I suppose the next step would be for me  
>> to make a small program that reproduces the problem.  It should be  
>> easy -- we're just reading and writing random records.  Let me know  
>> if there's interest in that.  I could  also decide to live with a  
>> 1000 ms latency here.  I'm thinking of putting a cache in the local  
>> filesystem in front of Cassandra (or whichever distributed DB we  
>> decide to go with), so living with it is definitely possible.
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday,  
>> and Thursday; call my desk instead.)
>>
>> -----Original Message-----
>> From: Jonathan Ellis [mailto:jbellis@gmail.com]
>> Sent: Tuesday, December 01, 2009 11:10 AM
>> To: cassandra-user@incubator.apache.org
>> Subject: Re: Persistently increasing read latency
>>
>> 1) use jconsole to see what is happening to jvm / cassandra  
>> internals.
>>  possibly you are slowly exceeding cassandra's ability to keep up  
>> with
>> writes, causing the jvm to spend more and more effort GCing to find
>> enough memory to keep going
>>
>> 2) you should be at least on 0.4.2 and preferably trunk if you are
>> stress testing
>>
>> -Jonathan
>>
>> On Tue, Dec 1, 2009 at 12:11 PM, Freeman, Tim <ti...@hp.com>  
>> wrote:
>>> In an 8 hour test run, I've seen the read latency for Cassandra  
>>> drift fairly linearly from ~460ms to ~900ms.  Eventually my  
>>> application gets starved for reads and starts misbehaving.  I have  
>>> attached graphs -- horizontal scales are seconds, vertical scales  
>>> are operations per minute and average milliseconds per operation.   
>>> The clearest feature is the light blue line in the left graph  
>>> drifting consistently upward during the run.
>>>
>>> I have a Cassandra 0.4.1 database, one node, records are 100kbytes  
>>> each, 350K records, 8 threads reading, around 700 reads per  
>>> minute.  There are also 8 threads writing.  This is all happening  
>>> on a 4 core processor that's supporting both the Cassandra node  
>>> and the code that's generating load for it.  I'm reasonably sure  
>>> that there are no page faults.
>>>
>>> I have attached my storage-conf.xml.  Briefly, it has default  
>>> values, except RpcTimeoutInMillis is 30000 and the partitioner is  
>>> OrderPreservingPartitioner.  Cassandra's garbage collection  
>>> parameters are:
>>>
>>>   -Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX: 
>>> +UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>>
>>> Is this normal behavior?  Is there some change to the  
>>> configuration I should make to get it to stop getting slower?  If  
>>> it's not normal, what debugging information should I gather?   
>>> Should I give up on Cassandra 0.4.1 and move to a newer version?
>>>
>>> I'll leave it running for the time being in case there's something  
>>> useful to extract from it.
>>>
>>> Tim Freeman
>>> Email: tim.freeman@hp.com
>>> Desk in Palo Alto: (650) 857-2581
>>> Home: (408) 774-1298
>>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday,  
>>> and Thursday; call my desk instead.)
>>>
>>>
>>

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

>how many are in your data directories?  is your compaction 
>lagging 1000s of tables behind again?

Yes, there are 2348 files in data/Keyspace1, and jconsole says the compaction pool has >1600 pending tasks.

Chris Goffinet's questions were good too but it will take me a little while to get answers.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)


-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Thursday, December 03, 2009 11:02 AM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

i would expect read latency to increase linearly w/ the number of
sstables you have around.  how many are in your data directories?  is
your compaction lagging 1000s of tables behind again?

On Thu, Dec 3, 2009 at 12:58 PM, Freeman, Tim <ti...@hp.com> wrote:
> I ran another test last night with the build dated 29 Nov 2009.  Other than the Cassandra version, the setup was the same as before.  I got qualitatively similar results as before, too -- the read latency increased fairly smoothly from 250ms to 1s, the GC times reported by jconsole are low, the pending tasks for row-mutation-stage and row-read-stage are less than 10, the pending tasks for the compaction pool are 1615.  Last time around the read latency maxed out at one second.  This time, it just got to one second as I'm writing this so I don't know yet if it will continue to increase.
>
> I have attached a fresh graph describing the present run.  It's qualitatively similar to the previous one.  The vertical units are milliseconds (for latency) and operations per minute (for reads or writes).  The horizontal scale is seconds.  The feature that's bothering me is the red line for the read latency going diagonally from lower left to the lower-middle right.  The scale doesn't make it look dramatic, but Cassandra slowed down by a factor of 4.
>
> The read and write rates were stable for 45,000 seconds or so, and then the read latency got big enough that the application was starved for reads and it started writing less.
>
> If this is worth pursuing, I suppose the next step would be for me to make a small program that reproduces the problem.  It should be easy -- we're just reading and writing random records.  Let me know if there's interest in that.  I could  also decide to live with a 1000 ms latency here.  I'm thinking of putting a cache in the local filesystem in front of Cassandra (or whichever distributed DB we decide to go with), so living with it is definitely possible.
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, December 01, 2009 11:10 AM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Persistently increasing read latency
>
> 1) use jconsole to see what is happening to jvm / cassandra internals.
>  possibly you are slowly exceeding cassandra's ability to keep up with
> writes, causing the jvm to spend more and more effort GCing to find
> enough memory to keep going
>
> 2) you should be at least on 0.4.2 and preferably trunk if you are
> stress testing
>
> -Jonathan
>
> On Tue, Dec 1, 2009 at 12:11 PM, Freeman, Tim <ti...@hp.com> wrote:
>> In an 8 hour test run, I've seen the read latency for Cassandra drift fairly linearly from ~460ms to ~900ms.  Eventually my application gets starved for reads and starts misbehaving.  I have attached graphs -- horizontal scales are seconds, vertical scales are operations per minute and average milliseconds per operation.  The clearest feature is the light blue line in the left graph drifting consistently upward during the run.
>>
>> I have a Cassandra 0.4.1 database, one node, records are 100kbytes each, 350K records, 8 threads reading, around 700 reads per minute.  There are also 8 threads writing.  This is all happening on a 4 core processor that's supporting both the Cassandra node and the code that's generating load for it.  I'm reasonably sure that there are no page faults.
>>
>> I have attached my storage-conf.xml.  Briefly, it has default values, except RpcTimeoutInMillis is 30000 and the partitioner is OrderPreservingPartitioner.  Cassandra's garbage collection parameters are:
>>
>>   -Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>
>> Is this normal behavior?  Is there some change to the configuration I should make to get it to stop getting slower?  If it's not normal, what debugging information should I gather?  Should I give up on Cassandra 0.4.1 and move to a newer version?
>>
>> I'll leave it running for the time being in case there's something useful to extract from it.
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>>
>>
>

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

i would expect read latency to increase linearly w/ the number of
sstables you have around.  how many are in your data directories?  is
your compaction lagging 1000s of tables behind again?

On Thu, Dec 3, 2009 at 12:58 PM, Freeman, Tim <ti...@hp.com> wrote:
> I ran another test last night with the build dated 29 Nov 2009.  Other than the Cassandra version, the setup was the same as before.  I got qualitatively similar results as before, too -- the read latency increased fairly smoothly from 250ms to 1s, the GC times reported by jconsole are low, the pending tasks for row-mutation-stage and row-read-stage are less than 10, the pending tasks for the compaction pool are 1615.  Last time around the read latency maxed out at one second.  This time, it just got to one second as I'm writing this so I don't know yet if it will continue to increase.
>
> I have attached a fresh graph describing the present run.  It's qualitatively similar to the previous one.  The vertical units are milliseconds (for latency) and operations per minute (for reads or writes).  The horizontal scale is seconds.  The feature that's bothering me is the red line for the read latency going diagonally from lower left to the lower-middle right.  The scale doesn't make it look dramatic, but Cassandra slowed down by a factor of 4.
>
> The read and write rates were stable for 45,000 seconds or so, and then the read latency got big enough that the application was starved for reads and it started writing less.
>
> If this is worth pursuing, I suppose the next step would be for me to make a small program that reproduces the problem.  It should be easy -- we're just reading and writing random records.  Let me know if there's interest in that.  I could  also decide to live with a 1000 ms latency here.  I'm thinking of putting a cache in the local filesystem in front of Cassandra (or whichever distributed DB we decide to go with), so living with it is definitely possible.
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, December 01, 2009 11:10 AM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Persistently increasing read latency
>
> 1) use jconsole to see what is happening to jvm / cassandra internals.
>  possibly you are slowly exceeding cassandra's ability to keep up with
> writes, causing the jvm to spend more and more effort GCing to find
> enough memory to keep going
>
> 2) you should be at least on 0.4.2 and preferably trunk if you are
> stress testing
>
> -Jonathan
>
> On Tue, Dec 1, 2009 at 12:11 PM, Freeman, Tim <ti...@hp.com> wrote:
>> In an 8 hour test run, I've seen the read latency for Cassandra drift fairly linearly from ~460ms to ~900ms.  Eventually my application gets starved for reads and starts misbehaving.  I have attached graphs -- horizontal scales are seconds, vertical scales are operations per minute and average milliseconds per operation.  The clearest feature is the light blue line in the left graph drifting consistently upward during the run.
>>
>> I have a Cassandra 0.4.1 database, one node, records are 100kbytes each, 350K records, 8 threads reading, around 700 reads per minute.  There are also 8 threads writing.  This is all happening on a 4 core processor that's supporting both the Cassandra node and the code that's generating load for it.  I'm reasonably sure that there are no page faults.
>>
>> I have attached my storage-conf.xml.  Briefly, it has default values, except RpcTimeoutInMillis is 30000 and the partitioner is OrderPreservingPartitioner.  Cassandra's garbage collection parameters are:
>>
>>   -Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>
>> Is this normal behavior?  Is there some change to the configuration I should make to get it to stop getting slower?  If it's not normal, what debugging information should I gather?  Should I give up on Cassandra 0.4.1 and move to a newer version?
>>
>> I'll leave it running for the time being in case there's something useful to extract from it.
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>>
>>
>

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Dec 3, 2009 at 1:04 PM, Chris Goffinet <go...@digg.com> wrote:
> We've seen reads spike like this with a large number of SSTables.

(Which I emphasize is a natural consequence of the SSTable design.)

Re: Persistently increasing read latency

Posted by Chris Goffinet <go...@digg.com>.

Tim,

Very interesting information. Was there any other numbers in tpstats  
from nodeprobe that are growing?

Can you plot the number of SSTables? Are you using the standard  
storage-conf.xml defaults?

We've seen reads spike like this with a large number of SSTables.

-Chris

On Dec 3, 2009, at 10:58 AM, Freeman, Tim wrote:

> I ran another test last night with the build dated 29 Nov 2009.   
> Other than the Cassandra version, the setup was the same as before.   
> I got qualitatively similar results as before, too -- the read  
> latency increased fairly smoothly from 250ms to 1s, the GC times  
> reported by jconsole are low, the pending tasks for row-mutation- 
> stage and row-read-stage are less than 10, the pending tasks for the  
> compaction pool are 1615.  Last time around the read latency maxed  
> out at one second.  This time, it just got to one second as I'm  
> writing this so I don't know yet if it will continue to increase.
>
> I have attached a fresh graph describing the present run.  It's  
> qualitatively similar to the previous one.  The vertical units are  
> milliseconds (for latency) and operations per minute (for reads or  
> writes).  The horizontal scale is seconds.  The feature that's  
> bothering me is the red line for the read latency going diagonally  
> from lower left to the lower-middle right.  The scale doesn't make  
> it look dramatic, but Cassandra slowed down by a factor of 4.
>
> The read and write rates were stable for 45,000 seconds or so, and  
> then the read latency got big enough that the application was  
> starved for reads and it started writing less.
>
> If this is worth pursuing, I suppose the next step would be for me  
> to make a small program that reproduces the problem.  It should be  
> easy -- we're just reading and writing random records.  Let me know  
> if there's interest in that.  I could  also decide to live with a  
> 1000 ms latency here.  I'm thinking of putting a cache in the local  
> filesystem in front of Cassandra (or whichever distributed DB we  
> decide to go with), so living with it is definitely possible.
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday,  
> and Thursday; call my desk instead.)
>
> -----Original Message-----
> From: Jonathan Ellis [mailto:jbellis@gmail.com]
> Sent: Tuesday, December 01, 2009 11:10 AM
> To: cassandra-user@incubator.apache.org
> Subject: Re: Persistently increasing read latency
>
> 1) use jconsole to see what is happening to jvm / cassandra internals.
> possibly you are slowly exceeding cassandra's ability to keep up with
> writes, causing the jvm to spend more and more effort GCing to find
> enough memory to keep going
>
> 2) you should be at least on 0.4.2 and preferably trunk if you are
> stress testing
>
> -Jonathan
>
> On Tue, Dec 1, 2009 at 12:11 PM, Freeman, Tim <ti...@hp.com>  
> wrote:
>> In an 8 hour test run, I've seen the read latency for Cassandra  
>> drift fairly linearly from ~460ms to ~900ms.  Eventually my  
>> application gets starved for reads and starts misbehaving.  I have  
>> attached graphs -- horizontal scales are seconds, vertical scales  
>> are operations per minute and average milliseconds per operation.   
>> The clearest feature is the light blue line in the left graph  
>> drifting consistently upward during the run.
>>
>> I have a Cassandra 0.4.1 database, one node, records are 100kbytes  
>> each, 350K records, 8 threads reading, around 700 reads per  
>> minute.  There are also 8 threads writing.  This is all happening  
>> on a 4 core processor that's supporting both the Cassandra node and  
>> the code that's generating load for it.  I'm reasonably sure that  
>> there are no page faults.
>>
>> I have attached my storage-conf.xml.  Briefly, it has default  
>> values, except RpcTimeoutInMillis is 30000 and the partitioner is  
>> OrderPreservingPartitioner.  Cassandra's garbage collection  
>> parameters are:
>>
>>   -Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX: 
>> +UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>>
>> Is this normal behavior?  Is there some change to the configuration  
>> I should make to get it to stop getting slower?  If it's not  
>> normal, what debugging information should I gather?  Should I give  
>> up on Cassandra 0.4.1 and move to a newer version?
>>
>> I'll leave it running for the time being in case there's something  
>> useful to extract from it.
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday,  
>> and Thursday; call my desk instead.)
>>
>>
> <latency-trend.png>

RE: Persistently increasing read latency

Posted by "Freeman, Tim" <ti...@hp.com>.

I ran another test last night with the build dated 29 Nov 2009.  Other than the Cassandra version, the setup was the same as before.  I got qualitatively similar results as before, too -- the read latency increased fairly smoothly from 250ms to 1s, the GC times reported by jconsole are low, the pending tasks for row-mutation-stage and row-read-stage are less than 10, the pending tasks for the compaction pool are 1615.  Last time around the read latency maxed out at one second.  This time, it just got to one second as I'm writing this so I don't know yet if it will continue to increase.

I have attached a fresh graph describing the present run.  It's qualitatively similar to the previous one.  The vertical units are milliseconds (for latency) and operations per minute (for reads or writes).  The horizontal scale is seconds.  The feature that's bothering me is the red line for the read latency going diagonally from lower left to the lower-middle right.  The scale doesn't make it look dramatic, but Cassandra slowed down by a factor of 4.

The read and write rates were stable for 45,000 seconds or so, and then the read latency got big enough that the application was starved for reads and it started writing less.

If this is worth pursuing, I suppose the next step would be for me to make a small program that reproduces the problem.  It should be easy -- we're just reading and writing random records.  Let me know if there's interest in that.  I could  also decide to live with a 1000 ms latency here.  I'm thinking of putting a cache in the local filesystem in front of Cassandra (or whichever distributed DB we decide to go with), so living with it is definitely possible.

Tim Freeman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com] 
Sent: Tuesday, December 01, 2009 11:10 AM
To: cassandra-user@incubator.apache.org
Subject: Re: Persistently increasing read latency

1) use jconsole to see what is happening to jvm / cassandra internals.
 possibly you are slowly exceeding cassandra's ability to keep up with
writes, causing the jvm to spend more and more effort GCing to find
enough memory to keep going

2) you should be at least on 0.4.2 and preferably trunk if you are
stress testing

-Jonathan

On Tue, Dec 1, 2009 at 12:11 PM, Freeman, Tim <ti...@hp.com> wrote:
> In an 8 hour test run, I've seen the read latency for Cassandra drift fairly linearly from ~460ms to ~900ms.  Eventually my application gets starved for reads and starts misbehaving.  I have attached graphs -- horizontal scales are seconds, vertical scales are operations per minute and average milliseconds per operation.  The clearest feature is the light blue line in the left graph drifting consistently upward during the run.
>
> I have a Cassandra 0.4.1 database, one node, records are 100kbytes each, 350K records, 8 threads reading, around 700 reads per minute.  There are also 8 threads writing.  This is all happening on a 4 core processor that's supporting both the Cassandra node and the code that's generating load for it.  I'm reasonably sure that there are no page faults.
>
> I have attached my storage-conf.xml.  Briefly, it has default values, except RpcTimeoutInMillis is 30000 and the partitioner is OrderPreservingPartitioner.  Cassandra's garbage collection parameters are:
>
>   -Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>
> Is this normal behavior?  Is there some change to the configuration I should make to get it to stop getting slower?  If it's not normal, what debugging information should I gather?  Should I give up on Cassandra 0.4.1 and move to a newer version?
>
> I'll leave it running for the time being in case there's something useful to extract from it.
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>
>

Re: Persistently increasing read latency

Posted by Jonathan Ellis <jb...@gmail.com>.

1) use jconsole to see what is happening to jvm / cassandra internals.
 possibly you are slowly exceeding cassandra's ability to keep up with
writes, causing the jvm to spend more and more effort GCing to find
enough memory to keep going

2) you should be at least on 0.4.2 and preferably trunk if you are
stress testing

-Jonathan

On Tue, Dec 1, 2009 at 12:11 PM, Freeman, Tim <ti...@hp.com> wrote:
> In an 8 hour test run, I've seen the read latency for Cassandra drift fairly linearly from ~460ms to ~900ms.  Eventually my application gets starved for reads and starts misbehaving.  I have attached graphs -- horizontal scales are seconds, vertical scales are operations per minute and average milliseconds per operation.  The clearest feature is the light blue line in the left graph drifting consistently upward during the run.
>
> I have a Cassandra 0.4.1 database, one node, records are 100kbytes each, 350K records, 8 threads reading, around 700 reads per minute.  There are also 8 threads writing.  This is all happening on a 4 core processor that's supporting both the Cassandra node and the code that's generating load for it.  I'm reasonably sure that there are no page faults.
>
> I have attached my storage-conf.xml.  Briefly, it has default values, except RpcTimeoutInMillis is 30000 and the partitioner is OrderPreservingPartitioner.  Cassandra's garbage collection parameters are:
>
>   -Xms128m -Xmx1G -XX:SurvivorRatio=8 -XX:+AggressiveOpts -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>
> Is this normal behavior?  Is there some change to the configuration I should make to get it to stop getting slower?  If it's not normal, what debugging information should I gather?  Should I give up on Cassandra 0.4.1 and move to a newer version?
>
> I'll leave it running for the time being in case there's something useful to extract from it.
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)
>
>