You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Curt Bererton <cu...@zipzapplay.com> on 2010/05/18 00:39:01 UTC

Problems running Cassandra 0.6.1 on large EC2 instances.

Hello Cassandra users+experts,

Hopefully someone will be able to point me in the correct direction. We have
cassandra 0.6.1 working on our test servers and we *thought* everything was
great and ready to move to production. We are currently running a ring of 4
large instance EC2 (http://aws.amazon.com/ec2/instance-types/) servers on
production with a replication factor of 3 and a QUORUM consistency level. We
ran a test on 1% of our users, and everything was writing to and reading
from cassandra great for the first 3 hours. After that point CPU usage
spiked to 100% and stayed there, basically on all 4 machines at once. This
smells to me like a GC issue, and I'm looking into it with jconsole right
now. If anyone can help me debug this and get cassandra all the way up and
running without CPU spiking I would be forever in their debt.

I suspect that anyone else running cassandra on large EC2 instances might
just be able to tell me what JVM args they are successfully using in a
production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
and did they go to batched writes due to bug 1014? (
https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer all
my questions.

Is there anyone on the list who is using large EC2 instances in production?
Would you be kind enough to share your JVM arguments and any other tips?

Thanks for any help,
Curt
--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Curt Bererton <cu...@zipzapplay.com>.

Thanks for the help guys:

First answering the first question: both cores are pegged:

Cpu0  : 43.8%us, 34.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
22.1%st
Cpu1  : 40.5%us, 36.2%sy,  0.0%ni,  0.4%id,  0.0%wa,  0.0%hi,  0.2%si,
22.6%st
Mem:   7872040k total,  3620180k used,  4251860k free,   388052k buffers
Swap:        0k total,        0k used,        0k free,  1655920k cached

Here's our current set up (mostly default) for storage-conf.xml:

<Storage>
  <ClusterName>zzpproduction</ClusterName>
  <AutoBootstrap>true</AutoBootstrap>

  <Keyspaces>
    <Keyspace Name="ks1">
      <ColumnFamily Name="A" CompareWith="BytesType"/>
      <ColumnFamily Name="B" CompareWith="BytesType"/>
      <ColumnFamily Name="C" CompareWith="BytesType"/>
      <ColumnFamily Name="D" CompareWith="BytesType"/>
      <ColumnFamily Name="E" CompareWith="BytesType"/>
      <ColumnFamily Name="F" CompareWith="BytesType"/>

<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
      <ReplicationFactor>3</ReplicationFactor>

<EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
    </Keyspace>
  </Keyspaces>

<Authenticator>org.apache.cassandra.auth.AllowAllAuthenticator</Authenticator>
  <Partitioner>org.apache.cassandra.dht.RandomPartitioner</Partitioner>
  <!-- this gets set at server boot time -->
  <InitialToken>@CASSANDRA_TOKEN@</InitialToken>
  <CommitLogDirectory>/mnt/cassandra/commitlog</CommitLogDirectory>
  <DataFileDirectories>
      <DataFileDirectory>/mnt/cassandra/data</DataFileDirectory>
  </DataFileDirectories>
  <!-- gets set at server boot -->
  <Seeds>
   @DEPLOY_SEEDS@
  </Seeds>

  <RpcTimeoutInMillis>10000</RpcTimeoutInMillis>
  <CommitLogRotationThresholdInMB>128</CommitLogRotationThresholdInMB>
  <ListenAddress>@CASSANDRA_ADDRESS@</ListenAddress>
  <StoragePort>7000</StoragePort>
  <ThriftAddress>@CASSANDRA_ADDRESS@</ThriftAddress>
  <ThriftPort>9160</ThriftPort>
  <ThriftFramedTransport>false</ThriftFramedTransport>
  <DiskAccessMode>auto</DiskAccessMode>
  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
  <MemtableThroughputInMB>64</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
  <ConcurrentReads>8</ConcurrentReads>
  <ConcurrentWrites>32</ConcurrentWrites>
  <CommitLogSync>periodic</CommitLogSync>
  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
  <!-- <CommitLogSyncBatchWindowInMS>1</CommitLogSyncBatchWindowInMS> -->
  <GCGraceSeconds>864000</GCGraceSeconds>
</Storage>

We use QUORUM for all writes and reads. All instances are in the same region
(us-east-1b).

We haven't set up raids yet on the machines, (though I want to) for now the
commitLog and the data files are also on the same disk. Mostly I just want
to get the sucker up and running, and then I'll optimize the crap out of it.


I'm currently thinking that it might be some outright stupidity in our
client code. We use PHP with Pandra, but in looking through our client code
we don't call "disconnect" anywhere. We've got around 8 middle app servers
talking to 4 cassandra nodes. This might explain why there's 178 threads
showing up in jconsole? I don't know how many threads are typical. Looking
at a given thread in jconsole it typically says something like:
State: Waiting on
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@6c7db013
Total Blocked:34 Total waited 3,631

Is that indicating that a thread is just sitting there waiting with an open
connection?

I'm looking into the above as well as trying out jmap right now.

Thanks for the suggestions, keep em coming. I'm hoping that it's just the
stupidity of not closing the connection from the client side..

Best,
Curt

--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 5:00 PM, Brandon Williams <dr...@gmail.com> wrote:

> On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>
>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>> hammered right now, and it is receiving 0 ops/sec from me since I
>> disconnected it from our application right now until I can figure out what's
>> going on.
>>
>> running top on the machine I get:
>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>> 15.13
>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>> 24.6%st
>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>
>
> Since your heap isn't anywhere near exhausted, I don't think you have a GC
> storm happening.  Is it one core or both that are pegged?  One way to tell
> which thread is using all the CPU is to run top -H so you can see the
> threads, get the pid of the one using the CPU, convert that to hex, then run
> jmap <main java pid> and grep for the hex.
>
> -Brandon
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Brandon Williams <dr...@gmail.com>.

On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <cu...@zipzapplay.com> wrote:

> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
> hammered right now, and it is receiving 0 ops/sec from me since I
> disconnected it from our application right now until I can figure out what's
> going on.
>
> running top on the machine I get:
> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
> 15.13
> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
> 24.6%st
> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
> Swap:        0k total,        0k used,        0k free,  1655556k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>

Since your heap isn't anywhere near exhausted, I don't think you have a GC
storm happening.  Is it one core or both that are pegged?  One way to tell
which thread is using all the CPU is to run top -H so you can see the
threads, get the pid of the one using the CPU, convert that to hex, then run
jmap <main java pid> and grep for the hex.

-Brandon

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Curt Bererton <cu...@zipzapplay.com>.

We can get Cassandra to run great for a few hours now.  Writing to and
reading from cassandra work well and the read/write times are good etc. We
also changed our config to enable row caching (we're hoping to ditch our
memcache server layer entirely).

Unfortunately, running on an EC2  High Memory extra large instance with
batch mode led to huge iowait on the cpu with only 20% of our traffic. We
don't have the commit log on a different disk yet, but it still seemed much
higher than it should have been. On Jonathan's recommendation we changed to
periodic mode in storage-conf.xml. This fixed the io wait problem, but the
machines went down hard after a few million writes. Unfortunately I don't
have any jmx or jvm level debugging (other than command line stuff) so I
don't have a ton of insight yet as to why it choked.

The main symptoms are memory dropping to zero and the cpu shooting up to
100% very suddenly. Typically CPU shot up to 100% at roughly the same time
for all machines.

We have two hypotheses:

   - our php client is connection leaking somehow
   - the GC kicks in and has so much memory to clean up ( the heap is at 12
   Gigs) that it takes forever and while the GC is running and eating cpu
   something else goes wrong.

I'm hooking up jcollectd to cassandra to see if we can find out more.

If anyone has any other suggestions please let me know.

C

--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/bakinglife http://apps.facebook.com/happyhabitat

On Fri, May 21, 2010 at 12:53 PM, S Ahmed <sa...@gmail.com> wrote:

> curious how did things turn out?
>
>
> On Tue, May 18, 2010 at 1:38 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>
>> We only have a few CFs (6 or 7).  I've increased the MemtableThroughputInMB
>> and MemtableOperationsInMillions as per your suggestions. Do we really
>> need a swap file though? I suppose it can't hurt, but with my problem in
>> particular we weren't maxing out main memory.
>>
>> We'll be running another test today and see if the settings changes
>> proposed so far fix our problem ( I hope so ).
>>
>> Best,
>> Curt
>>
>>
>> On Tue, May 18, 2010 at 5:59 AM, Lee Parker <le...@socialagency.com> wrote:
>>
>>> How many different CFs do you have?  If you only have a few, I would
>>> highly recommend increasing the MemtableThroughputInMB and MemtableOperationsInMillions.
>>>  We only have to CFs and I have it set at 256MB and 2.5m. Since most of our
>>> columns are relatively small, these values are practically equivalent to
>>> each other.  I would also recommend dropping your heap space to 6G and
>>> adding a swap file.  In our case, the large EC2 instances didn't have any
>>> swap setup by default.
>>>
>>> Lee Parker
>>>
>>>
>>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by S Ahmed <sa...@gmail.com>.

curious how did things turn out?

On Tue, May 18, 2010 at 1:38 PM, Curt Bererton <cu...@zipzapplay.com> wrote:

> We only have a few CFs (6 or 7).  I've increased the MemtableThroughputInMB
> and MemtableOperationsInMillions as per your suggestions. Do we really
> need a swap file though? I suppose it can't hurt, but with my problem in
> particular we weren't maxing out main memory.
>
> We'll be running another test today and see if the settings changes
> proposed so far fix our problem ( I hope so ).
>
> Best,
> Curt
>
>
> On Tue, May 18, 2010 at 5:59 AM, Lee Parker <le...@socialagency.com> wrote:
>
>> How many different CFs do you have?  If you only have a few, I would
>> highly recommend increasing the MemtableThroughputInMB and MemtableOperationsInMillions.
>>  We only have to CFs and I have it set at 256MB and 2.5m. Since most of our
>> columns are relatively small, these values are practically equivalent to
>> each other.  I would also recommend dropping your heap space to 6G and
>> adding a swap file.  In our case, the large EC2 instances didn't have any
>> swap setup by default.
>>
>> Lee Parker
>>
>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Curt Bererton <cu...@zipzapplay.com>.

We only have a few CFs (6 or 7).  I've increased the MemtableThroughputInMB
and MemtableOperationsInMillions as per your suggestions. Do we really need
a swap file though? I suppose it can't hurt, but with my problem in
particular we weren't maxing out main memory.

We'll be running another test today and see if the settings changes proposed
so far fix our problem ( I hope so ).

Best,
Curt

On Tue, May 18, 2010 at 5:59 AM, Lee Parker <le...@socialagency.com> wrote:

> How many different CFs do you have?  If you only have a few, I would highly
> recommend increasing the MemtableThroughputInMB and MemtableOperationsInMillions.
>  We only have to CFs and I have it set at 256MB and 2.5m. Since most of our
> columns are relatively small, these values are practically equivalent to
> each other.  I would also recommend dropping your heap space to 6G and
> adding a swap file.  In our case, the large EC2 instances didn't have any
> swap setup by default.
>
> Lee Parker
>
>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Lee Parker <le...@socialagency.com>.

How many different CFs do you have?  If you only have a few, I would highly
recommend increasing the MemtableThroughputInMB and
MemtableOperationsInMillions.
 We only have to CFs and I have it set at 256MB and 2.5m. Since most of our
columns are relatively small, these values are practically equivalent to
each other.  I would also recommend dropping your heap space to 6G and
adding a swap file.  In our case, the large EC2 instances didn't have any
swap setup by default.

Lee Parker
On Mon, May 17, 2010 at 7:31 PM, Curt Bererton <cu...@zipzapplay.com> wrote:

> Agreed, and I just saw that in storage conf that a higher value for the
> MemtableFlushAfterMinutes is suggested otherwise you might get a "flush
> storm: of all your memtables flushing at once". I've changed that as well.
>
>
> --
> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
> http://apps.facebook.com/happyhabitat
>
>
> On Mon, May 17, 2010 at 5:27 PM, Mark Greene <gr...@gmail.com> wrote:
>
>> Since you only have 7.5GB of memory, it's a really bad idea to set your
>> heap space to a max of 7GB. Remember, the java process heap will be larger
>> than what Xmx is allowed to grow to. If you reach this level, you can
>> start swapping which is very very bad. As Brandon pointed out, you haven't
>> exhausted your physically memory yet but you still want to lower Xmx to
>> something like 5 maybe 6 GB.
>>
>>
>> On Mon, May 17, 2010 at 7:02 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>>
>>> Here are the current jvm args  and java version:
>>>
>>> # Arguments to pass to the JVM
>>> JVM_OPTS=" \
>>>         -ea \
>>>         -Xms128M \
>>>         -Xmx7G \
>>>         -XX:TargetSurvivorRatio=90 \
>>>         -XX:+AggressiveOpts \
>>>         -XX:+UseParNewGC \
>>>         -XX:+UseConcMarkSweepGC \
>>>         -XX:+CMSParallelRemarkEnabled \
>>>         -XX:+HeapDumpOnOutOfMemoryError \
>>>         -XX:SurvivorRatio=128 \
>>>         -XX:MaxTenuringThreshold=0 \
>>>         -Dcom.sun.management.jmxremote.port=8080 \
>>>         -Dcom.sun.management.jmxremote.ssl=false \
>>>         -Dcom.sun.management.jmxremote.authenticate=false"
>>>
>>> java -version outputs:
>>> java version "1.6.0_20"
>>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>>
>>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>>> hammered right now, and it is receiving 0 ops/sec from me since I
>>> disconnected it from our application right now until I can figure out what's
>>> going on.
>>>
>>> running top on the machine I get:
>>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>>> 15.13
>>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>>> 24.6%st
>>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>> COMMAND
>>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>>
>>>
>>> I have jconsole up and running, and jconsole vm Summary tab says:
>>>  - total physical memory: 7,872,040 K
>>>  - Free physical memory: 4,253,036 K
>>>  - Total swap space: 0K
>>>  - Free swap space: 0K
>>>  - Committed virtual memory: 8,096648K
>>>
>>> Is there a specific thread I can look at in jconsole that might give me a
>>> clue?  It's weird that it's still at 100% cpu even though it's getting no
>>> traffic from outside right now.  I suppose it might still be talking across
>>> the machines though.
>>>
>>> Also, stopping cassandra and starting cassandra on one of the 4 machines
>>> caused the CPU to go back down to almost normal levels.
>>>
>>> Here's the ring;
>>>
>>> Address       Status     Load
>>> Range                                      Ring
>>>
>>> 170141183460469231731687303715884105728
>>> 10.251.XX.XX Up         2.15 MB
>>> 42535295865117307932921825928971026432     |<--|
>>> 10.250.XX.XX  Up         2.42 MB
>>> 85070591730234615865843651857942052864     |   |
>>> 10.250.XX.XX Up         2.47 MB
>>> 127605887595351923798765477786913079296    |   |
>>> 10.250.XX.XX Up         2.46 MB
>>> 170141183460469231731687303715884105728    |-->|
>>>
>>> Any thoughts?
>>>
>>> Best,
>>>
>>> Curt
>>> --
>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>> http://apps.facebook.com/happyhabitat
>>>
>>>
>>> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <gr...@gmail.com> wrote:
>>>
>>>> Can you provide us with the current JVM args? Also, what type of work
>>>> load you are giving the ring (op/s)?
>>>>
>>>>
>>>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>>>>
>>>>> Hello Cassandra users+experts,
>>>>>
>>>>> Hopefully someone will be able to point me in the correct direction. We
>>>>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>>>>> was great and ready to move to production. We are currently running a ring
>>>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>>>> servers on production with a replication factor of 3 and a QUORUM
>>>>> consistency level. We ran a test on 1% of our users, and everything was
>>>>> writing to and reading from cassandra great for the first 3 hours. After
>>>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>>>> machines at once. This smells to me like a GC issue, and I'm looking into it
>>>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>>>> all the way up and running without CPU spiking I would be forever in their
>>>>> debt.
>>>>>
>>>>> I suspect that anyone else running cassandra on large EC2 instances
>>>>> might just be able to tell me what JVM args they are successfully using in a
>>>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>>>> and did they go to batched writes due to bug 1014? (
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might
>>>>> answer all my questions.
>>>>>
>>>>> Is there anyone on the list who is using large EC2 instances in
>>>>> production? Would you be kind enough to share your JVM arguments and any
>>>>> other tips?
>>>>>
>>>>> Thanks for any help,
>>>>> Curt
>>>>> --
>>>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>>>> http://apps.facebook.com/happyhabitat
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Curt Bererton <cu...@zipzapplay.com>.

Agreed, and I just saw that in storage conf that a higher value for the
MemtableFlushAfterMinutes is suggested otherwise you might get a "flush
storm: of all your memtables flushing at once". I've changed that as well.

--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 5:27 PM, Mark Greene <gr...@gmail.com> wrote:

> Since you only have 7.5GB of memory, it's a really bad idea to set your
> heap space to a max of 7GB. Remember, the java process heap will be larger
> than what Xmx is allowed to grow to. If you reach this level, you can
> start swapping which is very very bad. As Brandon pointed out, you haven't
> exhausted your physically memory yet but you still want to lower Xmx to
> something like 5 maybe 6 GB.
>
>
> On Mon, May 17, 2010 at 7:02 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>
>> Here are the current jvm args  and java version:
>>
>> # Arguments to pass to the JVM
>> JVM_OPTS=" \
>>         -ea \
>>         -Xms128M \
>>         -Xmx7G \
>>         -XX:TargetSurvivorRatio=90 \
>>         -XX:+AggressiveOpts \
>>         -XX:+UseParNewGC \
>>         -XX:+UseConcMarkSweepGC \
>>         -XX:+CMSParallelRemarkEnabled \
>>         -XX:+HeapDumpOnOutOfMemoryError \
>>         -XX:SurvivorRatio=128 \
>>         -XX:MaxTenuringThreshold=0 \
>>         -Dcom.sun.management.jmxremote.port=8080 \
>>         -Dcom.sun.management.jmxremote.ssl=false \
>>         -Dcom.sun.management.jmxremote.authenticate=false"
>>
>> java -version outputs:
>> java version "1.6.0_20"
>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>
>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>> hammered right now, and it is receiving 0 ops/sec from me since I
>> disconnected it from our application right now until I can figure out what's
>> going on.
>>
>> running top on the machine I get:
>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>> 15.13
>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>> 24.6%st
>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>
>>
>> I have jconsole up and running, and jconsole vm Summary tab says:
>>  - total physical memory: 7,872,040 K
>>  - Free physical memory: 4,253,036 K
>>  - Total swap space: 0K
>>  - Free swap space: 0K
>>  - Committed virtual memory: 8,096648K
>>
>> Is there a specific thread I can look at in jconsole that might give me a
>> clue?  It's weird that it's still at 100% cpu even though it's getting no
>> traffic from outside right now.  I suppose it might still be talking across
>> the machines though.
>>
>> Also, stopping cassandra and starting cassandra on one of the 4 machines
>> caused the CPU to go back down to almost normal levels.
>>
>> Here's the ring;
>>
>> Address       Status     Load
>> Range                                      Ring
>>
>> 170141183460469231731687303715884105728
>> 10.251.XX.XX Up         2.15 MB
>> 42535295865117307932921825928971026432     |<--|
>> 10.250.XX.XX  Up         2.42 MB
>> 85070591730234615865843651857942052864     |   |
>> 10.250.XX.XX Up         2.47 MB
>> 127605887595351923798765477786913079296    |   |
>> 10.250.XX.XX Up         2.46 MB
>> 170141183460469231731687303715884105728    |-->|
>>
>> Any thoughts?
>>
>> Best,
>>
>> Curt
>> --
>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>> http://apps.facebook.com/happyhabitat
>>
>>
>> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <gr...@gmail.com> wrote:
>>
>>> Can you provide us with the current JVM args? Also, what type of work
>>> load you are giving the ring (op/s)?
>>>
>>>
>>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>>>
>>>> Hello Cassandra users+experts,
>>>>
>>>> Hopefully someone will be able to point me in the correct direction. We
>>>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>>>> was great and ready to move to production. We are currently running a ring
>>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>>> servers on production with a replication factor of 3 and a QUORUM
>>>> consistency level. We ran a test on 1% of our users, and everything was
>>>> writing to and reading from cassandra great for the first 3 hours. After
>>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>>> machines at once. This smells to me like a GC issue, and I'm looking into it
>>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>>> all the way up and running without CPU spiking I would be forever in their
>>>> debt.
>>>>
>>>> I suspect that anyone else running cassandra on large EC2 instances
>>>> might just be able to tell me what JVM args they are successfully using in a
>>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>>> and did they go to batched writes due to bug 1014? (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>>> all my questions.
>>>>
>>>> Is there anyone on the list who is using large EC2 instances in
>>>> production? Would you be kind enough to share your JVM arguments and any
>>>> other tips?
>>>>
>>>> Thanks for any help,
>>>> Curt
>>>> --
>>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>>> http://apps.facebook.com/happyhabitat
>>>>
>>>
>>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Mark Greene <gr...@gmail.com>.

Since you only have 7.5GB of memory, it's a really bad idea to set your heap
space to a max of 7GB. Remember, the java process heap will be larger than
what Xmx is allowed to grow to. If you reach this level, you can
start swapping which is very very bad. As Brandon pointed out, you haven't
exhausted your physically memory yet but you still want to lower Xmx to
something like 5 maybe 6 GB.

On Mon, May 17, 2010 at 7:02 PM, Curt Bererton <cu...@zipzapplay.com> wrote:

> Here are the current jvm args  and java version:
>
> # Arguments to pass to the JVM
> JVM_OPTS=" \
>         -ea \
>         -Xms128M \
>         -Xmx7G \
>         -XX:TargetSurvivorRatio=90 \
>         -XX:+AggressiveOpts \
>         -XX:+UseParNewGC \
>         -XX:+UseConcMarkSweepGC \
>         -XX:+CMSParallelRemarkEnabled \
>         -XX:+HeapDumpOnOutOfMemoryError \
>         -XX:SurvivorRatio=128 \
>         -XX:MaxTenuringThreshold=0 \
>         -Dcom.sun.management.jmxremote.port=8080 \
>         -Dcom.sun.management.jmxremote.ssl=false \
>         -Dcom.sun.management.jmxremote.authenticate=false"
>
> java -version outputs:
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>
> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
> hammered right now, and it is receiving 0 ops/sec from me since I
> disconnected it from our application right now until I can figure out what's
> going on.
>
> running top on the machine I get:
> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
> 15.13
> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
> 24.6%st
> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
> Swap:        0k total,        0k used,        0k free,  1655556k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>
>
> I have jconsole up and running, and jconsole vm Summary tab says:
>  - total physical memory: 7,872,040 K
>  - Free physical memory: 4,253,036 K
>  - Total swap space: 0K
>  - Free swap space: 0K
>  - Committed virtual memory: 8,096648K
>
> Is there a specific thread I can look at in jconsole that might give me a
> clue?  It's weird that it's still at 100% cpu even though it's getting no
> traffic from outside right now.  I suppose it might still be talking across
> the machines though.
>
> Also, stopping cassandra and starting cassandra on one of the 4 machines
> caused the CPU to go back down to almost normal levels.
>
> Here's the ring;
>
> Address       Status     Load
> Range                                      Ring
>
> 170141183460469231731687303715884105728
> 10.251.XX.XX Up         2.15 MB
> 42535295865117307932921825928971026432     |<--|
> 10.250.XX.XX  Up         2.42 MB
> 85070591730234615865843651857942052864     |   |
> 10.250.XX.XX Up         2.47 MB
> 127605887595351923798765477786913079296    |   |
> 10.250.XX.XX Up         2.46 MB
> 170141183460469231731687303715884105728    |-->|
>
> Any thoughts?
>
> Best,
>
> Curt
> --
> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
> http://apps.facebook.com/happyhabitat
>
>
> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <gr...@gmail.com> wrote:
>
>> Can you provide us with the current JVM args? Also, what type of work load
>> you are giving the ring (op/s)?
>>
>>
>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>>
>>> Hello Cassandra users+experts,
>>>
>>> Hopefully someone will be able to point me in the correct direction. We
>>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>>> was great and ready to move to production. We are currently running a ring
>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>> servers on production with a replication factor of 3 and a QUORUM
>>> consistency level. We ran a test on 1% of our users, and everything was
>>> writing to and reading from cassandra great for the first 3 hours. After
>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>> machines at once. This smells to me like a GC issue, and I'm looking into it
>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>> all the way up and running without CPU spiking I would be forever in their
>>> debt.
>>>
>>> I suspect that anyone else running cassandra on large EC2 instances might
>>> just be able to tell me what JVM args they are successfully using in a
>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>> and did they go to batched writes due to bug 1014? (
>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>> all my questions.
>>>
>>> Is there anyone on the list who is using large EC2 instances in
>>> production? Would you be kind enough to share your JVM arguments and any
>>> other tips?
>>>
>>> Thanks for any help,
>>> Curt
>>> --
>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>> http://apps.facebook.com/happyhabitat
>>>
>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Lee Parker <le...@socialagency.com>.

Also, I am using batch_mutate for all of my writes.

Lee Parker
On Mon, May 17, 2010 at 7:11 PM, Lee Parker <le...@socialagency.com> wrote:

> What are your storage-conf settings for Memtable thresholds?  One thing
> that could cause lots of CPU usage is dumping the memtables too frequently
> and then having to do lots of compaction.  With that much available heap
> space you could definitely go larger than the default thresholds.  Also, do
> you not have any swap space setup on the machine?  It is a good idea to at
> least setup a swap file so that the system can use it when it needs to.
>
> We are running a two node cluster using Amazon large EC2 instances as well.
>  The cluster is using a replication factor of 2 and most of my writes and
> reads are at a consistency level of ONE except for a few QUORUM calls.  The
> only difference in my JVM opts is that my max is set at 6G.  I have the two
> ephemeral disks setup as a raid 0 array and that is where I'm storing the
> data.  The commit logs are going to the default location so they are using
> the local disk.  We currently have more than 90G of data running on these
> and have only had issues with CPU utilization when our code was accidentally
> duplicating content to one of the servers.  This duplication of content
> started causing the server to be in a state of constant major compaction and
> it couldn't keep up with new writes.  In the end, I completely dropped that
> server and spun up another one to take it's place since the one good server
> had all the data anyway.  So, it might have also been an issue with that
> box.
>
> One more question, are all of the instances in the same region?
>
> Lee Parker
> On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>
>> Here are the current jvm args  and java version:
>>
>> # Arguments to pass to the JVM
>> JVM_OPTS=" \
>>         -ea \
>>         -Xms128M \
>>         -Xmx7G \
>>         -XX:TargetSurvivorRatio=90 \
>>         -XX:+AggressiveOpts \
>>         -XX:+UseParNewGC \
>>         -XX:+UseConcMarkSweepGC \
>>         -XX:+CMSParallelRemarkEnabled \
>>         -XX:+HeapDumpOnOutOfMemoryError \
>>         -XX:SurvivorRatio=128 \
>>         -XX:MaxTenuringThreshold=0 \
>>         -Dcom.sun.management.jmxremote.port=8080 \
>>         -Dcom.sun.management.jmxremote.ssl=false \
>>         -Dcom.sun.management.jmxremote.authenticate=false"
>>
>> java -version outputs:
>> java version "1.6.0_20"
>> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>>
>> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
>> hammered right now, and it is receiving 0 ops/sec from me since I
>> disconnected it from our application right now until I can figure out what's
>> going on.
>>
>> running top on the machine I get:
>> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
>> 15.13
>> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
>> 24.6%st
>> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
>> Swap:        0k total,        0k used,        0k free,  1655556k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>>
>>
>> I have jconsole up and running, and jconsole vm Summary tab says:
>>  - total physical memory: 7,872,040 K
>>  - Free physical memory: 4,253,036 K
>>  - Total swap space: 0K
>>  - Free swap space: 0K
>>  - Committed virtual memory: 8,096648K
>>
>> Is there a specific thread I can look at in jconsole that might give me a
>> clue?  It's weird that it's still at 100% cpu even though it's getting no
>> traffic from outside right now.  I suppose it might still be talking across
>> the machines though.
>>
>> Also, stopping cassandra and starting cassandra on one of the 4 machines
>> caused the CPU to go back down to almost normal levels.
>>
>> Here's the ring;
>> Address       Status     Load
>> Range                                      Ring
>>
>> 170141183460469231731687303715884105728
>> 10.251.XX.XX Up         2.15 MB
>> 42535295865117307932921825928971026432     |<--|
>> 10.250.XX.XX  Up         2.42 MB
>> 85070591730234615865843651857942052864     |   |
>> 10.250.XX.XX Up         2.47 MB
>> 127605887595351923798765477786913079296    |   |
>> 10.250.XX.XX Up         2.46 MB
>> 170141183460469231731687303715884105728    |-->|
>>
>> Any thoughts?
>>
>> Best,
>>
>> Curt
>> --
>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>> http://apps.facebook.com/happyhabitat
>>
>>
>> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <gr...@gmail.com> wrote:
>>
>>> Can you provide us with the current JVM args? Also, what type of work
>>> load you are giving the ring (op/s)?
>>>
>>>
>>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>>>
>>>> Hello Cassandra users+experts,
>>>>
>>>> Hopefully someone will be able to point me in the correct direction. We
>>>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>>>> was great and ready to move to production. We are currently running a ring
>>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>>> servers on production with a replication factor of 3 and a QUORUM
>>>> consistency level. We ran a test on 1% of our users, and everything was
>>>> writing to and reading from cassandra great for the first 3 hours. After
>>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>>> machines at once. This smells to me like a GC issue, and I'm looking into it
>>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>>> all the way up and running without CPU spiking I would be forever in their
>>>> debt.
>>>>
>>>> I suspect that anyone else running cassandra on large EC2 instances
>>>> might just be able to tell me what JVM args they are successfully using in a
>>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>>> and did they go to batched writes due to bug 1014? (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>>> all my questions.
>>>>
>>>> Is there anyone on the list who is using large EC2 instances in
>>>> production? Would you be kind enough to share your JVM arguments and any
>>>> other tips?
>>>>
>>>> Thanks for any help,
>>>> Curt
>>>> --
>>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>>> http://apps.facebook.com/happyhabitat
>>>>
>>>
>>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Lee Parker <le...@socialagency.com>.

What are your storage-conf settings for Memtable thresholds?  One thing that
could cause lots of CPU usage is dumping the memtables too frequently and
then having to do lots of compaction.  With that much available heap space
you could definitely go larger than the default thresholds.  Also, do you
not have any swap space setup on the machine?  It is a good idea to at least
setup a swap file so that the system can use it when it needs to.

We are running a two node cluster using Amazon large EC2 instances as well.
 The cluster is using a replication factor of 2 and most of my writes and
reads are at a consistency level of ONE except for a few QUORUM calls.  The
only difference in my JVM opts is that my max is set at 6G.  I have the two
ephemeral disks setup as a raid 0 array and that is where I'm storing the
data.  The commit logs are going to the default location so they are using
the local disk.  We currently have more than 90G of data running on these
and have only had issues with CPU utilization when our code was accidentally
duplicating content to one of the servers.  This duplication of content
started causing the server to be in a state of constant major compaction and
it couldn't keep up with new writes.  In the end, I completely dropped that
server and spun up another one to take it's place since the one good server
had all the data anyway.  So, it might have also been an issue with that
box.

One more question, are all of the instances in the same region?

Lee Parker
On Mon, May 17, 2010 at 6:02 PM, Curt Bererton <cu...@zipzapplay.com> wrote:

> Here are the current jvm args  and java version:
>
> # Arguments to pass to the JVM
> JVM_OPTS=" \
>         -ea \
>         -Xms128M \
>         -Xmx7G \
>         -XX:TargetSurvivorRatio=90 \
>         -XX:+AggressiveOpts \
>         -XX:+UseParNewGC \
>         -XX:+UseConcMarkSweepGC \
>         -XX:+CMSParallelRemarkEnabled \
>         -XX:+HeapDumpOnOutOfMemoryError \
>         -XX:SurvivorRatio=128 \
>         -XX:MaxTenuringThreshold=0 \
>         -Dcom.sun.management.jmxremote.port=8080 \
>         -Dcom.sun.management.jmxremote.ssl=false \
>         -Dcom.sun.management.jmxremote.authenticate=false"
>
> java -version outputs:
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
>
> So pretty much the defaults aside from the 7Gig max heap. CPU is totally
> hammered right now, and it is receiving 0 ops/sec from me since I
> disconnected it from our application right now until I can figure out what's
> going on.
>
> running top on the machine I get:
> top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
> 15.13
> Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
> Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
> 24.6%st
> Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
> Swap:        0k total,        0k used,        0k free,  1655556k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>  2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java
>
>
> I have jconsole up and running, and jconsole vm Summary tab says:
>  - total physical memory: 7,872,040 K
>  - Free physical memory: 4,253,036 K
>  - Total swap space: 0K
>  - Free swap space: 0K
>  - Committed virtual memory: 8,096648K
>
> Is there a specific thread I can look at in jconsole that might give me a
> clue?  It's weird that it's still at 100% cpu even though it's getting no
> traffic from outside right now.  I suppose it might still be talking across
> the machines though.
>
> Also, stopping cassandra and starting cassandra on one of the 4 machines
> caused the CPU to go back down to almost normal levels.
>
> Here's the ring;
> Address       Status     Load
> Range                                      Ring
>
> 170141183460469231731687303715884105728
> 10.251.XX.XX Up         2.15 MB
> 42535295865117307932921825928971026432     |<--|
> 10.250.XX.XX  Up         2.42 MB
> 85070591730234615865843651857942052864     |   |
> 10.250.XX.XX Up         2.47 MB
> 127605887595351923798765477786913079296    |   |
> 10.250.XX.XX Up         2.46 MB
> 170141183460469231731687303715884105728    |-->|
>
> Any thoughts?
>
> Best,
>
> Curt
> --
> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
> http://apps.facebook.com/happyhabitat
>
>
> On Mon, May 17, 2010 at 3:51 PM, Mark Greene <gr...@gmail.com> wrote:
>
>> Can you provide us with the current JVM args? Also, what type of work load
>> you are giving the ring (op/s)?
>>
>>
>> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>>
>>> Hello Cassandra users+experts,
>>>
>>> Hopefully someone will be able to point me in the correct direction. We
>>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>>> was great and ready to move to production. We are currently running a ring
>>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>>> servers on production with a replication factor of 3 and a QUORUM
>>> consistency level. We ran a test on 1% of our users, and everything was
>>> writing to and reading from cassandra great for the first 3 hours. After
>>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>>> machines at once. This smells to me like a GC issue, and I'm looking into it
>>> with jconsole right now. If anyone can help me debug this and get cassandra
>>> all the way up and running without CPU spiking I would be forever in their
>>> debt.
>>>
>>> I suspect that anyone else running cassandra on large EC2 instances might
>>> just be able to tell me what JVM args they are successfully using in a
>>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>>> and did they go to batched writes due to bug 1014? (
>>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>>> all my questions.
>>>
>>> Is there anyone on the list who is using large EC2 instances in
>>> production? Would you be kind enough to share your JVM arguments and any
>>> other tips?
>>>
>>> Thanks for any help,
>>> Curt
>>> --
>>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>>> http://apps.facebook.com/happyhabitat
>>>
>>
>>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Curt Bererton <cu...@zipzapplay.com>.

Here are the current jvm args  and java version:

# Arguments to pass to the JVM
JVM_OPTS=" \
        -ea \
        -Xms128M \
        -Xmx7G \
        -XX:TargetSurvivorRatio=90 \
        -XX:+AggressiveOpts \
        -XX:+UseParNewGC \
        -XX:+UseConcMarkSweepGC \
        -XX:+CMSParallelRemarkEnabled \
        -XX:+HeapDumpOnOutOfMemoryError \
        -XX:SurvivorRatio=128 \
        -XX:MaxTenuringThreshold=0 \
        -Dcom.sun.management.jmxremote.port=8080 \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcom.sun.management.jmxremote.authenticate=false"

java -version outputs:
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

So pretty much the defaults aside from the 7Gig max heap. CPU is totally
hammered right now, and it is receiving 0 ops/sec from me since I
disconnected it from our application right now until I can figure out what's
going on.

running top on the machine I get:
top - 18:56:32 up 2 days, 20:57,  2 users,  load average: 14.97, 15.24,
15.13
Tasks:  87 total,   5 running,  82 sleeping,   0 stopped,   0 zombie
Cpu(s): 40.1%us, 33.9%sy,  0.0%ni,  0.1%id,  0.0%wa,  0.0%hi,  1.3%si,
24.6%st
Mem:   7872040k total,  3618764k used,  4253276k free,   387536k buffers
Swap:        0k total,        0k used,        0k free,  1655556k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND
 2566 cassandr  25   0 7906m 639m  10m S  150  8.3   5846:35 java


I have jconsole up and running, and jconsole vm Summary tab says:
 - total physical memory: 7,872,040 K
 - Free physical memory: 4,253,036 K
 - Total swap space: 0K
 - Free swap space: 0K
 - Committed virtual memory: 8,096648K

Is there a specific thread I can look at in jconsole that might give me a
clue?  It's weird that it's still at 100% cpu even though it's getting no
traffic from outside right now.  I suppose it might still be talking across
the machines though.

Also, stopping cassandra and starting cassandra on one of the 4 machines
caused the CPU to go back down to almost normal levels.

Here's the ring;
Address       Status     Load
Range                                      Ring

170141183460469231731687303715884105728
10.251.XX.XX Up         2.15 MB
42535295865117307932921825928971026432     |<--|
10.250.XX.XX  Up         2.42 MB
85070591730234615865843651857942052864     |   |
10.250.XX.XX Up         2.47 MB
127605887595351923798765477786913079296    |   |
10.250.XX.XX Up         2.46 MB
170141183460469231731687303715884105728    |-->|

Any thoughts?

Best,
Curt
--
Curt, ZipZapPlay Inc., www.PlayCrafter.com,
http://apps.facebook.com/happyhabitat


On Mon, May 17, 2010 at 3:51 PM, Mark Greene <gr...@gmail.com> wrote:

> Can you provide us with the current JVM args? Also, what type of work load
> you are giving the ring (op/s)?
>
>
> On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <cu...@zipzapplay.com>wrote:
>
>> Hello Cassandra users+experts,
>>
>> Hopefully someone will be able to point me in the correct direction. We
>> have cassandra 0.6.1 working on our test servers and we *thought* everything
>> was great and ready to move to production. We are currently running a ring
>> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
>> servers on production with a replication factor of 3 and a QUORUM
>> consistency level. We ran a test on 1% of our users, and everything was
>> writing to and reading from cassandra great for the first 3 hours. After
>> that point CPU usage spiked to 100% and stayed there, basically on all 4
>> machines at once. This smells to me like a GC issue, and I'm looking into it
>> with jconsole right now. If anyone can help me debug this and get cassandra
>> all the way up and running without CPU spiking I would be forever in their
>> debt.
>>
>> I suspect that anyone else running cassandra on large EC2 instances might
>> just be able to tell me what JVM args they are successfully using in a
>> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
>> and did they go to batched writes due to bug 1014? (
>> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
>> all my questions.
>>
>> Is there anyone on the list who is using large EC2 instances in
>> production? Would you be kind enough to share your JVM arguments and any
>> other tips?
>>
>> Thanks for any help,
>> Curt
>> --
>> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
>> http://apps.facebook.com/happyhabitat
>>
>
>

Re: Problems running Cassandra 0.6.1 on large EC2 instances.

Posted by Mark Greene <gr...@gmail.com>.

Can you provide us with the current JVM args? Also, what type of work load
you are giving the ring (op/s)?

On Mon, May 17, 2010 at 6:39 PM, Curt Bererton <cu...@zipzapplay.com> wrote:

> Hello Cassandra users+experts,
>
> Hopefully someone will be able to point me in the correct direction. We
> have cassandra 0.6.1 working on our test servers and we *thought* everything
> was great and ready to move to production. We are currently running a ring
> of 4 large instance EC2 (http://aws.amazon.com/ec2/instance-types/)
> servers on production with a replication factor of 3 and a QUORUM
> consistency level. We ran a test on 1% of our users, and everything was
> writing to and reading from cassandra great for the first 3 hours. After
> that point CPU usage spiked to 100% and stayed there, basically on all 4
> machines at once. This smells to me like a GC issue, and I'm looking into it
> with jconsole right now. If anyone can help me debug this and get cassandra
> all the way up and running without CPU spiking I would be forever in their
> debt.
>
> I suspect that anyone else running cassandra on large EC2 instances might
> just be able to tell me what JVM args they are successfully using in a
> production environment and if they upgraded to Cassandra 0.6.2 from 0.6.1,
> and did they go to batched writes due to bug 1014? (
> https://issues.apache.org/jira/browse/CASSANDRA-1014) That might answer
> all my questions.
>
> Is there anyone on the list who is using large EC2 instances in production?
> Would you be kind enough to share your JVM arguments and any other tips?
>
> Thanks for any help,
> Curt
> --
> Curt, ZipZapPlay Inc., www.PlayCrafter.com,
> http://apps.facebook.com/happyhabitat
>