You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Aoi Kadoya <ca...@gmail.com> on 2016/08/05 20:21:44 UTC
Re: CPU high load

Thank you, Alain.

There was no frequent GC nor compaction so it have been a
mystery,however, once I stopped chef-client(we're managing the cluster
though chef-cookbook), the load was eased for almost all of the
servers.
so we're now refactoring our cookbook, in the meanwhile, we also
decided to rebuild a cluster with DSE5.0.1.

Thank you very much for your advices on the debugging processes,
Aoi


2016-07-20 4:03 GMT-07:00 Alain RODRIGUEZ <ar...@gmail.com>:
> Hi Aoi,
>
>>
>> since few weeks
>> ago, all of the cluster nodes are hitting avg. 15-20 cpu load.
>> These nodes are running on VMs(VMware vSphere) that have 8vcpu
>> (1core/socket)-16 vRAM.(JVM options : -Xms8G -Xmx8G -Xmn800M)
>
>
> I take my chance, a few ideas / questions below:
>
> What Cassandra version are you running?
> How is your GC doing?
>
> Run something like: grep "GC" /var/log/cassandra/system.log
> If you have a lot of long CMS pauses you might not be keeping things in the
> new gen long enough: Xmn800M looks too small to me, it has been a default
> but I never saw a case where this setting worked better than a higher value
> (let's say 2G), also tenuring threshold gives better results if set a bit
> higher than default (let's say 16). Those options are in cassandra-env.sh.
>
> Do you have other warnings or errors? Anything about tombstones or
> compacting wide rows incrementally?
> What compaction strategy are you using
> How many concurrent compactors do you use (if you have 8 cores, this value
> should probably be between 2 and 6, 4 is a good starting point)
> If your compaction is not fast enough and disk are doing fine, consider
> increasing the compaction throughput from default 16 to 32 or 64 Mbps to
> mitigate the impact of the point above.
> Do you use compression ? What kind ?
> Did the request count increased recently? Do you consider adding capacity or
> do you think you're hitting a new bug / issue that is worth it investigating
> / solving?
> Are you using default configuration? What did you change?
>
> No matter what you try, do it as much as possible on one canary node first,
> and incrementally (one change at the time - using NEWHEAP = 2GB +
> tenuringThreshold = 16 would be one change, it makes sense to move those 2
> values together)
>
>>
>> I have enabled a auto repair service on opscenter and it's running behind
>
>
> Also when did you do that, starting repairs? Repair is an expensive
> operation, consuming a lot of resources that is often needed, but that is
> hard to tune correctly. Are you sure you have enough CPU power to handle the
> load + repairs?
>
> Some other comments probably not directly related:
>
>>
>> I also realized that my cluster isn't well balanced
>
>
> Well you cluster looks balanced to me 7 GB isn't that far from 11 GB. To
> have a more accurate information, use 'nodetool status mykeyspace'. This way
> ownership will be displayed, replacing (?) by ownership (xx %). Total
> ownership = 300 % in your case (RF=3)
>
>>
>> I am running 6 nodes vnode cluster with DSE 4.8.1, and since few weeks
>> ago, all of the cluster nodes are hitting avg. 15-20 cpu load.
>
>
> By the way, from
> https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/RNdse.html:
>
> "Warning: DataStax does not recommend 4.8.1 or 4.8.2 versions for
> production, see warning. Use 4.8.3 instead.".
>
> I am not sure what happened there but I would move to 4.8.3+ asap, datastax
> people know their products and I don't like this kind of orange and bold
> warnings :-).
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2016-07-14 4:36 GMT+02:00 Aoi Kadoya <ca...@gmail.com>:
>>
>> Hi Romain,
>>
>> No, I don't think we upgraded cassandra version or changed any of
>> those schema elements. After I realized this high load issue, I found
>> that some of the tables have a shorter gc_grace_seconds(1day) than the
>> rest and because it seemed causing constant compaction cycles, I have
>> changed them to 10days. but again, that's after load hit this high
>> number.
>> some of nodes got eased a little bit after changing gc_grace_seconds
>> values and repairing nodes, but since few days ago, all of nodes are
>> constantly reporting load 15-20.
>>
>> Thank you for the suggestion about logging, let me try to change the
>> log level to see what I can get from it.
>>
>> Thanks,
>> Aoi
>>
>>
>> 2016-07-13 13:28 GMT-07:00 Romain Hardouin <ro...@yahoo.fr>:
>> > Did you upgrade from a previous version? DId you make some schema
>> > changes
>> > like compaction strategy, compression, bloom filter, etc.?
>> > What about the R/W requests?
>> > SharedPool Workers are... shared ;-) Put logs in debug to see some
>> > examples
>> > of what services are using this pool (many actually).
>> >
>> > Best,
>> >
>> > Romain
>> >
>> >
>> > Le Mercredi 13 juillet 2016 18h15, Patrick McFadin <pm...@gmail.com>
>> > a
>> > écrit :
>> >
>> >
>> > Might be more clear looking at nodetool tpstats
>> >
>> > From there you can see all the thread pools and if there are any blocks.
>> > Could be something subtle like network.
>> >
>> > On Tue, Jul 12, 2016 at 3:23 PM, Aoi Kadoya <ca...@gmail.com>
>> > wrote:
>> >
>> > Hi,
>> >
>> > I am running 6 nodes vnode cluster with DSE 4.8.1, and since few weeks
>> > ago, all of the cluster nodes are hitting avg. 15-20 cpu load.
>> > These nodes are running on VMs(VMware vSphere) that have 8vcpu
>> > (1core/socket)-16 vRAM.(JVM options : -Xms8G -Xmx8G -Xmn800M)
>> >
>> > At first I thought this is because of CPU iowait, however, iowait is
>> > constantly low(in fact it's 0 almost all time time), CPU steal time is
>> > also 0%.
>> >
>> > When I took a thread dump, I found some of "SharedPool-Worker" threads
>> > are consuming CPU and those threads seem to be waiting for something
>> > so I assume this is the cause of cpu load.
>> >
>> > "SharedPool-Worker-1" #240 daemon prio=5 os_prio=0
>> > tid=0x00007fabf459e000 nid=0x39b3 waiting on condition
>> > [0x00007faad7f02000]
>> >    java.lang.Thread.State: WAITING (parking)
>> > at sun.misc.Unsafe.park(Native Method)
>> > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
>> > at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:85)
>> > at java.lang.Thread.run(Thread.java:745)
>> >
>> > Thread dump looks like this, but I am not sure what is this
>> > sharedpool-worker waiting for.
>> > Would you please help me with the further trouble shooting?
>> > I am also reading the thread posted by Yuan as the situation is very
>> > similar to mine but I didn't get any blocked, dropped or pending count
>> > in my tpstat result.
>> >
>> > Thanks,
>> > Aoi
>> >
>> >
>> >
>> >
>
>