You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Marcin Pietraszek <mp...@opera.com> on 2015/07/20 09:15:44 UTC

High CPU load

Hello!

I've noticed a strange CPU utilisation patterns on machines in our
cluster. After C* daemon restart it behaves in a normal way, after a
few weeks since a restart CPU usage starts to raise. Currently on one
of the nodes (screenshots attached) cpu load is ~4. Shortly before
restart load raises to ~15 (our cassandra machines have 16 cpus).

In that cluster we're using bulkloading from hadoop cluster with 1400
reducers (200 parallel bulkloading tasks). After such session of heavy
bulkloading number of pending compactions is quite high but it's able
to clear them before next "bulkloading session". We're also tracking
number of pending compactions and during most of the time it's 0.

On our machines we do have a few gigs of free memory ~7GB (17GB used),
also it seems like we aren't IO bound.

Screenshots from our zabbix with CPU utilisation graphs:

http://i60.tinypic.com/xas8q8.jpg
http://i58.tinypic.com/24pifcy.jpg

Do you guys know what could be causing such high load?

-- 
mp

Re: High CPU load

Posted by Marcin Pietraszek <mp...@opera.com>.
I'm still struggling with finding the root cause for such CPU
utilisation patterns.

http://i58.tinypic.com/24pifcy.jpg

After a 3 weeks after C* restart CPU utilisation is going through the
roof, such situation isn't happening shortly after the restart (which
is visible at the graph).

C* is running on the machines with 16 cores, HEAP_NEWSIZE was set to
1600MB. Looking at gc logs I found:

2015-07-27T03:27:41.891+0000: 3354764.688: [ParNew:
1362069K->51348K(1474560K), 0.0942700 secs]
3268635K->1957914K(8224768K)After GC:
2015-07-27T03:27:41.999+0000: 3354764.797: [ParNew:
55327K->48503K(1474560K), 0.0935170 secs]
1961894K->1955069K(8224768K)After GC:
2015-07-27T03:27:43.464+0000: 3354766.261: [ParNew:
1359224K->54214K(1474560K), 0.0922640 secs]
3265790K->1960780K(8224768K)After GC:
2015-07-27T03:27:43.570+0000: 3354766.368: [ParNew:
56165K->45672K(1474560K), 0.0948220 secs]
1962732K->1952239K(8224768K)After GC:
2015-07-27T03:27:45.016+0000: 3354767.814: [ParNew:
1356393K->54245K(1474560K), 0.0922290 secs]
3262959K->1960811K(8224768K)After GC:
2015-07-27T03:27:45.121+0000: 3354767.919: [ParNew:
57866K->48553K(1474560K), 0.0928670 secs]
1964433K->1955119K(8224768K)After GC:
2015-07-27T03:27:46.590+0000: 3354769.387: [ParNew:
1359420K->48560K(1474560K), 0.0913750 secs]
3265986K->1955126K(8224768K)After GC:

It's clear to me that ParNew needs to be executed when 1359420K out of
1474560K is taken. But why it's executed when only 57866K/1474560K -
~4% of young is occupied? SurvivorRatio is set to 8 in our case.

Also what's bothering me is the time of CMS' concurreent sweep time:
gc.log.9:2015-07-27T07:54:36.561+0000: 3370772.285:
[CMS-concurrent-sweep: 19.710/20.600 secs] [Times: user=0.00
sys=317.60, real=20.60 secs]
Isn't it quite high?

Do you guys have any other thoughts? Which part of gc logs requires
more attention?


On Tue, Jul 21, 2015 at 4:22 PM, Marcin Pietraszek
<mp...@opera.com> wrote:
> Yup... it seems like it's gc fault
>
> gc logs
>
> 2015-07-21T14:19:54.336+0000: 2876133.270: Total time for which
> application threads were stopped: 0.0832030 seconds
> 2015-07-21T14:19:55.739+0000: 2876134.673: Total time for which
> application threads were stopped: 0.0806960 seconds
> 2015-07-21T14:19:57.149+0000: 2876136.083: Total time for which
> application threads were stopped: 0.0806890 seconds
> 2015-07-21T14:19:58.550+0000: 2876137.484: Total time for which
> application threads were stopped: 0.0821070 seconds
> 2015-07-21T14:19:59.941+0000: 2876138.875: Total time for which
> application threads were stopped: 0.0802640 seconds
> 2015-07-21T14:20:01.340+0000: 2876140.274: Total time for which
> application threads were stopped: 0.0835670 seconds
> 2015-07-21T14:20:02.744+0000: 2876141.678: Total time for which
> application threads were stopped: 0.0842440 seconds
> 2015-07-21T14:20:04.143+0000: 2876143.077: Total time for which
> application threads were stopped: 0.0841630 seconds
> 2015-07-21T14:20:05.541+0000: 2876144.475: Total time for which
> application threads were stopped: 0.0839850 seconds
>
> Heap after GC invocations=2273737 (full 101):
>  par new generation   total 1474560K, used 106131K
> [0x00000005fae00000, 0x000000065ee00000, 0x000000065ee00000)
>   eden space 1310720K,   0% used [0x00000005fae00000,
> 0x00000005fae00000, 0x000000064ae00000)
>   from space 163840K,  64% used [0x000000064ae00000,
> 0x00000006515a4ee0, 0x0000000654e00000)
>   to   space 163840K,   0% used [0x0000000654e00000,
> 0x0000000654e00000, 0x000000065ee00000)
>  concurrent mark-sweep generation total 6750208K, used 1316691K
> [0x000000065ee00000, 0x00000007fae00000, 0x00000007fae00000)
>  concurrent-mark-sweep perm gen total 49336K, used 29520K
> [0x00000007fae00000, 0x00000007fde2e000, 0x0000000800000000)
> }
> 2015-07-21T14:12:05.683+0000: 2875664.617: Total time for which
> application threads were stopped: 0.0830280 seconds
> {Heap before GC invocations=2273737 (full 101):
>  par new generation   total 1474560K, used 1416851K
> [0x00000005fae00000, 0x000000065ee00000, 0x000000065ee00000)
>   eden space 1310720K, 100% used [0x00000005fae00000,
> 0x000000064ae00000, 0x000000064ae00000)
>   from space 163840K,  64% used [0x000000064ae00000,
> 0x00000006515a4ee0, 0x0000000654e00000)
>   to   space 163840K,   0% used [0x0000000654e00000,
> 0x0000000654e00000, 0x000000065ee00000)
>  concurrent mark-sweep generation total 6750208K, used 1316691K
> [0x000000065ee00000, 0x00000007fae00000, 0x00000007fae00000)
>  concurrent-mark-sweep perm gen total 49336K, used 29520K
> [0x00000007fae00000, 0x00000007fde2e000, 0x0000000800000000)
>
> It seems like eden heap space is being constantly occupied by
> something which is later removed by gc...
>
>
> On Mon, Jul 20, 2015 at 9:18 AM, Jason Wee <pe...@gmail.com> wrote:
>> just a guess, gc?
>>
>> On Mon, Jul 20, 2015 at 3:15 PM, Marcin Pietraszek <mp...@opera.com>
>> wrote:
>>>
>>> Hello!
>>>
>>> I've noticed a strange CPU utilisation patterns on machines in our
>>> cluster. After C* daemon restart it behaves in a normal way, after a
>>> few weeks since a restart CPU usage starts to raise. Currently on one
>>> of the nodes (screenshots attached) cpu load is ~4. Shortly before
>>> restart load raises to ~15 (our cassandra machines have 16 cpus).
>>>
>>> In that cluster we're using bulkloading from hadoop cluster with 1400
>>> reducers (200 parallel bulkloading tasks). After such session of heavy
>>> bulkloading number of pending compactions is quite high but it's able
>>> to clear them before next "bulkloading session". We're also tracking
>>> number of pending compactions and during most of the time it's 0.
>>>
>>> On our machines we do have a few gigs of free memory ~7GB (17GB used),
>>> also it seems like we aren't IO bound.
>>>
>>> Screenshots from our zabbix with CPU utilisation graphs:
>>>
>>> http://i60.tinypic.com/xas8q8.jpg
>>> http://i58.tinypic.com/24pifcy.jpg
>>>
>>> Do you guys know what could be causing such high load?
>>>
>>> --
>>> mp
>>
>>

Re: High CPU load

Posted by Marcin Pietraszek <mp...@opera.com>.
Yup... it seems like it's gc fault

gc logs

2015-07-21T14:19:54.336+0000: 2876133.270: Total time for which
application threads were stopped: 0.0832030 seconds
2015-07-21T14:19:55.739+0000: 2876134.673: Total time for which
application threads were stopped: 0.0806960 seconds
2015-07-21T14:19:57.149+0000: 2876136.083: Total time for which
application threads were stopped: 0.0806890 seconds
2015-07-21T14:19:58.550+0000: 2876137.484: Total time for which
application threads were stopped: 0.0821070 seconds
2015-07-21T14:19:59.941+0000: 2876138.875: Total time for which
application threads were stopped: 0.0802640 seconds
2015-07-21T14:20:01.340+0000: 2876140.274: Total time for which
application threads were stopped: 0.0835670 seconds
2015-07-21T14:20:02.744+0000: 2876141.678: Total time for which
application threads were stopped: 0.0842440 seconds
2015-07-21T14:20:04.143+0000: 2876143.077: Total time for which
application threads were stopped: 0.0841630 seconds
2015-07-21T14:20:05.541+0000: 2876144.475: Total time for which
application threads were stopped: 0.0839850 seconds

Heap after GC invocations=2273737 (full 101):
 par new generation   total 1474560K, used 106131K
[0x00000005fae00000, 0x000000065ee00000, 0x000000065ee00000)
  eden space 1310720K,   0% used [0x00000005fae00000,
0x00000005fae00000, 0x000000064ae00000)
  from space 163840K,  64% used [0x000000064ae00000,
0x00000006515a4ee0, 0x0000000654e00000)
  to   space 163840K,   0% used [0x0000000654e00000,
0x0000000654e00000, 0x000000065ee00000)
 concurrent mark-sweep generation total 6750208K, used 1316691K
[0x000000065ee00000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 49336K, used 29520K
[0x00000007fae00000, 0x00000007fde2e000, 0x0000000800000000)
}
2015-07-21T14:12:05.683+0000: 2875664.617: Total time for which
application threads were stopped: 0.0830280 seconds
{Heap before GC invocations=2273737 (full 101):
 par new generation   total 1474560K, used 1416851K
[0x00000005fae00000, 0x000000065ee00000, 0x000000065ee00000)
  eden space 1310720K, 100% used [0x00000005fae00000,
0x000000064ae00000, 0x000000064ae00000)
  from space 163840K,  64% used [0x000000064ae00000,
0x00000006515a4ee0, 0x0000000654e00000)
  to   space 163840K,   0% used [0x0000000654e00000,
0x0000000654e00000, 0x000000065ee00000)
 concurrent mark-sweep generation total 6750208K, used 1316691K
[0x000000065ee00000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 49336K, used 29520K
[0x00000007fae00000, 0x00000007fde2e000, 0x0000000800000000)

It seems like eden heap space is being constantly occupied by
something which is later removed by gc...


On Mon, Jul 20, 2015 at 9:18 AM, Jason Wee <pe...@gmail.com> wrote:
> just a guess, gc?
>
> On Mon, Jul 20, 2015 at 3:15 PM, Marcin Pietraszek <mp...@opera.com>
> wrote:
>>
>> Hello!
>>
>> I've noticed a strange CPU utilisation patterns on machines in our
>> cluster. After C* daemon restart it behaves in a normal way, after a
>> few weeks since a restart CPU usage starts to raise. Currently on one
>> of the nodes (screenshots attached) cpu load is ~4. Shortly before
>> restart load raises to ~15 (our cassandra machines have 16 cpus).
>>
>> In that cluster we're using bulkloading from hadoop cluster with 1400
>> reducers (200 parallel bulkloading tasks). After such session of heavy
>> bulkloading number of pending compactions is quite high but it's able
>> to clear them before next "bulkloading session". We're also tracking
>> number of pending compactions and during most of the time it's 0.
>>
>> On our machines we do have a few gigs of free memory ~7GB (17GB used),
>> also it seems like we aren't IO bound.
>>
>> Screenshots from our zabbix with CPU utilisation graphs:
>>
>> http://i60.tinypic.com/xas8q8.jpg
>> http://i58.tinypic.com/24pifcy.jpg
>>
>> Do you guys know what could be causing such high load?
>>
>> --
>> mp
>
>

Re: High CPU load

Posted by Jason Wee <pe...@gmail.com>.
just a guess, gc?

On Mon, Jul 20, 2015 at 3:15 PM, Marcin Pietraszek <mp...@opera.com>
wrote:

> Hello!
>
> I've noticed a strange CPU utilisation patterns on machines in our
> cluster. After C* daemon restart it behaves in a normal way, after a
> few weeks since a restart CPU usage starts to raise. Currently on one
> of the nodes (screenshots attached) cpu load is ~4. Shortly before
> restart load raises to ~15 (our cassandra machines have 16 cpus).
>
> In that cluster we're using bulkloading from hadoop cluster with 1400
> reducers (200 parallel bulkloading tasks). After such session of heavy
> bulkloading number of pending compactions is quite high but it's able
> to clear them before next "bulkloading session". We're also tracking
> number of pending compactions and during most of the time it's 0.
>
> On our machines we do have a few gigs of free memory ~7GB (17GB used),
> also it seems like we aren't IO bound.
>
> Screenshots from our zabbix with CPU utilisation graphs:
>
> http://i60.tinypic.com/xas8q8.jpg
> http://i58.tinypic.com/24pifcy.jpg
>
> Do you guys know what could be causing such high load?
>
> --
> mp
>