You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Michał Łowicki <ml...@gmail.com> on 2015/02/18 19:28:19 UTC

C* 2.1.2 invokes oom-killer

Hi,

Couple of times a day 2 out of 4 members cluster nodes are killed

root@db4:~# dmesg | grep -i oom
[4811135.792657] [ pid ]   uid  tgid total_vm      rss cpu oom_adj
oom_score_adj name
[6559049.307293] java invoked oom-killer: gfp_mask=0x201da, order=0,
oom_adj=0, oom_score_adj=0

Nodes are using 8GB heap (confirmed with *nodetool info*) and aren't using
row cache.

Noticed that couple of times a day used RSS is growing really fast within
couple of minutes and I see CPU spikes at the same time -
https://www.dropbox.com/s/khco2kdp4qdzjit/Screenshot%202015-02-18%2015.10.54.png?dl=0
.

Could be related to compaction but after compaction is finished used RSS
doesn't shrink. Output from pmap when C* process uses 50GB RAM (out of
64GB) is available on http://paste.ofcode.org/ZjLUA2dYVuKvJHAk9T3Hjb. At
the time dump was made heap usage is far below 8GB (~3GB) but total RSS is
~50GB.

Any help will be appreciated.

-- 
BR,
Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Jacob Rhoden <ja...@me.com>.
I neglected to mention, I also adjust the oom score of cassandra, to tell the kernel to kill something else other than cassandra. (Like if one of your dev’s runs a script that uses a lot of memory, so it kills your dev’s script instead).

http://lwn.net/Articles/317814/ <http://lwn.net/Articles/317814/>

> On 19 Feb 2015, at 5:28 am, Michał Łowicki <ml...@gmail.com> wrote:
> 
> Hi,
> 
> Couple of times a day 2 out of 4 members cluster nodes are killed
> 
> root@db4:~# dmesg | grep -i oom
> [4811135.792657] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [6559049.307293] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
> 
> Nodes are using 8GB heap (confirmed with *nodetool info*) and aren't using row cache. 
> 
> Noticed that couple of times a day used RSS is growing really fast within couple of minutes and I see CPU spikes at the same time - https://www.dropbox.com/s/khco2kdp4qdzjit/Screenshot%202015-02-18%2015.10.54.png?dl=0 <https://www.dropbox.com/s/khco2kdp4qdzjit/Screenshot%202015-02-18%2015.10.54.png?dl=0>.
> 
> Could be related to compaction but after compaction is finished used RSS doesn't shrink. Output from pmap when C* process uses 50GB RAM (out of 64GB) is available on http://paste.ofcode.org/ZjLUA2dYVuKvJHAk9T3Hjb <http://paste.ofcode.org/ZjLUA2dYVuKvJHAk9T3Hjb>. At the time dump was made heap usage is far below 8GB (~3GB) but total RSS is ~50GB.
> 
> Any help will be appreciated.
> 
> -- 
> BR,
> Michał Łowicki


Re: C* 2.1.2 invokes oom-killer

Posted by Robert Coli <rc...@eventbrite.com>.
On Wed, Feb 18, 2015 at 10:28 AM, Michał Łowicki <ml...@gmail.com> wrote:

> Couple of times a day 2 out of 4 members cluster nodes are killed
>

This sort of issue is usually best handled/debugged interactively on IRC.

But briefly :

- 2.1.2 is IMO broken for production. Downgrade (officially unsupported but
fine between these versions) to 2.1.1 or upgrade to 2.1.3.
- Beyond that, look at the steady state heap consumption. With 2.1.2, it
would likely take at least 1TB of data to fill heap in steady state to
near-failure.

=Rob

Re: C* 2.1.2 invokes oom-killer

Posted by Michał Łowicki <ml...@gmail.com>.
After couple of days it's still behaving fine. Case closed.

On Thu, Feb 19, 2015 at 11:15 PM, Michał Łowicki <ml...@gmail.com> wrote:

> Upgrade to 2.1.3 seems to help so far. After ~12 hours total memory
> consumption grew from 10GB to 10.5GB.
>
> On Thu, Feb 19, 2015 at 2:02 PM, Carlos Rolo <ro...@pythian.com> wrote:
>
>> Then you are probably hitting a bug... Trying to find out in Jira. The
>> bad news is the fix is only to be released on 2.1.4. Once I find it out I
>> will post it here.
>>
>> Regards,
>>
>> Carlos Juzarte Rolo
>> Cassandra Consultant
>>
>> Pythian - Love your data
>>
>> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
>> <http://linkedin.com/in/carlosjuzarterolo>*
>> Tel: 1649
>> www.pythian.com
>>
>> On Thu, Feb 19, 2015 at 12:16 PM, Michał Łowicki <ml...@gmail.com>
>> wrote:
>>
>>> |trickle_fsync| has been enabled for long time in our settings (just
>>> noticed):
>>>
>>> trickle_fsync: true
>>>
>>> trickle_fsync_interval_in_kb: 10240
>>>
>>> On Thu, Feb 19, 2015 at 12:12 PM, Michał Łowicki <ml...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Feb 19, 2015 at 11:02 AM, Carlos Rolo <ro...@pythian.com> wrote:
>>>>
>>>>> Do you have trickle_fsync enabled? Try to enable that and see if it
>>>>> solves your problem, since you are getting out of non-heap memory.
>>>>>
>>>>> Another question, is always the same nodes that die? Or is 2 out of 4
>>>>> that die?
>>>>>
>>>>
>>>> Always the same nodes. Upgraded to 2.1.3 two hours ago so we'll monitor
>>>> if maybe issue has been fixed there. If not will try to enable
>>>> |tricke_fsync|
>>>>
>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Carlos Juzarte Rolo
>>>>> Cassandra Consultant
>>>>>
>>>>> Pythian - Love your data
>>>>>
>>>>> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
>>>>> <http://linkedin.com/in/carlosjuzarterolo>*
>>>>> Tel: 1649
>>>>> www.pythian.com
>>>>>
>>>>> On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki <ml...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo <ro...@pythian.com>
>>>>>> wrote:
>>>>>>
>>>>>>> So compaction doesn't seem to be your problem (You can check with
>>>>>>> nodetool compactionstats just to be sure).
>>>>>>>
>>>>>>
>>>>>> pending tasks: 0
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> How much is your write latency on your column families? I had OOM
>>>>>>> related to this before, and there was a tipping point around 70ms.
>>>>>>>
>>>>>>
>>>>>> Write request latency is below 0.05 ms/op (avg). Checked with
>>>>>> OpsCenter.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> BR,
>>>>>> Michał Łowicki
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> BR,
>>>> Michał Łowicki
>>>>
>>>
>>>
>>>
>>> --
>>> BR,
>>> Michał Łowicki
>>>
>>
>>
>> --
>>
>>
>>
>>
>
>
> --
> BR,
> Michał Łowicki
>



-- 
BR,
Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Michał Łowicki <ml...@gmail.com>.
Upgrade to 2.1.3 seems to help so far. After ~12 hours total memory
consumption grew from 10GB to 10.5GB.

On Thu, Feb 19, 2015 at 2:02 PM, Carlos Rolo <ro...@pythian.com> wrote:

> Then you are probably hitting a bug... Trying to find out in Jira. The bad
> news is the fix is only to be released on 2.1.4. Once I find it out I will
> post it here.
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
> <http://linkedin.com/in/carlosjuzarterolo>*
> Tel: 1649
> www.pythian.com
>
> On Thu, Feb 19, 2015 at 12:16 PM, Michał Łowicki <ml...@gmail.com>
> wrote:
>
>> |trickle_fsync| has been enabled for long time in our settings (just
>> noticed):
>>
>> trickle_fsync: true
>>
>> trickle_fsync_interval_in_kb: 10240
>>
>> On Thu, Feb 19, 2015 at 12:12 PM, Michał Łowicki <ml...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Feb 19, 2015 at 11:02 AM, Carlos Rolo <ro...@pythian.com> wrote:
>>>
>>>> Do you have trickle_fsync enabled? Try to enable that and see if it
>>>> solves your problem, since you are getting out of non-heap memory.
>>>>
>>>> Another question, is always the same nodes that die? Or is 2 out of 4
>>>> that die?
>>>>
>>>
>>> Always the same nodes. Upgraded to 2.1.3 two hours ago so we'll monitor
>>> if maybe issue has been fixed there. If not will try to enable
>>> |tricke_fsync|
>>>
>>>
>>>>
>>>> Regards,
>>>>
>>>> Carlos Juzarte Rolo
>>>> Cassandra Consultant
>>>>
>>>> Pythian - Love your data
>>>>
>>>> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
>>>> <http://linkedin.com/in/carlosjuzarterolo>*
>>>> Tel: 1649
>>>> www.pythian.com
>>>>
>>>> On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki <ml...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo <ro...@pythian.com>
>>>>> wrote:
>>>>>
>>>>>> So compaction doesn't seem to be your problem (You can check with
>>>>>> nodetool compactionstats just to be sure).
>>>>>>
>>>>>
>>>>> pending tasks: 0
>>>>>
>>>>>
>>>>>>
>>>>>> How much is your write latency on your column families? I had OOM
>>>>>> related to this before, and there was a tipping point around 70ms.
>>>>>>
>>>>>
>>>>> Write request latency is below 0.05 ms/op (avg). Checked with
>>>>> OpsCenter.
>>>>>
>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> BR,
>>>>> Michał Łowicki
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> BR,
>>> Michał Łowicki
>>>
>>
>>
>>
>> --
>> BR,
>> Michał Łowicki
>>
>
>
> --
>
>
>
>


-- 
BR,
Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Carlos Rolo <ro...@pythian.com>.
Then you are probably hitting a bug... Trying to find out in Jira. The bad
news is the fix is only to be released on 2.1.4. Once I find it out I will
post it here.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
<http://linkedin.com/in/carlosjuzarterolo>*
Tel: 1649
www.pythian.com

On Thu, Feb 19, 2015 at 12:16 PM, Michał Łowicki <ml...@gmail.com> wrote:

> |trickle_fsync| has been enabled for long time in our settings (just
> noticed):
>
> trickle_fsync: true
>
> trickle_fsync_interval_in_kb: 10240
>
> On Thu, Feb 19, 2015 at 12:12 PM, Michał Łowicki <ml...@gmail.com>
> wrote:
>
>>
>>
>> On Thu, Feb 19, 2015 at 11:02 AM, Carlos Rolo <ro...@pythian.com> wrote:
>>
>>> Do you have trickle_fsync enabled? Try to enable that and see if it
>>> solves your problem, since you are getting out of non-heap memory.
>>>
>>> Another question, is always the same nodes that die? Or is 2 out of 4
>>> that die?
>>>
>>
>> Always the same nodes. Upgraded to 2.1.3 two hours ago so we'll monitor
>> if maybe issue has been fixed there. If not will try to enable
>> |tricke_fsync|
>>
>>
>>>
>>> Regards,
>>>
>>> Carlos Juzarte Rolo
>>> Cassandra Consultant
>>>
>>> Pythian - Love your data
>>>
>>> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
>>> <http://linkedin.com/in/carlosjuzarterolo>*
>>> Tel: 1649
>>> www.pythian.com
>>>
>>> On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki <ml...@gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo <ro...@pythian.com> wrote:
>>>>
>>>>> So compaction doesn't seem to be your problem (You can check with
>>>>> nodetool compactionstats just to be sure).
>>>>>
>>>>
>>>> pending tasks: 0
>>>>
>>>>
>>>>>
>>>>> How much is your write latency on your column families? I had OOM
>>>>> related to this before, and there was a tipping point around 70ms.
>>>>>
>>>>
>>>> Write request latency is below 0.05 ms/op (avg). Checked with OpsCenter.
>>>>
>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> BR,
>>>> Michał Łowicki
>>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>>
>> --
>> BR,
>> Michał Łowicki
>>
>
>
>
> --
> BR,
> Michał Łowicki
>

-- 


--




Re: C* 2.1.2 invokes oom-killer

Posted by Michał Łowicki <ml...@gmail.com>.
|trickle_fsync| has been enabled for long time in our settings (just
noticed):

trickle_fsync: true

trickle_fsync_interval_in_kb: 10240

On Thu, Feb 19, 2015 at 12:12 PM, Michał Łowicki <ml...@gmail.com> wrote:

>
>
> On Thu, Feb 19, 2015 at 11:02 AM, Carlos Rolo <ro...@pythian.com> wrote:
>
>> Do you have trickle_fsync enabled? Try to enable that and see if it
>> solves your problem, since you are getting out of non-heap memory.
>>
>> Another question, is always the same nodes that die? Or is 2 out of 4
>> that die?
>>
>
> Always the same nodes. Upgraded to 2.1.3 two hours ago so we'll monitor if
> maybe issue has been fixed there. If not will try to enable |tricke_fsync|
>
>
>>
>> Regards,
>>
>> Carlos Juzarte Rolo
>> Cassandra Consultant
>>
>> Pythian - Love your data
>>
>> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
>> <http://linkedin.com/in/carlosjuzarterolo>*
>> Tel: 1649
>> www.pythian.com
>>
>> On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki <ml...@gmail.com>
>> wrote:
>>
>>>
>>>
>>> On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo <ro...@pythian.com> wrote:
>>>
>>>> So compaction doesn't seem to be your problem (You can check with
>>>> nodetool compactionstats just to be sure).
>>>>
>>>
>>> pending tasks: 0
>>>
>>>
>>>>
>>>> How much is your write latency on your column families? I had OOM
>>>> related to this before, and there was a tipping point around 70ms.
>>>>
>>>
>>> Write request latency is below 0.05 ms/op (avg). Checked with OpsCenter.
>>>
>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> BR,
>>> Michał Łowicki
>>>
>>
>>
>> --
>>
>>
>>
>>
>
>
> --
> BR,
> Michał Łowicki
>



-- 
BR,
Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Michał Łowicki <ml...@gmail.com>.
On Thu, Feb 19, 2015 at 11:02 AM, Carlos Rolo <ro...@pythian.com> wrote:

> Do you have trickle_fsync enabled? Try to enable that and see if it solves
> your problem, since you are getting out of non-heap memory.
>
> Another question, is always the same nodes that die? Or is 2 out of 4 that
> die?
>

Always the same nodes. Upgraded to 2.1.3 two hours ago so we'll monitor if
maybe issue has been fixed there. If not will try to enable |tricke_fsync|


>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
> <http://linkedin.com/in/carlosjuzarterolo>*
> Tel: 1649
> www.pythian.com
>
> On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki <ml...@gmail.com>
> wrote:
>
>>
>>
>> On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo <ro...@pythian.com> wrote:
>>
>>> So compaction doesn't seem to be your problem (You can check with
>>> nodetool compactionstats just to be sure).
>>>
>>
>> pending tasks: 0
>>
>>
>>>
>>> How much is your write latency on your column families? I had OOM
>>> related to this before, and there was a tipping point around 70ms.
>>>
>>
>> Write request latency is below 0.05 ms/op (avg). Checked with OpsCenter.
>>
>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>
>>
>> --
>> BR,
>> Michał Łowicki
>>
>
>
> --
>
>
>
>


-- 
BR,
Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Carlos Rolo <ro...@pythian.com>.
Do you have trickle_fsync enabled? Try to enable that and see if it solves
your problem, since you are getting out of non-heap memory.

Another question, is always the same nodes that die? Or is 2 out of 4 that
die?

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
<http://linkedin.com/in/carlosjuzarterolo>*
Tel: 1649
www.pythian.com

On Thu, Feb 19, 2015 at 10:49 AM, Michał Łowicki <ml...@gmail.com> wrote:

>
>
> On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo <ro...@pythian.com> wrote:
>
>> So compaction doesn't seem to be your problem (You can check with
>> nodetool compactionstats just to be sure).
>>
>
> pending tasks: 0
>
>
>>
>> How much is your write latency on your column families? I had OOM related
>> to this before, and there was a tipping point around 70ms.
>>
>
> Write request latency is below 0.05 ms/op (avg). Checked with OpsCenter.
>
>
>>
>> --
>>
>>
>>
>>
>
>
> --
> BR,
> Michał Łowicki
>

-- 


--




Re: C* 2.1.2 invokes oom-killer

Posted by Michał Łowicki <ml...@gmail.com>.
On Thu, Feb 19, 2015 at 10:41 AM, Carlos Rolo <ro...@pythian.com> wrote:

> So compaction doesn't seem to be your problem (You can check with nodetool
> compactionstats just to be sure).
>

pending tasks: 0


>
> How much is your write latency on your column families? I had OOM related
> to this before, and there was a tipping point around 70ms.
>

Write request latency is below 0.05 ms/op (avg). Checked with OpsCenter.


>
> --
>
>
>
>


-- 
BR,
Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Carlos Rolo <ro...@pythian.com>.
So compaction doesn't seem to be your problem (You can check with nodetool
compactionstats just to be sure).

How much is your write latency on your column families? I had OOM related
to this before, and there was a tipping point around 70ms.

-- 


--




Re: C* 2.1.2 invokes oom-killer

Posted by Michał Łowicki <ml...@gmail.com>.
In all tables SSTable counts is below 30.

On Thu, Feb 19, 2015 at 9:43 AM, Carlos Rolo <ro...@pythian.com> wrote:

> Can you check how many SSTables you have? It is more or less a know fact
> that 2.1.2 has lots of problems with compaction so a upgrade can solve it.
> But a high number of SSTables can confirm that indeed compaction is your
> problem not something else.
>
> Regards,
>
> Carlos Juzarte Rolo
> Cassandra Consultant
>
> Pythian - Love your data
>
> rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
> <http://linkedin.com/in/carlosjuzarterolo>*
> Tel: 1649
> www.pythian.com
>
> On Thu, Feb 19, 2015 at 9:16 AM, Michał Łowicki <ml...@gmail.com>
> wrote:
>
>> We don't have other things running on these boxes and C* is consuming all
>> the memory.
>>
>> Will try to upgrade to 2.1.3 and if won't help downgrade to 2.1.2.
>>
>> —
>> Michał
>>
>>
>> On Thu, Feb 19, 2015 at 2:39 AM, Jacob Rhoden <ja...@me.com>
>> wrote:
>>
>>> Are you tweaking the "nice" priority on Cassandra? (Type: man nice) if
>>> you don't know much about it. Certainly improving cassandra's nice score
>>> becomes important when you have other things running on the server like
>>> scheduled jobs of people logging in to the server and doing things.
>>>
>>> ______________________________
>>> Sent from iPhone
>>>
>>> On 19 Feb 2015, at 5:28 am, Michał Łowicki <ml...@gmail.com> wrote:
>>>
>>>  Hi,
>>>
>>> Couple of times a day 2 out of 4 members cluster nodes are killed
>>>
>>> root@db4:~# dmesg | grep -i oom
>>> [4811135.792657] [ pid ]   uid  tgid total_vm      rss cpu oom_adj
>>> oom_score_adj name
>>> [6559049.307293] java invoked oom-killer: gfp_mask=0x201da, order=0,
>>> oom_adj=0, oom_score_adj=0
>>>
>>> Nodes are using 8GB heap (confirmed with *nodetool info*) and aren't
>>> using row cache.
>>>
>>> Noticed that couple of times a day used RSS is growing really fast
>>> within couple of minutes and I see CPU spikes at the same time -
>>> https://www.dropbox.com/s/khco2kdp4qdzjit/Screenshot%202015-02-18%2015.10.54.png?dl=0
>>> .
>>>
>>> Could be related to compaction but after compaction is finished used RSS
>>> doesn't shrink. Output from pmap when C* process uses 50GB RAM (out of
>>> 64GB) is available on http://paste.ofcode.org/ZjLUA2dYVuKvJHAk9T3Hjb.
>>> At the time dump was made heap usage is far below 8GB (~3GB) but total RSS
>>> is ~50GB.
>>>
>>> Any help will be appreciated.
>>>
>>> --
>>> BR,
>>> Michał Łowicki
>>>
>>>
>>
>
> --
>
>
>
>


-- 
BR,
Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Carlos Rolo <ro...@pythian.com>.
Can you check how many SSTables you have? It is more or less a know fact
that 2.1.2 has lots of problems with compaction so a upgrade can solve it.
But a high number of SSTables can confirm that indeed compaction is your
problem not something else.

Regards,

Carlos Juzarte Rolo
Cassandra Consultant

Pythian - Love your data

rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo
<http://linkedin.com/in/carlosjuzarterolo>*
Tel: 1649
www.pythian.com

On Thu, Feb 19, 2015 at 9:16 AM, Michał Łowicki <ml...@gmail.com> wrote:

> We don't have other things running on these boxes and C* is consuming all
> the memory.
>
> Will try to upgrade to 2.1.3 and if won't help downgrade to 2.1.2.
>
> —
> Michał
>
>
> On Thu, Feb 19, 2015 at 2:39 AM, Jacob Rhoden <ja...@me.com> wrote:
>
>> Are you tweaking the "nice" priority on Cassandra? (Type: man nice) if
>> you don't know much about it. Certainly improving cassandra's nice score
>> becomes important when you have other things running on the server like
>> scheduled jobs of people logging in to the server and doing things.
>>
>> ______________________________
>> Sent from iPhone
>>
>> On 19 Feb 2015, at 5:28 am, Michał Łowicki <ml...@gmail.com> wrote:
>>
>>  Hi,
>>
>> Couple of times a day 2 out of 4 members cluster nodes are killed
>>
>> root@db4:~# dmesg | grep -i oom
>> [4811135.792657] [ pid ]   uid  tgid total_vm      rss cpu oom_adj
>> oom_score_adj name
>> [6559049.307293] java invoked oom-killer: gfp_mask=0x201da, order=0,
>> oom_adj=0, oom_score_adj=0
>>
>> Nodes are using 8GB heap (confirmed with *nodetool info*) and aren't
>> using row cache.
>>
>> Noticed that couple of times a day used RSS is growing really fast within
>> couple of minutes and I see CPU spikes at the same time -
>> https://www.dropbox.com/s/khco2kdp4qdzjit/Screenshot%202015-02-18%2015.10.54.png?dl=0
>> .
>>
>> Could be related to compaction but after compaction is finished used RSS
>> doesn't shrink. Output from pmap when C* process uses 50GB RAM (out of
>> 64GB) is available on http://paste.ofcode.org/ZjLUA2dYVuKvJHAk9T3Hjb. At
>> the time dump was made heap usage is far below 8GB (~3GB) but total RSS is
>> ~50GB.
>>
>> Any help will be appreciated.
>>
>> --
>> BR,
>> Michał Łowicki
>>
>>
>

-- 


--




Re: C* 2.1.2 invokes oom-killer

Posted by Michał Łowicki <ml...@gmail.com>.
We don't have other things running on these boxes and C* is consuming all the memory.




Will try to upgrade to 2.1.3 and if won't help downgrade to 2.1.2. 



—
Michał

On Thu, Feb 19, 2015 at 2:39 AM, Jacob Rhoden <ja...@me.com> wrote:

> Are you tweaking the "nice" priority on Cassandra? (Type: man nice) if you don't know much about it. Certainly improving cassandra's nice score becomes important when you have other things running on the server like scheduled jobs of people logging in to the server and doing things.
> ______________________________
> Sent from iPhone
>> On 19 Feb 2015, at 5:28 am, Michał Łowicki <ml...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> Couple of times a day 2 out of 4 members cluster nodes are killed
>> 
>> root@db4:~# dmesg | grep -i oom
>> [4811135.792657] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
>> [6559049.307293] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
>> 
>> Nodes are using 8GB heap (confirmed with *nodetool info*) and aren't using row cache. 
>> 
>> Noticed that couple of times a day used RSS is growing really fast within couple of minutes and I see CPU spikes at the same time - https://www.dropbox.com/s/khco2kdp4qdzjit/Screenshot%202015-02-18%2015.10.54.png?dl=0.
>> 
>> Could be related to compaction but after compaction is finished used RSS doesn't shrink. Output from pmap when C* process uses 50GB RAM (out of 64GB) is available on http://paste.ofcode.org/ZjLUA2dYVuKvJHAk9T3Hjb. At the time dump was made heap usage is far below 8GB (~3GB) but total RSS is ~50GB.
>> 
>> Any help will be appreciated.
>> 
>> -- 
>> BR,
>> Michał Łowicki

Re: C* 2.1.2 invokes oom-killer

Posted by Jacob Rhoden <ja...@me.com>.
Are you tweaking the "nice" priority on Cassandra? (Type: man nice) if you don't know much about it. Certainly improving cassandra's nice score becomes important when you have other things running on the server like scheduled jobs of people logging in to the server and doing things.

______________________________
Sent from iPhone

> On 19 Feb 2015, at 5:28 am, Michał Łowicki <ml...@gmail.com> wrote:
> 
> Hi,
> 
> Couple of times a day 2 out of 4 members cluster nodes are killed
> 
> root@db4:~# dmesg | grep -i oom
> [4811135.792657] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
> [6559049.307293] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
> 
> Nodes are using 8GB heap (confirmed with *nodetool info*) and aren't using row cache. 
> 
> Noticed that couple of times a day used RSS is growing really fast within couple of minutes and I see CPU spikes at the same time - https://www.dropbox.com/s/khco2kdp4qdzjit/Screenshot%202015-02-18%2015.10.54.png?dl=0.
> 
> Could be related to compaction but after compaction is finished used RSS doesn't shrink. Output from pmap when C* process uses 50GB RAM (out of 64GB) is available on http://paste.ofcode.org/ZjLUA2dYVuKvJHAk9T3Hjb. At the time dump was made heap usage is far below 8GB (~3GB) but total RSS is ~50GB.
> 
> Any help will be appreciated.
> 
> -- 
> BR,
> Michał Łowicki