You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Torsten Curdt (JIRA)" <ji...@apache.org> on 2010/06/09 17:04:13 UTC

[jira] Created: (CASSANDRA-1177) OutOfMemory on heavy inserts

OutOfMemory on heavy inserts
----------------------------

                 Key: CASSANDRA-1177
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.6.2
         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
            Reporter: Torsten Curdt
            Priority: Critical


We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).

On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.

Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.

This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Eric Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879992#action_12879992 ] 

Eric Evans commented on CASSANDRA-1177:
---------------------------------------

bq. After getting even further with our import, because we lowered the thresholds for the Memtable flushing, we seeing one node not answering over JMX and start to print more and more GC messages to the log. However we looked into the data and commitlog directory and find that the listing of what's inside might be helpful to solve our problem.

I don't think there is anything particularly telling here, other than that the behavior you're seeing still see falls within the range of what is expected (based on what we know).

I would suggest we move this discussion to user@cassandra.apache.org; the mailing list is a better forum for this sort of discussion. If it becomes apparent that there is indeed a bug, we can move it back here along with a summary or a pointer to the thread.

Be sure to include the rate of write operations, the size of the writes, the consistency level being used, how many nodes are involved, along with your most recent configuration.



> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip, commitlog.txt, data.txt
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877099#action_12877099 ] 

Jonathan Ellis commented on CASSANDRA-1177:
-------------------------------------------

CSLM sucking up the memory sounds like you just have too much unflushed data in your memtables.

Do you have balanced tokens/"load" across the machines?  ("nodetool ring")

I would
 - balance nodes (with move) if necessary as described at the top of http://wiki.apache.org/cassandra/Operations
 - increase heap size and/or decrease memtable size + op count flush thresholds, or possibly if you have some memtables way more active than others, leave the flush thresholds high but reduce MemtableFlushAfterMinutes to flush out the less frequently used ones instead.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Alexander Simmerl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Simmerl updated CASSANDRA-1177:
-----------------------------------------

    Attachment: data.txt
                commitlog.txt

After getting even further with our import, because we lowered the thresholds for the Memtable flushing, we seeing one node not answering over JMX and start to print more and more GC messages to the log. However we looked into the data and commitlog directory and find that the listing of what's inside might be helpful to solve our problem.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip, commitlog.txt, data.txt
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Alexander Simmerl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877445#action_12877445 ] 

Alexander Simmerl commented on CASSANDRA-1177:
----------------------------------------------

We tried to reduce the MemtableOperationsInMillions from 1 to 0.1 and MemtableFlushAfterMinutes 1. I also increased and decreased the heap size. As you can see in the attachment all nodes are kinda even loaded. Only 10.12.22.117 is showing a huge difference, but this happened after the crashes, before it was equal to the other nodes.

None of the actions helped. We also experienced a flapping with the Gossiper:


 INFO [GC inspection] 2010-06-10 16:10:38,790 GCInspector.java (line 110) GC for ConcurrentMarkSweep: 23943 ms, 8915640 reclaimed leaving 2151863720 used; max is 2263941120
 INFO [GMFD:1] 2010-06-10 16:10:38,790 Gossiper.java (line 568) InetAddress /10.12.22.116 is now UP
 INFO [Timer-1] 2010-06-10 16:10:55,846 Gossiper.java (line 179) InetAddress /10.12.22.116 is now dead.
 INFO [GC inspection] 2010-06-10 16:10:55,846 GCInspector.java (line 110) GC for ConcurrentMarkSweep: 16730 ms, 8592904 reclaimed leaving 2152186664 used; max is 2263941120
 INFO [GMFD:1] 2010-06-10 16:10:55,846 Gossiper.java (line 568) InetAddress /10.12.22.116 is now UP
 INFO [Timer-1] 2010-06-10 16:11:20,004 Gossiper.java (line 179) InetAddress /10.12.22.116 is now dead.
 INFO [GC inspection] 2010-06-10 16:11:20,004 GCInspector.java (line 110) GC for ConcurrentMarkSweep: 24118 ms, 8148936 reclaimed leaving 2152641776 used; max is 2263941120
 INFO [Timer-1] 2010-06-10 16:11:20,004 Gossiper.java (line 179) InetAddress /10.12.22.115 is now dead.
 INFO [GMFD:1] 2010-06-10 16:11:20,004 Gossiper.java (line 568) InetAddress /10.12.22.116 is now UP
 INFO [GMFD:1] 2010-06-10 16:11:20,004 Gossiper.java (line 568) InetAddress /10.12.22.115 is now UP
 INFO [Timer-1] 2010-06-10 16:11:36,610 Gossiper.java (line 179) InetAddress /10.12.22.116 is now dead.
 INFO [GC inspection] 2010-06-10 16:11:36,910 GCInspector.java (line 110) GC for ConcurrentMarkSweep: 16591 ms, 7905120 reclaimed leaving 2152871040 used; max is 2263941120
 INFO [GMFD:1] 2010-06-10 16:11:36,910 Gossiper.java (line 568) InetAddress /10.12.22.116 is now UP
 INFO [Timer-1] 2010-06-10 16:12:01,268 Gossiper.java (line 179) InetAddress /10.12.22.116 is now dead.
 INFO [Timer-1] 2010-06-10 16:12:01,268 Gossiper.java (line 179) InetAddress /10.12.22.115 is now dead.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Torsten Curdt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884741#action_12884741 ] 

Torsten Curdt commented on CASSANDRA-1177:
------------------------------------------

Some more information:

We had an old-ish JDK in place. Upgrading to the latest helped but did not fix it.
But we are still seeing those problems for un-throttled inserts through THRIFT.

 - replication factor 3
 - consistency level ONE
 - 6 nodes involved
 - 4 workers at 200-300 writes/s
 - writes are about 300 bytes in size

We switched to use the low level StorageProxy for the bulk import.
Using that the cluster behaved just beautiful. No problems there.
Much much faster!

So I assume as long as we don't insert too fast we should be OK.
But that it's quite a scary situation if the ring does not recover properly.
...and 1200 writes/s for a 6 node cluster is not really that much.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip, commitlog.txt, data.txt
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-1177.
---------------------------------------

    Resolution: Not A Problem

if you're balancing by "disk size" then you're basically creating hot spots on the ring deliberately.  that's not a good idea unless you are disk space bound and you're sure your disk-heavy machines can handle the extra load, which doesn't look like the case here. :)

cassandra doesn't do backpressure yet (see CASSANDRA-685) so when you are OOMing it under load then you can mitigate it by (a) giving the JVM more heap (or adding machines) and (b) when you get a TimedOutException on the client, sleep 100ms before retrying.  

You may well also be consuming a lot of heap in (a) compaction of Activities or (b) compaction or scanning of hinted handoff rows (once one node starts going down, say, the 12GB one to start with, that will start generating hints on the other ones that can add to the memory pressure they see).

We can continue troubleshooting here or on the list / irc, but I'm resolving NAP because it's almost certainly not a bug per se.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Eric Evans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879059#action_12879059 ] 

Eric Evans commented on CASSANDRA-1177:
---------------------------------------

Alexander, Torsten, are you still having problems with this?

bq. The most interesting fact is, that the problematic nodes ending up in a GC storm without any load on the ring. Since the problems started we stopped writing from it. So no external interaction is happening, but the nodes ending up in the endless cycles.

To be clear, are you saying that a cluster brought up cold with no traffic has nodes that GC storm, or that even after removing all load they continue to storm? The former would be a memory leak ,the latter would not be surprising (once it starts thrashing like that, it's not likely to recover on its own).

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Torsten Curdt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Torsten Curdt reopened CASSANDRA-1177:
--------------------------------------


Please see the comment from Alexander.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879071#action_12879071 ] 

Jonathan Ellis commented on CASSANDRA-1177:
-------------------------------------------

It's much more likely to be compaction related than a memory leak.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Alexander Simmerl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877461#action_12877461 ] 

Alexander Simmerl commented on CASSANDRA-1177:
----------------------------------------------

To clarify the situation even more. We tried it with a equally balanced ring, means all nodes have the same range assigned. The same problem occurred. The most interesting fact is, that the problematic nodes ending up in a GC storm without any load on the ring. Since the problems started we stopped writing from it. So no external interaction is happening, but the nodes ending up in the endless cycles.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877472#action_12877472 ] 

Jonathan Ellis commented on CASSANDRA-1177:
-------------------------------------------

The problem is that your cluster state now is not the same as the cluster state before you started seeing this.  Your rows are larger (compaction) and you have hinted handoff complicating the picture.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Torsten Curdt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Torsten Curdt updated CASSANDRA-1177:
-------------------------------------

    Attachment: bug report.zip

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Torsten Curdt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Torsten Curdt updated CASSANDRA-1177:
-------------------------------------

    Environment: 
SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
Sun SDK 1.6.0_12-b04

  was:SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode


> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Alexander Simmerl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12879353#action_12879353 ] 

Alexander Simmerl commented on CASSANDRA-1177:
----------------------------------------------

A cold cluster is performing well. And even under heavy load up to ~100million inserts with a replica factor of 3. But then half or more of the nodes (cluster total is 6) going into the GC storm. After removing the load the nodes still hanging in the GC and even restarting the nodes didn't helped. 

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Torsten Curdt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877434#action_12877434 ] 

Torsten Curdt commented on CASSANDRA-1177:
------------------------------------------

The output of "nodetool ring" is attached. It should be balanced alright. It should match the disk sizes.

We also already played with the thresholds.

I can ask the engineer that worked on this what exactly he tried.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (CASSANDRA-1177) OutOfMemory on heavy inserts

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-1177.
---------------------------------------

    Resolution: Duplicate

CASSANDRA-685 and CASSANDRA-981 will address this in 0.7.  For 0.6 the best solution is to throttle writes if you start seeing TimedOutExceptions.

> OutOfMemory on heavy inserts
> ----------------------------
>
>                 Key: CASSANDRA-1177
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1177
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.2
>         Environment: SunOS 5.10, x86 32bit, Jave Hotspot Server VM 11.2-b01 mixed mode
> Sun SDK 1.6.0_12-b04
>            Reporter: Torsten Curdt
>            Priority: Critical
>         Attachments: bug report.zip, commitlog.txt, data.txt
>
>
> We have cluster of 6 Cassandra 0.6.2 nodes running under SunOS (see environment).
> On initial import (using the thrift API) we see some weird behavior of half the cluster. While cas04-06 look fine as you can see from the attached munin graphs, the other 3 nodes kept on GCing (see log file) until they became unreachable and went OOM. (This is also why the stats are so spotty - munin could no longer reach the boxes) We have seen the same behavior on 0.6.2 and 0.6.1. This started after around 100 million inserts.
> Looking at the hprof (which is of course to big to attach) we see lots of ConcurrentSkipListMap$Node's and quite some Column objects. Please see the stats attached.
> This looks similar to https://issues.apache.org/jira/browse/CASSANDRA-1014 but we are not sure it really is the same.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.