You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@cassandra.apache.org by "David King (JIRA)" <ji...@apache.org> on 2011/01/26 02:20:43 UTC

[jira] Created: (CASSANDRA-2058) Nodes periodically spike in load

Nodes periodically spike in load
--------------------------------

                 Key: CASSANDRA-2058
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.6.10
            Reporter: David King


(Filing as a placeholder bug as I gather information.)

At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.

I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986865#action_12986865 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

bq. You were running 0.6.8 + DS before? Or is "it" not DynamicSnitch?

I was running 0.6.8 with no DES. Then I upgraded to 0.6.10 and turned it on. I had the aforementioned problems.

Now I'm running 0.6.10 with the DES turned off. (As of this writing, I'm still seeing the momentary spikes but thus far no sustained ones.)

If I continue to have the momentary or sustained spikes (I'll probably know by the morning), then I'll revert to 0.6.8, and turn *on* the DES.

If after that I continue to have problems I'll revert back to 0.6.8 with no DES, which is at least a configuration in which I didn't have any of these problems

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986849#action_12986849 ] 

Jonathan Ellis commented on CASSANDRA-2058:
-------------------------------------------

bq. If it doesn't, then I'll turn it back on and revert to 0.6.8 to see if that does it

You were running 0.6.8 + DS before?  Or is "it" not DynamicSnitch?

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988968#comment-12988968 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

It's hard to say. I lost 5 nodes in about an hour, but I don't know how many I lost last time

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990656#comment-12990656 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

I don't have JNA on these hosts, so at least in my case it's not JNA-related.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2058) Load spikes due to MessagingService-generated garbage collection

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Remaining Estimate: 0.4h
     Original Estimate: 0.4h

> Load spikes due to MessagingService-generated garbage collection
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.0
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>   Original Estimate: 0.4h
>  Remaining Estimate: 0.4h
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David King updated CASSANDRA-2058:
----------------------------------

    Attachment: cassandra.pmc14.log.bz2

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Load spikes due to MessagingService-generated garbage collection

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Summary: Load spikes due to MessagingService-generated garbage collection  (was: Nodes periodically spike in load)

> Load spikes due to MessagingService-generated garbage collection
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.0
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Attachment: 2058-0.7-v3.txt

v3 adds latency tracking to LocalReadRunnable

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-2058:
----------------------------------------

    Attachment: 2058-0.7-v2.txt

0.7 v2 fixes the DES by incorporating the approach from CASSANDRA-2004 and having it register with MS directly and removing ILP.  However, it does not receive timings for the local node.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Attachment: 2058.txt

Brandon's testing has narrowed the culprit down to CASSANDRA-1959.  As discussed on CASSANDRA-2054, the main problem there is with the NonBlockingHashMap introduced to track timed out latencies.

This patch reverts that and takes a different approach, of tracking the latency in the callback map.  This means that we need a unique messageId for each target we send a message to.  The Right Way to do this would be to have Message objects only contain the data to send, not the From address and not the messageId.  Refactoring Message is outside our scope here though, so instead we create a new Message for each target.

This does let us clean up the callback map in ResponseVerbHandler instead of in each Callback.  (That is what is going on in the changes to QRH, WRH, and AR.)

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987799#action_12987799 ] 

Brandon Williams commented on CASSANDRA-2058:
---------------------------------------------

+1

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Attachment: 2058-0.7.txt

port to 0.7 attached.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>            Assignee: Jonathan Ellis
>         Attachments: 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988912#comment-12988912 ] 

Jonathan Ellis commented on CASSANDRA-2058:
-------------------------------------------

Please tell me you're at least seeing this less often than with .10 :)

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987838#action_12987838 ] 

Jonathan Ellis commented on CASSANDRA-2058:
-------------------------------------------

bq. I think using MapMaker directly and getting rid of ExpiringMap would probably be best

Agreed, opened CASSANDRA-2070 for that

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reassigned CASSANDRA-2058:
-----------------------------------------

    Assignee: Jonathan Ellis

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>            Assignee: Jonathan Ellis
>         Attachments: 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986807#action_12986807 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

Also since then I've had notably worse performance, reading is maybe 30% slower than before.

My next step will be to hope that the jstacks in that log are the same as the ones causing the largest outages and to disable to dynamic snitch (as much as i'd like to get 100% reproduction, I'd also rather not take my site down) to see if that resolves the problem. If it doesn't, then I'll turn it back on and revert to 0.6.8 to see if that does it

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Thibaut (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990493#comment-12990493 ] 

Thibaut commented on CASSANDRA-2058:
------------------------------------

I'm also seeing something similar on yesterday's svn version (the one with the Consistency level fix).

It only occurs if I enable JNA.

Nodes will experience enormous high kernel load (htop, red bar). Ssh sessions on these servers will lag extermely. Nodes won't take 100% cpu though, but the cluster is unusable.




> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David King updated CASSANDRA-2058:
----------------------------------

    Environment: 
OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
Ubuntu 8.10
Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987642#action_12987642 ] 

Brandon Williams commented on CASSANDRA-2058:
---------------------------------------------

0.6 version looks good, RR, HH, and DES work, no more CPU spikes under heavy load.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>            Assignee: Jonathan Ellis
>         Attachments: 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987019#action_12987019 ] 

Jonathan Ellis commented on CASSANDRA-2058:
-------------------------------------------

DES in 0.6.8 is a no-op unless you're doing quorum reads.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987366#action_12987366 ] 

T Jake Luciani edited comment on CASSANDRA-2058 at 1/26/11 10:50 PM:
---------------------------------------------------------------------

This looks good overall, nothing major I can see.

The only niggles are:
 
1. ExpiringMap; we could do the same with MapMaker and may be more bulletproof. see EvictionListener http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MapMaker.html

2. I also wonder what impact (if any) there will be for generating a message per endpoint rather than re-using the same one as was perviously done.

But as-is it's still +1

      was (Author: tjake):
    This looks good overall, nothing major I can see.

The only niggles are:
 
1. the ExpiringMap we could do the same with MapMaker and may be more bulletproof. see EvictionListener http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MapMaker.html

2. I also wonder what impact (if any) there will be for generating a message per endpoint rather than re-using the same one as was perviously done.

But as-is it's still +1
  
> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>            Assignee: Jonathan Ellis
>         Attachments: 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Load spikes due to MessagingService-generated garbage collection

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Comment: was deleted

(was: AFAIK nobody has seen this on 0.7.1.)

> Load spikes due to MessagingService-generated garbage collection
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.0
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987369#action_12987369 ] 

Jonathan Ellis commented on CASSANDRA-2058:
-------------------------------------------

Thanks, Jake.

1. agreed, I'd like to upgrade at some point, but changing stuff i don't have to scares me at this point in 0.6.

2. we definitely saw a small speedup when i made that optimization the first time, but I'd rather have a working dynamic snitch.  (we can optimize later in 0.7 -- see The Right Way above.)  combined w/ the improved tcp performance in 0.6.10 we should still be ahead of 0.6.8 aka the last version that didn't have MessagingService bugs.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>            Assignee: Jonathan Ellis
>         Attachments: 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Paul Querna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989261#comment-12989261 ] 

Paul Querna commented on CASSANDRA-2058:
----------------------------------------

FYI, since upgrading to .10 we are also seeing this problem :(  Tried getting a jstack, but didn't work, tpstats etc all timed out.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (CASSANDRA-2058) Load spikes due to MessagingService-generated garbage collection

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-2058.
---------------------------------------

    Resolution: Fixed

closing this so it's clear that the excessive object creation problem introduced in CASSANDRA-1905 is fixed in 0.6.11 / 0.7.1.

opened CASSANDRA-2170 for other load spikes.

> Load spikes due to MessagingService-generated garbage collection
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.0
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.7.1, 0.6.11
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Resolved: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis resolved CASSANDRA-2058.
---------------------------------------

    Resolution: Fixed
      Reviewer: brandon.williams

committed to 0.7 and trunk

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Affects Version/s:     (was: 0.7.1)
                       0.7.0
        Fix Version/s:     (was: 0.7.2)
                       0.7.1

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.0
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989297#comment-12989297 ] 

Brandon Williams commented on CASSANDRA-2058:
---------------------------------------------

Paul, I would expect to see it on .10 (I can repro there) but that is what this ticket was supposed to address.  Can you repro with .11?

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Eric Evans (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989343#comment-12989343 ] 

Eric Evans commented on CASSANDRA-2058:
---------------------------------------

They're seeing it on r1064246 (one rev newer than 0.6.11).

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12988890#comment-12988890 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

I have upgraded to 0.6.11 and am definitely still seeing this problem (although I'm no longer seeing the 30% performance hit while the nodes are up)

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "T Jake Luciani (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987366#action_12987366 ] 

T Jake Luciani commented on CASSANDRA-2058:
-------------------------------------------

This looks good overall, nothing major I can see.

The only niggles are:
 
1. the ExpiringMap we could do the same with MapMaker and may be more bulletproof. see EvictionListener http://guava-libraries.googlecode.com/svn/trunk/javadoc/com/google/common/collect/MapMaker.html

2. I also wonder what impact (if any) there will be for generating a message per endpoint rather than re-using the same one as was perviously done.

But as-is it's still +1

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>            Assignee: Jonathan Ellis
>         Attachments: 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-2058:
----------------------------------------

    Fix Version/s:     (was: 0.7.1)
                   0.7.2

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.2
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Thibaut (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990493#comment-12990493 ] 

Thibaut edited comment on CASSANDRA-2058 at 2/4/11 9:28 AM:
------------------------------------------------------------

I'm also seeing something similar on yesterday's svn version (the one with the Consistency level fix).

It only occurs if I enable JNA.

Nodes will experience enormous high kernel load (htop, red bar). Ssh sessions on these servers will lag extermely. Nodes won't take 100% cpu though, but the cluster is unusable.

(Just to note: it's a completely different pattern to the 100% cpu spike which occured before, and I can't reproduce it wihout JNA enabled)


      was (Author: tbritz):
    I'm also seeing something similar on yesterday's svn version (the one with the Consistency level fix).

It only occurs if I enable JNA.

Nodes will experience enormous high kernel load (htop, red bar). Ssh sessions on these servers will lag extermely. Nodes won't take 100% cpu though, but the cluster is unusable.



  
> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987655#action_12987655 ] 

Hudson commented on CASSANDRA-2058:
-----------------------------------

Integrated in Cassandra-0.6 #52 (See [https://hudson.apache.org/hudson/job/Cassandra-0.6/52/])
    reduce garbage generated by MessagingServiceto prevent loadspikes
patch by jbellis; reviewed by brandonwilliams and tjake for CASSANDRA-2058


> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David King updated CASSANDRA-2058:
----------------------------------

    Attachment: cassandra.pmc01.log.bz2

This is one of the affected nodes' logs from 7a..6p (uncompresses to ~33mb). Note that around 4p I added a job to pull a jstack every 120s. On this node around 5:46p I saw the version of the load spike where the node recovers (at around 17:53).

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987173#action_12987173 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

I am in fact still having both the momentary and the sustained failures and am rolling back to 0.6.8 with no DES (since you describe it as a no-op anyway)

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Mike Malone (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987835#action_12987835 ] 

Mike Malone commented on CASSANDRA-2058:
----------------------------------------

Jake/Jonathan,

FWIW, I re-implemented ExpiringMap with MapMaker using an eviction listener (but mostly maintaining the ExpiringMap API) a little while back while investigating some messaging service issues we were seeing. The patch is against 0.6.8, but here's the code if you wanna try it out: https://gist.github.com/a2f645c69ca8f44ccff3

It could definitely be simplified more by someone willing to make more widespread code changes. Actually, I think using MapMaker directly and getting rid of ExpiringMap would probably be best. *shrug*

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987228#action_12987228 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

I've rolled back to 0.6.8 with the DES disabled and not only has the load problem stopped, performance has also gone back up to previous levels

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David King updated CASSANDRA-2058:
----------------------------------

    Attachment: graph b.png
                graph a.png

Just had this happen again, attaching load/CPU graphs. Will have logs shortly.

I was in the middle of pushing out the change to turn off the DES. This is pmc14. As of when this happened, the nodes {pmc01 pmc04 pmc07 pmc10 pmc13 pmc16} had it turned off but the others have not been restarted

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986806#action_12986806 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

It occurs to me that my timestamps may be in a different time zone than the logs themselves

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2058:
--------------------------------------

    Affects Version/s: 0.7.1
        Fix Version/s: 0.7.1
                       0.6.11

committed 0.6 version

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986810#action_12986810 ] 

Jonathan Ellis commented on CASSANDRA-2058:
-------------------------------------------

I believe this is the same as CASSANDRA-2054 but will leave both open for now.

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10
>            Reporter: David King
>         Attachments: cassandra.pmc01.log.bz2
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-2058) Load spikes due to MessagingService-generated garbage collection

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12994962#comment-12994962 ] 

Jonathan Ellis commented on CASSANDRA-2058:
-------------------------------------------

AFAIK nobody has seen this on 0.7.1.

> Load spikes due to MessagingService-generated garbage collection
> ----------------------------------------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.0
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Reopened: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reopened CASSANDRA-2058:
---------------------------------------


> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "David King (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990741#comment-12990741 ] 

David King commented on CASSANDRA-2058:
---------------------------------------

JVM_OPTS=" \
        -ea \
        -Xms6656m \
        -Xmx6656m \
        -XX:+UseParNewGC \
        -XX:+UseConcMarkSweepGC \
        -XX:+CMSParallelRemarkEnabled \
        -XX:SurvivorRatio=8 \
        -XX:MaxTenuringThreshold=1 \
        -XX:CMSInitiatingOccupancyFraction=75 \
        -XX:+UseCMSInitiatingOccupancyOnly \
        -XX:+HeapDumpOnOutOfMemoryError \
        -XX:+UseThreadPriorities \
        -XX:ThreadPriorityPolicy=42 \
        -Dcassandra.compaction.priority=1 \
        -Dcom.sun.management.jmxremote.port=8080 \
        -Dcom.sun.management.jmxremote.ssl=false \
        -Dcom.sun.management.jmxremote.authenticate=false"


/usr/bin/java -ea -Xms6656m -Xmx6656m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1 -Dcom.sun.management.jmxremote.port=8080 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/high-scale-lib.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar org.apache.cassandra.thrift.CassandraDaemon


java version "1.6.0_0"
IcedTea6 1.3.1 (6b12-0ubuntu6.7) Runtime Environment (build 1.6.0_0-b12)
OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)


Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (CASSANDRA-2058) Nodes periodically spike in load

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/CASSANDRA-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990728#comment-12990728 ] 

Brandon Williams commented on CASSANDRA-2058:
---------------------------------------------

Could those who are seeing this issue please post the JVM flags they're using?

> Nodes periodically spike in load
> --------------------------------
>
>                 Key: CASSANDRA-2058
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2058
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6.10, 0.7.1
>         Environment: OpenJDK 64-Bit Server VM (build 1.6.0_0-b12, mixed mode)
> Ubuntu 8.10
> Linux pmc01 2.6.27-22-xen #1 SMP Fri Feb 20 23:58:13 UTC 2009 x86_64 GNU/Linux
>            Reporter: David King
>            Assignee: Jonathan Ellis
>             Fix For: 0.6.11, 0.7.1
>
>         Attachments: 2058-0.7-v2.txt, 2058-0.7-v3.txt, 2058-0.7.txt, 2058.txt, cassandra.pmc01.log.bz2, cassandra.pmc14.log.bz2, graph a.png, graph b.png
>
>
> (Filing as a placeholder bug as I gather information.)
> At ~10p 24 Jan, I upgraded our 20-node cluster from 0.6.8->0.6.10, turned on the DES, and moved some CFs from one KS into another (drain whole cluster, take it down, move files, change schema, put it back up). Since then, I've had four storms whereby a node's load will shoot to 700+ (400% CPU on a 4-cpu machine) and become totally unresponsive. After a moment or two like that, its neighbour dies too, and the failure cascades around the ring. Unfortunately because of the high load I'm not able to get into the machine to pull a thread dump to see wtf it's doing as it happens.
> I've also had an issue where a single node spikes up to high load, but recovers. This may or may not be the same issue from which the nodes don't recover as above, but both are new behaviour

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira