You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jason Harvey (JIRA)" <ji...@apache.org> on 2011/03/27 01:12:05 UTC

[jira] [Issue Comment Edited] (CASSANDRA-2357) Load spikes on coordinators since upgrade from 0.6.8 to 0.7

    [ https://issues.apache.org/jira/browse/CASSANDRA-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011705#comment-13011705 ] 

Jason Harvey edited comment on CASSANDRA-2357 at 3/27/11 12:10 AM:
-------------------------------------------------------------------

Adding some more diagnostics to track this down. One curious thing I have noticed so far: Whenever the spikes happen, the JRE open file handles and JRE thread count plummets. For example, thread count was 600 and file handle count was around 500 on one node. Both of those numbers dropped 50% immediately after the spike.

      was (Author: alienth):
    Adding some more diagnostics to track this down. One curious thing I have noticed so far: Whenever the spikes happen, the open file handles and thread count plummets. Thread count was 600 and file handle count was around 500. Both of those numbers dropped 50% immediately after the spike.
  
> Load spikes on coordinators since upgrade from 0.6.8 to 0.7
> -----------------------------------------------------------
>
>                 Key: CASSANDRA-2357
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2357
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.4
>            Reporter: Jason Harvey
>         Attachments: thread_dump.txt
>
>
> Since our move from 0.6.8 to 0.7, all of the nodes which speak with clients have been having periodic, abrupt load spikes going into the hundreds. We have been seeing these load spikes 1 to 2 times per hour on every node which clients are speaking with. The load graph for a typical spike: http://i.imgur.com/jY8AV.png
> I have verified that client connections are not spiking at the same time via TCP statistics. I have also verified that we aren't seeing any spikes in reads/mutations/etc. 
> We were using the DynamicSnitch, but I turned that off as a troubleshooting step. The issue was unchanged.
> When the spikes occur, the box's responsiveness slows to a crawl so I am unable to do much in terms of real-time diagnostics. I was able to get a thread dump a few seconds after a spike, which I have attached to this ticket. Not sure if it will show anything since I couldn't capture it immediately during the spike.
> I should note that David King noticed a similar problem (#2058) when he tried moving us from 0.6.8 to 0.6.10. The main issue at the time was a long-lasting load spike, but he also saw occasional abrupt load spikes like we are seeing now. When we moved back to 0.6.8, we didn't see the problem again, until the move to 0.7.
> I realize this information is somewhat nebulous. If there is any further info I can provide, please let me know. The spikes are causing quite a bit of instability, so we are considering retreating back to 0.6.8. I'd like to investigate every possible solution before we resort to that.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira