You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dan Hendry <da...@gmail.com> on 2010/10/22 23:42:29 UTC

Hung Repair

I am currently running a 4 node cluster on Cassandra beta 2. Yesterday, I
ran into a number of problems and the one of my nodes went down for a few
hours. I tried to run a nodetool repair and at least at a data level,
everything seems to be consistent and alright. The problem is that the node
is still chewing up 100% of its available CPU, 20 hours after I started the
repair. Load averages are 8-9 which is crazy given it is a single core ec2
m1.small.

 

Besides sitting at 100% cpu, the node on which I ran the repair seems to be
fine. The Cassandra logs appear normal. Based on bandwidth patterns between
nodes, it does not seem like they are transferring any repair related data
(as they did initially). No pending tasks are being shown in any of the
services when inspecting via jmx. I have a reasonable amount of data in the
cluster (~6 gb * 2 replication factor) but nothing crazy. The last repair
related entry in the logs is as follows:

 

INFO [Thread-145] 2010-10-22 00:24:10,561 AntiEntropyService.java (line 828)
#<TreeRequest manual-repair-23dacf4b-4076-4460-abd5-a713bfd090e2,
/10.192.227.6, (kikmetrics,PacketEventsByPacket)> completed successfully: 14
outstanding.

 


Any idea what is going on? Could the CPU usage STILL be related to the
repair? Is there any way to check? I hesitate to simply kill the node given
the "14 outstanding" log message and as doing so has caused me problems in
the past when using beta versions.

 

 

Dan Hendry

Re: Hung Repair

Posted by Gary Dusbabek <gd...@gmail.com>.

Can you produce a thread dump on the machine?  kill -3 ought to do it.

JConsole can be your friend at a time like this too.  It might be
painstaking, but you can check the CPU time used by each thread using
the java.lang.Threading mbean.  There's an interesting jconsole plugin
that is supposed to make this easier:
http://lsd.luminis.nl/top-threads-plugin-for-jconsole/

Gary.


On Fri, Oct 22, 2010 at 16:42, Dan Hendry <da...@gmail.com> wrote:
> I am currently running a 4 node cluster on Cassandra beta 2. Yesterday, I
> ran into a number of problems and the one of my nodes went down for a few
> hours. I tried to run a nodetool repair and at least at a data level,
> everything seems to be consistent and alright. The problem is that the node
> is still chewing up 100% of its available CPU, 20 hours after I started the
> repair. Load averages are 8-9 which is crazy given it is a single core ec2
> m1.small.
>
>
>
> Besides sitting at 100% cpu, the node on which I ran the repair seems to be
> fine. The Cassandra logs appear normal. Based on bandwidth patterns between
> nodes, it does not seem like they are transferring any repair related data
> (as they did initially). No pending tasks are being shown in any of the
> services when inspecting via jmx. I have a reasonable amount of data in the
> cluster (~6 gb * 2 replication factor) but nothing crazy. The last repair
> related entry in the logs is as follows:
>
>
>
> INFO [Thread-145] 2010-10-22 00:24:10,561 AntiEntropyService.java (line 828)
> #<TreeRequest manual-repair-23dacf4b-4076-4460-abd5-a713bfd090e2,
> /10.192.227.6, (kikmetrics,PacketEventsByPacket)> completed successfully: 14
> outstanding.
>
>
>
> Any idea what is going on? Could the CPU usage STILL be related to the
> repair? Is there any way to check? I hesitate to simply kill the node given
> the “14 outstanding” log message and as doing so has caused me problems in
> the past when using beta versions.
>
>
>
>
>
> Dan Hendry
>
>