You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "J.B. Langston (JIRA)" <ji...@apache.org> on 2013/09/25 19:56:04 UTC
[jira] [Updated] (CASSANDRA-6097) nodetool repair randomly hangs.

     [ https://issues.apache.org/jira/browse/CASSANDRA-6097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

J.B. Langston updated CASSANDRA-6097:
-------------------------------------

    Description: 
nodetool repair randomly hangs. This is not the same issue where repair hangs if a stream is disrupted. This can be reproduced on a single-node cluster where no streaming takes place, so I think this may be a JMX connection or timeout issue. Thread dumps show that nodetool is waiting on a JMX response and there are no repair-related threads running in Cassandra. Nodetool main thread waiting for JMX response:

{code}
"main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
	at java.lang.Object.wait(Object.java:485)
	at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
	- locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
	at org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976)
	at org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221)
	at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444)
	at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213)
{code}

When nodetool hangs, it does not print out the following message:

"Starting repair command #XX, repairing 1 ranges for keyspace XXX"

However, Cassandra logs that repair in system.log:

1380033480.95  INFO [Thread-154] 10:38:00,882 Starting repair command #X, repairing X ranges for keyspace XXX

This suggests that the repair command was received by Cassandra but the connection then failed and nodetool didn't receive a response.

Obviously, running repair on a single-node cluster is pointless but it's the easiest way to demonstrate this problem. The customer who reported this has also seen the issue on his real multi-node cluster.

Steps to reproduce:

Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3 (Cassandra 1.2.6+patches).  I was unable to reproduce on my Mac using the same version, and subsequent attempts to reproduce it on the AMI were unsuccessful. The customer says he is able is able to reliably reproduce on his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster. 

1) Deploy an AMI using the DataStax AMI at https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2

2) Create a test keyspace
{code}
create keyspace test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
{code}
3) Run an endless loop that runs nodetool repair repeatedly:

{code}
while true; do nodetool repair -pr test; done
{code}

4) Wait until repair hangs. It may take many tries; the behavior is random.

  was:
nodetool repair randomly hangs. This is not the same issue where repair hangs if a stream is disrupted. This can be reproduced on a single-node cluster where no streaming takes place, so I think this may be a JMX connection or timeout issue. Thread dumps show that nodetool is waiting on a JMX response and there are no repair-related threads running in Cassandra. Nodetool main thread waiting for JMX response:

{code}
"main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
	at java.lang.Object.wait(Object.java:485)
	at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
	- locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
	at org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976)
	at org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221)
	at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444)
	at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213)
{code}

When nodetool hangs, it does not print out the following message:

"Starting repair command #XX, repairing 1 ranges for keyspace XXX"

However, Cassandra logs that repair in system.log:

1380033480.95  INFO [Thread-154] 10:38:00,882 Starting repair command #X, repairing X ranges for keyspace XXX

This suggests that the repair command was received by Cassandra but the connection then failed and nodetool didn't receive a response.

Obviously, running repair on a single-node cluster is pointless but it's the easiest way to demonstrate this problem. The customer who reported this has also seen the issue on his real multi-node cluster.

Steps to reproduce:

Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3 (Cassandra 1.2.6+patches).  I was unable to reproduce on my Mac using the same version, and subsequent attempts to reproduce it on the AMI were unsuccessful. The customer says he is able is able to reliably reproduce on his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster. 

1) Deploy an AMI using the DataStax AMI at https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2

2) Create a test keyspace

create keyspace test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

3) Run an endless loop that runs nodetool repair repeatedly:

while true; do nodetool repair -pr test; done

4) Wait until repair hangs. It may take hundreds or thousands of tries; the behavior is random.

    
> nodetool repair randomly hangs.
> -------------------------------
>
>                 Key: CASSANDRA-6097
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6097
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DataStax AMI
>            Reporter: J.B. Langston
>
> nodetool repair randomly hangs. This is not the same issue where repair hangs if a stream is disrupted. This can be reproduced on a single-node cluster where no streaming takes place, so I think this may be a JMX connection or timeout issue. Thread dumps show that nodetool is waiting on a JMX response and there are no repair-related threads running in Cassandra. Nodetool main thread waiting for JMX response:
> {code}
> "main" prio=5 tid=7ffa4b001800 nid=0x10aedf000 in Object.wait() [10aede000]
>    java.lang.Thread.State: WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
> 	at java.lang.Object.wait(Object.java:485)
> 	at org.apache.cassandra.utils.SimpleCondition.await(SimpleCondition.java:34)
> 	- locked <7f90d62e8> (a org.apache.cassandra.utils.SimpleCondition)
> 	at org.apache.cassandra.tools.RepairRunner.repairAndWait(NodeProbe.java:976)
> 	at org.apache.cassandra.tools.NodeProbe.forceRepairAsync(NodeProbe.java:221)
> 	at org.apache.cassandra.tools.NodeCmd.optionalKSandCFs(NodeCmd.java:1444)
> 	at org.apache.cassandra.tools.NodeCmd.main(NodeCmd.java:1213)
> {code}
> When nodetool hangs, it does not print out the following message:
> "Starting repair command #XX, repairing 1 ranges for keyspace XXX"
> However, Cassandra logs that repair in system.log:
> 1380033480.95  INFO [Thread-154] 10:38:00,882 Starting repair command #X, repairing X ranges for keyspace XXX
> This suggests that the repair command was received by Cassandra but the connection then failed and nodetool didn't receive a response.
> Obviously, running repair on a single-node cluster is pointless but it's the easiest way to demonstrate this problem. The customer who reported this has also seen the issue on his real multi-node cluster.
> Steps to reproduce:
> Note: I reproduced this once on the official DataStax AMI with DSE 3.1.3 (Cassandra 1.2.6+patches).  I was unable to reproduce on my Mac using the same version, and subsequent attempts to reproduce it on the AMI were unsuccessful. The customer says he is able is able to reliably reproduce on his Mac using DSE 3.1.3 and occasionally reproduce it on his real cluster. 
> 1) Deploy an AMI using the DataStax AMI at https://aws.amazon.com/amis/datastax-auto-clustering-ami-2-2
> 2) Create a test keyspace
> {code}
> create keyspace test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
> {code}
> 3) Run an endless loop that runs nodetool repair repeatedly:
> {code}
> while true; do nodetool repair -pr test; done
> {code}
> 4) Wait until repair hangs. It may take many tries; the behavior is random.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira