You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Russell Alexander Spitzer (JIRA)" <ji...@apache.org> on 2014/01/17 18:52:25 UTC

[jira] [Comment Edited] (CASSANDRA-6210) Repair hangs when a new datacenter is added to a cluster

    [ https://issues.apache.org/jira/browse/CASSANDRA-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13875006#comment-13875006 ] 

Russell Alexander Spitzer edited comment on CASSANDRA-6210 at 1/17/14 5:51 PM:
-------------------------------------------------------------------------------

Ran the test again last night and repair reported the following exceptions, I'll have the logs up in a moment.

Setup:
4 Nodes, 2 per DC

{code}
ERROR [AntiEntropySessions:2] 2014-01-17 06:59:13,320 RepairSession.java (line 278) [repair #def293b0-7f44-11e3-b180-d1c68624042f] session completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair #def293b0-7f44-11e3-b180-d1c68624042f on Keyspace1/Standard1, (-4559856749309798061,-4559456353371206248]] Sync failed between /10.171.121.18 and /10.196.16.123
        at org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
        at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:204)
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
ERROR [AntiEntropySessions:2] 2014-01-17 06:59:13,325 CassandraDaemon.java (line 192) Exception in thread Thread[AntiEntropySessions:2,5,RMI Runtime]
java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #def293b0-7f44-11e3-b180-d1c68624042f on Keyspace1/Standard1, (-4559856749309798061,-4559456353371206248]] Sync failed between /10.171.121.18 and /10.196.16.123
        at com.google.common.base.Throwables.propagate(Throwables.java:160)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #def293b0-7f44-11e3-b180-d1c68624042f on Keyspace1/Standard1, (-4559856749309798061,-4559456353371206248]] Sync failed between /10.171.121.18 and /10.196.16.123
        at org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
        at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:204)
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
        ... 3 more
 INFO [AntiEntropySessions:4] 2014-01-17 06:59:13,328 RepairSession.java (line 236) [repair #df5d6370-7f44-11e3-b180-d1c68624042f] new session: will sync /10.171.121.18, /10.198.2.16 on range (-5516517151222322415,-5504449186624942606] for Keyspace1.[SuperCounter1, Super1, Counter3, Standard1, Counter1]
{code}


was (Author: rspitzer):
Ran the test again last night and repair reported the following exceptions, I'll have the logs up in a moment.

Setup:
4 Nodes, 2 per DC

{code}
ERROR [AntiEntropySessions:2] 2014-01-17 06:59:13,320 RepairSession.java (line 278) [repair #def293b0-7f44-11e3-b180-d1c68624042f] session completed with the following error
org.apache.cassandra.exceptions.RepairException: [repair #def293b0-7f44-11e3-b180-d1c68624042f on Keyspace1/Standard1, (-4559856749309798061,-4559456353371206248]] Sync failed between /10.171.121.18 and /10.196.16.123
        at org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
        at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:204)
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
ERROR [AntiEntropySessions:2] 2014-01-17 06:59:13,325 CassandraDaemon.java (line 192) Exception in thread Thread[AntiEntropySessions:2,5,RMI Runtime]
java.lang.RuntimeException: org.apache.cassandra.exceptions.RepairException: [repair #def293b0-7f44-11e3-b180-d1c68624042f on Keyspace1/Standard1, (-4559856749309798061,-4559456353371206248]] Sync failed between /10.171.121.18 and /10.196.16.123
        at com.google.common.base.Throwables.propagate(Throwables.java:160)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
        at java.util.concurrent.FutureTask.run(FutureTask.java:166)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.cassandra.exceptions.RepairException: [repair #def293b0-7f44-11e3-b180-d1c68624042f on Keyspace1/Standard1, (-4559856749309798061,-4559456353371206248]] Sync failed between /10.171.121.18 and /10.196.16.123
        at org.apache.cassandra.repair.RepairSession.syncComplete(RepairSession.java:200)
        at org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:204)
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:59)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:60)
        ... 3 more
 INFO [AntiEntropySessions:4] 2014-01-17 06:59:13,328 RepairSession.java (line 236) [repair #df5d6370-7f44-11e3-b180-d1c68624042f] new session: will sync /10.171.121.18, /10.198.2.16 on range (-5516517151222322415,-5504449186624942606] for Keyspace1.[SuperCounter1, Super1, Counter3, Standard1, Counter1]
{code]

> Repair hangs when a new datacenter is added to a cluster
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6210
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6210
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Amazon Ec2
> 2 M1.large nodes
>            Reporter: Russell Alexander Spitzer
>            Assignee: Yuki Morishita
>             Fix For: 2.0.5
>
>         Attachments: 6210-2.0.txt, RepairLogs.tar.gz
>
>
> Attempting to add a new datacenter to a cluster seems to cause repair operations to break. I've been reproducing this with 20~ node clusters but can get it to reliably occur on 2 node setups.
> {code}
> ##Basic Steps to reproduce
> #Node 1 is started using GossipingPropertyFileSnitch as dc1
> #Cassandra-stress is used to insert a minimal amount of data
> $CASSANDRA_STRESS -t 100 -R org.apache.cassandra.locator.NetworkTopologyStrategy  --num-keys=1000 --columns=10 --consistency-level=LOCAL_QUORUM --average-size-values -
> -compaction-strategy='LeveledCompactionStrategy' -O dc1:1 --operation=COUNTER_ADD
> #Alter "Keyspace1"
> ALTER KEYSPACE "Keyspace1" WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 1 , 'dc2': 1 };
> #Add node 2 using GossipingPropertyFileSnitch as dc2
> run repair on node 1
> run repair on node 2
> {code}
> The repair task on node 1 never completes and while there are no exceptions in the logs of node1, netstat reports the following repair tasks
> {code}
> Mode: NORMAL
> Repair 4e71a250-36b4-11e3-bedc-1d1bb5c9abab
> Repair 6c64ded0-36b4-11e3-bedc-1d1bb5c9abab
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0          10239
> Responses                       n/a         0           3839
> {code}
> Checking on node 2 we see the following exceptions
> {code}
> ERROR [STREAM-IN-/10.171.122.130] 2013-10-16 22:42:58,961 StreamSession.java (line 410) [Stream #4e71a250-36b4-11e3-bedc-1d1bb5c9abab] Streaming error occurred
> java.lang.NullPointerException
>         at org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:174)
>         at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:436)
>         at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:358)
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:293)
>         at java.lang.Thread.run(Thread.java:724)
> ...
> ERROR [STREAM-IN-/10.171.122.130] 2013-10-16 22:43:49,214 StreamSession.java (line 410) [Stream #6c64ded0-36b4-11e3-bedc-1d1bb5c9abab] Streaming error occurred
> java.lang.NullPointerException
>         at org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:174)
>         at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:436)
>         at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:358)
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:293)
>         at java.lang.Thread.run(Thread.java:724)
> {code}
> Netstats on node 2 reports
> {code}
> automaton@ip-10-171-15-234:~$ nodetool netstats
> Mode: NORMAL
> Repair 4e71a250-36b4-11e3-bedc-1d1bb5c9abab
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0           2562
> Responses                       n/a         0           4284
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)