You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Russell Alexander Spitzer (JIRA)" <ji...@apache.org> on 2013/12/19 18:00:11 UTC

[jira] [Updated] (CASSANDRA-6210) Repair hangs when a new datacenter is added to a cluster

     [ https://issues.apache.org/jira/browse/CASSANDRA-6210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Russell Alexander Spitzer updated CASSANDRA-6210:
-------------------------------------------------

    Attachment: RepairLogs.tar.gz

Attached RepairLogs.tar.gz Which has logs from all the nodes involved

They contain data from two trial runs, one where counter data is attempted to be repaired, and one where standard inserts are repaired. The Shutdown indicates the separation between these tests. 

Null pointers are seen on 10.171.49.137 and 10.171.81.22

> Repair hangs when a new datacenter is added to a cluster
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6210
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6210
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Amazon Ec2
> 2 M1.large nodes
>            Reporter: Russell Alexander Spitzer
>            Assignee: Yuki Morishita
>         Attachments: RepairLogs.tar.gz
>
>
> Attempting to add a new datacenter to a cluster seems to cause repair operations to break. I've been reproducing this with 20~ node clusters but can get it to reliably occur on 2 node setups.
> {code}
> ##Basic Steps to reproduce
> #Node 1 is started using GossipingPropertyFileSnitch as dc1
> #Cassandra-stress is used to insert a minimal amount of data
> $CASSANDRA_STRESS -t 100 -R org.apache.cassandra.locator.NetworkTopologyStrategy  --num-keys=1000 --columns=10 --consistency-level=LOCAL_QUORUM --average-size-values -
> -compaction-strategy='LeveledCompactionStrategy' -O dc1:1 --operation=COUNTER_ADD
> #Alter "Keyspace1"
> ALTER KEYSPACE "Keyspace1" WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 1 , 'dc2': 1 };
> #Add node 2 using GossipingPropertyFileSnitch as dc2
> run repair on node 1
> run repair on node 2
> {code}
> The repair task on node 1 never completes and while there are no exceptions in the logs of node1, netstat reports the following repair tasks
> {code}
> Mode: NORMAL
> Repair 4e71a250-36b4-11e3-bedc-1d1bb5c9abab
> Repair 6c64ded0-36b4-11e3-bedc-1d1bb5c9abab
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0          10239
> Responses                       n/a         0           3839
> {code}
> Checking on node 2 we see the following exceptions
> {code}
> ERROR [STREAM-IN-/10.171.122.130] 2013-10-16 22:42:58,961 StreamSession.java (line 410) [Stream #4e71a250-36b4-11e3-bedc-1d1bb5c9abab] Streaming error occurred
> java.lang.NullPointerException
>         at org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:174)
>         at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:436)
>         at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:358)
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:293)
>         at java.lang.Thread.run(Thread.java:724)
> ...
> ERROR [STREAM-IN-/10.171.122.130] 2013-10-16 22:43:49,214 StreamSession.java (line 410) [Stream #6c64ded0-36b4-11e3-bedc-1d1bb5c9abab] Streaming error occurred
> java.lang.NullPointerException
>         at org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:174)
>         at org.apache.cassandra.streaming.StreamSession.prepare(StreamSession.java:436)
>         at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:358)
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:293)
>         at java.lang.Thread.run(Thread.java:724)
> {code}
> Netstats on node 2 reports
> {code}
> automaton@ip-10-171-15-234:~$ nodetool netstats
> Mode: NORMAL
> Repair 4e71a250-36b4-11e3-bedc-1d1bb5c9abab
> Read Repair Statistics:
> Attempted: 0
> Mismatch (Blocking): 0
> Mismatch (Background): 0
> Pool Name                    Active   Pending      Completed
> Commands                        n/a         0           2562
> Responses                       n/a         0           4284
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)