You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Peter Schuller (JIRA)" <ji...@apache.org> on 2011/01/20 15:00:50 UTC

[jira] Commented: (CASSANDRA-2015) Propagation of schema changes got out of sync with node's notion of ring

    [ https://issues.apache.org/jira/browse/CASSANDRA-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12984194#action_12984194 ] 

Peter Schuller commented on CASSANDRA-2015:
-------------------------------------------

I now believe this was triggered by concurrent migrations (due to a test doing drop/create quickly, and using a hector client pointed to a cluster instead of a single machine).

I have not confirmed, but I am definitely able to trigger a similar symptom by submitting 'create keyspace conflict;' more or less concurrently on two nodes at the same time using cassandra-cli. One of the loads is permanently no longer getting schema migrations, even after restart.

So, until I know otherwise I'll consider this operator error, and since it is known that concurrent migrations are not supported/allowed, it is not a bug.

(Diagnosis mechanisms may be improved, and recovery instructions written, but that is a separate concern.)


> Propagation of schema changes got out of sync with node's notion of ring
> ------------------------------------------------------------------------
>
>                 Key: CASSANDRA-2015
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2015
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Peter Schuller
>
> I have a test cluster of 0.7.0 of three nodes, 1, 2, 3. 1 and 2 are seeds (but not 3).
> I had a situation where the following was observed:
> * Schema changes submitted to node 1 would not propagate to any other node (observational method: tail syslog and don't see any flushing of system memtables/etc except on node 1).
> * Schema changes submitted to node 2 or 3 would propagate between them, or to all (not sure which).
> * Mutations submitted on node 1 *would* get propagated to node 3.
> * All nodes knew of each other and considered themselves up according to 'nodetool ring'.
> * Because node 3 never got schema migrations, writes submitted to node 1 that got sent to node 3 blocked for extended periods of time on node 1, while triggering an exception on now 3 because of an invalid cfid in the row mutation.
> * I can not be entirely sure whether just a regular restart would have fixed the problem.
> Unfortunately, I was not aware of the problem until running some unit tests against the cluster and I cannot say for sure which order the machines were bootstrapped in.
> After initial discovery I switched to manually submitting 'create keyspace x;' via cassandra-cli on each node (for different ks:es or interleaving create/drop), and observing results in syslog.
> The observations w.r.t. row mutations did not come from the manual test, but rather from the unit test that failed so there is some chance that there was a different mode of failure than during my cassandra-cli tests.
> Stopping all nodes and wiping data directories and restarting, fixed the problem and so far I have not been able to trigger it again. I am not sure whether just restarting the nodes would have fixed it.
> It definitely seems like a problem to me that schema changes did not propagate even though the node (1) node was apparently sufficiently aware of the other node (3) to sent mutations to it, even if the original problem may have been due to some kind of operational error.
> I'd be interested in hearing speculation of what likely triggers may be.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.