You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pr@cassandra.apache.org by GitBox <gi...@apache.org> on 2022/01/20 11:53:24 UTC

[GitHub] [cassandra] adelapena commented on a change in pull request #1168: CASSANDRA-16878 trunk: Race in commit log replay can cause rejected mutations

adelapena commented on a change in pull request #1168:
URL: https://github.com/apache/cassandra/pull/1168#discussion_r788690244



##########
File path: src/java/org/apache/cassandra/db/commitlog/CommitLogReplayer.java
##########
@@ -436,16 +455,23 @@ public void handleMutation(Mutation m, int size, int entryLocation, CommitLogDes
             sawCDCMutation = true;
 
         pendingMutationBytes += size;
+
+        boolean isSchemaMutation = SchemaConstants.isSchemaKeyspace(m.getKeyspaceName());
+
+        if (isSchemaMutation)
+            writeOrder.awaitNewBarrier();

Review comment:
       I don't understand how counting schema mutations would help us to verify that we are reproducing the problematic case, since we are not sending more mutations of this kind and the problem is due to the order in which mutations are replayed. Also, regular node starts include multiple schema mutations (~20) besides the ones that we introduce with the test, and it's not easy to identify which ones are the relevant ones with byteman.
   
   I think that the induced 2 seconds delay is a really long time for replaying mutations, and it seems enough to guarantee that not-schema mutations are replayed before schema mutations. Indeed, the test passes [1000 iterations against the patched branch](https://app.circleci.com/pipelines/github/adelapena/cassandra/1246/workflows/9ab6f30b-8e87-486f-a5a7-2d1e74873ea5), while these other [1000 runs against unpatched trunk](https://app.circleci.com/pipelines/github/adelapena/cassandra/1248/workflows/03e75af8-985a-42ea-84f5-cf5f7480d6a7) fail every single time. 
   
   However, running the test against trunk fails after the first replay, which doesn't even need the byteman rule. I we slightly modify the test to skip the first replay and go straight to the second replay with the byteman rule, [this way](https://github.com/adelapena/cassandra-dtest/commit/d54cc4878d674d3736bd71c88ae192f08aa9a622), the modified tests fails [1000 runs](https://app.circleci.com/pipelines/github/adelapena/cassandra/1250/workflows/75ee8eba-c7a7-4e7c-96bb-30e82d2f4520) against trunk, while it succeeds another [1000 runs](https://app.circleci.com/pipelines/github/adelapena/cassandra/1249/workflows/9182133c-aee0-4b58-bf1b-8082ff3a3cbf) against the patched branch. 
   
   I thinks these runs show the stability of the test. We could try to add some locking with a couple of byteman rules to be able to somehow produce the desired ordering of replayed mutations, more or less reverting the action of the introduced `writeOrder`, but I think that would introduce unneeded complexity, making the rules more sensitive to future changes in `CommitLogReplayer`. wdyt?
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org