You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2022/06/08 17:46:04 UTC

[GitHub] [kafka] mumrah opened a new pull request, #12269: KAFKA-13966 Set curClaimEpoch after we enqueue bootstrap write

mumrah opened a new pull request, #12269:
URL: https://github.com/apache/kafka/pull/12269

   This patch does two things to tighten up races during bootstrap. One is to simply move the volatile write until after the "bootstrapMetadata" event has been enqueued. The other thing is to prepend "bootstrapMetadata" to the queue.
   
   From the JIRA, this test was flaky when we see the "registerBroker" event precede the "boostrapMetadata" event
   
   ```
   handleLeaderChange() start
   appendWriteEvent(registerBroker)
   appendWriteEvent(bootstrapMetadata)
   handleLeaderChange() finish
   registerBroker() -> writes broker registration to log
   bootstrapMetadata() -> writes bootstrap metadata to log
   ```
   
   The test harness in QuorumControllerTest is polling the controllers to see when their current leader epoch matches the expected leader epoch
   
   ```java
           QuorumController activeController() throws InterruptedException {
           AtomicReference<QuorumController> value = new AtomicReference<>(null);
           TestUtils.retryOnExceptionWithTimeout(20000, 3, () -> {
               LeaderAndEpoch leader = logEnv.leaderAndEpoch();
               for (QuorumController controller : controllers) {
                   if (OptionalInt.of(controller.nodeId()).equals(leader.leaderId()) &&
                 --->  controller.curClaimEpoch() == leader.epoch()) { <---
                       value.set(controller);
                       break;
                   }
               }
   
               if (value.get() == null) {
                   throw new RuntimeException(String.format("Expected to see %s as leader", leader));
               }
           });
   
           return value.get();
       }
   ```
   
   Prior to this patch, we were setting the `curClaimEpoch` volatile before scheduling the "bootstrapMetadata" event. This creates a race where the test code can see the controller as active before the bootstrap metadata has been enqueued. By moving the volatile write after the "bootstrapMetadata" event enqueueing, we can eliminate this race.
   
   In production, there is nothing stopping clients from making requests to a controller that is in the process of starting up. Once the socket server begins processing requests, we will begin seeing events on the QuorumController queue. The check for active controller is done when events are processed, so we could have events enqueued at any time. By prepending the "bootstrapMetadata" event, we can ensure any events that see the controller as active will come after the bootstrap metadata has been processed.
   
   If we handle a "registerBroker" event before "bootstrapMetadata" is enqueued, then we are sure to see `curClaimEpoch = -1`. If a "registerBroker" is waiting in the queue while we finish bootstrapping, then we are sure to enqueue "bootstrapMetadata" before it. This means "registerBroker" (and any other event) is guaranteed to see the controller as inactive _or_ that it is active and the bootstrap metadata has been written.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] hachikuji commented on a diff in pull request #12269: KAFKA-13966 Set curClaimEpoch after we enqueue bootstrap write

Posted by GitBox <gi...@apache.org>.
hachikuji commented on code in PR #12269:
URL: https://github.com/apache/kafka/pull/12269#discussion_r892732326


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -942,7 +943,7 @@ public void handleLeaderChange(LeaderAndEpoch newLeader) {
                                     "at least 1. Got " + bootstrapMetadata.metadataVersion().featureLevel()));
                         } else {
                             metadataVersion = bootstrapMetadata.metadataVersion();
-                            future = appendWriteEvent("bootstrapMetadata", OptionalLong.empty(), () -> {
+                            future = prependWriteEvent("bootstrapMetadata", () -> {

Review Comment:
   Can we add a comment about this? 



##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -967,6 +968,10 @@ public void handleLeaderChange(LeaderAndEpoch newLeader) {
                         metadataVersion = featureControl.metadataVersion();
                     }
 
+                    curClaimEpoch = newEpoch;

Review Comment:
   Hmm, why is it necessary to reorder these fields?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] mumrah commented on a diff in pull request #12269: KAFKA-13966 Set curClaimEpoch after we enqueue bootstrap write

Posted by GitBox <gi...@apache.org>.
mumrah commented on code in PR #12269:
URL: https://github.com/apache/kafka/pull/12269#discussion_r892741833


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -967,6 +968,10 @@ public void handleLeaderChange(LeaderAndEpoch newLeader) {
                         metadataVersion = featureControl.metadataVersion();
                     }
 
+                    curClaimEpoch = newEpoch;

Review Comment:
   Yea actually I think we can do without this change since we are prepending the event



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] mumrah merged pull request #12269: KAFKA-13966 Set curClaimEpoch after we enqueue bootstrap write

Posted by GitBox <gi...@apache.org>.
mumrah merged PR #12269:
URL: https://github.com/apache/kafka/pull/12269


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] mumrah commented on a diff in pull request #12269: KAFKA-13966 Set curClaimEpoch after we enqueue bootstrap write

Posted by GitBox <gi...@apache.org>.
mumrah commented on code in PR #12269:
URL: https://github.com/apache/kafka/pull/12269#discussion_r892741833


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -967,6 +968,10 @@ public void handleLeaderChange(LeaderAndEpoch newLeader) {
                         metadataVersion = featureControl.metadataVersion();
                     }
 
+                    curClaimEpoch = newEpoch;

Review Comment:
   This is primarily to fix the race in the test code. If we update the volatile sooner, there is a possibility the test code will see the controller as active, enqueue the brokerRegister event, and process the event before the bootstrap metadata gets enqueued.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org