You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "dengziming (via GitHub)" <gi...@apache.org> on 2023/05/30 02:42:12 UTC

[GitHub] [kafka] dengziming opened a new pull request, #13777: KAFKA-15036: UnknownServerError on any leader failover

dengziming opened a new pull request, #13777:
URL: https://github.com/apache/kafka/pull/13777

   *More detailed description of your change*
   This bug was introduced in #13679, we will invoke `finalizedFeatures` on receiving `ApiVersionRequest`, however, it's possible for a standby controller to receiving `ApiVersionRequest` on leader failover, and throwing UnknownServerError since we don't invoke `snapshotRegistry.getOrCreateSnapshot(lastCommittedOffset)` in a standby controller.
   
   I reproduced this in #13761, in which we stop the active controller and restart it and error will be thrown.
   We can fix it temporally by invoke `snapshotRegistry.getOrCreateSnapshot(lastCommittedOffset)` in a standby controller, and a better way is to use `Long.MAX_VALUE` in standby controller.
   
   We should also be careful about any rpcs sending to a standby controller, currently, only `ApiVersionRequest`.
   
   *Summary of testing strategy (including rationale)*
   I tested it in #13761, but there are some code change we need to make to make the code looks better.
   
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] KarboniteKream commented on a diff in pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "KarboniteKream (via GitHub)" <gi...@apache.org>.
KarboniteKream commented on code in PR #13777:
URL: https://github.com/apache/kafka/pull/13777#discussion_r1209671539


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -2028,8 +2028,11 @@ public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
         if (lastCommittedOffset == -1) {
             return CompletableFuture.completedFuture(new FinalizedControllerFeatures(Collections.emptyMap(), -1));
         }
+        // It's possible for a standby controller to receiving ApiVersionRequest and we do not have any timeline snapshot
+        // in a standby controller, in this case we use Long.MAX_VALUE.
+        long epoch = isActive() ? lastCommittedOffset : Long.MAX_VALUE;

Review Comment:
   I wonder, would using `SnapshotRegistry.LATEST_EPOCH` make more sense semantically? We're also specifically checking for that value in `TimelineObject.get()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dengziming commented on pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "dengziming (via GitHub)" <gi...@apache.org>.
dengziming commented on PR #13777:
URL: https://github.com/apache/kafka/pull/13777#issuecomment-1573073315

   this will be fixed in #13799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] KarboniteKream commented on a diff in pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "KarboniteKream (via GitHub)" <gi...@apache.org>.
KarboniteKream commented on code in PR #13777:
URL: https://github.com/apache/kafka/pull/13777#discussion_r1209671539


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -2028,8 +2028,11 @@ public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
         if (lastCommittedOffset == -1) {
             return CompletableFuture.completedFuture(new FinalizedControllerFeatures(Collections.emptyMap(), -1));
         }
+        // It's possible for a standby controller to receiving ApiVersionRequest and we do not have any timeline snapshot
+        // in a standby controller, in this case we use Long.MAX_VALUE.
+        long epoch = isActive() ? lastCommittedOffset : Long.MAX_VALUE;

Review Comment:
   I wonder, would using `SnapshotRegistry.LATEST_EPOCH` make more sense semantically? We're also specifically checking for that value in `TimelineObject.get(long)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dengziming closed pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "dengziming (via GitHub)" <gi...@apache.org>.
dengziming closed pull request #13777: KAFKA-15036: UnknownServerError on any leader failover
URL: https://github.com/apache/kafka/pull/13777


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dengziming commented on a diff in pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "dengziming (via GitHub)" <gi...@apache.org>.
dengziming commented on code in PR #13777:
URL: https://github.com/apache/kafka/pull/13777#discussion_r1211294473


##########
core/src/test/scala/integration/kafka/server/KRaftClusterTest.scala:
##########
@@ -96,6 +96,32 @@ class KRaftClusterTest {
     }
   }
 
+  @Test
+  def testCreateClusterAndRestartControllerNode(): Unit = {

Review Comment:
   @soarez Yes, we shutdown the active controller, then restart it to make it send `ApiVersionRequest` to a standby controller.



##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -2028,8 +2028,11 @@ public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
         if (lastCommittedOffset == -1) {
             return CompletableFuture.completedFuture(new FinalizedControllerFeatures(Collections.emptyMap(), -1));
         }
+        // It's possible for a standby controller to receiving ApiVersionRequest and we do not have any timeline snapshot
+        // in a standby controller, in this case we use Long.MAX_VALUE.
+        long epoch = isActive() ? lastCommittedOffset : Long.MAX_VALUE;

Review Comment:
   good suggestion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] soarez commented on a diff in pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "soarez (via GitHub)" <gi...@apache.org>.
soarez commented on code in PR #13777:
URL: https://github.com/apache/kafka/pull/13777#discussion_r1209972559


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -2028,8 +2028,11 @@ public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
         if (lastCommittedOffset == -1) {
             return CompletableFuture.completedFuture(new FinalizedControllerFeatures(Collections.emptyMap(), -1));
         }
+        // It's possible for a standby controller to receiving ApiVersionRequest and we do not have any timeline snapshot

Review Comment:
   ```suggestion
           // It's possible for a standby controller to receive a ApiVersionRequest and we do not have any timeline snapshot
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] KarboniteKream commented on a diff in pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "KarboniteKream (via GitHub)" <gi...@apache.org>.
KarboniteKream commented on code in PR #13777:
URL: https://github.com/apache/kafka/pull/13777#discussion_r1209671539


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -2028,8 +2028,11 @@ public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
         if (lastCommittedOffset == -1) {
             return CompletableFuture.completedFuture(new FinalizedControllerFeatures(Collections.emptyMap(), -1));
         }
+        // It's possible for a standby controller to receiving ApiVersionRequest and we do not have any timeline snapshot
+        // in a standby controller, in this case we use Long.MAX_VALUE.
+        long epoch = isActive() ? lastCommittedOffset : Long.MAX_VALUE;

Review Comment:
   I wonder, would using `SnapshotRegistry.LATEST_EPOCH` make more sense semantically? We're also specifically checking for that value in `TimelineObject`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dengziming commented on pull request #13777: KAFKA-15036: UnknownServerError on any leader failover

Posted by "dengziming (via GitHub)" <gi...@apache.org>.
dengziming commented on PR #13777:
URL: https://github.com/apache/kafka/pull/13777#issuecomment-1567692572

   @showuon @KarboniteKream @hachikuji I think this bug is very serious since it will stop any leader failover from functioning normally, so please take a look as soon as possible. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org