You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "cmccabe (via GitHub)" <gi...@apache.org> on 2023/06/02 00:00:33 UTC

[GitHub] [kafka] cmccabe opened a new pull request, #13799: KAFKA-15048: Improve handling of unexpected quorum controller errors

cmccabe opened a new pull request, #13799:
URL: https://github.com/apache/kafka/pull/13799

   When the active quorum controller encounters an "unexpected" error, such as a NullPointerException, it currently resigns its leadership. This PR fixes it so that in addition to doing that, it also increments the metadata error count metric. This will allow us to better track down these errors.
   
   This PR also fixes a minor bug where performing read operations on a standby controller would result in an unexpected RuntimeException. The bug happened because the standby controller does not take in-memory snapshots, and read operations were attempting to read from the epoch of the latest committed offset. The fix is for the standby controller to simply read the latest value of each data structure. This is always safe, because standby controllers don't contain uncommitted data.
   
   Also, fix a bug where listPartitionReassignments was reading the latest data, rather than data from the last committed offset.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] cmccabe commented on a diff in pull request #13799: KAFKA-15048: Improve handling of unexpected quorum controller errors

Posted by "cmccabe (via GitHub)" <gi...@apache.org>.
cmccabe commented on code in PR #13799:
URL: https://github.com/apache/kafka/pull/13799#discussion_r1214678452


##########
metadata/src/main/java/org/apache/kafka/controller/errors/ControllerExceptions.java:
##########
@@ -38,4 +47,104 @@ public static boolean isTimeoutException(Throwable exception) {
         if (!(exception instanceof TimeoutException)) return false;
         return true;
     }
+
+    /**
+     * Check if an exception is a NotController exception.
+     *
+     * @param exception     The exception to check.
+     * @return              True if the exception is a NotController exception.
+     */
+    public static boolean isNotControllerException(Throwable exception) {
+        if (exception == null) return false;
+        if (exception instanceof ExecutionException) {
+            exception = exception.getCause();
+            if (exception == null) return false;
+        }
+        if (!(exception instanceof NotControllerException)) return false;
+        return true;
+    }
+
+    /**
+     * Create a new exception indicating that the controller is in pre-migration mode, so the
+     * operation cannot be completed.
+     *
+     * @param latestController  The current controller.
+     * @return                  The new NotControllerException.
+     */
+    public static NotControllerException newPreMigrationException(OptionalInt latestController) {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] cmccabe commented on a diff in pull request #13799: KAFKA-15048: Improve handling of unexpected quorum controller errors

Posted by "cmccabe (via GitHub)" <gi...@apache.org>.
cmccabe commented on code in PR #13799:
URL: https://github.com/apache/kafka/pull/13799#discussion_r1214677896


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -425,25 +430,24 @@ public void accept(ConfigResource configResource) {
 
     public static final String CONTROLLER_THREAD_SUFFIX = "QuorumControllerEventHandler";
 
-    private static final String ACTIVE_CONTROLLER_EXCEPTION_TEXT_PREFIX =
-        "The active controller appears to be node ";
-
-    private NotControllerException newNotControllerException() {
-        OptionalInt latestController = raftClient.leaderAndEpoch().leaderId();
-        if (latestController.isPresent()) {
-            return new NotControllerException(ACTIVE_CONTROLLER_EXCEPTION_TEXT_PREFIX +
-                latestController.getAsInt() + ".");
-        } else {
-            return new NotControllerException("No controller appears to be active.");
-        }
+    private OptionalInt latestController() {
+        return raftClient.leaderAndEpoch().leaderId();
     }
 
-    private NotControllerException newPreMigrationException() {
-        OptionalInt latestController = raftClient.leaderAndEpoch().leaderId();
-        if (latestController.isPresent()) {
-            return new NotControllerException("The controller is in pre-migration mode.");
+    /**
+     * @return          The offset that we should perform read operations at.
+     */
+    private long currentReadOffset() {
+        if (isActiveController()) {
+            // The active controller keeps an in-memory snapshot at the last committed offset,
+            // which we want to read from when performing read operations. This will avoid
+            // reading uncommitted data.
+            return lastCommittedOffset;
         } else {
-            return new NotControllerException("No controller appears to be active.");
+            // Standby controllers never have uncommitted data in memory. Therefore, we return
+            // Long.MAX_VALUE, a special value which means "always read the latest from every
+            // data structure."
+            return Long.MAX_VALUE;

Review Comment:
   Good point. Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] cmccabe commented on a diff in pull request #13799: KAFKA-15048: Improve handling of unexpected quorum controller errors

Posted by "cmccabe (via GitHub)" <gi...@apache.org>.
cmccabe commented on code in PR #13799:
URL: https://github.com/apache/kafka/pull/13799#discussion_r1214678682


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -2024,13 +2029,8 @@ public CompletableFuture<ElectLeadersResponseData> electLeaders(
     public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
         ControllerRequestContext context
     ) {
-        // It's possible that we call ApiVersionRequest before consuming the log since ApiVersionRequest is sent when
-        // initialize NetworkClient, we should not return an error since it would stop the NetworkClient from working correctly.
-        if (lastCommittedOffset == -1) {

Review Comment:
   yes, this was previously a bug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] cmccabe merged pull request #13799: KAFKA-15048: Improve handling of unexpected quorum controller errors

Posted by "cmccabe (via GitHub)" <gi...@apache.org>.
cmccabe merged PR #13799:
URL: https://github.com/apache/kafka/pull/13799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] dengziming commented on a diff in pull request #13799: KAFKA-15048: Improve handling of unexpected quorum controller errors

Posted by "dengziming (via GitHub)" <gi...@apache.org>.
dengziming commented on code in PR #13799:
URL: https://github.com/apache/kafka/pull/13799#discussion_r1213858197


##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -425,25 +430,24 @@ public void accept(ConfigResource configResource) {
 
     public static final String CONTROLLER_THREAD_SUFFIX = "QuorumControllerEventHandler";
 
-    private static final String ACTIVE_CONTROLLER_EXCEPTION_TEXT_PREFIX =
-        "The active controller appears to be node ";
-
-    private NotControllerException newNotControllerException() {
-        OptionalInt latestController = raftClient.leaderAndEpoch().leaderId();
-        if (latestController.isPresent()) {
-            return new NotControllerException(ACTIVE_CONTROLLER_EXCEPTION_TEXT_PREFIX +
-                latestController.getAsInt() + ".");
-        } else {
-            return new NotControllerException("No controller appears to be active.");
-        }
+    private OptionalInt latestController() {
+        return raftClient.leaderAndEpoch().leaderId();
     }
 
-    private NotControllerException newPreMigrationException() {
-        OptionalInt latestController = raftClient.leaderAndEpoch().leaderId();
-        if (latestController.isPresent()) {
-            return new NotControllerException("The controller is in pre-migration mode.");
+    /**
+     * @return          The offset that we should perform read operations at.
+     */
+    private long currentReadOffset() {
+        if (isActiveController()) {
+            // The active controller keeps an in-memory snapshot at the last committed offset,
+            // which we want to read from when performing read operations. This will avoid
+            // reading uncommitted data.
+            return lastCommittedOffset;
         } else {
-            return new NotControllerException("No controller appears to be active.");
+            // Standby controllers never have uncommitted data in memory. Therefore, we return
+            // Long.MAX_VALUE, a special value which means "always read the latest from every
+            // data structure."
+            return Long.MAX_VALUE;

Review Comment:
   How about using `SnapshotRegistry.LATEST_EPOCH`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [kafka] mumrah commented on a diff in pull request #13799: KAFKA-15048: Improve handling of unexpected quorum controller errors

Posted by "mumrah (via GitHub)" <gi...@apache.org>.
mumrah commented on code in PR #13799:
URL: https://github.com/apache/kafka/pull/13799#discussion_r1214563481


##########
metadata/src/main/java/org/apache/kafka/controller/errors/ControllerExceptions.java:
##########
@@ -38,4 +47,104 @@ public static boolean isTimeoutException(Throwable exception) {
         if (!(exception instanceof TimeoutException)) return false;
         return true;
     }
+
+    /**
+     * Check if an exception is a NotController exception.
+     *
+     * @param exception     The exception to check.
+     * @return              True if the exception is a NotController exception.
+     */
+    public static boolean isNotControllerException(Throwable exception) {
+        if (exception == null) return false;
+        if (exception instanceof ExecutionException) {
+            exception = exception.getCause();
+            if (exception == null) return false;
+        }
+        if (!(exception instanceof NotControllerException)) return false;
+        return true;
+    }
+
+    /**
+     * Create a new exception indicating that the controller is in pre-migration mode, so the
+     * operation cannot be completed.
+     *
+     * @param latestController  The current controller.
+     * @return                  The new NotControllerException.
+     */
+    public static NotControllerException newPreMigrationException(OptionalInt latestController) {

Review Comment:
   I'm guessing the argument here is the controller ID? Can you rename it to reflect that? (same for other methods in this class)



##########
metadata/src/main/java/org/apache/kafka/controller/QuorumController.java:
##########
@@ -2024,13 +2029,8 @@ public CompletableFuture<ElectLeadersResponseData> electLeaders(
     public CompletableFuture<FinalizedControllerFeatures> finalizedFeatures(
         ControllerRequestContext context
     ) {
-        // It's possible that we call ApiVersionRequest before consuming the log since ApiVersionRequest is sent when
-        // initialize NetworkClient, we should not return an error since it would stop the NetworkClient from working correctly.
-        if (lastCommittedOffset == -1) {

Review Comment:
   Was this a bug previously? If another node sent an ApiVersionRequest prior to loading the log, we would have not returned any MetadataVersion (whereas with this patch, we would). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org