You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/11/09 10:51:03 UTC

[GitHub] [kafka] dengziming opened a new pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

dengziming opened a new pull request #9577:
URL: https://github.com/apache/kafka/pull/9577


   This patch implements [KIP-589](https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller), which introduces an asynchronous API for brokers to notifying the controller of log dir failure.
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   1. Unit test for LogDirEventManagerImpl
   2. Integration test for new behavior
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on a change in pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
soarez commented on a change in pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#discussion_r760091660



##########
File path: core/src/main/scala/kafka/controller/KafkaController.scala
##########
@@ -2443,6 +2443,117 @@ class KafkaController(val config: KafkaConfig,
     }
   }
 
+  def alterReplicaState(alterReplicaStateRequest: AlterReplicaStateRequest,
+                        callback: AlterReplicaStateResponseData => Unit): Unit = {
+    val alterReplicaStateRequestDataData = alterReplicaStateRequest.data()

Review comment:
       Should this be named `alterReplicaStateRequestData` instead?

##########
File path: core/src/main/scala/kafka/server/ReplicaManager.scala
##########
@@ -2281,4 +2288,31 @@ class ReplicaManager(val config: KafkaConfig,
       }
     }
   }
+
+  def sentLogDirEvent(dir: String, topicPartitionsInDir:Seq[TopicPartition], newState: Byte, reason: String): Unit = {
+    if (topicPartitionsInDir.nonEmpty) {
+      val logDirEventItem = AlterReplicaStateItem(topicPartitionsInDir.asJava, newState, reason, handleAlterReplicaStateResponse)
+
+      logDirEventManager.handleAlterReplicaStateChanges(logDirEventItem)
+    } else {
+      info(s"Log dir: $dir contains none partitions, ignore log dir event")
+    }
+  }
+
+  /**
+   * Visible for testing
+   * @return true if we need not to retry, which means the response contains no error or only UNKNOWN_TOPIC_OR_PARTITION error

Review comment:
       This is incorrect, there's no return value

##########
File path: core/src/main/scala/kafka/server/KafkaServer.scala
##########
@@ -313,6 +315,21 @@ class KafkaServer(
         }
         alterIsrManager.start()
 
+        val alterReplicaStateChannelManager = new BrokerToControllerChannelManagerImpl(

Review comment:
       Same here, should we prefer `BrokerToControllerChannelManager`s `apply`?

##########
File path: clients/src/main/java/org/apache/kafka/common/protocol/Errors.java
##########
@@ -364,7 +365,8 @@
     INCONSISTENT_TOPIC_ID(103, "The log's topic ID did not match the topic ID in the request", InconsistentTopicIdException::new),
     INCONSISTENT_CLUSTER_ID(104, "The clusterId in the request does not match that found on the server", InconsistentClusterIdException::new),
     TRANSACTIONAL_ID_NOT_FOUND(105, "The transactionalId could not be found", TransactionalIdNotFoundException::new),
-    FETCH_SESSION_TOPIC_ID_ERROR(106, "The fetch session encountered inconsistent topic ID usage", FetchSessionTopicIdException::new);
+    FETCH_SESSION_TOPIC_ID_ERROR(106, "The fetch session encountered inconsistent topic ID usage", FetchSessionTopicIdException::new),
+    UNKNOWN_REPLICA_STATE(107, "Replica state change only support OfflineState, see ReplicaState.state ", UnknownReplicaStateException::new);

Review comment:
       ```suggestion
       UNKNOWN_REPLICA_STATE(107, "Replica state change only supports OfflineState, see ReplicaState.state ", UnknownReplicaStateException::new);
   ```

##########
File path: core/src/main/scala/kafka/server/BrokerServer.scala
##########
@@ -255,6 +257,21 @@ class BrokerServer(
       )
       alterIsrManager.start()
 
+      val alterReplicaStateChannelManager = new BrokerToControllerChannelManagerImpl(

Review comment:
       Everywhere else we seem to prefer `BrokerToControllerChannelManager`'s `apply` instead of directly invoking `BrokerToControllerChannelManagerImpl`'s constructor, should we stick to the same pattern?

##########
File path: core/src/main/scala/kafka/server/KafkaServer.scala
##########
@@ -313,6 +315,21 @@ class KafkaServer(
         }
         alterIsrManager.start()
 
+        val alterReplicaStateChannelManager = new BrokerToControllerChannelManagerImpl(
+          controllerNodeProvider = MetadataCacheControllerNodeProvider(config, metadataCache),
+          time = time,
+          metrics = metrics,
+          config = config,
+          channelName = "alterReplicaStateChannel",
+          threadNamePrefix = threadNamePrefix,
+          retryTimeoutMs = Long.MaxValue)
+        if (config.interBrokerProtocolVersion >= kafka.api.KAFKA_3_0_IV1) {

Review comment:
       Why not `config.interBrokerProtocolVersion.isAlterReplicaStateSupported`?

##########
File path: core/src/main/scala/kafka/server/KafkaApis.scala
##########
@@ -3276,6 +3277,23 @@ class KafkaApis(val requestChannel: RequestChannel,
     }
   }
 
+  def handleAlterReplicaStateRequest(request: RequestChannel.Request): Unit = {
+    val zkSupport = metadataSupport.requireZkOrThrow(KafkaApis.shouldNeverReceive(request))

Review comment:
       Very likely a silly question, but this seems to be the only handler for `AlterReplicaStateRequest` and I'm struggling to understand - how will this work in KRaft, don't we need ControllerApis counterpart?

##########
File path: core/src/main/scala/kafka/server/LogDirEventManagerImpl.scala
##########
@@ -0,0 +1,198 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package kafka.server
+
+import java.util
+import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong}
+import java.util.concurrent.{LinkedBlockingQueue, TimeUnit}
+
+import kafka.metrics.KafkaMetricsGroup
+import kafka.utils.{Logging, Scheduler}
+import org.apache.kafka.clients.ClientResponse
+import org.apache.kafka.common.TopicPartition
+import org.apache.kafka.common.message.{AlterReplicaStateRequestData, AlterReplicaStateResponseData}
+import org.apache.kafka.common.protocol.Errors
+import org.apache.kafka.common.requests.{AlterReplicaStateRequest, AlterReplicaStateResponse}
+import org.apache.kafka.common.utils.Time
+
+import scala.collection.mutable
+import scala.jdk.CollectionConverters._
+
+/**
+ * Handles the sending of AlterReplicaState requests to the controller when LogDirFailure.
+ */
+abstract class LogDirEventManager {
+
+  def start(): Unit
+
+  def handleAlterReplicaStateChanges(logDirEventItem: AlterReplicaStateItem): Unit
+
+  def pendingAlterReplicaStateItemCount(): Int
+
+  def shutdown(): Unit
+}
+
+case class AlterReplicaStateItem(topicPartitions: util.List[TopicPartition],
+                                 newState: Byte,
+                                 reason: String,
+                                 callback: Either[Errors, TopicPartition] => Unit)
+
+class LogDirEventManagerImpl(val controllerChannelManager: BrokerToControllerChannelManager,
+                             val scheduler: Scheduler,
+                             val time: Time,
+                             val brokerId: Int,
+                             val brokerEpochSupplier: () => Long) extends LogDirEventManager with Logging with KafkaMetricsGroup {
+
+  // Used to allow only one in-flight request at a time
+  private val inflightRequest: AtomicBoolean = new AtomicBoolean(false)
+
+  private val pendingReplicaStateUpdates: LinkedBlockingQueue[AlterReplicaStateItem] = new LinkedBlockingQueue[AlterReplicaStateItem]()
+
+  private val lastSentMs = new AtomicLong(0)
+
+  def start(): Unit = {
+    scheduler.schedule("send-alter-replica-state", propagateReplicaStateChanges, 50, 50, TimeUnit.MILLISECONDS)
+  }
+
+  override def pendingAlterReplicaStateItemCount(): Int = pendingReplicaStateUpdates.size()
+
+  private def propagateReplicaStateChanges(): Unit = {
+    if (!pendingReplicaStateUpdates.isEmpty && inflightRequest.compareAndSet(false, true)) {
+      // Copy current unsent items and remove from the queue, will be inserted if failed
+      val inflightAlterIsrItem = pendingReplicaStateUpdates.poll()
+
+      lastSentMs.set(time.milliseconds())
+      sendRequest(inflightAlterIsrItem)
+    }
+  }
+
+  def handleAlterReplicaStateChanges(logDirEventItem: AlterReplicaStateItem): Unit = {
+    pendingReplicaStateUpdates.put(logDirEventItem)
+  }
+
+  def sendRequest(logDirEventItem: AlterReplicaStateItem): Unit = {
+
+    val message = buildRequest(logDirEventItem)
+
+    debug(s"Sending AlterReplicaState to controller $message")
+    controllerChannelManager.sendRequest(new AlterReplicaStateRequest.Builder(message),
+      new ControllerRequestCompletionHandler {
+        override def onComplete(response: ClientResponse): Unit = {
+          try {
+            val body = response.responseBody().asInstanceOf[AlterReplicaStateResponse]
+            handleAlterReplicaStateResponse(body, message.brokerEpoch, logDirEventItem)
+          } finally {
+            inflightRequest.set(false)
+          }
+        }
+
+        override def onTimeout(): Unit = {
+          throw new IllegalStateException("Encountered unexpected timeout when sending AlterIsr to the controller")
+        }
+      })
+  }
+
+  private def buildRequest(logDirEventItem: AlterReplicaStateItem): AlterReplicaStateRequestData = {
+    val message = new AlterReplicaStateRequestData()
+      .setBrokerId(brokerId)
+      .setBrokerEpoch(brokerEpochSupplier.apply())
+      .setNewState(logDirEventItem.newState)
+      .setReason(logDirEventItem.reason)
+      .setTopics(new java.util.ArrayList())
+
+    logDirEventItem.topicPartitions.asScala.groupBy(_.topic).foreach(entry => {
+      val topicPart = new AlterReplicaStateRequestData.TopicData()
+        .setName(entry._1)
+        .setPartitions(new java.util.ArrayList())
+      message.topics().add(topicPart)
+      entry._2.foreach(item => {
+        topicPart.partitions().add(new AlterReplicaStateRequestData.PartitionData()
+          .setPartitionIndex(item.partition)
+        )
+      })
+    })
+    message
+  }
+
+  private def handleAlterReplicaStateResponse(alterReplicaStateResponse: AlterReplicaStateResponse,
+                                              sentBrokerEpoch: Long,
+                                              logDirEventItem: AlterReplicaStateItem): Unit = {
+    val data: AlterReplicaStateResponseData = alterReplicaStateResponse.data
+
+    Errors.forCode(data.errorCode) match {
+      case Errors.STALE_BROKER_EPOCH =>
+        warn(s"Broker had a stale broker epoch ($sentBrokerEpoch), broker could have been repaired and restarted, ignore")
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.NOT_CONTROLLER =>
+        warn(s"Remote broker is not controller, ignore")
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.CLUSTER_AUTHORIZATION_FAILED =>
+        val exception = Errors.CLUSTER_AUTHORIZATION_FAILED.exception("Broker is not authorized to send AlterReplicaState to controller")
+        error(s"Broker is not authorized to send AlterReplicaState to controller", exception)
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.UNKNOWN_REPLICA_STATE =>
+        val exception = Errors.CLUSTER_AUTHORIZATION_FAILED.exception("ReplicaStateChange failed with an unknown replica state")
+        error(s"Broker is not authorized to send AlterReplicaState to controller", exception)
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.NONE =>
+        // success is a flag to indicate whether all the partitions had successfully alter state
+        val failedPartitions = new util.ArrayList[TopicPartition]()
+        // Collect partition-level responses to pass to the callbacks
+        val partitionResponses: mutable.Map[TopicPartition, Either[Errors, TopicPartition]] =
+          new mutable.HashMap[TopicPartition, Either[Errors, TopicPartition]]()
+        data.topics.forEach { topic =>
+          topic.partitions().forEach(partition => {
+            val tp = new TopicPartition(topic.name, partition.partitionIndex)
+            val error = Errors.forCode(partition.errorCode())
+            debug(s"Controller successfully handled AlterReplicaState request for $tp: $partition")
+            if (error == Errors.NONE) {
+              partitionResponses(tp) = Right(tp)
+            } else {
+              failedPartitions.add(new TopicPartition(topic.name(), partition.partitionIndex()))
+              partitionResponses(tp) = Left(error)
+            }
+          })
+        }
+
+        // Iterate across the items we sent rather than what we received to ensure we run the callback even if a
+        // partition was somehow erroneously excluded from the response.

Review comment:
       Is it worth also checking that there aren't partitions in the response that weren't part of the request?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming removed a comment on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming removed a comment on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-833187678


   ping @mumrah to have a look 😉.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on a change in pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on a change in pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#discussion_r762506030



##########
File path: core/src/main/scala/kafka/server/KafkaServer.scala
##########
@@ -313,6 +315,21 @@ class KafkaServer(
         }
         alterIsrManager.start()
 
+        val alterReplicaStateChannelManager = new BrokerToControllerChannelManagerImpl(
+          controllerNodeProvider = MetadataCacheControllerNodeProvider(config, metadataCache),
+          time = time,
+          metrics = metrics,
+          config = config,
+          channelName = "alterReplicaStateChannel",
+          threadNamePrefix = threadNamePrefix,
+          retryTimeoutMs = Long.MaxValue)
+        if (config.interBrokerProtocolVersion >= kafka.api.KAFKA_3_0_IV1) {

Review comment:
       Good catch!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming removed a comment on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming removed a comment on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-851110147


   ping @mumrah @cmccabe , this PR is similar to KIP-497, I wish this can be finished before KAFKA 3.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on a change in pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on a change in pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#discussion_r762506268



##########
File path: core/src/main/scala/kafka/server/LogDirEventManagerImpl.scala
##########
@@ -0,0 +1,198 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package kafka.server
+
+import java.util
+import java.util.concurrent.atomic.{AtomicBoolean, AtomicLong}
+import java.util.concurrent.{LinkedBlockingQueue, TimeUnit}
+
+import kafka.metrics.KafkaMetricsGroup
+import kafka.utils.{Logging, Scheduler}
+import org.apache.kafka.clients.ClientResponse
+import org.apache.kafka.common.TopicPartition
+import org.apache.kafka.common.message.{AlterReplicaStateRequestData, AlterReplicaStateResponseData}
+import org.apache.kafka.common.protocol.Errors
+import org.apache.kafka.common.requests.{AlterReplicaStateRequest, AlterReplicaStateResponse}
+import org.apache.kafka.common.utils.Time
+
+import scala.collection.mutable
+import scala.jdk.CollectionConverters._
+
+/**
+ * Handles the sending of AlterReplicaState requests to the controller when LogDirFailure.
+ */
+abstract class LogDirEventManager {
+
+  def start(): Unit
+
+  def handleAlterReplicaStateChanges(logDirEventItem: AlterReplicaStateItem): Unit
+
+  def pendingAlterReplicaStateItemCount(): Int
+
+  def shutdown(): Unit
+}
+
+case class AlterReplicaStateItem(topicPartitions: util.List[TopicPartition],
+                                 newState: Byte,
+                                 reason: String,
+                                 callback: Either[Errors, TopicPartition] => Unit)
+
+class LogDirEventManagerImpl(val controllerChannelManager: BrokerToControllerChannelManager,
+                             val scheduler: Scheduler,
+                             val time: Time,
+                             val brokerId: Int,
+                             val brokerEpochSupplier: () => Long) extends LogDirEventManager with Logging with KafkaMetricsGroup {
+
+  // Used to allow only one in-flight request at a time
+  private val inflightRequest: AtomicBoolean = new AtomicBoolean(false)
+
+  private val pendingReplicaStateUpdates: LinkedBlockingQueue[AlterReplicaStateItem] = new LinkedBlockingQueue[AlterReplicaStateItem]()
+
+  private val lastSentMs = new AtomicLong(0)
+
+  def start(): Unit = {
+    scheduler.schedule("send-alter-replica-state", propagateReplicaStateChanges, 50, 50, TimeUnit.MILLISECONDS)
+  }
+
+  override def pendingAlterReplicaStateItemCount(): Int = pendingReplicaStateUpdates.size()
+
+  private def propagateReplicaStateChanges(): Unit = {
+    if (!pendingReplicaStateUpdates.isEmpty && inflightRequest.compareAndSet(false, true)) {
+      // Copy current unsent items and remove from the queue, will be inserted if failed
+      val inflightAlterIsrItem = pendingReplicaStateUpdates.poll()
+
+      lastSentMs.set(time.milliseconds())
+      sendRequest(inflightAlterIsrItem)
+    }
+  }
+
+  def handleAlterReplicaStateChanges(logDirEventItem: AlterReplicaStateItem): Unit = {
+    pendingReplicaStateUpdates.put(logDirEventItem)
+  }
+
+  def sendRequest(logDirEventItem: AlterReplicaStateItem): Unit = {
+
+    val message = buildRequest(logDirEventItem)
+
+    debug(s"Sending AlterReplicaState to controller $message")
+    controllerChannelManager.sendRequest(new AlterReplicaStateRequest.Builder(message),
+      new ControllerRequestCompletionHandler {
+        override def onComplete(response: ClientResponse): Unit = {
+          try {
+            val body = response.responseBody().asInstanceOf[AlterReplicaStateResponse]
+            handleAlterReplicaStateResponse(body, message.brokerEpoch, logDirEventItem)
+          } finally {
+            inflightRequest.set(false)
+          }
+        }
+
+        override def onTimeout(): Unit = {
+          throw new IllegalStateException("Encountered unexpected timeout when sending AlterIsr to the controller")
+        }
+      })
+  }
+
+  private def buildRequest(logDirEventItem: AlterReplicaStateItem): AlterReplicaStateRequestData = {
+    val message = new AlterReplicaStateRequestData()
+      .setBrokerId(brokerId)
+      .setBrokerEpoch(brokerEpochSupplier.apply())
+      .setNewState(logDirEventItem.newState)
+      .setReason(logDirEventItem.reason)
+      .setTopics(new java.util.ArrayList())
+
+    logDirEventItem.topicPartitions.asScala.groupBy(_.topic).foreach(entry => {
+      val topicPart = new AlterReplicaStateRequestData.TopicData()
+        .setName(entry._1)
+        .setPartitions(new java.util.ArrayList())
+      message.topics().add(topicPart)
+      entry._2.foreach(item => {
+        topicPart.partitions().add(new AlterReplicaStateRequestData.PartitionData()
+          .setPartitionIndex(item.partition)
+        )
+      })
+    })
+    message
+  }
+
+  private def handleAlterReplicaStateResponse(alterReplicaStateResponse: AlterReplicaStateResponse,
+                                              sentBrokerEpoch: Long,
+                                              logDirEventItem: AlterReplicaStateItem): Unit = {
+    val data: AlterReplicaStateResponseData = alterReplicaStateResponse.data
+
+    Errors.forCode(data.errorCode) match {
+      case Errors.STALE_BROKER_EPOCH =>
+        warn(s"Broker had a stale broker epoch ($sentBrokerEpoch), broker could have been repaired and restarted, ignore")
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.NOT_CONTROLLER =>
+        warn(s"Remote broker is not controller, ignore")
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.CLUSTER_AUTHORIZATION_FAILED =>
+        val exception = Errors.CLUSTER_AUTHORIZATION_FAILED.exception("Broker is not authorized to send AlterReplicaState to controller")
+        error(s"Broker is not authorized to send AlterReplicaState to controller", exception)
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.UNKNOWN_REPLICA_STATE =>
+        val exception = Errors.CLUSTER_AUTHORIZATION_FAILED.exception("ReplicaStateChange failed with an unknown replica state")
+        error(s"Broker is not authorized to send AlterReplicaState to controller", exception)
+        pendingReplicaStateUpdates.put(logDirEventItem)
+
+      case Errors.NONE =>
+        // success is a flag to indicate whether all the partitions had successfully alter state
+        val failedPartitions = new util.ArrayList[TopicPartition]()
+        // Collect partition-level responses to pass to the callbacks
+        val partitionResponses: mutable.Map[TopicPartition, Either[Errors, TopicPartition]] =
+          new mutable.HashMap[TopicPartition, Either[Errors, TopicPartition]]()
+        data.topics.forEach { topic =>
+          topic.partitions().forEach(partition => {
+            val tp = new TopicPartition(topic.name, partition.partitionIndex)
+            val error = Errors.forCode(partition.errorCode())
+            debug(s"Controller successfully handled AlterReplicaState request for $tp: $partition")
+            if (error == Errors.NONE) {
+              partitionResponses(tp) = Right(tp)
+            } else {
+              failedPartitions.add(new TopicPartition(topic.name(), partition.partitionIndex()))
+              partitionResponses(tp) = Left(error)
+            }
+          })
+        }
+
+        // Iterate across the items we sent rather than what we received to ensure we run the callback even if a
+        // partition was somehow erroneously excluded from the response.

Review comment:
       I'm not sure whether it's an error If response contains partitions that weren't part of the request so I just ignored this, I'll consider this case.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-881129535


   ping @mumrah 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming removed a comment on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming removed a comment on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-834218869


   The flaky test is so annoying:
   ```
   WARN [RequestSendThread controllerId=0] Controller 0 epoch 1 fails to send request (type=LeaderAndIsRequest, controllerId=0, controllerEpoch=1, brokerEpoch=26, partitionStates=[LeaderAndIsrPartitionState(topicName='topic', partitionIndex=9, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=3, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=0, controllerEpoch=1, leader=-1, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[1, 0], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=11, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsr
 PartitionState(topicName='topic', partitionIndex=8, controllerEpoch=1, leader=-1, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[1, 0], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=5, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=7, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=1, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false)], topicIds={topic=UF4BQlpgQRG9A_hJugA_fw}, liveLeaders=(localhost:34815 (id: 0 rack: null))) to broker localhost:34815 (id: 0 rack: null). Reconnecting to broker. (kafka.controller.RequestSendThread:72)
   java.io.IOException: Connection to 0 was disconnected before the response was read
   ```
   
   This is similar to some other flaky test, I'm investigating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-1062889765


   @mumrah @hachikuji @bbejeck can anyone review this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-1013132683


   @mumrah @hachikuji @bbejeck could someone have a look at this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-833187678


   ping @mumrah to have a look 😉.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-834218869


   The flaky test is so annoying:
   ```
   WARN [RequestSendThread controllerId=0] Controller 0 epoch 1 fails to send request (type=LeaderAndIsRequest, controllerId=0, controllerEpoch=1, brokerEpoch=26, partitionStates=[LeaderAndIsrPartitionState(topicName='topic', partitionIndex=9, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=3, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=0, controllerEpoch=1, leader=-1, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[1, 0], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=11, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsr
 PartitionState(topicName='topic', partitionIndex=8, controllerEpoch=1, leader=-1, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[1, 0], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=5, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=7, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false), LeaderAndIsrPartitionState(topicName='topic', partitionIndex=1, controllerEpoch=1, leader=0, leaderEpoch=1, isr=[0], zkVersion=1, replicas=[0, 1], addingReplicas=[], removingReplicas=[], isNew=false)], topicIds={topic=UF4BQlpgQRG9A_hJugA_fw}, liveLeaders=(localhost:34815 (id: 0 rack: null))) to broker localhost:34815 (id: 0 rack: null). Reconnecting to broker. (kafka.controller.RequestSendThread:72)
   java.io.IOException: Connection to 0 was disconnected before the response was read
   ```
   
   This is similar to some other flaky test, I'm investigating.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-986161398


   Thank you for your comments @soarez , PTAL again. ping @mumrah 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] mumrah commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
mumrah commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-776805631


   @dengziming, thanks for the patch! We are rather busy at the moment merging all the KIP-500 related work into trunk for the 2.8 release. After things settle down with that I will take a look at this. It looks like the approach here is similar to what we did for KIP-497 so you're probably on the right path. I know that we have some changes incoming for BrokerToControllerManager, so you'll likely need to rebase with trunk soon.
   
   If I don't get back to this in the next few weeks, please feel free to ping me here :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-851110147


   ping @mumrah @cmccabe , this PR is similar to KIP-497, I wish this can be finished before KAFKA 3.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-723936352


   @mumrah @hachikuji @bbejeck  Hi, PTAL.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on a change in pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
dengziming commented on a change in pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#discussion_r762506014



##########
File path: core/src/main/scala/kafka/server/KafkaApis.scala
##########
@@ -3276,6 +3277,23 @@ class KafkaApis(val requestChannel: RequestChannel,
     }
   }
 
+  def handleAlterReplicaStateRequest(request: RequestChannel.Request): Unit = {
+    val zkSupport = metadataSupport.requireZkOrThrow(KafkaApis.shouldNeverReceive(request))

Review comment:
       Kraft support is not in the scope of KIP-589, we created KAFKA-13005 for it, so it's not included in this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] soarez commented on pull request #9577: KAFKA-9837: KIP-589 new RPC for notifying controller log dir failure

Posted by GitBox <gi...@apache.org>.
soarez commented on pull request #9577:
URL: https://github.com/apache/kafka/pull/9577#issuecomment-951893322


   @dengziming @mumrah now that 2.8 has been released, should this be rebased and reviewed again? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org