You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2020/06/29 17:19:24 UTC

[GitHub] [kafka] andrewchoi5 opened a new pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

andrewchoi5 opened a new pull request #8479:
URL: https://github.com/apache/kafka/pull/8479


   KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception occurs.
   https://issues.apache.org/jira/browse/KAFKA-9769
   
   For example, in such case, we will have the following mechanism :
   1 - P1 and P2 succeeds. leaderEpoch for them are incremented because no ZkException occurs
   2 - while making follower for P3, ZkException occurs and the leaderEpoch is not updated and thus thepartitionsToMakeFollower += partition isn’t executed. We catch this ZkException in line 1498 and log it as an error. No Exception is thrown.
   3 - After catching the exception, makeFollower for P4 is then not executed.
   4 - so the partitionsToMakeFollower only contains P1, P2. And fetchers are added to these partitionsToMakeFollower
   
   Signed-off-by: Andrew Choi <li...@microsoft.com>
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-628258485


   @andrewchoi5 : Thanks for finding this issue. We fixed https://issues.apache.org/jira/browse/KAFKA-9932 recently. So, the `fetchLogConfig` from ZK method is called rarely during makeFollowers() now. If we do hit the ZK exception, there are a few options: (1) As Jason mentioned, we could keep retrying from ZK until successful. This probably needs to be done in a separate thread to avoid blocking the request handler thread. So, it can be a bit involved. (2) Don't update the leader epoch (we can choose to update the leader epoch after the local log is obtained successfully) and log an error. Since this issue should be rare now, maybe we can do (2) for now?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-643010366


   Hi @junrao. Thank you for your review -- I have further synced up with @jjkoshy on this PR.
   
   The partition's new leader epoch update is actually happening after the point at which ZooKeeper Exception is thrown. Therefore, when the `createLogs` throws ZooKeeper Exception, the new leader epoch does not actually get updated.
   
   In the catch case for ZooKeeperClientException, I have populated the responseMap with the topic partition and the `Errors.NETWORK_EXCEPTION`. If you suggest any other Error to be populated in this responseMap, please let me know and I will change it accordingly. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-649082464


   Hello @junrao @hachikuji  -- I have made some updates to address the comments. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 closed pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 closed pull request #8479:
URL: https://github.com/apache/kafka/pull/8479


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r446309532



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"A ZooKeeper client exception has occurred and makeLeader will be skipping the " +
+            s"state change for the partition with leader epoch: $leaderEpoch ", e)
+          error(s"ZooKeeper client error occurred while this partition was becoming the leader for $topicPartition.", e)

Review comment:
       Thanks @junrao -- I will make the logging level to error. 
   Could you clarify what you mean by keep the state change logging?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r445234702



##########
File path: core/src/main/scala/kafka/server/ReplicaManager.scala
##########
@@ -1556,6 +1557,11 @@ class ReplicaManager(val config: KafkaConfig,
             error(s"Error while making broker the follower for partition $partition with leader " +
               s"$newLeaderBrokerId in dir $dirOpt", e)
             responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
+          case e: ZooKeeperClientException =>

Review comment:
       It's probably better to do this in Partition.makeFollower() instead of here. That way, we only skip partitions that have incurred ZK error.
   
   Also, the same ZK exception can happen in Partition.makeLeader(). So, we want to do the same thing there as well.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] jjkoshy commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
jjkoshy commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-643015314


   
   > In the catch case for ZooKeeperClientException, I have populated the responseMap with the topic partition and the `Errors.NETWORK_EXCEPTION`. If you suggest any other Error to be populated in this responseMap, please let me know and I will change it accordingly.
   
   I haven't looked at this code in a while so I may not have enough context at this point, but I don't think we should use the network exception error code - i.e., this isn't a network issue between the coordinator and broker but between the broker and zk. Also, there doesn't seem to be any active retry attempt from the controller to resend the request in this scenario.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-651253282


   Closed by accident. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-653954070


   ok to test


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-624858200


   @mjsax Hi Matthias, would you happen to know if there were any other reviewers available? I don't mind waiting, but was curious what the ETA usually appears to be.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r445902326



##########
File path: core/src/main/scala/kafka/server/ReplicaManager.scala
##########
@@ -1556,6 +1557,11 @@ class ReplicaManager(val config: KafkaConfig,
             error(s"Error while making broker the follower for partition $partition with leader " +
               s"$newLeaderBrokerId in dir $dirOpt", e)
             responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
+          case e: ZooKeeperClientException =>

Review comment:
       @junrao -- let me know if there's additional changes to be made. Thanks.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] mjsax commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
mjsax commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-626047555


   Feel to reach out to the dev mailing list and call for review.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
hachikuji commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-628204174


   @andrewchoi5 Just want to make sure I understand the problem. The scenario is that we lose the zk session while handling a LeaderAndIsr request. Current LeaderAndIsr handling works like this:
   
   1. Check epoch of each partition. If it is less than or equal to current epoch, ignore the update.
   2. Update the epoch for each partition.
   3. Make followers and leaders, which involves loading topic configs from zk.
   
   So the problem occurs when we hit an error loading the topic configs in step 3). This potentially causes us to miss the needed state changes from that request and also prevents us from being able to retry them because the epoch has already been updated. Is that right?
   
   I guess the fix here is only a partial fix. We would still be left with the one failed partition, right?
   
   Off the top of my head, I am wondering whether we should be retrying the request at a lower level. For example, maybe `getEntityConfigs` could catch the `ZooKeeperClientExpiredException` and retry. I assume there is a good reason we do not catch this exception in `retryRequestUntilConnected` already. Perhaps it is unsafe to assume that the requests are still valid after a session expiration. However, just for reading configurations, I do not see a problem retrying after a session expiration.
   
   cc @junrao 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r445904547



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"Because a ZooKeeper client exception has occurred, completed become leader " +
+            s"state change from epoch $leaderEpoch only for those updated partitions with before " +

Review comment:
       The message seem inaccurate since we only skip this partition now. Ditto in makeFollower().

##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"Because a ZooKeeper client exception has occurred, completed become leader " +
+            s"state change from epoch $leaderEpoch only for those updated partitions with before " +
+            s"ZooKeeper disconnect occurred.", e)
+          error(s"ZooKeeper client occurred while rendering a $topicPartition's leader through zkClient.'", e)

Review comment:
       Do we need ' before "?
   
   Also, the text could probably be sth like "ZooKeeper client error occurred while becoming leader for $topicPartition." 
   
   Ditto in makeFollower().




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r446306280



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"A ZooKeeper client exception has occurred and makeLeader will be skipping the " +
+            s"state change for the partition with leader epoch: $leaderEpoch ", e)
+          error(s"ZooKeeper client error occurred while this partition was becoming the leader for $topicPartition.", e)

Review comment:
       We probably can just keep the state change logging. Also, the logging level probably should be error instead of info. Ditto below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r445748544



##########
File path: core/src/main/scala/kafka/server/ReplicaManager.scala
##########
@@ -1556,6 +1557,11 @@ class ReplicaManager(val config: KafkaConfig,
             error(s"Error while making broker the follower for partition $partition with leader " +
               s"$newLeaderBrokerId in dir $dirOpt", e)
             responseMap.put(partition.topicPartition, Errors.KAFKA_STORAGE_ERROR)
+          case e: ZooKeeperClientException =>

Review comment:
       πŸ‘ 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-644482079


   Thanks, @junrao .
   Removed the population of responseMap with the `ERROR` code.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-653725987


   @junrao -- let me know if anything else needs your attention. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r448505615



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,15 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)

Review comment:
       @junrao -- My understanding was we wanted to catch ZooKeeperClientException when performing createLogIfNotExists within makeLeader and makeFollower methods. Did we want to place this somewhere else?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r448569922



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,15 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)

Review comment:
       @junrao  -- Got it. I misunderstood it for something else. Updated!
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on issue #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on issue #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-617479063


   Thanks for referring Matthias. Would appreciate your review @hachikuji @cmccabe 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r448527321



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,15 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)

Review comment:
       @andrewchoi5 : I was asking if we could change 
   
   `this.createLogIfNotExists()`
   
   to
   
   `createLogIfNotExists()`
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r446318540



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"A ZooKeeper client exception has occurred and makeLeader will be skipping the " +
+            s"state change for the partition with leader epoch: $leaderEpoch ", e)
+          error(s"ZooKeeper client error occurred while this partition was becoming the leader for $topicPartition.", e)

Review comment:
       Makes sense. Would appreciate a review on the change -- @junrao .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-644476784


   @andrewchoi5 : Since the controller only checks KAFKA_STORAGE_ERROR in LeaderAndIsrResponse now, perhaps we can just log an error without sending an error code back for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-651253198


   Thanks for the review @junrao -- let me know if there's anything else for revision.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r447927722



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,15 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)

Review comment:
       Do we need this? Ditto below.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#issuecomment-643017385


   > > In the catch case for ZooKeeperClientException, I have populated the responseMap with the topic partition and the `Errors.NETWORK_EXCEPTION`. If you suggest any other Error to be populated in this responseMap, please let me know and I will change it accordingly.
   > 
   > I haven't looked at this code in a while so I may not have enough context at this point, but I don't think we should use the network exception error code - i.e., this isn't a network issue between the coordinator and broker but between the broker and zk. Also, there doesn't seem to be any active retry attempt from the controller to resend the request in this scenario.
   
   Correct -- I wasn't able to find the best, close enough Errors exception to populate, especially since there was none related to ZooKeeper in that class. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao merged pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao merged pull request #8479:
URL: https://github.com/apache/kafka/pull/8479


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] junrao commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
junrao commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r446316668



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"A ZooKeeper client exception has occurred and makeLeader will be skipping the " +
+            s"state change for the partition with leader epoch: $leaderEpoch ", e)
+          error(s"ZooKeeper client error occurred while this partition was becoming the leader for $topicPartition.", e)

Review comment:
       I meant removing the line in 509 of error(s"ZooKeeper client .




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r448505615



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,15 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)

Review comment:
       @junrao -- My understanding was we wanted to catch ZooKeeperClientException when performing createLogIfNotExists within makeLeader and makeFollower methods. Did we want to place this somewhere else?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r445953130



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"Because a ZooKeeper client exception has occurred, completed become leader " +
+            s"state change from epoch $leaderEpoch only for those updated partitions with before " +
+            s"ZooKeeper disconnect occurred.", e)
+          error(s"ZooKeeper client occurred while rendering a $topicPartition's leader through zkClient.'", e)

Review comment:
       πŸ‘ , have made that change. Thanks @junrao 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] andrewchoi5 commented on a change in pull request #8479: KAFKA-9769: Finish operations for leaderEpoch-updated partitions up to point ZK Exception

Posted by GitBox <gi...@apache.org>.
andrewchoi5 commented on a change in pull request #8479:
URL: https://github.com/apache/kafka/pull/8479#discussion_r445952240



##########
File path: core/src/main/scala/kafka/cluster/Partition.scala
##########
@@ -499,7 +500,16 @@ class Partition(val topicPartition: TopicPartition,
         addingReplicas = addingReplicas,
         removingReplicas = removingReplicas
       )
-      createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      try {
+        this.createLogIfNotExists(partitionState.isNew, isFutureReplica = false, highWatermarkCheckpoints)
+      } catch {
+        case e: ZooKeeperClientException =>
+          stateChangeLogger.info(s"Because a ZooKeeper client exception has occurred, completed become leader " +
+            s"state change from epoch $leaderEpoch only for those updated partitions with before " +

Review comment:
       That makes sense. That will be reflected. πŸ‘ 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org