You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/03/23 17:20:48 UTC

[GitHub] [kafka] chia7712 opened a new pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

chia7712 opened a new pull request #10389:
URL: https://github.com/apache/kafka/pull/10389


   There are 2 root causes.
   
   1. the mini in-sync is 1 so we could lose data when force-removing current leader
   2. we don't wait new leader to sync hw with follower so sending request to get offset could encounter `OFFSET_NOT_AVAILABLE` error.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-806010778


   merge trunk to trigger QA again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-805575730


   > Have you been able to reproduce the flaky test locally? I tried few times and I never could...
   
   yep. It requires a "slow" machine to make "slow" sync between leader and followers. I open 10 containers to loop that test on my local and it can reproduce the error effectively. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 merged pull request #10389: KAFKA-12384: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 merged pull request #10389:
URL: https://github.com/apache/kafka/pull/10389


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: KAFKA-12384: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-814296573


   >  Is the test still flaky with the latest changes?
   
   yep. I loop this patch 100 times and all pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dajac commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
dajac commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-805573515


   @chia7712 Have you been able to reproduce the flaky test locally? I tried few times and I never could...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-805454465


   @dengziming thanks for your review!
   
   > should we trigger the jenkins build multiple times to verify that the flaky test is fixed.
   
   sure. I also loop `ListOffsetsRequestTest` 300 times on my local. all pass


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 edited a comment on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 edited a comment on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-805575730


   > Have you been able to reproduce the flaky test locally? I tried few times and I never could...
   
   yep. It requires a "slow" machine to make "slow" sync between leader and followers. I open 8 containers to loop that test on my local and it can reproduce the error effectively. 
   
   <img width="1440" alt="截圖 2021-03-24 下午3 41 16" src="https://user-images.githubusercontent.com/6234750/112272634-6c5efb80-8cb7-11eb-93fd-68c86b9d0360.png">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dengziming commented on a change in pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
dengziming commented on a change in pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#discussion_r600079948



##########
File path: core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala
##########
@@ -139,10 +139,17 @@ class ListOffsetsRequestTest extends BaseRequestTest {
 
     val request = if (version == -1) builder.build() else builder.build(version)
 
-    val response = sendRequest(serverId, request)
-    val partitionData = response.topics.asScala.find(_.name == topic).get
+    sendRequest(serverId, request).topics.asScala.find(_.name == topic).get
       .partitions.asScala.find(_.partitionIndex == partition.partition).get
+  }
 
+  // -1 indicate "latest"
+  private[this] def fetchOffsetAndEpoch(serverId: Int,
+                                        timestamp: Long,
+                                        version: Short): (Long, Int) = {
+    val partitionData = sendRequest(serverId, timestamp, version)
+
+    println(s"[CHIA] fetchOffsetAndEpoch version: $version partitionData: $partitionData")

Review comment:
       nit: this is checked in by accident?

##########
File path: core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala
##########
@@ -166,6 +176,9 @@ class ListOffsetsRequestTest extends BaseRequestTest {
     // Kill the first leader so that we can verify the epoch change when fetching the latest offset
     killBroker(firstLeaderId)
     val secondLeaderId = TestUtils.awaitLeaderChange(servers, partition, firstLeaderId)
+    // make sure high watermark of new leader has not caught up

Review comment:
       should this be "has caught up"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-813901559


   unrelated error. merge trunk to trigger QA again


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on a change in pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on a change in pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#discussion_r602237567



##########
File path: core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala
##########
@@ -154,7 +160,10 @@ class ListOffsetsRequestTest extends BaseRequestTest {
 
   @Test
   def testResponseIncludesLeaderEpoch(): Unit = {
-    val partitionToLeader = TestUtils.createTopic(zkClient, topic, numPartitions = 1, replicationFactor = 3, servers)
+    val topicConfig = new Properties
+    // make sure we won't lose data when force-removing leader
+    topicConfig.setProperty(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, "2")

Review comment:
       IIRC, acks=all means “all in-sync replicas” (rather than “all replicas”) have to receive record. In other words, produce request can be completed after one replica has received the record if mini ISR is one. When we shutdown the server with mini replica=1, the successful record may be NOT synced with other broker (there are 3 brokers totally). In short , the data could get lost after we shutdown current leader.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on a change in pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on a change in pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#discussion_r602393980



##########
File path: core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala
##########
@@ -154,7 +160,10 @@ class ListOffsetsRequestTest extends BaseRequestTest {
 
   @Test
   def testResponseIncludesLeaderEpoch(): Unit = {
-    val partitionToLeader = TestUtils.createTopic(zkClient, topic, numPartitions = 1, replicationFactor = 3, servers)
+    val topicConfig = new Properties
+    // make sure we won't lose data when force-removing leader
+    topicConfig.setProperty(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, "2")

Review comment:
       > Are you saying that we are shrinking the ISR to 1 in this test? That would be unexpected.
   
   No, I did not observe such accident. Maybe I misunderstood the log before and my assumption is not happen (loop it 1000 times). will revert this change.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-808075755


   unrelated failure. will merge trunk to trigger QA again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-810780261


   > Please also update the status in KAFKA-12384. Thank you.
   
   oh, sorry that I did not notice the existent ticket. will update the jira :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ijuma commented on a change in pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
ijuma commented on a change in pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#discussion_r602257805



##########
File path: core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala
##########
@@ -154,7 +160,10 @@ class ListOffsetsRequestTest extends BaseRequestTest {
 
   @Test
   def testResponseIncludesLeaderEpoch(): Unit = {
-    val partitionToLeader = TestUtils.createTopic(zkClient, topic, numPartitions = 1, replicationFactor = 3, servers)
+    val topicConfig = new Properties
+    // make sure we won't lose data when force-removing leader
+    topicConfig.setProperty(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, "2")

Review comment:
       Yeah, but if you have 3 brokers and you shutdown one of them, you should still have 2 brokers in the ISR. Are you saying that we are shrinking the ISR to 1 in this test? That would be unexpected.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dajac commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
dajac commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-805581946


   @chia7712 awesome!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ijuma commented on pull request #10389: KAFKA-12384: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
ijuma commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-814107295


   @chia7712 Is the test still flaky with the latest changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] chia7712 commented on pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
chia7712 commented on pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#issuecomment-810731206


   @ijuma any suggestions? This test is still flaky :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ijuma commented on a change in pull request #10389: MINOR: stabilize ListOffsetsRequestTest#testResponseIncludesLeaderEpoch

Posted by GitBox <gi...@apache.org>.
ijuma commented on a change in pull request #10389:
URL: https://github.com/apache/kafka/pull/10389#discussion_r602220116



##########
File path: core/src/test/scala/unit/kafka/server/ListOffsetsRequestTest.scala
##########
@@ -154,7 +160,10 @@ class ListOffsetsRequestTest extends BaseRequestTest {
 
   @Test
   def testResponseIncludesLeaderEpoch(): Unit = {
-    val partitionToLeader = TestUtils.createTopic(zkClient, topic, numPartitions = 1, replicationFactor = 3, servers)
+    val topicConfig = new Properties
+    // make sure we won't lose data when force-removing leader
+    topicConfig.setProperty(TopicConfig.MIN_IN_SYNC_REPLICAS_CONFIG, "2")

Review comment:
       Is this required? Even with min isr 1, we still require all in sync replicas to ack for acks=all. How many brokers do we have in the test?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org