You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/03/26 02:54:00 UTC

[GitHub] [kafka] showuon opened a new pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

showuon opened a new pull request #10409:
URL: https://github.com/apache/kafka/pull/10409


   After investigation, I found the test sometimes failed because the streams keep rebalancing before they start to read/process/write data, due to heartbeat timeout:
   ```
   Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 2 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-5d3d74ab-1475-4cef-a837-1205ba6c2bfe-StreamThread-1-consumer-4d4c9703-765d-4acb-a7f0-a180f09734b8 on heartbeat expiration)
   ```
   When the system is slow, it might not be able to send heart within default 3 seconds, and cause another rebalance triggering, and then, the consumer re-join again, and make the system even slower... To fix it, I increase the heartbeat interval, and corresponding session timeout, to avoid unnecessary rebalance, to make this test reliable.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-818309573


   Sure, I'll update the PR later. I removed it because I tried before and it doesn't work well. Let's try it again 🙂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-817457115


   @ableegoldman , I've updated this PR to revert the increasing heartbeat interval change. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on a change in pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on a change in pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#discussion_r611308156



##########
File path: streams/src/test/java/org/apache/kafka/streams/integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java
##########
@@ -154,8 +154,6 @@ public void before() throws IOException {
         streamsConfig.put(StreamsConfig.STATE_DIR_CONFIG, stateDirBasePath + "-1");
         streamsConfigTwo.put(StreamsConfig.STATE_DIR_CONFIG, stateDirBasePath + "-2");
         streamsConfigThree.put(StreamsConfig.STATE_DIR_CONFIG, stateDirBasePath + "-3");
-
-        IntegrationTestUtils.purgeLocalStreamsState(asList(streamsConfig, streamsConfigTwo, streamsConfigThree));

Review comment:
       Agree with you. I also think it's not necessary to do purge before test. Remove it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman commented on a change in pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

ableegoldman commented on a change in pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#discussion_r601991532



##########
File path: streams/src/test/java/org/apache/kafka/streams/integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java
##########
@@ -172,7 +174,7 @@ public void after() throws IOException {
             streamsThree.close();
             streamsThree = null;
         }
-        IntegrationTestUtils.purgeLocalStreamsState(streamsConfig);
+        IntegrationTestUtils.purgeLocalStreamsState(asList(streamsConfig, streamsConfigTwo, streamsConfigThree));

Review comment:
       This is done above in `before()` so it's probably not behind the flakiness, unfortunately. Not sure why we purge the local state twice but you're right, we should at least be consistent when we do and purge the state for all three Streams. Good catch




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-817043654


   @ableegoldman , thanks for your comment. I know what your concern, and it makes sense we merge other improvement first. I'll update the PR. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman commented on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

ableegoldman commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-817035420


   Thanks @showuon . I guess my question is really this: why is it that this test, and no others, is frequently flaky due to dropping out on the session interval? There's nothing really "special" about it, ie it uses a single StreamThread and the default hb/session interval. It's hard to believe that Jenkins is so bad that a single consumer can't get a single heartbeat to the brokers in 10s. Maybe there is a bug or other slowdown in the networking layer, I don't know -- my point was really that dropping out on the heartbeat should not be expected, and if that's really causing tests to fail then it seems like a valid problem to investigate further, and not necessarily just expected flakiness (Even if it's not what this particular test was _supposed_ to be testing) Does that make sense?
   
   I do see that it appears to drop out on hb expiration, but that only occurs twice so I wonder if that's really causing the flakiness or whether it should be able to recover from this. What if we just merge the other improvements in this PR to start with, and see how much that helps?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

ableegoldman edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-807906627


   > When the system is slow, it might not be able to send heartbeat within default 3 seconds
   
   Well, it would have to miss ~3 heartbeats in order to trigger a rebalance on the session timeout which is 10 seconds. It's a bit hard to believe that it really can't get a single heartbeat in within 10s, and if so that only this one test would be affected when most of them rely on the default heartbeat/session interval.
   
   Were you seeing that it's continuously rebalancing in the logs? Even if it did have one rebalance due to one missed session interval, it had a full minute to process just a single record. I'm skeptical that the default configs are the problem here, and they may actually mask a real issue if we just blindly increase them. If it really is dropping out on hb expiration then that sounds like a real problem -- Jenkins is flaky, sure, but there's only so much we can blame on this poor testing infrastructure 🙂 
   
   Have you had any luck in reproducing this locally? If we can get full logs, especially debug logs, that will help us confirm the root cause here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman commented on pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

ableegoldman commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-818077023


   It looks like the test failed in this most recent run -- however that's not unexpected since I think you may have accidentally taken out the other fix which probably _will_ help: to use `startStreamsAndWaitForRunning` so we make sure to wait for the KafkaStreams to start up and get into RUNNING. Let's add that back and see if it helps (I suspect it will)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

ableegoldman edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-807906627


   > When the system is slow, it might not be able to send heartbeat within default 3 seconds
   
   Well, it would have to miss ~3 heartbeats in order to trigger a rebalance on the session timeout which is 10 seconds. It's a bit hard to believe that it really can't get a single heartbeat in within 10s, and even if so that only this one test would be affected when most of them rely on the default heartbeat/session interval.
   
   Were you seeing that it's continuously rebalancing in the logs? Even if it did have one rebalance due to one missed session interval, it had a full minute to process just a single record. I'm skeptical that the default configs are the problem here, and they may actually mask a real issue if we just blindly increase them. 
   
   Have you had any luck in reproducing this locally? If we can get full logs, especially debug logs, that will help us confirm the root cause here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-816544910


   @ableegoldman , I ran multiple times on jenkins and print some debug logs. Please check here:
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegration10Test/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable/
   
   The error message: `failed 1st time` is because I don't stop the test immediately when the normal test timed out (60 seconds) for waiting consuming 1 record. I keep waiting for another 60 seconds to see what will happen. If it passed in 2nd 60 seconds, it'll failed with `failed 1st time` indicating this test passed in 2nd run, otherwise, it'll failed with normal error message (i.e. `Did not receive all 1 record...`).
   
   OK, let's check the logs, you can scroll down to the **standard error** logs (where I put my debug logs).
   At first, we have 3 stream client (thread) created:
   ```
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9CREATED,REBALANCING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , CREATED,STARTING
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , CREATED,STARTING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , CREATED,STARTING
   ```
   And then, rebalancing the tasks. You can see this logs: (reason: Adding new member KTable-FKJ-Multi....)
   ```
   
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 1 (__consumer_offsets-3) (reason: Adding new member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 with group instance id None)
   responseStatusMap(__consumer_offsets-3 -> {error: NONE,offset: 0,logAppendTime: -1, logStartOffset: 0, recordErrors: [], errorMessage: null})
   Successfully synced group in generation:Generation{generationId=1, memberId='consumer-ktable-ktable-consumer-7-daa57628-31cb-498c-9d72-168874157cc1', protocol='range'}
   ```
   
   Later, the consumer left the group with heartbeat expiration: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 2 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 on heartbeat expiration)
   Group coordinator:, cause:session timed out without receiving a heartbeat response,isDis:false
   ```
   And then, you can see this line, which means 60 seconds reached
   ```
   !!! failed 1st time
   ```
   In the meantime, you can see the stream state changed to `PARTITIONS_ASSIGNED`, `PARTITIONS_REVOKED`, `RUNNING`, just never reach a state with 3 stream client with `RUNNING` state
   Later, another consumer left group with heartbeat expiration again: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 5 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1-consumer-080464ae-710f-478e-9195-73b55285e21d on heartbeat expiration)
   ```
   Finally, we can see all 3 stream client reached `RUNNING` state:
   ```
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93REBALANCING,RUNNING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5REBALANCING,RUNNING
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9REBALANCING,RUNNING
   ```
   And then, starting to process the data: (committing task offset)
   ```
   committing task off:{0_0={table1-0=OffsetAndMetadata{offset=1, leaderEpoch=null, metadata='AQAAAXil/kaK'}}...
   ```
   And complete the test, to close all the streams:
   ```
   closing
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PENDING_SHUTDOWN,DEAD
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9PENDING_SHUTDOWN,NOT_RUNNING
   ```
   
   That's basically what I observed, and that's why I want to increase the heartbeat interval to fix this flaky test. I also tried to force all 3 streams reach `RUNNING` state before creating another consumer to consume the data, but that doesn't help. There are also other identical failed tests. You can go here to check other test logs.
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/
   
   If you want to check other logs, you can let me know. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-818439327


   @ableegoldman , the failed tests are un-related:
   ```
   Build / ARM / org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup
   Build / JDK 11 and Scala 2.13 / org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationTest.testReplication()
   Build / JDK 15 and Scala 2.13 / kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic()
   Build / JDK 8 and Scala 2.12 / kafka.server.RaftClusterTest.testCreateClusterAndCreateAndManyTopicsWithManyPartitions()
   ```
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on a change in pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on a change in pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#discussion_r601987807



##########
File path: streams/src/test/java/org/apache/kafka/streams/integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java
##########
@@ -172,7 +174,7 @@ public void after() throws IOException {
             streamsThree.close();
             streamsThree = null;
         }
-        IntegrationTestUtils.purgeLocalStreamsState(streamsConfig);
+        IntegrationTestUtils.purgeLocalStreamsState(asList(streamsConfig, streamsConfigTwo, streamsConfigThree));

Review comment:
       another fix: we missed it in this fix: https://github.com/apache/kafka/pull/9978/files#diff-29da306477a4e2aaf1195f085dd77ae150f767e96277db52252d4093f802a9aeR154, to separate the streamConfig to persist UUID in state directory. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman commented on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

ableegoldman commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-807906627


   > When the system is slow, it might not be able to send heartbeat within default 3 seconds
   
   Well, it would have to miss ~3 heartbeats in order to trigger a rebalance on the session timeout which is 10 seconds. It's a bit hard to believe that it really can't get a single heartbeat in within 10s, and even if so that only this one test would be affected when most of them rely on the default heartbeat/session interval.
   
   Were you seeing that it's continuously rebalancing in the logs? Even if it did have one rebalance due to one missed session interval, it had a full minute to process just a single record. I'm skeptical that the default configs are the problem here, and they may actually mask a real issue if we just blindly increase them. 
   
   Have you had any luck in reproducing this locally If we can get full logs, especially debug logs, that will help us confirm the root cause here


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-816544910


   @ableegoldman , I ran multiple times on jenkins and print some debug logs. Please check here:
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegration10Test/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable/
   
   The error message: `failed 1st time` is because I don't stop the test immediately when the normal test timed out (60 seconds) for waiting consuming 1 record. I keep waiting for another 60 seconds to see what will happen. If it passed in 2nd 60 seconds, it'll failed with `failed 1st time` indicating this test passed in 2nd run, otherwise, it'll failed with normal error message (i.e. `Did not receive all 1 record...`).
   
   OK, let's check the logs, you can scroll down to the **standard error** logs (where I put my debug logs).
   At first, we have 3 stream client (thread) created:
   ```
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9CREATED,REBALANCING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , CREATED,STARTING
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , CREATED,STARTING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , CREATED,STARTING
   ```
   And then, rebalancing the tasks. You can see this logs: (reason: Adding new member KTable-FKJ-Multi....)
   ```
   
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 1 (__consumer_offsets-3) (reason: Adding new member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 with group instance id None)
   responseStatusMap(__consumer_offsets-3 -> {error: NONE,offset: 0,logAppendTime: -1, logStartOffset: 0, recordErrors: [], errorMessage: null})
   Successfully synced group in generation:Generation{generationId=1, memberId='consumer-ktable-ktable-consumer-7-daa57628-31cb-498c-9d72-168874157cc1', protocol='range'}
   ```
   
   Later, 1 consumer left the group with heartbeat expiration: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 2 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 on heartbeat expiration)
   Group coordinator:, cause:session timed out without receiving a heartbeat response,isDis:false
   ```
   And then, you can see this line, which means 60 seconds reached
   ```
   !!! failed 1st time
   ```
   In the meantime, you can see the stream state changed to `PARTITIONS_ASSIGNED`, `PARTITIONS_REVOKED`, `RUNNING`, just never reach a state with 3 stream client with `RUNNING` state
   Later, another consumer left group with heartbeat expiration again: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 5 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1-consumer-080464ae-710f-478e-9195-73b55285e21d on heartbeat expiration)
   ```
   Finally, we can see all 3 stream client reached `RUNNING` state:
   ```
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93REBALANCING,RUNNING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5REBALANCING,RUNNING
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9REBALANCING,RUNNING
   ```
   And then, starting to process the data: (committing task offset)
   ```
   committing task off:{0_0={table1-0=OffsetAndMetadata{offset=1, leaderEpoch=null, metadata='AQAAAXil/kaK'}}...
   ```
   And complete the test, to close all the streams:
   ```
   closing
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PENDING_SHUTDOWN,DEAD
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9PENDING_SHUTDOWN,NOT_RUNNING
   ```
   
   That's basically what I observed, and that's why I want to increase the heartbeat interval to fix this flaky test. I also tried to force all 3 streams reach `RUNNING` state before creating another consumer to consume the data, but that doesn't help. (still failed sometimes while all 3 stream clients don't reach running within 60 secs). There are also other identical failed tests in this branch test. You can go here to check other test logs.
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/
   
   If you want to check other logs, you can let me know. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman commented on pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

ableegoldman commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-818970047


   Thanks @showuon , merged to trunk. Let's also close out the ticket for now so that if someone sees it fail again on a PR build they'll reopen it and we'll be notified


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on a change in pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on a change in pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#discussion_r601987807



##########
File path: streams/src/test/java/org/apache/kafka/streams/integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java
##########
@@ -172,7 +174,7 @@ public void after() throws IOException {
             streamsThree.close();
             streamsThree = null;
         }
-        IntegrationTestUtils.purgeLocalStreamsState(streamsConfig);
+        IntegrationTestUtils.purgeLocalStreamsState(asList(streamsConfig, streamsConfigTwo, streamsConfigThree));

Review comment:
       another fix: we missed it in this fix: https://github.com/apache/kafka/pull/9978/files#diff-29da306477a4e2aaf1195f085dd77ae150f767e96277db52252d4093f802a9aeR152, to separate the streamConfig to persist UUID in state directory. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-816544910


   @ableegoldman , I ran multiple times on jenkins and print some debug logs. Please check here:
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegration10Test/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable/
   
   The error message: `failed 1st time` is because I don't stop the test immediately when the normal test timed out (60 seconds) for waiting consuming 1 record. I keep waiting for another 60 seconds to see what will happen. If it passed in 2nd 60 seconds, it'll failed with `failed 1st time` indicating this test passed in 2nd run, otherwise, it'll failed with normal error message (i.e. `Did not receive all 1 record...`).
   
   OK, let's check the logs, you can scroll down to the standard error logs (where I put my debug logs).
   At first, we have 3 stream client (thread) created:
   ```
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9CREATED,REBALANCING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , CREATED,STARTING
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , CREATED,STARTING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , CREATED,STARTING
   ```
   And then, rebalancing the tasks. You can see this logs:
   ```
   
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 1 (__consumer_offsets-3) (reason: Adding new member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 with group instance id None)
   responseStatusMap(__consumer_offsets-3 -> {error: NONE,offset: 0,logAppendTime: -1, logStartOffset: 0, recordErrors: [], errorMessage: null})
   Successfully synced group in generation:Generation{generationId=1, memberId='consumer-ktable-ktable-consumer-7-daa57628-31cb-498c-9d72-168874157cc1', protocol='range'}
   ```
   
   Later, the consumer left the group with heartbeat expiration
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 2 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 on heartbeat expiration)
   Group coordinator:, cause:session timed out without receiving a heartbeat response,isDis:false
   ```
   And then, you can see this line, which means 60 seconds reached
   ```
   !!! failed 1st time
   ```
   In the meantime, you can see the stream state changed to `PARTITIONS_ASSIGNED`, `PARTITIONS_REVOKED`, `RUNNING`, just never reach a state with 3 stream client with `RUNNING` state
   Later, another consumer left group with heartbeat expiration again:
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 5 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1-consumer-080464ae-710f-478e-9195-73b55285e21d on heartbeat expiration)
   ```
   Finally, we can see all 3 stream client reached `RUNNING` state:
   ```
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93REBALANCING,RUNNING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5REBALANCING,RUNNING
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9REBALANCING,RUNNING
   ```
   And then, starting to process the data: (committing task offset)
   ```
   committing task off:{0_0={table1-0=OffsetAndMetadata{offset=1, leaderEpoch=null, metadata='AQAAAXil/kaK'}}...
   ```
   And complete the test, to close all the streams:
   ```
   closing
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PENDING_SHUTDOWN,DEAD
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9PENDING_SHUTDOWN,NOT_RUNNING
   ```
   
   That's basically what I observed, and that's why I want to increase the heartbeat interval to fix this flaky test. I also tried to force all 3 streams reach `RUNNING` state before creating another consumer to consume the data, but that doesn't help. There are also other identical failed tests. You can go here to check other test logs.
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/
   
   If you want to check other logs, you can let me know. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on a change in pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on a change in pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#discussion_r611308540



##########
File path: streams/src/test/java/org/apache/kafka/streams/integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java
##########
@@ -172,38 +170,35 @@ public void after() throws IOException {
             streamsThree.close();
             streamsThree = null;
         }
-        IntegrationTestUtils.purgeLocalStreamsState(streamsConfig);
+        IntegrationTestUtils.purgeLocalStreamsState(asList(streamsConfig, streamsConfigTwo, streamsConfigThree));
     }
 
-    private enum JoinType {
-        INNER
-    }

Review comment:
       No need enum here since we only test `INNER` join type here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-816544910


   @ableegoldman , I ran multiple times on jenkins and print some debug logs. Please check here:
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegration10Test/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable/
   
   The error message: `failed 1st time` is because I don't stop the test immediately when the normal test timed out (60 seconds) for waiting consuming 1 record. I keep waiting for another 60 seconds to see what will happen. If it passed in 2nd 60 seconds, it'll failed with `failed 1st time` indicating this test passed in 2nd run, otherwise, it'll failed with normal error message (i.e. `Did not receive all 1 record...`).
   
   OK, let's check the logs, you can scroll down to the **standard error** logs (where I put my debug logs).
   At first, we have 3 stream client (thread) created:
   ```
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9CREATED,REBALANCING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , CREATED,STARTING
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , CREATED,STARTING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , CREATED,STARTING
   ```
   And then, rebalancing the tasks. You can see this logs: (reason: Adding new member KTable-FKJ-Multi....)
   ```
   
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 1 (__consumer_offsets-3) (reason: Adding new member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 with group instance id None)
   responseStatusMap(__consumer_offsets-3 -> {error: NONE,offset: 0,logAppendTime: -1, logStartOffset: 0, recordErrors: [], errorMessage: null})
   Successfully synced group in generation:Generation{generationId=1, memberId='consumer-ktable-ktable-consumer-7-daa57628-31cb-498c-9d72-168874157cc1', protocol='range'}
   ```
   
   Later, 1 consumer left the group with heartbeat expiration: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 2 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 on heartbeat expiration)
   Group coordinator:, cause:session timed out without receiving a heartbeat response,isDis:false
   ```
   And then, you can see this line, which means 60 seconds reached
   ```
   !!! failed 1st time
   ```
   In the meantime, you can see the stream state changed to `PARTITIONS_ASSIGNED`, `PARTITIONS_REVOKED`, `RUNNING`, just never reach a state with 3 stream client with `RUNNING` state
   Later, another consumer left group with heartbeat expiration again: (reason: removing member KTable-FKJ-Multi... on heartbeat expiration)
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 5 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1-consumer-080464ae-710f-478e-9195-73b55285e21d on heartbeat expiration)
   ```
   Finally, we can see all 3 stream client reached `RUNNING` state:
   ```
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93REBALANCING,RUNNING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5REBALANCING,RUNNING
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9REBALANCING,RUNNING
   ```
   And then, starting to process the data: (committing task offset)
   ```
   committing task off:{0_0={table1-0=OffsetAndMetadata{offset=1, leaderEpoch=null, metadata='AQAAAXil/kaK'}}...
   ```
   And complete the test, to close all the streams:
   ```
   closing
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PENDING_SHUTDOWN,DEAD
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9PENDING_SHUTDOWN,NOT_RUNNING
   ```
   
   That's basically what I observed, and that's why I want to increase the heartbeat interval to fix this flaky test. I also tried to force all 3 streams reach `RUNNING` state before creating another consumer to consume the data, but that doesn't help. There are also other identical failed tests. You can go here to check other test logs.
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/
   
   If you want to check other logs, you can let me know. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on a change in pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on a change in pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#discussion_r601988722



##########
File path: streams/src/test/java/org/apache/kafka/streams/integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java
##########
@@ -184,26 +186,25 @@ public void shouldInnerJoinMultiPartitionQueryable() throws Exception {
         final Set<KeyValue<Integer, String>> expectedOne = new HashSet<>();
         expectedOne.add(new KeyValue<>(1, "value1=1.33,value2=10,value3=waffle"));
 
-        verifyKTableKTableJoin(JoinType.INNER, expectedOne, true);
+        verifyKTableKTableJoin(JoinType.INNER, expectedOne);

Review comment:
       In this test suite, we only have 1 test to verify QueryableState. So, remove unnecessary parameter and check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-818378362


   @ableegoldman , I updated the PR to wait until all streams reach "RUNNING" state before consuming records. Let's wait and see the test results. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] ableegoldman merged pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

ableegoldman merged pull request #10409:
URL: https://github.com/apache/kafka/pull/10409


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-817043654


   @ableegoldman , thanks for your comment. I know what your concern is, and it makes sense we merge other improvement first. I'll update the PR. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: improve KTableKTableForeignKeyInnerJoinMultiIntegrationTest

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-818386632


   @ableegoldman , thank you. I'll monitor the PR build, and update here. You can go take a rest, and check it tomorrow (your time). 🙂


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon edited a comment on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon edited a comment on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-816544910


   @ableegoldman , I ran multiple times on jenkins and print some debug logs. Please check here:
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegration10Test/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable/
   
   The error message: `failed 1st time` is because I don't stop the test immediately when the normal test timed out (60 seconds) for waiting consuming 1 record. I keep waiting for another 60 seconds to see what will happen. If it passed in 2nd 60 seconds, it'll failed with `failed 1st time` indicating this test passed in 2nd run, otherwise, it'll failed with normal error message (i.e. `Did not receive all 1 record...`).
   
   OK, let's check the logs, you can scroll down to the **standard error** logs (where I put my debug logs).
   At first, we have 3 stream client (thread) created:
   ```
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9CREATED,REBALANCING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , CREATED,STARTING
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , CREATED,STARTING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5CREATED,REBALANCING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , CREATED,STARTING
   ```
   And then, rebalancing the tasks. You can see this logs:
   ```
   
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 1 (__consumer_offsets-3) (reason: Adding new member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 with group instance id None)
   responseStatusMap(__consumer_offsets-3 -> {error: NONE,offset: 0,logAppendTime: -1, logStartOffset: 0, recordErrors: [], errorMessage: null})
   Successfully synced group in generation:Generation{generationId=1, memberId='consumer-ktable-ktable-consumer-7-daa57628-31cb-498c-9d72-168874157cc1', protocol='range'}
   ```
   
   Later, the consumer left the group with heartbeat expiration
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 2 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1-consumer-7fff842d-2157-410a-a939-cca43c0d8e29 on heartbeat expiration)
   Group coordinator:, cause:session timed out without receiving a heartbeat response,isDis:false
   ```
   And then, you can see this line, which means 60 seconds reached
   ```
   !!! failed 1st time
   ```
   In the meantime, you can see the stream state changed to `PARTITIONS_ASSIGNED`, `PARTITIONS_REVOKED`, `RUNNING`, just never reach a state with 3 stream client with `RUNNING` state
   Later, another consumer left group with heartbeat expiration again:
   ```
   0: Preparing to rebalance group KTable-FKJ-Multi in state PreparingRebalance with old generation 5 (__consumer_offsets-3) (reason: removing member KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1-consumer-080464ae-710f-478e-9195-73b55285e21d on heartbeat expiration)
   ```
   Finally, we can see all 3 stream client reached `RUNNING` state:
   ```
   stream-thread [KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-5a59900d-05ba-4208-ae6f-ff2b9d86ff93REBALANCING,RUNNING
   stream-thread [KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PARTITIONS_ASSIGNED,RUNNING
   Stream-client KTable-FKJ-Multi-c889e99c-7d74-4e3e-bb33-c0c03b6b2ac5REBALANCING,RUNNING
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9REBALANCING,RUNNING
   ```
   And then, starting to process the data: (committing task offset)
   ```
   committing task off:{0_0={table1-0=OffsetAndMetadata{offset=1, leaderEpoch=null, metadata='AQAAAXil/kaK'}}...
   ```
   And complete the test, to close all the streams:
   ```
   closing
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , RUNNING,PENDING_SHUTDOWN
   stream-thread [KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9-StreamThread-1] , PENDING_SHUTDOWN,DEAD
   Stream-client KTable-FKJ-Multi-fc16826a-c92f-4d47-91d7-7e17abf084b9PENDING_SHUTDOWN,NOT_RUNNING
   ```
   
   That's basically what I observed, and that's why I want to increase the heartbeat interval to fix this flaky test. I also tried to force all 3 streams reach `RUNNING` state before creating another consumer to consume the data, but that doesn't help. There are also other identical failed tests. You can go here to check other test logs.
   https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-9724/237/
   
   If you want to check other logs, you can let me know. Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on a change in pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on a change in pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#discussion_r601988476



##########
File path: streams/src/test/java/org/apache/kafka/streams/integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java
##########
@@ -184,26 +186,25 @@ public void shouldInnerJoinMultiPartitionQueryable() throws Exception {
         final Set<KeyValue<Integer, String>> expectedOne = new HashSet<>();
         expectedOne.add(new KeyValue<>(1, "value1=1.33,value2=10,value3=waffle"));
 
-        verifyKTableKTableJoin(JoinType.INNER, expectedOne, true);
+        verifyKTableKTableJoin(JoinType.INNER, expectedOne);
     }
 
     private void verifyKTableKTableJoin(final JoinType joinType,
-                                        final Set<KeyValue<Integer, String>> expectedResult,
-                                        final boolean verifyQueryableState) throws Exception {
-        final String queryableName = verifyQueryableState ? joinType + "-store1" : null;

Review comment:
       In this test suite, we only have 1 test to verify QueryableState. So, remove unnecessary parameter and check.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [kafka] showuon commented on pull request #10409: KAFKA-9295: increase heartbeat interval to avoid unneeded rebalance

Posted by GitBox <gi...@apache.org>.

showuon commented on pull request #10409:
URL: https://github.com/apache/kafka/pull/10409#issuecomment-807915216


   @ableegoldman , I agree with you! Let me see if I can get some logs and see if you have any thoughts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org