You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by GitBox <gi...@apache.org> on 2021/07/07 04:33:52 UTC

[GitHub] [kafka] ableegoldman opened a new pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

ableegoldman opened a new pull request #10986:
URL: https://github.com/apache/kafka/pull/10986


   The `#onJoinPrepare` callback is not always invoked before a member (re)joins the group, but only once when it first enters the rebalance. This means that any updates or events that occur during the join phase can be lost in the internal state: for example, clearing the SubscriptionState (and thus the "ownedPartitions" that are used for cooperative rebalancing) after losing its memberId during a rebalance.
   
   We should reset the `needsJoinPrepare` flag inside the resetStateAndRejoin() method.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
hachikuji commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-878621043


   To clarify, from the perspective of the eager protocol, how would this case look? Would we get multiple calls to `onPartitionsRevoked` with the same set of partitions or something else?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] guozhangwang commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
guozhangwang commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-878589565


   @hachikuji I think the key idea behind this fix is that, if a rebalance failed with e.g. memberId lost, then conceptually we would just started a new rebalance in which we would call `onJoinPrepare` and in which we may call `onRepartitionsRevoked` again. This behavior would be the same for eager or cooperative.
   
   Personally I think this fix is fine -- @ableegoldman if you could just add a unit test for the case of memberId lost during a first rebalance, and check that we would re-triggered `onJoinPrepare` again?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ableegoldman commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
ableegoldman commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-878706682


   @hachikuji in the EAGER case, after the first `onJoinPrepare` / `onPartitionsRevoked`, the subscription would have been cleared. So any subsequent invocations of `onPartitionsRevoked` would be with an empty set of partitions
   
   @everyone, I was having trouble getting a unit test that would actually verify this behavior but I wanted to kick off discussion on the fix ASAP (for obvious reasons) so I opened the PR without one. I do intended to add a test, I just haven't had time to pursue that yet. Suggestions welcome :P


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] dajac commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
dajac commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-875329231


   @ableegoldman Thanks for the patch. The change makes sense to me. I wonder if we could add a unit test which would fail without it though. This would avoid regressing in the future. What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ableegoldman commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
ableegoldman commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-879333780


   Two test failures, both `ConsumerBounceTest.testCloseDuringRebalance()`. This test is already known to be flaky and failed with the same error that has been reported before, so I think we can conclude that this was unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ableegoldman commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
ableegoldman commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-879343735


   Merged to trunk and cherrypicked to 2.8 & 3.0 (cc @kkonstantine)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
hachikuji commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-875956460


   @ableegoldman Thanks for the patch. I think the original idea behind the implementation was to ensure that each rebalance triggered only one call to `onPartitionsRevoked`. It sounds like this needs some refinement for the cooperative rebalance logic. I guess the main difference is that we could now have a call to `onPartitionsLost` if the memberId is lost after the initial call to `onJoinPrepare`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ableegoldman merged pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
ableegoldman merged pull request #10986:
URL: https://github.com/apache/kafka/pull/10986


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ableegoldman commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
ableegoldman commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-878749974


   Now ready for review @dajac @hachikuji @guozhangwang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ableegoldman edited a comment on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
ableegoldman edited a comment on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-879333780


   Two test failures, both `ConsumerBounceTest.testCloseDuringRebalance()`. This test is already known to be flaky and failed with the same error that has been reported before ([KAFKA-8529](https://issues.apache.org/jira/browse/KAFKA-8529)), so I think we can conclude that this was unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] ableegoldman commented on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
ableegoldman commented on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-878748345


   Ok I realize we actually do have a test that reproduces this already: `ConsumerCoordinatorTest.testRebalanceWithMetadataChange`. This test sets up a case where a change in topic metadata triggers a rebalance after a member had joined the group, after which the change is reverted so that the metadata is ultimately the same. Then a `NOT_COORDINATOR` response is sent to fail the initial JoinGroup, and the test just verifies that the member attempts to rejoin until successful. It also verifies things like the number of times each rebalance callback is invoked, and the set of partitions that the callbacks receive.
   This test actually only failed in the COOPERATIVE case, which confirms that the behavior remains correct for the EAGER case. When following the COOPERATIVE protocol, the test was formerly assuming that the member would retain all partitions despite actually having its generation and memberId cleared when the initial JoinGroup is failed. So it was technically asserting the wrong behavior beforehand; just fixing this gives us a unit test for this patch after all.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [kafka] hachikuji edited a comment on pull request #10986: KAFKA-12983: reset needsJoinPrepare flag before rejoining the group

Posted by GitBox <gi...@apache.org>.
hachikuji edited a comment on pull request #10986:
URL: https://github.com/apache/kafka/pull/10986#issuecomment-875956460


   @ableegoldman Thanks for the patch. I think the original idea behind the implementation was to ensure that each rebalance triggered only one call to `onPartitionsRevoked`. It sounds like this needs some refinement for the cooperative rebalance logic. I guess the main difference is that we could now have a call to `onPartitionsLost` if the memberId is lost after the initial call to `onJoinPrepare`? It might be nice to ensure that we can keep the same behavior for eager rebalancing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: jira-unsubscribe@kafka.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org