You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Nikolay Izhikov <ni...@apache.org> on 2019/10/07 18:37:50 UTC

KAFKA-8104: Help with the fair reproducer and review

Hello.

We have KAFKA-8104 "Consumer cannot rejoin to the group after rebalancing" [1] issue.
It reproduces on many production environments.

I prepared reproducer and fix [2] for this issue.
But, I need assistance with the "fair" reproducer.

Please, help me with the review and "fair" reproducer:

PR contains the fix of race condition bug between "consumer thread" and "consumer coordinator heartbeat thread". It reproduces in many production environments.

Condition for reproducing:

1. Consumer thread initiates rejoin to the group because of commit timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to `sendJoinGroupRequest`.
2. `JoinGroupResponseHandler` writes to the `AbstractCoordinator.this.generation` new generation data and leaves the` synchronized` section.
3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data via `resetGenerationOnLeaveGroup`.
4. Consumer thread executes `onJoinComplete(generation.generationId, generation.memberId, generation.protocol, memberAssignment);` with the cleared generation data. This leads to the corresponding
exception.

The race fixed with the condition in `maybeLeaveGroup`: if we have ongoing rejoin process in consumer thread there is no reason to reset generation data and send `LeaveGroupRequest` in heartbeat
thread.

This PR contains unfair "reproducer".
It implemented with the `CountDownLatch` that imitates described race in `AbstractCoordinator` code.



[1] https://issues.apache.org/jira/browse/KAFKA-8104
[2] https://github.com/apache/kafka/pull/7460

Re: KAFKA-8104: Help with the review

Posted by Nikolay Izhikov <ni...@apache.org>.
Hello, Guozhang.

Got it, thanks for the help with the PR.
Will wait for your review.

В Пн, 14/10/2019 в 13:40 -0700, Guozhang Wang пишет:
> Hello Nikolay,
> 
> I'm still on your PR, but was swamped with some other issues as the release
> code freeze date's approaching, will try to make another pass on it asap.
> 
> 
> Guozhang
> 
> On Mon, Oct 14, 2019 at 12:46 PM Nikolay Izhikov <ni...@apache.org>
> wrote:
> 
> > Hello.
> > 
> > I got very helpfull advices from guozhang.
> > And now, we have a ready fix and reproducer.
> > 
> > This PR fixes a very long living Kafka Consumer bug.
> > Please, join to the review.
> > 
> > [1] https://issues.apache.org/jira/browse/KAFKA-8104
> > [2] https://github.com/apache/kafka/pull/7460
> > 
> > В Пн, 07/10/2019 в 21:37 +0300, Nikolay Izhikov пишет:
> > > Hello.
> > > 
> > > We have KAFKA-8104 "Consumer cannot rejoin to the group after
> > 
> > rebalancing" [1] issue.
> > > It reproduces on many production environments.
> > > 
> > > I prepared reproducer and fix [2] for this issue.
> > > But, I need assistance with the "fair" reproducer.
> > > 
> > > Please, help me with the review and "fair" reproducer:
> > > 
> > > PR contains the fix of race condition bug between "consumer thread" and
> > 
> > "consumer coordinator heartbeat thread". It reproduces in many production
> > environments.
> > > 
> > > Condition for reproducing:
> > > 
> > > 1. Consumer thread initiates rejoin to the group because of commit
> > 
> > timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to
> > `sendJoinGroupRequest`.
> > > 2. `JoinGroupResponseHandler` writes to the
> > 
> > `AbstractCoordinator.this.generation` new generation data and leaves the`
> > synchronized` section.
> > > 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data
> > 
> > via `resetGenerationOnLeaveGroup`.
> > > 4. Consumer thread executes `onJoinComplete(generation.generationId,
> > 
> > generation.memberId, generation.protocol, memberAssignment);` with the
> > cleared generation data. This leads to the corresponding
> > > exception.
> > > 
> > > The race fixed with the condition in `maybeLeaveGroup`: if we have
> > 
> > ongoing rejoin process in consumer thread there is no reason to reset
> > generation data and send `LeaveGroupRequest` in heartbeat
> > > thread.
> > > 
> > > This PR contains unfair "reproducer".
> > > It implemented with the `CountDownLatch` that imitates described race in
> > 
> > `AbstractCoordinator` code.
> > > 
> > > 
> > > 
> > > [1] https://issues.apache.org/jira/browse/KAFKA-8104
> > > [2] https://github.com/apache/kafka/pull/7460
> 
> 

Re: KAFKA-8104: Help with the review

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Nikolay,

I'm still on your PR, but was swamped with some other issues as the release
code freeze date's approaching, will try to make another pass on it asap.


Guozhang

On Mon, Oct 14, 2019 at 12:46 PM Nikolay Izhikov <ni...@apache.org>
wrote:

> Hello.
>
> I got very helpfull advices from guozhang.
> And now, we have a ready fix and reproducer.
>
> This PR fixes a very long living Kafka Consumer bug.
> Please, join to the review.
>
> [1] https://issues.apache.org/jira/browse/KAFKA-8104
> [2] https://github.com/apache/kafka/pull/7460
>
> В Пн, 07/10/2019 в 21:37 +0300, Nikolay Izhikov пишет:
> > Hello.
> >
> > We have KAFKA-8104 "Consumer cannot rejoin to the group after
> rebalancing" [1] issue.
> > It reproduces on many production environments.
> >
> > I prepared reproducer and fix [2] for this issue.
> > But, I need assistance with the "fair" reproducer.
> >
> > Please, help me with the review and "fair" reproducer:
> >
> > PR contains the fix of race condition bug between "consumer thread" and
> "consumer coordinator heartbeat thread". It reproduces in many production
> environments.
> >
> > Condition for reproducing:
> >
> > 1. Consumer thread initiates rejoin to the group because of commit
> timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to
> `sendJoinGroupRequest`.
> > 2. `JoinGroupResponseHandler` writes to the
> `AbstractCoordinator.this.generation` new generation data and leaves the`
> synchronized` section.
> > 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data
> via `resetGenerationOnLeaveGroup`.
> > 4. Consumer thread executes `onJoinComplete(generation.generationId,
> generation.memberId, generation.protocol, memberAssignment);` with the
> cleared generation data. This leads to the corresponding
> > exception.
> >
> > The race fixed with the condition in `maybeLeaveGroup`: if we have
> ongoing rejoin process in consumer thread there is no reason to reset
> generation data and send `LeaveGroupRequest` in heartbeat
> > thread.
> >
> > This PR contains unfair "reproducer".
> > It implemented with the `CountDownLatch` that imitates described race in
> `AbstractCoordinator` code.
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/KAFKA-8104
> > [2] https://github.com/apache/kafka/pull/7460
>


-- 
-- Guozhang

Re: KAFKA-8104: Help with the review

Posted by Nikolay Izhikov <ni...@apache.org>.
Hello.

I got very helpfull advices from guozhang.
And now, we have a ready fix and reproducer.

This PR fixes a very long living Kafka Consumer bug.
Please, join to the review.

[1] https://issues.apache.org/jira/browse/KAFKA-8104
[2] https://github.com/apache/kafka/pull/7460

В Пн, 07/10/2019 в 21:37 +0300, Nikolay Izhikov пишет:
> Hello.
> 
> We have KAFKA-8104 "Consumer cannot rejoin to the group after rebalancing" [1] issue.
> It reproduces on many production environments.
> 
> I prepared reproducer and fix [2] for this issue.
> But, I need assistance with the "fair" reproducer.
> 
> Please, help me with the review and "fair" reproducer:
> 
> PR contains the fix of race condition bug between "consumer thread" and "consumer coordinator heartbeat thread". It reproduces in many production environments.
> 
> Condition for reproducing:
> 
> 1. Consumer thread initiates rejoin to the group because of commit timeout. Call of `AbstractCoordinator#joinGroupIfNeeded` which leads to `sendJoinGroupRequest`.
> 2. `JoinGroupResponseHandler` writes to the `AbstractCoordinator.this.generation` new generation data and leaves the` synchronized` section.
> 3. Heartbeat thread executes `mabeLeaveGroup` and clears generation data via `resetGenerationOnLeaveGroup`.
> 4. Consumer thread executes `onJoinComplete(generation.generationId, generation.memberId, generation.protocol, memberAssignment);` with the cleared generation data. This leads to the corresponding
> exception.
> 
> The race fixed with the condition in `maybeLeaveGroup`: if we have ongoing rejoin process in consumer thread there is no reason to reset generation data and send `LeaveGroupRequest` in heartbeat
> thread.
> 
> This PR contains unfair "reproducer".
> It implemented with the `CountDownLatch` that imitates described race in `AbstractCoordinator` code.
> 
> 
> 
> [1] https://issues.apache.org/jira/browse/KAFKA-8104
> [2] https://github.com/apache/kafka/pull/7460