You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Boyang Chen <bc...@outlook.com> on 2018/11/27 01:47:46 UTC

[DISCUSS] KIP-394: Require member.id for initial join group request

Hey friends,


I would like to start a discussion thread for KIP-394 which is trying to mitigate broker cache bursting issue due to anonymous join group requests:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request


Thanks!

Boyang

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Mayuresh Gharat <gh...@gmail.com>.

Thanks for the KIP Boyang and great to see the progress on solving the
rebalance issues with this and KIP-345.

Thanks,

Mayuresh

On Mon, Dec 3, 2018 at 4:57 AM Stanislav Kozlovski <st...@confluent.io>
wrote:

> Everything sounds good to me.
>
> On Sun, Dec 2, 2018 at 1:24 PM Boyang Chen <bc...@outlook.com> wrote:
>
> > In fact, it's probably better to move KIP-394<
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request
> >
> > to the vote stage first, so that it's easier to finalize the timeline and
> > smooth the rollout plan for KIP-345. Jason and Stanislav, since you two
> > involve most in this KIP, could you let me know if there is still any
> > unclarity we want to resolve before moving to vote?
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Boyang Chen <bc...@outlook.com>
> > Sent: Saturday, December 1, 2018 10:53 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> > request
> >
> > Thanks Jason for the reply! Since the overall motivation and design is
> > pretty clear, I will go ahead to start implementation and we could
> discuss
> > the underlying details in the PR.
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Matthias J. Sax <ma...@confluent.io>
> > Sent: Saturday, December 1, 2018 3:12 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> > request
> >
> > SGTM.
> >
> > On 11/30/18 10:17 AM, Jason Gustafson wrote:
> > > Using the session expiration logic we already have seems like the
> > simplest
> > > option (this is probably a one or two line change). The rejoin should
> be
> > > quick anyway, so I don't think it's worth optimizing for unjoined new
> > > members. Just my two cents. This is more of an implementation detail,
> so
> > > need not necessarily be resolved here.
> > >
> > > -Jason
> > >
> > > On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bc...@outlook.com>
> > wrote:
> > >
> > >> Thanks Matthias for the question. I'm thinking of having a separate
> hash
> > >> set called `registeredMemberIds` which
> > >> will be cleared out every time a group finishes one round of
> rebalance.
> > >> Since storing one id is pretty trivial, using
> > >> purgatory to track the id removal is a bit wasteful in my opinion.
> > >> ________________________________
> > >> From: Matthias J. Sax <ma...@confluent.io>
> > >> Sent: Friday, November 30, 2018 10:26 AM
> > >> To: dev@kafka.apache.org
> > >> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> > group
> > >> request
> > >>
> > >> Thanks! Makes sense.
> > >>
> > >> I missed that fact, that the `member.id` is added on the second
> > >> joinGroup request that contains the `member.id`.
> > >>
> > >> However, it seems there is another race condition for this design:
> > >>
> > >> If two consumers join at the same time, it it possible that the broker
> > >> assigns the same `member.id` to both (because none of them have
> joined
> > >> the group yet--ie, second joinGroup request not sent yet--, the
> > >> `member.id` is not store broker side yes and broker cannot check for
> > >> duplicates when creating a new `member.id`.
> > >>
> > >> The probability might be fairly low thought. However, what Stanislav
> > >> proposed, to add the `member.id` directly, and remove it after
> > >> `session.timeout.ms` sound like a save option that avoids this issue.
> > >>
> > >> Thoughts?
> > >>
> > >>
> > >> -Matthias
> > >>
> > >> On 11/28/18 8:15 PM, Boyang Chen wrote:
> > >>> Thanks Matthias for the question, and Stanislav for the explanation!
> > >>>
> > >>> For the scenario described, we will never let a member join the
> > >> GroupMetadata map
> > >>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> > >>>
> > >>>   1.  Group is empty. Consumer c1 started. Join with
> UNKNOWN_MEMBER_ID;
> > >>>   2.  Broker rejects while allocating a member.id to c1 in response
> > (c1
> > >> protocol version is current);
> > >>>   3.  c1 handles the error and rejoins with assigned member.id;
> > >>>   4.  Broker stores c1 in its group metadata;
> > >>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
> > >>>   6.  Broker rejects while allocating a member.id to c2 in response
> > (c2
> > >> protocol version is current);
> > >>>   7.  c2 fails to get the response/crashes in the middle;
> > >>>   8.  After certain time, c2 restarts a join request with
> > >> UNKNOWN_MEMBER_ID;
> > >>>
> > >>> As you could see, c2 will repeat step 6~8 until successfully send
> back
> > a
> > >> join group request with allocated id.
> > >>> By then broker will include c2 within the broker metadata map.
> > >>>
> > >>> Does this sound clear to you?
> > >>>
> > >>> Best,
> > >>> Boyang
> > >>> ________________________________
> > >>> From: Stanislav Kozlovski <st...@confluent.io>
> > >>> Sent: Wednesday, November 28, 2018 7:39 PM
> > >>> To: dev@kafka.apache.org
> > >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> > >> group request
> > >>>
> > >>> Hey Matthias,
> > >>>
> > >>> I think the notion is to have the `session.timeout.ms` to start
> > ticking
> > >>> when the broker responds with the member.id. Then, the broker would
> > >>> properly expire consumers and not hold too many stale ones.
> > >>> This isn't mentioned in the KIP though so it is worth to wait for
> > Boyang
> > >> to
> > >>> confirm
> > >>>
> > >>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <
> matthias@confluent.io
> > >
> > >>> wrote:
> > >>>
> > >>>> Thanks for the KIP Boyang.
> > >>>>
> > >>>> I guess I am missing something, but I am still learning more details
> > >>>> about the rebalance protocol, so maybe you can help me out?
> > >>>>
> > >>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup
> > request.
> > >>>> The broker generates a `member.id` and sends it back via
> > >>>> `MEMBER_ID_REQUIRED` error response. This response might never reach
> > the
> > >>>> client or the client fails before it can send the second joinGroup
> > >>>> request. Thus, a client would need to start over with a new
> > >>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs
> to
> > >>>> generate a new `member.id` again.
> > >>>>
> > >>>> So it seems the problem is moved, but not resolved? The motivation
> of
> > >>>> the KIP is:
> > >>>>
> > >>>>> The edge case is that if initial join group request keeps failing
> due
> > >> to
> > >>>> connection timeout, or the consumer keeps restarting,
> > >>>>
> > >>>> From my understanding, this KIP move the issue from the first to the
> > >>>> second joinGroup request (or broker joinGroup response).
> > >>>>
> > >>>> But maybe I am missing something. Can you help me out?
> > >>>>
> > >>>>
> > >>>> -Matthias
> > >>>>
> > >>>>
> > >>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
> > >>>>> Thanks Stanislav and Jason for the suggestions!
> > >>>>>
> > >>>>>
> > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to
> bump
> > >> the
> > >>>>>> version of the JoinGroup protocol in order to indicate
> compatibility
> > >>>> with
> > >>>>>> the new behavior. The coordinator needs to know when it is safe to
> > >>>> assume
> > >>>>>> the client will handle the error code.
> > >>>>>>
> > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> > >> error
> > >>>>>> code. When the client sees this error code, it will take the
> > memberId
> > >>>> from
> > >>>>>> the response and rejoin. We'd still need the protocol bump since
> > older
> > >>>>>> consumers do not have this logic.
> > >>>>>
> > >>>>> I will add the join group protocol version change to the KIP.
> > Meanwhile
> > >>>> I feel for
> > >>>>> understandability it's better to define a separate error code since
> > >>>> REBALANCE_IN_PROGRESS
> > >>>>> is not the actual cause of the returned error.
> > >>>>>
> > >>>>>> One small question I have is now that we have one and a half
> > >> round-trips
> > >>>>>> needed to join in a rebalance (1 full RT addition), is it worth it
> > to
> > >>>>>> consider increasing the default value of `
> > >>>> group.initial.rebalance.delay.ms`?
> > >>>>> I guess we could keep it for now. After KIP-345 and incremental
> > >>>> cooperative rebalancing
> > >>>>> work we should be safe to deprecate `
> > group.initial.rebalance.delay.ms
> > >> `.
> > >>>> Also one round trip
> > >>>>> shouldn't increase the latency too much IMO.
> > >>>>>
> > >>>>> Best,
> > >>>>> Boyang
> > >>>>> ________________________________
> > >>>>> From: Stanislav Kozlovski <st...@confluent.io>
> > >>>>> Sent: Wednesday, November 28, 2018 2:32 AM
> > >>>>> To: dev@kafka.apache.org
> > >>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> > >>>> group request
> > >>>>>
> > >>>>> Hi Boyang,
> > >>>>>
> > >>>>> The KIP looks very good.
> > >>>>> One small question I have is now that we have one and a half
> > >> round-trips
> > >>>>> needed to join in a rebalance (1 full RT addition), is it worth it
> to
> > >>>>> consider increasing the default value of `
> > >>>> group.initial.rebalance.delay.ms`?
> > >>>>>
> > >>>>> Best,
> > >>>>> Stanislav
> > >>>>>
> > >>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <
> jason@confluent.io>
> > >>>> wrote:
> > >>>>>
> > >>>>>> Hi Boyang,
> > >>>>>>
> > >>>>>> Thanks for the KIP. Looks good overall. I think we will need to
> bump
> > >> the
> > >>>>>> version of the JoinGroup protocol in order to indicate
> compatibility
> > >>>> with
> > >>>>>> the new behavior. The coordinator needs to know when it is safe to
> > >>>> assume
> > >>>>>> the client will handle the error code.
> > >>>>>>
> > >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> > >> error
> > >>>>>> code. When the client sees this error code, it will take the
> > memberId
> > >>>> from
> > >>>>>> the response and rejoin. We'd still need the protocol bump since
> > older
> > >>>>>> consumers do not have this logic.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Jason
> > >>>>>>
> > >>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
> > >>>> wrote:
> > >>>>>>
> > >>>>>>> Hey friends,
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> I would like to start a discussion thread for KIP-394 which is
> > trying
> > >>>> to
> > >>>>>>> mitigate broker cache bursting issue due to anonymous join group
> > >>>>>> requests:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C3ca95629be9e42b1f00108d657383bfd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636792296362447479&amp;sdata=3BuPVUH5v3hMYe%2FMgpSsNftTwb5DsHDlm2lN%2FVUR0T8%3D&amp;reserved=0
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Thanks!
> > >>>>>>>
> > >>>>>>> Boyang
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Best,
> > >>>>> Stanislav
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Best,
> > >>> Stanislav
> > >>>
> > >>
> > >>
> > >
> >
> >
>
> --
> Best,
> Stanislav
>


-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Stanislav Kozlovski <st...@confluent.io>.

Everything sounds good to me.

On Sun, Dec 2, 2018 at 1:24 PM Boyang Chen <bc...@outlook.com> wrote:

> In fact, it's probably better to move KIP-394<
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request>
> to the vote stage first, so that it's easier to finalize the timeline and
> smooth the rollout plan for KIP-345. Jason and Stanislav, since you two
> involve most in this KIP, could you let me know if there is still any
> unclarity we want to resolve before moving to vote?
>
> Best,
> Boyang
> ________________________________
> From: Boyang Chen <bc...@outlook.com>
> Sent: Saturday, December 1, 2018 10:53 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> request
>
> Thanks Jason for the reply! Since the overall motivation and design is
> pretty clear, I will go ahead to start implementation and we could discuss
> the underlying details in the PR.
>
> Best,
> Boyang
> ________________________________
> From: Matthias J. Sax <ma...@confluent.io>
> Sent: Saturday, December 1, 2018 3:12 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> request
>
> SGTM.
>
> On 11/30/18 10:17 AM, Jason Gustafson wrote:
> > Using the session expiration logic we already have seems like the
> simplest
> > option (this is probably a one or two line change). The rejoin should be
> > quick anyway, so I don't think it's worth optimizing for unjoined new
> > members. Just my two cents. This is more of an implementation detail, so
> > need not necessarily be resolved here.
> >
> > -Jason
> >
> > On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bc...@outlook.com>
> wrote:
> >
> >> Thanks Matthias for the question. I'm thinking of having a separate hash
> >> set called `registeredMemberIds` which
> >> will be cleared out every time a group finishes one round of rebalance.
> >> Since storing one id is pretty trivial, using
> >> purgatory to track the id removal is a bit wasteful in my opinion.
> >> ________________________________
> >> From: Matthias J. Sax <ma...@confluent.io>
> >> Sent: Friday, November 30, 2018 10:26 AM
> >> To: dev@kafka.apache.org
> >> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> group
> >> request
> >>
> >> Thanks! Makes sense.
> >>
> >> I missed that fact, that the `member.id` is added on the second
> >> joinGroup request that contains the `member.id`.
> >>
> >> However, it seems there is another race condition for this design:
> >>
> >> If two consumers join at the same time, it it possible that the broker
> >> assigns the same `member.id` to both (because none of them have joined
> >> the group yet--ie, second joinGroup request not sent yet--, the
> >> `member.id` is not store broker side yes and broker cannot check for
> >> duplicates when creating a new `member.id`.
> >>
> >> The probability might be fairly low thought. However, what Stanislav
> >> proposed, to add the `member.id` directly, and remove it after
> >> `session.timeout.ms` sound like a save option that avoids this issue.
> >>
> >> Thoughts?
> >>
> >>
> >> -Matthias
> >>
> >> On 11/28/18 8:15 PM, Boyang Chen wrote:
> >>> Thanks Matthias for the question, and Stanislav for the explanation!
> >>>
> >>> For the scenario described, we will never let a member join the
> >> GroupMetadata map
> >>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> >>>
> >>>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
> >>>   2.  Broker rejects while allocating a member.id to c1 in response
> (c1
> >> protocol version is current);
> >>>   3.  c1 handles the error and rejoins with assigned member.id;
> >>>   4.  Broker stores c1 in its group metadata;
> >>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
> >>>   6.  Broker rejects while allocating a member.id to c2 in response
> (c2
> >> protocol version is current);
> >>>   7.  c2 fails to get the response/crashes in the middle;
> >>>   8.  After certain time, c2 restarts a join request with
> >> UNKNOWN_MEMBER_ID;
> >>>
> >>> As you could see, c2 will repeat step 6~8 until successfully send back
> a
> >> join group request with allocated id.
> >>> By then broker will include c2 within the broker metadata map.
> >>>
> >>> Does this sound clear to you?
> >>>
> >>> Best,
> >>> Boyang
> >>> ________________________________
> >>> From: Stanislav Kozlovski <st...@confluent.io>
> >>> Sent: Wednesday, November 28, 2018 7:39 PM
> >>> To: dev@kafka.apache.org
> >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> >> group request
> >>>
> >>> Hey Matthias,
> >>>
> >>> I think the notion is to have the `session.timeout.ms` to start
> ticking
> >>> when the broker responds with the member.id. Then, the broker would
> >>> properly expire consumers and not hold too many stale ones.
> >>> This isn't mentioned in the KIP though so it is worth to wait for
> Boyang
> >> to
> >>> confirm
> >>>
> >>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <matthias@confluent.io
> >
> >>> wrote:
> >>>
> >>>> Thanks for the KIP Boyang.
> >>>>
> >>>> I guess I am missing something, but I am still learning more details
> >>>> about the rebalance protocol, so maybe you can help me out?
> >>>>
> >>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup
> request.
> >>>> The broker generates a `member.id` and sends it back via
> >>>> `MEMBER_ID_REQUIRED` error response. This response might never reach
> the
> >>>> client or the client fails before it can send the second joinGroup
> >>>> request. Thus, a client would need to start over with a new
> >>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
> >>>> generate a new `member.id` again.
> >>>>
> >>>> So it seems the problem is moved, but not resolved? The motivation of
> >>>> the KIP is:
> >>>>
> >>>>> The edge case is that if initial join group request keeps failing due
> >> to
> >>>> connection timeout, or the consumer keeps restarting,
> >>>>
> >>>> From my understanding, this KIP move the issue from the first to the
> >>>> second joinGroup request (or broker joinGroup response).
> >>>>
> >>>> But maybe I am missing something. Can you help me out?
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>>
> >>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
> >>>>> Thanks Stanislav and Jason for the suggestions!
> >>>>>
> >>>>>
> >>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
> >> the
> >>>>>> version of the JoinGroup protocol in order to indicate compatibility
> >>>> with
> >>>>>> the new behavior. The coordinator needs to know when it is safe to
> >>>> assume
> >>>>>> the client will handle the error code.
> >>>>>>
> >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> >> error
> >>>>>> code. When the client sees this error code, it will take the
> memberId
> >>>> from
> >>>>>> the response and rejoin. We'd still need the protocol bump since
> older
> >>>>>> consumers do not have this logic.
> >>>>>
> >>>>> I will add the join group protocol version change to the KIP.
> Meanwhile
> >>>> I feel for
> >>>>> understandability it's better to define a separate error code since
> >>>> REBALANCE_IN_PROGRESS
> >>>>> is not the actual cause of the returned error.
> >>>>>
> >>>>>> One small question I have is now that we have one and a half
> >> round-trips
> >>>>>> needed to join in a rebalance (1 full RT addition), is it worth it
> to
> >>>>>> consider increasing the default value of `
> >>>> group.initial.rebalance.delay.ms`?
> >>>>> I guess we could keep it for now. After KIP-345 and incremental
> >>>> cooperative rebalancing
> >>>>> work we should be safe to deprecate `
> group.initial.rebalance.delay.ms
> >> `.
> >>>> Also one round trip
> >>>>> shouldn't increase the latency too much IMO.
> >>>>>
> >>>>> Best,
> >>>>> Boyang
> >>>>> ________________________________
> >>>>> From: Stanislav Kozlovski <st...@confluent.io>
> >>>>> Sent: Wednesday, November 28, 2018 2:32 AM
> >>>>> To: dev@kafka.apache.org
> >>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> >>>> group request
> >>>>>
> >>>>> Hi Boyang,
> >>>>>
> >>>>> The KIP looks very good.
> >>>>> One small question I have is now that we have one and a half
> >> round-trips
> >>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
> >>>>> consider increasing the default value of `
> >>>> group.initial.rebalance.delay.ms`?
> >>>>>
> >>>>> Best,
> >>>>> Stanislav
> >>>>>
> >>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
> >>>> wrote:
> >>>>>
> >>>>>> Hi Boyang,
> >>>>>>
> >>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
> >> the
> >>>>>> version of the JoinGroup protocol in order to indicate compatibility
> >>>> with
> >>>>>> the new behavior. The coordinator needs to know when it is safe to
> >>>> assume
> >>>>>> the client will handle the error code.
> >>>>>>
> >>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> >> error
> >>>>>> code. When the client sees this error code, it will take the
> memberId
> >>>> from
> >>>>>> the response and rejoin. We'd still need the protocol bump since
> older
> >>>>>> consumers do not have this logic.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Jason
> >>>>>>
> >>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> Hey friends,
> >>>>>>>
> >>>>>>>
> >>>>>>> I would like to start a discussion thread for KIP-394 which is
> trying
> >>>> to
> >>>>>>> mitigate broker cache bursting issue due to anonymous join group
> >>>>>> requests:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C3ca95629be9e42b1f00108d657383bfd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636792296362447479&amp;sdata=3BuPVUH5v3hMYe%2FMgpSsNftTwb5DsHDlm2lN%2FVUR0T8%3D&amp;reserved=0
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks!
> >>>>>>>
> >>>>>>> Boyang
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best,
> >>>>> Stanislav
> >>>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> Best,
> >>> Stanislav
> >>>
> >>
> >>
> >
>
>

-- 
Best,
Stanislav

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Boyang Chen <bc...@outlook.com>.

In fact, it's probably better to move KIP-394<https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request> to the vote stage first, so that it's easier to finalize the timeline and smooth the rollout plan for KIP-345. Jason and Stanislav, since you two involve most in this KIP, could you let me know if there is still any unclarity we want to resolve before moving to vote?

Best,
Boyang
________________________________
From: Boyang Chen <bc...@outlook.com>
Sent: Saturday, December 1, 2018 10:53 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Thanks Jason for the reply! Since the overall motivation and design is pretty clear, I will go ahead to start implementation and we could discuss the underlying details in the PR.

Best,
Boyang
________________________________
From: Matthias J. Sax <ma...@confluent.io>
Sent: Saturday, December 1, 2018 3:12 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

SGTM.

On 11/30/18 10:17 AM, Jason Gustafson wrote:
> Using the session expiration logic we already have seems like the simplest
> option (this is probably a one or two line change). The rejoin should be
> quick anyway, so I don't think it's worth optimizing for unjoined new
> members. Just my two cents. This is more of an implementation detail, so
> need not necessarily be resolved here.
>
> -Jason
>
> On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bc...@outlook.com> wrote:
>
>> Thanks Matthias for the question. I'm thinking of having a separate hash
>> set called `registeredMemberIds` which
>> will be cleared out every time a group finishes one round of rebalance.
>> Since storing one id is pretty trivial, using
>> purgatory to track the id removal is a bit wasteful in my opinion.
>> ________________________________
>> From: Matthias J. Sax <ma...@confluent.io>
>> Sent: Friday, November 30, 2018 10:26 AM
>> To: dev@kafka.apache.org
>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
>> request
>>
>> Thanks! Makes sense.
>>
>> I missed that fact, that the `member.id` is added on the second
>> joinGroup request that contains the `member.id`.
>>
>> However, it seems there is another race condition for this design:
>>
>> If two consumers join at the same time, it it possible that the broker
>> assigns the same `member.id` to both (because none of them have joined
>> the group yet--ie, second joinGroup request not sent yet--, the
>> `member.id` is not store broker side yes and broker cannot check for
>> duplicates when creating a new `member.id`.
>>
>> The probability might be fairly low thought. However, what Stanislav
>> proposed, to add the `member.id` directly, and remove it after
>> `session.timeout.ms` sound like a save option that avoids this issue.
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>> On 11/28/18 8:15 PM, Boyang Chen wrote:
>>> Thanks Matthias for the question, and Stanislav for the explanation!
>>>
>>> For the scenario described, we will never let a member join the
>> GroupMetadata map
>>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
>>>
>>>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
>>>   2.  Broker rejects while allocating a member.id to c1 in response (c1
>> protocol version is current);
>>>   3.  c1 handles the error and rejoins with assigned member.id;
>>>   4.  Broker stores c1 in its group metadata;
>>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
>>>   6.  Broker rejects while allocating a member.id to c2 in response (c2
>> protocol version is current);
>>>   7.  c2 fails to get the response/crashes in the middle;
>>>   8.  After certain time, c2 restarts a join request with
>> UNKNOWN_MEMBER_ID;
>>>
>>> As you could see, c2 will repeat step 6~8 until successfully send back a
>> join group request with allocated id.
>>> By then broker will include c2 within the broker metadata map.
>>>
>>> Does this sound clear to you?
>>>
>>> Best,
>>> Boyang
>>> ________________________________
>>> From: Stanislav Kozlovski <st...@confluent.io>
>>> Sent: Wednesday, November 28, 2018 7:39 PM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>> group request
>>>
>>> Hey Matthias,
>>>
>>> I think the notion is to have the `session.timeout.ms` to start ticking
>>> when the broker responds with the member.id. Then, the broker would
>>> properly expire consumers and not hold too many stale ones.
>>> This isn't mentioned in the KIP though so it is worth to wait for Boyang
>> to
>>> confirm
>>>
>>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
>>> wrote:
>>>
>>>> Thanks for the KIP Boyang.
>>>>
>>>> I guess I am missing something, but I am still learning more details
>>>> about the rebalance protocol, so maybe you can help me out?
>>>>
>>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
>>>> The broker generates a `member.id` and sends it back via
>>>> `MEMBER_ID_REQUIRED` error response. This response might never reach the
>>>> client or the client fails before it can send the second joinGroup
>>>> request. Thus, a client would need to start over with a new
>>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
>>>> generate a new `member.id` again.
>>>>
>>>> So it seems the problem is moved, but not resolved? The motivation of
>>>> the KIP is:
>>>>
>>>>> The edge case is that if initial join group request keeps failing due
>> to
>>>> connection timeout, or the consumer keeps restarting,
>>>>
>>>> From my understanding, this KIP move the issue from the first to the
>>>> second joinGroup request (or broker joinGroup response).
>>>>
>>>> But maybe I am missing something. Can you help me out?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
>>>>> Thanks Stanislav and Jason for the suggestions!
>>>>>
>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>
>>>>> I will add the join group protocol version change to the KIP. Meanwhile
>>>> I feel for
>>>>> understandability it's better to define a separate error code since
>>>> REBALANCE_IN_PROGRESS
>>>>> is not the actual cause of the returned error.
>>>>>
>>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>> I guess we could keep it for now. After KIP-345 and incremental
>>>> cooperative rebalancing
>>>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms
>> `.
>>>> Also one round trip
>>>>> shouldn't increase the latency too much IMO.
>>>>>
>>>>> Best,
>>>>> Boyang
>>>>> ________________________________
>>>>> From: Stanislav Kozlovski <st...@confluent.io>
>>>>> Sent: Wednesday, November 28, 2018 2:32 AM
>>>>> To: dev@kafka.apache.org
>>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>>>> group request
>>>>>
>>>>> Hi Boyang,
>>>>>
>>>>> The KIP looks very good.
>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>>
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>>
>>>>>> Hi Boyang,
>>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>>
>>>>>> Thanks,
>>>>>> Jason
>>>>>>
>>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
>>>> wrote:
>>>>>>
>>>>>>> Hey friends,
>>>>>>>
>>>>>>>
>>>>>>> I would like to start a discussion thread for KIP-394 which is trying
>>>> to
>>>>>>> mitigate broker cache bursting issue due to anonymous join group
>>>>>> requests:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C3ca95629be9e42b1f00108d657383bfd%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636792296362447479&amp;sdata=3BuPVUH5v3hMYe%2FMgpSsNftTwb5DsHDlm2lN%2FVUR0T8%3D&amp;reserved=0
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Boyang
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>>>
>>
>>
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Boyang Chen <bc...@outlook.com>.

Thanks Jason for the reply! Since the overall motivation and design is pretty clear, I will go ahead to start implementation and we could discuss the underlying details in the PR.

Best,
Boyang
________________________________
From: Matthias J. Sax <ma...@confluent.io>
Sent: Saturday, December 1, 2018 3:12 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

SGTM.

On 11/30/18 10:17 AM, Jason Gustafson wrote:
> Using the session expiration logic we already have seems like the simplest
> option (this is probably a one or two line change). The rejoin should be
> quick anyway, so I don't think it's worth optimizing for unjoined new
> members. Just my two cents. This is more of an implementation detail, so
> need not necessarily be resolved here.
>
> -Jason
>
> On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bc...@outlook.com> wrote:
>
>> Thanks Matthias for the question. I'm thinking of having a separate hash
>> set called `registeredMemberIds` which
>> will be cleared out every time a group finishes one round of rebalance.
>> Since storing one id is pretty trivial, using
>> purgatory to track the id removal is a bit wasteful in my opinion.
>> ________________________________
>> From: Matthias J. Sax <ma...@confluent.io>
>> Sent: Friday, November 30, 2018 10:26 AM
>> To: dev@kafka.apache.org
>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
>> request
>>
>> Thanks! Makes sense.
>>
>> I missed that fact, that the `member.id` is added on the second
>> joinGroup request that contains the `member.id`.
>>
>> However, it seems there is another race condition for this design:
>>
>> If two consumers join at the same time, it it possible that the broker
>> assigns the same `member.id` to both (because none of them have joined
>> the group yet--ie, second joinGroup request not sent yet--, the
>> `member.id` is not store broker side yes and broker cannot check for
>> duplicates when creating a new `member.id`.
>>
>> The probability might be fairly low thought. However, what Stanislav
>> proposed, to add the `member.id` directly, and remove it after
>> `session.timeout.ms` sound like a save option that avoids this issue.
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>> On 11/28/18 8:15 PM, Boyang Chen wrote:
>>> Thanks Matthias for the question, and Stanislav for the explanation!
>>>
>>> For the scenario described, we will never let a member join the
>> GroupMetadata map
>>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
>>>
>>>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
>>>   2.  Broker rejects while allocating a member.id to c1 in response (c1
>> protocol version is current);
>>>   3.  c1 handles the error and rejoins with assigned member.id;
>>>   4.  Broker stores c1 in its group metadata;
>>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
>>>   6.  Broker rejects while allocating a member.id to c2 in response (c2
>> protocol version is current);
>>>   7.  c2 fails to get the response/crashes in the middle;
>>>   8.  After certain time, c2 restarts a join request with
>> UNKNOWN_MEMBER_ID;
>>>
>>> As you could see, c2 will repeat step 6~8 until successfully send back a
>> join group request with allocated id.
>>> By then broker will include c2 within the broker metadata map.
>>>
>>> Does this sound clear to you?
>>>
>>> Best,
>>> Boyang
>>> ________________________________
>>> From: Stanislav Kozlovski <st...@confluent.io>
>>> Sent: Wednesday, November 28, 2018 7:39 PM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>> group request
>>>
>>> Hey Matthias,
>>>
>>> I think the notion is to have the `session.timeout.ms` to start ticking
>>> when the broker responds with the member.id. Then, the broker would
>>> properly expire consumers and not hold too many stale ones.
>>> This isn't mentioned in the KIP though so it is worth to wait for Boyang
>> to
>>> confirm
>>>
>>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
>>> wrote:
>>>
>>>> Thanks for the KIP Boyang.
>>>>
>>>> I guess I am missing something, but I am still learning more details
>>>> about the rebalance protocol, so maybe you can help me out?
>>>>
>>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
>>>> The broker generates a `member.id` and sends it back via
>>>> `MEMBER_ID_REQUIRED` error response. This response might never reach the
>>>> client or the client fails before it can send the second joinGroup
>>>> request. Thus, a client would need to start over with a new
>>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
>>>> generate a new `member.id` again.
>>>>
>>>> So it seems the problem is moved, but not resolved? The motivation of
>>>> the KIP is:
>>>>
>>>>> The edge case is that if initial join group request keeps failing due
>> to
>>>> connection timeout, or the consumer keeps restarting,
>>>>
>>>> From my understanding, this KIP move the issue from the first to the
>>>> second joinGroup request (or broker joinGroup response).
>>>>
>>>> But maybe I am missing something. Can you help me out?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
>>>>> Thanks Stanislav and Jason for the suggestions!
>>>>>
>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>
>>>>> I will add the join group protocol version change to the KIP. Meanwhile
>>>> I feel for
>>>>> understandability it's better to define a separate error code since
>>>> REBALANCE_IN_PROGRESS
>>>>> is not the actual cause of the returned error.
>>>>>
>>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>> I guess we could keep it for now. After KIP-345 and incremental
>>>> cooperative rebalancing
>>>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms
>> `.
>>>> Also one round trip
>>>>> shouldn't increase the latency too much IMO.
>>>>>
>>>>> Best,
>>>>> Boyang
>>>>> ________________________________
>>>>> From: Stanislav Kozlovski <st...@confluent.io>
>>>>> Sent: Wednesday, November 28, 2018 2:32 AM
>>>>> To: dev@kafka.apache.org
>>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>>>> group request
>>>>>
>>>>> Hi Boyang,
>>>>>
>>>>> The KIP looks very good.
>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>>
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>>
>>>>>> Hi Boyang,
>>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>>
>>>>>> Thanks,
>>>>>> Jason
>>>>>>
>>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
>>>> wrote:
>>>>>>
>>>>>>> Hey friends,
>>>>>>>
>>>>>>>
>>>>>>> I would like to start a discussion thread for KIP-394 which is trying
>>>> to
>>>>>>> mitigate broker cache bursting issue due to anonymous join group
>>>>>> requests:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Boyang
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>>>
>>
>>
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by "Matthias J. Sax" <ma...@confluent.io>.

SGTM.

On 11/30/18 10:17 AM, Jason Gustafson wrote:
> Using the session expiration logic we already have seems like the simplest
> option (this is probably a one or two line change). The rejoin should be
> quick anyway, so I don't think it's worth optimizing for unjoined new
> members. Just my two cents. This is more of an implementation detail, so
> need not necessarily be resolved here.
> 
> -Jason
> 
> On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bc...@outlook.com> wrote:
> 
>> Thanks Matthias for the question. I'm thinking of having a separate hash
>> set called `registeredMemberIds` which
>> will be cleared out every time a group finishes one round of rebalance.
>> Since storing one id is pretty trivial, using
>> purgatory to track the id removal is a bit wasteful in my opinion.
>> ________________________________
>> From: Matthias J. Sax <ma...@confluent.io>
>> Sent: Friday, November 30, 2018 10:26 AM
>> To: dev@kafka.apache.org
>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
>> request
>>
>> Thanks! Makes sense.
>>
>> I missed that fact, that the `member.id` is added on the second
>> joinGroup request that contains the `member.id`.
>>
>> However, it seems there is another race condition for this design:
>>
>> If two consumers join at the same time, it it possible that the broker
>> assigns the same `member.id` to both (because none of them have joined
>> the group yet--ie, second joinGroup request not sent yet--, the
>> `member.id` is not store broker side yes and broker cannot check for
>> duplicates when creating a new `member.id`.
>>
>> The probability might be fairly low thought. However, what Stanislav
>> proposed, to add the `member.id` directly, and remove it after
>> `session.timeout.ms` sound like a save option that avoids this issue.
>>
>> Thoughts?
>>
>>
>> -Matthias
>>
>> On 11/28/18 8:15 PM, Boyang Chen wrote:
>>> Thanks Matthias for the question, and Stanislav for the explanation!
>>>
>>> For the scenario described, we will never let a member join the
>> GroupMetadata map
>>> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
>>>
>>>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
>>>   2.  Broker rejects while allocating a member.id to c1 in response (c1
>> protocol version is current);
>>>   3.  c1 handles the error and rejoins with assigned member.id;
>>>   4.  Broker stores c1 in its group metadata;
>>>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
>>>   6.  Broker rejects while allocating a member.id to c2 in response (c2
>> protocol version is current);
>>>   7.  c2 fails to get the response/crashes in the middle;
>>>   8.  After certain time, c2 restarts a join request with
>> UNKNOWN_MEMBER_ID;
>>>
>>> As you could see, c2 will repeat step 6~8 until successfully send back a
>> join group request with allocated id.
>>> By then broker will include c2 within the broker metadata map.
>>>
>>> Does this sound clear to you?
>>>
>>> Best,
>>> Boyang
>>> ________________________________
>>> From: Stanislav Kozlovski <st...@confluent.io>
>>> Sent: Wednesday, November 28, 2018 7:39 PM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>> group request
>>>
>>> Hey Matthias,
>>>
>>> I think the notion is to have the `session.timeout.ms` to start ticking
>>> when the broker responds with the member.id. Then, the broker would
>>> properly expire consumers and not hold too many stale ones.
>>> This isn't mentioned in the KIP though so it is worth to wait for Boyang
>> to
>>> confirm
>>>
>>> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
>>> wrote:
>>>
>>>> Thanks for the KIP Boyang.
>>>>
>>>> I guess I am missing something, but I am still learning more details
>>>> about the rebalance protocol, so maybe you can help me out?
>>>>
>>>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
>>>> The broker generates a `member.id` and sends it back via
>>>> `MEMBER_ID_REQUIRED` error response. This response might never reach the
>>>> client or the client fails before it can send the second joinGroup
>>>> request. Thus, a client would need to start over with a new
>>>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
>>>> generate a new `member.id` again.
>>>>
>>>> So it seems the problem is moved, but not resolved? The motivation of
>>>> the KIP is:
>>>>
>>>>> The edge case is that if initial join group request keeps failing due
>> to
>>>> connection timeout, or the consumer keeps restarting,
>>>>
>>>> From my understanding, this KIP move the issue from the first to the
>>>> second joinGroup request (or broker joinGroup response).
>>>>
>>>> But maybe I am missing something. Can you help me out?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 11/27/18 6:00 PM, Boyang Chen wrote:
>>>>> Thanks Stanislav and Jason for the suggestions!
>>>>>
>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>
>>>>> I will add the join group protocol version change to the KIP. Meanwhile
>>>> I feel for
>>>>> understandability it's better to define a separate error code since
>>>> REBALANCE_IN_PROGRESS
>>>>> is not the actual cause of the returned error.
>>>>>
>>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>> I guess we could keep it for now. After KIP-345 and incremental
>>>> cooperative rebalancing
>>>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms
>> `.
>>>> Also one round trip
>>>>> shouldn't increase the latency too much IMO.
>>>>>
>>>>> Best,
>>>>> Boyang
>>>>> ________________________________
>>>>> From: Stanislav Kozlovski <st...@confluent.io>
>>>>> Sent: Wednesday, November 28, 2018 2:32 AM
>>>>> To: dev@kafka.apache.org
>>>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>>>> group request
>>>>>
>>>>> Hi Boyang,
>>>>>
>>>>> The KIP looks very good.
>>>>> One small question I have is now that we have one and a half
>> round-trips
>>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>>> consider increasing the default value of `
>>>> group.initial.rebalance.delay.ms`?
>>>>>
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
>>>> wrote:
>>>>>
>>>>>> Hi Boyang,
>>>>>>
>>>>>> Thanks for the KIP. Looks good overall. I think we will need to bump
>> the
>>>>>> version of the JoinGroup protocol in order to indicate compatibility
>>>> with
>>>>>> the new behavior. The coordinator needs to know when it is safe to
>>>> assume
>>>>>> the client will handle the error code.
>>>>>>
>>>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
>> error
>>>>>> code. When the client sees this error code, it will take the memberId
>>>> from
>>>>>> the response and rejoin. We'd still need the protocol bump since older
>>>>>> consumers do not have this logic.
>>>>>>
>>>>>> Thanks,
>>>>>> Jason
>>>>>>
>>>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
>>>> wrote:
>>>>>>
>>>>>>> Hey friends,
>>>>>>>
>>>>>>>
>>>>>>> I would like to start a discussion thread for KIP-394 which is trying
>>>> to
>>>>>>> mitigate broker cache bursting issue due to anonymous join group
>>>>>> requests:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> Boyang
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best,
>>>>> Stanislav
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>>>
>>
>>
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Jason Gustafson <ja...@confluent.io>.

Using the session expiration logic we already have seems like the simplest
option (this is probably a one or two line change). The rejoin should be
quick anyway, so I don't think it's worth optimizing for unjoined new
members. Just my two cents. This is more of an implementation detail, so
need not necessarily be resolved here.

-Jason

On Fri, Nov 30, 2018 at 12:56 AM Boyang Chen <bc...@outlook.com> wrote:

> Thanks Matthias for the question. I'm thinking of having a separate hash
> set called `registeredMemberIds` which
> will be cleared out every time a group finishes one round of rebalance.
> Since storing one id is pretty trivial, using
> purgatory to track the id removal is a bit wasteful in my opinion.
> ________________________________
> From: Matthias J. Sax <ma...@confluent.io>
> Sent: Friday, November 30, 2018 10:26 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group
> request
>
> Thanks! Makes sense.
>
> I missed that fact, that the `member.id` is added on the second
> joinGroup request that contains the `member.id`.
>
> However, it seems there is another race condition for this design:
>
> If two consumers join at the same time, it it possible that the broker
> assigns the same `member.id` to both (because none of them have joined
> the group yet--ie, second joinGroup request not sent yet--, the
> `member.id` is not store broker side yes and broker cannot check for
> duplicates when creating a new `member.id`.
>
> The probability might be fairly low thought. However, what Stanislav
> proposed, to add the `member.id` directly, and remove it after
> `session.timeout.ms` sound like a save option that avoids this issue.
>
> Thoughts?
>
>
> -Matthias
>
> On 11/28/18 8:15 PM, Boyang Chen wrote:
> > Thanks Matthias for the question, and Stanislav for the explanation!
> >
> > For the scenario described, we will never let a member join the
> GroupMetadata map
> > if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> >
> >   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
> >   2.  Broker rejects while allocating a member.id to c1 in response (c1
> protocol version is current);
> >   3.  c1 handles the error and rejoins with assigned member.id;
> >   4.  Broker stores c1 in its group metadata;
> >   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
> >   6.  Broker rejects while allocating a member.id to c2 in response (c2
> protocol version is current);
> >   7.  c2 fails to get the response/crashes in the middle;
> >   8.  After certain time, c2 restarts a join request with
> UNKNOWN_MEMBER_ID;
> >
> > As you could see, c2 will repeat step 6~8 until successfully send back a
> join group request with allocated id.
> > By then broker will include c2 within the broker metadata map.
> >
> > Does this sound clear to you?
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Stanislav Kozlovski <st...@confluent.io>
> > Sent: Wednesday, November 28, 2018 7:39 PM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> group request
> >
> > Hey Matthias,
> >
> > I think the notion is to have the `session.timeout.ms` to start ticking
> > when the broker responds with the member.id. Then, the broker would
> > properly expire consumers and not hold too many stale ones.
> > This isn't mentioned in the KIP though so it is worth to wait for Boyang
> to
> > confirm
> >
> > On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
> > wrote:
> >
> >> Thanks for the KIP Boyang.
> >>
> >> I guess I am missing something, but I am still learning more details
> >> about the rebalance protocol, so maybe you can help me out?
> >>
> >> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
> >> The broker generates a `member.id` and sends it back via
> >> `MEMBER_ID_REQUIRED` error response. This response might never reach the
> >> client or the client fails before it can send the second joinGroup
> >> request. Thus, a client would need to start over with a new
> >> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
> >> generate a new `member.id` again.
> >>
> >> So it seems the problem is moved, but not resolved? The motivation of
> >> the KIP is:
> >>
> >>> The edge case is that if initial join group request keeps failing due
> to
> >> connection timeout, or the consumer keeps restarting,
> >>
> >> From my understanding, this KIP move the issue from the first to the
> >> second joinGroup request (or broker joinGroup response).
> >>
> >> But maybe I am missing something. Can you help me out?
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 11/27/18 6:00 PM, Boyang Chen wrote:
> >>> Thanks Stanislav and Jason for the suggestions!
> >>>
> >>>
> >>>> Thanks for the KIP. Looks good overall. I think we will need to bump
> the
> >>>> version of the JoinGroup protocol in order to indicate compatibility
> >> with
> >>>> the new behavior. The coordinator needs to know when it is safe to
> >> assume
> >>>> the client will handle the error code.
> >>>>
> >>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> error
> >>>> code. When the client sees this error code, it will take the memberId
> >> from
> >>>> the response and rejoin. We'd still need the protocol bump since older
> >>>> consumers do not have this logic.
> >>>
> >>> I will add the join group protocol version change to the KIP. Meanwhile
> >> I feel for
> >>> understandability it's better to define a separate error code since
> >> REBALANCE_IN_PROGRESS
> >>> is not the actual cause of the returned error.
> >>>
> >>>> One small question I have is now that we have one and a half
> round-trips
> >>>> needed to join in a rebalance (1 full RT addition), is it worth it to
> >>>> consider increasing the default value of `
> >> group.initial.rebalance.delay.ms`?
> >>> I guess we could keep it for now. After KIP-345 and incremental
> >> cooperative rebalancing
> >>> work we should be safe to deprecate `group.initial.rebalance.delay.ms
> `.
> >> Also one round trip
> >>> shouldn't increase the latency too much IMO.
> >>>
> >>> Best,
> >>> Boyang
> >>> ________________________________
> >>> From: Stanislav Kozlovski <st...@confluent.io>
> >>> Sent: Wednesday, November 28, 2018 2:32 AM
> >>> To: dev@kafka.apache.org
> >>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> >> group request
> >>>
> >>> Hi Boyang,
> >>>
> >>> The KIP looks very good.
> >>> One small question I have is now that we have one and a half
> round-trips
> >>> needed to join in a rebalance (1 full RT addition), is it worth it to
> >>> consider increasing the default value of `
> >> group.initial.rebalance.delay.ms`?
> >>>
> >>> Best,
> >>> Stanislav
> >>>
> >>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
> >> wrote:
> >>>
> >>>> Hi Boyang,
> >>>>
> >>>> Thanks for the KIP. Looks good overall. I think we will need to bump
> the
> >>>> version of the JoinGroup protocol in order to indicate compatibility
> >> with
> >>>> the new behavior. The coordinator needs to know when it is safe to
> >> assume
> >>>> the client will handle the error code.
> >>>>
> >>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS
> error
> >>>> code. When the client sees this error code, it will take the memberId
> >> from
> >>>> the response and rejoin. We'd still need the protocol bump since older
> >>>> consumers do not have this logic.
> >>>>
> >>>> Thanks,
> >>>> Jason
> >>>>
> >>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
> >> wrote:
> >>>>
> >>>>> Hey friends,
> >>>>>
> >>>>>
> >>>>> I would like to start a discussion thread for KIP-394 which is trying
> >> to
> >>>>> mitigate broker cache bursting issue due to anonymous join group
> >>>> requests:
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
> >>>>>
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>>> Boyang
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best,
> >>> Stanislav
> >>>
> >>
> >>
> >
> > --
> > Best,
> > Stanislav
> >
>
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Boyang Chen <bc...@outlook.com>.

Thanks Matthias for the question. I'm thinking of having a separate hash set called `registeredMemberIds` which
will be cleared out every time a group finishes one round of rebalance. Since storing one id is pretty trivial, using
purgatory to track the id removal is a bit wasteful in my opinion.
________________________________
From: Matthias J. Sax <ma...@confluent.io>
Sent: Friday, November 30, 2018 10:26 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Thanks! Makes sense.

I missed that fact, that the `member.id` is added on the second
joinGroup request that contains the `member.id`.

However, it seems there is another race condition for this design:

If two consumers join at the same time, it it possible that the broker
assigns the same `member.id` to both (because none of them have joined
the group yet--ie, second joinGroup request not sent yet--, the
`member.id` is not store broker side yes and broker cannot check for
duplicates when creating a new `member.id`.

The probability might be fairly low thought. However, what Stanislav
proposed, to add the `member.id` directly, and remove it after
`session.timeout.ms` sound like a save option that avoids this issue.

Thoughts?


-Matthias

On 11/28/18 8:15 PM, Boyang Chen wrote:
> Thanks Matthias for the question, and Stanislav for the explanation!
>
> For the scenario described, we will never let a member join the GroupMetadata map
> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
>
>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
>   2.  Broker rejects while allocating a member.id to c1 in response (c1 protocol version is current);
>   3.  c1 handles the error and rejoins with assigned member.id;
>   4.  Broker stores c1 in its group metadata;
>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
>   6.  Broker rejects while allocating a member.id to c2 in response (c2 protocol version is current);
>   7.  c2 fails to get the response/crashes in the middle;
>   8.  After certain time, c2 restarts a join request with UNKNOWN_MEMBER_ID;
>
> As you could see, c2 will repeat step 6~8 until successfully send back a join group request with allocated id.
> By then broker will include c2 within the broker metadata map.
>
> Does this sound clear to you?
>
> Best,
> Boyang
> ________________________________
> From: Stanislav Kozlovski <st...@confluent.io>
> Sent: Wednesday, November 28, 2018 7:39 PM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request
>
> Hey Matthias,
>
> I think the notion is to have the `session.timeout.ms` to start ticking
> when the broker responds with the member.id. Then, the broker would
> properly expire consumers and not hold too many stale ones.
> This isn't mentioned in the KIP though so it is worth to wait for Boyang to
> confirm
>
> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
> wrote:
>
>> Thanks for the KIP Boyang.
>>
>> I guess I am missing something, but I am still learning more details
>> about the rebalance protocol, so maybe you can help me out?
>>
>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
>> The broker generates a `member.id` and sends it back via
>> `MEMBER_ID_REQUIRED` error response. This response might never reach the
>> client or the client fails before it can send the second joinGroup
>> request. Thus, a client would need to start over with a new
>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
>> generate a new `member.id` again.
>>
>> So it seems the problem is moved, but not resolved? The motivation of
>> the KIP is:
>>
>>> The edge case is that if initial join group request keeps failing due to
>> connection timeout, or the consumer keeps restarting,
>>
>> From my understanding, this KIP move the issue from the first to the
>> second joinGroup request (or broker joinGroup response).
>>
>> But maybe I am missing something. Can you help me out?
>>
>>
>> -Matthias
>>
>>
>> On 11/27/18 6:00 PM, Boyang Chen wrote:
>>> Thanks Stanislav and Jason for the suggestions!
>>>
>>>
>>>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>>>> version of the JoinGroup protocol in order to indicate compatibility
>> with
>>>> the new behavior. The coordinator needs to know when it is safe to
>> assume
>>>> the client will handle the error code.
>>>>
>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>>>> code. When the client sees this error code, it will take the memberId
>> from
>>>> the response and rejoin. We'd still need the protocol bump since older
>>>> consumers do not have this logic.
>>>
>>> I will add the join group protocol version change to the KIP. Meanwhile
>> I feel for
>>> understandability it's better to define a separate error code since
>> REBALANCE_IN_PROGRESS
>>> is not the actual cause of the returned error.
>>>
>>>> One small question I have is now that we have one and a half round-trips
>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>> consider increasing the default value of `
>> group.initial.rebalance.delay.ms`?
>>> I guess we could keep it for now. After KIP-345 and incremental
>> cooperative rebalancing
>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms`.
>> Also one round trip
>>> shouldn't increase the latency too much IMO.
>>>
>>> Best,
>>> Boyang
>>> ________________________________
>>> From: Stanislav Kozlovski <st...@confluent.io>
>>> Sent: Wednesday, November 28, 2018 2:32 AM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>> group request
>>>
>>> Hi Boyang,
>>>
>>> The KIP looks very good.
>>> One small question I have is now that we have one and a half round-trips
>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>> consider increasing the default value of `
>> group.initial.rebalance.delay.ms`?
>>>
>>> Best,
>>> Stanislav
>>>
>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>>
>>>> Hi Boyang,
>>>>
>>>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>>>> version of the JoinGroup protocol in order to indicate compatibility
>> with
>>>> the new behavior. The coordinator needs to know when it is safe to
>> assume
>>>> the client will handle the error code.
>>>>
>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>>>> code. When the client sees this error code, it will take the memberId
>> from
>>>> the response and rejoin. We'd still need the protocol bump since older
>>>> consumers do not have this logic.
>>>>
>>>> Thanks,
>>>> Jason
>>>>
>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
>> wrote:
>>>>
>>>>> Hey friends,
>>>>>
>>>>>
>>>>> I would like to start a discussion thread for KIP-394 which is trying
>> to
>>>>> mitigate broker cache bursting issue due to anonymous join group
>>>> requests:
>>>>>
>>>>>
>>>>>
>>>>
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Boyang
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>>>
>>
>>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks! Makes sense.

I missed that fact, that the `member.id` is added on the second
joinGroup request that contains the `member.id`.

However, it seems there is another race condition for this design:

If two consumers join at the same time, it it possible that the broker
assigns the same `member.id` to both (because none of them have joined
the group yet--ie, second joinGroup request not sent yet--, the
`member.id` is not store broker side yes and broker cannot check for
duplicates when creating a new `member.id`.

The probability might be fairly low thought. However, what Stanislav
proposed, to add the `member.id` directly, and remove it after
`session.timeout.ms` sound like a save option that avoids this issue.

Thoughts?


-Matthias

On 11/28/18 8:15 PM, Boyang Chen wrote:
> Thanks Matthias for the question, and Stanislav for the explanation!
> 
> For the scenario described, we will never let a member join the GroupMetadata map
> if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:
> 
>   1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
>   2.  Broker rejects while allocating a member.id to c1 in response (c1 protocol version is current);
>   3.  c1 handles the error and rejoins with assigned member.id;
>   4.  Broker stores c1 in its group metadata;
>   5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
>   6.  Broker rejects while allocating a member.id to c2 in response (c2 protocol version is current);
>   7.  c2 fails to get the response/crashes in the middle;
>   8.  After certain time, c2 restarts a join request with UNKNOWN_MEMBER_ID;
> 
> As you could see, c2 will repeat step 6~8 until successfully send back a join group request with allocated id.
> By then broker will include c2 within the broker metadata map.
> 
> Does this sound clear to you?
> 
> Best,
> Boyang
> ________________________________
> From: Stanislav Kozlovski <st...@confluent.io>
> Sent: Wednesday, November 28, 2018 7:39 PM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request
> 
> Hey Matthias,
> 
> I think the notion is to have the `session.timeout.ms` to start ticking
> when the broker responds with the member.id. Then, the broker would
> properly expire consumers and not hold too many stale ones.
> This isn't mentioned in the KIP though so it is worth to wait for Boyang to
> confirm
> 
> On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> Thanks for the KIP Boyang.
>>
>> I guess I am missing something, but I am still learning more details
>> about the rebalance protocol, so maybe you can help me out?
>>
>> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
>> The broker generates a `member.id` and sends it back via
>> `MEMBER_ID_REQUIRED` error response. This response might never reach the
>> client or the client fails before it can send the second joinGroup
>> request. Thus, a client would need to start over with a new
>> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
>> generate a new `member.id` again.
>>
>> So it seems the problem is moved, but not resolved? The motivation of
>> the KIP is:
>>
>>> The edge case is that if initial join group request keeps failing due to
>> connection timeout, or the consumer keeps restarting,
>>
>> From my understanding, this KIP move the issue from the first to the
>> second joinGroup request (or broker joinGroup response).
>>
>> But maybe I am missing something. Can you help me out?
>>
>>
>> -Matthias
>>
>>
>> On 11/27/18 6:00 PM, Boyang Chen wrote:
>>> Thanks Stanislav and Jason for the suggestions!
>>>
>>>
>>>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>>>> version of the JoinGroup protocol in order to indicate compatibility
>> with
>>>> the new behavior. The coordinator needs to know when it is safe to
>> assume
>>>> the client will handle the error code.
>>>>
>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>>>> code. When the client sees this error code, it will take the memberId
>> from
>>>> the response and rejoin. We'd still need the protocol bump since older
>>>> consumers do not have this logic.
>>>
>>> I will add the join group protocol version change to the KIP. Meanwhile
>> I feel for
>>> understandability it's better to define a separate error code since
>> REBALANCE_IN_PROGRESS
>>> is not the actual cause of the returned error.
>>>
>>>> One small question I have is now that we have one and a half round-trips
>>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>>> consider increasing the default value of `
>> group.initial.rebalance.delay.ms`?
>>> I guess we could keep it for now. After KIP-345 and incremental
>> cooperative rebalancing
>>> work we should be safe to deprecate `group.initial.rebalance.delay.ms`.
>> Also one round trip
>>> shouldn't increase the latency too much IMO.
>>>
>>> Best,
>>> Boyang
>>> ________________________________
>>> From: Stanislav Kozlovski <st...@confluent.io>
>>> Sent: Wednesday, November 28, 2018 2:32 AM
>>> To: dev@kafka.apache.org
>>> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
>> group request
>>>
>>> Hi Boyang,
>>>
>>> The KIP looks very good.
>>> One small question I have is now that we have one and a half round-trips
>>> needed to join in a rebalance (1 full RT addition), is it worth it to
>>> consider increasing the default value of `
>> group.initial.rebalance.delay.ms`?
>>>
>>> Best,
>>> Stanislav
>>>
>>> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>>
>>>> Hi Boyang,
>>>>
>>>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>>>> version of the JoinGroup protocol in order to indicate compatibility
>> with
>>>> the new behavior. The coordinator needs to know when it is safe to
>> assume
>>>> the client will handle the error code.
>>>>
>>>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>>>> code. When the client sees this error code, it will take the memberId
>> from
>>>> the response and rejoin. We'd still need the protocol bump since older
>>>> consumers do not have this logic.
>>>>
>>>> Thanks,
>>>> Jason
>>>>
>>>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
>> wrote:
>>>>
>>>>> Hey friends,
>>>>>
>>>>>
>>>>> I would like to start a discussion thread for KIP-394 which is trying
>> to
>>>>> mitigate broker cache bursting issue due to anonymous join group
>>>> requests:
>>>>>
>>>>>
>>>>>
>>>>
>> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> Boyang
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best,
>>> Stanislav
>>>
>>
>>
> 
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Boyang Chen <bc...@outlook.com>.

Thanks Matthias for the question, and Stanislav for the explanation!

For the scenario described, we will never let a member join the GroupMetadata map
if it uses UNKNOWN_MEMBER_ID. So the workflow will be like this:

  1.  Group is empty. Consumer c1 started. Join with UNKNOWN_MEMBER_ID;
  2.  Broker rejects while allocating a member.id to c1 in response (c1 protocol version is current);
  3.  c1 handles the error and rejoins with assigned member.id;
  4.  Broker stores c1 in its group metadata;
  5.  Consumer c2 started. Join with UNKNOWN_MEMBER_ID;
  6.  Broker rejects while allocating a member.id to c2 in response (c2 protocol version is current);
  7.  c2 fails to get the response/crashes in the middle;
  8.  After certain time, c2 restarts a join request with UNKNOWN_MEMBER_ID;

As you could see, c2 will repeat step 6~8 until successfully send back a join group request with allocated id.
By then broker will include c2 within the broker metadata map.

Does this sound clear to you?

Best,
Boyang
________________________________
From: Stanislav Kozlovski <st...@confluent.io>
Sent: Wednesday, November 28, 2018 7:39 PM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Hey Matthias,

I think the notion is to have the `session.timeout.ms` to start ticking
when the broker responds with the member.id. Then, the broker would
properly expire consumers and not hold too many stale ones.
This isn't mentioned in the KIP though so it is worth to wait for Boyang to
confirm

On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
wrote:

> Thanks for the KIP Boyang.
>
> I guess I am missing something, but I am still learning more details
> about the rebalance protocol, so maybe you can help me out?
>
> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
> The broker generates a `member.id` and sends it back via
> `MEMBER_ID_REQUIRED` error response. This response might never reach the
> client or the client fails before it can send the second joinGroup
> request. Thus, a client would need to start over with a new
> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
> generate a new `member.id` again.
>
> So it seems the problem is moved, but not resolved? The motivation of
> the KIP is:
>
> > The edge case is that if initial join group request keeps failing due to
> connection timeout, or the consumer keeps restarting,
>
> From my understanding, this KIP move the issue from the first to the
> second joinGroup request (or broker joinGroup response).
>
> But maybe I am missing something. Can you help me out?
>
>
> -Matthias
>
>
> On 11/27/18 6:00 PM, Boyang Chen wrote:
> > Thanks Stanislav and Jason for the suggestions!
> >
> >
> >> Thanks for the KIP. Looks good overall. I think we will need to bump the
> >> version of the JoinGroup protocol in order to indicate compatibility
> with
> >> the new behavior. The coordinator needs to know when it is safe to
> assume
> >> the client will handle the error code.
> >>
> >> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
> >> code. When the client sees this error code, it will take the memberId
> from
> >> the response and rejoin. We'd still need the protocol bump since older
> >> consumers do not have this logic.
> >
> > I will add the join group protocol version change to the KIP. Meanwhile
> I feel for
> > understandability it's better to define a separate error code since
> REBALANCE_IN_PROGRESS
> > is not the actual cause of the returned error.
> >
> >> One small question I have is now that we have one and a half round-trips
> >> needed to join in a rebalance (1 full RT addition), is it worth it to
> >> consider increasing the default value of `
> group.initial.rebalance.delay.ms`?
> > I guess we could keep it for now. After KIP-345 and incremental
> cooperative rebalancing
> > work we should be safe to deprecate `group.initial.rebalance.delay.ms`.
> Also one round trip
> > shouldn't increase the latency too much IMO.
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Stanislav Kozlovski <st...@confluent.io>
> > Sent: Wednesday, November 28, 2018 2:32 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> group request
> >
> > Hi Boyang,
> >
> > The KIP looks very good.
> > One small question I have is now that we have one and a half round-trips
> > needed to join in a rebalance (1 full RT addition), is it worth it to
> > consider increasing the default value of `
> group.initial.rebalance.delay.ms`?
> >
> > Best,
> > Stanislav
> >
> > On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
> wrote:
> >
> >> Hi Boyang,
> >>
> >> Thanks for the KIP. Looks good overall. I think we will need to bump the
> >> version of the JoinGroup protocol in order to indicate compatibility
> with
> >> the new behavior. The coordinator needs to know when it is safe to
> assume
> >> the client will handle the error code.
> >>
> >> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
> >> code. When the client sees this error code, it will take the memberId
> from
> >> the response and rejoin. We'd still need the protocol bump since older
> >> consumers do not have this logic.
> >>
> >> Thanks,
> >> Jason
> >>
> >> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
> wrote:
> >>
> >>> Hey friends,
> >>>
> >>>
> >>> I would like to start a discussion thread for KIP-394 which is trying
> to
> >>> mitigate broker cache bursting issue due to anonymous join group
> >> requests:
> >>>
> >>>
> >>>
> >>
> https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C046b858d134f4c1c576108d655277552%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636790025286277394&amp;sdata=QyBzqnislL%2B9fK1mXaRuJ0xpi9Y2JDvHrM881hzjq3A%3D&amp;reserved=0
> >>>
> >>>
> >>> Thanks!
> >>>
> >>> Boyang
> >>>
> >>
> >
> >
> > --
> > Best,
> > Stanislav
> >
>
>

--
Best,
Stanislav

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hey Matthias,

I think the notion is to have the `session.timeout.ms` to start ticking
when the broker responds with the member.id. Then, the broker would
properly expire consumers and not hold too many stale ones.
This isn't mentioned in the KIP though so it is worth to wait for Boyang to
confirm

On Wed, Nov 28, 2018 at 3:10 AM Matthias J. Sax <ma...@confluent.io>
wrote:

> Thanks for the KIP Boyang.
>
> I guess I am missing something, but I am still learning more details
> about the rebalance protocol, so maybe you can help me out?
>
> Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
> The broker generates a `member.id` and sends it back via
> `MEMBER_ID_REQUIRED` error response. This response might never reach the
> client or the client fails before it can send the second joinGroup
> request. Thus, a client would need to start over with a new
> UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
> generate a new `member.id` again.
>
> So it seems the problem is moved, but not resolved? The motivation of
> the KIP is:
>
> > The edge case is that if initial join group request keeps failing due to
> connection timeout, or the consumer keeps restarting,
>
> From my understanding, this KIP move the issue from the first to the
> second joinGroup request (or broker joinGroup response).
>
> But maybe I am missing something. Can you help me out?
>
>
> -Matthias
>
>
> On 11/27/18 6:00 PM, Boyang Chen wrote:
> > Thanks Stanislav and Jason for the suggestions!
> >
> >
> >> Thanks for the KIP. Looks good overall. I think we will need to bump the
> >> version of the JoinGroup protocol in order to indicate compatibility
> with
> >> the new behavior. The coordinator needs to know when it is safe to
> assume
> >> the client will handle the error code.
> >>
> >> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
> >> code. When the client sees this error code, it will take the memberId
> from
> >> the response and rejoin. We'd still need the protocol bump since older
> >> consumers do not have this logic.
> >
> > I will add the join group protocol version change to the KIP. Meanwhile
> I feel for
> > understandability it's better to define a separate error code since
> REBALANCE_IN_PROGRESS
> > is not the actual cause of the returned error.
> >
> >> One small question I have is now that we have one and a half round-trips
> >> needed to join in a rebalance (1 full RT addition), is it worth it to
> >> consider increasing the default value of `
> group.initial.rebalance.delay.ms`?
> > I guess we could keep it for now. After KIP-345 and incremental
> cooperative rebalancing
> > work we should be safe to deprecate `group.initial.rebalance.delay.ms`.
> Also one round trip
> > shouldn't increase the latency too much IMO.
> >
> > Best,
> > Boyang
> > ________________________________
> > From: Stanislav Kozlovski <st...@confluent.io>
> > Sent: Wednesday, November 28, 2018 2:32 AM
> > To: dev@kafka.apache.org
> > Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join
> group request
> >
> > Hi Boyang,
> >
> > The KIP looks very good.
> > One small question I have is now that we have one and a half round-trips
> > needed to join in a rebalance (1 full RT addition), is it worth it to
> > consider increasing the default value of `
> group.initial.rebalance.delay.ms`?
> >
> > Best,
> > Stanislav
> >
> > On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io>
> wrote:
> >
> >> Hi Boyang,
> >>
> >> Thanks for the KIP. Looks good overall. I think we will need to bump the
> >> version of the JoinGroup protocol in order to indicate compatibility
> with
> >> the new behavior. The coordinator needs to know when it is safe to
> assume
> >> the client will handle the error code.
> >>
> >> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
> >> code. When the client sees this error code, it will take the memberId
> from
> >> the response and rejoin. We'd still need the protocol bump since older
> >> consumers do not have this logic.
> >>
> >> Thanks,
> >> Jason
> >>
> >> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com>
> wrote:
> >>
> >>> Hey friends,
> >>>
> >>>
> >>> I would like to start a discussion thread for KIP-394 which is trying
> to
> >>> mitigate broker cache bursting issue due to anonymous join group
> >> requests:
> >>>
> >>>
> >>>
> >>
> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C8c2c54e07967404f0fa808d65496c9c7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636789403931186848&amp;sdata=oRbPKzwyDx6SodAaVb3Vv%2FXpJoD09E3%2BdTc0p1qKDEo%3D&amp;reserved=0
> >>>
> >>>
> >>> Thanks!
> >>>
> >>> Boyang
> >>>
> >>
> >
> >
> > --
> > Best,
> > Stanislav
> >
>
>

-- 
Best,
Stanislav

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thanks for the KIP Boyang.

I guess I am missing something, but I am still learning more details
about the rebalance protocol, so maybe you can help me out?

Assume a client sends UNKNOWN_MEMBER_ID in its first joinGroup request.
The broker generates a `member.id` and sends it back via
`MEMBER_ID_REQUIRED` error response. This response might never reach the
client or the client fails before it can send the second joinGroup
request. Thus, a client would need to start over with a new
UNKNOWN_MEMBER_ID in its joinGroup request. Thus, the broker needs to
generate a new `member.id` again.

So it seems the problem is moved, but not resolved? The motivation of
the KIP is:

> The edge case is that if initial join group request keeps failing due to connection timeout, or the consumer keeps restarting,

From my understanding, this KIP move the issue from the first to the
second joinGroup request (or broker joinGroup response).

But maybe I am missing something. Can you help me out?


-Matthias


On 11/27/18 6:00 PM, Boyang Chen wrote:
> Thanks Stanislav and Jason for the suggestions!
> 
> 
>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>> version of the JoinGroup protocol in order to indicate compatibility with
>> the new behavior. The coordinator needs to know when it is safe to assume
>> the client will handle the error code.
>>
>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>> code. When the client sees this error code, it will take the memberId from
>> the response and rejoin. We'd still need the protocol bump since older
>> consumers do not have this logic.
> 
> I will add the join group protocol version change to the KIP. Meanwhile I feel for
> understandability it's better to define a separate error code since REBALANCE_IN_PROGRESS
> is not the actual cause of the returned error.
> 
>> One small question I have is now that we have one and a half round-trips
>> needed to join in a rebalance (1 full RT addition), is it worth it to
>> consider increasing the default value of `group.initial.rebalance.delay.ms`?
> I guess we could keep it for now. After KIP-345 and incremental cooperative rebalancing
> work we should be safe to deprecate `group.initial.rebalance.delay.ms`. Also one round trip
> shouldn't increase the latency too much IMO.
> 
> Best,
> Boyang
> ________________________________
> From: Stanislav Kozlovski <st...@confluent.io>
> Sent: Wednesday, November 28, 2018 2:32 AM
> To: dev@kafka.apache.org
> Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request
> 
> Hi Boyang,
> 
> The KIP looks very good.
> One small question I have is now that we have one and a half round-trips
> needed to join in a rebalance (1 full RT addition), is it worth it to
> consider increasing the default value of `group.initial.rebalance.delay.ms`?
> 
> Best,
> Stanislav
> 
> On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io> wrote:
> 
>> Hi Boyang,
>>
>> Thanks for the KIP. Looks good overall. I think we will need to bump the
>> version of the JoinGroup protocol in order to indicate compatibility with
>> the new behavior. The coordinator needs to know when it is safe to assume
>> the client will handle the error code.
>>
>> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
>> code. When the client sees this error code, it will take the memberId from
>> the response and rejoin. We'd still need the protocol bump since older
>> consumers do not have this logic.
>>
>> Thanks,
>> Jason
>>
>> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com> wrote:
>>
>>> Hey friends,
>>>
>>>
>>> I would like to start a discussion thread for KIP-394 which is trying to
>>> mitigate broker cache bursting issue due to anonymous join group
>> requests:
>>>
>>>
>>>
>> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C8c2c54e07967404f0fa808d65496c9c7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636789403931186848&amp;sdata=oRbPKzwyDx6SodAaVb3Vv%2FXpJoD09E3%2BdTc0p1qKDEo%3D&amp;reserved=0
>>>
>>>
>>> Thanks!
>>>
>>> Boyang
>>>
>>
> 
> 
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Boyang Chen <bc...@outlook.com>.

Thanks Stanislav and Jason for the suggestions!


> Thanks for the KIP. Looks good overall. I think we will need to bump the
> version of the JoinGroup protocol in order to indicate compatibility with
> the new behavior. The coordinator needs to know when it is safe to assume
> the client will handle the error code.
>
> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
> code. When the client sees this error code, it will take the memberId from
> the response and rejoin. We'd still need the protocol bump since older
> consumers do not have this logic.

I will add the join group protocol version change to the KIP. Meanwhile I feel for
understandability it's better to define a separate error code since REBALANCE_IN_PROGRESS
is not the actual cause of the returned error.

> One small question I have is now that we have one and a half round-trips
> needed to join in a rebalance (1 full RT addition), is it worth it to
> consider increasing the default value of `group.initial.rebalance.delay.ms`?
I guess we could keep it for now. After KIP-345 and incremental cooperative rebalancing
work we should be safe to deprecate `group.initial.rebalance.delay.ms`. Also one round trip
shouldn't increase the latency too much IMO.

Best,
Boyang
________________________________
From: Stanislav Kozlovski <st...@confluent.io>
Sent: Wednesday, November 28, 2018 2:32 AM
To: dev@kafka.apache.org
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Hi Boyang,

The KIP looks very good.
One small question I have is now that we have one and a half round-trips
needed to join in a rebalance (1 full RT addition), is it worth it to
consider increasing the default value of `group.initial.rebalance.delay.ms`?

Best,
Stanislav

On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Boyang,
>
> Thanks for the KIP. Looks good overall. I think we will need to bump the
> version of the JoinGroup protocol in order to indicate compatibility with
> the new behavior. The coordinator needs to know when it is safe to assume
> the client will handle the error code.
>
> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
> code. When the client sees this error code, it will take the memberId from
> the response and rejoin. We'd still need the protocol bump since older
> consumers do not have this logic.
>
> Thanks,
> Jason
>
> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com> wrote:
>
> > Hey friends,
> >
> >
> > I would like to start a discussion thread for KIP-394 which is trying to
> > mitigate broker cache bursting issue due to anonymous join group
> requests:
> >
> >
> >
> https://nam03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FKAFKA%2FKIP-394%253A%2BRequire%2Bmember.id%2Bfor%2Binitial%2Bjoin%2Bgroup%2Brequest&amp;data=02%7C01%7C%7C8c2c54e07967404f0fa808d65496c9c7%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636789403931186848&amp;sdata=oRbPKzwyDx6SodAaVb3Vv%2FXpJoD09E3%2BdTc0p1qKDEo%3D&amp;reserved=0
> >
> >
> > Thanks!
> >
> > Boyang
> >
>


--
Best,
Stanislav

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Stanislav Kozlovski <st...@confluent.io>.

Hi Boyang,

The KIP looks very good.
One small question I have is now that we have one and a half round-trips
needed to join in a rebalance (1 full RT addition), is it worth it to
consider increasing the default value of `group.initial.rebalance.delay.ms`?

Best,
Stanislav

On Tue, Nov 27, 2018 at 5:39 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Boyang,
>
> Thanks for the KIP. Looks good overall. I think we will need to bump the
> version of the JoinGroup protocol in order to indicate compatibility with
> the new behavior. The coordinator needs to know when it is safe to assume
> the client will handle the error code.
>
> Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
> code. When the client sees this error code, it will take the memberId from
> the response and rejoin. We'd still need the protocol bump since older
> consumers do not have this logic.
>
> Thanks,
> Jason
>
> On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com> wrote:
>
> > Hey friends,
> >
> >
> > I would like to start a discussion thread for KIP-394 which is trying to
> > mitigate broker cache bursting issue due to anonymous join group
> requests:
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request
> >
> >
> > Thanks!
> >
> > Boyang
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Boyang,

Thanks for the KIP. Looks good overall. I think we will need to bump the
version of the JoinGroup protocol in order to indicate compatibility with
the new behavior. The coordinator needs to know when it is safe to assume
the client will handle the error code.

Also, I was wondering if we could reuse the REBALANCE_IN_PROGRESS error
code. When the client sees this error code, it will take the memberId from
the response and rejoin. We'd still need the protocol bump since older
consumers do not have this logic.

Thanks,
Jason

On Mon, Nov 26, 2018 at 5:47 PM Boyang Chen <bc...@outlook.com> wrote:

> Hey friends,
>
>
> I would like to start a discussion thread for KIP-394 which is trying to
> mitigate broker cache bursting issue due to anonymous join group requests:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request
>
>
> Thanks!
>
> Boyang
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Boyang Chen <bc...@outlook.com>.

Hey Ismael,

thanks for the suggestion here! I think the reason is because creating individual id on client is purely random (or at least I couldn't think of how to make sure it is "known to be unique"). Id collision will not be handled gracefully as we could perceive.

However let client generate unique id is a very good idea, which we proposed in KIP-345<https://cwiki.apache.org/confluence/display/KAFKA/KIP-345%3A+Introduce+static+membership+protocol+to+reduce+consumer+rebalances> to let end user supply instance identities. This could save us one round-trip time to solve member identity problem. Since enabling new feature requires user operation that is not guaranteed to happen, KIP-394 is a patch to alleviate the similar issue for existing consumer use cases.

Best,
Boyang

________________________________
From: Ismael Juma <is...@gmail.com>
Sent: Monday, January 21, 2019 6:07 AM
To: dev
Subject: Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Hi,

I'm late to the discussion, so apologies. One question, did we consider
having the client generate a member id in the first join group? This could
be random or known to be unique and would avoid a second join group request
in the common case. If we considered and rejected this option, it would be
good to include why in the "Rejected Alternatives" section.

Ismael

On Mon, Nov 26, 2018, 5:48 PM Boyang Chen <bchen11@outlook.com wrote:

> Hey friends,
>
>
> I would like to start a discussion thread for KIP-394 which is trying to
> mitigate broker cache bursting issue due to anonymous join group requests:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request
>
>
> Thanks!
>
> Boyang
>

Re: [DISCUSS] KIP-394: Require member.id for initial join group request

Posted by Ismael Juma <is...@gmail.com>.

Hi,

I'm late to the discussion, so apologies. One question, did we consider
having the client generate a member id in the first join group? This could
be random or known to be unique and would avoid a second join group request
in the common case. If we considered and rejected this option, it would be
good to include why in the "Rejected Alternatives" section.

Ismael

On Mon, Nov 26, 2018, 5:48 PM Boyang Chen <bchen11@outlook.com wrote:

> Hey friends,
>
>
> I would like to start a discussion thread for KIP-394 which is trying to
> mitigate broker cache bursting issue due to anonymous join group requests:
>
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request
>
>
> Thanks!
>
> Boyang
>