You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Sophie Blee-Goldman <so...@confluent.io.INVALID> on 2021/06/02 23:07:58 UTC

Re: [DISCUSS] KIP-726: Make the CooperativeStickyAssignor as the default assignor

Hey Luke,

It's been a while since the last update on this, which is mostly my fault
for picking up
other things in the meantime. I'm planning to get back to work
on KAFKA-12477 next
week but there are minimal changes to the current implementation given the
proposal
I put forth earlier in this KIP discussion, so I think we're good to go.

Although this KIP no longer requires a major release since it should be
fully compatible, I
still hope we can get it in to 3.0 since cooperative rebalancing is a major
improvement to
the life of a consumer group (and its operator). Can we make sure the KIP
reflects the latest
and then kick off a vote by next Monday at the latest so we can make KIP
freeze?

Thanks!
Sophie

On Fri, Apr 16, 2021 at 2:33 PM Guozhang Wang <wa...@gmail.com> wrote:

> 1) From user's perspective, it is always possible that a commit within
> onPartitionsRevoked throw in practice (e.g. if the member missed the
> previous rebalance where its assigned partitions are already re-assigned)
> -- and the onPartitionsLost was introduced for that exact reason, i.e. it
> is primarily for optimizations, but not for correctness guarantees -- on
> the other hand, it would be surprising to users to see the commit returns
> and then later found it not going through. Given that, I'd suggest we still
> throw the exception right away. Regarding the flag itself though, I agree
> that keeping it set until the next succeeded join group makes sense to be
> safer.
>
> 2) That's crystal, thank you for the clarification.
>
> On Wed, Apr 14, 2021 at 6:46 PM Sophie Blee-Goldman
> <so...@confluent.io.invalid> wrote:
>
> > 1) Once the short-circuit is triggered, the member will downgrade to the
> > EAGER protocol, but
> > won't necessarily try to rejoin the group right away.
> >
> > In the "happy path", the user has implemented #onPartitionsLost correctly
> > and will not attempt
> > to commit partitions that are lost. And since these partitions have
> indeed
> > been revoked, the user
> > application should not attempt to commit those partitions after this
> point.
> > In this case, there's no
> > reason for the consumer to immediately rejoin the group. Since a
> > non-cooperative assignor was
> > selected, we know that all partitions have been assigned. This member can
> > continue on as usual,
> > processing the remaining un-revoked partitions and will follow the EAGER
> > protocol in the next
> > rebalance. There's no user-facing impact or handling required; all that
> > happens is that the work
> > since the last commit on those revoked partitions has been lost.
> >
> > In the less-happy path, the user has implemented #onPartitionsLost
> > incorrectly or not implemented
> > it at all, falling back on the default which invokes #onPartitionsRevoked
> > which in turn will attempt to
> > commit those partitions during the rebalance callback. In this case we
> rely
> > on the flag to prevent
> > this commit request from being sent to the broker.
> >
> > Originally I was thinking we should throw a CommitFailedException up
> > through the #onPartitionsLost
> > callback, and eventually up through poll(), then rejoin the group. But
> now
> > I'm wondering if this is really
> > necessary -- the important point in all cases is just to prevent the
> > commit, but there's no reason the
> > consumer should not be allowed to continue processing its other
> partitions,
> > and it hasn't dropped out
> > of the group. What do you think about this slight amendment to my
> original
> > proposal: if a user does end up
> > calling commit for whatever reason when invoking #onPartitionsLost, we'll
> > just swallow the resulting
> > CommitFailedException. So the user application wouldn't see anything, and
> > the only impact would be
> > that these partitions were not able to commit those last set of offsets
> on
> > the revoked partitions.
> >
> > WDYT? My only concern there is that the user might have some implicit
> > assumption that unless a
> > CommitFailedException was thrown, the offsets of revoked partitions were
> > successfully committed
> > and they may have some downstream logic that should trigger only in this
> > case. If that's a concern,
> > then I would keep the original proposal which says a
> CommitFailedException
> > will be thrown up through
> > poll(), and leave it up to the user to decide if they want to trigger a
> new
> > rebalance/rejoin the group or not.
> >
> > Regarding the flag which prevents committing the revoked partitions, this
> > will need to continue
> > blocking such commit attempts until the next time the consumer rejoins
> the
> > group, ie until the end
> > of the next successful rebalance. Technically this shouldn't matter,
> since
> > the consumer no longer
> > owns those partitions this member shouldn't attempt to commit them
> anyways.
> > Usually we can
> > rely on the broker rejecting commit attempts on partitions that are not
> > owned, in which case the
> > consumer will throw a CommitFailedException. This is similar, except that
> > we can't rely on the
> > broker having been informed of the change in ownership before this
> consumer
> > might attempt to
> > commit. So to avoid this race condition, we'll keep the "blockCommit"
> flag
> > until the next rebalance
> > when we can be certain that the broker is clear on this
> > partition's ownership.
> >
> > 2) I guess maybe the wording here is unclear -- what I meant is that all
> > 3.0 applications will *eventually*
> > enable cooperative rebalancing in the stable state. This doesn't mean
> that
> > it will select COOPERATIVE
> > when it first starts up, and in order for this dynamic protocol upgrade
> to
> > be safe we do indeed need to
> > start off with EAGER and only upgrade once the selected assignor
> indicates
> > that it's safe to do so.
> > (This only applies if multiple assignors are used, if the assignors are
> > "cooperative-sticky" only then it
> > will just start out and forever remain on COOPERATIVE, like in Streams)
> >
> > Since it's just the first rebalance, the choice of COOPERATIVE vs EAGER
> > actually doesn't matter at
> > all since the consumer won't own any partitions until it's joined the
> > group. So we may as well continue
> > the initial protocol selection strategy of "highest commonly supported
> > protocol", but the point is that
> > 3.0 applications will upgrade to COOPERATIVE as soon as they have any
> > partitions. If you can think
> > of a better way to phrase "New applications on 3.0 will enable
> cooperative
> > rebalancing by default" then
> > please let me know.
> >
> >
> > Thanks for the response -- hope this makes sense so far, but I'm happy to
> > elaborate any aspects of the
> > proposal which aren't clear. I'll also update the ticket description
> > for KAFKA-12477 with the latest.
> >
> >
> > On Wed, Apr 14, 2021 at 12:03 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Hello Sophie,
> > >
> > > Thanks for the detailed explanation, a few clarifying questions:
> > >
> > > 1) when the short-circuit is triggered, what would happen next? Would
> the
> > > consumers switch back to EAGER, and try to re-join the group, and then
> > upon
> > > succeeding the next rebalance reset the flag to allow committing? Or
> > would
> > > we just fail the consumer immediately.
> > >
> > > 2) at the overview you mentioned "New applications on 3.0 will enable
> > > cooperative rebalancing by default", but in the detailed description as
> > > "With ["cooperative-sticky", "range”], the initial protocol will be
> EAGER
> > > when the member first joins the group." which seems contradictory? If
> we
> > > want to have cooperative behavior be the default, then with the
> > > default ["cooperative-sticky", "range”] the member would start with
> > > COOPERATIVE protocol right away.
> > >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Mon, Apr 12, 2021 at 5:19 AM Chris Egerton
> > <chrise@confluent.io.invalid
> > > >
> > > wrote:
> > >
> > > > Whoops, small correction--meant to say
> > > > ConsumerRebalanceListener::onPartitionsLost, not
> > > Consumer::onPartitionsLost
> > > >
> > > > On Mon, Apr 12, 2021 at 8:17 AM Chris Egerton <ch...@confluent.io>
> > > wrote:
> > > >
> > > > > Hi Sophie,
> > > > >
> > > > > This sounds fantastic. I've made a note on KAFKA-12487 about being
> > sure
> > > > to
> > > > > implement Consumer::onPartitionsLost to avoid unnecessary task
> > failures
> > > > on
> > > > > consumer protocol downgrade, but besides that, I don't think things
> > > could
> > > > > get any smoother for Connect users or developers. The automatic
> > > protocol
> > > > > upgrade/downgrade behavior appears safe, intuitive, and pain-free.
> > > > >
> > > > > Really excited for this development and hoping we can see it come
> to
> > > > > fruition in time for the 3.0 release!
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Fri, Apr 9, 2021 at 2:43 PM Sophie Blee-Goldman
> > > > > <so...@confluent.io.invalid> wrote:
> > > > >
> > > > >> 1) Yes, all of the above will be part of KAFKA-12477 (not KIP-726)
> > > > >>
> > > > >> 2) No, KAFKA-12638 would be nice to have but I don't think it's
> > > > >> appropriate
> > > > >> to remove
> > > > >> the default implementation of #onPartitionsLost in 3.0 since we
> > never
> > > > gave
> > > > >> any indication
> > > > >> yet that we intend to remove it
> > > > >>
> > > > >> 3) Yes, this would be similar to when a Consumer drops out of the
> > > group.
> > > > >> It's always been
> > > > >> possible for a member to miss a rebalance and have its partition
> be
> > > > >> reassigned to another
> > > > >> member, during which time both members would claim to own said
> > > > partition.
> > > > >> But this is safe
> > > > >> because the member who dropped out is blocked from committing
> > offsets
> > > on
> > > > >> that partition.
> > > > >>
> > > > >> On Fri, Apr 9, 2021 at 2:46 AM Luke Chen <sh...@gmail.com>
> wrote:
> > > > >>
> > > > >> > Hi Sophie,
> > > > >> > That sounds great to take care of each case I can think of.
> > > > >> > Questions:
> > > > >> > 1. Do you mean the short-Circuit will also be implemented in
> > > > >> KAFKA-12477?
> > > > >> > 2. I don't think KAFKA-12638 is the blocker of this KIP-726, Am
> I
> > > > right?
> > > > >> > 3. So, does that mean we still have possibility to have multiple
> > > > >> consumer
> > > > >> > owned the same topic partition? And in this situation, we avoid
> > them
> > > > >> doing
> > > > >> > committing, and waiting for next rebalance (should be soon). Is
> my
> > > > >> > understanding correct?
> > > > >> >
> > > > >> > Thank you very much for finding this great solution.
> > > > >> >
> > > > >> > Luke
> > > > >> >
> > > > >> > On Fri, Apr 9, 2021 at 11:37 AM Sophie Blee-Goldman
> > > > >> > <so...@confluent.io.invalid> wrote:
> > > > >> >
> > > > >> > > Alright, here's the detailed proposal for KAFKA-12477. This
> > > assumes
> > > > we
> > > > >> > will
> > > > >> > > change the default assignor to ["cooperative-sticky", "range"]
> > in
> > > > >> > KIP-726.
> > > > >> > > It also acknowledges that users may attempt any kind of
> upgrade
> > > > >> without
> > > > >> > > reading the docs, and so we need to put in safeguards against
> > data
> > > > >> > > corruption rather than assume everyone will follow the safe
> > > upgrade
> > > > >> path.
> > > > >> > >
> > > > >> > > With this proposal,
> > > > >> > > 1) New applications on 3.0 will enable cooperative rebalancing
> > by
> > > > >> default
> > > > >> > > 2) Existing applications which don’t set an assignor can
> safely
> > > > >> upgrade
> > > > >> > to
> > > > >> > > 3.0 using a single rolling bounce with no extra steps, and
> will
> > > > >> > > automatically transition to cooperative rebalancing
> > > > >> > > 3) Existing applications which do set an assignor that uses
> > EAGER
> > > > can
> > > > >> > > likewise upgrade their applications to COOPERATIVE with a
> single
> > > > >> rolling
> > > > >> > > bounce
> > > > >> > > 4) Once on 3.0, applications can safely go back and forth
> > between
> > > > >> EAGER
> > > > >> > and
> > > > >> > > COOPERATIVE
> > > > >> > > 5) Applications can safely downgrade away from 3.0
> > > > >> > >
> > > > >> > > The high-level idea for dynamic protocol upgrades is that the
> > > group
> > > > >> will
> > > > >> > > leverage the assignor selected by the group coordinator to
> > > determine
> > > > >> when
> > > > >> > > it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe
> to
> > > > >> protect
> > > > >> > the
> > > > >> > > group in case of rare events or user misconfiguration. The
> group
> > > > >> > > coordinator selects the most preferred assignor that’s
> supported
> > > by
> > > > >> all
> > > > >> > > members of the group, so we know that all members will support
> > > > >> > COOPERATIVE
> > > > >> > > once we receive the “cooperative-sticky” assignor after a
> > > rebalance.
> > > > >> At
> > > > >> > > this point, each member can upgrade their own protocol to
> > > > COOPERATIVE.
> > > > >> > > However, there may be situations in which an EAGER member may
> > join
> > > > the
> > > > >> > > group even after upgrading to COOPERATIVE. For example,
> during a
> > > > >> rolling
> > > > >> > > upgrade if the last remaining member on the old bytecode
> misses
> > a
> > > > >> > > rebalance, the other members will be allowed to upgrade to
> > > > >> COOPERATIVE.
> > > > >> > If
> > > > >> > > the old member rejoins and is chosen to be the group leader
> > before
> > > > >> it’s
> > > > >> > > upgraded to 3.0, it won’t be aware that the other members of
> the
> > > > group
> > > > >> > have
> > > > >> > > not yet revoked their partitions when computing the
> assignment.
> > > > >> > >
> > > > >> > > Short Circuit:
> > > > >> > > The risk of mixing the cooperative and eager rebalancing
> > protocols
> > > > is
> > > > >> > that
> > > > >> > > a partition may be assigned to one member while it has yet to
> be
> > > > >> revoked
> > > > >> > > from its previous owner. The danger is that the new owner may
> > > begin
> > > > >> > > processing and committing offsets for this partition while the
> > > > >> previous
> > > > >> > > owner is also committing offsets in its #onPartitionsRevoked
> > > > callback,
> > > > >> > > which is invoked at the end of the rebalance in the
> cooperative
> > > > >> protocol.
> > > > >> > > This can result in these consumers overwriting each other’s
> > > offsets
> > > > >> and
> > > > >> > > getting a corrupted view of the partition. Note that it’s not
> > > > >> possible to
> > > > >> > > commit during a rebalance, so we can protect against offset
> > > > >> corruption by
> > > > >> > > blocking further commits after we detect that the group leader
> > may
> > > > not
> > > > >> > > understand COOPERATIVE, but before we invoke
> > #onPartitionsRevoked.
> > > > >> This
> > > > >> > is
> > > > >> > > the “short-circuit” — if we detect that the group is in an
> > unsafe
> > > > >> state,
> > > > >> > we
> > > > >> > > invoke #onPartitionsLost instead of #onPartitionsRevoked and
> > > > >> explicitly
> > > > >> > > prevent offsets from being committed on those revoked
> > partitions.
> > > > >> > >
> > > > >> > > Consumer procedure:
> > > > >> > > Upon startup, the consumer will initially select the highest
> > > > >> > > commonly-supported protocol across its configured assignors.
> > With
> > > > >> > > ["cooperative-sticky", "range”], the initial protocol will be
> > > EAGER
> > > > >> when
> > > > >> > > the member first joins the group. Following a rebalance, each
> > > member
> > > > >> will
> > > > >> > > check the selected assignor. If the chosen assignor supports
> > > > >> COOPERATIVE,
> > > > >> > > the member can upgrade their used protocol to COOPERATIVE and
> no
> > > > >> further
> > > > >> > > action is required. If the member is already on COOPERATIVE
> but
> > > the
> > > > >> > > selected assignor does NOT support it, then we need to trigger
> > the
> > > > >> > > short-circuit. In this case we will invoke #onPartitionsLost
> > > instead
> > > > >> of
> > > > >> > > #onPartitionsRevoked, and set a flag to block any attempts at
> > > > >> committing
> > > > >> > > those partitions which have been revoked. If a commit is
> > > attempted,
> > > > as
> > > > >> > may
> > > > >> > > be the case if the user does not implement #onPartitionsLost
> > (see
> > > > >> > > KAFKA-12638), we will throw a CommitFailedException which will
> > be
> > > > >> bubbled
> > > > >> > > up through poll() after completing the rebalance. The member
> > will
> > > > then
> > > > >> > > downgrade its protocol to EAGER for the next rebalance.
> > > > >> > >
> > > > >> > > Let me know what you think,
> > > > >> > > Sophie
> > > > >> > >
> > > > >> > > On Fri, Apr 2, 2021 at 7:08 PM Luke Chen <sh...@gmail.com>
> > > wrote:
> > > > >> > >
> > > > >> > > > Hi Sophie,
> > > > >> > > > Making the default to "cooperative-sticky, range" is a smart
> > > idea,
> > > > >> to
> > > > >> > > > ensure we can at least fall back to rangeAssignor if
> consumers
> > > are
> > > > >> not
> > > > >> > > > following our recommended upgrade path. I updated the KIP
> > > > >> accordingly.
> > > > >> > > >
> > > > >> > > > Hi Chris,
> > > > >> > > > No problem, I updated the KIP to include the change in
> > Connect.
> > > > >> > > >
> > > > >> > > > Thank you very much.
> > > > >> > > >
> > > > >> > > > Luke
> > > > >> > > >
> > > > >> > > > On Thu, Apr 1, 2021 at 3:24 AM Chris Egerton
> > > > >> > <chrise@confluent.io.invalid
> > > > >> > > >
> > > > >> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Hi all,
> > > > >> > > > >
> > > > >> > > > > @Sophie - I like the sound of the dual-protocol default.
> The
> > > > >> smooth
> > > > >> > > > upgrade
> > > > >> > > > > path it permits sounds fantastic!
> > > > >> > > > >
> > > > >> > > > > @Luke - Do you think we can also include Connect in this
> > KIP?
> > > > >> Right
> > > > >> > now
> > > > >> > > > we
> > > > >> > > > > don't set any custom partition assignment strategies for
> the
> > > > >> consumer
> > > > >> > > > > groups we bring up for sink tasks, and if we continue to
> > just
> > > > use
> > > > >> the
> > > > >> > > > > default, the assignment strategy for those consumer groups
> > > would
> > > > >> > change
> > > > >> > > > on
> > > > >> > > > > Connect clusters once people upgrade to 3.0. I think this
> is
> > > > fine
> > > > >> > > > (assuming
> > > > >> > > > > we can take care of
> > > > >> > https://issues.apache.org/jira/browse/KAFKA-12487
> > > > >> > > > > before then, which I'm fairly optimistic about), but it
> > might
> > > be
> > > > >> > worth
> > > > >> > > a
> > > > >> > > > > sentence or two in the KIP explaining that the change in
> > > default
> > > > >> will
> > > > >> > > > > intentionally propagate to Connect. And, if we think
> Connect
> > > > >> should
> > > > >> > be
> > > > >> > > > left
> > > > >> > > > > out of this change and stay on the range assignor instead,
> > we
> > > > >> should
> > > > >> > > > > probably call that fact out in the KIP as well and state
> > that
> > > > >> Connect
> > > > >> > > > will
> > > > >> > > > > now override the default partition assignment strategy to
> be
> > > the
> > > > >> > range
> > > > >> > > > > assignor (assuming the user hasn't specified a value for
> > > > >> > > > > consumer.partition.assignment.strategy in their worker
> > config
> > > or
> > > > >> for
> > > > >> > > > > consumer.override.partition.assignment.strategy in their
> > > > connector
> > > > >> > > > config).
> > > > >> > > > >
> > > > >> > > > > Cheers,
> > > > >> > > > >
> > > > >> > > > > Chris
> > > > >> > > > >
> > > > >> > > > > On Wed, Mar 31, 2021 at 12:18 AM Sophie Blee-Goldman
> > > > >> > > > > <so...@confluent.io.invalid> wrote:
> > > > >> > > > >
> > > > >> > > > > > Ok I'm still fleshing out all the details of KAFKA-12477
> > > but I
> > > > >> > think
> > > > >> > > we
> > > > >> > > > > can
> > > > >> > > > > > simplify some things a bit, and avoid
> > > > >> > > > > > any kind of "fail-fast" which will require user
> > > intervention.
> > > > In
> > > > >> > > fact I
> > > > >> > > > > > think we can avoid requiring the user to make
> > > > >> > > > > > any changes at all for KIP-726, so we don't have to
> worry
> > > > about
> > > > >> > > whether
> > > > >> > > > > > they actually read our documentation:
> > > > >> > > > > >
> > > > >> > > > > > Instead of making ["cooperative-sticky"] the default, we
> > > > change
> > > > >> the
> > > > >> > > > > default
> > > > >> > > > > > to ["cooperative-sticky", "range"].
> > > > >> > > > > > Since "range" is the old default, this is equivalent to
> > the
> > > > >> first
> > > > >> > > > rolling
> > > > >> > > > > > bounce of the safe upgrade path in KIP-429.
> > > > >> > > > > >
> > > > >> > > > > > Of course this also means that under the current
> protocol
> > > > >> selection
> > > > >> > > > > > mechanism we won't actually upgrade to
> > > > >> > > > > > cooperative rebalancing with the default assignor. But
> > > that's
> > > > >> where
> > > > >> > > > > > KAFKA-12477 will come in.
> > > > >> > > > > >
> > > > >> > > > > > @Guozhang Wang <gu...@confluent.io>  I'll get back
> to
> > > you
> > > > >> with
> > > > >> > a
> > > > >> > > > > > concrete proposal and answer your questions, I just want
> > to
> > > > >> point
> > > > >> > out
> > > > >> > > > > > that it's possible to side-step the risk of users
> shooting
> > > > >> > themselves
> > > > >> > > > in
> > > > >> > > > > > the foot (well, at least in this one specific case,
> > > > >> > > > > > obviously they always find a way)
> > > > >> > > > > >
> > > > >> > > > > > On Tue, Mar 30, 2021 at 10:37 AM Guozhang Wang <
> > > > >> wangguoz@gmail.com
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > > >
> > > > >> > > > > > > Hi Sophie,
> > > > >> > > > > > >
> > > > >> > > > > > > My question is more related to KAFKA-12477, but since
> > your
> > > > >> latest
> > > > >> > > > > replies
> > > > >> > > > > > > are on this thread I figured we can follow-up on the
> > same
> > > > >> venue.
> > > > >> > > Just
> > > > >> > > > > so
> > > > >> > > > > > I
> > > > >> > > > > > > understand your latest comments above about the
> > approach:
> > > > >> > > > > > >
> > > > >> > > > > > > * I think, we would need to persist this decision so
> > that
> > > > the
> > > > >> > group
> > > > >> > > > > would
> > > > >> > > > > > > never go back to the eager protocol, this bit would be
> > > > >> written to
> > > > >> > > the
> > > > >> > > > > > > internal topic's assignment message. Is that correct?
> > > > >> > > > > > > * Maybe you can describe the steps, after the group
> has
> > > > >> decided
> > > > >> > to
> > > > >> > > > move
> > > > >> > > > > > > forward with cooperative protocols, when:
> > > > >> > > > > > > 1) a new member joined the group with the old version,
> > and
> > > > >> hence
> > > > >> > > only
> > > > >> > > > > > > recognized eager protocol and executing the eager
> > protocol
> > > > >> with
> > > > >> > its
> > > > >> > > > > first
> > > > >> > > > > > > rebalance, what would happen.
> > > > >> > > > > > > 2) in addition to 1), the new member joined the group
> > with
> > > > the
> > > > >> > old
> > > > >> > > > > > version
> > > > >> > > > > > > and only recognized the old subscription format, and
> was
> > > > >> selected
> > > > >> > > as
> > > > >> > > > > the
> > > > >> > > > > > > leader, what would happen.
> > > > >> > > > > > >
> > > > >> > > > > > > Guozhang
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > On Mon, Mar 29, 2021 at 10:30 PM Luke Chen <
> > > > showuon@gmail.com
> > > > >> >
> > > > >> > > > wrote:
> > > > >> > > > > > >
> > > > >> > > > > > > > Hi Sophie & Ismael,
> > > > >> > > > > > > > Thank you for your feedback.
> > > > >> > > > > > > > No problem, let's pause this KIP and wait for this
> > > > >> improvement:
> > > > >> > > > > > > KAFKA-12477
> > > > >> > > > > > > > <https://issues.apache.org/jira/browse/KAFKA-12477
> >.
> > > > >> > > > > > > >
> > > > >> > > > > > > > Stay tuned :)
> > > > >> > > > > > > >
> > > > >> > > > > > > > Thank you.
> > > > >> > > > > > > > Luke
> > > > >> > > > > > > >
> > > > >> > > > > > > > On Tue, Mar 30, 2021 at 3:14 AM Ismael Juma <
> > > > >> ismael@juma.me.uk
> > > > >> > >
> > > > >> > > > > wrote:
> > > > >> > > > > > > >
> > > > >> > > > > > > > > Hi Sophie,
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > I didn't analyze the KIP in detail, but the two
> > > > >> suggestions
> > > > >> > you
> > > > >> > > > > > > mentioned
> > > > >> > > > > > > > > sound like great improvements.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > A bit more context: breaking changes for a widely
> > used
> > > > >> > product
> > > > >> > > > like
> > > > >> > > > > > > Kafka
> > > > >> > > > > > > > > are costly and hence why we try as hard as we can
> to
> > > > avoid
> > > > >> > > them.
> > > > >> > > > > When
> > > > >> > > > > > > it
> > > > >> > > > > > > > > comes to the brokers, they are often managed by a
> > > > central
> > > > >> > group
> > > > >> > > > (or
> > > > >> > > > > > > > they're
> > > > >> > > > > > > > > in the Cloud), so they're a bit easier to manage.
> > Even
> > > > so,
> > > > >> > it's
> > > > >> > > > > still
> > > > >> > > > > > > > > possible to upgrade from 0.8.x directly to 2.7
> since
> > > all
> > > > >> > > protocol
> > > > >> > > > > > > > versions
> > > > >> > > > > > > > > are still supported. When it comes to the basic
> > > clients
> > > > >> > > > (producer,
> > > > >> > > > > > > > > consumer, admin client), they're often embedded in
> > > > >> > applications
> > > > >> > > > so
> > > > >> > > > > we
> > > > >> > > > > > > > have
> > > > >> > > > > > > > > to be even more conservative.
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > Ismael
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > On Mon, Mar 29, 2021 at 10:50 AM Sophie
> Blee-Goldman
> > > > >> > > > > > > > > <so...@confluent.io.invalid> wrote:
> > > > >> > > > > > > > >
> > > > >> > > > > > > > > > Ismael,
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > It seems like given 3.0 is a breaking release,
> we
> > > have
> > > > >> to
> > > > >> > > rely
> > > > >> > > > on
> > > > >> > > > > > > users
> > > > >> > > > > > > > > > being aware of this and responsible
> > > > >> > > > > > > > > > enough to read the upgrade guide. Otherwise we
> > could
> > > > >> never
> > > > >> > > ever
> > > > >> > > > > > make
> > > > >> > > > > > > > any
> > > > >> > > > > > > > > > breaking changes beyond just
> > > > >> > > > > > > > > > removing deprecated APIs or other
> > > compilation-breaking
> > > > >> > errors
> > > > >> > > > > that
> > > > >> > > > > > > > would
> > > > >> > > > > > > > > be
> > > > >> > > > > > > > > > immediately visible, no?
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > That said, obviously it's better to have a
> > > > >> circuit-breaker
> > > > >> > > that
> > > > >> > > > > > will
> > > > >> > > > > > > > fail
> > > > >> > > > > > > > > > fast in case of a user misconfiguration
> > > > >> > > > > > > > > > rather than silently corrupting the consumer
> group
> > > > >> state --
> > > > >> > > eg
> > > > >> > > > > for
> > > > >> > > > > > > two
> > > > >> > > > > > > > > > consumers to overlap in their ownership
> > > > >> > > > > > > > > > of the same partition(s). We could definitely
> > > > implement
> > > > >> > this,
> > > > >> > > > and
> > > > >> > > > > > now
> > > > >> > > > > > > > > that
> > > > >> > > > > > > > > > I think about it this might solve a
> > > > >> > > > > > > > > > related problem in KAFKA-12477
> > > > >> > > > > > > > > > <
> > https://issues.apache.org/jira/browse/KAFKA-12477
> > > >.
> > > > We
> > > > >> > just
> > > > >> > > > > add a
> > > > >> > > > > > > new
> > > > >> > > > > > > > > > field to the Assignment in which the group
> leader
> > > > >> > > > > > > > > > indicates whether it's on a recent enough
> version
> > to
> > > > >> > > understand
> > > > >> > > > > > > > > cooperative
> > > > >> > > > > > > > > > rebalancing. If an upgraded member
> > > > >> > > > > > > > > > joins the group, it'll only be allowed to start
> > > > >> following
> > > > >> > the
> > > > >> > > > new
> > > > >> > > > > > > > > > rebalancing protocol after receiving the
> go-ahead
> > > > >> > > > > > > > > > from the group leader.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > If we do go ahead and add this new field in the
> > > > >> Assignment
> > > > >> > > then
> > > > >> > > > > I'm
> > > > >> > > > > > > > > pretty
> > > > >> > > > > > > > > > confident we can reduce the number
> > > > >> > > > > > > > > > of required rolling bounces to just one with
> > > > KAFKA-12477
> > > > >> > > > > > > > > > <
> > https://issues.apache.org/jira/browse/KAFKA-12477
> > > >.
> > > > In
> > > > >> > that
> > > > >> > > > > case
> > > > >> > > > > > we
> > > > >> > > > > > > > > > should
> > > > >> > > > > > > > > > be in much better shape to
> > > > >> > > > > > > > > > feel good about changing the default to the
> > > > >> > > > > > > CooperativeStickyAssignor.
> > > > >> > > > > > > > > How
> > > > >> > > > > > > > > > does that sound?
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > To be clear, I'm not proposing we do this as
> part
> > of
> > > > >> > KIP-726.
> > > > >> > > > > > Here's
> > > > >> > > > > > > my
> > > > >> > > > > > > > > > take:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Let's pause this KIP while I work on making
> these
> > > two
> > > > >> > > > > improvements
> > > > >> > > > > > in
> > > > >> > > > > > > > > > KAFKA-12477 <
> > > > >> > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > >> > > > >.
> > > > >> > > > > > > Once
> > > > >> > > > > > > > I
> > > > >> > > > > > > > > > can
> > > > >> > > > > > > > > > confirm the
> > > > >> > > > > > > > > > short-circuit and single rolling bounce will be
> > > > >> available
> > > > >> > for
> > > > >> > > > > 3.0,
> > > > >> > > > > > > I'll
> > > > >> > > > > > > > > > report back on this thread. Then we can move
> > > > >> > > > > > > > > > forward with this KIP again.
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > Thoughts?
> > > > >> > > > > > > > > > Sophie
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > On Mon, Mar 29, 2021 at 12:01 AM Luke Chen <
> > > > >> > > showuon@gmail.com>
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > > > > Hi Ismael,
> > > > >> > > > > > > > > > > Thanks for your good question. Answer them
> > below:
> > > > >> > > > > > > > > > > *1. Are we saying that every consumer upgraded
> > > would
> > > > >> have
> > > > >> > > to
> > > > >> > > > > > follow
> > > > >> > > > > > > > the
> > > > >> > > > > > > > > > > complex path described in the KIP? *
> > > > >> > > > > > > > > > > --> We suggest that every consumer did these 2
> > > steps
> > > > >> of
> > > > >> > > > rolling
> > > > >> > > > > > > > > upgrade.
> > > > >> > > > > > > > > > > And after KAFKA-12477 <
> > > > >> > > > > > > > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > is completed, it can be reduced to 1 rolling
> > > > upgrade.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > *2. what happens if they don't read the
> > > instructions
> > > > >> and
> > > > >> > > > > upgrade
> > > > >> > > > > > as
> > > > >> > > > > > > > > they
> > > > >> > > > > > > > > > > have in the past?*
> > > > >> > > > > > > > > > > --> The reason we want 2 steps of rolling
> > upgrade
> > > is
> > > > >> that
> > > > >> > > we
> > > > >> > > > > want
> > > > >> > > > > > > to
> > > > >> > > > > > > > > > avoid
> > > > >> > > > > > > > > > > the situation where leader is on old byte-code
> > and
> > > > >> only
> > > > >> > > > > recognize
> > > > >> > > > > > > > > > "eager",
> > > > >> > > > > > > > > > > but due to compatibility would still be able
> to
> > > > >> > deserialize
> > > > >> > > > the
> > > > >> > > > > > new
> > > > >> > > > > > > > > > > protocol data from newer versioned members,
> and
> > > > hence
> > > > >> > just
> > > > >> > > go
> > > > >> > > > > > ahead
> > > > >> > > > > > > > and
> > > > >> > > > > > > > > > do
> > > > >> > > > > > > > > > > the assignment while new versioned members did
> > not
> > > > >> revoke
> > > > >> > > > their
> > > > >> > > > > > > > > > partitions
> > > > >> > > > > > > > > > > before joining the group.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > But I'd say, the new default assignor
> > > > >> > > > > "CooperativeStickyAssignor"
> > > > >> > > > > > > was
> > > > >> > > > > > > > > > > already introduced in V2.4.0, and it should be
> > > long
> > > > >> > enough
> > > > >> > > > for
> > > > >> > > > > > user
> > > > >> > > > > > > > to
> > > > >> > > > > > > > > > > upgrade to the new byte-code to recognize the
> > > > >> > "cooperative"
> > > > >> > > > > > > protocol.
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > What do you think?
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > Thank you.
> > > > >> > > > > > > > > > > Luke
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > On Mon, Mar 29, 2021 at 12:14 PM Ismael Juma <
> > > > >> > > > > ismael@juma.me.uk>
> > > > >> > > > > > > > > wrote:
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > > > > Thanks for the KIP. Are we saying that every
> > > > >> consumer
> > > > >> > > > > upgraded
> > > > >> > > > > > > > would
> > > > >> > > > > > > > > > have
> > > > >> > > > > > > > > > > > to follow the complex path described in the
> > KIP?
> > > > >> Also,
> > > > >> > > what
> > > > >> > > > > > > happens
> > > > >> > > > > > > > > if
> > > > >> > > > > > > > > > > they
> > > > >> > > > > > > > > > > > don't read the instructions and upgrade as
> > they
> > > > >> have in
> > > > >> > > the
> > > > >> > > > > > past?
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > Ismael
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > On Fri, Mar 26, 2021, 1:53 AM Luke Chen <
> > > > >> > > showuon@gmail.com
> > > > >> > > > >
> > > > >> > > > > > > wrote:
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Hi everyone,
> > > > >> > > > > > > > > > > > > <Update the subject>
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > I'd like to discuss the following proposal
> > to
> > > > make
> > > > >> > the
> > > > >> > > > > > > > > > > > > CooperativeStickyAssignor as the default
> > > > assignor.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-726%3A+Make+the+CooperativeStickyAssignor+as+the+default+assignor
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Any comments are welcomed.
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > > > Thank you.
> > > > >> > > > > > > > > > > > > Luke
> > > > >> > > > > > > > > > > > >
> > > > >> > > > > > > > > > > >
> > > > >> > > > > > > > > > >
> > > > >> > > > > > > > > >
> > > > >> > > > > > > > >
> > > > >> > > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > >
> > > > >> > > > > > > --
> > > > >> > > > > > > -- Guozhang
> > > > >> > > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-726: Make the CooperativeStickyAssignor as the default assignor

Posted by Sophie Blee-Goldman <so...@confluent.io.INVALID>.
Thanks Luke. We may as well get this KIP in to 3.0 so that we can fully
enable cooperative rebalancing
by default in 3.0 if we have KAFKA-12477 done in time, and if we don't then
there's no harm as it's
not going to change the behavior.

On Wed, Jun 2, 2021 at 7:28 PM Luke Chen <sh...@gmail.com> wrote:

> Hi Sophie,
> Thanks for the reminder. Yes, I was thinking this KIP doesn't have to be
> put into a major release since it will be fully backward compatible, so no
> need to push it. Currently, if we want to work on this KIP, we need
> KAFKA-12477 and KAFKA-12487. But you're right, we can at least try our best
> to see if we can make it into V3.0 since cooperative rebalancing is a major
> improvement. I'll kick off a vote later.
>
> Thank you.
> Luke
>
> On Thu, Jun 3, 2021 at 7:08 AM Sophie Blee-Goldman
> <so...@confluent.io.invalid> wrote:
>
> > Hey Luke,
> >
> > It's been a while since the last update on this, which is mostly my fault
> > for picking up
> > other things in the meantime. I'm planning to get back to work
> > on KAFKA-12477 next
> > week but there are minimal changes to the current implementation given
> the
> > proposal
> > I put forth earlier in this KIP discussion, so I think we're good to go.
> >
> > Although this KIP no longer requires a major release since it should be
> > fully compatible, I
> > still hope we can get it in to 3.0 since cooperative rebalancing is a
> major
> > improvement to
> > the life of a consumer group (and its operator). Can we make sure the KIP
> > reflects the latest
> > and then kick off a vote by next Monday at the latest so we can make KIP
> > freeze?
> >
> > Thanks!
> > Sophie
> >
> > On Fri, Apr 16, 2021 at 2:33 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > 1) From user's perspective, it is always possible that a commit within
> > > onPartitionsRevoked throw in practice (e.g. if the member missed the
> > > previous rebalance where its assigned partitions are already
> re-assigned)
> > > -- and the onPartitionsLost was introduced for that exact reason, i.e.
> it
> > > is primarily for optimizations, but not for correctness guarantees --
> on
> > > the other hand, it would be surprising to users to see the commit
> returns
> > > and then later found it not going through. Given that, I'd suggest we
> > still
> > > throw the exception right away. Regarding the flag itself though, I
> agree
> > > that keeping it set until the next succeeded join group makes sense to
> be
> > > safer.
> > >
> > > 2) That's crystal, thank you for the clarification.
> > >
> > > On Wed, Apr 14, 2021 at 6:46 PM Sophie Blee-Goldman
> > > <so...@confluent.io.invalid> wrote:
> > >
> > > > 1) Once the short-circuit is triggered, the member will downgrade to
> > the
> > > > EAGER protocol, but
> > > > won't necessarily try to rejoin the group right away.
> > > >
> > > > In the "happy path", the user has implemented #onPartitionsLost
> > correctly
> > > > and will not attempt
> > > > to commit partitions that are lost. And since these partitions have
> > > indeed
> > > > been revoked, the user
> > > > application should not attempt to commit those partitions after this
> > > point.
> > > > In this case, there's no
> > > > reason for the consumer to immediately rejoin the group. Since a
> > > > non-cooperative assignor was
> > > > selected, we know that all partitions have been assigned. This member
> > can
> > > > continue on as usual,
> > > > processing the remaining un-revoked partitions and will follow the
> > EAGER
> > > > protocol in the next
> > > > rebalance. There's no user-facing impact or handling required; all
> that
> > > > happens is that the work
> > > > since the last commit on those revoked partitions has been lost.
> > > >
> > > > In the less-happy path, the user has implemented #onPartitionsLost
> > > > incorrectly or not implemented
> > > > it at all, falling back on the default which invokes
> > #onPartitionsRevoked
> > > > which in turn will attempt to
> > > > commit those partitions during the rebalance callback. In this case
> we
> > > rely
> > > > on the flag to prevent
> > > > this commit request from being sent to the broker.
> > > >
> > > > Originally I was thinking we should throw a CommitFailedException up
> > > > through the #onPartitionsLost
> > > > callback, and eventually up through poll(), then rejoin the group.
> But
> > > now
> > > > I'm wondering if this is really
> > > > necessary -- the important point in all cases is just to prevent the
> > > > commit, but there's no reason the
> > > > consumer should not be allowed to continue processing its other
> > > partitions,
> > > > and it hasn't dropped out
> > > > of the group. What do you think about this slight amendment to my
> > > original
> > > > proposal: if a user does end up
> > > > calling commit for whatever reason when invoking #onPartitionsLost,
> > we'll
> > > > just swallow the resulting
> > > > CommitFailedException. So the user application wouldn't see anything,
> > and
> > > > the only impact would be
> > > > that these partitions were not able to commit those last set of
> offsets
> > > on
> > > > the revoked partitions.
> > > >
> > > > WDYT? My only concern there is that the user might have some implicit
> > > > assumption that unless a
> > > > CommitFailedException was thrown, the offsets of revoked partitions
> > were
> > > > successfully committed
> > > > and they may have some downstream logic that should trigger only in
> > this
> > > > case. If that's a concern,
> > > > then I would keep the original proposal which says a
> > > CommitFailedException
> > > > will be thrown up through
> > > > poll(), and leave it up to the user to decide if they want to
> trigger a
> > > new
> > > > rebalance/rejoin the group or not.
> > > >
> > > > Regarding the flag which prevents committing the revoked partitions,
> > this
> > > > will need to continue
> > > > blocking such commit attempts until the next time the consumer
> rejoins
> > > the
> > > > group, ie until the end
> > > > of the next successful rebalance. Technically this shouldn't matter,
> > > since
> > > > the consumer no longer
> > > > owns those partitions this member shouldn't attempt to commit them
> > > anyways.
> > > > Usually we can
> > > > rely on the broker rejecting commit attempts on partitions that are
> not
> > > > owned, in which case the
> > > > consumer will throw a CommitFailedException. This is similar, except
> > that
> > > > we can't rely on the
> > > > broker having been informed of the change in ownership before this
> > > consumer
> > > > might attempt to
> > > > commit. So to avoid this race condition, we'll keep the "blockCommit"
> > > flag
> > > > until the next rebalance
> > > > when we can be certain that the broker is clear on this
> > > > partition's ownership.
> > > >
> > > > 2) I guess maybe the wording here is unclear -- what I meant is that
> > all
> > > > 3.0 applications will *eventually*
> > > > enable cooperative rebalancing in the stable state. This doesn't mean
> > > that
> > > > it will select COOPERATIVE
> > > > when it first starts up, and in order for this dynamic protocol
> upgrade
> > > to
> > > > be safe we do indeed need to
> > > > start off with EAGER and only upgrade once the selected assignor
> > > indicates
> > > > that it's safe to do so.
> > > > (This only applies if multiple assignors are used, if the assignors
> are
> > > > "cooperative-sticky" only then it
> > > > will just start out and forever remain on COOPERATIVE, like in
> Streams)
> > > >
> > > > Since it's just the first rebalance, the choice of COOPERATIVE vs
> EAGER
> > > > actually doesn't matter at
> > > > all since the consumer won't own any partitions until it's joined the
> > > > group. So we may as well continue
> > > > the initial protocol selection strategy of "highest commonly
> supported
> > > > protocol", but the point is that
> > > > 3.0 applications will upgrade to COOPERATIVE as soon as they have any
> > > > partitions. If you can think
> > > > of a better way to phrase "New applications on 3.0 will enable
> > > cooperative
> > > > rebalancing by default" then
> > > > please let me know.
> > > >
> > > >
> > > > Thanks for the response -- hope this makes sense so far, but I'm
> happy
> > to
> > > > elaborate any aspects of the
> > > > proposal which aren't clear. I'll also update the ticket description
> > > > for KAFKA-12477 with the latest.
> > > >
> > > >
> > > > On Wed, Apr 14, 2021 at 12:03 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Hello Sophie,
> > > > >
> > > > > Thanks for the detailed explanation, a few clarifying questions:
> > > > >
> > > > > 1) when the short-circuit is triggered, what would happen next?
> Would
> > > the
> > > > > consumers switch back to EAGER, and try to re-join the group, and
> > then
> > > > upon
> > > > > succeeding the next rebalance reset the flag to allow committing?
> Or
> > > > would
> > > > > we just fail the consumer immediately.
> > > > >
> > > > > 2) at the overview you mentioned "New applications on 3.0 will
> enable
> > > > > cooperative rebalancing by default", but in the detailed
> description
> > as
> > > > > "With ["cooperative-sticky", "range”], the initial protocol will be
> > > EAGER
> > > > > when the member first joins the group." which seems contradictory?
> If
> > > we
> > > > > want to have cooperative behavior be the default, then with the
> > > > > default ["cooperative-sticky", "range”] the member would start with
> > > > > COOPERATIVE protocol right away.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Apr 12, 2021 at 5:19 AM Chris Egerton
> > > > <chrise@confluent.io.invalid
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Whoops, small correction--meant to say
> > > > > > ConsumerRebalanceListener::onPartitionsLost, not
> > > > > Consumer::onPartitionsLost
> > > > > >
> > > > > > On Mon, Apr 12, 2021 at 8:17 AM Chris Egerton <
> chrise@confluent.io
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hi Sophie,
> > > > > > >
> > > > > > > This sounds fantastic. I've made a note on KAFKA-12487 about
> > being
> > > > sure
> > > > > > to
> > > > > > > implement Consumer::onPartitionsLost to avoid unnecessary task
> > > > failures
> > > > > > on
> > > > > > > consumer protocol downgrade, but besides that, I don't think
> > things
> > > > > could
> > > > > > > get any smoother for Connect users or developers. The automatic
> > > > > protocol
> > > > > > > upgrade/downgrade behavior appears safe, intuitive, and
> > pain-free.
> > > > > > >
> > > > > > > Really excited for this development and hoping we can see it
> come
> > > to
> > > > > > > fruition in time for the 3.0 release!
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Apr 9, 2021 at 2:43 PM Sophie Blee-Goldman
> > > > > > > <so...@confluent.io.invalid> wrote:
> > > > > > >
> > > > > > >> 1) Yes, all of the above will be part of KAFKA-12477 (not
> > KIP-726)
> > > > > > >>
> > > > > > >> 2) No, KAFKA-12638 would be nice to have but I don't think
> it's
> > > > > > >> appropriate
> > > > > > >> to remove
> > > > > > >> the default implementation of #onPartitionsLost in 3.0 since
> we
> > > > never
> > > > > > gave
> > > > > > >> any indication
> > > > > > >> yet that we intend to remove it
> > > > > > >>
> > > > > > >> 3) Yes, this would be similar to when a Consumer drops out of
> > the
> > > > > group.
> > > > > > >> It's always been
> > > > > > >> possible for a member to miss a rebalance and have its
> partition
> > > be
> > > > > > >> reassigned to another
> > > > > > >> member, during which time both members would claim to own said
> > > > > > partition.
> > > > > > >> But this is safe
> > > > > > >> because the member who dropped out is blocked from committing
> > > > offsets
> > > > > on
> > > > > > >> that partition.
> > > > > > >>
> > > > > > >> On Fri, Apr 9, 2021 at 2:46 AM Luke Chen <sh...@gmail.com>
> > > wrote:
> > > > > > >>
> > > > > > >> > Hi Sophie,
> > > > > > >> > That sounds great to take care of each case I can think of.
> > > > > > >> > Questions:
> > > > > > >> > 1. Do you mean the short-Circuit will also be implemented in
> > > > > > >> KAFKA-12477?
> > > > > > >> > 2. I don't think KAFKA-12638 is the blocker of this KIP-726,
> > Am
> > > I
> > > > > > right?
> > > > > > >> > 3. So, does that mean we still have possibility to have
> > multiple
> > > > > > >> consumer
> > > > > > >> > owned the same topic partition? And in this situation, we
> > avoid
> > > > them
> > > > > > >> doing
> > > > > > >> > committing, and waiting for next rebalance (should be soon).
> > Is
> > > my
> > > > > > >> > understanding correct?
> > > > > > >> >
> > > > > > >> > Thank you very much for finding this great solution.
> > > > > > >> >
> > > > > > >> > Luke
> > > > > > >> >
> > > > > > >> > On Fri, Apr 9, 2021 at 11:37 AM Sophie Blee-Goldman
> > > > > > >> > <so...@confluent.io.invalid> wrote:
> > > > > > >> >
> > > > > > >> > > Alright, here's the detailed proposal for KAFKA-12477.
> This
> > > > > assumes
> > > > > > we
> > > > > > >> > will
> > > > > > >> > > change the default assignor to ["cooperative-sticky",
> > "range"]
> > > > in
> > > > > > >> > KIP-726.
> > > > > > >> > > It also acknowledges that users may attempt any kind of
> > > upgrade
> > > > > > >> without
> > > > > > >> > > reading the docs, and so we need to put in safeguards
> > against
> > > > data
> > > > > > >> > > corruption rather than assume everyone will follow the
> safe
> > > > > upgrade
> > > > > > >> path.
> > > > > > >> > >
> > > > > > >> > > With this proposal,
> > > > > > >> > > 1) New applications on 3.0 will enable cooperative
> > rebalancing
> > > > by
> > > > > > >> default
> > > > > > >> > > 2) Existing applications which don’t set an assignor can
> > > safely
> > > > > > >> upgrade
> > > > > > >> > to
> > > > > > >> > > 3.0 using a single rolling bounce with no extra steps, and
> > > will
> > > > > > >> > > automatically transition to cooperative rebalancing
> > > > > > >> > > 3) Existing applications which do set an assignor that
> uses
> > > > EAGER
> > > > > > can
> > > > > > >> > > likewise upgrade their applications to COOPERATIVE with a
> > > single
> > > > > > >> rolling
> > > > > > >> > > bounce
> > > > > > >> > > 4) Once on 3.0, applications can safely go back and forth
> > > > between
> > > > > > >> EAGER
> > > > > > >> > and
> > > > > > >> > > COOPERATIVE
> > > > > > >> > > 5) Applications can safely downgrade away from 3.0
> > > > > > >> > >
> > > > > > >> > > The high-level idea for dynamic protocol upgrades is that
> > the
> > > > > group
> > > > > > >> will
> > > > > > >> > > leverage the assignor selected by the group coordinator to
> > > > > determine
> > > > > > >> when
> > > > > > >> > > it’s safe to upgrade to COOPERATIVE, and trigger a
> fail-safe
> > > to
> > > > > > >> protect
> > > > > > >> > the
> > > > > > >> > > group in case of rare events or user misconfiguration. The
> > > group
> > > > > > >> > > coordinator selects the most preferred assignor that’s
> > > supported
> > > > > by
> > > > > > >> all
> > > > > > >> > > members of the group, so we know that all members will
> > support
> > > > > > >> > COOPERATIVE
> > > > > > >> > > once we receive the “cooperative-sticky” assignor after a
> > > > > rebalance.
> > > > > > >> At
> > > > > > >> > > this point, each member can upgrade their own protocol to
> > > > > > COOPERATIVE.
> > > > > > >> > > However, there may be situations in which an EAGER member
> > may
> > > > join
> > > > > > the
> > > > > > >> > > group even after upgrading to COOPERATIVE. For example,
> > > during a
> > > > > > >> rolling
> > > > > > >> > > upgrade if the last remaining member on the old bytecode
> > > misses
> > > > a
> > > > > > >> > > rebalance, the other members will be allowed to upgrade to
> > > > > > >> COOPERATIVE.
> > > > > > >> > If
> > > > > > >> > > the old member rejoins and is chosen to be the group
> leader
> > > > before
> > > > > > >> it’s
> > > > > > >> > > upgraded to 3.0, it won’t be aware that the other members
> of
> > > the
> > > > > > group
> > > > > > >> > have
> > > > > > >> > > not yet revoked their partitions when computing the
> > > assignment.
> > > > > > >> > >
> > > > > > >> > > Short Circuit:
> > > > > > >> > > The risk of mixing the cooperative and eager rebalancing
> > > > protocols
> > > > > > is
> > > > > > >> > that
> > > > > > >> > > a partition may be assigned to one member while it has yet
> > to
> > > be
> > > > > > >> revoked
> > > > > > >> > > from its previous owner. The danger is that the new owner
> > may
> > > > > begin
> > > > > > >> > > processing and committing offsets for this partition while
> > the
> > > > > > >> previous
> > > > > > >> > > owner is also committing offsets in its
> #onPartitionsRevoked
> > > > > > callback,
> > > > > > >> > > which is invoked at the end of the rebalance in the
> > > cooperative
> > > > > > >> protocol.
> > > > > > >> > > This can result in these consumers overwriting each
> other’s
> > > > > offsets
> > > > > > >> and
> > > > > > >> > > getting a corrupted view of the partition. Note that it’s
> > not
> > > > > > >> possible to
> > > > > > >> > > commit during a rebalance, so we can protect against
> offset
> > > > > > >> corruption by
> > > > > > >> > > blocking further commits after we detect that the group
> > leader
> > > > may
> > > > > > not
> > > > > > >> > > understand COOPERATIVE, but before we invoke
> > > > #onPartitionsRevoked.
> > > > > > >> This
> > > > > > >> > is
> > > > > > >> > > the “short-circuit” — if we detect that the group is in an
> > > > unsafe
> > > > > > >> state,
> > > > > > >> > we
> > > > > > >> > > invoke #onPartitionsLost instead of #onPartitionsRevoked
> and
> > > > > > >> explicitly
> > > > > > >> > > prevent offsets from being committed on those revoked
> > > > partitions.
> > > > > > >> > >
> > > > > > >> > > Consumer procedure:
> > > > > > >> > > Upon startup, the consumer will initially select the
> highest
> > > > > > >> > > commonly-supported protocol across its configured
> assignors.
> > > > With
> > > > > > >> > > ["cooperative-sticky", "range”], the initial protocol will
> > be
> > > > > EAGER
> > > > > > >> when
> > > > > > >> > > the member first joins the group. Following a rebalance,
> > each
> > > > > member
> > > > > > >> will
> > > > > > >> > > check the selected assignor. If the chosen assignor
> supports
> > > > > > >> COOPERATIVE,
> > > > > > >> > > the member can upgrade their used protocol to COOPERATIVE
> > and
> > > no
> > > > > > >> further
> > > > > > >> > > action is required. If the member is already on
> COOPERATIVE
> > > but
> > > > > the
> > > > > > >> > > selected assignor does NOT support it, then we need to
> > trigger
> > > > the
> > > > > > >> > > short-circuit. In this case we will invoke
> #onPartitionsLost
> > > > > instead
> > > > > > >> of
> > > > > > >> > > #onPartitionsRevoked, and set a flag to block any attempts
> > at
> > > > > > >> committing
> > > > > > >> > > those partitions which have been revoked. If a commit is
> > > > > attempted,
> > > > > > as
> > > > > > >> > may
> > > > > > >> > > be the case if the user does not implement
> #onPartitionsLost
> > > > (see
> > > > > > >> > > KAFKA-12638), we will throw a CommitFailedException which
> > will
> > > > be
> > > > > > >> bubbled
> > > > > > >> > > up through poll() after completing the rebalance. The
> member
> > > > will
> > > > > > then
> > > > > > >> > > downgrade its protocol to EAGER for the next rebalance.
> > > > > > >> > >
> > > > > > >> > > Let me know what you think,
> > > > > > >> > > Sophie
> > > > > > >> > >
> > > > > > >> > > On Fri, Apr 2, 2021 at 7:08 PM Luke Chen <
> showuon@gmail.com
> > >
> > > > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi Sophie,
> > > > > > >> > > > Making the default to "cooperative-sticky, range" is a
> > smart
> > > > > idea,
> > > > > > >> to
> > > > > > >> > > > ensure we can at least fall back to rangeAssignor if
> > > consumers
> > > > > are
> > > > > > >> not
> > > > > > >> > > > following our recommended upgrade path. I updated the
> KIP
> > > > > > >> accordingly.
> > > > > > >> > > >
> > > > > > >> > > > Hi Chris,
> > > > > > >> > > > No problem, I updated the KIP to include the change in
> > > > Connect.
> > > > > > >> > > >
> > > > > > >> > > > Thank you very much.
> > > > > > >> > > >
> > > > > > >> > > > Luke
> > > > > > >> > > >
> > > > > > >> > > > On Thu, Apr 1, 2021 at 3:24 AM Chris Egerton
> > > > > > >> > <chrise@confluent.io.invalid
> > > > > > >> > > >
> > > > > > >> > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi all,
> > > > > > >> > > > >
> > > > > > >> > > > > @Sophie - I like the sound of the dual-protocol
> default.
> > > The
> > > > > > >> smooth
> > > > > > >> > > > upgrade
> > > > > > >> > > > > path it permits sounds fantastic!
> > > > > > >> > > > >
> > > > > > >> > > > > @Luke - Do you think we can also include Connect in
> this
> > > > KIP?
> > > > > > >> Right
> > > > > > >> > now
> > > > > > >> > > > we
> > > > > > >> > > > > don't set any custom partition assignment strategies
> for
> > > the
> > > > > > >> consumer
> > > > > > >> > > > > groups we bring up for sink tasks, and if we continue
> to
> > > > just
> > > > > > use
> > > > > > >> the
> > > > > > >> > > > > default, the assignment strategy for those consumer
> > groups
> > > > > would
> > > > > > >> > change
> > > > > > >> > > > on
> > > > > > >> > > > > Connect clusters once people upgrade to 3.0. I think
> > this
> > > is
> > > > > > fine
> > > > > > >> > > > (assuming
> > > > > > >> > > > > we can take care of
> > > > > > >> > https://issues.apache.org/jira/browse/KAFKA-12487
> > > > > > >> > > > > before then, which I'm fairly optimistic about), but
> it
> > > > might
> > > > > be
> > > > > > >> > worth
> > > > > > >> > > a
> > > > > > >> > > > > sentence or two in the KIP explaining that the change
> in
> > > > > default
> > > > > > >> will
> > > > > > >> > > > > intentionally propagate to Connect. And, if we think
> > > Connect
> > > > > > >> should
> > > > > > >> > be
> > > > > > >> > > > left
> > > > > > >> > > > > out of this change and stay on the range assignor
> > instead,
> > > > we
> > > > > > >> should
> > > > > > >> > > > > probably call that fact out in the KIP as well and
> state
> > > > that
> > > > > > >> Connect
> > > > > > >> > > > will
> > > > > > >> > > > > now override the default partition assignment strategy
> > to
> > > be
> > > > > the
> > > > > > >> > range
> > > > > > >> > > > > assignor (assuming the user hasn't specified a value
> for
> > > > > > >> > > > > consumer.partition.assignment.strategy in their worker
> > > > config
> > > > > or
> > > > > > >> for
> > > > > > >> > > > > consumer.override.partition.assignment.strategy in
> their
> > > > > > connector
> > > > > > >> > > > config).
> > > > > > >> > > > >
> > > > > > >> > > > > Cheers,
> > > > > > >> > > > >
> > > > > > >> > > > > Chris
> > > > > > >> > > > >
> > > > > > >> > > > > On Wed, Mar 31, 2021 at 12:18 AM Sophie Blee-Goldman
> > > > > > >> > > > > <so...@confluent.io.invalid> wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Ok I'm still fleshing out all the details of
> > KAFKA-12477
> > > > > but I
> > > > > > >> > think
> > > > > > >> > > we
> > > > > > >> > > > > can
> > > > > > >> > > > > > simplify some things a bit, and avoid
> > > > > > >> > > > > > any kind of "fail-fast" which will require user
> > > > > intervention.
> > > > > > In
> > > > > > >> > > fact I
> > > > > > >> > > > > > think we can avoid requiring the user to make
> > > > > > >> > > > > > any changes at all for KIP-726, so we don't have to
> > > worry
> > > > > > about
> > > > > > >> > > whether
> > > > > > >> > > > > > they actually read our documentation:
> > > > > > >> > > > > >
> > > > > > >> > > > > > Instead of making ["cooperative-sticky"] the
> default,
> > we
> > > > > > change
> > > > > > >> the
> > > > > > >> > > > > default
> > > > > > >> > > > > > to ["cooperative-sticky", "range"].
> > > > > > >> > > > > > Since "range" is the old default, this is equivalent
> > to
> > > > the
> > > > > > >> first
> > > > > > >> > > > rolling
> > > > > > >> > > > > > bounce of the safe upgrade path in KIP-429.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Of course this also means that under the current
> > > protocol
> > > > > > >> selection
> > > > > > >> > > > > > mechanism we won't actually upgrade to
> > > > > > >> > > > > > cooperative rebalancing with the default assignor.
> But
> > > > > that's
> > > > > > >> where
> > > > > > >> > > > > > KAFKA-12477 will come in.
> > > > > > >> > > > > >
> > > > > > >> > > > > > @Guozhang Wang <gu...@confluent.io>  I'll get
> back
> > > to
> > > > > you
> > > > > > >> with
> > > > > > >> > a
> > > > > > >> > > > > > concrete proposal and answer your questions, I just
> > want
> > > > to
> > > > > > >> point
> > > > > > >> > out
> > > > > > >> > > > > > that it's possible to side-step the risk of users
> > > shooting
> > > > > > >> > themselves
> > > > > > >> > > > in
> > > > > > >> > > > > > the foot (well, at least in this one specific case,
> > > > > > >> > > > > > obviously they always find a way)
> > > > > > >> > > > > >
> > > > > > >> > > > > > On Tue, Mar 30, 2021 at 10:37 AM Guozhang Wang <
> > > > > > >> wangguoz@gmail.com
> > > > > > >> > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > > > Hi Sophie,
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > My question is more related to KAFKA-12477, but
> > since
> > > > your
> > > > > > >> latest
> > > > > > >> > > > > replies
> > > > > > >> > > > > > > are on this thread I figured we can follow-up on
> the
> > > > same
> > > > > > >> venue.
> > > > > > >> > > Just
> > > > > > >> > > > > so
> > > > > > >> > > > > > I
> > > > > > >> > > > > > > understand your latest comments above about the
> > > > approach:
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > * I think, we would need to persist this decision
> so
> > > > that
> > > > > > the
> > > > > > >> > group
> > > > > > >> > > > > would
> > > > > > >> > > > > > > never go back to the eager protocol, this bit
> would
> > be
> > > > > > >> written to
> > > > > > >> > > the
> > > > > > >> > > > > > > internal topic's assignment message. Is that
> > correct?
> > > > > > >> > > > > > > * Maybe you can describe the steps, after the
> group
> > > has
> > > > > > >> decided
> > > > > > >> > to
> > > > > > >> > > > move
> > > > > > >> > > > > > > forward with cooperative protocols, when:
> > > > > > >> > > > > > > 1) a new member joined the group with the old
> > version,
> > > > and
> > > > > > >> hence
> > > > > > >> > > only
> > > > > > >> > > > > > > recognized eager protocol and executing the eager
> > > > protocol
> > > > > > >> with
> > > > > > >> > its
> > > > > > >> > > > > first
> > > > > > >> > > > > > > rebalance, what would happen.
> > > > > > >> > > > > > > 2) in addition to 1), the new member joined the
> > group
> > > > with
> > > > > > the
> > > > > > >> > old
> > > > > > >> > > > > > version
> > > > > > >> > > > > > > and only recognized the old subscription format,
> and
> > > was
> > > > > > >> selected
> > > > > > >> > > as
> > > > > > >> > > > > the
> > > > > > >> > > > > > > leader, what would happen.
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > Guozhang
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > On Mon, Mar 29, 2021 at 10:30 PM Luke Chen <
> > > > > > showuon@gmail.com
> > > > > > >> >
> > > > > > >> > > > wrote:
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > > Hi Sophie & Ismael,
> > > > > > >> > > > > > > > Thank you for your feedback.
> > > > > > >> > > > > > > > No problem, let's pause this KIP and wait for
> this
> > > > > > >> improvement:
> > > > > > >> > > > > > > KAFKA-12477
> > > > > > >> > > > > > > > <
> > https://issues.apache.org/jira/browse/KAFKA-12477
> > > >.
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Stay tuned :)
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > Thank you.
> > > > > > >> > > > > > > > Luke
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > On Tue, Mar 30, 2021 at 3:14 AM Ismael Juma <
> > > > > > >> ismael@juma.me.uk
> > > > > > >> > >
> > > > > > >> > > > > wrote:
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > > > > Hi Sophie,
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > I didn't analyze the KIP in detail, but the
> two
> > > > > > >> suggestions
> > > > > > >> > you
> > > > > > >> > > > > > > mentioned
> > > > > > >> > > > > > > > > sound like great improvements.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > A bit more context: breaking changes for a
> > widely
> > > > used
> > > > > > >> > product
> > > > > > >> > > > like
> > > > > > >> > > > > > > Kafka
> > > > > > >> > > > > > > > > are costly and hence why we try as hard as we
> > can
> > > to
> > > > > > avoid
> > > > > > >> > > them.
> > > > > > >> > > > > When
> > > > > > >> > > > > > > it
> > > > > > >> > > > > > > > > comes to the brokers, they are often managed
> by
> > a
> > > > > > central
> > > > > > >> > group
> > > > > > >> > > > (or
> > > > > > >> > > > > > > > they're
> > > > > > >> > > > > > > > > in the Cloud), so they're a bit easier to
> > manage.
> > > > Even
> > > > > > so,
> > > > > > >> > it's
> > > > > > >> > > > > still
> > > > > > >> > > > > > > > > possible to upgrade from 0.8.x directly to 2.7
> > > since
> > > > > all
> > > > > > >> > > protocol
> > > > > > >> > > > > > > > versions
> > > > > > >> > > > > > > > > are still supported. When it comes to the
> basic
> > > > > clients
> > > > > > >> > > > (producer,
> > > > > > >> > > > > > > > > consumer, admin client), they're often
> embedded
> > in
> > > > > > >> > applications
> > > > > > >> > > > so
> > > > > > >> > > > > we
> > > > > > >> > > > > > > > have
> > > > > > >> > > > > > > > > to be even more conservative.
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > Ismael
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > On Mon, Mar 29, 2021 at 10:50 AM Sophie
> > > Blee-Goldman
> > > > > > >> > > > > > > > > <so...@confluent.io.invalid> wrote:
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > > > > Ismael,
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > It seems like given 3.0 is a breaking
> release,
> > > we
> > > > > have
> > > > > > >> to
> > > > > > >> > > rely
> > > > > > >> > > > on
> > > > > > >> > > > > > > users
> > > > > > >> > > > > > > > > > being aware of this and responsible
> > > > > > >> > > > > > > > > > enough to read the upgrade guide. Otherwise
> we
> > > > could
> > > > > > >> never
> > > > > > >> > > ever
> > > > > > >> > > > > > make
> > > > > > >> > > > > > > > any
> > > > > > >> > > > > > > > > > breaking changes beyond just
> > > > > > >> > > > > > > > > > removing deprecated APIs or other
> > > > > compilation-breaking
> > > > > > >> > errors
> > > > > > >> > > > > that
> > > > > > >> > > > > > > > would
> > > > > > >> > > > > > > > > be
> > > > > > >> > > > > > > > > > immediately visible, no?
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > That said, obviously it's better to have a
> > > > > > >> circuit-breaker
> > > > > > >> > > that
> > > > > > >> > > > > > will
> > > > > > >> > > > > > > > fail
> > > > > > >> > > > > > > > > > fast in case of a user misconfiguration
> > > > > > >> > > > > > > > > > rather than silently corrupting the consumer
> > > group
> > > > > > >> state --
> > > > > > >> > > eg
> > > > > > >> > > > > for
> > > > > > >> > > > > > > two
> > > > > > >> > > > > > > > > > consumers to overlap in their ownership
> > > > > > >> > > > > > > > > > of the same partition(s). We could
> definitely
> > > > > > implement
> > > > > > >> > this,
> > > > > > >> > > > and
> > > > > > >> > > > > > now
> > > > > > >> > > > > > > > > that
> > > > > > >> > > > > > > > > > I think about it this might solve a
> > > > > > >> > > > > > > > > > related problem in KAFKA-12477
> > > > > > >> > > > > > > > > > <
> > > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > > >.
> > > > > > We
> > > > > > >> > just
> > > > > > >> > > > > add a
> > > > > > >> > > > > > > new
> > > > > > >> > > > > > > > > > field to the Assignment in which the group
> > > leader
> > > > > > >> > > > > > > > > > indicates whether it's on a recent enough
> > > version
> > > > to
> > > > > > >> > > understand
> > > > > > >> > > > > > > > > cooperative
> > > > > > >> > > > > > > > > > rebalancing. If an upgraded member
> > > > > > >> > > > > > > > > > joins the group, it'll only be allowed to
> > start
> > > > > > >> following
> > > > > > >> > the
> > > > > > >> > > > new
> > > > > > >> > > > > > > > > > rebalancing protocol after receiving the
> > > go-ahead
> > > > > > >> > > > > > > > > > from the group leader.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > If we do go ahead and add this new field in
> > the
> > > > > > >> Assignment
> > > > > > >> > > then
> > > > > > >> > > > > I'm
> > > > > > >> > > > > > > > > pretty
> > > > > > >> > > > > > > > > > confident we can reduce the number
> > > > > > >> > > > > > > > > > of required rolling bounces to just one with
> > > > > > KAFKA-12477
> > > > > > >> > > > > > > > > > <
> > > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > > >.
> > > > > > In
> > > > > > >> > that
> > > > > > >> > > > > case
> > > > > > >> > > > > > we
> > > > > > >> > > > > > > > > > should
> > > > > > >> > > > > > > > > > be in much better shape to
> > > > > > >> > > > > > > > > > feel good about changing the default to the
> > > > > > >> > > > > > > CooperativeStickyAssignor.
> > > > > > >> > > > > > > > > How
> > > > > > >> > > > > > > > > > does that sound?
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > To be clear, I'm not proposing we do this as
> > > part
> > > > of
> > > > > > >> > KIP-726.
> > > > > > >> > > > > > Here's
> > > > > > >> > > > > > > my
> > > > > > >> > > > > > > > > > take:
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Let's pause this KIP while I work on making
> > > these
> > > > > two
> > > > > > >> > > > > improvements
> > > > > > >> > > > > > in
> > > > > > >> > > > > > > > > > KAFKA-12477 <
> > > > > > >> > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > > > >> > > > >.
> > > > > > >> > > > > > > Once
> > > > > > >> > > > > > > > I
> > > > > > >> > > > > > > > > > can
> > > > > > >> > > > > > > > > > confirm the
> > > > > > >> > > > > > > > > > short-circuit and single rolling bounce will
> > be
> > > > > > >> available
> > > > > > >> > for
> > > > > > >> > > > > 3.0,
> > > > > > >> > > > > > > I'll
> > > > > > >> > > > > > > > > > report back on this thread. Then we can move
> > > > > > >> > > > > > > > > > forward with this KIP again.
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > Thoughts?
> > > > > > >> > > > > > > > > > Sophie
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > On Mon, Mar 29, 2021 at 12:01 AM Luke Chen <
> > > > > > >> > > showuon@gmail.com>
> > > > > > >> > > > > > > wrote:
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > > > > Hi Ismael,
> > > > > > >> > > > > > > > > > > Thanks for your good question. Answer them
> > > > below:
> > > > > > >> > > > > > > > > > > *1. Are we saying that every consumer
> > upgraded
> > > > > would
> > > > > > >> have
> > > > > > >> > > to
> > > > > > >> > > > > > follow
> > > > > > >> > > > > > > > the
> > > > > > >> > > > > > > > > > > complex path described in the KIP? *
> > > > > > >> > > > > > > > > > > --> We suggest that every consumer did
> > these 2
> > > > > steps
> > > > > > >> of
> > > > > > >> > > > rolling
> > > > > > >> > > > > > > > > upgrade.
> > > > > > >> > > > > > > > > > > And after KAFKA-12477 <
> > > > > > >> > > > > > > > >
> > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > is completed, it can be reduced to 1
> rolling
> > > > > > upgrade.
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > *2. what happens if they don't read the
> > > > > instructions
> > > > > > >> and
> > > > > > >> > > > > upgrade
> > > > > > >> > > > > > as
> > > > > > >> > > > > > > > > they
> > > > > > >> > > > > > > > > > > have in the past?*
> > > > > > >> > > > > > > > > > > --> The reason we want 2 steps of rolling
> > > > upgrade
> > > > > is
> > > > > > >> that
> > > > > > >> > > we
> > > > > > >> > > > > want
> > > > > > >> > > > > > > to
> > > > > > >> > > > > > > > > > avoid
> > > > > > >> > > > > > > > > > > the situation where leader is on old
> > byte-code
> > > > and
> > > > > > >> only
> > > > > > >> > > > > recognize
> > > > > > >> > > > > > > > > > "eager",
> > > > > > >> > > > > > > > > > > but due to compatibility would still be
> able
> > > to
> > > > > > >> > deserialize
> > > > > > >> > > > the
> > > > > > >> > > > > > new
> > > > > > >> > > > > > > > > > > protocol data from newer versioned
> members,
> > > and
> > > > > > hence
> > > > > > >> > just
> > > > > > >> > > go
> > > > > > >> > > > > > ahead
> > > > > > >> > > > > > > > and
> > > > > > >> > > > > > > > > > do
> > > > > > >> > > > > > > > > > > the assignment while new versioned members
> > did
> > > > not
> > > > > > >> revoke
> > > > > > >> > > > their
> > > > > > >> > > > > > > > > > partitions
> > > > > > >> > > > > > > > > > > before joining the group.
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > But I'd say, the new default assignor
> > > > > > >> > > > > "CooperativeStickyAssignor"
> > > > > > >> > > > > > > was
> > > > > > >> > > > > > > > > > > already introduced in V2.4.0, and it
> should
> > be
> > > > > long
> > > > > > >> > enough
> > > > > > >> > > > for
> > > > > > >> > > > > > user
> > > > > > >> > > > > > > > to
> > > > > > >> > > > > > > > > > > upgrade to the new byte-code to recognize
> > the
> > > > > > >> > "cooperative"
> > > > > > >> > > > > > > protocol.
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > What do you think?
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > Thank you.
> > > > > > >> > > > > > > > > > > Luke
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > On Mon, Mar 29, 2021 at 12:14 PM Ismael
> > Juma <
> > > > > > >> > > > > ismael@juma.me.uk>
> > > > > > >> > > > > > > > > wrote:
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > Thanks for the KIP. Are we saying that
> > every
> > > > > > >> consumer
> > > > > > >> > > > > upgraded
> > > > > > >> > > > > > > > would
> > > > > > >> > > > > > > > > > have
> > > > > > >> > > > > > > > > > > > to follow the complex path described in
> > the
> > > > KIP?
> > > > > > >> Also,
> > > > > > >> > > what
> > > > > > >> > > > > > > happens
> > > > > > >> > > > > > > > > if
> > > > > > >> > > > > > > > > > > they
> > > > > > >> > > > > > > > > > > > don't read the instructions and upgrade
> as
> > > > they
> > > > > > >> have in
> > > > > > >> > > the
> > > > > > >> > > > > > past?
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > Ismael
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > On Fri, Mar 26, 2021, 1:53 AM Luke Chen
> <
> > > > > > >> > > showuon@gmail.com
> > > > > > >> > > > >
> > > > > > >> > > > > > > wrote:
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > Hi everyone,
> > > > > > >> > > > > > > > > > > > > <Update the subject>
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > I'd like to discuss the following
> > proposal
> > > > to
> > > > > > make
> > > > > > >> > the
> > > > > > >> > > > > > > > > > > > > CooperativeStickyAssignor as the
> default
> > > > > > assignor.
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-726%3A+Make+the+CooperativeStickyAssignor+as+the+default+assignor
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > Any comments are welcomed.
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > > > Thank you.
> > > > > > >> > > > > > > > > > > > > Luke
> > > > > > >> > > > > > > > > > > > >
> > > > > > >> > > > > > > > > > > >
> > > > > > >> > > > > > > > > > >
> > > > > > >> > > > > > > > > >
> > > > > > >> > > > > > > > >
> > > > > > >> > > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > >
> > > > > > >> > > > > > > --
> > > > > > >> > > > > > > -- Guozhang
> > > > > > >> > > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: [DISCUSS] KIP-726: Make the CooperativeStickyAssignor as the default assignor

Posted by Luke Chen <sh...@gmail.com>.
Hi Sophie,
Thanks for the reminder. Yes, I was thinking this KIP doesn't have to be
put into a major release since it will be fully backward compatible, so no
need to push it. Currently, if we want to work on this KIP, we need
KAFKA-12477 and KAFKA-12487. But you're right, we can at least try our best
to see if we can make it into V3.0 since cooperative rebalancing is a major
improvement. I'll kick off a vote later.

Thank you.
Luke

On Thu, Jun 3, 2021 at 7:08 AM Sophie Blee-Goldman
<so...@confluent.io.invalid> wrote:

> Hey Luke,
>
> It's been a while since the last update on this, which is mostly my fault
> for picking up
> other things in the meantime. I'm planning to get back to work
> on KAFKA-12477 next
> week but there are minimal changes to the current implementation given the
> proposal
> I put forth earlier in this KIP discussion, so I think we're good to go.
>
> Although this KIP no longer requires a major release since it should be
> fully compatible, I
> still hope we can get it in to 3.0 since cooperative rebalancing is a major
> improvement to
> the life of a consumer group (and its operator). Can we make sure the KIP
> reflects the latest
> and then kick off a vote by next Monday at the latest so we can make KIP
> freeze?
>
> Thanks!
> Sophie
>
> On Fri, Apr 16, 2021 at 2:33 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > 1) From user's perspective, it is always possible that a commit within
> > onPartitionsRevoked throw in practice (e.g. if the member missed the
> > previous rebalance where its assigned partitions are already re-assigned)
> > -- and the onPartitionsLost was introduced for that exact reason, i.e. it
> > is primarily for optimizations, but not for correctness guarantees -- on
> > the other hand, it would be surprising to users to see the commit returns
> > and then later found it not going through. Given that, I'd suggest we
> still
> > throw the exception right away. Regarding the flag itself though, I agree
> > that keeping it set until the next succeeded join group makes sense to be
> > safer.
> >
> > 2) That's crystal, thank you for the clarification.
> >
> > On Wed, Apr 14, 2021 at 6:46 PM Sophie Blee-Goldman
> > <so...@confluent.io.invalid> wrote:
> >
> > > 1) Once the short-circuit is triggered, the member will downgrade to
> the
> > > EAGER protocol, but
> > > won't necessarily try to rejoin the group right away.
> > >
> > > In the "happy path", the user has implemented #onPartitionsLost
> correctly
> > > and will not attempt
> > > to commit partitions that are lost. And since these partitions have
> > indeed
> > > been revoked, the user
> > > application should not attempt to commit those partitions after this
> > point.
> > > In this case, there's no
> > > reason for the consumer to immediately rejoin the group. Since a
> > > non-cooperative assignor was
> > > selected, we know that all partitions have been assigned. This member
> can
> > > continue on as usual,
> > > processing the remaining un-revoked partitions and will follow the
> EAGER
> > > protocol in the next
> > > rebalance. There's no user-facing impact or handling required; all that
> > > happens is that the work
> > > since the last commit on those revoked partitions has been lost.
> > >
> > > In the less-happy path, the user has implemented #onPartitionsLost
> > > incorrectly or not implemented
> > > it at all, falling back on the default which invokes
> #onPartitionsRevoked
> > > which in turn will attempt to
> > > commit those partitions during the rebalance callback. In this case we
> > rely
> > > on the flag to prevent
> > > this commit request from being sent to the broker.
> > >
> > > Originally I was thinking we should throw a CommitFailedException up
> > > through the #onPartitionsLost
> > > callback, and eventually up through poll(), then rejoin the group. But
> > now
> > > I'm wondering if this is really
> > > necessary -- the important point in all cases is just to prevent the
> > > commit, but there's no reason the
> > > consumer should not be allowed to continue processing its other
> > partitions,
> > > and it hasn't dropped out
> > > of the group. What do you think about this slight amendment to my
> > original
> > > proposal: if a user does end up
> > > calling commit for whatever reason when invoking #onPartitionsLost,
> we'll
> > > just swallow the resulting
> > > CommitFailedException. So the user application wouldn't see anything,
> and
> > > the only impact would be
> > > that these partitions were not able to commit those last set of offsets
> > on
> > > the revoked partitions.
> > >
> > > WDYT? My only concern there is that the user might have some implicit
> > > assumption that unless a
> > > CommitFailedException was thrown, the offsets of revoked partitions
> were
> > > successfully committed
> > > and they may have some downstream logic that should trigger only in
> this
> > > case. If that's a concern,
> > > then I would keep the original proposal which says a
> > CommitFailedException
> > > will be thrown up through
> > > poll(), and leave it up to the user to decide if they want to trigger a
> > new
> > > rebalance/rejoin the group or not.
> > >
> > > Regarding the flag which prevents committing the revoked partitions,
> this
> > > will need to continue
> > > blocking such commit attempts until the next time the consumer rejoins
> > the
> > > group, ie until the end
> > > of the next successful rebalance. Technically this shouldn't matter,
> > since
> > > the consumer no longer
> > > owns those partitions this member shouldn't attempt to commit them
> > anyways.
> > > Usually we can
> > > rely on the broker rejecting commit attempts on partitions that are not
> > > owned, in which case the
> > > consumer will throw a CommitFailedException. This is similar, except
> that
> > > we can't rely on the
> > > broker having been informed of the change in ownership before this
> > consumer
> > > might attempt to
> > > commit. So to avoid this race condition, we'll keep the "blockCommit"
> > flag
> > > until the next rebalance
> > > when we can be certain that the broker is clear on this
> > > partition's ownership.
> > >
> > > 2) I guess maybe the wording here is unclear -- what I meant is that
> all
> > > 3.0 applications will *eventually*
> > > enable cooperative rebalancing in the stable state. This doesn't mean
> > that
> > > it will select COOPERATIVE
> > > when it first starts up, and in order for this dynamic protocol upgrade
> > to
> > > be safe we do indeed need to
> > > start off with EAGER and only upgrade once the selected assignor
> > indicates
> > > that it's safe to do so.
> > > (This only applies if multiple assignors are used, if the assignors are
> > > "cooperative-sticky" only then it
> > > will just start out and forever remain on COOPERATIVE, like in Streams)
> > >
> > > Since it's just the first rebalance, the choice of COOPERATIVE vs EAGER
> > > actually doesn't matter at
> > > all since the consumer won't own any partitions until it's joined the
> > > group. So we may as well continue
> > > the initial protocol selection strategy of "highest commonly supported
> > > protocol", but the point is that
> > > 3.0 applications will upgrade to COOPERATIVE as soon as they have any
> > > partitions. If you can think
> > > of a better way to phrase "New applications on 3.0 will enable
> > cooperative
> > > rebalancing by default" then
> > > please let me know.
> > >
> > >
> > > Thanks for the response -- hope this makes sense so far, but I'm happy
> to
> > > elaborate any aspects of the
> > > proposal which aren't clear. I'll also update the ticket description
> > > for KAFKA-12477 with the latest.
> > >
> > >
> > > On Wed, Apr 14, 2021 at 12:03 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hello Sophie,
> > > >
> > > > Thanks for the detailed explanation, a few clarifying questions:
> > > >
> > > > 1) when the short-circuit is triggered, what would happen next? Would
> > the
> > > > consumers switch back to EAGER, and try to re-join the group, and
> then
> > > upon
> > > > succeeding the next rebalance reset the flag to allow committing? Or
> > > would
> > > > we just fail the consumer immediately.
> > > >
> > > > 2) at the overview you mentioned "New applications on 3.0 will enable
> > > > cooperative rebalancing by default", but in the detailed description
> as
> > > > "With ["cooperative-sticky", "range”], the initial protocol will be
> > EAGER
> > > > when the member first joins the group." which seems contradictory? If
> > we
> > > > want to have cooperative behavior be the default, then with the
> > > > default ["cooperative-sticky", "range”] the member would start with
> > > > COOPERATIVE protocol right away.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > > On Mon, Apr 12, 2021 at 5:19 AM Chris Egerton
> > > <chrise@confluent.io.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Whoops, small correction--meant to say
> > > > > ConsumerRebalanceListener::onPartitionsLost, not
> > > > Consumer::onPartitionsLost
> > > > >
> > > > > On Mon, Apr 12, 2021 at 8:17 AM Chris Egerton <chrise@confluent.io
> >
> > > > wrote:
> > > > >
> > > > > > Hi Sophie,
> > > > > >
> > > > > > This sounds fantastic. I've made a note on KAFKA-12487 about
> being
> > > sure
> > > > > to
> > > > > > implement Consumer::onPartitionsLost to avoid unnecessary task
> > > failures
> > > > > on
> > > > > > consumer protocol downgrade, but besides that, I don't think
> things
> > > > could
> > > > > > get any smoother for Connect users or developers. The automatic
> > > > protocol
> > > > > > upgrade/downgrade behavior appears safe, intuitive, and
> pain-free.
> > > > > >
> > > > > > Really excited for this development and hoping we can see it come
> > to
> > > > > > fruition in time for the 3.0 release!
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Fri, Apr 9, 2021 at 2:43 PM Sophie Blee-Goldman
> > > > > > <so...@confluent.io.invalid> wrote:
> > > > > >
> > > > > >> 1) Yes, all of the above will be part of KAFKA-12477 (not
> KIP-726)
> > > > > >>
> > > > > >> 2) No, KAFKA-12638 would be nice to have but I don't think it's
> > > > > >> appropriate
> > > > > >> to remove
> > > > > >> the default implementation of #onPartitionsLost in 3.0 since we
> > > never
> > > > > gave
> > > > > >> any indication
> > > > > >> yet that we intend to remove it
> > > > > >>
> > > > > >> 3) Yes, this would be similar to when a Consumer drops out of
> the
> > > > group.
> > > > > >> It's always been
> > > > > >> possible for a member to miss a rebalance and have its partition
> > be
> > > > > >> reassigned to another
> > > > > >> member, during which time both members would claim to own said
> > > > > partition.
> > > > > >> But this is safe
> > > > > >> because the member who dropped out is blocked from committing
> > > offsets
> > > > on
> > > > > >> that partition.
> > > > > >>
> > > > > >> On Fri, Apr 9, 2021 at 2:46 AM Luke Chen <sh...@gmail.com>
> > wrote:
> > > > > >>
> > > > > >> > Hi Sophie,
> > > > > >> > That sounds great to take care of each case I can think of.
> > > > > >> > Questions:
> > > > > >> > 1. Do you mean the short-Circuit will also be implemented in
> > > > > >> KAFKA-12477?
> > > > > >> > 2. I don't think KAFKA-12638 is the blocker of this KIP-726,
> Am
> > I
> > > > > right?
> > > > > >> > 3. So, does that mean we still have possibility to have
> multiple
> > > > > >> consumer
> > > > > >> > owned the same topic partition? And in this situation, we
> avoid
> > > them
> > > > > >> doing
> > > > > >> > committing, and waiting for next rebalance (should be soon).
> Is
> > my
> > > > > >> > understanding correct?
> > > > > >> >
> > > > > >> > Thank you very much for finding this great solution.
> > > > > >> >
> > > > > >> > Luke
> > > > > >> >
> > > > > >> > On Fri, Apr 9, 2021 at 11:37 AM Sophie Blee-Goldman
> > > > > >> > <so...@confluent.io.invalid> wrote:
> > > > > >> >
> > > > > >> > > Alright, here's the detailed proposal for KAFKA-12477. This
> > > > assumes
> > > > > we
> > > > > >> > will
> > > > > >> > > change the default assignor to ["cooperative-sticky",
> "range"]
> > > in
> > > > > >> > KIP-726.
> > > > > >> > > It also acknowledges that users may attempt any kind of
> > upgrade
> > > > > >> without
> > > > > >> > > reading the docs, and so we need to put in safeguards
> against
> > > data
> > > > > >> > > corruption rather than assume everyone will follow the safe
> > > > upgrade
> > > > > >> path.
> > > > > >> > >
> > > > > >> > > With this proposal,
> > > > > >> > > 1) New applications on 3.0 will enable cooperative
> rebalancing
> > > by
> > > > > >> default
> > > > > >> > > 2) Existing applications which don’t set an assignor can
> > safely
> > > > > >> upgrade
> > > > > >> > to
> > > > > >> > > 3.0 using a single rolling bounce with no extra steps, and
> > will
> > > > > >> > > automatically transition to cooperative rebalancing
> > > > > >> > > 3) Existing applications which do set an assignor that uses
> > > EAGER
> > > > > can
> > > > > >> > > likewise upgrade their applications to COOPERATIVE with a
> > single
> > > > > >> rolling
> > > > > >> > > bounce
> > > > > >> > > 4) Once on 3.0, applications can safely go back and forth
> > > between
> > > > > >> EAGER
> > > > > >> > and
> > > > > >> > > COOPERATIVE
> > > > > >> > > 5) Applications can safely downgrade away from 3.0
> > > > > >> > >
> > > > > >> > > The high-level idea for dynamic protocol upgrades is that
> the
> > > > group
> > > > > >> will
> > > > > >> > > leverage the assignor selected by the group coordinator to
> > > > determine
> > > > > >> when
> > > > > >> > > it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe
> > to
> > > > > >> protect
> > > > > >> > the
> > > > > >> > > group in case of rare events or user misconfiguration. The
> > group
> > > > > >> > > coordinator selects the most preferred assignor that’s
> > supported
> > > > by
> > > > > >> all
> > > > > >> > > members of the group, so we know that all members will
> support
> > > > > >> > COOPERATIVE
> > > > > >> > > once we receive the “cooperative-sticky” assignor after a
> > > > rebalance.
> > > > > >> At
> > > > > >> > > this point, each member can upgrade their own protocol to
> > > > > COOPERATIVE.
> > > > > >> > > However, there may be situations in which an EAGER member
> may
> > > join
> > > > > the
> > > > > >> > > group even after upgrading to COOPERATIVE. For example,
> > during a
> > > > > >> rolling
> > > > > >> > > upgrade if the last remaining member on the old bytecode
> > misses
> > > a
> > > > > >> > > rebalance, the other members will be allowed to upgrade to
> > > > > >> COOPERATIVE.
> > > > > >> > If
> > > > > >> > > the old member rejoins and is chosen to be the group leader
> > > before
> > > > > >> it’s
> > > > > >> > > upgraded to 3.0, it won’t be aware that the other members of
> > the
> > > > > group
> > > > > >> > have
> > > > > >> > > not yet revoked their partitions when computing the
> > assignment.
> > > > > >> > >
> > > > > >> > > Short Circuit:
> > > > > >> > > The risk of mixing the cooperative and eager rebalancing
> > > protocols
> > > > > is
> > > > > >> > that
> > > > > >> > > a partition may be assigned to one member while it has yet
> to
> > be
> > > > > >> revoked
> > > > > >> > > from its previous owner. The danger is that the new owner
> may
> > > > begin
> > > > > >> > > processing and committing offsets for this partition while
> the
> > > > > >> previous
> > > > > >> > > owner is also committing offsets in its #onPartitionsRevoked
> > > > > callback,
> > > > > >> > > which is invoked at the end of the rebalance in the
> > cooperative
> > > > > >> protocol.
> > > > > >> > > This can result in these consumers overwriting each other’s
> > > > offsets
> > > > > >> and
> > > > > >> > > getting a corrupted view of the partition. Note that it’s
> not
> > > > > >> possible to
> > > > > >> > > commit during a rebalance, so we can protect against offset
> > > > > >> corruption by
> > > > > >> > > blocking further commits after we detect that the group
> leader
> > > may
> > > > > not
> > > > > >> > > understand COOPERATIVE, but before we invoke
> > > #onPartitionsRevoked.
> > > > > >> This
> > > > > >> > is
> > > > > >> > > the “short-circuit” — if we detect that the group is in an
> > > unsafe
> > > > > >> state,
> > > > > >> > we
> > > > > >> > > invoke #onPartitionsLost instead of #onPartitionsRevoked and
> > > > > >> explicitly
> > > > > >> > > prevent offsets from being committed on those revoked
> > > partitions.
> > > > > >> > >
> > > > > >> > > Consumer procedure:
> > > > > >> > > Upon startup, the consumer will initially select the highest
> > > > > >> > > commonly-supported protocol across its configured assignors.
> > > With
> > > > > >> > > ["cooperative-sticky", "range”], the initial protocol will
> be
> > > > EAGER
> > > > > >> when
> > > > > >> > > the member first joins the group. Following a rebalance,
> each
> > > > member
> > > > > >> will
> > > > > >> > > check the selected assignor. If the chosen assignor supports
> > > > > >> COOPERATIVE,
> > > > > >> > > the member can upgrade their used protocol to COOPERATIVE
> and
> > no
> > > > > >> further
> > > > > >> > > action is required. If the member is already on COOPERATIVE
> > but
> > > > the
> > > > > >> > > selected assignor does NOT support it, then we need to
> trigger
> > > the
> > > > > >> > > short-circuit. In this case we will invoke #onPartitionsLost
> > > > instead
> > > > > >> of
> > > > > >> > > #onPartitionsRevoked, and set a flag to block any attempts
> at
> > > > > >> committing
> > > > > >> > > those partitions which have been revoked. If a commit is
> > > > attempted,
> > > > > as
> > > > > >> > may
> > > > > >> > > be the case if the user does not implement #onPartitionsLost
> > > (see
> > > > > >> > > KAFKA-12638), we will throw a CommitFailedException which
> will
> > > be
> > > > > >> bubbled
> > > > > >> > > up through poll() after completing the rebalance. The member
> > > will
> > > > > then
> > > > > >> > > downgrade its protocol to EAGER for the next rebalance.
> > > > > >> > >
> > > > > >> > > Let me know what you think,
> > > > > >> > > Sophie
> > > > > >> > >
> > > > > >> > > On Fri, Apr 2, 2021 at 7:08 PM Luke Chen <showuon@gmail.com
> >
> > > > wrote:
> > > > > >> > >
> > > > > >> > > > Hi Sophie,
> > > > > >> > > > Making the default to "cooperative-sticky, range" is a
> smart
> > > > idea,
> > > > > >> to
> > > > > >> > > > ensure we can at least fall back to rangeAssignor if
> > consumers
> > > > are
> > > > > >> not
> > > > > >> > > > following our recommended upgrade path. I updated the KIP
> > > > > >> accordingly.
> > > > > >> > > >
> > > > > >> > > > Hi Chris,
> > > > > >> > > > No problem, I updated the KIP to include the change in
> > > Connect.
> > > > > >> > > >
> > > > > >> > > > Thank you very much.
> > > > > >> > > >
> > > > > >> > > > Luke
> > > > > >> > > >
> > > > > >> > > > On Thu, Apr 1, 2021 at 3:24 AM Chris Egerton
> > > > > >> > <chrise@confluent.io.invalid
> > > > > >> > > >
> > > > > >> > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Hi all,
> > > > > >> > > > >
> > > > > >> > > > > @Sophie - I like the sound of the dual-protocol default.
> > The
> > > > > >> smooth
> > > > > >> > > > upgrade
> > > > > >> > > > > path it permits sounds fantastic!
> > > > > >> > > > >
> > > > > >> > > > > @Luke - Do you think we can also include Connect in this
> > > KIP?
> > > > > >> Right
> > > > > >> > now
> > > > > >> > > > we
> > > > > >> > > > > don't set any custom partition assignment strategies for
> > the
> > > > > >> consumer
> > > > > >> > > > > groups we bring up for sink tasks, and if we continue to
> > > just
> > > > > use
> > > > > >> the
> > > > > >> > > > > default, the assignment strategy for those consumer
> groups
> > > > would
> > > > > >> > change
> > > > > >> > > > on
> > > > > >> > > > > Connect clusters once people upgrade to 3.0. I think
> this
> > is
> > > > > fine
> > > > > >> > > > (assuming
> > > > > >> > > > > we can take care of
> > > > > >> > https://issues.apache.org/jira/browse/KAFKA-12487
> > > > > >> > > > > before then, which I'm fairly optimistic about), but it
> > > might
> > > > be
> > > > > >> > worth
> > > > > >> > > a
> > > > > >> > > > > sentence or two in the KIP explaining that the change in
> > > > default
> > > > > >> will
> > > > > >> > > > > intentionally propagate to Connect. And, if we think
> > Connect
> > > > > >> should
> > > > > >> > be
> > > > > >> > > > left
> > > > > >> > > > > out of this change and stay on the range assignor
> instead,
> > > we
> > > > > >> should
> > > > > >> > > > > probably call that fact out in the KIP as well and state
> > > that
> > > > > >> Connect
> > > > > >> > > > will
> > > > > >> > > > > now override the default partition assignment strategy
> to
> > be
> > > > the
> > > > > >> > range
> > > > > >> > > > > assignor (assuming the user hasn't specified a value for
> > > > > >> > > > > consumer.partition.assignment.strategy in their worker
> > > config
> > > > or
> > > > > >> for
> > > > > >> > > > > consumer.override.partition.assignment.strategy in their
> > > > > connector
> > > > > >> > > > config).
> > > > > >> > > > >
> > > > > >> > > > > Cheers,
> > > > > >> > > > >
> > > > > >> > > > > Chris
> > > > > >> > > > >
> > > > > >> > > > > On Wed, Mar 31, 2021 at 12:18 AM Sophie Blee-Goldman
> > > > > >> > > > > <so...@confluent.io.invalid> wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Ok I'm still fleshing out all the details of
> KAFKA-12477
> > > > but I
> > > > > >> > think
> > > > > >> > > we
> > > > > >> > > > > can
> > > > > >> > > > > > simplify some things a bit, and avoid
> > > > > >> > > > > > any kind of "fail-fast" which will require user
> > > > intervention.
> > > > > In
> > > > > >> > > fact I
> > > > > >> > > > > > think we can avoid requiring the user to make
> > > > > >> > > > > > any changes at all for KIP-726, so we don't have to
> > worry
> > > > > about
> > > > > >> > > whether
> > > > > >> > > > > > they actually read our documentation:
> > > > > >> > > > > >
> > > > > >> > > > > > Instead of making ["cooperative-sticky"] the default,
> we
> > > > > change
> > > > > >> the
> > > > > >> > > > > default
> > > > > >> > > > > > to ["cooperative-sticky", "range"].
> > > > > >> > > > > > Since "range" is the old default, this is equivalent
> to
> > > the
> > > > > >> first
> > > > > >> > > > rolling
> > > > > >> > > > > > bounce of the safe upgrade path in KIP-429.
> > > > > >> > > > > >
> > > > > >> > > > > > Of course this also means that under the current
> > protocol
> > > > > >> selection
> > > > > >> > > > > > mechanism we won't actually upgrade to
> > > > > >> > > > > > cooperative rebalancing with the default assignor. But
> > > > that's
> > > > > >> where
> > > > > >> > > > > > KAFKA-12477 will come in.
> > > > > >> > > > > >
> > > > > >> > > > > > @Guozhang Wang <gu...@confluent.io>  I'll get back
> > to
> > > > you
> > > > > >> with
> > > > > >> > a
> > > > > >> > > > > > concrete proposal and answer your questions, I just
> want
> > > to
> > > > > >> point
> > > > > >> > out
> > > > > >> > > > > > that it's possible to side-step the risk of users
> > shooting
> > > > > >> > themselves
> > > > > >> > > > in
> > > > > >> > > > > > the foot (well, at least in this one specific case,
> > > > > >> > > > > > obviously they always find a way)
> > > > > >> > > > > >
> > > > > >> > > > > > On Tue, Mar 30, 2021 at 10:37 AM Guozhang Wang <
> > > > > >> wangguoz@gmail.com
> > > > > >> > >
> > > > > >> > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Hi Sophie,
> > > > > >> > > > > > >
> > > > > >> > > > > > > My question is more related to KAFKA-12477, but
> since
> > > your
> > > > > >> latest
> > > > > >> > > > > replies
> > > > > >> > > > > > > are on this thread I figured we can follow-up on the
> > > same
> > > > > >> venue.
> > > > > >> > > Just
> > > > > >> > > > > so
> > > > > >> > > > > > I
> > > > > >> > > > > > > understand your latest comments above about the
> > > approach:
> > > > > >> > > > > > >
> > > > > >> > > > > > > * I think, we would need to persist this decision so
> > > that
> > > > > the
> > > > > >> > group
> > > > > >> > > > > would
> > > > > >> > > > > > > never go back to the eager protocol, this bit would
> be
> > > > > >> written to
> > > > > >> > > the
> > > > > >> > > > > > > internal topic's assignment message. Is that
> correct?
> > > > > >> > > > > > > * Maybe you can describe the steps, after the group
> > has
> > > > > >> decided
> > > > > >> > to
> > > > > >> > > > move
> > > > > >> > > > > > > forward with cooperative protocols, when:
> > > > > >> > > > > > > 1) a new member joined the group with the old
> version,
> > > and
> > > > > >> hence
> > > > > >> > > only
> > > > > >> > > > > > > recognized eager protocol and executing the eager
> > > protocol
> > > > > >> with
> > > > > >> > its
> > > > > >> > > > > first
> > > > > >> > > > > > > rebalance, what would happen.
> > > > > >> > > > > > > 2) in addition to 1), the new member joined the
> group
> > > with
> > > > > the
> > > > > >> > old
> > > > > >> > > > > > version
> > > > > >> > > > > > > and only recognized the old subscription format, and
> > was
> > > > > >> selected
> > > > > >> > > as
> > > > > >> > > > > the
> > > > > >> > > > > > > leader, what would happen.
> > > > > >> > > > > > >
> > > > > >> > > > > > > Guozhang
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > On Mon, Mar 29, 2021 at 10:30 PM Luke Chen <
> > > > > showuon@gmail.com
> > > > > >> >
> > > > > >> > > > wrote:
> > > > > >> > > > > > >
> > > > > >> > > > > > > > Hi Sophie & Ismael,
> > > > > >> > > > > > > > Thank you for your feedback.
> > > > > >> > > > > > > > No problem, let's pause this KIP and wait for this
> > > > > >> improvement:
> > > > > >> > > > > > > KAFKA-12477
> > > > > >> > > > > > > > <
> https://issues.apache.org/jira/browse/KAFKA-12477
> > >.
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Stay tuned :)
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > Thank you.
> > > > > >> > > > > > > > Luke
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > On Tue, Mar 30, 2021 at 3:14 AM Ismael Juma <
> > > > > >> ismael@juma.me.uk
> > > > > >> > >
> > > > > >> > > > > wrote:
> > > > > >> > > > > > > >
> > > > > >> > > > > > > > > Hi Sophie,
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > I didn't analyze the KIP in detail, but the two
> > > > > >> suggestions
> > > > > >> > you
> > > > > >> > > > > > > mentioned
> > > > > >> > > > > > > > > sound like great improvements.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > A bit more context: breaking changes for a
> widely
> > > used
> > > > > >> > product
> > > > > >> > > > like
> > > > > >> > > > > > > Kafka
> > > > > >> > > > > > > > > are costly and hence why we try as hard as we
> can
> > to
> > > > > avoid
> > > > > >> > > them.
> > > > > >> > > > > When
> > > > > >> > > > > > > it
> > > > > >> > > > > > > > > comes to the brokers, they are often managed by
> a
> > > > > central
> > > > > >> > group
> > > > > >> > > > (or
> > > > > >> > > > > > > > they're
> > > > > >> > > > > > > > > in the Cloud), so they're a bit easier to
> manage.
> > > Even
> > > > > so,
> > > > > >> > it's
> > > > > >> > > > > still
> > > > > >> > > > > > > > > possible to upgrade from 0.8.x directly to 2.7
> > since
> > > > all
> > > > > >> > > protocol
> > > > > >> > > > > > > > versions
> > > > > >> > > > > > > > > are still supported. When it comes to the basic
> > > > clients
> > > > > >> > > > (producer,
> > > > > >> > > > > > > > > consumer, admin client), they're often embedded
> in
> > > > > >> > applications
> > > > > >> > > > so
> > > > > >> > > > > we
> > > > > >> > > > > > > > have
> > > > > >> > > > > > > > > to be even more conservative.
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > Ismael
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > On Mon, Mar 29, 2021 at 10:50 AM Sophie
> > Blee-Goldman
> > > > > >> > > > > > > > > <so...@confluent.io.invalid> wrote:
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > > > > Ismael,
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > It seems like given 3.0 is a breaking release,
> > we
> > > > have
> > > > > >> to
> > > > > >> > > rely
> > > > > >> > > > on
> > > > > >> > > > > > > users
> > > > > >> > > > > > > > > > being aware of this and responsible
> > > > > >> > > > > > > > > > enough to read the upgrade guide. Otherwise we
> > > could
> > > > > >> never
> > > > > >> > > ever
> > > > > >> > > > > > make
> > > > > >> > > > > > > > any
> > > > > >> > > > > > > > > > breaking changes beyond just
> > > > > >> > > > > > > > > > removing deprecated APIs or other
> > > > compilation-breaking
> > > > > >> > errors
> > > > > >> > > > > that
> > > > > >> > > > > > > > would
> > > > > >> > > > > > > > > be
> > > > > >> > > > > > > > > > immediately visible, no?
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > That said, obviously it's better to have a
> > > > > >> circuit-breaker
> > > > > >> > > that
> > > > > >> > > > > > will
> > > > > >> > > > > > > > fail
> > > > > >> > > > > > > > > > fast in case of a user misconfiguration
> > > > > >> > > > > > > > > > rather than silently corrupting the consumer
> > group
> > > > > >> state --
> > > > > >> > > eg
> > > > > >> > > > > for
> > > > > >> > > > > > > two
> > > > > >> > > > > > > > > > consumers to overlap in their ownership
> > > > > >> > > > > > > > > > of the same partition(s). We could definitely
> > > > > implement
> > > > > >> > this,
> > > > > >> > > > and
> > > > > >> > > > > > now
> > > > > >> > > > > > > > > that
> > > > > >> > > > > > > > > > I think about it this might solve a
> > > > > >> > > > > > > > > > related problem in KAFKA-12477
> > > > > >> > > > > > > > > > <
> > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > >.
> > > > > We
> > > > > >> > just
> > > > > >> > > > > add a
> > > > > >> > > > > > > new
> > > > > >> > > > > > > > > > field to the Assignment in which the group
> > leader
> > > > > >> > > > > > > > > > indicates whether it's on a recent enough
> > version
> > > to
> > > > > >> > > understand
> > > > > >> > > > > > > > > cooperative
> > > > > >> > > > > > > > > > rebalancing. If an upgraded member
> > > > > >> > > > > > > > > > joins the group, it'll only be allowed to
> start
> > > > > >> following
> > > > > >> > the
> > > > > >> > > > new
> > > > > >> > > > > > > > > > rebalancing protocol after receiving the
> > go-ahead
> > > > > >> > > > > > > > > > from the group leader.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > If we do go ahead and add this new field in
> the
> > > > > >> Assignment
> > > > > >> > > then
> > > > > >> > > > > I'm
> > > > > >> > > > > > > > > pretty
> > > > > >> > > > > > > > > > confident we can reduce the number
> > > > > >> > > > > > > > > > of required rolling bounces to just one with
> > > > > KAFKA-12477
> > > > > >> > > > > > > > > > <
> > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > >.
> > > > > In
> > > > > >> > that
> > > > > >> > > > > case
> > > > > >> > > > > > we
> > > > > >> > > > > > > > > > should
> > > > > >> > > > > > > > > > be in much better shape to
> > > > > >> > > > > > > > > > feel good about changing the default to the
> > > > > >> > > > > > > CooperativeStickyAssignor.
> > > > > >> > > > > > > > > How
> > > > > >> > > > > > > > > > does that sound?
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > To be clear, I'm not proposing we do this as
> > part
> > > of
> > > > > >> > KIP-726.
> > > > > >> > > > > > Here's
> > > > > >> > > > > > > my
> > > > > >> > > > > > > > > > take:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Let's pause this KIP while I work on making
> > these
> > > > two
> > > > > >> > > > > improvements
> > > > > >> > > > > > in
> > > > > >> > > > > > > > > > KAFKA-12477 <
> > > > > >> > > https://issues.apache.org/jira/browse/KAFKA-12477
> > > > > >> > > > >.
> > > > > >> > > > > > > Once
> > > > > >> > > > > > > > I
> > > > > >> > > > > > > > > > can
> > > > > >> > > > > > > > > > confirm the
> > > > > >> > > > > > > > > > short-circuit and single rolling bounce will
> be
> > > > > >> available
> > > > > >> > for
> > > > > >> > > > > 3.0,
> > > > > >> > > > > > > I'll
> > > > > >> > > > > > > > > > report back on this thread. Then we can move
> > > > > >> > > > > > > > > > forward with this KIP again.
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > Thoughts?
> > > > > >> > > > > > > > > > Sophie
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > On Mon, Mar 29, 2021 at 12:01 AM Luke Chen <
> > > > > >> > > showuon@gmail.com>
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > > > > Hi Ismael,
> > > > > >> > > > > > > > > > > Thanks for your good question. Answer them
> > > below:
> > > > > >> > > > > > > > > > > *1. Are we saying that every consumer
> upgraded
> > > > would
> > > > > >> have
> > > > > >> > > to
> > > > > >> > > > > > follow
> > > > > >> > > > > > > > the
> > > > > >> > > > > > > > > > > complex path described in the KIP? *
> > > > > >> > > > > > > > > > > --> We suggest that every consumer did
> these 2
> > > > steps
> > > > > >> of
> > > > > >> > > > rolling
> > > > > >> > > > > > > > > upgrade.
> > > > > >> > > > > > > > > > > And after KAFKA-12477 <
> > > > > >> > > > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-12477
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > is completed, it can be reduced to 1 rolling
> > > > > upgrade.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > *2. what happens if they don't read the
> > > > instructions
> > > > > >> and
> > > > > >> > > > > upgrade
> > > > > >> > > > > > as
> > > > > >> > > > > > > > > they
> > > > > >> > > > > > > > > > > have in the past?*
> > > > > >> > > > > > > > > > > --> The reason we want 2 steps of rolling
> > > upgrade
> > > > is
> > > > > >> that
> > > > > >> > > we
> > > > > >> > > > > want
> > > > > >> > > > > > > to
> > > > > >> > > > > > > > > > avoid
> > > > > >> > > > > > > > > > > the situation where leader is on old
> byte-code
> > > and
> > > > > >> only
> > > > > >> > > > > recognize
> > > > > >> > > > > > > > > > "eager",
> > > > > >> > > > > > > > > > > but due to compatibility would still be able
> > to
> > > > > >> > deserialize
> > > > > >> > > > the
> > > > > >> > > > > > new
> > > > > >> > > > > > > > > > > protocol data from newer versioned members,
> > and
> > > > > hence
> > > > > >> > just
> > > > > >> > > go
> > > > > >> > > > > > ahead
> > > > > >> > > > > > > > and
> > > > > >> > > > > > > > > > do
> > > > > >> > > > > > > > > > > the assignment while new versioned members
> did
> > > not
> > > > > >> revoke
> > > > > >> > > > their
> > > > > >> > > > > > > > > > partitions
> > > > > >> > > > > > > > > > > before joining the group.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > But I'd say, the new default assignor
> > > > > >> > > > > "CooperativeStickyAssignor"
> > > > > >> > > > > > > was
> > > > > >> > > > > > > > > > > already introduced in V2.4.0, and it should
> be
> > > > long
> > > > > >> > enough
> > > > > >> > > > for
> > > > > >> > > > > > user
> > > > > >> > > > > > > > to
> > > > > >> > > > > > > > > > > upgrade to the new byte-code to recognize
> the
> > > > > >> > "cooperative"
> > > > > >> > > > > > > protocol.
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > What do you think?
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > Thank you.
> > > > > >> > > > > > > > > > > Luke
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > On Mon, Mar 29, 2021 at 12:14 PM Ismael
> Juma <
> > > > > >> > > > > ismael@juma.me.uk>
> > > > > >> > > > > > > > > wrote:
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Thanks for the KIP. Are we saying that
> every
> > > > > >> consumer
> > > > > >> > > > > upgraded
> > > > > >> > > > > > > > would
> > > > > >> > > > > > > > > > have
> > > > > >> > > > > > > > > > > > to follow the complex path described in
> the
> > > KIP?
> > > > > >> Also,
> > > > > >> > > what
> > > > > >> > > > > > > happens
> > > > > >> > > > > > > > > if
> > > > > >> > > > > > > > > > > they
> > > > > >> > > > > > > > > > > > don't read the instructions and upgrade as
> > > they
> > > > > >> have in
> > > > > >> > > the
> > > > > >> > > > > > past?
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > Ismael
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > On Fri, Mar 26, 2021, 1:53 AM Luke Chen <
> > > > > >> > > showuon@gmail.com
> > > > > >> > > > >
> > > > > >> > > > > > > wrote:
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > Hi everyone,
> > > > > >> > > > > > > > > > > > > <Update the subject>
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > I'd like to discuss the following
> proposal
> > > to
> > > > > make
> > > > > >> > the
> > > > > >> > > > > > > > > > > > > CooperativeStickyAssignor as the default
> > > > > assignor.
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-726%3A+Make+the+CooperativeStickyAssignor+as+the+default+assignor
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > Any comments are welcomed.
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > > > Thank you.
> > > > > >> > > > > > > > > > > > > Luke
> > > > > >> > > > > > > > > > > > >
> > > > > >> > > > > > > > > > > >
> > > > > >> > > > > > > > > > >
> > > > > >> > > > > > > > > >
> > > > > >> > > > > > > > >
> > > > > >> > > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > > --
> > > > > >> > > > > > > -- Guozhang
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>