You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Ismael Juma <is...@juma.me.uk> on 2022/09/07 14:07:54 UTC

Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

Thanks for the KIP. Can we explain a bit more why this is an important use
case to address? For example, do we have concrete examples of people
running into this? The way the KIP is written, it sounds like a potential
problem but no information is given on whether it's a real problem in
practice.

Ismael

On Thu, Jul 28, 2022 at 2:00 AM Stanislav Kozlovski
<st...@confluent.io.invalid> wrote:

> Hey all,
>
> I'd like to start a discussion on a proposal to help API users from
> inadvertently increasing the replication factor of a topic through
> the alter partition reassignments API. The KIP describes two fairly
> easy-to-hit race conditions in which this can happen.
>
> The KIP itself is pretty simple, yet has a couple of alternatives that can
> help solve the same problem. I would appreciate thoughts from the community
> on how you think we should proceed, and whether the proposal makes sense in
> the first place.
>
> Thanks!
>
> KIP:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
> JIRA: https://issues.apache.org/jira/browse/KAFKA-14121
>
> --
> Best,
> Stanislav
>

Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

Posted by Stanislav Kozlovski <st...@confluent.io.INVALID>.
Thanks Ismael,

I added an extra paragraph in the motivation. We have certainly hit this
within our internal Confluent reassignment software and from a quick skim
in the popular Cruise Control repository, I notice that similar problems
have been hit there too. Hopefully the examples in the KIP are sufficient
to make the case

On Wed, Sep 7, 2022 at 11:21 PM Ismael Juma <is...@juma.me.uk> wrote:

> Thanks for the details, Colin. I understand how this can happen. But this
> API has been out for a long time. Are we saying that we have seen Cruise
> Control cause this kind of problem? If so, it would be good to mention it
> in the KIP as evidence that the current approach is brittle.
>
> Ismael
>
> On Wed, Sep 7, 2022 at 2:15 PM Colin McCabe <cm...@apache.org> wrote:
>
> > Hi Ismael,
> >
> > I think this issue comes up when people write software that automatically
> > creates partition reassignments to balance the cluster. Cruise Control is
> > one example; Confluent also has some software that does this. If there is
> > already a reassignment that is going on for some partition and the
> software
> > tries to create a new reassignment for that partition, the software may
> > inadvertently change the replication factor.
> >
> > In general, I think some people find it surprising that reassignment can
> > change the replication factor of a partition. When we outlined the
> > reassignment API in KIP-455 we maintained the ability to do this, since
> the
> > old ZK-based API had always been able to do it. But this was a bit
> > controversial. Maybe it would have been more intuitive to preserve
> > replication factor by default unless the user explicitly stated that they
> > wanted to change it. So in a sense, you could view this as a fix for
> > KIP-455 :) (in my opinion, at least)
> >
> > best,
> > Colin
> >
> >
> > On Wed, Sep 7, 2022, at 07:07, Ismael Juma wrote:
> > > Thanks for the KIP. Can we explain a bit more why this is an important
> > use
> > > case to address? For example, do we have concrete examples of people
> > > running into this? The way the KIP is written, it sounds like a
> potential
> > > problem but no information is given on whether it's a real problem in
> > > practice.
> > >
> > > Ismael
> > >
> > > On Thu, Jul 28, 2022 at 2:00 AM Stanislav Kozlovski
> > > <st...@confluent.io.invalid> wrote:
> > >
> > >> Hey all,
> > >>
> > >> I'd like to start a discussion on a proposal to help API users from
> > >> inadvertently increasing the replication factor of a topic through
> > >> the alter partition reassignments API. The KIP describes two fairly
> > >> easy-to-hit race conditions in which this can happen.
> > >>
> > >> The KIP itself is pretty simple, yet has a couple of alternatives that
> > can
> > >> help solve the same problem. I would appreciate thoughts from the
> > community
> > >> on how you think we should proceed, and whether the proposal makes
> > sense in
> > >> the first place.
> > >>
> > >> Thanks!
> > >>
> > >> KIP:
> > >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
> > >> JIRA: https://issues.apache.org/jira/browse/KAFKA-14121
> > >>
> > >> --
> > >> Best,
> > >> Stanislav
> > >>
> >
>


-- 
Best,
Stanislav

Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

Posted by Ismael Juma <is...@juma.me.uk>.
Thanks for the details, Colin. I understand how this can happen. But this
API has been out for a long time. Are we saying that we have seen Cruise
Control cause this kind of problem? If so, it would be good to mention it
in the KIP as evidence that the current approach is brittle.

Ismael

On Wed, Sep 7, 2022 at 2:15 PM Colin McCabe <cm...@apache.org> wrote:

> Hi Ismael,
>
> I think this issue comes up when people write software that automatically
> creates partition reassignments to balance the cluster. Cruise Control is
> one example; Confluent also has some software that does this. If there is
> already a reassignment that is going on for some partition and the software
> tries to create a new reassignment for that partition, the software may
> inadvertently change the replication factor.
>
> In general, I think some people find it surprising that reassignment can
> change the replication factor of a partition. When we outlined the
> reassignment API in KIP-455 we maintained the ability to do this, since the
> old ZK-based API had always been able to do it. But this was a bit
> controversial. Maybe it would have been more intuitive to preserve
> replication factor by default unless the user explicitly stated that they
> wanted to change it. So in a sense, you could view this as a fix for
> KIP-455 :) (in my opinion, at least)
>
> best,
> Colin
>
>
> On Wed, Sep 7, 2022, at 07:07, Ismael Juma wrote:
> > Thanks for the KIP. Can we explain a bit more why this is an important
> use
> > case to address? For example, do we have concrete examples of people
> > running into this? The way the KIP is written, it sounds like a potential
> > problem but no information is given on whether it's a real problem in
> > practice.
> >
> > Ismael
> >
> > On Thu, Jul 28, 2022 at 2:00 AM Stanislav Kozlovski
> > <st...@confluent.io.invalid> wrote:
> >
> >> Hey all,
> >>
> >> I'd like to start a discussion on a proposal to help API users from
> >> inadvertently increasing the replication factor of a topic through
> >> the alter partition reassignments API. The KIP describes two fairly
> >> easy-to-hit race conditions in which this can happen.
> >>
> >> The KIP itself is pretty simple, yet has a couple of alternatives that
> can
> >> help solve the same problem. I would appreciate thoughts from the
> community
> >> on how you think we should proceed, and whether the proposal makes
> sense in
> >> the first place.
> >>
> >> Thanks!
> >>
> >> KIP:
> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
> >> JIRA: https://issues.apache.org/jira/browse/KAFKA-14121
> >>
> >> --
> >> Best,
> >> Stanislav
> >>
>

Re: [DISCISS] KIP-860: Add client-provided option to guard against unintentional replication factor change during partition reassignments

Posted by Colin McCabe <cm...@apache.org>.
Hi Ismael,

I think this issue comes up when people write software that automatically creates partition reassignments to balance the cluster. Cruise Control is one example; Confluent also has some software that does this. If there is already a reassignment that is going on for some partition and the software tries to create a new reassignment for that partition, the software may inadvertently change the replication factor.

In general, I think some people find it surprising that reassignment can change the replication factor of a partition. When we outlined the reassignment API in KIP-455 we maintained the ability to do this, since the old ZK-based API had always been able to do it. But this was a bit controversial. Maybe it would have been more intuitive to preserve replication factor by default unless the user explicitly stated that they wanted to change it. So in a sense, you could view this as a fix for KIP-455 :) (in my opinion, at least)

best,
Colin


On Wed, Sep 7, 2022, at 07:07, Ismael Juma wrote:
> Thanks for the KIP. Can we explain a bit more why this is an important use
> case to address? For example, do we have concrete examples of people
> running into this? The way the KIP is written, it sounds like a potential
> problem but no information is given on whether it's a real problem in
> practice.
>
> Ismael
>
> On Thu, Jul 28, 2022 at 2:00 AM Stanislav Kozlovski
> <st...@confluent.io.invalid> wrote:
>
>> Hey all,
>>
>> I'd like to start a discussion on a proposal to help API users from
>> inadvertently increasing the replication factor of a topic through
>> the alter partition reassignments API. The KIP describes two fairly
>> easy-to-hit race conditions in which this can happen.
>>
>> The KIP itself is pretty simple, yet has a couple of alternatives that can
>> help solve the same problem. I would appreciate thoughts from the community
>> on how you think we should proceed, and whether the proposal makes sense in
>> the first place.
>>
>> Thanks!
>>
>> KIP:
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-860%3A+Add+client-provided+option+to+guard+against+replication+factor+change+during+partition+reassignments
>> JIRA: https://issues.apache.org/jira/browse/KAFKA-14121
>>
>> --
>> Best,
>> Stanislav
>>