You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Jun Rao <ju...@confluent.io> on 2019/04/04 23:48:45 UTC

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Hi, Viktor,

Thanks for the KIP. A couple of comments below.

1. Another potential thing to do reassignment incrementally is to move a
batch of partitions at a time, instead of all partitions. This may lead to
less data replication since by the time the first batch of partitions have
been completely moved, some data of the next batch may have been deleted
due to retention and doesn't need to be replicated.

2. "Update CR in Zookeeper with TR for the given partition". Which ZK path
is this for?

Jun

On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <vi...@gmail.com>
wrote:

> Hi Harsha,
>
> As far as I understand KIP-236 it's about enabling reassignment
> cancellation and as a future plan providing a queue of replica reassignment
> steps to allow manual reassignment chains. While I agree that the
> reassignment chain has a specific use case that allows fine grain control
> over reassignment process, My proposal on the other hand doesn't talk about
> cancellation but it only provides an automatic way to incrementalize an
> arbitrary reassignment which I think fits the general use case where users
> don't want that level of control but still would like a balanced way of
> reassignments. Therefore I think it's still relevant as an improvement of
> the current algorithm.
> Nevertheless I'm happy to add my ideas to KIP-236 as I think it would be a
> great improvement to Kafka.
>
> Cheers,
> Viktor
>
> On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote:
>
> > Hi Viktor,
> >             There is already KIP-236 for the same feature and George made
> > a PR for this as well.
> > Lets consolidate these two discussions. If you have any cases that are
> not
> > being solved by KIP-236 can you please mention them in that thread. We
> can
> > address as part of KIP-236.
> >
> > Thanks,
> > Harsha
> >
> > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > Hi Folks,
> > >
> > > I've created a KIP about an improvement of the reassignment algorithm
> we
> > > have. It aims to enable partition-wise incremental reassignment. The
> > > motivation for this is to avoid excess load that the current
> replication
> > > algorithm implicitly carries as in that case there are points in the
> > > algorithm where both the new and old replica set could be online and
> > > replicating which puts double (or almost double) pressure on the
> brokers
> > > which could cause problems.
> > > Instead my proposal would slice this up into several steps where each
> > step
> > > is calculated based on the final target replicas and the current
> replica
> > > assignment taking into account scenarios where brokers could be offline
> > and
> > > when there are not enough replicas to fulfil the min.insync.replica
> > > requirement.
> > >
> > > The link to the KIP:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > >
> > > I'd be happy to receive any feedback.
> > >
> > > An important note is that this KIP and another one, KIP-236 that is
> > > about
> > > interruptible reassignment (
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > )
> > > should be compatible.
> > >
> > > Thanks,
> > > Viktor
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.

Hi Colin,

Certainly there will be some interaction and good idea with that you said,
I've added it to my KIP.
Will start a new discussion thread and link this one.

Viktor

On Wed, Jun 26, 2019 at 11:39 PM Colin McCabe <cm...@apache.org> wrote:

> Hi Viktor,
>
> Good point.  Sorry, I should have read the KIP more closely.
>
> It would be good to change the title of the mail thread to reflect the new
> title of the KIP, "Internal Partition Reassignment Batching."
>
> I do think there will be some interaction with KIP-455 here.  One example
> is that we'll want a way of knowing what target replicas are currently
> being worked on.  So maybe we'll have to add a field to the structures
> returned by listPartitionReassignments.
>
> best,
> Colin
>
>
> On Wed, Jun 26, 2019, at 06:20, Viktor Somogyi-Vass wrote:
> > Hey Colin,
> >
> > I think there's some confusion here so I might change the name of this.
> So
> > KIP-435 is about the internal batching of reassignments (so purely a
> > controller change) and not about client side APIs. As per this moment
> these
> > kind of improvements are listed on KIP-455's future work section so in my
> > understanding KIP-455 won't touch that :).
> > Let me know if I'm missing any points here.
> >
> > Viktor
> >
> > On Tue, Jun 25, 2019 at 9:02 PM Colin McCabe <cm...@apache.org> wrote:
> >
> > > Hi Viktor,
> > >
> > > Now that the 2.3 release is over, we're going to be turning our
> attention
> > > back to working on KIP-455, which provides an API for partition
> > > reassignment, and also solves the incremental reassignment problem.
> Sorry
> > > about the pause, but I had to focus on the stuff that was going into
> 2.3.
> > >
> > > I think last time we talked about this, the consensus was that KIP-455
> > > supersedes KIP-435, since KIP-455 supports incremental reassignment.
> We
> > > also don't want to add more technical debt in the form of a new
> > > ZooKeeper-based API that we'll have to support for a while.  So let's
> focus
> > > on KIP-455 here.  We have more resources now so I think we'll be able
> to
> > > get it done soonish.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Tue, Jun 25, 2019, at 08:09, Viktor Somogyi-Vass wrote:
> > > > Hi All,
> > > >
> > > > I have added another improvement to this, which is to limit the
> parallel
> > > > leader movements. I think I'll soon (maybe late this week or early
> next)
> > > > start a vote on this too if there are no additional feedback.
> > > >
> > > > Thanks,
> > > > Viktor
> > > >
> > > > On Mon, Apr 29, 2019 at 1:26 PM Viktor Somogyi-Vass <
> > > viktorsomogyi@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Folks,
> > > > >
> > > > > I've updated the KIP with the batching which would work on both
> replica
> > > > > and partition level. To explain it briefly: for instance if the
> replica
> > > > > level is set to 2 and partition level is set to 3, then 2x3=6
> replica
> > > > > reassignment would be in progress at the same time. In case of
> > > reassignment
> > > > > for a single partition from (0, 1, 2, 3, 4) to (5, 6, 7, 8, 9) we
> would
> > > > > form the batches (0, 1) → (5, 6); (2, 3) → (7, 8) and 4 → 9 and
> would
> > > > > execute the reassignment in this order.
> > > > >
> > > > > Let me know what you think.
> > > > >
> > > > > Best,
> > > > > Viktor
> > > > >
> > > > > On Mon, Apr 15, 2019 at 7:01 PM Viktor Somogyi-Vass <
> > > > > viktorsomogyi@gmail.com> wrote:
> > > > >
> > > > >> A follow up on the batching topic to clarify my points above.
> > > > >>
> > > > >> Generally I think that batching should be a core feature as Colin
> said
> > > > >> the controller should possess all information that are related.
> > > > >> Also Cruise Control (or really any 3rd party admin system) might
> build
> > > > >> upon this to give more holistic approach to balance brokers. We
> may
> > > cater
> > > > >> them with APIs that act like building blocks to make their life
> > > easier like
> > > > >> incrementalization, batching, cancellation and rollback but I
> think
> > > the
> > > > >> more advanced we go we'll need more advanced control surface and
> > > Kafka's
> > > > >> basic tooling might not be suitable for that.
> > > > >>
> > > > >> Best,
> > > > >> Viktor
> > > > >>
> > > > >>
> > > > >> On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <
> > > viktorsomogyi@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Hey Guys,
> > > > >>>
> > > > >>> I'll reply to you all in this email:
> > > > >>>
> > > > >>> @Jun:
> > > > >>> 1. yes, it'd be a good idea to add this feature, I'll write this
> into
> > > > >>> the KIP. I was actually thinking about introducing a dynamic
> config
> > > called
> > > > >>> reassignment.parallel.partition.count and
> > > > >>> reassignment.parallel.replica.count. The first property would
> > > control how
> > > > >>> many partition reassignment can we do concurrently. The second
> would
> > > go one
> > > > >>> level in granularity and would control how many replicas do we
> want
> > > to move
> > > > >>> for a given partition. Also one more thing that'd be useful to
> fix
> > > is that
> > > > >>> a given list of partition -> replica list would be executed in
> the
> > > same
> > > > >>> order (from first to last) so it's overall predictable and the
> user
> > > would
> > > > >>> have some control over the order of reassignments should be
> > > specified as
> > > > >>> the JSON is still assembled by the user.
> > > > >>> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll
> > > update
> > > > >>> the KIP to contain this.
> > > > >>>
> > > > >>> @Jason:
> > > > >>> I think building this functionality into Kafka would definitely
> > > benefit
> > > > >>> all the users and that CC as well as it'd simplify their
> software as
> > > you
> > > > >>> said. As I understand the main advantage of CC and other similar
> > > softwares
> > > > >>> are to give high level features for automatic load balancing.
> > > Reliability,
> > > > >>> stability and predictability of the reassignment should be a core
> > > feature
> > > > >>> of Kafka. I think the incrementalization feature would make it
> more
> > > stable.
> > > > >>> I would consider cancellation too as a core feature and we can
> leave
> > > the
> > > > >>> gate open for external tools to feed in their reassignment json
> as
> > > they
> > > > >>> want. I was also thinking about what are the set of features we
> can
> > > provide
> > > > >>> for Kafka but I think the more advanced we go the more need
> there is
> > > for an
> > > > >>> administrative UI component.
> > > > >>> Regarding KIP-352: Thanks for pointing this out, I didn't see
> this
> > > > >>> although lately I was also thinking about the throttling aspect
> of
> > > it.
> > > > >>> Would be a nice add-on to Kafka since though the above configs
> > > provide some
> > > > >>> level of control, it'd be nice to put an upper cap on the
> bandwidth
> > > and
> > > > >>> make it monitorable.
> > > > >>>
> > > > >>> Viktor
> > > > >>>
> > > > >>> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <
> jason@confluent.io>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Hi Colin,
> > > > >>>>
> > > > >>>> On a related note, what do you think about the idea of storing
> the
> > > > >>>> > reassigning replicas in
> > > > >>>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather
> > > than
> > > > >>>> in the
> > > > >>>> > reassignment znode?  I don't think this requires a major
> change
> > > to the
> > > > >>>> > proposal-- when the controller becomes aware that it should
> do a
> > > > >>>> > reassignment, the controller could make the changes.  This
> also
> > > helps
> > > > >>>> keep
> > > > >>>> > the reassignment znode from getting larger, which has been a
> > > problem.
> > > > >>>>
> > > > >>>>
> > > > >>>> Yeah, I think it's a good idea to store the reassignment state
> at a
> > > > >>>> finer
> > > > >>>> level. I'm not sure the LeaderAndIsr znode is the right one
> though.
> > > > >>>> Another
> > > > >>>> option is /brokers/topics/{topic}. That is where we currently
> store
> > > the
> > > > >>>> replica assignment. I think we basically want to represent both
> the
> > > > >>>> current
> > > > >>>> state and the desired state. This would also open the door to a
> > > cleaner
> > > > >>>> way
> > > > >>>> to update a reassignment while it is still in progress.
> > > > >>>>
> > > > >>>> -Jason
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Apr 8, 2019 at 11:14 PM George Li <
> sql_consulting@yahoo.com
> > > > >>>> .invalid>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>> >  Hi Colin / Jason,
> > > > >>>> >
> > > > >>>> > Reassignment should really be doing a batches.  I am not too
> > > worried
> > > > >>>> about
> > > > >>>> > reassignment znode getting larger.  In a real production
> > > > >>>> environment,  too
> > > > >>>> > many concurrent reassignment and too frequent submission of
> > > > >>>> reassignments
> > > > >>>> > seemed to cause latency spikes of kafka cluster.  So
> > > > >>>> > batching/staggering/throttling of submitting reassignments is
> > > > >>>> recommended.
> > > > >>>> >
> > > > >>>> > In KIP-236,  The "originalReplicas" are only kept for the
> current
> > > > >>>> > reassigning partitions (small #), and kept in memory of the
> > > controller
> > > > >>>> > context partitionsBeingReassigned as well as in the znode
> > > > >>>> > /admin/reassign_partitions,  I think below "setting in the RPC
> > > like
> > > > >>>> null =
> > > > >>>> > no replicas are reassigning" is a good idea.
> > > > >>>> >
> > > > >>>> > There seems to be some issues with the Mail archive server of
> this
> > > > >>>> mailing
> > > > >>>> > list?  I didn't receive email after April 7th, and the
> archive for
> > > > >>>> April
> > > > >>>> > 2019 has only 50 messages (
> > > > >>>> >
> > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
> > > > >>>> ?
> > > > >>>> >
> > > > >>>> > Thanks,
> > > > >>>> > George
> > > > >>>> >
> > > > >>>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
> > > > >>>> >
> > > > >>>> >   Yeah, I think adding this information to LeaderAndIsr makes
> > > sense.
> > > > >>>> It
> > > > >>>> > would be better to track
> > > > >>>> > "reassigningReplicas" than "originalReplicas", I think.
> Tracking
> > > > >>>> > "originalReplicas" is going
> > > > >>>> > to involve sending a lot more data, since most replicas in the
> > > system
> > > > >>>> are
> > > > >>>> > not reassigning
> > > > >>>> > at any given point.  Or we would need a hack in the RPC like
> null
> > > = no
> > > > >>>> > replicas are reassigning.
> > > > >>>> >
> > > > >>>> > On a related note, what do you think about the idea of
> storing the
> > > > >>>> > reassigning replicas in
> > > > >>>> >  /brokers/topics/[topic]/partitions/[partitionId]/state,
> rather
> > > than
> > > > >>>> in
> > > > >>>> > the reassignment znode?
> > > > >>>> >  I don't think this requires a major change to the proposal--
> > > when the
> > > > >>>> > controller becomes
> > > > >>>> > aware that it should do a reassignment, the controller could
> make
> > > the
> > > > >>>> > changes.  This also
> > > > >>>> > helps keep the reassignment znode from getting larger, which
> has
> > > been
> > > > >>>> a
> > > > >>>> > problem.
> > > > >>>> >
> > > > >>>> > best,
> > > > >>>> > Colin
> > > > >>>> >
> > > > >>>> >
> > > > >>>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> > > > >>>> > > Hey George,
> > > > >>>> > >
> > > > >>>> > > For the URP during a reassignment,  if the
> "original_replicas"
> > > is
> > > > >>>> kept
> > > > >>>> > for
> > > > >>>> > > > the current pending reassignment. I think it will be very
> > > easy to
> > > > >>>> > compare
> > > > >>>> > > > that with the topic/partition's ISR.  If all
> > > "original_replicas"
> > > > >>>> are in
> > > > >>>> > > > ISR, then URP should be 0 for that topic/partition.
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > > Yeah, that makes sense. But I guess we would need
> > > > >>>> "original_replicas" to
> > > > >>>> > be
> > > > >>>> > > propagated to partition leaders in the LeaderAndIsr request
> > > since
> > > > >>>> leaders
> > > > >>>> > > are the ones that are computing URPs. That is basically what
> > > > >>>> KIP-352 had
> > > > >>>> > > proposed, but we also need the changes to the reassignment
> path.
> > > > >>>> Perhaps
> > > > >>>> > it
> > > > >>>> > > makes more sense to address this problem in KIP-236 since
> that
> > > is
> > > > >>>> where
> > > > >>>> > you
> > > > >>>> > > have already introduced "original_replicas"? I'm also happy
> to
> > > do
> > > > >>>> KIP-352
> > > > >>>> > > as a follow-up to KIP-236.
> > > > >>>> > >
> > > > >>>> > > Best,
> > > > >>>> > > Jason
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <
> ismaelj@gmail.com>
> > > > >>>> wrote:
> > > > >>>> > >
> > > > >>>> > > > Good discussion about where we should do batching. I
> think if
> > > > >>>> there is
> > > > >>>> > a
> > > > >>>> > > > clear great way to batch, then it makes a lot of sense to
> > > just do
> > > > >>>> it
> > > > >>>> > once.
> > > > >>>> > > > However, if we think there is scope for experimenting with
> > > > >>>> different
> > > > >>>> > > > approaches, then an API that tools can use makes a lot of
> > > sense.
> > > > >>>> They
> > > > >>>> > can
> > > > >>>> > > > experiment and innovate. Eventually, we can integrate
> > > something
> > > > >>>> into
> > > > >>>> > Kafka
> > > > >>>> > > > if it makes sense.
> > > > >>>> > > >
> > > > >>>> > > > Ismael
> > > > >>>> > > >
> > > > >>>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <
> > > cmccabe@apache.org>
> > > > >>>> wrote:
> > > > >>>> > > >
> > > > >>>> > > > > Hi George,
> > > > >>>> > > > >
> > > > >>>> > > > > As Jason was saying, it seems like there are two
> directions
> > > we
> > > > >>>> could
> > > > >>>> > go
> > > > >>>> > > > > here: an external system handling batching, and the
> > > controller
> > > > >>>> > handling
> > > > >>>> > > > > batching.  I think the controller handling batching
> would be
> > > > >>>> better,
> > > > >>>> > > > since
> > > > >>>> > > > > the controller has more information about the state of
> the
> > > > >>>> system.
> > > > >>>> > If
> > > > >>>> > > > the
> > > > >>>> > > > > controller handles batching, then the controller could
> also
> > > > >>>> handle
> > > > >>>> > things
> > > > >>>> > > > > like setting up replication quotas for individual
> > > partitions.
> > > > >>>> The
> > > > >>>> > > > > controller could do things like throttle replication
> down
> > > if the
> > > > >>>> > cluster
> > > > >>>> > > > > was having problems.
> > > > >>>> > > > >
> > > > >>>> > > > > We kind of need to figure out which way we're going to
> go on
> > > > >>>> this one
> > > > >>>> > > > > before we set up big new APIs, I think.  If we want an
> > > external
> > > > >>>> > system to
> > > > >>>> > > > > handle batching, then we can keep the idea that there is
> > > only
> > > > >>>> one
> > > > >>>> > > > > reassignment in progress at once.  If we want the
> > > controller to
> > > > >>>> > handle
> > > > >>>> > > > > batching, we will need to get away from that idea.
> > > Instead, we
> > > > >>>> > should
> > > > >>>> > > > just
> > > > >>>> > > > > have a bunch of "ideal assignments" that we tell the
> > > controller
> > > > >>>> > about,
> > > > >>>> > > > and
> > > > >>>> > > > > let it decide how to do the batching.  These ideal
> > > assignments
> > > > >>>> could
> > > > >>>> > > > change
> > > > >>>> > > > > continuously over time, so from the admin's point of
> view,
> > > there
> > > > >>>> > would be
> > > > >>>> > > > > no start/stop/cancel, but just individual partition
> > > > >>>> reassignments
> > > > >>>> > that we
> > > > >>>> > > > > submit, perhaps over a long period of time.  And then
> > > > >>>> cancellation
> > > > >>>> > might
> > > > >>>> > > > > just mean cancelling just that individual partition
> > > > >>>> reassignment,
> > > > >>>> > not all
> > > > >>>> > > > > partition reassignments.
> > > > >>>> > > > >
> > > > >>>> > > > > best,
> > > > >>>> > > > > Colin
> > > > >>>> > > > >
> > > > >>>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > > > >>>> > > > > >  Hi Jason / Viktor,
> > > > >>>> > > > > >
> > > > >>>> > > > > > For the URP during a reassignment,  if the
> > > > >>>> "original_replicas" is
> > > > >>>> > kept
> > > > >>>> > > > > > for the current pending reassignment. I think it will
> be
> > > very
> > > > >>>> easy
> > > > >>>> > to
> > > > >>>> > > > > > compare that with the topic/partition's ISR.  If all
> > > > >>>> > > > > > "original_replicas" are in ISR, then URP should be 0
> for
> > > that
> > > > >>>> > > > > > topic/partition.
> > > > >>>> > > > > >
> > > > >>>> > > > > > It would be also nice to separate the metrics
> > > MaxLag/TotalLag
> > > > >>>> for
> > > > >>>> > > > > > Reassignments. I think that will also require
> > > > >>>> "original_replicas"
> > > > >>>> > (the
> > > > >>>> > > > > > topic/partition's replicas just before reassignment
> when
> > > the
> > > > >>>> AR
> > > > >>>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
> > > > >>>> > > > > > Set(new_replicas_in_reassign_partitions) ).
> > > > >>>> > > > > >
> > > > >>>> > > > > > Thanks,
> > > > >>>> > > > > > George
> > > > >>>> > > > > >
> > > > >>>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason
> > > Gustafson
> > > > >>>> > > > > > <ja...@confluent.io> wrote:
> > > > >>>> > > > > >
> > > > >>>> > > > > >  Hi Viktor,
> > > > >>>> > > > > >
> > > > >>>> > > > > > Thanks for writing this up. As far as questions about
> > > overlap
> > > > >>>> with
> > > > >>>> > > > > KIP-236,
> > > > >>>> > > > > > I agree it seems mostly orthogonal. I think KIP-236
> may
> > > have
> > > > >>>> had a
> > > > >>>> > > > larger
> > > > >>>> > > > > > initial scope, but now it focuses on cancellation and
> > > > >>>> batching is
> > > > >>>> > left
> > > > >>>> > > > > for
> > > > >>>> > > > > > future work.
> > > > >>>> > > > > >
> > > > >>>> > > > > > With that said, I think we may not actually need a KIP
> > > for the
> > > > >>>> > current
> > > > >>>> > > > > > proposal since it doesn't change any APIs. To make it
> more
> > > > >>>> > generally
> > > > >>>> > > > > > useful, however, it would be nice to handle batching
> at
> > > the
> > > > >>>> > partition
> > > > >>>> > > > > level
> > > > >>>> > > > > > as well as Jun suggests. The basic question is at what
> > > level
> > > > >>>> > should the
> > > > >>>> > > > > > batching be determined. You could rely on external
> > > processes
> > > > >>>> (e.g.
> > > > >>>> > > > cruise
> > > > >>>> > > > > > control) or it could be built into the controller.
> There
> > > are
> > > > >>>> > tradeoffs
> > > > >>>> > > > > > either way, but I think it simplifies such tools if
> it is
> > > > >>>> handled
> > > > >>>> > > > > > internally. Then it would be much safer to submit a
> larger
> > > > >>>> > reassignment
> > > > >>>> > > > > > even just using the simple tools that come with Kafka.
> > > > >>>> > > > > >
> > > > >>>> > > > > > By the way, since you are looking into some of the
> > > > >>>> reassignment
> > > > >>>> > logic,
> > > > >>>> > > > > > another problem that we might want to address is the
> > > > >>>> misleading
> > > > >>>> > way we
> > > > >>>> > > > > > report URPs during a reassignment. I had a naive
> proposal
> > > for
> > > > >>>> this
> > > > >>>> > > > > > previously, but it didn't really work
> > > > >>>> > > > > >
> > > > >>>> > > > >
> > > > >>>> > > >
> > > > >>>> >
> > > > >>>>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > > >>>> > > > > .
> > > > >>>> > > > > > Potentially fixing that could fall under this work as
> > > well if
> > > > >>>> you
> > > > >>>> > think
> > > > >>>> > > > > > it
> > > > >>>> > > > > > makes sense.
> > > > >>>> > > > > >
> > > > >>>> > > > > > Best,
> > > > >>>> > > > > > Jason
> > > > >>>> > > > > >
> > > > >>>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <
> jun@confluent.io>
> > > > >>>> wrote:
> > > > >>>> > > > > >
> > > > >>>> > > > > > > Hi, Viktor,
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > Thanks for the KIP. A couple of comments below.
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > 1. Another potential thing to do reassignment
> > > incrementally
> > > > >>>> is to
> > > > >>>> > > > move
> > > > >>>> > > > > a
> > > > >>>> > > > > > > batch of partitions at a time, instead of all
> > > partitions.
> > > > >>>> This
> > > > >>>> > may
> > > > >>>> > > > > lead to
> > > > >>>> > > > > > > less data replication since by the time the first
> batch
> > > of
> > > > >>>> > partitions
> > > > >>>> > > > > have
> > > > >>>> > > > > > > been completely moved, some data of the next batch
> may
> > > have
> > > > >>>> been
> > > > >>>> > > > > deleted
> > > > >>>> > > > > > > due to retention and doesn't need to be replicated.
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > 2. "Update CR in Zookeeper with TR for the given
> > > partition".
> > > > >>>> > Which
> > > > >>>> > ZK
> > > > >>>> > > > > path
> > > > >>>> > > > > > > is this for?
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > Jun
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass
> <
> > > > >>>> > > > > > > viktorsomogyi@gmail.com>
> > > > >>>> > > > > > > wrote:
> > > > >>>> > > > > > >
> > > > >>>> > > > > > > > Hi Harsha,
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > As far as I understand KIP-236 it's about enabling
> > > > >>>> reassignment
> > > > >>>> > > > > > > > cancellation and as a future plan providing a
> queue of
> > > > >>>> replica
> > > > >>>> > > > > > > reassignment
> > > > >>>> > > > > > > > steps to allow manual reassignment chains. While I
> > > agree
> > > > >>>> that
> > > > >>>> > the
> > > > >>>> > > > > > > > reassignment chain has a specific use case that
> allows
> > > > >>>> fine
> > > > >>>> > grain
> > > > >>>> > > > > control
> > > > >>>> > > > > > > > over reassignment process, My proposal on the
> other
> > > hand
> > > > >>>> > doesn't
> > > > >>>> > > > talk
> > > > >>>> > > > > > > about
> > > > >>>> > > > > > > > cancellation but it only provides an automatic
> way to
> > > > >>>> > > > incrementalize
> > > > >>>> > > > > an
> > > > >>>> > > > > > > > arbitrary reassignment which I think fits the
> general
> > > use
> > > > >>>> case
> > > > >>>> > > > where
> > > > >>>> > > > > > > users
> > > > >>>> > > > > > > > don't want that level of control but still would
> like
> > > a
> > > > >>>> > balanced
> > > > >>>> > > > way
> > > > >>>> > > > > of
> > > > >>>> > > > > > > > reassignments. Therefore I think it's still
> relevant
> > > as an
> > > > >>>> > > > > improvement of
> > > > >>>> > > > > > > > the current algorithm.
> > > > >>>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236
> as I
> > > > >>>> think
> > > > >>>> > it
> > > > >>>> > > > > would be
> > > > >>>> > > > > > > a
> > > > >>>> > > > > > > > great improvement to Kafka.
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > Cheers,
> > > > >>>> > > > > > > > Viktor
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <
> > > kafka@harsha.io>
> > > > >>>> > wrote:
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > > > > Hi Viktor,
> > > > >>>> > > > > > > > >            There is already KIP-236 for the same
> > > feature
> > > > >>>> > and
> > > > >>>> > > > George
> > > > >>>> > > > > > > made
> > > > >>>> > > > > > > > > a PR for this as well.
> > > > >>>> > > > > > > > > Lets consolidate these two discussions. If you
> have
> > > any
> > > > >>>> > cases
> > > > >>>> > > > that
> > > > >>>> > > > > are
> > > > >>>> > > > > > > > not
> > > > >>>> > > > > > > > > being solved by KIP-236 can you please mention
> them
> > > in
> > > > >>>> > that
> > > > >>>> > > > > thread. We
> > > > >>>> > > > > > > > can
> > > > >>>> > > > > > > > > address as part of KIP-236.
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > > > Thanks,
> > > > >>>> > > > > > > > > Harsha
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor
> > > Somogyi-Vass
> > > > >>>> wrote:
> > > > >>>> > > > > > > > > > Hi Folks,
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > > > I've created a KIP about an improvement of the
> > > > >>>> reassignment
> > > > >>>> > > > > algorithm
> > > > >>>> > > > > > > > we
> > > > >>>> > > > > > > > > > have. It aims to enable partition-wise
> incremental
> > > > >>>> > > > reassignment.
> > > > >>>> > > > > The
> > > > >>>> > > > > > > > > > motivation for this is to avoid excess load
> that
> > > the
> > > > >>>> > current
> > > > >>>> > > > > > > > replication
> > > > >>>> > > > > > > > > > algorithm implicitly carries as in that case
> there
> > > > >>>> > are points
> > > > >>>> > > > in
> > > > >>>> > > > > the
> > > > >>>> > > > > > > > > > algorithm where both the new and old replica
> set
> > > could
> > > > >>>> > be
> > > > >>>> > > > online
> > > > >>>> > > > > and
> > > > >>>> > > > > > > > > > replicating which puts double (or almost
> double)
> > > > >>>> pressure
> > > > >>>> > on
> > > > >>>> > > > the
> > > > >>>> > > > > > > > brokers
> > > > >>>> > > > > > > > > > which could cause problems.
> > > > >>>> > > > > > > > > > Instead my proposal would slice this up into
> > > several
> > > > >>>> > steps
> > > > >>>> > > > where
> > > > >>>> > > > > each
> > > > >>>> > > > > > > > > step
> > > > >>>> > > > > > > > > > is calculated based on the final target
> replicas
> > > and
> > > > >>>> > the
> > > > >>>> > > > current
> > > > >>>> > > > > > > > replica
> > > > >>>> > > > > > > > > > assignment taking into account scenarios where
> > > brokers
> > > > >>>> > could be
> > > > >>>> > > > > > > offline
> > > > >>>> > > > > > > > > and
> > > > >>>> > > > > > > > > > when there are not enough replicas to fulfil
> the
> > > > >>>> > > > > min.insync.replica
> > > > >>>> > > > > > > > > > requirement.
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > > > The link to the KIP:
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > >
> > > > >>>> > > > >
> > > > >>>> > > >
> > > > >>>> >
> > > > >>>>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > > > I'd be happy to receive any feedback.
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > > > An important note is that this KIP and another
> > > one,
> > > > >>>> > KIP-236
> > > > >>>> > > > that
> > > > >>>> > > > > is
> > > > >>>> > > > > > > > > > about
> > > > >>>> > > > > > > > > > interruptible reassignment (
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > >
> > > > >>>> > > > >
> > > > >>>> > > >
> > > > >>>> >
> > > > >>>>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > >>>> > > > > > > > > )
> > > > >>>> > > > > > > > > > should be compatible.
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > > > Thanks,
> > > > >>>> > > > > > > > > > Viktor
> > > > >>>> > > > > > > > > >
> > > > >>>> > > > > > > > >
> > > > >>>> > > > > > > >
> > > > >>>> > > > > > >
> > > > >>>> > > > > >
> > > > >>>> > > > >
> > > > >>>> > > >
> > > > >>>> > >
> > > > >>>> >
> > > > >>>>
> > > > >>>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Colin McCabe <cm...@apache.org>.

Hi Viktor,

Good point.  Sorry, I should have read the KIP more closely.

It would be good to change the title of the mail thread to reflect the new title of the KIP, "Internal Partition Reassignment Batching."

I do think there will be some interaction with KIP-455 here.  One example is that we'll want a way of knowing what target replicas are currently being worked on.  So maybe we'll have to add a field to the structures returned by listPartitionReassignments.

best,
Colin


On Wed, Jun 26, 2019, at 06:20, Viktor Somogyi-Vass wrote:
> Hey Colin,
> 
> I think there's some confusion here so I might change the name of this. So
> KIP-435 is about the internal batching of reassignments (so purely a
> controller change) and not about client side APIs. As per this moment these
> kind of improvements are listed on KIP-455's future work section so in my
> understanding KIP-455 won't touch that :).
> Let me know if I'm missing any points here.
> 
> Viktor
> 
> On Tue, Jun 25, 2019 at 9:02 PM Colin McCabe <cm...@apache.org> wrote:
> 
> > Hi Viktor,
> >
> > Now that the 2.3 release is over, we're going to be turning our attention
> > back to working on KIP-455, which provides an API for partition
> > reassignment, and also solves the incremental reassignment problem.  Sorry
> > about the pause, but I had to focus on the stuff that was going into 2.3.
> >
> > I think last time we talked about this, the consensus was that KIP-455
> > supersedes KIP-435, since KIP-455 supports incremental reassignment.  We
> > also don't want to add more technical debt in the form of a new
> > ZooKeeper-based API that we'll have to support for a while.  So let's focus
> > on KIP-455 here.  We have more resources now so I think we'll be able to
> > get it done soonish.
> >
> > best,
> > Colin
> >
> >
> > On Tue, Jun 25, 2019, at 08:09, Viktor Somogyi-Vass wrote:
> > > Hi All,
> > >
> > > I have added another improvement to this, which is to limit the parallel
> > > leader movements. I think I'll soon (maybe late this week or early next)
> > > start a vote on this too if there are no additional feedback.
> > >
> > > Thanks,
> > > Viktor
> > >
> > > On Mon, Apr 29, 2019 at 1:26 PM Viktor Somogyi-Vass <
> > viktorsomogyi@gmail.com>
> > > wrote:
> > >
> > > > Hi Folks,
> > > >
> > > > I've updated the KIP with the batching which would work on both replica
> > > > and partition level. To explain it briefly: for instance if the replica
> > > > level is set to 2 and partition level is set to 3, then 2x3=6 replica
> > > > reassignment would be in progress at the same time. In case of
> > reassignment
> > > > for a single partition from (0, 1, 2, 3, 4) to (5, 6, 7, 8, 9) we would
> > > > form the batches (0, 1) → (5, 6); (2, 3) → (7, 8) and 4 → 9 and would
> > > > execute the reassignment in this order.
> > > >
> > > > Let me know what you think.
> > > >
> > > > Best,
> > > > Viktor
> > > >
> > > > On Mon, Apr 15, 2019 at 7:01 PM Viktor Somogyi-Vass <
> > > > viktorsomogyi@gmail.com> wrote:
> > > >
> > > >> A follow up on the batching topic to clarify my points above.
> > > >>
> > > >> Generally I think that batching should be a core feature as Colin said
> > > >> the controller should possess all information that are related.
> > > >> Also Cruise Control (or really any 3rd party admin system) might build
> > > >> upon this to give more holistic approach to balance brokers. We may
> > cater
> > > >> them with APIs that act like building blocks to make their life
> > easier like
> > > >> incrementalization, batching, cancellation and rollback but I think
> > the
> > > >> more advanced we go we'll need more advanced control surface and
> > Kafka's
> > > >> basic tooling might not be suitable for that.
> > > >>
> > > >> Best,
> > > >> Viktor
> > > >>
> > > >>
> > > >> On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <
> > viktorsomogyi@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Hey Guys,
> > > >>>
> > > >>> I'll reply to you all in this email:
> > > >>>
> > > >>> @Jun:
> > > >>> 1. yes, it'd be a good idea to add this feature, I'll write this into
> > > >>> the KIP. I was actually thinking about introducing a dynamic config
> > called
> > > >>> reassignment.parallel.partition.count and
> > > >>> reassignment.parallel.replica.count. The first property would
> > control how
> > > >>> many partition reassignment can we do concurrently. The second would
> > go one
> > > >>> level in granularity and would control how many replicas do we want
> > to move
> > > >>> for a given partition. Also one more thing that'd be useful to fix
> > is that
> > > >>> a given list of partition -> replica list would be executed in the
> > same
> > > >>> order (from first to last) so it's overall predictable and the user
> > would
> > > >>> have some control over the order of reassignments should be
> > specified as
> > > >>> the JSON is still assembled by the user.
> > > >>> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll
> > update
> > > >>> the KIP to contain this.
> > > >>>
> > > >>> @Jason:
> > > >>> I think building this functionality into Kafka would definitely
> > benefit
> > > >>> all the users and that CC as well as it'd simplify their software as
> > you
> > > >>> said. As I understand the main advantage of CC and other similar
> > softwares
> > > >>> are to give high level features for automatic load balancing.
> > Reliability,
> > > >>> stability and predictability of the reassignment should be a core
> > feature
> > > >>> of Kafka. I think the incrementalization feature would make it more
> > stable.
> > > >>> I would consider cancellation too as a core feature and we can leave
> > the
> > > >>> gate open for external tools to feed in their reassignment json as
> > they
> > > >>> want. I was also thinking about what are the set of features we can
> > provide
> > > >>> for Kafka but I think the more advanced we go the more need there is
> > for an
> > > >>> administrative UI component.
> > > >>> Regarding KIP-352: Thanks for pointing this out, I didn't see this
> > > >>> although lately I was also thinking about the throttling aspect of
> > it.
> > > >>> Would be a nice add-on to Kafka since though the above configs
> > provide some
> > > >>> level of control, it'd be nice to put an upper cap on the bandwidth
> > and
> > > >>> make it monitorable.
> > > >>>
> > > >>> Viktor
> > > >>>
> > > >>> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io>
> > > >>> wrote:
> > > >>>
> > > >>>> Hi Colin,
> > > >>>>
> > > >>>> On a related note, what do you think about the idea of storing the
> > > >>>> > reassigning replicas in
> > > >>>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather
> > than
> > > >>>> in the
> > > >>>> > reassignment znode?  I don't think this requires a major change
> > to the
> > > >>>> > proposal-- when the controller becomes aware that it should do a
> > > >>>> > reassignment, the controller could make the changes.  This also
> > helps
> > > >>>> keep
> > > >>>> > the reassignment znode from getting larger, which has been a
> > problem.
> > > >>>>
> > > >>>>
> > > >>>> Yeah, I think it's a good idea to store the reassignment state at a
> > > >>>> finer
> > > >>>> level. I'm not sure the LeaderAndIsr znode is the right one though.
> > > >>>> Another
> > > >>>> option is /brokers/topics/{topic}. That is where we currently store
> > the
> > > >>>> replica assignment. I think we basically want to represent both the
> > > >>>> current
> > > >>>> state and the desired state. This would also open the door to a
> > cleaner
> > > >>>> way
> > > >>>> to update a reassignment while it is still in progress.
> > > >>>>
> > > >>>> -Jason
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consulting@yahoo.com
> > > >>>> .invalid>
> > > >>>> wrote:
> > > >>>>
> > > >>>> >  Hi Colin / Jason,
> > > >>>> >
> > > >>>> > Reassignment should really be doing a batches.  I am not too
> > worried
> > > >>>> about
> > > >>>> > reassignment znode getting larger.  In a real production
> > > >>>> environment,  too
> > > >>>> > many concurrent reassignment and too frequent submission of
> > > >>>> reassignments
> > > >>>> > seemed to cause latency spikes of kafka cluster.  So
> > > >>>> > batching/staggering/throttling of submitting reassignments is
> > > >>>> recommended.
> > > >>>> >
> > > >>>> > In KIP-236,  The "originalReplicas" are only kept for the current
> > > >>>> > reassigning partitions (small #), and kept in memory of the
> > controller
> > > >>>> > context partitionsBeingReassigned as well as in the znode
> > > >>>> > /admin/reassign_partitions,  I think below "setting in the RPC
> > like
> > > >>>> null =
> > > >>>> > no replicas are reassigning" is a good idea.
> > > >>>> >
> > > >>>> > There seems to be some issues with the Mail archive server of this
> > > >>>> mailing
> > > >>>> > list?  I didn't receive email after April 7th, and the archive for
> > > >>>> April
> > > >>>> > 2019 has only 50 messages (
> > > >>>> >
> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
> > > >>>> ?
> > > >>>> >
> > > >>>> > Thanks,
> > > >>>> > George
> > > >>>> >
> > > >>>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
> > > >>>> >
> > > >>>> >   Yeah, I think adding this information to LeaderAndIsr makes
> > sense.
> > > >>>> It
> > > >>>> > would be better to track
> > > >>>> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
> > > >>>> > "originalReplicas" is going
> > > >>>> > to involve sending a lot more data, since most replicas in the
> > system
> > > >>>> are
> > > >>>> > not reassigning
> > > >>>> > at any given point.  Or we would need a hack in the RPC like null
> > = no
> > > >>>> > replicas are reassigning.
> > > >>>> >
> > > >>>> > On a related note, what do you think about the idea of storing the
> > > >>>> > reassigning replicas in
> > > >>>> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather
> > than
> > > >>>> in
> > > >>>> > the reassignment znode?
> > > >>>> >  I don't think this requires a major change to the proposal--
> > when the
> > > >>>> > controller becomes
> > > >>>> > aware that it should do a reassignment, the controller could make
> > the
> > > >>>> > changes.  This also
> > > >>>> > helps keep the reassignment znode from getting larger, which has
> > been
> > > >>>> a
> > > >>>> > problem.
> > > >>>> >
> > > >>>> > best,
> > > >>>> > Colin
> > > >>>> >
> > > >>>> >
> > > >>>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> > > >>>> > > Hey George,
> > > >>>> > >
> > > >>>> > > For the URP during a reassignment,  if the "original_replicas"
> > is
> > > >>>> kept
> > > >>>> > for
> > > >>>> > > > the current pending reassignment. I think it will be very
> > easy to
> > > >>>> > compare
> > > >>>> > > > that with the topic/partition's ISR.  If all
> > "original_replicas"
> > > >>>> are in
> > > >>>> > > > ISR, then URP should be 0 for that topic/partition.
> > > >>>> > >
> > > >>>> > >
> > > >>>> > > Yeah, that makes sense. But I guess we would need
> > > >>>> "original_replicas" to
> > > >>>> > be
> > > >>>> > > propagated to partition leaders in the LeaderAndIsr request
> > since
> > > >>>> leaders
> > > >>>> > > are the ones that are computing URPs. That is basically what
> > > >>>> KIP-352 had
> > > >>>> > > proposed, but we also need the changes to the reassignment path.
> > > >>>> Perhaps
> > > >>>> > it
> > > >>>> > > makes more sense to address this problem in KIP-236 since that
> > is
> > > >>>> where
> > > >>>> > you
> > > >>>> > > have already introduced "original_replicas"? I'm also happy to
> > do
> > > >>>> KIP-352
> > > >>>> > > as a follow-up to KIP-236.
> > > >>>> > >
> > > >>>> > > Best,
> > > >>>> > > Jason
> > > >>>> > >
> > > >>>> > >
> > > >>>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com>
> > > >>>> wrote:
> > > >>>> > >
> > > >>>> > > > Good discussion about where we should do batching. I think if
> > > >>>> there is
> > > >>>> > a
> > > >>>> > > > clear great way to batch, then it makes a lot of sense to
> > just do
> > > >>>> it
> > > >>>> > once.
> > > >>>> > > > However, if we think there is scope for experimenting with
> > > >>>> different
> > > >>>> > > > approaches, then an API that tools can use makes a lot of
> > sense.
> > > >>>> They
> > > >>>> > can
> > > >>>> > > > experiment and innovate. Eventually, we can integrate
> > something
> > > >>>> into
> > > >>>> > Kafka
> > > >>>> > > > if it makes sense.
> > > >>>> > > >
> > > >>>> > > > Ismael
> > > >>>> > > >
> > > >>>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <
> > cmccabe@apache.org>
> > > >>>> wrote:
> > > >>>> > > >
> > > >>>> > > > > Hi George,
> > > >>>> > > > >
> > > >>>> > > > > As Jason was saying, it seems like there are two directions
> > we
> > > >>>> could
> > > >>>> > go
> > > >>>> > > > > here: an external system handling batching, and the
> > controller
> > > >>>> > handling
> > > >>>> > > > > batching.  I think the controller handling batching would be
> > > >>>> better,
> > > >>>> > > > since
> > > >>>> > > > > the controller has more information about the state of the
> > > >>>> system.
> > > >>>> > If
> > > >>>> > > > the
> > > >>>> > > > > controller handles batching, then the controller could also
> > > >>>> handle
> > > >>>> > things
> > > >>>> > > > > like setting up replication quotas for individual
> > partitions.
> > > >>>> The
> > > >>>> > > > > controller could do things like throttle replication down
> > if the
> > > >>>> > cluster
> > > >>>> > > > > was having problems.
> > > >>>> > > > >
> > > >>>> > > > > We kind of need to figure out which way we're going to go on
> > > >>>> this one
> > > >>>> > > > > before we set up big new APIs, I think.  If we want an
> > external
> > > >>>> > system to
> > > >>>> > > > > handle batching, then we can keep the idea that there is
> > only
> > > >>>> one
> > > >>>> > > > > reassignment in progress at once.  If we want the
> > controller to
> > > >>>> > handle
> > > >>>> > > > > batching, we will need to get away from that idea.
> > Instead, we
> > > >>>> > should
> > > >>>> > > > just
> > > >>>> > > > > have a bunch of "ideal assignments" that we tell the
> > controller
> > > >>>> > about,
> > > >>>> > > > and
> > > >>>> > > > > let it decide how to do the batching.  These ideal
> > assignments
> > > >>>> could
> > > >>>> > > > change
> > > >>>> > > > > continuously over time, so from the admin's point of view,
> > there
> > > >>>> > would be
> > > >>>> > > > > no start/stop/cancel, but just individual partition
> > > >>>> reassignments
> > > >>>> > that we
> > > >>>> > > > > submit, perhaps over a long period of time.  And then
> > > >>>> cancellation
> > > >>>> > might
> > > >>>> > > > > just mean cancelling just that individual partition
> > > >>>> reassignment,
> > > >>>> > not all
> > > >>>> > > > > partition reassignments.
> > > >>>> > > > >
> > > >>>> > > > > best,
> > > >>>> > > > > Colin
> > > >>>> > > > >
> > > >>>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > > >>>> > > > > >  Hi Jason / Viktor,
> > > >>>> > > > > >
> > > >>>> > > > > > For the URP during a reassignment,  if the
> > > >>>> "original_replicas" is
> > > >>>> > kept
> > > >>>> > > > > > for the current pending reassignment. I think it will be
> > very
> > > >>>> easy
> > > >>>> > to
> > > >>>> > > > > > compare that with the topic/partition's ISR.  If all
> > > >>>> > > > > > "original_replicas" are in ISR, then URP should be 0 for
> > that
> > > >>>> > > > > > topic/partition.
> > > >>>> > > > > >
> > > >>>> > > > > > It would be also nice to separate the metrics
> > MaxLag/TotalLag
> > > >>>> for
> > > >>>> > > > > > Reassignments. I think that will also require
> > > >>>> "original_replicas"
> > > >>>> > (the
> > > >>>> > > > > > topic/partition's replicas just before reassignment when
> > the
> > > >>>> AR
> > > >>>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
> > > >>>> > > > > > Set(new_replicas_in_reassign_partitions) ).
> > > >>>> > > > > >
> > > >>>> > > > > > Thanks,
> > > >>>> > > > > > George
> > > >>>> > > > > >
> > > >>>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason
> > Gustafson
> > > >>>> > > > > > <ja...@confluent.io> wrote:
> > > >>>> > > > > >
> > > >>>> > > > > >  Hi Viktor,
> > > >>>> > > > > >
> > > >>>> > > > > > Thanks for writing this up. As far as questions about
> > overlap
> > > >>>> with
> > > >>>> > > > > KIP-236,
> > > >>>> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may
> > have
> > > >>>> had a
> > > >>>> > > > larger
> > > >>>> > > > > > initial scope, but now it focuses on cancellation and
> > > >>>> batching is
> > > >>>> > left
> > > >>>> > > > > for
> > > >>>> > > > > > future work.
> > > >>>> > > > > >
> > > >>>> > > > > > With that said, I think we may not actually need a KIP
> > for the
> > > >>>> > current
> > > >>>> > > > > > proposal since it doesn't change any APIs. To make it more
> > > >>>> > generally
> > > >>>> > > > > > useful, however, it would be nice to handle batching at
> > the
> > > >>>> > partition
> > > >>>> > > > > level
> > > >>>> > > > > > as well as Jun suggests. The basic question is at what
> > level
> > > >>>> > should the
> > > >>>> > > > > > batching be determined. You could rely on external
> > processes
> > > >>>> (e.g.
> > > >>>> > > > cruise
> > > >>>> > > > > > control) or it could be built into the controller. There
> > are
> > > >>>> > tradeoffs
> > > >>>> > > > > > either way, but I think it simplifies such tools if it is
> > > >>>> handled
> > > >>>> > > > > > internally. Then it would be much safer to submit a larger
> > > >>>> > reassignment
> > > >>>> > > > > > even just using the simple tools that come with Kafka.
> > > >>>> > > > > >
> > > >>>> > > > > > By the way, since you are looking into some of the
> > > >>>> reassignment
> > > >>>> > logic,
> > > >>>> > > > > > another problem that we might want to address is the
> > > >>>> misleading
> > > >>>> > way we
> > > >>>> > > > > > report URPs during a reassignment. I had a naive proposal
> > for
> > > >>>> this
> > > >>>> > > > > > previously, but it didn't really work
> > > >>>> > > > > >
> > > >>>> > > > >
> > > >>>> > > >
> > > >>>> >
> > > >>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > >>>> > > > > .
> > > >>>> > > > > > Potentially fixing that could fall under this work as
> > well if
> > > >>>> you
> > > >>>> > think
> > > >>>> > > > > > it
> > > >>>> > > > > > makes sense.
> > > >>>> > > > > >
> > > >>>> > > > > > Best,
> > > >>>> > > > > > Jason
> > > >>>> > > > > >
> > > >>>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io>
> > > >>>> wrote:
> > > >>>> > > > > >
> > > >>>> > > > > > > Hi, Viktor,
> > > >>>> > > > > > >
> > > >>>> > > > > > > Thanks for the KIP. A couple of comments below.
> > > >>>> > > > > > >
> > > >>>> > > > > > > 1. Another potential thing to do reassignment
> > incrementally
> > > >>>> is to
> > > >>>> > > > move
> > > >>>> > > > > a
> > > >>>> > > > > > > batch of partitions at a time, instead of all
> > partitions.
> > > >>>> This
> > > >>>> > may
> > > >>>> > > > > lead to
> > > >>>> > > > > > > less data replication since by the time the first batch
> > of
> > > >>>> > partitions
> > > >>>> > > > > have
> > > >>>> > > > > > > been completely moved, some data of the next batch may
> > have
> > > >>>> been
> > > >>>> > > > > deleted
> > > >>>> > > > > > > due to retention and doesn't need to be replicated.
> > > >>>> > > > > > >
> > > >>>> > > > > > > 2. "Update CR in Zookeeper with TR for the given
> > partition".
> > > >>>> > Which
> > > >>>> > ZK
> > > >>>> > > > > path
> > > >>>> > > > > > > is this for?
> > > >>>> > > > > > >
> > > >>>> > > > > > > Jun
> > > >>>> > > > > > >
> > > >>>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > >>>> > > > > > > viktorsomogyi@gmail.com>
> > > >>>> > > > > > > wrote:
> > > >>>> > > > > > >
> > > >>>> > > > > > > > Hi Harsha,
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > As far as I understand KIP-236 it's about enabling
> > > >>>> reassignment
> > > >>>> > > > > > > > cancellation and as a future plan providing a queue of
> > > >>>> replica
> > > >>>> > > > > > > reassignment
> > > >>>> > > > > > > > steps to allow manual reassignment chains. While I
> > agree
> > > >>>> that
> > > >>>> > the
> > > >>>> > > > > > > > reassignment chain has a specific use case that allows
> > > >>>> fine
> > > >>>> > grain
> > > >>>> > > > > control
> > > >>>> > > > > > > > over reassignment process, My proposal on the other
> > hand
> > > >>>> > doesn't
> > > >>>> > > > talk
> > > >>>> > > > > > > about
> > > >>>> > > > > > > > cancellation but it only provides an automatic way to
> > > >>>> > > > incrementalize
> > > >>>> > > > > an
> > > >>>> > > > > > > > arbitrary reassignment which I think fits the general
> > use
> > > >>>> case
> > > >>>> > > > where
> > > >>>> > > > > > > users
> > > >>>> > > > > > > > don't want that level of control but still would like
> > a
> > > >>>> > balanced
> > > >>>> > > > way
> > > >>>> > > > > of
> > > >>>> > > > > > > > reassignments. Therefore I think it's still relevant
> > as an
> > > >>>> > > > > improvement of
> > > >>>> > > > > > > > the current algorithm.
> > > >>>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I
> > > >>>> think
> > > >>>> > it
> > > >>>> > > > > would be
> > > >>>> > > > > > > a
> > > >>>> > > > > > > > great improvement to Kafka.
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > Cheers,
> > > >>>> > > > > > > > Viktor
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <
> > kafka@harsha.io>
> > > >>>> > wrote:
> > > >>>> > > > > > > >
> > > >>>> > > > > > > > > Hi Viktor,
> > > >>>> > > > > > > > >            There is already KIP-236 for the same
> > feature
> > > >>>> > and
> > > >>>> > > > George
> > > >>>> > > > > > > made
> > > >>>> > > > > > > > > a PR for this as well.
> > > >>>> > > > > > > > > Lets consolidate these two discussions. If you have
> > any
> > > >>>> > cases
> > > >>>> > > > that
> > > >>>> > > > > are
> > > >>>> > > > > > > > not
> > > >>>> > > > > > > > > being solved by KIP-236 can you please mention them
> > in
> > > >>>> > that
> > > >>>> > > > > thread. We
> > > >>>> > > > > > > > can
> > > >>>> > > > > > > > > address as part of KIP-236.
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > > Thanks,
> > > >>>> > > > > > > > > Harsha
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor
> > Somogyi-Vass
> > > >>>> wrote:
> > > >>>> > > > > > > > > > Hi Folks,
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > > > I've created a KIP about an improvement of the
> > > >>>> reassignment
> > > >>>> > > > > algorithm
> > > >>>> > > > > > > > we
> > > >>>> > > > > > > > > > have. It aims to enable partition-wise incremental
> > > >>>> > > > reassignment.
> > > >>>> > > > > The
> > > >>>> > > > > > > > > > motivation for this is to avoid excess load that
> > the
> > > >>>> > current
> > > >>>> > > > > > > > replication
> > > >>>> > > > > > > > > > algorithm implicitly carries as in that case there
> > > >>>> > are points
> > > >>>> > > > in
> > > >>>> > > > > the
> > > >>>> > > > > > > > > > algorithm where both the new and old replica set
> > could
> > > >>>> > be
> > > >>>> > > > online
> > > >>>> > > > > and
> > > >>>> > > > > > > > > > replicating which puts double (or almost double)
> > > >>>> pressure
> > > >>>> > on
> > > >>>> > > > the
> > > >>>> > > > > > > > brokers
> > > >>>> > > > > > > > > > which could cause problems.
> > > >>>> > > > > > > > > > Instead my proposal would slice this up into
> > several
> > > >>>> > steps
> > > >>>> > > > where
> > > >>>> > > > > each
> > > >>>> > > > > > > > > step
> > > >>>> > > > > > > > > > is calculated based on the final target replicas
> > and
> > > >>>> > the
> > > >>>> > > > current
> > > >>>> > > > > > > > replica
> > > >>>> > > > > > > > > > assignment taking into account scenarios where
> > brokers
> > > >>>> > could be
> > > >>>> > > > > > > offline
> > > >>>> > > > > > > > > and
> > > >>>> > > > > > > > > > when there are not enough replicas to fulfil the
> > > >>>> > > > > min.insync.replica
> > > >>>> > > > > > > > > > requirement.
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > > > The link to the KIP:
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > >
> > > >>>> > > > >
> > > >>>> > > >
> > > >>>> >
> > > >>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > > > I'd be happy to receive any feedback.
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > > > An important note is that this KIP and another
> > one,
> > > >>>> > KIP-236
> > > >>>> > > > that
> > > >>>> > > > > is
> > > >>>> > > > > > > > > > about
> > > >>>> > > > > > > > > > interruptible reassignment (
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > >
> > > >>>> > > > >
> > > >>>> > > >
> > > >>>> >
> > > >>>>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > >>>> > > > > > > > > )
> > > >>>> > > > > > > > > > should be compatible.
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > > > Thanks,
> > > >>>> > > > > > > > > > Viktor
> > > >>>> > > > > > > > > >
> > > >>>> > > > > > > > >
> > > >>>> > > > > > > >
> > > >>>> > > > > > >
> > > >>>> > > > > >
> > > >>>> > > > >
> > > >>>> > > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > > >>>
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.

Hey Colin,

I think there's some confusion here so I might change the name of this. So
KIP-435 is about the internal batching of reassignments (so purely a
controller change) and not about client side APIs. As per this moment these
kind of improvements are listed on KIP-455's future work section so in my
understanding KIP-455 won't touch that :).
Let me know if I'm missing any points here.

Viktor

On Tue, Jun 25, 2019 at 9:02 PM Colin McCabe <cm...@apache.org> wrote:

> Hi Viktor,
>
> Now that the 2.3 release is over, we're going to be turning our attention
> back to working on KIP-455, which provides an API for partition
> reassignment, and also solves the incremental reassignment problem.  Sorry
> about the pause, but I had to focus on the stuff that was going into 2.3.
>
> I think last time we talked about this, the consensus was that KIP-455
> supersedes KIP-435, since KIP-455 supports incremental reassignment.  We
> also don't want to add more technical debt in the form of a new
> ZooKeeper-based API that we'll have to support for a while.  So let's focus
> on KIP-455 here.  We have more resources now so I think we'll be able to
> get it done soonish.
>
> best,
> Colin
>
>
> On Tue, Jun 25, 2019, at 08:09, Viktor Somogyi-Vass wrote:
> > Hi All,
> >
> > I have added another improvement to this, which is to limit the parallel
> > leader movements. I think I'll soon (maybe late this week or early next)
> > start a vote on this too if there are no additional feedback.
> >
> > Thanks,
> > Viktor
> >
> > On Mon, Apr 29, 2019 at 1:26 PM Viktor Somogyi-Vass <
> viktorsomogyi@gmail.com>
> > wrote:
> >
> > > Hi Folks,
> > >
> > > I've updated the KIP with the batching which would work on both replica
> > > and partition level. To explain it briefly: for instance if the replica
> > > level is set to 2 and partition level is set to 3, then 2x3=6 replica
> > > reassignment would be in progress at the same time. In case of
> reassignment
> > > for a single partition from (0, 1, 2, 3, 4) to (5, 6, 7, 8, 9) we would
> > > form the batches (0, 1) → (5, 6); (2, 3) → (7, 8) and 4 → 9 and would
> > > execute the reassignment in this order.
> > >
> > > Let me know what you think.
> > >
> > > Best,
> > > Viktor
> > >
> > > On Mon, Apr 15, 2019 at 7:01 PM Viktor Somogyi-Vass <
> > > viktorsomogyi@gmail.com> wrote:
> > >
> > >> A follow up on the batching topic to clarify my points above.
> > >>
> > >> Generally I think that batching should be a core feature as Colin said
> > >> the controller should possess all information that are related.
> > >> Also Cruise Control (or really any 3rd party admin system) might build
> > >> upon this to give more holistic approach to balance brokers. We may
> cater
> > >> them with APIs that act like building blocks to make their life
> easier like
> > >> incrementalization, batching, cancellation and rollback but I think
> the
> > >> more advanced we go we'll need more advanced control surface and
> Kafka's
> > >> basic tooling might not be suitable for that.
> > >>
> > >> Best,
> > >> Viktor
> > >>
> > >>
> > >> On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <
> viktorsomogyi@gmail.com>
> > >> wrote:
> > >>
> > >>> Hey Guys,
> > >>>
> > >>> I'll reply to you all in this email:
> > >>>
> > >>> @Jun:
> > >>> 1. yes, it'd be a good idea to add this feature, I'll write this into
> > >>> the KIP. I was actually thinking about introducing a dynamic config
> called
> > >>> reassignment.parallel.partition.count and
> > >>> reassignment.parallel.replica.count. The first property would
> control how
> > >>> many partition reassignment can we do concurrently. The second would
> go one
> > >>> level in granularity and would control how many replicas do we want
> to move
> > >>> for a given partition. Also one more thing that'd be useful to fix
> is that
> > >>> a given list of partition -> replica list would be executed in the
> same
> > >>> order (from first to last) so it's overall predictable and the user
> would
> > >>> have some control over the order of reassignments should be
> specified as
> > >>> the JSON is still assembled by the user.
> > >>> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll
> update
> > >>> the KIP to contain this.
> > >>>
> > >>> @Jason:
> > >>> I think building this functionality into Kafka would definitely
> benefit
> > >>> all the users and that CC as well as it'd simplify their software as
> you
> > >>> said. As I understand the main advantage of CC and other similar
> softwares
> > >>> are to give high level features for automatic load balancing.
> Reliability,
> > >>> stability and predictability of the reassignment should be a core
> feature
> > >>> of Kafka. I think the incrementalization feature would make it more
> stable.
> > >>> I would consider cancellation too as a core feature and we can leave
> the
> > >>> gate open for external tools to feed in their reassignment json as
> they
> > >>> want. I was also thinking about what are the set of features we can
> provide
> > >>> for Kafka but I think the more advanced we go the more need there is
> for an
> > >>> administrative UI component.
> > >>> Regarding KIP-352: Thanks for pointing this out, I didn't see this
> > >>> although lately I was also thinking about the throttling aspect of
> it.
> > >>> Would be a nice add-on to Kafka since though the above configs
> provide some
> > >>> level of control, it'd be nice to put an upper cap on the bandwidth
> and
> > >>> make it monitorable.
> > >>>
> > >>> Viktor
> > >>>
> > >>> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io>
> > >>> wrote:
> > >>>
> > >>>> Hi Colin,
> > >>>>
> > >>>> On a related note, what do you think about the idea of storing the
> > >>>> > reassigning replicas in
> > >>>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather
> than
> > >>>> in the
> > >>>> > reassignment znode?  I don't think this requires a major change
> to the
> > >>>> > proposal-- when the controller becomes aware that it should do a
> > >>>> > reassignment, the controller could make the changes.  This also
> helps
> > >>>> keep
> > >>>> > the reassignment znode from getting larger, which has been a
> problem.
> > >>>>
> > >>>>
> > >>>> Yeah, I think it's a good idea to store the reassignment state at a
> > >>>> finer
> > >>>> level. I'm not sure the LeaderAndIsr znode is the right one though.
> > >>>> Another
> > >>>> option is /brokers/topics/{topic}. That is where we currently store
> the
> > >>>> replica assignment. I think we basically want to represent both the
> > >>>> current
> > >>>> state and the desired state. This would also open the door to a
> cleaner
> > >>>> way
> > >>>> to update a reassignment while it is still in progress.
> > >>>>
> > >>>> -Jason
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consulting@yahoo.com
> > >>>> .invalid>
> > >>>> wrote:
> > >>>>
> > >>>> >  Hi Colin / Jason,
> > >>>> >
> > >>>> > Reassignment should really be doing a batches.  I am not too
> worried
> > >>>> about
> > >>>> > reassignment znode getting larger.  In a real production
> > >>>> environment,  too
> > >>>> > many concurrent reassignment and too frequent submission of
> > >>>> reassignments
> > >>>> > seemed to cause latency spikes of kafka cluster.  So
> > >>>> > batching/staggering/throttling of submitting reassignments is
> > >>>> recommended.
> > >>>> >
> > >>>> > In KIP-236,  The "originalReplicas" are only kept for the current
> > >>>> > reassigning partitions (small #), and kept in memory of the
> controller
> > >>>> > context partitionsBeingReassigned as well as in the znode
> > >>>> > /admin/reassign_partitions,  I think below "setting in the RPC
> like
> > >>>> null =
> > >>>> > no replicas are reassigning" is a good idea.
> > >>>> >
> > >>>> > There seems to be some issues with the Mail archive server of this
> > >>>> mailing
> > >>>> > list?  I didn't receive email after April 7th, and the archive for
> > >>>> April
> > >>>> > 2019 has only 50 messages (
> > >>>> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
> > >>>> ?
> > >>>> >
> > >>>> > Thanks,
> > >>>> > George
> > >>>> >
> > >>>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
> > >>>> >
> > >>>> >   Yeah, I think adding this information to LeaderAndIsr makes
> sense.
> > >>>> It
> > >>>> > would be better to track
> > >>>> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
> > >>>> > "originalReplicas" is going
> > >>>> > to involve sending a lot more data, since most replicas in the
> system
> > >>>> are
> > >>>> > not reassigning
> > >>>> > at any given point.  Or we would need a hack in the RPC like null
> = no
> > >>>> > replicas are reassigning.
> > >>>> >
> > >>>> > On a related note, what do you think about the idea of storing the
> > >>>> > reassigning replicas in
> > >>>> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather
> than
> > >>>> in
> > >>>> > the reassignment znode?
> > >>>> >  I don't think this requires a major change to the proposal--
> when the
> > >>>> > controller becomes
> > >>>> > aware that it should do a reassignment, the controller could make
> the
> > >>>> > changes.  This also
> > >>>> > helps keep the reassignment znode from getting larger, which has
> been
> > >>>> a
> > >>>> > problem.
> > >>>> >
> > >>>> > best,
> > >>>> > Colin
> > >>>> >
> > >>>> >
> > >>>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> > >>>> > > Hey George,
> > >>>> > >
> > >>>> > > For the URP during a reassignment,  if the "original_replicas"
> is
> > >>>> kept
> > >>>> > for
> > >>>> > > > the current pending reassignment. I think it will be very
> easy to
> > >>>> > compare
> > >>>> > > > that with the topic/partition's ISR.  If all
> "original_replicas"
> > >>>> are in
> > >>>> > > > ISR, then URP should be 0 for that topic/partition.
> > >>>> > >
> > >>>> > >
> > >>>> > > Yeah, that makes sense. But I guess we would need
> > >>>> "original_replicas" to
> > >>>> > be
> > >>>> > > propagated to partition leaders in the LeaderAndIsr request
> since
> > >>>> leaders
> > >>>> > > are the ones that are computing URPs. That is basically what
> > >>>> KIP-352 had
> > >>>> > > proposed, but we also need the changes to the reassignment path.
> > >>>> Perhaps
> > >>>> > it
> > >>>> > > makes more sense to address this problem in KIP-236 since that
> is
> > >>>> where
> > >>>> > you
> > >>>> > > have already introduced "original_replicas"? I'm also happy to
> do
> > >>>> KIP-352
> > >>>> > > as a follow-up to KIP-236.
> > >>>> > >
> > >>>> > > Best,
> > >>>> > > Jason
> > >>>> > >
> > >>>> > >
> > >>>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com>
> > >>>> wrote:
> > >>>> > >
> > >>>> > > > Good discussion about where we should do batching. I think if
> > >>>> there is
> > >>>> > a
> > >>>> > > > clear great way to batch, then it makes a lot of sense to
> just do
> > >>>> it
> > >>>> > once.
> > >>>> > > > However, if we think there is scope for experimenting with
> > >>>> different
> > >>>> > > > approaches, then an API that tools can use makes a lot of
> sense.
> > >>>> They
> > >>>> > can
> > >>>> > > > experiment and innovate. Eventually, we can integrate
> something
> > >>>> into
> > >>>> > Kafka
> > >>>> > > > if it makes sense.
> > >>>> > > >
> > >>>> > > > Ismael
> > >>>> > > >
> > >>>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <
> cmccabe@apache.org>
> > >>>> wrote:
> > >>>> > > >
> > >>>> > > > > Hi George,
> > >>>> > > > >
> > >>>> > > > > As Jason was saying, it seems like there are two directions
> we
> > >>>> could
> > >>>> > go
> > >>>> > > > > here: an external system handling batching, and the
> controller
> > >>>> > handling
> > >>>> > > > > batching.  I think the controller handling batching would be
> > >>>> better,
> > >>>> > > > since
> > >>>> > > > > the controller has more information about the state of the
> > >>>> system.
> > >>>> > If
> > >>>> > > > the
> > >>>> > > > > controller handles batching, then the controller could also
> > >>>> handle
> > >>>> > things
> > >>>> > > > > like setting up replication quotas for individual
> partitions.
> > >>>> The
> > >>>> > > > > controller could do things like throttle replication down
> if the
> > >>>> > cluster
> > >>>> > > > > was having problems.
> > >>>> > > > >
> > >>>> > > > > We kind of need to figure out which way we're going to go on
> > >>>> this one
> > >>>> > > > > before we set up big new APIs, I think.  If we want an
> external
> > >>>> > system to
> > >>>> > > > > handle batching, then we can keep the idea that there is
> only
> > >>>> one
> > >>>> > > > > reassignment in progress at once.  If we want the
> controller to
> > >>>> > handle
> > >>>> > > > > batching, we will need to get away from that idea.
> Instead, we
> > >>>> > should
> > >>>> > > > just
> > >>>> > > > > have a bunch of "ideal assignments" that we tell the
> controller
> > >>>> > about,
> > >>>> > > > and
> > >>>> > > > > let it decide how to do the batching.  These ideal
> assignments
> > >>>> could
> > >>>> > > > change
> > >>>> > > > > continuously over time, so from the admin's point of view,
> there
> > >>>> > would be
> > >>>> > > > > no start/stop/cancel, but just individual partition
> > >>>> reassignments
> > >>>> > that we
> > >>>> > > > > submit, perhaps over a long period of time.  And then
> > >>>> cancellation
> > >>>> > might
> > >>>> > > > > just mean cancelling just that individual partition
> > >>>> reassignment,
> > >>>> > not all
> > >>>> > > > > partition reassignments.
> > >>>> > > > >
> > >>>> > > > > best,
> > >>>> > > > > Colin
> > >>>> > > > >
> > >>>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > >>>> > > > > >  Hi Jason / Viktor,
> > >>>> > > > > >
> > >>>> > > > > > For the URP during a reassignment,  if the
> > >>>> "original_replicas" is
> > >>>> > kept
> > >>>> > > > > > for the current pending reassignment. I think it will be
> very
> > >>>> easy
> > >>>> > to
> > >>>> > > > > > compare that with the topic/partition's ISR.  If all
> > >>>> > > > > > "original_replicas" are in ISR, then URP should be 0 for
> that
> > >>>> > > > > > topic/partition.
> > >>>> > > > > >
> > >>>> > > > > > It would be also nice to separate the metrics
> MaxLag/TotalLag
> > >>>> for
> > >>>> > > > > > Reassignments. I think that will also require
> > >>>> "original_replicas"
> > >>>> > (the
> > >>>> > > > > > topic/partition's replicas just before reassignment when
> the
> > >>>> AR
> > >>>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
> > >>>> > > > > > Set(new_replicas_in_reassign_partitions) ).
> > >>>> > > > > >
> > >>>> > > > > > Thanks,
> > >>>> > > > > > George
> > >>>> > > > > >
> > >>>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason
> Gustafson
> > >>>> > > > > > <ja...@confluent.io> wrote:
> > >>>> > > > > >
> > >>>> > > > > >  Hi Viktor,
> > >>>> > > > > >
> > >>>> > > > > > Thanks for writing this up. As far as questions about
> overlap
> > >>>> with
> > >>>> > > > > KIP-236,
> > >>>> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may
> have
> > >>>> had a
> > >>>> > > > larger
> > >>>> > > > > > initial scope, but now it focuses on cancellation and
> > >>>> batching is
> > >>>> > left
> > >>>> > > > > for
> > >>>> > > > > > future work.
> > >>>> > > > > >
> > >>>> > > > > > With that said, I think we may not actually need a KIP
> for the
> > >>>> > current
> > >>>> > > > > > proposal since it doesn't change any APIs. To make it more
> > >>>> > generally
> > >>>> > > > > > useful, however, it would be nice to handle batching at
> the
> > >>>> > partition
> > >>>> > > > > level
> > >>>> > > > > > as well as Jun suggests. The basic question is at what
> level
> > >>>> > should the
> > >>>> > > > > > batching be determined. You could rely on external
> processes
> > >>>> (e.g.
> > >>>> > > > cruise
> > >>>> > > > > > control) or it could be built into the controller. There
> are
> > >>>> > tradeoffs
> > >>>> > > > > > either way, but I think it simplifies such tools if it is
> > >>>> handled
> > >>>> > > > > > internally. Then it would be much safer to submit a larger
> > >>>> > reassignment
> > >>>> > > > > > even just using the simple tools that come with Kafka.
> > >>>> > > > > >
> > >>>> > > > > > By the way, since you are looking into some of the
> > >>>> reassignment
> > >>>> > logic,
> > >>>> > > > > > another problem that we might want to address is the
> > >>>> misleading
> > >>>> > way we
> > >>>> > > > > > report URPs during a reassignment. I had a naive proposal
> for
> > >>>> this
> > >>>> > > > > > previously, but it didn't really work
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> >
> > >>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > >>>> > > > > .
> > >>>> > > > > > Potentially fixing that could fall under this work as
> well if
> > >>>> you
> > >>>> > think
> > >>>> > > > > > it
> > >>>> > > > > > makes sense.
> > >>>> > > > > >
> > >>>> > > > > > Best,
> > >>>> > > > > > Jason
> > >>>> > > > > >
> > >>>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io>
> > >>>> wrote:
> > >>>> > > > > >
> > >>>> > > > > > > Hi, Viktor,
> > >>>> > > > > > >
> > >>>> > > > > > > Thanks for the KIP. A couple of comments below.
> > >>>> > > > > > >
> > >>>> > > > > > > 1. Another potential thing to do reassignment
> incrementally
> > >>>> is to
> > >>>> > > > move
> > >>>> > > > > a
> > >>>> > > > > > > batch of partitions at a time, instead of all
> partitions.
> > >>>> This
> > >>>> > may
> > >>>> > > > > lead to
> > >>>> > > > > > > less data replication since by the time the first batch
> of
> > >>>> > partitions
> > >>>> > > > > have
> > >>>> > > > > > > been completely moved, some data of the next batch may
> have
> > >>>> been
> > >>>> > > > > deleted
> > >>>> > > > > > > due to retention and doesn't need to be replicated.
> > >>>> > > > > > >
> > >>>> > > > > > > 2. "Update CR in Zookeeper with TR for the given
> partition".
> > >>>> > Which
> > >>>> > ZK
> > >>>> > > > > path
> > >>>> > > > > > > is this for?
> > >>>> > > > > > >
> > >>>> > > > > > > Jun
> > >>>> > > > > > >
> > >>>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > >>>> > > > > > > viktorsomogyi@gmail.com>
> > >>>> > > > > > > wrote:
> > >>>> > > > > > >
> > >>>> > > > > > > > Hi Harsha,
> > >>>> > > > > > > >
> > >>>> > > > > > > > As far as I understand KIP-236 it's about enabling
> > >>>> reassignment
> > >>>> > > > > > > > cancellation and as a future plan providing a queue of
> > >>>> replica
> > >>>> > > > > > > reassignment
> > >>>> > > > > > > > steps to allow manual reassignment chains. While I
> agree
> > >>>> that
> > >>>> > the
> > >>>> > > > > > > > reassignment chain has a specific use case that allows
> > >>>> fine
> > >>>> > grain
> > >>>> > > > > control
> > >>>> > > > > > > > over reassignment process, My proposal on the other
> hand
> > >>>> > doesn't
> > >>>> > > > talk
> > >>>> > > > > > > about
> > >>>> > > > > > > > cancellation but it only provides an automatic way to
> > >>>> > > > incrementalize
> > >>>> > > > > an
> > >>>> > > > > > > > arbitrary reassignment which I think fits the general
> use
> > >>>> case
> > >>>> > > > where
> > >>>> > > > > > > users
> > >>>> > > > > > > > don't want that level of control but still would like
> a
> > >>>> > balanced
> > >>>> > > > way
> > >>>> > > > > of
> > >>>> > > > > > > > reassignments. Therefore I think it's still relevant
> as an
> > >>>> > > > > improvement of
> > >>>> > > > > > > > the current algorithm.
> > >>>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I
> > >>>> think
> > >>>> > it
> > >>>> > > > > would be
> > >>>> > > > > > > a
> > >>>> > > > > > > > great improvement to Kafka.
> > >>>> > > > > > > >
> > >>>> > > > > > > > Cheers,
> > >>>> > > > > > > > Viktor
> > >>>> > > > > > > >
> > >>>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <
> kafka@harsha.io>
> > >>>> > wrote:
> > >>>> > > > > > > >
> > >>>> > > > > > > > > Hi Viktor,
> > >>>> > > > > > > > >            There is already KIP-236 for the same
> feature
> > >>>> > and
> > >>>> > > > George
> > >>>> > > > > > > made
> > >>>> > > > > > > > > a PR for this as well.
> > >>>> > > > > > > > > Lets consolidate these two discussions. If you have
> any
> > >>>> > cases
> > >>>> > > > that
> > >>>> > > > > are
> > >>>> > > > > > > > not
> > >>>> > > > > > > > > being solved by KIP-236 can you please mention them
> in
> > >>>> > that
> > >>>> > > > > thread. We
> > >>>> > > > > > > > can
> > >>>> > > > > > > > > address as part of KIP-236.
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > Thanks,
> > >>>> > > > > > > > > Harsha
> > >>>> > > > > > > > >
> > >>>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor
> Somogyi-Vass
> > >>>> wrote:
> > >>>> > > > > > > > > > Hi Folks,
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > > > I've created a KIP about an improvement of the
> > >>>> reassignment
> > >>>> > > > > algorithm
> > >>>> > > > > > > > we
> > >>>> > > > > > > > > > have. It aims to enable partition-wise incremental
> > >>>> > > > reassignment.
> > >>>> > > > > The
> > >>>> > > > > > > > > > motivation for this is to avoid excess load that
> the
> > >>>> > current
> > >>>> > > > > > > > replication
> > >>>> > > > > > > > > > algorithm implicitly carries as in that case there
> > >>>> > are points
> > >>>> > > > in
> > >>>> > > > > the
> > >>>> > > > > > > > > > algorithm where both the new and old replica set
> could
> > >>>> > be
> > >>>> > > > online
> > >>>> > > > > and
> > >>>> > > > > > > > > > replicating which puts double (or almost double)
> > >>>> pressure
> > >>>> > on
> > >>>> > > > the
> > >>>> > > > > > > > brokers
> > >>>> > > > > > > > > > which could cause problems.
> > >>>> > > > > > > > > > Instead my proposal would slice this up into
> several
> > >>>> > steps
> > >>>> > > > where
> > >>>> > > > > each
> > >>>> > > > > > > > > step
> > >>>> > > > > > > > > > is calculated based on the final target replicas
> and
> > >>>> > the
> > >>>> > > > current
> > >>>> > > > > > > > replica
> > >>>> > > > > > > > > > assignment taking into account scenarios where
> brokers
> > >>>> > could be
> > >>>> > > > > > > offline
> > >>>> > > > > > > > > and
> > >>>> > > > > > > > > > when there are not enough replicas to fulfil the
> > >>>> > > > > min.insync.replica
> > >>>> > > > > > > > > > requirement.
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > > > The link to the KIP:
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> >
> > >>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > > > I'd be happy to receive any feedback.
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > > > An important note is that this KIP and another
> one,
> > >>>> > KIP-236
> > >>>> > > > that
> > >>>> > > > > is
> > >>>> > > > > > > > > > about
> > >>>> > > > > > > > > > interruptible reassignment (
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> >
> > >>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > >>>> > > > > > > > > )
> > >>>> > > > > > > > > > should be compatible.
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > > > Thanks,
> > >>>> > > > > > > > > > Viktor
> > >>>> > > > > > > > > >
> > >>>> > > > > > > > >
> > >>>> > > > > > > >
> > >>>> > > > > > >
> > >>>> > > > > >
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Colin McCabe <cm...@apache.org>.

Hi Viktor,

Now that the 2.3 release is over, we're going to be turning our attention back to working on KIP-455, which provides an API for partition reassignment, and also solves the incremental reassignment problem.  Sorry about the pause, but I had to focus on the stuff that was going into 2.3.

I think last time we talked about this, the consensus was that KIP-455 supersedes KIP-435, since KIP-455 supports incremental reassignment.  We also don't want to add more technical debt in the form of a new ZooKeeper-based API that we'll have to support for a while.  So let's focus on KIP-455 here.  We have more resources now so I think we'll be able to get it done soonish.

best,
Colin


On Tue, Jun 25, 2019, at 08:09, Viktor Somogyi-Vass wrote:
> Hi All,
> 
> I have added another improvement to this, which is to limit the parallel
> leader movements. I think I'll soon (maybe late this week or early next)
> start a vote on this too if there are no additional feedback.
> 
> Thanks,
> Viktor
> 
> On Mon, Apr 29, 2019 at 1:26 PM Viktor Somogyi-Vass <vi...@gmail.com>
> wrote:
> 
> > Hi Folks,
> >
> > I've updated the KIP with the batching which would work on both replica
> > and partition level. To explain it briefly: for instance if the replica
> > level is set to 2 and partition level is set to 3, then 2x3=6 replica
> > reassignment would be in progress at the same time. In case of reassignment
> > for a single partition from (0, 1, 2, 3, 4) to (5, 6, 7, 8, 9) we would
> > form the batches (0, 1) → (5, 6); (2, 3) → (7, 8) and 4 → 9 and would
> > execute the reassignment in this order.
> >
> > Let me know what you think.
> >
> > Best,
> > Viktor
> >
> > On Mon, Apr 15, 2019 at 7:01 PM Viktor Somogyi-Vass <
> > viktorsomogyi@gmail.com> wrote:
> >
> >> A follow up on the batching topic to clarify my points above.
> >>
> >> Generally I think that batching should be a core feature as Colin said
> >> the controller should possess all information that are related.
> >> Also Cruise Control (or really any 3rd party admin system) might build
> >> upon this to give more holistic approach to balance brokers. We may cater
> >> them with APIs that act like building blocks to make their life easier like
> >> incrementalization, batching, cancellation and rollback but I think the
> >> more advanced we go we'll need more advanced control surface and Kafka's
> >> basic tooling might not be suitable for that.
> >>
> >> Best,
> >> Viktor
> >>
> >>
> >> On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <vi...@gmail.com>
> >> wrote:
> >>
> >>> Hey Guys,
> >>>
> >>> I'll reply to you all in this email:
> >>>
> >>> @Jun:
> >>> 1. yes, it'd be a good idea to add this feature, I'll write this into
> >>> the KIP. I was actually thinking about introducing a dynamic config called
> >>> reassignment.parallel.partition.count and
> >>> reassignment.parallel.replica.count. The first property would control how
> >>> many partition reassignment can we do concurrently. The second would go one
> >>> level in granularity and would control how many replicas do we want to move
> >>> for a given partition. Also one more thing that'd be useful to fix is that
> >>> a given list of partition -> replica list would be executed in the same
> >>> order (from first to last) so it's overall predictable and the user would
> >>> have some control over the order of reassignments should be specified as
> >>> the JSON is still assembled by the user.
> >>> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll update
> >>> the KIP to contain this.
> >>>
> >>> @Jason:
> >>> I think building this functionality into Kafka would definitely benefit
> >>> all the users and that CC as well as it'd simplify their software as you
> >>> said. As I understand the main advantage of CC and other similar softwares
> >>> are to give high level features for automatic load balancing. Reliability,
> >>> stability and predictability of the reassignment should be a core feature
> >>> of Kafka. I think the incrementalization feature would make it more stable.
> >>> I would consider cancellation too as a core feature and we can leave the
> >>> gate open for external tools to feed in their reassignment json as they
> >>> want. I was also thinking about what are the set of features we can provide
> >>> for Kafka but I think the more advanced we go the more need there is for an
> >>> administrative UI component.
> >>> Regarding KIP-352: Thanks for pointing this out, I didn't see this
> >>> although lately I was also thinking about the throttling aspect of it.
> >>> Would be a nice add-on to Kafka since though the above configs provide some
> >>> level of control, it'd be nice to put an upper cap on the bandwidth and
> >>> make it monitorable.
> >>>
> >>> Viktor
> >>>
> >>> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io>
> >>> wrote:
> >>>
> >>>> Hi Colin,
> >>>>
> >>>> On a related note, what do you think about the idea of storing the
> >>>> > reassigning replicas in
> >>>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than
> >>>> in the
> >>>> > reassignment znode?  I don't think this requires a major change to the
> >>>> > proposal-- when the controller becomes aware that it should do a
> >>>> > reassignment, the controller could make the changes.  This also helps
> >>>> keep
> >>>> > the reassignment znode from getting larger, which has been a problem.
> >>>>
> >>>>
> >>>> Yeah, I think it's a good idea to store the reassignment state at a
> >>>> finer
> >>>> level. I'm not sure the LeaderAndIsr znode is the right one though.
> >>>> Another
> >>>> option is /brokers/topics/{topic}. That is where we currently store the
> >>>> replica assignment. I think we basically want to represent both the
> >>>> current
> >>>> state and the desired state. This would also open the door to a cleaner
> >>>> way
> >>>> to update a reassignment while it is still in progress.
> >>>>
> >>>> -Jason
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consulting@yahoo.com
> >>>> .invalid>
> >>>> wrote:
> >>>>
> >>>> >  Hi Colin / Jason,
> >>>> >
> >>>> > Reassignment should really be doing a batches.  I am not too worried
> >>>> about
> >>>> > reassignment znode getting larger.  In a real production
> >>>> environment,  too
> >>>> > many concurrent reassignment and too frequent submission of
> >>>> reassignments
> >>>> > seemed to cause latency spikes of kafka cluster.  So
> >>>> > batching/staggering/throttling of submitting reassignments is
> >>>> recommended.
> >>>> >
> >>>> > In KIP-236,  The "originalReplicas" are only kept for the current
> >>>> > reassigning partitions (small #), and kept in memory of the controller
> >>>> > context partitionsBeingReassigned as well as in the znode
> >>>> > /admin/reassign_partitions,  I think below "setting in the RPC like
> >>>> null =
> >>>> > no replicas are reassigning" is a good idea.
> >>>> >
> >>>> > There seems to be some issues with the Mail archive server of this
> >>>> mailing
> >>>> > list?  I didn't receive email after April 7th, and the archive for
> >>>> April
> >>>> > 2019 has only 50 messages (
> >>>> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
> >>>> ?
> >>>> >
> >>>> > Thanks,
> >>>> > George
> >>>> >
> >>>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
> >>>> >
> >>>> >   Yeah, I think adding this information to LeaderAndIsr makes sense.
> >>>> It
> >>>> > would be better to track
> >>>> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
> >>>> > "originalReplicas" is going
> >>>> > to involve sending a lot more data, since most replicas in the system
> >>>> are
> >>>> > not reassigning
> >>>> > at any given point.  Or we would need a hack in the RPC like null = no
> >>>> > replicas are reassigning.
> >>>> >
> >>>> > On a related note, what do you think about the idea of storing the
> >>>> > reassigning replicas in
> >>>> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than
> >>>> in
> >>>> > the reassignment znode?
> >>>> >  I don't think this requires a major change to the proposal-- when the
> >>>> > controller becomes
> >>>> > aware that it should do a reassignment, the controller could make the
> >>>> > changes.  This also
> >>>> > helps keep the reassignment znode from getting larger, which has been
> >>>> a
> >>>> > problem.
> >>>> >
> >>>> > best,
> >>>> > Colin
> >>>> >
> >>>> >
> >>>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> >>>> > > Hey George,
> >>>> > >
> >>>> > > For the URP during a reassignment,  if the "original_replicas" is
> >>>> kept
> >>>> > for
> >>>> > > > the current pending reassignment. I think it will be very easy to
> >>>> > compare
> >>>> > > > that with the topic/partition's ISR.  If all "original_replicas"
> >>>> are in
> >>>> > > > ISR, then URP should be 0 for that topic/partition.
> >>>> > >
> >>>> > >
> >>>> > > Yeah, that makes sense. But I guess we would need
> >>>> "original_replicas" to
> >>>> > be
> >>>> > > propagated to partition leaders in the LeaderAndIsr request since
> >>>> leaders
> >>>> > > are the ones that are computing URPs. That is basically what
> >>>> KIP-352 had
> >>>> > > proposed, but we also need the changes to the reassignment path.
> >>>> Perhaps
> >>>> > it
> >>>> > > makes more sense to address this problem in KIP-236 since that is
> >>>> where
> >>>> > you
> >>>> > > have already introduced "original_replicas"? I'm also happy to do
> >>>> KIP-352
> >>>> > > as a follow-up to KIP-236.
> >>>> > >
> >>>> > > Best,
> >>>> > > Jason
> >>>> > >
> >>>> > >
> >>>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com>
> >>>> wrote:
> >>>> > >
> >>>> > > > Good discussion about where we should do batching. I think if
> >>>> there is
> >>>> > a
> >>>> > > > clear great way to batch, then it makes a lot of sense to just do
> >>>> it
> >>>> > once.
> >>>> > > > However, if we think there is scope for experimenting with
> >>>> different
> >>>> > > > approaches, then an API that tools can use makes a lot of sense.
> >>>> They
> >>>> > can
> >>>> > > > experiment and innovate. Eventually, we can integrate something
> >>>> into
> >>>> > Kafka
> >>>> > > > if it makes sense.
> >>>> > > >
> >>>> > > > Ismael
> >>>> > > >
> >>>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org>
> >>>> wrote:
> >>>> > > >
> >>>> > > > > Hi George,
> >>>> > > > >
> >>>> > > > > As Jason was saying, it seems like there are two directions we
> >>>> could
> >>>> > go
> >>>> > > > > here: an external system handling batching, and the controller
> >>>> > handling
> >>>> > > > > batching.  I think the controller handling batching would be
> >>>> better,
> >>>> > > > since
> >>>> > > > > the controller has more information about the state of the
> >>>> system.
> >>>> > If
> >>>> > > > the
> >>>> > > > > controller handles batching, then the controller could also
> >>>> handle
> >>>> > things
> >>>> > > > > like setting up replication quotas for individual partitions.
> >>>> The
> >>>> > > > > controller could do things like throttle replication down if the
> >>>> > cluster
> >>>> > > > > was having problems.
> >>>> > > > >
> >>>> > > > > We kind of need to figure out which way we're going to go on
> >>>> this one
> >>>> > > > > before we set up big new APIs, I think.  If we want an external
> >>>> > system to
> >>>> > > > > handle batching, then we can keep the idea that there is only
> >>>> one
> >>>> > > > > reassignment in progress at once.  If we want the controller to
> >>>> > handle
> >>>> > > > > batching, we will need to get away from that idea.  Instead, we
> >>>> > should
> >>>> > > > just
> >>>> > > > > have a bunch of "ideal assignments" that we tell the controller
> >>>> > about,
> >>>> > > > and
> >>>> > > > > let it decide how to do the batching.  These ideal assignments
> >>>> could
> >>>> > > > change
> >>>> > > > > continuously over time, so from the admin's point of view, there
> >>>> > would be
> >>>> > > > > no start/stop/cancel, but just individual partition
> >>>> reassignments
> >>>> > that we
> >>>> > > > > submit, perhaps over a long period of time.  And then
> >>>> cancellation
> >>>> > might
> >>>> > > > > just mean cancelling just that individual partition
> >>>> reassignment,
> >>>> > not all
> >>>> > > > > partition reassignments.
> >>>> > > > >
> >>>> > > > > best,
> >>>> > > > > Colin
> >>>> > > > >
> >>>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> >>>> > > > > >  Hi Jason / Viktor,
> >>>> > > > > >
> >>>> > > > > > For the URP during a reassignment,  if the
> >>>> "original_replicas" is
> >>>> > kept
> >>>> > > > > > for the current pending reassignment. I think it will be very
> >>>> easy
> >>>> > to
> >>>> > > > > > compare that with the topic/partition's ISR.  If all
> >>>> > > > > > "original_replicas" are in ISR, then URP should be 0 for that
> >>>> > > > > > topic/partition.
> >>>> > > > > >
> >>>> > > > > > It would be also nice to separate the metrics MaxLag/TotalLag
> >>>> for
> >>>> > > > > > Reassignments. I think that will also require
> >>>> "original_replicas"
> >>>> > (the
> >>>> > > > > > topic/partition's replicas just before reassignment when the
> >>>> AR
> >>>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
> >>>> > > > > > Set(new_replicas_in_reassign_partitions) ).
> >>>> > > > > >
> >>>> > > > > > Thanks,
> >>>> > > > > > George
> >>>> > > > > >
> >>>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> >>>> > > > > > <ja...@confluent.io> wrote:
> >>>> > > > > >
> >>>> > > > > >  Hi Viktor,
> >>>> > > > > >
> >>>> > > > > > Thanks for writing this up. As far as questions about overlap
> >>>> with
> >>>> > > > > KIP-236,
> >>>> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may have
> >>>> had a
> >>>> > > > larger
> >>>> > > > > > initial scope, but now it focuses on cancellation and
> >>>> batching is
> >>>> > left
> >>>> > > > > for
> >>>> > > > > > future work.
> >>>> > > > > >
> >>>> > > > > > With that said, I think we may not actually need a KIP for the
> >>>> > current
> >>>> > > > > > proposal since it doesn't change any APIs. To make it more
> >>>> > generally
> >>>> > > > > > useful, however, it would be nice to handle batching at the
> >>>> > partition
> >>>> > > > > level
> >>>> > > > > > as well as Jun suggests. The basic question is at what level
> >>>> > should the
> >>>> > > > > > batching be determined. You could rely on external processes
> >>>> (e.g.
> >>>> > > > cruise
> >>>> > > > > > control) or it could be built into the controller. There are
> >>>> > tradeoffs
> >>>> > > > > > either way, but I think it simplifies such tools if it is
> >>>> handled
> >>>> > > > > > internally. Then it would be much safer to submit a larger
> >>>> > reassignment
> >>>> > > > > > even just using the simple tools that come with Kafka.
> >>>> > > > > >
> >>>> > > > > > By the way, since you are looking into some of the
> >>>> reassignment
> >>>> > logic,
> >>>> > > > > > another problem that we might want to address is the
> >>>> misleading
> >>>> > way we
> >>>> > > > > > report URPs during a reassignment. I had a naive proposal for
> >>>> this
> >>>> > > > > > previously, but it didn't really work
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> >
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> >>>> > > > > .
> >>>> > > > > > Potentially fixing that could fall under this work as well if
> >>>> you
> >>>> > think
> >>>> > > > > > it
> >>>> > > > > > makes sense.
> >>>> > > > > >
> >>>> > > > > > Best,
> >>>> > > > > > Jason
> >>>> > > > > >
> >>>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io>
> >>>> wrote:
> >>>> > > > > >
> >>>> > > > > > > Hi, Viktor,
> >>>> > > > > > >
> >>>> > > > > > > Thanks for the KIP. A couple of comments below.
> >>>> > > > > > >
> >>>> > > > > > > 1. Another potential thing to do reassignment incrementally
> >>>> is to
> >>>> > > > move
> >>>> > > > > a
> >>>> > > > > > > batch of partitions at a time, instead of all partitions.
> >>>> This
> >>>> > may
> >>>> > > > > lead to
> >>>> > > > > > > less data replication since by the time the first batch of
> >>>> > partitions
> >>>> > > > > have
> >>>> > > > > > > been completely moved, some data of the next batch may have
> >>>> been
> >>>> > > > > deleted
> >>>> > > > > > > due to retention and doesn't need to be replicated.
> >>>> > > > > > >
> >>>> > > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
> >>>> > Which
> >>>> > ZK
> >>>> > > > > path
> >>>> > > > > > > is this for?
> >>>> > > > > > >
> >>>> > > > > > > Jun
> >>>> > > > > > >
> >>>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> >>>> > > > > > > viktorsomogyi@gmail.com>
> >>>> > > > > > > wrote:
> >>>> > > > > > >
> >>>> > > > > > > > Hi Harsha,
> >>>> > > > > > > >
> >>>> > > > > > > > As far as I understand KIP-236 it's about enabling
> >>>> reassignment
> >>>> > > > > > > > cancellation and as a future plan providing a queue of
> >>>> replica
> >>>> > > > > > > reassignment
> >>>> > > > > > > > steps to allow manual reassignment chains. While I agree
> >>>> that
> >>>> > the
> >>>> > > > > > > > reassignment chain has a specific use case that allows
> >>>> fine
> >>>> > grain
> >>>> > > > > control
> >>>> > > > > > > > over reassignment process, My proposal on the other hand
> >>>> > doesn't
> >>>> > > > talk
> >>>> > > > > > > about
> >>>> > > > > > > > cancellation but it only provides an automatic way to
> >>>> > > > incrementalize
> >>>> > > > > an
> >>>> > > > > > > > arbitrary reassignment which I think fits the general use
> >>>> case
> >>>> > > > where
> >>>> > > > > > > users
> >>>> > > > > > > > don't want that level of control but still would like a
> >>>> > balanced
> >>>> > > > way
> >>>> > > > > of
> >>>> > > > > > > > reassignments. Therefore I think it's still relevant as an
> >>>> > > > > improvement of
> >>>> > > > > > > > the current algorithm.
> >>>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I
> >>>> think
> >>>> > it
> >>>> > > > > would be
> >>>> > > > > > > a
> >>>> > > > > > > > great improvement to Kafka.
> >>>> > > > > > > >
> >>>> > > > > > > > Cheers,
> >>>> > > > > > > > Viktor
> >>>> > > > > > > >
> >>>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
> >>>> > wrote:
> >>>> > > > > > > >
> >>>> > > > > > > > > Hi Viktor,
> >>>> > > > > > > > >            There is already KIP-236 for the same feature
> >>>> > and
> >>>> > > > George
> >>>> > > > > > > made
> >>>> > > > > > > > > a PR for this as well.
> >>>> > > > > > > > > Lets consolidate these two discussions. If you have any
> >>>> > cases
> >>>> > > > that
> >>>> > > > > are
> >>>> > > > > > > > not
> >>>> > > > > > > > > being solved by KIP-236 can you please mention them in
> >>>> > that
> >>>> > > > > thread. We
> >>>> > > > > > > > can
> >>>> > > > > > > > > address as part of KIP-236.
> >>>> > > > > > > > >
> >>>> > > > > > > > > Thanks,
> >>>> > > > > > > > > Harsha
> >>>> > > > > > > > >
> >>>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass
> >>>> wrote:
> >>>> > > > > > > > > > Hi Folks,
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > I've created a KIP about an improvement of the
> >>>> reassignment
> >>>> > > > > algorithm
> >>>> > > > > > > > we
> >>>> > > > > > > > > > have. It aims to enable partition-wise incremental
> >>>> > > > reassignment.
> >>>> > > > > The
> >>>> > > > > > > > > > motivation for this is to avoid excess load that the
> >>>> > current
> >>>> > > > > > > > replication
> >>>> > > > > > > > > > algorithm implicitly carries as in that case there
> >>>> > are points
> >>>> > > > in
> >>>> > > > > the
> >>>> > > > > > > > > > algorithm where both the new and old replica set could
> >>>> > be
> >>>> > > > online
> >>>> > > > > and
> >>>> > > > > > > > > > replicating which puts double (or almost double)
> >>>> pressure
> >>>> > on
> >>>> > > > the
> >>>> > > > > > > > brokers
> >>>> > > > > > > > > > which could cause problems.
> >>>> > > > > > > > > > Instead my proposal would slice this up into several
> >>>> > steps
> >>>> > > > where
> >>>> > > > > each
> >>>> > > > > > > > > step
> >>>> > > > > > > > > > is calculated based on the final target replicas and
> >>>> > the
> >>>> > > > current
> >>>> > > > > > > > replica
> >>>> > > > > > > > > > assignment taking into account scenarios where brokers
> >>>> > could be
> >>>> > > > > > > offline
> >>>> > > > > > > > > and
> >>>> > > > > > > > > > when there are not enough replicas to fulfil the
> >>>> > > > > min.insync.replica
> >>>> > > > > > > > > > requirement.
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > The link to the KIP:
> >>>> > > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> >
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > I'd be happy to receive any feedback.
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > An important note is that this KIP and another one,
> >>>> > KIP-236
> >>>> > > > that
> >>>> > > > > is
> >>>> > > > > > > > > > about
> >>>> > > > > > > > > > interruptible reassignment (
> >>>> > > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> >
> >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> >>>> > > > > > > > > )
> >>>> > > > > > > > > > should be compatible.
> >>>> > > > > > > > > >
> >>>> > > > > > > > > > Thanks,
> >>>> > > > > > > > > > Viktor
> >>>> > > > > > > > > >
> >>>> > > > > > > > >
> >>>> > > > > > > >
> >>>> > > > > > >
> >>>> > > > > >
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.

Hi All,

I have added another improvement to this, which is to limit the parallel
leader movements. I think I'll soon (maybe late this week or early next)
start a vote on this too if there are no additional feedback.

Thanks,
Viktor

On Mon, Apr 29, 2019 at 1:26 PM Viktor Somogyi-Vass <vi...@gmail.com>
wrote:

> Hi Folks,
>
> I've updated the KIP with the batching which would work on both replica
> and partition level. To explain it briefly: for instance if the replica
> level is set to 2 and partition level is set to 3, then 2x3=6 replica
> reassignment would be in progress at the same time. In case of reassignment
> for a single partition from (0, 1, 2, 3, 4) to (5, 6, 7, 8, 9) we would
> form the batches (0, 1) → (5, 6); (2, 3) → (7, 8) and 4 → 9 and would
> execute the reassignment in this order.
>
> Let me know what you think.
>
> Best,
> Viktor
>
> On Mon, Apr 15, 2019 at 7:01 PM Viktor Somogyi-Vass <
> viktorsomogyi@gmail.com> wrote:
>
>> A follow up on the batching topic to clarify my points above.
>>
>> Generally I think that batching should be a core feature as Colin said
>> the controller should possess all information that are related.
>> Also Cruise Control (or really any 3rd party admin system) might build
>> upon this to give more holistic approach to balance brokers. We may cater
>> them with APIs that act like building blocks to make their life easier like
>> incrementalization, batching, cancellation and rollback but I think the
>> more advanced we go we'll need more advanced control surface and Kafka's
>> basic tooling might not be suitable for that.
>>
>> Best,
>> Viktor
>>
>>
>> On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <vi...@gmail.com>
>> wrote:
>>
>>> Hey Guys,
>>>
>>> I'll reply to you all in this email:
>>>
>>> @Jun:
>>> 1. yes, it'd be a good idea to add this feature, I'll write this into
>>> the KIP. I was actually thinking about introducing a dynamic config called
>>> reassignment.parallel.partition.count and
>>> reassignment.parallel.replica.count. The first property would control how
>>> many partition reassignment can we do concurrently. The second would go one
>>> level in granularity and would control how many replicas do we want to move
>>> for a given partition. Also one more thing that'd be useful to fix is that
>>> a given list of partition -> replica list would be executed in the same
>>> order (from first to last) so it's overall predictable and the user would
>>> have some control over the order of reassignments should be specified as
>>> the JSON is still assembled by the user.
>>> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll update
>>> the KIP to contain this.
>>>
>>> @Jason:
>>> I think building this functionality into Kafka would definitely benefit
>>> all the users and that CC as well as it'd simplify their software as you
>>> said. As I understand the main advantage of CC and other similar softwares
>>> are to give high level features for automatic load balancing. Reliability,
>>> stability and predictability of the reassignment should be a core feature
>>> of Kafka. I think the incrementalization feature would make it more stable.
>>> I would consider cancellation too as a core feature and we can leave the
>>> gate open for external tools to feed in their reassignment json as they
>>> want. I was also thinking about what are the set of features we can provide
>>> for Kafka but I think the more advanced we go the more need there is for an
>>> administrative UI component.
>>> Regarding KIP-352: Thanks for pointing this out, I didn't see this
>>> although lately I was also thinking about the throttling aspect of it.
>>> Would be a nice add-on to Kafka since though the above configs provide some
>>> level of control, it'd be nice to put an upper cap on the bandwidth and
>>> make it monitorable.
>>>
>>> Viktor
>>>
>>> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io>
>>> wrote:
>>>
>>>> Hi Colin,
>>>>
>>>> On a related note, what do you think about the idea of storing the
>>>> > reassigning replicas in
>>>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than
>>>> in the
>>>> > reassignment znode?  I don't think this requires a major change to the
>>>> > proposal-- when the controller becomes aware that it should do a
>>>> > reassignment, the controller could make the changes.  This also helps
>>>> keep
>>>> > the reassignment znode from getting larger, which has been a problem.
>>>>
>>>>
>>>> Yeah, I think it's a good idea to store the reassignment state at a
>>>> finer
>>>> level. I'm not sure the LeaderAndIsr znode is the right one though.
>>>> Another
>>>> option is /brokers/topics/{topic}. That is where we currently store the
>>>> replica assignment. I think we basically want to represent both the
>>>> current
>>>> state and the desired state. This would also open the door to a cleaner
>>>> way
>>>> to update a reassignment while it is still in progress.
>>>>
>>>> -Jason
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consulting@yahoo.com
>>>> .invalid>
>>>> wrote:
>>>>
>>>> >  Hi Colin / Jason,
>>>> >
>>>> > Reassignment should really be doing a batches.  I am not too worried
>>>> about
>>>> > reassignment znode getting larger.  In a real production
>>>> environment,  too
>>>> > many concurrent reassignment and too frequent submission of
>>>> reassignments
>>>> > seemed to cause latency spikes of kafka cluster.  So
>>>> > batching/staggering/throttling of submitting reassignments is
>>>> recommended.
>>>> >
>>>> > In KIP-236,  The "originalReplicas" are only kept for the current
>>>> > reassigning partitions (small #), and kept in memory of the controller
>>>> > context partitionsBeingReassigned as well as in the znode
>>>> > /admin/reassign_partitions,  I think below "setting in the RPC like
>>>> null =
>>>> > no replicas are reassigning" is a good idea.
>>>> >
>>>> > There seems to be some issues with the Mail archive server of this
>>>> mailing
>>>> > list?  I didn't receive email after April 7th, and the archive for
>>>> April
>>>> > 2019 has only 50 messages (
>>>> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
>>>> ?
>>>> >
>>>> > Thanks,
>>>> > George
>>>> >
>>>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
>>>> >
>>>> >   Yeah, I think adding this information to LeaderAndIsr makes sense.
>>>> It
>>>> > would be better to track
>>>> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
>>>> > "originalReplicas" is going
>>>> > to involve sending a lot more data, since most replicas in the system
>>>> are
>>>> > not reassigning
>>>> > at any given point.  Or we would need a hack in the RPC like null = no
>>>> > replicas are reassigning.
>>>> >
>>>> > On a related note, what do you think about the idea of storing the
>>>> > reassigning replicas in
>>>> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than
>>>> in
>>>> > the reassignment znode?
>>>> >  I don't think this requires a major change to the proposal-- when the
>>>> > controller becomes
>>>> > aware that it should do a reassignment, the controller could make the
>>>> > changes.  This also
>>>> > helps keep the reassignment znode from getting larger, which has been
>>>> a
>>>> > problem.
>>>> >
>>>> > best,
>>>> > Colin
>>>> >
>>>> >
>>>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
>>>> > > Hey George,
>>>> > >
>>>> > > For the URP during a reassignment,  if the "original_replicas" is
>>>> kept
>>>> > for
>>>> > > > the current pending reassignment. I think it will be very easy to
>>>> > compare
>>>> > > > that with the topic/partition's ISR.  If all "original_replicas"
>>>> are in
>>>> > > > ISR, then URP should be 0 for that topic/partition.
>>>> > >
>>>> > >
>>>> > > Yeah, that makes sense. But I guess we would need
>>>> "original_replicas" to
>>>> > be
>>>> > > propagated to partition leaders in the LeaderAndIsr request since
>>>> leaders
>>>> > > are the ones that are computing URPs. That is basically what
>>>> KIP-352 had
>>>> > > proposed, but we also need the changes to the reassignment path.
>>>> Perhaps
>>>> > it
>>>> > > makes more sense to address this problem in KIP-236 since that is
>>>> where
>>>> > you
>>>> > > have already introduced "original_replicas"? I'm also happy to do
>>>> KIP-352
>>>> > > as a follow-up to KIP-236.
>>>> > >
>>>> > > Best,
>>>> > > Jason
>>>> > >
>>>> > >
>>>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com>
>>>> wrote:
>>>> > >
>>>> > > > Good discussion about where we should do batching. I think if
>>>> there is
>>>> > a
>>>> > > > clear great way to batch, then it makes a lot of sense to just do
>>>> it
>>>> > once.
>>>> > > > However, if we think there is scope for experimenting with
>>>> different
>>>> > > > approaches, then an API that tools can use makes a lot of sense.
>>>> They
>>>> > can
>>>> > > > experiment and innovate. Eventually, we can integrate something
>>>> into
>>>> > Kafka
>>>> > > > if it makes sense.
>>>> > > >
>>>> > > > Ismael
>>>> > > >
>>>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org>
>>>> wrote:
>>>> > > >
>>>> > > > > Hi George,
>>>> > > > >
>>>> > > > > As Jason was saying, it seems like there are two directions we
>>>> could
>>>> > go
>>>> > > > > here: an external system handling batching, and the controller
>>>> > handling
>>>> > > > > batching.  I think the controller handling batching would be
>>>> better,
>>>> > > > since
>>>> > > > > the controller has more information about the state of the
>>>> system.
>>>> > If
>>>> > > > the
>>>> > > > > controller handles batching, then the controller could also
>>>> handle
>>>> > things
>>>> > > > > like setting up replication quotas for individual partitions.
>>>> The
>>>> > > > > controller could do things like throttle replication down if the
>>>> > cluster
>>>> > > > > was having problems.
>>>> > > > >
>>>> > > > > We kind of need to figure out which way we're going to go on
>>>> this one
>>>> > > > > before we set up big new APIs, I think.  If we want an external
>>>> > system to
>>>> > > > > handle batching, then we can keep the idea that there is only
>>>> one
>>>> > > > > reassignment in progress at once.  If we want the controller to
>>>> > handle
>>>> > > > > batching, we will need to get away from that idea.  Instead, we
>>>> > should
>>>> > > > just
>>>> > > > > have a bunch of "ideal assignments" that we tell the controller
>>>> > about,
>>>> > > > and
>>>> > > > > let it decide how to do the batching.  These ideal assignments
>>>> could
>>>> > > > change
>>>> > > > > continuously over time, so from the admin's point of view, there
>>>> > would be
>>>> > > > > no start/stop/cancel, but just individual partition
>>>> reassignments
>>>> > that we
>>>> > > > > submit, perhaps over a long period of time.  And then
>>>> cancellation
>>>> > might
>>>> > > > > just mean cancelling just that individual partition
>>>> reassignment,
>>>> > not all
>>>> > > > > partition reassignments.
>>>> > > > >
>>>> > > > > best,
>>>> > > > > Colin
>>>> > > > >
>>>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
>>>> > > > > >  Hi Jason / Viktor,
>>>> > > > > >
>>>> > > > > > For the URP during a reassignment,  if the
>>>> "original_replicas" is
>>>> > kept
>>>> > > > > > for the current pending reassignment. I think it will be very
>>>> easy
>>>> > to
>>>> > > > > > compare that with the topic/partition's ISR.  If all
>>>> > > > > > "original_replicas" are in ISR, then URP should be 0 for that
>>>> > > > > > topic/partition.
>>>> > > > > >
>>>> > > > > > It would be also nice to separate the metrics MaxLag/TotalLag
>>>> for
>>>> > > > > > Reassignments. I think that will also require
>>>> "original_replicas"
>>>> > (the
>>>> > > > > > topic/partition's replicas just before reassignment when the
>>>> AR
>>>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
>>>> > > > > > Set(new_replicas_in_reassign_partitions) ).
>>>> > > > > >
>>>> > > > > > Thanks,
>>>> > > > > > George
>>>> > > > > >
>>>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
>>>> > > > > > <ja...@confluent.io> wrote:
>>>> > > > > >
>>>> > > > > >  Hi Viktor,
>>>> > > > > >
>>>> > > > > > Thanks for writing this up. As far as questions about overlap
>>>> with
>>>> > > > > KIP-236,
>>>> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may have
>>>> had a
>>>> > > > larger
>>>> > > > > > initial scope, but now it focuses on cancellation and
>>>> batching is
>>>> > left
>>>> > > > > for
>>>> > > > > > future work.
>>>> > > > > >
>>>> > > > > > With that said, I think we may not actually need a KIP for the
>>>> > current
>>>> > > > > > proposal since it doesn't change any APIs. To make it more
>>>> > generally
>>>> > > > > > useful, however, it would be nice to handle batching at the
>>>> > partition
>>>> > > > > level
>>>> > > > > > as well as Jun suggests. The basic question is at what level
>>>> > should the
>>>> > > > > > batching be determined. You could rely on external processes
>>>> (e.g.
>>>> > > > cruise
>>>> > > > > > control) or it could be built into the controller. There are
>>>> > tradeoffs
>>>> > > > > > either way, but I think it simplifies such tools if it is
>>>> handled
>>>> > > > > > internally. Then it would be much safer to submit a larger
>>>> > reassignment
>>>> > > > > > even just using the simple tools that come with Kafka.
>>>> > > > > >
>>>> > > > > > By the way, since you are looking into some of the
>>>> reassignment
>>>> > logic,
>>>> > > > > > another problem that we might want to address is the
>>>> misleading
>>>> > way we
>>>> > > > > > report URPs during a reassignment. I had a naive proposal for
>>>> this
>>>> > > > > > previously, but it didn't really work
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
>>>> > > > > .
>>>> > > > > > Potentially fixing that could fall under this work as well if
>>>> you
>>>> > think
>>>> > > > > > it
>>>> > > > > > makes sense.
>>>> > > > > >
>>>> > > > > > Best,
>>>> > > > > > Jason
>>>> > > > > >
>>>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io>
>>>> wrote:
>>>> > > > > >
>>>> > > > > > > Hi, Viktor,
>>>> > > > > > >
>>>> > > > > > > Thanks for the KIP. A couple of comments below.
>>>> > > > > > >
>>>> > > > > > > 1. Another potential thing to do reassignment incrementally
>>>> is to
>>>> > > > move
>>>> > > > > a
>>>> > > > > > > batch of partitions at a time, instead of all partitions.
>>>> This
>>>> > may
>>>> > > > > lead to
>>>> > > > > > > less data replication since by the time the first batch of
>>>> > partitions
>>>> > > > > have
>>>> > > > > > > been completely moved, some data of the next batch may have
>>>> been
>>>> > > > > deleted
>>>> > > > > > > due to retention and doesn't need to be replicated.
>>>> > > > > > >
>>>> > > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
>>>> > Which
>>>> > ZK
>>>> > > > > path
>>>> > > > > > > is this for?
>>>> > > > > > >
>>>> > > > > > > Jun
>>>> > > > > > >
>>>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
>>>> > > > > > > viktorsomogyi@gmail.com>
>>>> > > > > > > wrote:
>>>> > > > > > >
>>>> > > > > > > > Hi Harsha,
>>>> > > > > > > >
>>>> > > > > > > > As far as I understand KIP-236 it's about enabling
>>>> reassignment
>>>> > > > > > > > cancellation and as a future plan providing a queue of
>>>> replica
>>>> > > > > > > reassignment
>>>> > > > > > > > steps to allow manual reassignment chains. While I agree
>>>> that
>>>> > the
>>>> > > > > > > > reassignment chain has a specific use case that allows
>>>> fine
>>>> > grain
>>>> > > > > control
>>>> > > > > > > > over reassignment process, My proposal on the other hand
>>>> > doesn't
>>>> > > > talk
>>>> > > > > > > about
>>>> > > > > > > > cancellation but it only provides an automatic way to
>>>> > > > incrementalize
>>>> > > > > an
>>>> > > > > > > > arbitrary reassignment which I think fits the general use
>>>> case
>>>> > > > where
>>>> > > > > > > users
>>>> > > > > > > > don't want that level of control but still would like a
>>>> > balanced
>>>> > > > way
>>>> > > > > of
>>>> > > > > > > > reassignments. Therefore I think it's still relevant as an
>>>> > > > > improvement of
>>>> > > > > > > > the current algorithm.
>>>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I
>>>> think
>>>> > it
>>>> > > > > would be
>>>> > > > > > > a
>>>> > > > > > > > great improvement to Kafka.
>>>> > > > > > > >
>>>> > > > > > > > Cheers,
>>>> > > > > > > > Viktor
>>>> > > > > > > >
>>>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
>>>> > wrote:
>>>> > > > > > > >
>>>> > > > > > > > > Hi Viktor,
>>>> > > > > > > > >            There is already KIP-236 for the same feature
>>>> > and
>>>> > > > George
>>>> > > > > > > made
>>>> > > > > > > > > a PR for this as well.
>>>> > > > > > > > > Lets consolidate these two discussions. If you have any
>>>> > cases
>>>> > > > that
>>>> > > > > are
>>>> > > > > > > > not
>>>> > > > > > > > > being solved by KIP-236 can you please mention them in
>>>> > that
>>>> > > > > thread. We
>>>> > > > > > > > can
>>>> > > > > > > > > address as part of KIP-236.
>>>> > > > > > > > >
>>>> > > > > > > > > Thanks,
>>>> > > > > > > > > Harsha
>>>> > > > > > > > >
>>>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass
>>>> wrote:
>>>> > > > > > > > > > Hi Folks,
>>>> > > > > > > > > >
>>>> > > > > > > > > > I've created a KIP about an improvement of the
>>>> reassignment
>>>> > > > > algorithm
>>>> > > > > > > > we
>>>> > > > > > > > > > have. It aims to enable partition-wise incremental
>>>> > > > reassignment.
>>>> > > > > The
>>>> > > > > > > > > > motivation for this is to avoid excess load that the
>>>> > current
>>>> > > > > > > > replication
>>>> > > > > > > > > > algorithm implicitly carries as in that case there
>>>> > are points
>>>> > > > in
>>>> > > > > the
>>>> > > > > > > > > > algorithm where both the new and old replica set could
>>>> > be
>>>> > > > online
>>>> > > > > and
>>>> > > > > > > > > > replicating which puts double (or almost double)
>>>> pressure
>>>> > on
>>>> > > > the
>>>> > > > > > > > brokers
>>>> > > > > > > > > > which could cause problems.
>>>> > > > > > > > > > Instead my proposal would slice this up into several
>>>> > steps
>>>> > > > where
>>>> > > > > each
>>>> > > > > > > > > step
>>>> > > > > > > > > > is calculated based on the final target replicas and
>>>> > the
>>>> > > > current
>>>> > > > > > > > replica
>>>> > > > > > > > > > assignment taking into account scenarios where brokers
>>>> > could be
>>>> > > > > > > offline
>>>> > > > > > > > > and
>>>> > > > > > > > > > when there are not enough replicas to fulfil the
>>>> > > > > min.insync.replica
>>>> > > > > > > > > > requirement.
>>>> > > > > > > > > >
>>>> > > > > > > > > > The link to the KIP:
>>>> > > > > > > > > >
>>>> > > > > > > > >
>>>> > > > > > > >
>>>> > > > > > >
>>>> > > > >
>>>> > > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
>>>> > > > > > > > > >
>>>> > > > > > > > > > I'd be happy to receive any feedback.
>>>> > > > > > > > > >
>>>> > > > > > > > > > An important note is that this KIP and another one,
>>>> > KIP-236
>>>> > > > that
>>>> > > > > is
>>>> > > > > > > > > > about
>>>> > > > > > > > > > interruptible reassignment (
>>>> > > > > > > > > >
>>>> > > > > > > > >
>>>> > > > > > > >
>>>> > > > > > >
>>>> > > > >
>>>> > > >
>>>> >
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
>>>> > > > > > > > > )
>>>> > > > > > > > > > should be compatible.
>>>> > > > > > > > > >
>>>> > > > > > > > > > Thanks,
>>>> > > > > > > > > > Viktor
>>>> > > > > > > > > >
>>>> > > > > > > > >
>>>> > > > > > > >
>>>> > > > > > >
>>>> > > > > >
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.

Hi Folks,

I've updated the KIP with the batching which would work on both replica and
partition level. To explain it briefly: for instance if the replica level
is set to 2 and partition level is set to 3, then 2x3=6 replica
reassignment would be in progress at the same time. In case of reassignment
for a single partition from (0, 1, 2, 3, 4) to (5, 6, 7, 8, 9) we would
form the batches (0, 1) → (5, 6); (2, 3) → (7, 8) and 4 → 9 and would
execute the reassignment in this order.

Let me know what you think.

Best,
Viktor

On Mon, Apr 15, 2019 at 7:01 PM Viktor Somogyi-Vass <vi...@gmail.com>
wrote:

> A follow up on the batching topic to clarify my points above.
>
> Generally I think that batching should be a core feature as Colin said the
> controller should possess all information that are related.
> Also Cruise Control (or really any 3rd party admin system) might build
> upon this to give more holistic approach to balance brokers. We may cater
> them with APIs that act like building blocks to make their life easier like
> incrementalization, batching, cancellation and rollback but I think the
> more advanced we go we'll need more advanced control surface and Kafka's
> basic tooling might not be suitable for that.
>
> Best,
> Viktor
>
>
> On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <vi...@gmail.com>
> wrote:
>
>> Hey Guys,
>>
>> I'll reply to you all in this email:
>>
>> @Jun:
>> 1. yes, it'd be a good idea to add this feature, I'll write this into the
>> KIP. I was actually thinking about introducing a dynamic config called
>> reassignment.parallel.partition.count and
>> reassignment.parallel.replica.count. The first property would control how
>> many partition reassignment can we do concurrently. The second would go one
>> level in granularity and would control how many replicas do we want to move
>> for a given partition. Also one more thing that'd be useful to fix is that
>> a given list of partition -> replica list would be executed in the same
>> order (from first to last) so it's overall predictable and the user would
>> have some control over the order of reassignments should be specified as
>> the JSON is still assembled by the user.
>> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll update
>> the KIP to contain this.
>>
>> @Jason:
>> I think building this functionality into Kafka would definitely benefit
>> all the users and that CC as well as it'd simplify their software as you
>> said. As I understand the main advantage of CC and other similar softwares
>> are to give high level features for automatic load balancing. Reliability,
>> stability and predictability of the reassignment should be a core feature
>> of Kafka. I think the incrementalization feature would make it more stable.
>> I would consider cancellation too as a core feature and we can leave the
>> gate open for external tools to feed in their reassignment json as they
>> want. I was also thinking about what are the set of features we can provide
>> for Kafka but I think the more advanced we go the more need there is for an
>> administrative UI component.
>> Regarding KIP-352: Thanks for pointing this out, I didn't see this
>> although lately I was also thinking about the throttling aspect of it.
>> Would be a nice add-on to Kafka since though the above configs provide some
>> level of control, it'd be nice to put an upper cap on the bandwidth and
>> make it monitorable.
>>
>> Viktor
>>
>> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io>
>> wrote:
>>
>>> Hi Colin,
>>>
>>> On a related note, what do you think about the idea of storing the
>>> > reassigning replicas in
>>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
>>> the
>>> > reassignment znode?  I don't think this requires a major change to the
>>> > proposal-- when the controller becomes aware that it should do a
>>> > reassignment, the controller could make the changes.  This also helps
>>> keep
>>> > the reassignment znode from getting larger, which has been a problem.
>>>
>>>
>>> Yeah, I think it's a good idea to store the reassignment state at a finer
>>> level. I'm not sure the LeaderAndIsr znode is the right one though.
>>> Another
>>> option is /brokers/topics/{topic}. That is where we currently store the
>>> replica assignment. I think we basically want to represent both the
>>> current
>>> state and the desired state. This would also open the door to a cleaner
>>> way
>>> to update a reassignment while it is still in progress.
>>>
>>> -Jason
>>>
>>>
>>>
>>>
>>> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consulting@yahoo.com
>>> .invalid>
>>> wrote:
>>>
>>> >  Hi Colin / Jason,
>>> >
>>> > Reassignment should really be doing a batches.  I am not too worried
>>> about
>>> > reassignment znode getting larger.  In a real production environment,
>>> too
>>> > many concurrent reassignment and too frequent submission of
>>> reassignments
>>> > seemed to cause latency spikes of kafka cluster.  So
>>> > batching/staggering/throttling of submitting reassignments is
>>> recommended.
>>> >
>>> > In KIP-236,  The "originalReplicas" are only kept for the current
>>> > reassigning partitions (small #), and kept in memory of the controller
>>> > context partitionsBeingReassigned as well as in the znode
>>> > /admin/reassign_partitions,  I think below "setting in the RPC like
>>> null =
>>> > no replicas are reassigning" is a good idea.
>>> >
>>> > There seems to be some issues with the Mail archive server of this
>>> mailing
>>> > list?  I didn't receive email after April 7th, and the archive for
>>> April
>>> > 2019 has only 50 messages (
>>> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
>>> ?
>>> >
>>> > Thanks,
>>> > George
>>> >
>>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
>>> >
>>> >   Yeah, I think adding this information to LeaderAndIsr makes sense.
>>> It
>>> > would be better to track
>>> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
>>> > "originalReplicas" is going
>>> > to involve sending a lot more data, since most replicas in the system
>>> are
>>> > not reassigning
>>> > at any given point.  Or we would need a hack in the RPC like null = no
>>> > replicas are reassigning.
>>> >
>>> > On a related note, what do you think about the idea of storing the
>>> > reassigning replicas in
>>> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
>>> > the reassignment znode?
>>> >  I don't think this requires a major change to the proposal-- when the
>>> > controller becomes
>>> > aware that it should do a reassignment, the controller could make the
>>> > changes.  This also
>>> > helps keep the reassignment znode from getting larger, which has been a
>>> > problem.
>>> >
>>> > best,
>>> > Colin
>>> >
>>> >
>>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
>>> > > Hey George,
>>> > >
>>> > > For the URP during a reassignment,  if the "original_replicas" is
>>> kept
>>> > for
>>> > > > the current pending reassignment. I think it will be very easy to
>>> > compare
>>> > > > that with the topic/partition's ISR.  If all "original_replicas"
>>> are in
>>> > > > ISR, then URP should be 0 for that topic/partition.
>>> > >
>>> > >
>>> > > Yeah, that makes sense. But I guess we would need
>>> "original_replicas" to
>>> > be
>>> > > propagated to partition leaders in the LeaderAndIsr request since
>>> leaders
>>> > > are the ones that are computing URPs. That is basically what KIP-352
>>> had
>>> > > proposed, but we also need the changes to the reassignment path.
>>> Perhaps
>>> > it
>>> > > makes more sense to address this problem in KIP-236 since that is
>>> where
>>> > you
>>> > > have already introduced "original_replicas"? I'm also happy to do
>>> KIP-352
>>> > > as a follow-up to KIP-236.
>>> > >
>>> > > Best,
>>> > > Jason
>>> > >
>>> > >
>>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com>
>>> wrote:
>>> > >
>>> > > > Good discussion about where we should do batching. I think if
>>> there is
>>> > a
>>> > > > clear great way to batch, then it makes a lot of sense to just do
>>> it
>>> > once.
>>> > > > However, if we think there is scope for experimenting with
>>> different
>>> > > > approaches, then an API that tools can use makes a lot of sense.
>>> They
>>> > can
>>> > > > experiment and innovate. Eventually, we can integrate something
>>> into
>>> > Kafka
>>> > > > if it makes sense.
>>> > > >
>>> > > > Ismael
>>> > > >
>>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org>
>>> wrote:
>>> > > >
>>> > > > > Hi George,
>>> > > > >
>>> > > > > As Jason was saying, it seems like there are two directions we
>>> could
>>> > go
>>> > > > > here: an external system handling batching, and the controller
>>> > handling
>>> > > > > batching.  I think the controller handling batching would be
>>> better,
>>> > > > since
>>> > > > > the controller has more information about the state of the
>>> system.
>>> > If
>>> > > > the
>>> > > > > controller handles batching, then the controller could also
>>> handle
>>> > things
>>> > > > > like setting up replication quotas for individual partitions.
>>> The
>>> > > > > controller could do things like throttle replication down if the
>>> > cluster
>>> > > > > was having problems.
>>> > > > >
>>> > > > > We kind of need to figure out which way we're going to go on
>>> this one
>>> > > > > before we set up big new APIs, I think.  If we want an external
>>> > system to
>>> > > > > handle batching, then we can keep the idea that there is only one
>>> > > > > reassignment in progress at once.  If we want the controller to
>>> > handle
>>> > > > > batching, we will need to get away from that idea.  Instead, we
>>> > should
>>> > > > just
>>> > > > > have a bunch of "ideal assignments" that we tell the controller
>>> > about,
>>> > > > and
>>> > > > > let it decide how to do the batching.  These ideal assignments
>>> could
>>> > > > change
>>> > > > > continuously over time, so from the admin's point of view, there
>>> > would be
>>> > > > > no start/stop/cancel, but just individual partition reassignments
>>> > that we
>>> > > > > submit, perhaps over a long period of time.  And then
>>> cancellation
>>> > might
>>> > > > > just mean cancelling just that individual partition reassignment,
>>> > not all
>>> > > > > partition reassignments.
>>> > > > >
>>> > > > > best,
>>> > > > > Colin
>>> > > > >
>>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
>>> > > > > >  Hi Jason / Viktor,
>>> > > > > >
>>> > > > > > For the URP during a reassignment,  if the "original_replicas"
>>> is
>>> > kept
>>> > > > > > for the current pending reassignment. I think it will be very
>>> easy
>>> > to
>>> > > > > > compare that with the topic/partition's ISR.  If all
>>> > > > > > "original_replicas" are in ISR, then URP should be 0 for that
>>> > > > > > topic/partition.
>>> > > > > >
>>> > > > > > It would be also nice to separate the metrics MaxLag/TotalLag
>>> for
>>> > > > > > Reassignments. I think that will also require
>>> "original_replicas"
>>> > (the
>>> > > > > > topic/partition's replicas just before reassignment when the AR
>>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
>>> > > > > > Set(new_replicas_in_reassign_partitions) ).
>>> > > > > >
>>> > > > > > Thanks,
>>> > > > > > George
>>> > > > > >
>>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
>>> > > > > > <ja...@confluent.io> wrote:
>>> > > > > >
>>> > > > > >  Hi Viktor,
>>> > > > > >
>>> > > > > > Thanks for writing this up. As far as questions about overlap
>>> with
>>> > > > > KIP-236,
>>> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may have
>>> had a
>>> > > > larger
>>> > > > > > initial scope, but now it focuses on cancellation and batching
>>> is
>>> > left
>>> > > > > for
>>> > > > > > future work.
>>> > > > > >
>>> > > > > > With that said, I think we may not actually need a KIP for the
>>> > current
>>> > > > > > proposal since it doesn't change any APIs. To make it more
>>> > generally
>>> > > > > > useful, however, it would be nice to handle batching at the
>>> > partition
>>> > > > > level
>>> > > > > > as well as Jun suggests. The basic question is at what level
>>> > should the
>>> > > > > > batching be determined. You could rely on external processes
>>> (e.g.
>>> > > > cruise
>>> > > > > > control) or it could be built into the controller. There are
>>> > tradeoffs
>>> > > > > > either way, but I think it simplifies such tools if it is
>>> handled
>>> > > > > > internally. Then it would be much safer to submit a larger
>>> > reassignment
>>> > > > > > even just using the simple tools that come with Kafka.
>>> > > > > >
>>> > > > > > By the way, since you are looking into some of the reassignment
>>> > logic,
>>> > > > > > another problem that we might want to address is the misleading
>>> > way we
>>> > > > > > report URPs during a reassignment. I had a naive proposal for
>>> this
>>> > > > > > previously, but it didn't really work
>>> > > > > >
>>> > > > >
>>> > > >
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
>>> > > > > .
>>> > > > > > Potentially fixing that could fall under this work as well if
>>> you
>>> > think
>>> > > > > > it
>>> > > > > > makes sense.
>>> > > > > >
>>> > > > > > Best,
>>> > > > > > Jason
>>> > > > > >
>>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io>
>>> wrote:
>>> > > > > >
>>> > > > > > > Hi, Viktor,
>>> > > > > > >
>>> > > > > > > Thanks for the KIP. A couple of comments below.
>>> > > > > > >
>>> > > > > > > 1. Another potential thing to do reassignment incrementally
>>> is to
>>> > > > move
>>> > > > > a
>>> > > > > > > batch of partitions at a time, instead of all partitions.
>>> This
>>> > may
>>> > > > > lead to
>>> > > > > > > less data replication since by the time the first batch of
>>> > partitions
>>> > > > > have
>>> > > > > > > been completely moved, some data of the next batch may have
>>> been
>>> > > > > deleted
>>> > > > > > > due to retention and doesn't need to be replicated.
>>> > > > > > >
>>> > > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
>>> > Which
>>> > ZK
>>> > > > > path
>>> > > > > > > is this for?
>>> > > > > > >
>>> > > > > > > Jun
>>> > > > > > >
>>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
>>> > > > > > > viktorsomogyi@gmail.com>
>>> > > > > > > wrote:
>>> > > > > > >
>>> > > > > > > > Hi Harsha,
>>> > > > > > > >
>>> > > > > > > > As far as I understand KIP-236 it's about enabling
>>> reassignment
>>> > > > > > > > cancellation and as a future plan providing a queue of
>>> replica
>>> > > > > > > reassignment
>>> > > > > > > > steps to allow manual reassignment chains. While I agree
>>> that
>>> > the
>>> > > > > > > > reassignment chain has a specific use case that allows fine
>>> > grain
>>> > > > > control
>>> > > > > > > > over reassignment process, My proposal on the other hand
>>> > doesn't
>>> > > > talk
>>> > > > > > > about
>>> > > > > > > > cancellation but it only provides an automatic way to
>>> > > > incrementalize
>>> > > > > an
>>> > > > > > > > arbitrary reassignment which I think fits the general use
>>> case
>>> > > > where
>>> > > > > > > users
>>> > > > > > > > don't want that level of control but still would like a
>>> > balanced
>>> > > > way
>>> > > > > of
>>> > > > > > > > reassignments. Therefore I think it's still relevant as an
>>> > > > > improvement of
>>> > > > > > > > the current algorithm.
>>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I
>>> think
>>> > it
>>> > > > > would be
>>> > > > > > > a
>>> > > > > > > > great improvement to Kafka.
>>> > > > > > > >
>>> > > > > > > > Cheers,
>>> > > > > > > > Viktor
>>> > > > > > > >
>>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
>>> > wrote:
>>> > > > > > > >
>>> > > > > > > > > Hi Viktor,
>>> > > > > > > > >            There is already KIP-236 for the same feature
>>> > and
>>> > > > George
>>> > > > > > > made
>>> > > > > > > > > a PR for this as well.
>>> > > > > > > > > Lets consolidate these two discussions. If you have any
>>> > cases
>>> > > > that
>>> > > > > are
>>> > > > > > > > not
>>> > > > > > > > > being solved by KIP-236 can you please mention them in
>>> > that
>>> > > > > thread. We
>>> > > > > > > > can
>>> > > > > > > > > address as part of KIP-236.
>>> > > > > > > > >
>>> > > > > > > > > Thanks,
>>> > > > > > > > > Harsha
>>> > > > > > > > >
>>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass
>>> wrote:
>>> > > > > > > > > > Hi Folks,
>>> > > > > > > > > >
>>> > > > > > > > > > I've created a KIP about an improvement of the
>>> reassignment
>>> > > > > algorithm
>>> > > > > > > > we
>>> > > > > > > > > > have. It aims to enable partition-wise incremental
>>> > > > reassignment.
>>> > > > > The
>>> > > > > > > > > > motivation for this is to avoid excess load that the
>>> > current
>>> > > > > > > > replication
>>> > > > > > > > > > algorithm implicitly carries as in that case there
>>> > are points
>>> > > > in
>>> > > > > the
>>> > > > > > > > > > algorithm where both the new and old replica set could
>>> > be
>>> > > > online
>>> > > > > and
>>> > > > > > > > > > replicating which puts double (or almost double)
>>> pressure
>>> > on
>>> > > > the
>>> > > > > > > > brokers
>>> > > > > > > > > > which could cause problems.
>>> > > > > > > > > > Instead my proposal would slice this up into several
>>> > steps
>>> > > > where
>>> > > > > each
>>> > > > > > > > > step
>>> > > > > > > > > > is calculated based on the final target replicas and
>>> > the
>>> > > > current
>>> > > > > > > > replica
>>> > > > > > > > > > assignment taking into account scenarios where brokers
>>> > could be
>>> > > > > > > offline
>>> > > > > > > > > and
>>> > > > > > > > > > when there are not enough replicas to fulfil the
>>> > > > > min.insync.replica
>>> > > > > > > > > > requirement.
>>> > > > > > > > > >
>>> > > > > > > > > > The link to the KIP:
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > >
>>> > > >
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
>>> > > > > > > > > >
>>> > > > > > > > > > I'd be happy to receive any feedback.
>>> > > > > > > > > >
>>> > > > > > > > > > An important note is that this KIP and another one,
>>> > KIP-236
>>> > > > that
>>> > > > > is
>>> > > > > > > > > > about
>>> > > > > > > > > > interruptible reassignment (
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > >
>>> > > >
>>> >
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
>>> > > > > > > > > )
>>> > > > > > > > > > should be compatible.
>>> > > > > > > > > >
>>> > > > > > > > > > Thanks,
>>> > > > > > > > > > Viktor
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.

A follow up on the batching topic to clarify my points above.

Generally I think that batching should be a core feature as Colin said the
controller should possess all information that are related.
Also Cruise Control (or really any 3rd party admin system) might build upon
this to give more holistic approach to balance brokers. We may cater them
with APIs that act like building blocks to make their life easier like
incrementalization, batching, cancellation and rollback but I think the
more advanced we go we'll need more advanced control surface and Kafka's
basic tooling might not be suitable for that.

Best,
Viktor


On Mon, 15 Apr 2019, 18:22 Viktor Somogyi-Vass, <vi...@gmail.com>
wrote:

> Hey Guys,
>
> I'll reply to you all in this email:
>
> @Jun:
> 1. yes, it'd be a good idea to add this feature, I'll write this into the
> KIP. I was actually thinking about introducing a dynamic config called
> reassignment.parallel.partition.count and
> reassignment.parallel.replica.count. The first property would control how
> many partition reassignment can we do concurrently. The second would go one
> level in granularity and would control how many replicas do we want to move
> for a given partition. Also one more thing that'd be useful to fix is that
> a given list of partition -> replica list would be executed in the same
> order (from first to last) so it's overall predictable and the user would
> have some control over the order of reassignments should be specified as
> the JSON is still assembled by the user.
> 2. the /kafka/brokers/topics/{topic} znode to be specific. I'll update the
> KIP to contain this.
>
> @Jason:
> I think building this functionality into Kafka would definitely benefit
> all the users and that CC as well as it'd simplify their software as you
> said. As I understand the main advantage of CC and other similar softwares
> are to give high level features for automatic load balancing. Reliability,
> stability and predictability of the reassignment should be a core feature
> of Kafka. I think the incrementalization feature would make it more stable.
> I would consider cancellation too as a core feature and we can leave the
> gate open for external tools to feed in their reassignment json as they
> want. I was also thinking about what are the set of features we can provide
> for Kafka but I think the more advanced we go the more need there is for an
> administrative UI component.
> Regarding KIP-352: Thanks for pointing this out, I didn't see this
> although lately I was also thinking about the throttling aspect of it.
> Would be a nice add-on to Kafka since though the above configs provide some
> level of control, it'd be nice to put an upper cap on the bandwidth and
> make it monitorable.
>
> Viktor
>
> On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
>> Hi Colin,
>>
>> On a related note, what do you think about the idea of storing the
>> > reassigning replicas in
>> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
>> the
>> > reassignment znode?  I don't think this requires a major change to the
>> > proposal-- when the controller becomes aware that it should do a
>> > reassignment, the controller could make the changes.  This also helps
>> keep
>> > the reassignment znode from getting larger, which has been a problem.
>>
>>
>> Yeah, I think it's a good idea to store the reassignment state at a finer
>> level. I'm not sure the LeaderAndIsr znode is the right one though.
>> Another
>> option is /brokers/topics/{topic}. That is where we currently store the
>> replica assignment. I think we basically want to represent both the
>> current
>> state and the desired state. This would also open the door to a cleaner
>> way
>> to update a reassignment while it is still in progress.
>>
>> -Jason
>>
>>
>>
>>
>> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consulting@yahoo.com
>> .invalid>
>> wrote:
>>
>> >  Hi Colin / Jason,
>> >
>> > Reassignment should really be doing a batches.  I am not too worried
>> about
>> > reassignment znode getting larger.  In a real production environment,
>> too
>> > many concurrent reassignment and too frequent submission of
>> reassignments
>> > seemed to cause latency spikes of kafka cluster.  So
>> > batching/staggering/throttling of submitting reassignments is
>> recommended.
>> >
>> > In KIP-236,  The "originalReplicas" are only kept for the current
>> > reassigning partitions (small #), and kept in memory of the controller
>> > context partitionsBeingReassigned as well as in the znode
>> > /admin/reassign_partitions,  I think below "setting in the RPC like
>> null =
>> > no replicas are reassigning" is a good idea.
>> >
>> > There seems to be some issues with the Mail archive server of this
>> mailing
>> > list?  I didn't receive email after April 7th, and the archive for April
>> > 2019 has only 50 messages (
>> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread)
>> ?
>> >
>> > Thanks,
>> > George
>> >
>> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
>> >
>> >   Yeah, I think adding this information to LeaderAndIsr makes sense.  It
>> > would be better to track
>> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
>> > "originalReplicas" is going
>> > to involve sending a lot more data, since most replicas in the system
>> are
>> > not reassigning
>> > at any given point.  Or we would need a hack in the RPC like null = no
>> > replicas are reassigning.
>> >
>> > On a related note, what do you think about the idea of storing the
>> > reassigning replicas in
>> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
>> > the reassignment znode?
>> >  I don't think this requires a major change to the proposal-- when the
>> > controller becomes
>> > aware that it should do a reassignment, the controller could make the
>> > changes.  This also
>> > helps keep the reassignment znode from getting larger, which has been a
>> > problem.
>> >
>> > best,
>> > Colin
>> >
>> >
>> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
>> > > Hey George,
>> > >
>> > > For the URP during a reassignment,  if the "original_replicas" is kept
>> > for
>> > > > the current pending reassignment. I think it will be very easy to
>> > compare
>> > > > that with the topic/partition's ISR.  If all "original_replicas"
>> are in
>> > > > ISR, then URP should be 0 for that topic/partition.
>> > >
>> > >
>> > > Yeah, that makes sense. But I guess we would need "original_replicas"
>> to
>> > be
>> > > propagated to partition leaders in the LeaderAndIsr request since
>> leaders
>> > > are the ones that are computing URPs. That is basically what KIP-352
>> had
>> > > proposed, but we also need the changes to the reassignment path.
>> Perhaps
>> > it
>> > > makes more sense to address this problem in KIP-236 since that is
>> where
>> > you
>> > > have already introduced "original_replicas"? I'm also happy to do
>> KIP-352
>> > > as a follow-up to KIP-236.
>> > >
>> > > Best,
>> > > Jason
>> > >
>> > >
>> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com> wrote:
>> > >
>> > > > Good discussion about where we should do batching. I think if there
>> is
>> > a
>> > > > clear great way to batch, then it makes a lot of sense to just do it
>> > once.
>> > > > However, if we think there is scope for experimenting with different
>> > > > approaches, then an API that tools can use makes a lot of sense.
>> They
>> > can
>> > > > experiment and innovate. Eventually, we can integrate something into
>> > Kafka
>> > > > if it makes sense.
>> > > >
>> > > > Ismael
>> > > >
>> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org>
>> wrote:
>> > > >
>> > > > > Hi George,
>> > > > >
>> > > > > As Jason was saying, it seems like there are two directions we
>> could
>> > go
>> > > > > here: an external system handling batching, and the controller
>> > handling
>> > > > > batching.  I think the controller handling batching would be
>> better,
>> > > > since
>> > > > > the controller has more information about the state of the system.
>> > If
>> > > > the
>> > > > > controller handles batching, then the controller could also handle
>> > things
>> > > > > like setting up replication quotas for individual partitions.  The
>> > > > > controller could do things like throttle replication down if the
>> > cluster
>> > > > > was having problems.
>> > > > >
>> > > > > We kind of need to figure out which way we're going to go on this
>> one
>> > > > > before we set up big new APIs, I think.  If we want an external
>> > system to
>> > > > > handle batching, then we can keep the idea that there is only one
>> > > > > reassignment in progress at once.  If we want the controller to
>> > handle
>> > > > > batching, we will need to get away from that idea.  Instead, we
>> > should
>> > > > just
>> > > > > have a bunch of "ideal assignments" that we tell the controller
>> > about,
>> > > > and
>> > > > > let it decide how to do the batching.  These ideal assignments
>> could
>> > > > change
>> > > > > continuously over time, so from the admin's point of view, there
>> > would be
>> > > > > no start/stop/cancel, but just individual partition reassignments
>> > that we
>> > > > > submit, perhaps over a long period of time.  And then cancellation
>> > might
>> > > > > just mean cancelling just that individual partition reassignment,
>> > not all
>> > > > > partition reassignments.
>> > > > >
>> > > > > best,
>> > > > > Colin
>> > > > >
>> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
>> > > > > >  Hi Jason / Viktor,
>> > > > > >
>> > > > > > For the URP during a reassignment,  if the "original_replicas"
>> is
>> > kept
>> > > > > > for the current pending reassignment. I think it will be very
>> easy
>> > to
>> > > > > > compare that with the topic/partition's ISR.  If all
>> > > > > > "original_replicas" are in ISR, then URP should be 0 for that
>> > > > > > topic/partition.
>> > > > > >
>> > > > > > It would be also nice to separate the metrics MaxLag/TotalLag
>> for
>> > > > > > Reassignments. I think that will also require
>> "original_replicas"
>> > (the
>> > > > > > topic/partition's replicas just before reassignment when the AR
>> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
>> > > > > > Set(new_replicas_in_reassign_partitions) ).
>> > > > > >
>> > > > > > Thanks,
>> > > > > > George
>> > > > > >
>> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
>> > > > > > <ja...@confluent.io> wrote:
>> > > > > >
>> > > > > >  Hi Viktor,
>> > > > > >
>> > > > > > Thanks for writing this up. As far as questions about overlap
>> with
>> > > > > KIP-236,
>> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may have
>> had a
>> > > > larger
>> > > > > > initial scope, but now it focuses on cancellation and batching
>> is
>> > left
>> > > > > for
>> > > > > > future work.
>> > > > > >
>> > > > > > With that said, I think we may not actually need a KIP for the
>> > current
>> > > > > > proposal since it doesn't change any APIs. To make it more
>> > generally
>> > > > > > useful, however, it would be nice to handle batching at the
>> > partition
>> > > > > level
>> > > > > > as well as Jun suggests. The basic question is at what level
>> > should the
>> > > > > > batching be determined. You could rely on external processes
>> (e.g.
>> > > > cruise
>> > > > > > control) or it could be built into the controller. There are
>> > tradeoffs
>> > > > > > either way, but I think it simplifies such tools if it is
>> handled
>> > > > > > internally. Then it would be much safer to submit a larger
>> > reassignment
>> > > > > > even just using the simple tools that come with Kafka.
>> > > > > >
>> > > > > > By the way, since you are looking into some of the reassignment
>> > logic,
>> > > > > > another problem that we might want to address is the misleading
>> > way we
>> > > > > > report URPs during a reassignment. I had a naive proposal for
>> this
>> > > > > > previously, but it didn't really work
>> > > > > >
>> > > > >
>> > > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
>> > > > > .
>> > > > > > Potentially fixing that could fall under this work as well if
>> you
>> > think
>> > > > > > it
>> > > > > > makes sense.
>> > > > > >
>> > > > > > Best,
>> > > > > > Jason
>> > > > > >
>> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io>
>> wrote:
>> > > > > >
>> > > > > > > Hi, Viktor,
>> > > > > > >
>> > > > > > > Thanks for the KIP. A couple of comments below.
>> > > > > > >
>> > > > > > > 1. Another potential thing to do reassignment incrementally
>> is to
>> > > > move
>> > > > > a
>> > > > > > > batch of partitions at a time, instead of all partitions. This
>> > may
>> > > > > lead to
>> > > > > > > less data replication since by the time the first batch of
>> > partitions
>> > > > > have
>> > > > > > > been completely moved, some data of the next batch may have
>> been
>> > > > > deleted
>> > > > > > > due to retention and doesn't need to be replicated.
>> > > > > > >
>> > > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
>> > Which
>> > ZK
>> > > > > path
>> > > > > > > is this for?
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
>> > > > > > > viktorsomogyi@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Harsha,
>> > > > > > > >
>> > > > > > > > As far as I understand KIP-236 it's about enabling
>> reassignment
>> > > > > > > > cancellation and as a future plan providing a queue of
>> replica
>> > > > > > > reassignment
>> > > > > > > > steps to allow manual reassignment chains. While I agree
>> that
>> > the
>> > > > > > > > reassignment chain has a specific use case that allows fine
>> > grain
>> > > > > control
>> > > > > > > > over reassignment process, My proposal on the other hand
>> > doesn't
>> > > > talk
>> > > > > > > about
>> > > > > > > > cancellation but it only provides an automatic way to
>> > > > incrementalize
>> > > > > an
>> > > > > > > > arbitrary reassignment which I think fits the general use
>> case
>> > > > where
>> > > > > > > users
>> > > > > > > > don't want that level of control but still would like a
>> > balanced
>> > > > way
>> > > > > of
>> > > > > > > > reassignments. Therefore I think it's still relevant as an
>> > > > > improvement of
>> > > > > > > > the current algorithm.
>> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think
>> > it
>> > > > > would be
>> > > > > > > a
>> > > > > > > > great improvement to Kafka.
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > > Viktor
>> > > > > > > >
>> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
>> > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Viktor,
>> > > > > > > > >            There is already KIP-236 for the same feature
>> > and
>> > > > George
>> > > > > > > made
>> > > > > > > > > a PR for this as well.
>> > > > > > > > > Lets consolidate these two discussions. If you have any
>> > cases
>> > > > that
>> > > > > are
>> > > > > > > > not
>> > > > > > > > > being solved by KIP-236 can you please mention them in
>> > that
>> > > > > thread. We
>> > > > > > > > can
>> > > > > > > > > address as part of KIP-236.
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > > Harsha
>> > > > > > > > >
>> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass
>> wrote:
>> > > > > > > > > > Hi Folks,
>> > > > > > > > > >
>> > > > > > > > > > I've created a KIP about an improvement of the
>> reassignment
>> > > > > algorithm
>> > > > > > > > we
>> > > > > > > > > > have. It aims to enable partition-wise incremental
>> > > > reassignment.
>> > > > > The
>> > > > > > > > > > motivation for this is to avoid excess load that the
>> > current
>> > > > > > > > replication
>> > > > > > > > > > algorithm implicitly carries as in that case there
>> > are points
>> > > > in
>> > > > > the
>> > > > > > > > > > algorithm where both the new and old replica set could
>> > be
>> > > > online
>> > > > > and
>> > > > > > > > > > replicating which puts double (or almost double)
>> pressure
>> > on
>> > > > the
>> > > > > > > > brokers
>> > > > > > > > > > which could cause problems.
>> > > > > > > > > > Instead my proposal would slice this up into several
>> > steps
>> > > > where
>> > > > > each
>> > > > > > > > > step
>> > > > > > > > > > is calculated based on the final target replicas and
>> > the
>> > > > current
>> > > > > > > > replica
>> > > > > > > > > > assignment taking into account scenarios where brokers
>> > could be
>> > > > > > > offline
>> > > > > > > > > and
>> > > > > > > > > > when there are not enough replicas to fulfil the
>> > > > > min.insync.replica
>> > > > > > > > > > requirement.
>> > > > > > > > > >
>> > > > > > > > > > The link to the KIP:
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
>> > > > > > > > > >
>> > > > > > > > > > I'd be happy to receive any feedback.
>> > > > > > > > > >
>> > > > > > > > > > An important note is that this KIP and another one,
>> > KIP-236
>> > > > that
>> > > > > is
>> > > > > > > > > > about
>> > > > > > > > > > interruptible reassignment (
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
>> > > > > > > > > )
>> > > > > > > > > > should be compatible.
>> > > > > > > > > >
>> > > > > > > > > > Thanks,
>> > > > > > > > > > Viktor
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Viktor Somogyi-Vass <vi...@gmail.com>.

Hey Guys,

I'll reply to you all in this email:

@Jun:
1. yes, it'd be a good idea to add this feature, I'll write this into the
KIP. I was actually thinking about introducing a dynamic config called
reassignment.parallel.partition.count and
reassignment.parallel.replica.count. The first property would control how
many partition reassignment can we do concurrently. The second would go one
level in granularity and would control how many replicas do we want to move
for a given partition. Also one more thing that'd be useful to fix is that
a given list of partition -> replica list would be executed in the same
order (from first to last) so it's overall predictable and the user would
have some control over the order of reassignments should be specified as
the JSON is still assembled by the user.
2. the /kafka/brokers/topics/{topic} znode to be specific. I'll update the
KIP to contain this.

@Jason:
I think building this functionality into Kafka would definitely benefit all
the users and that CC as well as it'd simplify their software as you said.
As I understand the main advantage of CC and other similar softwares are to
give high level features for automatic load balancing. Reliability,
stability and predictability of the reassignment should be a core feature
of Kafka. I think the incrementalization feature would make it more stable.
I would consider cancellation too as a core feature and we can leave the
gate open for external tools to feed in their reassignment json as they
want. I was also thinking about what are the set of features we can provide
for Kafka but I think the more advanced we go the more need there is for an
administrative UI component.
Regarding KIP-352: Thanks for pointing this out, I didn't see this although
lately I was also thinking about the throttling aspect of it. Would be a
nice add-on to Kafka since though the above configs provide some level of
control, it'd be nice to put an upper cap on the bandwidth and make it
monitorable.

Viktor

On Wed, Apr 10, 2019 at 2:57 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Colin,
>
> On a related note, what do you think about the idea of storing the
> > reassigning replicas in
> > /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
> the
> > reassignment znode?  I don't think this requires a major change to the
> > proposal-- when the controller becomes aware that it should do a
> > reassignment, the controller could make the changes.  This also helps
> keep
> > the reassignment znode from getting larger, which has been a problem.
>
>
> Yeah, I think it's a good idea to store the reassignment state at a finer
> level. I'm not sure the LeaderAndIsr znode is the right one though. Another
> option is /brokers/topics/{topic}. That is where we currently store the
> replica assignment. I think we basically want to represent both the current
> state and the desired state. This would also open the door to a cleaner way
> to update a reassignment while it is still in progress.
>
> -Jason
>
>
>
>
> On Mon, Apr 8, 2019 at 11:14 PM George Li <sql_consulting@yahoo.com
> .invalid>
> wrote:
>
> >  Hi Colin / Jason,
> >
> > Reassignment should really be doing a batches.  I am not too worried
> about
> > reassignment znode getting larger.  In a real production environment,
> too
> > many concurrent reassignment and too frequent submission of reassignments
> > seemed to cause latency spikes of kafka cluster.  So
> > batching/staggering/throttling of submitting reassignments is
> recommended.
> >
> > In KIP-236,  The "originalReplicas" are only kept for the current
> > reassigning partitions (small #), and kept in memory of the controller
> > context partitionsBeingReassigned as well as in the znode
> > /admin/reassign_partitions,  I think below "setting in the RPC like null
> =
> > no replicas are reassigning" is a good idea.
> >
> > There seems to be some issues with the Mail archive server of this
> mailing
> > list?  I didn't receive email after April 7th, and the archive for April
> > 2019 has only 50 messages (
> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread) ?
> >
> > Thanks,
> > George
> >
> >    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
> >
> >   Yeah, I think adding this information to LeaderAndIsr makes sense.  It
> > would be better to track
> > "reassigningReplicas" than "originalReplicas", I think.  Tracking
> > "originalReplicas" is going
> > to involve sending a lot more data, since most replicas in the system are
> > not reassigning
> > at any given point.  Or we would need a hack in the RPC like null = no
> > replicas are reassigning.
> >
> > On a related note, what do you think about the idea of storing the
> > reassigning replicas in
> >  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
> > the reassignment znode?
> >  I don't think this requires a major change to the proposal-- when the
> > controller becomes
> > aware that it should do a reassignment, the controller could make the
> > changes.  This also
> > helps keep the reassignment znode from getting larger, which has been a
> > problem.
> >
> > best,
> > Colin
> >
> >
> > On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> > > Hey George,
> > >
> > > For the URP during a reassignment,  if the "original_replicas" is kept
> > for
> > > > the current pending reassignment. I think it will be very easy to
> > compare
> > > > that with the topic/partition's ISR.  If all "original_replicas" are
> in
> > > > ISR, then URP should be 0 for that topic/partition.
> > >
> > >
> > > Yeah, that makes sense. But I guess we would need "original_replicas"
> to
> > be
> > > propagated to partition leaders in the LeaderAndIsr request since
> leaders
> > > are the ones that are computing URPs. That is basically what KIP-352
> had
> > > proposed, but we also need the changes to the reassignment path.
> Perhaps
> > it
> > > makes more sense to address this problem in KIP-236 since that is where
> > you
> > > have already introduced "original_replicas"? I'm also happy to do
> KIP-352
> > > as a follow-up to KIP-236.
> > >
> > > Best,
> > > Jason
> > >
> > >
> > > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com> wrote:
> > >
> > > > Good discussion about where we should do batching. I think if there
> is
> > a
> > > > clear great way to batch, then it makes a lot of sense to just do it
> > once.
> > > > However, if we think there is scope for experimenting with different
> > > > approaches, then an API that tools can use makes a lot of sense. They
> > can
> > > > experiment and innovate. Eventually, we can integrate something into
> > Kafka
> > > > if it makes sense.
> > > >
> > > > Ismael
> > > >
> > > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org>
> wrote:
> > > >
> > > > > Hi George,
> > > > >
> > > > > As Jason was saying, it seems like there are two directions we
> could
> > go
> > > > > here: an external system handling batching, and the controller
> > handling
> > > > > batching.  I think the controller handling batching would be
> better,
> > > > since
> > > > > the controller has more information about the state of the system.
> > If
> > > > the
> > > > > controller handles batching, then the controller could also handle
> > things
> > > > > like setting up replication quotas for individual partitions.  The
> > > > > controller could do things like throttle replication down if the
> > cluster
> > > > > was having problems.
> > > > >
> > > > > We kind of need to figure out which way we're going to go on this
> one
> > > > > before we set up big new APIs, I think.  If we want an external
> > system to
> > > > > handle batching, then we can keep the idea that there is only one
> > > > > reassignment in progress at once.  If we want the controller to
> > handle
> > > > > batching, we will need to get away from that idea.  Instead, we
> > should
> > > > just
> > > > > have a bunch of "ideal assignments" that we tell the controller
> > about,
> > > > and
> > > > > let it decide how to do the batching.  These ideal assignments
> could
> > > > change
> > > > > continuously over time, so from the admin's point of view, there
> > would be
> > > > > no start/stop/cancel, but just individual partition reassignments
> > that we
> > > > > submit, perhaps over a long period of time.  And then cancellation
> > might
> > > > > just mean cancelling just that individual partition reassignment,
> > not all
> > > > > partition reassignments.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > > > > >  Hi Jason / Viktor,
> > > > > >
> > > > > > For the URP during a reassignment,  if the "original_replicas" is
> > kept
> > > > > > for the current pending reassignment. I think it will be very
> easy
> > to
> > > > > > compare that with the topic/partition's ISR.  If all
> > > > > > "original_replicas" are in ISR, then URP should be 0 for that
> > > > > > topic/partition.
> > > > > >
> > > > > > It would be also nice to separate the metrics MaxLag/TotalLag for
> > > > > > Reassignments. I think that will also require "original_replicas"
> > (the
> > > > > > topic/partition's replicas just before reassignment when the AR
> > > > > > (Assigned Replicas) is set to Set(original_replicas) +
> > > > > > Set(new_replicas_in_reassign_partitions) ).
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> > > > > > <ja...@confluent.io> wrote:
> > > > > >
> > > > > >  Hi Viktor,
> > > > > >
> > > > > > Thanks for writing this up. As far as questions about overlap
> with
> > > > > KIP-236,
> > > > > > I agree it seems mostly orthogonal. I think KIP-236 may have had
> a
> > > > larger
> > > > > > initial scope, but now it focuses on cancellation and batching is
> > left
> > > > > for
> > > > > > future work.
> > > > > >
> > > > > > With that said, I think we may not actually need a KIP for the
> > current
> > > > > > proposal since it doesn't change any APIs. To make it more
> > generally
> > > > > > useful, however, it would be nice to handle batching at the
> > partition
> > > > > level
> > > > > > as well as Jun suggests. The basic question is at what level
> > should the
> > > > > > batching be determined. You could rely on external processes
> (e.g.
> > > > cruise
> > > > > > control) or it could be built into the controller. There are
> > tradeoffs
> > > > > > either way, but I think it simplifies such tools if it is handled
> > > > > > internally. Then it would be much safer to submit a larger
> > reassignment
> > > > > > even just using the simple tools that come with Kafka.
> > > > > >
> > > > > > By the way, since you are looking into some of the reassignment
> > logic,
> > > > > > another problem that we might want to address is the misleading
> > way we
> > > > > > report URPs during a reassignment. I had a naive proposal for
> this
> > > > > > previously, but it didn't really work
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > > > .
> > > > > > Potentially fixing that could fall under this work as well if you
> > think
> > > > > > it
> > > > > > makes sense.
> > > > > >
> > > > > > Best,
> > > > > > Jason
> > > > > >
> > > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:
> > > > > >
> > > > > > > Hi, Viktor,
> > > > > > >
> > > > > > > Thanks for the KIP. A couple of comments below.
> > > > > > >
> > > > > > > 1. Another potential thing to do reassignment incrementally is
> to
> > > > move
> > > > > a
> > > > > > > batch of partitions at a time, instead of all partitions. This
> > may
> > > > > lead to
> > > > > > > less data replication since by the time the first batch of
> > partitions
> > > > > have
> > > > > > > been completely moved, some data of the next batch may have
> been
> > > > > deleted
> > > > > > > due to retention and doesn't need to be replicated.
> > > > > > >
> > > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
> > Which
> > ZK
> > > > > path
> > > > > > > is this for?
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > > > > > viktorsomogyi@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Harsha,
> > > > > > > >
> > > > > > > > As far as I understand KIP-236 it's about enabling
> reassignment
> > > > > > > > cancellation and as a future plan providing a queue of
> replica
> > > > > > > reassignment
> > > > > > > > steps to allow manual reassignment chains. While I agree that
> > the
> > > > > > > > reassignment chain has a specific use case that allows fine
> > grain
> > > > > control
> > > > > > > > over reassignment process, My proposal on the other hand
> > doesn't
> > > > talk
> > > > > > > about
> > > > > > > > cancellation but it only provides an automatic way to
> > > > incrementalize
> > > > > an
> > > > > > > > arbitrary reassignment which I think fits the general use
> case
> > > > where
> > > > > > > users
> > > > > > > > don't want that level of control but still would like a
> > balanced
> > > > way
> > > > > of
> > > > > > > > reassignments. Therefore I think it's still relevant as an
> > > > > improvement of
> > > > > > > > the current algorithm.
> > > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think
> > it
> > > > > would be
> > > > > > > a
> > > > > > > > great improvement to Kafka.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Viktor
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
> > wrote:
> > > > > > > >
> > > > > > > > > Hi Viktor,
> > > > > > > > >            There is already KIP-236 for the same feature
> > and
> > > > George
> > > > > > > made
> > > > > > > > > a PR for this as well.
> > > > > > > > > Lets consolidate these two discussions. If you have any
> > cases
> > > > that
> > > > > are
> > > > > > > > not
> > > > > > > > > being solved by KIP-236 can you please mention them in
> > that
> > > > > thread. We
> > > > > > > > can
> > > > > > > > > address as part of KIP-236.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Harsha
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass
> wrote:
> > > > > > > > > > Hi Folks,
> > > > > > > > > >
> > > > > > > > > > I've created a KIP about an improvement of the
> reassignment
> > > > > algorithm
> > > > > > > > we
> > > > > > > > > > have. It aims to enable partition-wise incremental
> > > > reassignment.
> > > > > The
> > > > > > > > > > motivation for this is to avoid excess load that the
> > current
> > > > > > > > replication
> > > > > > > > > > algorithm implicitly carries as in that case there
> > are points
> > > > in
> > > > > the
> > > > > > > > > > algorithm where both the new and old replica set could
> > be
> > > > online
> > > > > and
> > > > > > > > > > replicating which puts double (or almost double) pressure
> > on
> > > > the
> > > > > > > > brokers
> > > > > > > > > > which could cause problems.
> > > > > > > > > > Instead my proposal would slice this up into several
> > steps
> > > > where
> > > > > each
> > > > > > > > > step
> > > > > > > > > > is calculated based on the final target replicas and
> > the
> > > > current
> > > > > > > > replica
> > > > > > > > > > assignment taking into account scenarios where brokers
> > could be
> > > > > > > offline
> > > > > > > > > and
> > > > > > > > > > when there are not enough replicas to fulfil the
> > > > > min.insync.replica
> > > > > > > > > > requirement.
> > > > > > > > > >
> > > > > > > > > > The link to the KIP:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > > > > > > >
> > > > > > > > > > I'd be happy to receive any feedback.
> > > > > > > > > >
> > > > > > > > > > An important note is that this KIP and another one,
> > KIP-236
> > > > that
> > > > > is
> > > > > > > > > > about
> > > > > > > > > > interruptible reassignment (
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > > > > > > )
> > > > > > > > > > should be compatible.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Viktor
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Colin,

On a related note, what do you think about the idea of storing the
> reassigning replicas in
> /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in the
> reassignment znode?  I don't think this requires a major change to the
> proposal-- when the controller becomes aware that it should do a
> reassignment, the controller could make the changes.  This also helps keep
> the reassignment znode from getting larger, which has been a problem.


Yeah, I think it's a good idea to store the reassignment state at a finer
level. I'm not sure the LeaderAndIsr znode is the right one though. Another
option is /brokers/topics/{topic}. That is where we currently store the
replica assignment. I think we basically want to represent both the current
state and the desired state. This would also open the door to a cleaner way
to update a reassignment while it is still in progress.

-Jason




On Mon, Apr 8, 2019 at 11:14 PM George Li <sq...@yahoo.com.invalid>
wrote:

>  Hi Colin / Jason,
>
> Reassignment should really be doing a batches.  I am not too worried about
> reassignment znode getting larger.  In a real production environment,  too
> many concurrent reassignment and too frequent submission of reassignments
> seemed to cause latency spikes of kafka cluster.  So
> batching/staggering/throttling of submitting reassignments is recommended.
>
> In KIP-236,  The "originalReplicas" are only kept for the current
> reassigning partitions (small #), and kept in memory of the controller
> context partitionsBeingReassigned as well as in the znode
> /admin/reassign_partitions,  I think below "setting in the RPC like null =
> no replicas are reassigning" is a good idea.
>
> There seems to be some issues with the Mail archive server of this mailing
> list?  I didn't receive email after April 7th, and the archive for April
> 2019 has only 50 messages (
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread) ?
>
> Thanks,
> George
>
>    on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:
>
>   Yeah, I think adding this information to LeaderAndIsr makes sense.  It
> would be better to track
> "reassigningReplicas" than "originalReplicas", I think.  Tracking
> "originalReplicas" is going
> to involve sending a lot more data, since most replicas in the system are
> not reassigning
> at any given point.  Or we would need a hack in the RPC like null = no
> replicas are reassigning.
>
> On a related note, what do you think about the idea of storing the
> reassigning replicas in
>  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in
> the reassignment znode?
>  I don't think this requires a major change to the proposal-- when the
> controller becomes
> aware that it should do a reassignment, the controller could make the
> changes.  This also
> helps keep the reassignment znode from getting larger, which has been a
> problem.
>
> best,
> Colin
>
>
> On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> > Hey George,
> >
> > For the URP during a reassignment,  if the "original_replicas" is kept
> for
> > > the current pending reassignment. I think it will be very easy to
> compare
> > > that with the topic/partition's ISR.  If all "original_replicas" are in
> > > ISR, then URP should be 0 for that topic/partition.
> >
> >
> > Yeah, that makes sense. But I guess we would need "original_replicas" to
> be
> > propagated to partition leaders in the LeaderAndIsr request since leaders
> > are the ones that are computing URPs. That is basically what KIP-352 had
> > proposed, but we also need the changes to the reassignment path. Perhaps
> it
> > makes more sense to address this problem in KIP-236 since that is where
> you
> > have already introduced "original_replicas"? I'm also happy to do KIP-352
> > as a follow-up to KIP-236.
> >
> > Best,
> > Jason
> >
> >
> > On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com> wrote:
> >
> > > Good discussion about where we should do batching. I think if there is
> a
> > > clear great way to batch, then it makes a lot of sense to just do it
> once.
> > > However, if we think there is scope for experimenting with different
> > > approaches, then an API that tools can use makes a lot of sense. They
> can
> > > experiment and innovate. Eventually, we can integrate something into
> Kafka
> > > if it makes sense.
> > >
> > > Ismael
> > >
> > > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org> wrote:
> > >
> > > > Hi George,
> > > >
> > > > As Jason was saying, it seems like there are two directions we could
> go
> > > > here: an external system handling batching, and the controller
> handling
> > > > batching.  I think the controller handling batching would be better,
> > > since
> > > > the controller has more information about the state of the system.
> If
> > > the
> > > > controller handles batching, then the controller could also handle
> things
> > > > like setting up replication quotas for individual partitions.  The
> > > > controller could do things like throttle replication down if the
> cluster
> > > > was having problems.
> > > >
> > > > We kind of need to figure out which way we're going to go on this one
> > > > before we set up big new APIs, I think.  If we want an external
> system to
> > > > handle batching, then we can keep the idea that there is only one
> > > > reassignment in progress at once.  If we want the controller to
> handle
> > > > batching, we will need to get away from that idea.  Instead, we
> should
> > > just
> > > > have a bunch of "ideal assignments" that we tell the controller
> about,
> > > and
> > > > let it decide how to do the batching.  These ideal assignments could
> > > change
> > > > continuously over time, so from the admin's point of view, there
> would be
> > > > no start/stop/cancel, but just individual partition reassignments
> that we
> > > > submit, perhaps over a long period of time.  And then cancellation
> might
> > > > just mean cancelling just that individual partition reassignment,
> not all
> > > > partition reassignments.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > > > >  Hi Jason / Viktor,
> > > > >
> > > > > For the URP during a reassignment,  if the "original_replicas" is
> kept
> > > > > for the current pending reassignment. I think it will be very easy
> to
> > > > > compare that with the topic/partition's ISR.  If all
> > > > > "original_replicas" are in ISR, then URP should be 0 for that
> > > > > topic/partition.
> > > > >
> > > > > It would be also nice to separate the metrics MaxLag/TotalLag for
> > > > > Reassignments. I think that will also require "original_replicas"
> (the
> > > > > topic/partition's replicas just before reassignment when the AR
> > > > > (Assigned Replicas) is set to Set(original_replicas) +
> > > > > Set(new_replicas_in_reassign_partitions) ).
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> > > > > <ja...@confluent.io> wrote:
> > > > >
> > > > >  Hi Viktor,
> > > > >
> > > > > Thanks for writing this up. As far as questions about overlap with
> > > > KIP-236,
> > > > > I agree it seems mostly orthogonal. I think KIP-236 may have had a
> > > larger
> > > > > initial scope, but now it focuses on cancellation and batching is
> left
> > > > for
> > > > > future work.
> > > > >
> > > > > With that said, I think we may not actually need a KIP for the
> current
> > > > > proposal since it doesn't change any APIs. To make it more
> generally
> > > > > useful, however, it would be nice to handle batching at the
> partition
> > > > level
> > > > > as well as Jun suggests. The basic question is at what level
> should the
> > > > > batching be determined. You could rely on external processes (e.g.
> > > cruise
> > > > > control) or it could be built into the controller. There are
> tradeoffs
> > > > > either way, but I think it simplifies such tools if it is handled
> > > > > internally. Then it would be much safer to submit a larger
> reassignment
> > > > > even just using the simple tools that come with Kafka.
> > > > >
> > > > > By the way, since you are looking into some of the reassignment
> logic,
> > > > > another problem that we might want to address is the misleading
> way we
> > > > > report URPs during a reassignment. I had a naive proposal for this
> > > > > previously, but it didn't really work
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > > .
> > > > > Potentially fixing that could fall under this work as well if you
> think
> > > > > it
> > > > > makes sense.
> > > > >
> > > > > Best,
> > > > > Jason
> > > > >
> > > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Viktor,
> > > > > >
> > > > > > Thanks for the KIP. A couple of comments below.
> > > > > >
> > > > > > 1. Another potential thing to do reassignment incrementally is to
> > > move
> > > > a
> > > > > > batch of partitions at a time, instead of all partitions. This
> may
> > > > lead to
> > > > > > less data replication since by the time the first batch of
> partitions
> > > > have
> > > > > > been completely moved, some data of the next batch may have been
> > > > deleted
> > > > > > due to retention and doesn't need to be replicated.
> > > > > >
> > > > > > 2. "Update CR in Zookeeper with TR for the given partition".
> Which
> ZK
> > > > path
> > > > > > is this for?
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > > > > viktorsomogyi@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Harsha,
> > > > > > >
> > > > > > > As far as I understand KIP-236 it's about enabling reassignment
> > > > > > > cancellation and as a future plan providing a queue of replica
> > > > > > reassignment
> > > > > > > steps to allow manual reassignment chains. While I agree that
> the
> > > > > > > reassignment chain has a specific use case that allows fine
> grain
> > > > control
> > > > > > > over reassignment process, My proposal on the other hand
> doesn't
> > > talk
> > > > > > about
> > > > > > > cancellation but it only provides an automatic way to
> > > incrementalize
> > > > an
> > > > > > > arbitrary reassignment which I think fits the general use case
> > > where
> > > > > > users
> > > > > > > don't want that level of control but still would like a
> balanced
> > > way
> > > > of
> > > > > > > reassignments. Therefore I think it's still relevant as an
> > > > improvement of
> > > > > > > the current algorithm.
> > > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think
> it
> > > > would be
> > > > > > a
> > > > > > > great improvement to Kafka.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Viktor
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
> wrote:
> > > > > > >
> > > > > > > > Hi Viktor,
> > > > > > > >            There is already KIP-236 for the same feature
> and
> > > George
> > > > > > made
> > > > > > > > a PR for this as well.
> > > > > > > > Lets consolidate these two discussions. If you have any
> cases
> > > that
> > > > are
> > > > > > > not
> > > > > > > > being solved by KIP-236 can you please mention them in
> that
> > > > thread. We
> > > > > > > can
> > > > > > > > address as part of KIP-236.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Harsha
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > > > > > > Hi Folks,
> > > > > > > > >
> > > > > > > > > I've created a KIP about an improvement of the reassignment
> > > > algorithm
> > > > > > > we
> > > > > > > > > have. It aims to enable partition-wise incremental
> > > reassignment.
> > > > The
> > > > > > > > > motivation for this is to avoid excess load that the
> current
> > > > > > > replication
> > > > > > > > > algorithm implicitly carries as in that case there
> are points
> > > in
> > > > the
> > > > > > > > > algorithm where both the new and old replica set could
> be
> > > online
> > > > and
> > > > > > > > > replicating which puts double (or almost double) pressure
> on
> > > the
> > > > > > > brokers
> > > > > > > > > which could cause problems.
> > > > > > > > > Instead my proposal would slice this up into several
> steps
> > > where
> > > > each
> > > > > > > > step
> > > > > > > > > is calculated based on the final target replicas and
> the
> > > current
> > > > > > > replica
> > > > > > > > > assignment taking into account scenarios where brokers
> could be
> > > > > > offline
> > > > > > > > and
> > > > > > > > > when there are not enough replicas to fulfil the
> > > > min.insync.replica
> > > > > > > > > requirement.
> > > > > > > > >
> > > > > > > > > The link to the KIP:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > > > > > >
> > > > > > > > > I'd be happy to receive any feedback.
> > > > > > > > >
> > > > > > > > > An important note is that this KIP and another one,
> KIP-236
> > > that
> > > > is
> > > > > > > > > about
> > > > > > > > > interruptible reassignment (
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > > > > > )
> > > > > > > > > should be compatible.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Viktor
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by George Li <sq...@yahoo.com.INVALID>.

 Hi Colin / Jason, 

Reassignment should really be doing a batches.  I am not too worried about reassignment znode getting larger.  In a real production environment,  too many concurrent reassignment and too frequent submission of reassignments seemed to cause latency spikes of kafka cluster.  So batching/staggering/throttling of submitting reassignments is recommended. 

In KIP-236,  The "originalReplicas" are only kept for the current reassigning partitions (small #), and kept in memory of the controller context partitionsBeingReassigned as well as in the znode /admin/reassign_partitions,  I think below "setting in the RPC like null = no replicas are reassigning" is a good idea.   

There seems to be some issues with the Mail archive server of this mailing list?  I didn't receive email after April 7th, and the archive for April 2019 has only 50 messages (http://mail-archives.apache.org/mod_mbox/kafka-dev/201904.mbox/thread) ? 

Thanks,
George

   on, 08 Apr 2019 17:54:48 GMT  Colin McCabe wrote:  
 
  Yeah, I think adding this information to LeaderAndIsr makes sense.  It would be better to track
"reassigningReplicas" than "originalReplicas", I think.  Tracking "originalReplicas" is going
to involve sending a lot more data, since most replicas in the system are not reassigning
at any given point.  Or we would need a hack in the RPC like null = no replicas are reassigning.

On a related note, what do you think about the idea of storing the reassigning replicas in
 /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in the reassignment znode?
 I don't think this requires a major change to the proposal-- when the controller becomes
aware that it should do a reassignment, the controller could make the changes.  This also
helps keep the reassignment znode from getting larger, which has been a problem.

best,
Colin


On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> Hey George,
> 
> For the URP during a reassignment,  if the "original_replicas" is kept for
> > the current pending reassignment. I think it will be very easy to compare
> > that with the topic/partition's ISR.  If all "original_replicas" are in
> > ISR, then URP should be 0 for that topic/partition.
> 
> 
> Yeah, that makes sense. But I guess we would need "original_replicas" to be
> propagated to partition leaders in the LeaderAndIsr request since leaders
> are the ones that are computing URPs. That is basically what KIP-352 had
> proposed, but we also need the changes to the reassignment path. Perhaps it
> makes more sense to address this problem in KIP-236 since that is where you
> have already introduced "original_replicas"? I'm also happy to do KIP-352
> as a follow-up to KIP-236.
> 
> Best,
> Jason
> 
> 
> On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com> wrote:
> 
> > Good discussion about where we should do batching. I think if there is a
> > clear great way to batch, then it makes a lot of sense to just do it once.
> > However, if we think there is scope for experimenting with different
> > approaches, then an API that tools can use makes a lot of sense. They can
> > experiment and innovate. Eventually, we can integrate something into Kafka
> > if it makes sense.
> >
> > Ismael
> >
> > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org> wrote:
> >
> > > Hi George,
> > >
> > > As Jason was saying, it seems like there are two directions we could go
> > > here: an external system handling batching, and the controller handling
> > > batching.  I think the controller handling batching would be better,
> > since
> > > the controller has more information about the state of the system.  If
> > the
> > > controller handles batching, then the controller could also handle things
> > > like setting up replication quotas for individual partitions.  The
> > > controller could do things like throttle replication down if the cluster
> > > was having problems.
> > >
> > > We kind of need to figure out which way we're going to go on this one
> > > before we set up big new APIs, I think.  If we want an external system to
> > > handle batching, then we can keep the idea that there is only one
> > > reassignment in progress at once.  If we want the controller to handle
> > > batching, we will need to get away from that idea.  Instead, we should
> > just
> > > have a bunch of "ideal assignments" that we tell the controller about,
> > and
> > > let it decide how to do the batching.  These ideal assignments could
> > change
> > > continuously over time, so from the admin's point of view, there would be
> > > no start/stop/cancel, but just individual partition reassignments that we
> > > submit, perhaps over a long period of time.  And then cancellation might
> > > just mean cancelling just that individual partition reassignment, not all
> > > partition reassignments.
> > >
> > > best,
> > > Colin
> > >
> > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > > >  Hi Jason / Viktor,
> > > >
> > > > For the URP during a reassignment,  if the "original_replicas" is kept
> > > > for the current pending reassignment. I think it will be very easy to
> > > > compare that with the topic/partition's ISR.  If all
> > > > "original_replicas" are in ISR, then URP should be 0 for that
> > > > topic/partition.
> > > >
> > > > It would be also nice to separate the metrics MaxLag/TotalLag for
> > > > Reassignments. I think that will also require "original_replicas" (the
> > > > topic/partition's replicas just before reassignment when the AR
> > > > (Assigned Replicas) is set to Set(original_replicas) +
> > > > Set(new_replicas_in_reassign_partitions) ).
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> > > > <ja...@confluent.io> wrote:
> > > >
> > > >  Hi Viktor,
> > > >
> > > > Thanks for writing this up. As far as questions about overlap with
> > > KIP-236,
> > > > I agree it seems mostly orthogonal. I think KIP-236 may have had a
> > larger
> > > > initial scope, but now it focuses on cancellation and batching is left
> > > for
> > > > future work.
> > > >
> > > > With that said, I think we may not actually need a KIP for the current
> > > > proposal since it doesn't change any APIs. To make it more generally
> > > > useful, however, it would be nice to handle batching at the partition
> > > level
> > > > as well as Jun suggests. The basic question is at what level should the
> > > > batching be determined. You could rely on external processes (e.g.
> > cruise
> > > > control) or it could be built into the controller. There are tradeoffs
> > > > either way, but I think it simplifies such tools if it is handled
> > > > internally. Then it would be much safer to submit a larger reassignment
> > > > even just using the simple tools that come with Kafka.
> > > >
> > > > By the way, since you are looking into some of the reassignment logic,
> > > > another problem that we might want to address is the misleading way we
> > > > report URPs during a reassignment. I had a naive proposal for this
> > > > previously, but it didn't really work
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > .
> > > > Potentially fixing that could fall under this work as well if you think
> > > > it
> > > > makes sense.
> > > >
> > > > Best,
> > > > Jason
> > > >
> > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Viktor,
> > > > >
> > > > > Thanks for the KIP. A couple of comments below.
> > > > >
> > > > > 1. Another potential thing to do reassignment incrementally is to
> > move
> > > a
> > > > > batch of partitions at a time, instead of all partitions. This may
> > > lead to
> > > > > less data replication since by the time the first batch of partitions
> > > have
> > > > > been completely moved, some data of the next batch may have been
> > > deleted
> > > > > due to retention and doesn't need to be replicated.
> > > > >
> > > > > 2. "Update CR in Zookeeper with TR for the given partition". Which
ZK
> > > path
> > > > > is this for?
> > > > >
> > > > > Jun
> > > > >
> > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > > > viktorsomogyi@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Harsha,
> > > > > >
> > > > > > As far as I understand KIP-236 it's about enabling reassignment
> > > > > > cancellation and as a future plan providing a queue of replica
> > > > > reassignment
> > > > > > steps to allow manual reassignment chains. While I agree that
the
> > > > > > reassignment chain has a specific use case that allows fine
grain
> > > control
> > > > > > over reassignment process, My proposal on the other hand doesn't
> > talk
> > > > > about
> > > > > > cancellation but it only provides an automatic way to
> > incrementalize
> > > an
> > > > > > arbitrary reassignment which I think fits the general use case
> > where
> > > > > users
> > > > > > don't want that level of control but still would like a balanced
> > way
> > > of
> > > > > > reassignments. Therefore I think it's still relevant as an
> > > improvement of
> > > > > > the current algorithm.
> > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think
it
> > > would be
> > > > > a
> > > > > > great improvement to Kafka.
> > > > > >
> > > > > > Cheers,
> > > > > > Viktor
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io>
wrote:
> > > > > >
> > > > > > > Hi Viktor,
> > > > > > >            There is already KIP-236 for the same feature
and
> > George
> > > > > made
> > > > > > > a PR for this as well.
> > > > > > > Lets consolidate these two discussions. If you have any
cases
> > that
> > > are
> > > > > > not
> > > > > > > being solved by KIP-236 can you please mention them in
that
> > > thread. We
> > > > > > can
> > > > > > > address as part of KIP-236.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Harsha
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > > > > > Hi Folks,
> > > > > > > >
> > > > > > > > I've created a KIP about an improvement of the reassignment
> > > algorithm
> > > > > > we
> > > > > > > > have. It aims to enable partition-wise incremental
> > reassignment.
> > > The
> > > > > > > > motivation for this is to avoid excess load that the
current
> > > > > > replication
> > > > > > > > algorithm implicitly carries as in that case there
are points
> > in
> > > the
> > > > > > > > algorithm where both the new and old replica set could
be
> > online
> > > and
> > > > > > > > replicating which puts double (or almost double) pressure
on
> > the
> > > > > > brokers
> > > > > > > > which could cause problems.
> > > > > > > > Instead my proposal would slice this up into several
steps
> > where
> > > each
> > > > > > > step
> > > > > > > > is calculated based on the final target replicas and
the
> > current
> > > > > > replica
> > > > > > > > assignment taking into account scenarios where brokers
could be
> > > > > offline
> > > > > > > and
> > > > > > > > when there are not enough replicas to fulfil the
> > > min.insync.replica
> > > > > > > > requirement.
> > > > > > > >
> > > > > > > > The link to the KIP:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > > > > >
> > > > > > > > I'd be happy to receive any feedback.
> > > > > > > >
> > > > > > > > An important note is that this KIP and another one,
KIP-236
> > that
> > > is
> > > > > > > > about
> > > > > > > > interruptible reassignment (
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > > > > )
> > > > > > > > should be compatible.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Viktor
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Colin McCabe <cm...@apache.org>.

Yeah, I think adding this information to LeaderAndIsr makes sense.  It would be better to track "reassigningReplicas" than "originalReplicas", I think.  Tracking "originalReplicas" is going to involve sending a lot more data, since most replicas in the system are not reassigning at any given point.  Or we would need a hack in the RPC like null = no replicas are reassigning.

On a related note, what do you think about the idea of storing the reassigning replicas in  /brokers/topics/[topic]/partitions/[partitionId]/state, rather than in the reassignment znode?  I don't think this requires a major change to the proposal-- when the controller becomes aware that it should do a reassignment, the controller could make the changes.  This also helps keep the reassignment znode from getting larger, which has been a problem.

best,
Colin


On Mon, Apr 8, 2019, at 09:29, Jason Gustafson wrote:
> Hey George,
> 
> For the URP during a reassignment,  if the "original_replicas" is kept for
> > the current pending reassignment. I think it will be very easy to compare
> > that with the topic/partition's ISR.  If all "original_replicas" are in
> > ISR, then URP should be 0 for that topic/partition.
> 
> 
> Yeah, that makes sense. But I guess we would need "original_replicas" to be
> propagated to partition leaders in the LeaderAndIsr request since leaders
> are the ones that are computing URPs. That is basically what KIP-352 had
> proposed, but we also need the changes to the reassignment path. Perhaps it
> makes more sense to address this problem in KIP-236 since that is where you
> have already introduced "original_replicas"? I'm also happy to do KIP-352
> as a follow-up to KIP-236.
> 
> Best,
> Jason
> 
> 
> On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com> wrote:
> 
> > Good discussion about where we should do batching. I think if there is a
> > clear great way to batch, then it makes a lot of sense to just do it once.
> > However, if we think there is scope for experimenting with different
> > approaches, then an API that tools can use makes a lot of sense. They can
> > experiment and innovate. Eventually, we can integrate something into Kafka
> > if it makes sense.
> >
> > Ismael
> >
> > On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org> wrote:
> >
> > > Hi George,
> > >
> > > As Jason was saying, it seems like there are two directions we could go
> > > here: an external system handling batching, and the controller handling
> > > batching.  I think the controller handling batching would be better,
> > since
> > > the controller has more information about the state of the system.  If
> > the
> > > controller handles batching, then the controller could also handle things
> > > like setting up replication quotas for individual partitions.  The
> > > controller could do things like throttle replication down if the cluster
> > > was having problems.
> > >
> > > We kind of need to figure out which way we're going to go on this one
> > > before we set up big new APIs, I think.  If we want an external system to
> > > handle batching, then we can keep the idea that there is only one
> > > reassignment in progress at once.  If we want the controller to handle
> > > batching, we will need to get away from that idea.  Instead, we should
> > just
> > > have a bunch of "ideal assignments" that we tell the controller about,
> > and
> > > let it decide how to do the batching.  These ideal assignments could
> > change
> > > continuously over time, so from the admin's point of view, there would be
> > > no start/stop/cancel, but just individual partition reassignments that we
> > > submit, perhaps over a long period of time.  And then cancellation might
> > > just mean cancelling just that individual partition reassignment, not all
> > > partition reassignments.
> > >
> > > best,
> > > Colin
> > >
> > > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > > >  Hi Jason / Viktor,
> > > >
> > > > For the URP during a reassignment,  if the "original_replicas" is kept
> > > > for the current pending reassignment. I think it will be very easy to
> > > > compare that with the topic/partition's ISR.  If all
> > > > "original_replicas" are in ISR, then URP should be 0 for that
> > > > topic/partition.
> > > >
> > > > It would be also nice to separate the metrics MaxLag/TotalLag for
> > > > Reassignments. I think that will also require "original_replicas" (the
> > > > topic/partition's replicas just before reassignment when the AR
> > > > (Assigned Replicas) is set to Set(original_replicas) +
> > > > Set(new_replicas_in_reassign_partitions) ).
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> > > > <ja...@confluent.io> wrote:
> > > >
> > > >  Hi Viktor,
> > > >
> > > > Thanks for writing this up. As far as questions about overlap with
> > > KIP-236,
> > > > I agree it seems mostly orthogonal. I think KIP-236 may have had a
> > larger
> > > > initial scope, but now it focuses on cancellation and batching is left
> > > for
> > > > future work.
> > > >
> > > > With that said, I think we may not actually need a KIP for the current
> > > > proposal since it doesn't change any APIs. To make it more generally
> > > > useful, however, it would be nice to handle batching at the partition
> > > level
> > > > as well as Jun suggests. The basic question is at what level should the
> > > > batching be determined. You could rely on external processes (e.g.
> > cruise
> > > > control) or it could be built into the controller. There are tradeoffs
> > > > either way, but I think it simplifies such tools if it is handled
> > > > internally. Then it would be much safer to submit a larger reassignment
> > > > even just using the simple tools that come with Kafka.
> > > >
> > > > By the way, since you are looking into some of the reassignment logic,
> > > > another problem that we might want to address is the misleading way we
> > > > report URPs during a reassignment. I had a naive proposal for this
> > > > previously, but it didn't really work
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > > .
> > > > Potentially fixing that could fall under this work as well if you think
> > > > it
> > > > makes sense.
> > > >
> > > > Best,
> > > > Jason
> > > >
> > > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Viktor,
> > > > >
> > > > > Thanks for the KIP. A couple of comments below.
> > > > >
> > > > > 1. Another potential thing to do reassignment incrementally is to
> > move
> > > a
> > > > > batch of partitions at a time, instead of all partitions. This may
> > > lead to
> > > > > less data replication since by the time the first batch of partitions
> > > have
> > > > > been completely moved, some data of the next batch may have been
> > > deleted
> > > > > due to retention and doesn't need to be replicated.
> > > > >
> > > > > 2. "Update CR in Zookeeper with TR for the given partition". Which ZK
> > > path
> > > > > is this for?
> > > > >
> > > > > Jun
> > > > >
> > > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > > > viktorsomogyi@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Harsha,
> > > > > >
> > > > > > As far as I understand KIP-236 it's about enabling reassignment
> > > > > > cancellation and as a future plan providing a queue of replica
> > > > > reassignment
> > > > > > steps to allow manual reassignment chains. While I agree that the
> > > > > > reassignment chain has a specific use case that allows fine grain
> > > control
> > > > > > over reassignment process, My proposal on the other hand doesn't
> > talk
> > > > > about
> > > > > > cancellation but it only provides an automatic way to
> > incrementalize
> > > an
> > > > > > arbitrary reassignment which I think fits the general use case
> > where
> > > > > users
> > > > > > don't want that level of control but still would like a balanced
> > way
> > > of
> > > > > > reassignments. Therefore I think it's still relevant as an
> > > improvement of
> > > > > > the current algorithm.
> > > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think it
> > > would be
> > > > > a
> > > > > > great improvement to Kafka.
> > > > > >
> > > > > > Cheers,
> > > > > > Viktor
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote:
> > > > > >
> > > > > > > Hi Viktor,
> > > > > > >            There is already KIP-236 for the same feature and
> > George
> > > > > made
> > > > > > > a PR for this as well.
> > > > > > > Lets consolidate these two discussions. If you have any cases
> > that
> > > are
> > > > > > not
> > > > > > > being solved by KIP-236 can you please mention them in that
> > > thread. We
> > > > > > can
> > > > > > > address as part of KIP-236.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Harsha
> > > > > > >
> > > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > > > > > Hi Folks,
> > > > > > > >
> > > > > > > > I've created a KIP about an improvement of the reassignment
> > > algorithm
> > > > > > we
> > > > > > > > have. It aims to enable partition-wise incremental
> > reassignment.
> > > The
> > > > > > > > motivation for this is to avoid excess load that the current
> > > > > > replication
> > > > > > > > algorithm implicitly carries as in that case there are points
> > in
> > > the
> > > > > > > > algorithm where both the new and old replica set could be
> > online
> > > and
> > > > > > > > replicating which puts double (or almost double) pressure on
> > the
> > > > > > brokers
> > > > > > > > which could cause problems.
> > > > > > > > Instead my proposal would slice this up into several steps
> > where
> > > each
> > > > > > > step
> > > > > > > > is calculated based on the final target replicas and the
> > current
> > > > > > replica
> > > > > > > > assignment taking into account scenarios where brokers could be
> > > > > offline
> > > > > > > and
> > > > > > > > when there are not enough replicas to fulfil the
> > > min.insync.replica
> > > > > > > > requirement.
> > > > > > > >
> > > > > > > > The link to the KIP:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > > > > >
> > > > > > > > I'd be happy to receive any feedback.
> > > > > > > >
> > > > > > > > An important note is that this KIP and another one, KIP-236
> > that
> > > is
> > > > > > > > about
> > > > > > > > interruptible reassignment (
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > > > > )
> > > > > > > > should be compatible.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Viktor
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Jason Gustafson <ja...@confluent.io>.

Hey George,

For the URP during a reassignment,  if the "original_replicas" is kept for
> the current pending reassignment. I think it will be very easy to compare
> that with the topic/partition's ISR.  If all "original_replicas" are in
> ISR, then URP should be 0 for that topic/partition.


Yeah, that makes sense. But I guess we would need "original_replicas" to be
propagated to partition leaders in the LeaderAndIsr request since leaders
are the ones that are computing URPs. That is basically what KIP-352 had
proposed, but we also need the changes to the reassignment path. Perhaps it
makes more sense to address this problem in KIP-236 since that is where you
have already introduced "original_replicas"? I'm also happy to do KIP-352
as a follow-up to KIP-236.

Best,
Jason


On Sun, Apr 7, 2019 at 5:09 PM Ismael Juma <is...@gmail.com> wrote:

> Good discussion about where we should do batching. I think if there is a
> clear great way to batch, then it makes a lot of sense to just do it once.
> However, if we think there is scope for experimenting with different
> approaches, then an API that tools can use makes a lot of sense. They can
> experiment and innovate. Eventually, we can integrate something into Kafka
> if it makes sense.
>
> Ismael
>
> On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org> wrote:
>
> > Hi George,
> >
> > As Jason was saying, it seems like there are two directions we could go
> > here: an external system handling batching, and the controller handling
> > batching.  I think the controller handling batching would be better,
> since
> > the controller has more information about the state of the system.  If
> the
> > controller handles batching, then the controller could also handle things
> > like setting up replication quotas for individual partitions.  The
> > controller could do things like throttle replication down if the cluster
> > was having problems.
> >
> > We kind of need to figure out which way we're going to go on this one
> > before we set up big new APIs, I think.  If we want an external system to
> > handle batching, then we can keep the idea that there is only one
> > reassignment in progress at once.  If we want the controller to handle
> > batching, we will need to get away from that idea.  Instead, we should
> just
> > have a bunch of "ideal assignments" that we tell the controller about,
> and
> > let it decide how to do the batching.  These ideal assignments could
> change
> > continuously over time, so from the admin's point of view, there would be
> > no start/stop/cancel, but just individual partition reassignments that we
> > submit, perhaps over a long period of time.  And then cancellation might
> > just mean cancelling just that individual partition reassignment, not all
> > partition reassignments.
> >
> > best,
> > Colin
> >
> > On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> > >  Hi Jason / Viktor,
> > >
> > > For the URP during a reassignment,  if the "original_replicas" is kept
> > > for the current pending reassignment. I think it will be very easy to
> > > compare that with the topic/partition's ISR.  If all
> > > "original_replicas" are in ISR, then URP should be 0 for that
> > > topic/partition.
> > >
> > > It would be also nice to separate the metrics MaxLag/TotalLag for
> > > Reassignments. I think that will also require "original_replicas" (the
> > > topic/partition's replicas just before reassignment when the AR
> > > (Assigned Replicas) is set to Set(original_replicas) +
> > > Set(new_replicas_in_reassign_partitions) ).
> > >
> > > Thanks,
> > > George
> > >
> > >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> > > <ja...@confluent.io> wrote:
> > >
> > >  Hi Viktor,
> > >
> > > Thanks for writing this up. As far as questions about overlap with
> > KIP-236,
> > > I agree it seems mostly orthogonal. I think KIP-236 may have had a
> larger
> > > initial scope, but now it focuses on cancellation and batching is left
> > for
> > > future work.
> > >
> > > With that said, I think we may not actually need a KIP for the current
> > > proposal since it doesn't change any APIs. To make it more generally
> > > useful, however, it would be nice to handle batching at the partition
> > level
> > > as well as Jun suggests. The basic question is at what level should the
> > > batching be determined. You could rely on external processes (e.g.
> cruise
> > > control) or it could be built into the controller. There are tradeoffs
> > > either way, but I think it simplifies such tools if it is handled
> > > internally. Then it would be much safer to submit a larger reassignment
> > > even just using the simple tools that come with Kafka.
> > >
> > > By the way, since you are looking into some of the reassignment logic,
> > > another problem that we might want to address is the misleading way we
> > > report URPs during a reassignment. I had a naive proposal for this
> > > previously, but it didn't really work
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> > .
> > > Potentially fixing that could fall under this work as well if you think
> > > it
> > > makes sense.
> > >
> > > Best,
> > > Jason
> > >
> > > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Viktor,
> > > >
> > > > Thanks for the KIP. A couple of comments below.
> > > >
> > > > 1. Another potential thing to do reassignment incrementally is to
> move
> > a
> > > > batch of partitions at a time, instead of all partitions. This may
> > lead to
> > > > less data replication since by the time the first batch of partitions
> > have
> > > > been completely moved, some data of the next batch may have been
> > deleted
> > > > due to retention and doesn't need to be replicated.
> > > >
> > > > 2. "Update CR in Zookeeper with TR for the given partition". Which ZK
> > path
> > > > is this for?
> > > >
> > > > Jun
> > > >
> > > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > > viktorsomogyi@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Harsha,
> > > > >
> > > > > As far as I understand KIP-236 it's about enabling reassignment
> > > > > cancellation and as a future plan providing a queue of replica
> > > > reassignment
> > > > > steps to allow manual reassignment chains. While I agree that the
> > > > > reassignment chain has a specific use case that allows fine grain
> > control
> > > > > over reassignment process, My proposal on the other hand doesn't
> talk
> > > > about
> > > > > cancellation but it only provides an automatic way to
> incrementalize
> > an
> > > > > arbitrary reassignment which I think fits the general use case
> where
> > > > users
> > > > > don't want that level of control but still would like a balanced
> way
> > of
> > > > > reassignments. Therefore I think it's still relevant as an
> > improvement of
> > > > > the current algorithm.
> > > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think it
> > would be
> > > > a
> > > > > great improvement to Kafka.
> > > > >
> > > > > Cheers,
> > > > > Viktor
> > > > >
> > > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote:
> > > > >
> > > > > > Hi Viktor,
> > > > > >            There is already KIP-236 for the same feature and
> George
> > > > made
> > > > > > a PR for this as well.
> > > > > > Lets consolidate these two discussions. If you have any cases
> that
> > are
> > > > > not
> > > > > > being solved by KIP-236 can you please mention them in that
> > thread. We
> > > > > can
> > > > > > address as part of KIP-236.
> > > > > >
> > > > > > Thanks,
> > > > > > Harsha
> > > > > >
> > > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > > > > Hi Folks,
> > > > > > >
> > > > > > > I've created a KIP about an improvement of the reassignment
> > algorithm
> > > > > we
> > > > > > > have. It aims to enable partition-wise incremental
> reassignment.
> > The
> > > > > > > motivation for this is to avoid excess load that the current
> > > > > replication
> > > > > > > algorithm implicitly carries as in that case there are points
> in
> > the
> > > > > > > algorithm where both the new and old replica set could be
> online
> > and
> > > > > > > replicating which puts double (or almost double) pressure on
> the
> > > > > brokers
> > > > > > > which could cause problems.
> > > > > > > Instead my proposal would slice this up into several steps
> where
> > each
> > > > > > step
> > > > > > > is calculated based on the final target replicas and the
> current
> > > > > replica
> > > > > > > assignment taking into account scenarios where brokers could be
> > > > offline
> > > > > > and
> > > > > > > when there are not enough replicas to fulfil the
> > min.insync.replica
> > > > > > > requirement.
> > > > > > >
> > > > > > > The link to the KIP:
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > > > >
> > > > > > > I'd be happy to receive any feedback.
> > > > > > >
> > > > > > > An important note is that this KIP and another one, KIP-236
> that
> > is
> > > > > > > about
> > > > > > > interruptible reassignment (
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > > > )
> > > > > > > should be compatible.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Viktor
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Ismael Juma <is...@gmail.com>.

Good discussion about where we should do batching. I think if there is a
clear great way to batch, then it makes a lot of sense to just do it once.
However, if we think there is scope for experimenting with different
approaches, then an API that tools can use makes a lot of sense. They can
experiment and innovate. Eventually, we can integrate something into Kafka
if it makes sense.

Ismael

On Sun, Apr 7, 2019, 11:03 PM Colin McCabe <cm...@apache.org> wrote:

> Hi George,
>
> As Jason was saying, it seems like there are two directions we could go
> here: an external system handling batching, and the controller handling
> batching.  I think the controller handling batching would be better, since
> the controller has more information about the state of the system.  If the
> controller handles batching, then the controller could also handle things
> like setting up replication quotas for individual partitions.  The
> controller could do things like throttle replication down if the cluster
> was having problems.
>
> We kind of need to figure out which way we're going to go on this one
> before we set up big new APIs, I think.  If we want an external system to
> handle batching, then we can keep the idea that there is only one
> reassignment in progress at once.  If we want the controller to handle
> batching, we will need to get away from that idea.  Instead, we should just
> have a bunch of "ideal assignments" that we tell the controller about, and
> let it decide how to do the batching.  These ideal assignments could change
> continuously over time, so from the admin's point of view, there would be
> no start/stop/cancel, but just individual partition reassignments that we
> submit, perhaps over a long period of time.  And then cancellation might
> just mean cancelling just that individual partition reassignment, not all
> partition reassignments.
>
> best,
> Colin
>
> On Fri, Apr 5, 2019, at 19:34, George Li wrote:
> >  Hi Jason / Viktor,
> >
> > For the URP during a reassignment,  if the "original_replicas" is kept
> > for the current pending reassignment. I think it will be very easy to
> > compare that with the topic/partition's ISR.  If all
> > "original_replicas" are in ISR, then URP should be 0 for that
> > topic/partition.
> >
> > It would be also nice to separate the metrics MaxLag/TotalLag for
> > Reassignments. I think that will also require "original_replicas" (the
> > topic/partition's replicas just before reassignment when the AR
> > (Assigned Replicas) is set to Set(original_replicas) +
> > Set(new_replicas_in_reassign_partitions) ).
> >
> > Thanks,
> > George
> >
> >     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson
> > <ja...@confluent.io> wrote:
> >
> >  Hi Viktor,
> >
> > Thanks for writing this up. As far as questions about overlap with
> KIP-236,
> > I agree it seems mostly orthogonal. I think KIP-236 may have had a larger
> > initial scope, but now it focuses on cancellation and batching is left
> for
> > future work.
> >
> > With that said, I think we may not actually need a KIP for the current
> > proposal since it doesn't change any APIs. To make it more generally
> > useful, however, it would be nice to handle batching at the partition
> level
> > as well as Jun suggests. The basic question is at what level should the
> > batching be determined. You could rely on external processes (e.g. cruise
> > control) or it could be built into the controller. There are tradeoffs
> > either way, but I think it simplifies such tools if it is handled
> > internally. Then it would be much safer to submit a larger reassignment
> > even just using the simple tools that come with Kafka.
> >
> > By the way, since you are looking into some of the reassignment logic,
> > another problem that we might want to address is the misleading way we
> > report URPs during a reassignment. I had a naive proposal for this
> > previously, but it didn't really work
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment
> .
> > Potentially fixing that could fall under this work as well if you think
> > it
> > makes sense.
> >
> > Best,
> > Jason
> >
> > On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Viktor,
> > >
> > > Thanks for the KIP. A couple of comments below.
> > >
> > > 1. Another potential thing to do reassignment incrementally is to move
> a
> > > batch of partitions at a time, instead of all partitions. This may
> lead to
> > > less data replication since by the time the first batch of partitions
> have
> > > been completely moved, some data of the next batch may have been
> deleted
> > > due to retention and doesn't need to be replicated.
> > >
> > > 2. "Update CR in Zookeeper with TR for the given partition". Which ZK
> path
> > > is this for?
> > >
> > > Jun
> > >
> > > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > > viktorsomogyi@gmail.com>
> > > wrote:
> > >
> > > > Hi Harsha,
> > > >
> > > > As far as I understand KIP-236 it's about enabling reassignment
> > > > cancellation and as a future plan providing a queue of replica
> > > reassignment
> > > > steps to allow manual reassignment chains. While I agree that the
> > > > reassignment chain has a specific use case that allows fine grain
> control
> > > > over reassignment process, My proposal on the other hand doesn't talk
> > > about
> > > > cancellation but it only provides an automatic way to incrementalize
> an
> > > > arbitrary reassignment which I think fits the general use case where
> > > users
> > > > don't want that level of control but still would like a balanced way
> of
> > > > reassignments. Therefore I think it's still relevant as an
> improvement of
> > > > the current algorithm.
> > > > Nevertheless I'm happy to add my ideas to KIP-236 as I think it
> would be
> > > a
> > > > great improvement to Kafka.
> > > >
> > > > Cheers,
> > > > Viktor
> > > >
> > > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote:
> > > >
> > > > > Hi Viktor,
> > > > >            There is already KIP-236 for the same feature and George
> > > made
> > > > > a PR for this as well.
> > > > > Lets consolidate these two discussions. If you have any cases that
> are
> > > > not
> > > > > being solved by KIP-236 can you please mention them in that
> thread. We
> > > > can
> > > > > address as part of KIP-236.
> > > > >
> > > > > Thanks,
> > > > > Harsha
> > > > >
> > > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > > > Hi Folks,
> > > > > >
> > > > > > I've created a KIP about an improvement of the reassignment
> algorithm
> > > > we
> > > > > > have. It aims to enable partition-wise incremental reassignment.
> The
> > > > > > motivation for this is to avoid excess load that the current
> > > > replication
> > > > > > algorithm implicitly carries as in that case there are points in
> the
> > > > > > algorithm where both the new and old replica set could be online
> and
> > > > > > replicating which puts double (or almost double) pressure on the
> > > > brokers
> > > > > > which could cause problems.
> > > > > > Instead my proposal would slice this up into several steps where
> each
> > > > > step
> > > > > > is calculated based on the final target replicas and the current
> > > > replica
> > > > > > assignment taking into account scenarios where brokers could be
> > > offline
> > > > > and
> > > > > > when there are not enough replicas to fulfil the
> min.insync.replica
> > > > > > requirement.
> > > > > >
> > > > > > The link to the KIP:
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > > >
> > > > > > I'd be happy to receive any feedback.
> > > > > >
> > > > > > An important note is that this KIP and another one, KIP-236 that
> is
> > > > > > about
> > > > > > interruptible reassignment (
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > > )
> > > > > > should be compatible.
> > > > > >
> > > > > > Thanks,
> > > > > > Viktor
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Colin McCabe <cm...@apache.org>.

Hi George,

As Jason was saying, it seems like there are two directions we could go here: an external system handling batching, and the controller handling batching.  I think the controller handling batching would be better, since the controller has more information about the state of the system.  If the controller handles batching, then the controller could also handle things like setting up replication quotas for individual partitions.  The controller could do things like throttle replication down if the cluster was having problems.

We kind of need to figure out which way we're going to go on this one before we set up big new APIs, I think.  If we want an external system to handle batching, then we can keep the idea that there is only one reassignment in progress at once.  If we want the controller to handle batching, we will need to get away from that idea.  Instead, we should just have a bunch of "ideal assignments" that we tell the controller about, and let it decide how to do the batching.  These ideal assignments could change continuously over time, so from the admin's point of view, there would be no start/stop/cancel, but just individual partition reassignments that we submit, perhaps over a long period of time.  And then cancellation might just mean cancelling just that individual partition reassignment, not all partition reassignments.

best,
Colin

On Fri, Apr 5, 2019, at 19:34, George Li wrote:
>  Hi Jason / Viktor,
> 
> For the URP during a reassignment,  if the "original_replicas" is kept 
> for the current pending reassignment. I think it will be very easy to 
> compare that with the topic/partition's ISR.  If all 
> "original_replicas" are in ISR, then URP should be 0 for that 
> topic/partition. 
> 
> It would be also nice to separate the metrics MaxLag/TotalLag for 
> Reassignments. I think that will also require "original_replicas" (the 
> topic/partition's replicas just before reassignment when the AR 
> (Assigned Replicas) is set to Set(original_replicas) + 
> Set(new_replicas_in_reassign_partitions) ). 
> 
> Thanks,
> George
> 
>     On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson 
> <ja...@confluent.io> wrote:  
>  
>  Hi Viktor,
> 
> Thanks for writing this up. As far as questions about overlap with KIP-236,
> I agree it seems mostly orthogonal. I think KIP-236 may have had a larger
> initial scope, but now it focuses on cancellation and batching is left for
> future work.
> 
> With that said, I think we may not actually need a KIP for the current
> proposal since it doesn't change any APIs. To make it more generally
> useful, however, it would be nice to handle batching at the partition level
> as well as Jun suggests. The basic question is at what level should the
> batching be determined. You could rely on external processes (e.g. cruise
> control) or it could be built into the controller. There are tradeoffs
> either way, but I think it simplifies such tools if it is handled
> internally. Then it would be much safer to submit a larger reassignment
> even just using the simple tools that come with Kafka.
> 
> By the way, since you are looking into some of the reassignment logic,
> another problem that we might want to address is the misleading way we
> report URPs during a reassignment. I had a naive proposal for this
> previously, but it didn't really work
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment.
> Potentially fixing that could fall under this work as well if you think 
> it
> makes sense.
> 
> Best,
> Jason
> 
> On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:
> 
> > Hi, Viktor,
> >
> > Thanks for the KIP. A couple of comments below.
> >
> > 1. Another potential thing to do reassignment incrementally is to move a
> > batch of partitions at a time, instead of all partitions. This may lead to
> > less data replication since by the time the first batch of partitions have
> > been completely moved, some data of the next batch may have been deleted
> > due to retention and doesn't need to be replicated.
> >
> > 2. "Update CR in Zookeeper with TR for the given partition". Which ZK path
> > is this for?
> >
> > Jun
> >
> > On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> > viktorsomogyi@gmail.com>
> > wrote:
> >
> > > Hi Harsha,
> > >
> > > As far as I understand KIP-236 it's about enabling reassignment
> > > cancellation and as a future plan providing a queue of replica
> > reassignment
> > > steps to allow manual reassignment chains. While I agree that the
> > > reassignment chain has a specific use case that allows fine grain control
> > > over reassignment process, My proposal on the other hand doesn't talk
> > about
> > > cancellation but it only provides an automatic way to incrementalize an
> > > arbitrary reassignment which I think fits the general use case where
> > users
> > > don't want that level of control but still would like a balanced way of
> > > reassignments. Therefore I think it's still relevant as an improvement of
> > > the current algorithm.
> > > Nevertheless I'm happy to add my ideas to KIP-236 as I think it would be
> > a
> > > great improvement to Kafka.
> > >
> > > Cheers,
> > > Viktor
> > >
> > > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote:
> > >
> > > > Hi Viktor,
> > > >            There is already KIP-236 for the same feature and George
> > made
> > > > a PR for this as well.
> > > > Lets consolidate these two discussions. If you have any cases that are
> > > not
> > > > being solved by KIP-236 can you please mention them in that thread. We
> > > can
> > > > address as part of KIP-236.
> > > >
> > > > Thanks,
> > > > Harsha
> > > >
> > > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > > Hi Folks,
> > > > >
> > > > > I've created a KIP about an improvement of the reassignment algorithm
> > > we
> > > > > have. It aims to enable partition-wise incremental reassignment. The
> > > > > motivation for this is to avoid excess load that the current
> > > replication
> > > > > algorithm implicitly carries as in that case there are points in the
> > > > > algorithm where both the new and old replica set could be online and
> > > > > replicating which puts double (or almost double) pressure on the
> > > brokers
> > > > > which could cause problems.
> > > > > Instead my proposal would slice this up into several steps where each
> > > > step
> > > > > is calculated based on the final target replicas and the current
> > > replica
> > > > > assignment taking into account scenarios where brokers could be
> > offline
> > > > and
> > > > > when there are not enough replicas to fulfil the min.insync.replica
> > > > > requirement.
> > > > >
> > > > > The link to the KIP:
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > > >
> > > > > I'd be happy to receive any feedback.
> > > > >
> > > > > An important note is that this KIP and another one, KIP-236 that is
> > > > > about
> > > > > interruptible reassignment (
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > > )
> > > > > should be compatible.
> > > > >
> > > > > Thanks,
> > > > > Viktor
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by George Li <sq...@yahoo.com.INVALID>.

 Hi Jason / Viktor,

For the URP during a reassignment,  if the "original_replicas" is kept for the current pending reassignment. I think it will be very easy to compare that with the topic/partition's ISR.  If all "original_replicas" are in ISR, then URP should be 0 for that topic/partition. 

It would be also nice to separate the metrics MaxLag/TotalLag for Reassignments. I think that will also require "original_replicas" (the topic/partition's replicas just before reassignment when the AR (Assigned Replicas) is set to Set(original_replicas) + Set(new_replicas_in_reassign_partitions) ). 

Thanks,
George

    On Friday, April 5, 2019, 6:29:55 PM PDT, Jason Gustafson <ja...@confluent.io> wrote:  

 Hi Viktor,

Thanks for writing this up. As far as questions about overlap with KIP-236,
I agree it seems mostly orthogonal. I think KIP-236 may have had a larger
initial scope, but now it focuses on cancellation and batching is left for
future work.

With that said, I think we may not actually need a KIP for the current
proposal since it doesn't change any APIs. To make it more generally
useful, however, it would be nice to handle batching at the partition level
as well as Jun suggests. The basic question is at what level should the
batching be determined. You could rely on external processes (e.g. cruise
control) or it could be built into the controller. There are tradeoffs
either way, but I think it simplifies such tools if it is handled
internally. Then it would be much safer to submit a larger reassignment
even just using the simple tools that come with Kafka.

By the way, since you are looking into some of the reassignment logic,
another problem that we might want to address is the misleading way we
report URPs during a reassignment. I had a naive proposal for this
previously, but it didn't really work
https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment.
Potentially fixing that could fall under this work as well if you think it
makes sense.

Best,
Jason

On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:

> Hi, Viktor,
>
> Thanks for the KIP. A couple of comments below.
>
> 1. Another potential thing to do reassignment incrementally is to move a
> batch of partitions at a time, instead of all partitions. This may lead to
> less data replication since by the time the first batch of partitions have
> been completely moved, some data of the next batch may have been deleted
> due to retention and doesn't need to be replicated.
>
> 2. "Update CR in Zookeeper with TR for the given partition". Which ZK path
> is this for?
>
> Jun
>
> On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> viktorsomogyi@gmail.com>
> wrote:
>
> > Hi Harsha,
> >
> > As far as I understand KIP-236 it's about enabling reassignment
> > cancellation and as a future plan providing a queue of replica
> reassignment
> > steps to allow manual reassignment chains. While I agree that the
> > reassignment chain has a specific use case that allows fine grain control
> > over reassignment process, My proposal on the other hand doesn't talk
> about
> > cancellation but it only provides an automatic way to incrementalize an
> > arbitrary reassignment which I think fits the general use case where
> users
> > don't want that level of control but still would like a balanced way of
> > reassignments. Therefore I think it's still relevant as an improvement of
> > the current algorithm.
> > Nevertheless I'm happy to add my ideas to KIP-236 as I think it would be
> a
> > great improvement to Kafka.
> >
> > Cheers,
> > Viktor
> >
> > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote:
> >
> > > Hi Viktor,
> > >            There is already KIP-236 for the same feature and George
> made
> > > a PR for this as well.
> > > Lets consolidate these two discussions. If you have any cases that are
> > not
> > > being solved by KIP-236 can you please mention them in that thread. We
> > can
> > > address as part of KIP-236.
> > >
> > > Thanks,
> > > Harsha
> > >
> > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > Hi Folks,
> > > >
> > > > I've created a KIP about an improvement of the reassignment algorithm
> > we
> > > > have. It aims to enable partition-wise incremental reassignment. The
> > > > motivation for this is to avoid excess load that the current
> > replication
> > > > algorithm implicitly carries as in that case there are points in the
> > > > algorithm where both the new and old replica set could be online and
> > > > replicating which puts double (or almost double) pressure on the
> > brokers
> > > > which could cause problems.
> > > > Instead my proposal would slice this up into several steps where each
> > > step
> > > > is calculated based on the final target replicas and the current
> > replica
> > > > assignment taking into account scenarios where brokers could be
> offline
> > > and
> > > > when there are not enough replicas to fulfil the min.insync.replica
> > > > requirement.
> > > >
> > > > The link to the KIP:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > >
> > > > I'd be happy to receive any feedback.
> > > >
> > > > An important note is that this KIP and another one, KIP-236 that is
> > > > about
> > > > interruptible reassignment (
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > )
> > > > should be compatible.
> > > >
> > > > Thanks,
> > > > Viktor
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-435: Incremental Partition Reassignment

Posted by Jason Gustafson <ja...@confluent.io>.

Hi Viktor,

Thanks for writing this up. As far as questions about overlap with KIP-236,
I agree it seems mostly orthogonal. I think KIP-236 may have had a larger
initial scope, but now it focuses on cancellation and batching is left for
future work.

With that said, I think we may not actually need a KIP for the current
proposal since it doesn't change any APIs. To make it more generally
useful, however, it would be nice to handle batching at the partition level
as well as Jun suggests. The basic question is at what level should the
batching be determined. You could rely on external processes (e.g. cruise
control) or it could be built into the controller. There are tradeoffs
either way, but I think it simplifies such tools if it is handled
internally. Then it would be much safer to submit a larger reassignment
even just using the simple tools that come with Kafka.

By the way, since you are looking into some of the reassignment logic,
another problem that we might want to address is the misleading way we
report URPs during a reassignment. I had a naive proposal for this
previously, but it didn't really work
https://cwiki.apache.org/confluence/display/KAFKA/KIP-352%3A+Distinguish+URPs+caused+by+reassignment.
Potentially fixing that could fall under this work as well if you think it
makes sense.

Best,
Jason

On Thu, Apr 4, 2019 at 4:49 PM Jun Rao <ju...@confluent.io> wrote:

> Hi, Viktor,
>
> Thanks for the KIP. A couple of comments below.
>
> 1. Another potential thing to do reassignment incrementally is to move a
> batch of partitions at a time, instead of all partitions. This may lead to
> less data replication since by the time the first batch of partitions have
> been completely moved, some data of the next batch may have been deleted
> due to retention and doesn't need to be replicated.
>
> 2. "Update CR in Zookeeper with TR for the given partition". Which ZK path
> is this for?
>
> Jun
>
> On Sat, Feb 23, 2019 at 2:12 AM Viktor Somogyi-Vass <
> viktorsomogyi@gmail.com>
> wrote:
>
> > Hi Harsha,
> >
> > As far as I understand KIP-236 it's about enabling reassignment
> > cancellation and as a future plan providing a queue of replica
> reassignment
> > steps to allow manual reassignment chains. While I agree that the
> > reassignment chain has a specific use case that allows fine grain control
> > over reassignment process, My proposal on the other hand doesn't talk
> about
> > cancellation but it only provides an automatic way to incrementalize an
> > arbitrary reassignment which I think fits the general use case where
> users
> > don't want that level of control but still would like a balanced way of
> > reassignments. Therefore I think it's still relevant as an improvement of
> > the current algorithm.
> > Nevertheless I'm happy to add my ideas to KIP-236 as I think it would be
> a
> > great improvement to Kafka.
> >
> > Cheers,
> > Viktor
> >
> > On Fri, Feb 22, 2019 at 5:05 PM Harsha <ka...@harsha.io> wrote:
> >
> > > Hi Viktor,
> > >             There is already KIP-236 for the same feature and George
> made
> > > a PR for this as well.
> > > Lets consolidate these two discussions. If you have any cases that are
> > not
> > > being solved by KIP-236 can you please mention them in that thread. We
> > can
> > > address as part of KIP-236.
> > >
> > > Thanks,
> > > Harsha
> > >
> > > On Fri, Feb 22, 2019, at 5:44 AM, Viktor Somogyi-Vass wrote:
> > > > Hi Folks,
> > > >
> > > > I've created a KIP about an improvement of the reassignment algorithm
> > we
> > > > have. It aims to enable partition-wise incremental reassignment. The
> > > > motivation for this is to avoid excess load that the current
> > replication
> > > > algorithm implicitly carries as in that case there are points in the
> > > > algorithm where both the new and old replica set could be online and
> > > > replicating which puts double (or almost double) pressure on the
> > brokers
> > > > which could cause problems.
> > > > Instead my proposal would slice this up into several steps where each
> > > step
> > > > is calculated based on the final target replicas and the current
> > replica
> > > > assignment taking into account scenarios where brokers could be
> offline
> > > and
> > > > when there are not enough replicas to fulfil the min.insync.replica
> > > > requirement.
> > > >
> > > > The link to the KIP:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-435%3A+Incremental+Partition+Reassignment
> > > >
> > > > I'd be happy to receive any feedback.
> > > >
> > > > An important note is that this KIP and another one, KIP-236 that is
> > > > about
> > > > interruptible reassignment (
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment
> > > )
> > > > should be compatible.
> > > >
> > > > Thanks,
> > > > Viktor
> > > >
> > >
> >
>