You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Tom Bentley <t....@gmail.com> on 2017/12/06 21:19:08 UTC

[DISCUSS] KIP-236 Interruptible Partition Reassignment

Hi,

This is still very new, but I wanted some quick feedback on a preliminary
KIP which could, I think, help with providing an AdminClient API for
partition reassignment.

https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%3A+Interruptible+Partition+Reassignment

I wasn't sure whether to start fleshing out a whole AdminClient API in this
KIP (which would make it very big, and difficult to read), or whether to
break it down into smaller KIPs (which makes it easier to read and
implement in pieces, but harder to get a high-level picture of the ultimate
destination). For now I've gone for a very small initial KIP, but I'm happy
to sketch the bigger picture here if people are interested.

Cheers,

Tom

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Jun Rao <ju...@confluent.io>.
Hi, Tom,

Thanks for the reply.

10. That's a good thought. Perhaps it's better to get rid of
/admin/reassignment_requests
too. The window when a controller is not available is small. So, we can
just failed the admin client if the controller is not reachable after the
timeout.

13. With the changes in 10, the old approach is handled through ZK callback
and the new approach is through Kafka RPC. The ordering between the two is
kind of arbitrary. Perhaps the ordering can just be based on the order that
the reassignment is added to the controller request queue. From there, we
can either do the overriding or the prevention.

Jun


On Fri, Dec 22, 2017 at 7:31 AM, Tom Bentley <t....@gmail.com> wrote:

> Hi Jun,
>
> Thanks for responding, my replies are inline:
>
> 10. You explanation makes sense. My remaining concern is the additional ZK
> > writes in the proposal. With the proposal, we will need to do following
> > writes in ZK.
> >
> > a. write new assignment in /admin/reassignment_requests
> >
> > b. write new assignment and additional metadata in
> > /admin/reassignments/$topic/$partition
> >
> > c. write old + new assignment  in /brokers/topics/[topic]
> >
> > d. write new assignment in /brokers/topics/[topic]
> >
> > e. delete /admin/reassignments/$topic/$partition
> >
> > So, there are quite a few ZK writes. I am wondering if it's better to
> > consolidate the info in /admin/reassignments/$topic/$partition into
> > /brokers/topics/[topic].
> > For example, we can just add some new JSON fields in
> > /brokers/topics/[topic]
> > to remember the new assignment and potentially the original replica count
> > when doing step c. Those fields with then be removed in step d. That way,
> > we can get rid of step b and e, saving 2 ZK writes per partition.
> >
>
> This seems like a great idea to me.
>
> It might also be possible to get rid of the /admin/reassignment_requests
> subtree too. I've not yet published the ideas I have for the AdminClient
> API for reassigning partitions, but given the existence of such an API, the
> route to starting a reassignment would be the AdminClient, and not
> zookeeper. In that case there is no need for /admin/reassignment_requests
> at all. The only drawback that I can see is that while it's currently
> possible to trigger a reassignment even during a controller
> election/failover that would no longer be the case if all requests had to
> go via the controller.
>
>
> > 11. What you described sounds good. We could potentially optimize the
> > dropped replicas a bit more. Suppose that assignment [0,1,2] is first
> > changed to [1,2,3] and then to [2,3,4]. When initiating the second
> > assignment, we may end up dropping replica 3 and only to restart it
> again.
> > In this case, we could only drop a replica if it's not going to be added
> > back again.
> >
>
> I had missed that, thank you! I will update the proposed algorithm to
> prevent this.
>
>
> > 13. Since this is a corner case, we can either prevent or allow
> overriding
> > with old/new mechanisms. To me, it seems that allowing is simpler to
> > implement, the order in /admin/reassignment_requests determines the
> > ordering the of override, whether that's initiated by the new way or the
> > old way.
> >
>
> That makes sense except for the corner case where:
>
> * There is no current controller and
> * Writes to both the new and old znodes happen
>
> On election of the new controller, for those partitions with both a
> reassignment_request and in /admin/reassign_partitions, we have to decide
> which should win. You could use the modification time, though there are
> some very unlikely scenarios where that doesn't work properly, for example
> if both znodes have the same mtime, or the /admin/reassign_partitions was
> updated, but the assignment of the partition wasn't changed, like this:
>
> 0. /admin/reassign_partitions has my-topic/42 = [1,2,3]
> 1. Controller stops watching.
> 2. Create /admin/reassignment_requests/request_1234 to change the
> reassignment of partition my-topic/42 = [4,5,6]
> 3. Update /admin/reassign_partitions to add your-topic/12=[7,8,9]
> 4. New controller resumes
>
>
>
> > Thanks,
> >
> > Jun
> >
> > On Tue, Dec 19, 2017 at 2:43 AM, Tom Bentley <t....@gmail.com>
> > wrote:
> >
> > > Hi Jun,
> > >
> > > 10. Another concern of mine is on consistency with the current pattern.
> > The
> > > > current pattern for change notification based on ZK is (1) we first
> > write
> > > > the actual value in the entity path and then write the change
> > > notification
> > > > path, and (2)  the change notification path only includes what entity
> > has
> > > > changed but not the actual changes. If we want to follow this pattern
> > for
> > > > consistency, /admin/reassignment_requests/request_xxx will only have
> > the
> > > > partitions whose reassignment have changed, but not the actual
> > > > reassignment.
> > > >
> > >
> > > Ah, I hadn't understood part (2). That means my concern about
> efficiency
> > > with the current pattern is misplaced. There are still some interesting
> > > differences in semantics, however:
> > >
> > > a) The mechanism currently proposed in KIP-236 means that the
> controller
> > is
> > > the only writer to /admin/reassignments. This means it can include
> > > information in these znodes that requesters might not know, or
> > information
> > > that's necessary to perform the reassignment but not necessary to
> > describe
> > > the request. While this could be handled using the current pattern it
> > would
> > > rely on all  writers to preserve any information added by the
> controller,
> > > which seems complicated and hence fragile.
> > >
> > > b) The current pattern for change notification doesn't cope with
> > competing
> > > writers to the entity path: If two processes write to the entity path
> > > before the controller can read it (due to notification) then one set of
> > > updates will be lost.
> > >
> > > c) If a single writing process crashes after writing to the entity
> path,
> > > but before writing to the notification path then the write will be
> lost.
> > >
> > > I'm actually using point a) in my WIP (see below). Points b) and c) are
> > > obviously edge cases.
> > >
> > >
> > > > 11. Ok. I am not sure that I fully understand the description of that
> > > part.
> > > > Does "assigned" refer to the current assignment? Could you also
> > describe
> > > > where the length of the original assignment is stored in ZK?
> > > >
> > >
> > > Sorry if the description is not clear. Yes, "assigned" referrs to the
> > > currently assigned replicas (taken from the
> > > ControllerContext.partitionReplicaAssignment). I would store the
> length
> > of
> > > the original assignment in the /admin/reassignments/$topic/$partition
> > > znode
> > > (this is where the point (a) above is useful -- the requester shouldn't
> > > know that this information is used by the controller).
> > >
> > > I've updated the KIP to make these points clearer.
> > >
> > >
> > > > 13. Hmm, I am not sure that the cancellation needs to be done for the
> > > whole
> > > > batch. The reason that I brought this up is for consistency. The KIP
> > > allows
> > > > override when using the new approach. It just seems that it's simpler
> > to
> > > > extend this model when resolving multiple changes between the old and
> > the
> > > > new approach.
> > >
> > >
> > > Ah, I think I've been unclear on this point too. Currently the
> > > ReassignPartitionsCommand enforces that you can't change reassignments,
> > but
> > > this doesn't stop other ZK clients making changes to
> > > /admin/reassign_partitions directly and I believe some Kafka users do
> > > indeed change reassignments in-flight by writing to
> > > /admin/reassign_partitions. What I'm proposing doesn't break that at
> all.
> > > The semantic I've implemented is only that the controller only refuses
> a
> > > reassignment change if there is already one in-flight (i.e. in
> > > /admin/reassignments/$topic/$partition) **via the other mechansim**.
> So
> > if
> > > you're using /admin/reassign_partitions and you change or cancel part
> of
> > it
> > > via /admin/reassign_partitions, that's OK. Likewise if you're using
> > > /admin/reassignment_request/request_xxx and you change or cancel part
> of
> > > it
> > > via another /admin/reassignment_request/request_xxx, that's OK.What
> you
> > > can't do is change a request that was started via
> > > /admin/reassign_partitions via /admin/reassignment_request/
> request_xxx,
> > or
> > > vice versa.
> > >
> > > What I was thinking of when I replied is the case where, on controller
> > > failover, /admin/reassign_partitions has been changed and
> > > /admin/reassignment_request/request_xxx created (in the period when
> the
> > > new
> > > controller was being elected, for example) with a common partition. In
> > this
> > > case we should apply a consistent rule to that used when the
> notification
> > > happen in real time. Your suggestion to use the modification time of
> the
> > > znode would work here too (except in the edge case where ZK writes to
> > both
> > > znodes happen within the same clock tick on the ZK server, so the
> mtimes
> > > are the same).
> > >
> > > Let me know if you think this is the right semantic and I'll try to
> > clarify
> > > the KIP.
> > >
> > > Many thanks,
> > >
> > > Tom
> > >
> > > On 18 December 2017 at 18:12, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Tom,
> > > >
> > > > Thanks for the reply. A few more followup comments below.
> > > >
> > > > 10. Another concern of mine is on consistency with the current
> pattern.
> > > The
> > > > current pattern for change notification based on ZK is (1) we first
> > write
> > > > the actual value in the entity path and then write the change
> > > notification
> > > > path, and (2)  the change notification path only includes what entity
> > has
> > > > changed but not the actual changes. If we want to follow this pattern
> > for
> > > > consistency, /admin/reassignment_requests/request_xxx will only have
> > the
> > > > partitions whose reassignment have changed, but not the actual
> > > > reassignment.
> > > >
> > > > 11. Ok. I am not sure that I fully understand the description of that
> > > part.
> > > > Does "assigned" refer to the current assignment? Could you also
> > describe
> > > > where the length of the original assignment is stored in ZK?
> > > >
> > > > 13. Hmm, I am not sure that the cancellation needs to be done for the
> > > whole
> > > > batch. The reason that I brought this up is for consistency. The KIP
> > > allows
> > > > override when using the new approach. It just seems that it's simpler
> > to
> > > > extend this model when resolving multiple changes between the old and
> > the
> > > > new approach.
> > > >
> > > > Jun
> > > >
> > > >
> > > >
> > > > On Mon, Dec 18, 2017 at 2:45 AM, Tom Bentley <t....@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Thanks for replying, some answers below:
> > > > >
> > > > >
> > > > > > 10. The proposal now stores the reassignment for all partitions
> in
> > > > > > /admin/reassignment_requests/request_xxx. If the number of
> > > reassigned
> > > > > > partitions is larger, the ZK write may hit the default 1MB limit
> > and
> > > > > fail.
> > > > > > An alternative approach is to have the reassignment requester
> first
> > > > write
> > > > > > the new assignment for each partition under
> > > > > > /admin/reassignments/$topic/$partition and then write
> > > > > > /admin/reassignment_requests/request_xxx with an empty value.
> The
> > > > > > controller can then read all values under /admin/reassignments.
> > > > > >
> > > > >
> > > > > You're right that reassigning enough partitions would hit the 1MB
> > > limit,
> > > > > but I don't think this would be a problem in practice because it
> > would
> > > be
> > > > > trivial to split the partitions into several requests (i.e.
> mutleiple
> > > > > request_xxx).
> > > > > I don't think the non-atomicity this would imply is a problem. By
> > > writing
> > > > > the partitions whose /admin/reassignments/$topic/$partition has
> been
> > > > > created or changed it makes it much more efficient to know which of
> > > those
> > > > > znodes we need to read. If I understand your suggestion, you would
> > have
> > > > to
> > > > > read every node under /admin/reassignments to figure out which had
> > > > changed.
> > > > >
> > > > >
> > > > > > 11. The improvement you suggested in onPartitionReassignment()
> > sounds
> > > > > good
> > > > > > at the high level. The computation of those dropped partitions
> > seems
> > > a
> > > > > bit
> > > > > > complicated. Perhaps a simple approach is to drop the replicas
> not
> > in
> > > > the
> > > > > > original assignment and newest reassignment?
> > > > > >
> > > > >
> > > > > This was what I came up with originally too, but when I looked into
> > > > > implementing it I found a couple of things which made me reconsider
> > it.
> > > > > Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we
> > start
> > > > > reassigning to [2,3], but then change our minds about 2 and switch
> it
> > > to
> > > > 4
> > > > > (maybe we've figured out a better overall balance). At that point
> it
> > is
> > > > > perfectly possible that broker 2 is in-sync and broker 1 is not
> > > in-sync.
> > > > It
> > > > > seems silly to drop broker 2 in favour of broker 1: We're
> needlessly
> > > > giving
> > > > > the cluster more work to do.
> > > > >
> > > > > The second thing that made me reconsider was in that same scenario
> > it's
> > > > > even possible that broker 2 is the leader of the partition.
> Obviously
> > > we
> > > > > can elect a new leader before dropping it, but not without causing
> > > > > disruption to producers and consumers.
> > > > >
> > > > > By accepting a little more complexity in choosing which brokers to
> > drop
> > > > we
> > > > > make the dropping simpler (no need for leader election) and ensure
> > the
> > > > > cluster has less work to do.
> > > > >
> > > > >
> > > > > > 12. You brought up the need of remembering the original
> assignment.
> > > > This
> > > > > > will be lost if the assignment is changed multiple times if we
> > follow
> > > > the
> > > > > > approach described in 10. One way is to store the original
> > assignment
> > > > in
> > > > > > /brokers/topics/[topic] as the following. When the final
> > reassignment
> > > > > > completes, we can remove the original field.
> > > > > > {
> > > > > >   "version": 1,
> > > > > >   "partitions": {"0": [0, 1, 3] },
> > > > > >   "originals": {"0": [0, 1, 2] }
> > > > > > }
> > > > > >
> > > > >
> > > > > While I was implementing my first version of
> > onPartitionReassignment(),
> > > > > where I preferred the originals, I was storing the originals in the
> > > > > /admin/reassignments/$topic/$partition znodes. Since we will
> remove
> > > that
> > > > > znode at the end of reassignment anyway, I would suggest this is a
> > > better
> > > > > place to store that data (if it's necessary to do so), so that we
> can
> > > > avoid
> > > > > another ZK round trip.
> > > > >
> > > > >
> > > > > > 13. For resolving the conflict between /admin/reassign_partitions
> > and
> > > > > > /admin/reassignments/$topic/$partition, perhaps it's more
> natural
> > to
> > > > > just
> > > > > > let the assignment with a newer timestamp to override the older
> > one?
> > > > > >
> > > > >
> > > > > That would work but with slightly different semantics to what I
> have:
> > > > Since
> > > > > /admin/reassign_partitions contains multiple partitions, using the
> > > > > timestamp means the whole batch wins or losses. By tracking how
> each
> > > > > request was made we can be more fine-grained. I'm to use the
> > > modification
> > > > > time if such granularity is not required.
> > > > >
> > > > >
> > > > > > 14. Implementation wise, currently, we register a watcher of the
> > isr
> > > > path
> > > > > > of each partition being reassigned. This has the potential issue
> of
> > > > > > registering many listeners. An improvement could be just
> > piggybacking
> > > > on
> > > > > > the existing IsrChangeNotificationHandler, which only watches a
> > > single
> > > > ZK
> > > > > > path and is triggered on a batch of isr changes. This is kind of
> > > > > orthogonal
> > > > > > to the KIP. However, if we are touching the reassignment logic,
> it
> > > may
> > > > be
> > > > > > worth considering.
> > > > >
> > > > >
> > > > > Let me look into that.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Tom
> > > > >
> > > > > On 16 December 2017 at 02:19, Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Tom,
> > > > > >
> > > > > > Thanks for the updated KIP. A few more comments below.
> > > > > >
> > > > > > 10. The proposal now stores the reassignment for all partitions
> in
> > > > > > /admin/reassignment_requests/request_xxx. If the number of
> > > reassigned
> > > > > > partitions is larger, the ZK write may hit the default 1MB limit
> > and
> > > > > fail.
> > > > > > An alternative approach is to have the reassignment requester
> first
> > > > write
> > > > > > the new assignment for each partition under
> > > > > > /admin/reassignments/$topic/$partition and then write
> > > > > > /admin/reassignment_requests/request_xxx with an empty value.
> The
> > > > > > controller can then read all values under /admin/reassignments.
> > > > > >
> > > > > > 11. The improvement you suggested in onPartitionReassignment()
> > sounds
> > > > > good
> > > > > > at the high level. The computation of those dropped partitions
> > seems
> > > a
> > > > > bit
> > > > > > complicated. Perhaps a simple approach is to drop the replicas
> not
> > in
> > > > the
> > > > > > original assignment and newest reassignment?
> > > > > >
> > > > > > 12. You brought up the need of remembering the original
> assignment.
> > > > This
> > > > > > will be lost if the assignment is changed multiple times if we
> > follow
> > > > the
> > > > > > approach described in 10. One way is to store the original
> > assignment
> > > > in
> > > > > > /brokers/topics/[topic] as the following. When the final
> > reassignment
> > > > > > completes, we can remove the original field.
> > > > > > {
> > > > > >   "version": 1,
> > > > > >   "partitions": {"0": [0, 1, 3] },
> > > > > >   "originals": {"0": [0, 1, 2] }
> > > > > > }
> > > > > >
> > > > > > 13. For resolving the conflict between /admin/reassign_partitions
> > and
> > > > > > /admin/reassignments/$topic/$partition, perhaps it's more
> natural
> > to
> > > > > just
> > > > > > let the assignment with a newer timestamp to override the older
> > one?
> > > > > >
> > > > > > 14. Implementation wise, currently, we register a watcher of the
> > isr
> > > > path
> > > > > > of each partition being reassigned. This has the potential issue
> of
> > > > > > registering many listeners. An improvement could be just
> > piggybacking
> > > > on
> > > > > > the existing IsrChangeNotificationHandler, which only watches a
> > > single
> > > > ZK
> > > > > > path and is triggered on a batch of isr changes. This is kind of
> > > > > orthogonal
> > > > > > to the KIP. However, if we are touching the reassignment logic,
> it
> > > may
> > > > be
> > > > > > worth considering.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <
> > t.j.bentley@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Just wanted to mention that I've started KIP-240, which builds
> on
> > > top
> > > > > of
> > > > > > > this one to provide an AdminClient API for listing and
> describing
> > > > > > > reassignments.
> > > > > > >
> > > > > > > On 15 December 2017 at 14:34, Tom Bentley <
> t.j.bentley@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > > Should we seek to improve this algorithm in this KIP, or
> > leave
> > > > that
> > > > > > as
> > > > > > > > a later optimisation?
> > > > > > > >
> > > > > > > > I've updated the KIP with a proposed algorithm.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On 14 December 2017 at 09:57, Tom Bentley <
> > t.j.bentley@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Thanks Ted, now fixed.
> > > > > > > >>
> > > > > > > >> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com>
> > > wrote:
> > > > > > > >>
> > > > > > > >>> Tom:
> > > > > > > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > > > > > > >>>
> > > > > > > >>> Looks like the tree structure above should be:
> > > > > > > >>>
> > > > > > > >>> /admin/reassignments/$topic/$partition
> > > > > > > >>>
> > > > > > > >>> bq. The controller removes /admin/reassignment/$topic/$
> > > partition
> > > > > > > >>>
> > > > > > > >>> Note the lack of 's' for reassignment. It would be good to
> > make
> > > > > > > zookeeper
> > > > > > > >>> paths consistent.
> > > > > > > >>>
> > > > > > > >>> Thanks
> > > > > > > >>>
> > > > > > > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <
> > > > > t.j.bentley@gmail.com>
> > > > > > > >>> wrote:
> > > > > > > >>>
> > > > > > > >>> > Hi Jun and Ted,
> > > > > > > >>> >
> > > > > > > >>> > Jun, you're right that needing one watcher per reassigned
> > > > > partition
> > > > > > > >>> > presents a scalability problem, and using a separate
> > > > notification
> > > > > > > path
> > > > > > > >>> > solves that. I also agree that it makes sense to prevent
> > > users
> > > > > from
> > > > > > > >>> using
> > > > > > > >>> > both methods on the same reassignment.
> > > > > > > >>> >
> > > > > > > >>> > Ted, naming the reassignments like mytopic-42 was simpler
> > > > while I
> > > > > > was
> > > > > > > >>> > proposing a watcher-per-reassignment (I'd have needed a
> > child
> > > > > > watcher
> > > > > > > >>> on
> > > > > > > >>> > /admin/reassignments and also on
> > > /admin/reassignments/mytopic).
> > > > > > Using
> > > > > > > >>> the
> > > > > > > >>> > separate notification path means I don't need any
> watchers
> > in
> > > > the
> > > > > > > >>> > /admin/reassignments subtree, so switching to
> > > > > > > >>> /admin/reassignments/mytopic/
> > > > > > > >>> > 42
> > > > > > > >>> > would work, and avoid /admin/reassignments having a very
> > > large
> > > > > > number
> > > > > > > >>> of
> > > > > > > >>> > child nodes. On the other hand it also means I have to
> > create
> > > > and
> > > > > > > >>> delete
> > > > > > > >>> > the topic nodes (e.g. /admin/reassignments/mytopic),
> which
> > > > incurs
> > > > > > the
> > > > > > > >>> cost
> > > > > > > >>> > of extra round trips to zookeeper. I suppose that since
> > > > > > reassignment
> > > > > > > is
> > > > > > > >>> > generally a slow process it makes little difference if we
> > > > > increase
> > > > > > > the
> > > > > > > >>> > latency of the interactions with zookeeper.
> > > > > > > >>> >
> > > > > > > >>> > I have updated the KIP with these improvements, and a
> more
> > > > > detailed
> > > > > > > >>> > description of exactly how we would manage these znodes.
> > > > > > > >>> >
> > > > > > > >>> > Reading the algorithm in KafkaController.
> > > > > > onPartitionReassignment(),
> > > > > > > it
> > > > > > > >>> > seems that it would be suboptimal for changing
> > reassignments
> > > > > > > in-flight.
> > > > > > > >>> > Consider an initial assignment of [1,2], reassigned to
> > [2,3]
> > > > and
> > > > > > then
> > > > > > > >>> > changed to [2,4]. Broker 3 will remain in the assigned
> > > replicas
> > > > > > until
> > > > > > > >>> > broker 4 is in sync, even though 3 wasn't actually one of
> > the
> > > > > > > original
> > > > > > > >>> > assigned replicas and is no longer a new assigned
> replica.
> > I
> > > > > think
> > > > > > > this
> > > > > > > >>> > also affects the case where the reassignment is cancelled
> > > > > > > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to
> catch
> > > up,
> > > > > > even
> > > > > > > >>> though
> > > > > > > >>> > its replica will then be deleted.
> > > > > > > >>> >
> > > > > > > >>> > Should we seek to improve this algorithm in this KIP, or
> > > leave
> > > > > that
> > > > > > > as
> > > > > > > >>> a
> > > > > > > >>> > later optimisation?
> > > > > > > >>> >
> > > > > > > >>> > Cheers,
> > > > > > > >>> >
> > > > > > > >>> > Tom
> > > > > > > >>> >
> > > > > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> > > > wrote:
> > > > > > > >>> >
> > > > > > > >>> > > Another question is on the compatibility. Since now
> there
> > > > are 2
> > > > > > > ways
> > > > > > > >>> of
> > > > > > > >>> > > specifying a partition reassignment, one under
> > > > > > > >>> /admin/reassign_partitions
> > > > > > > >>> > > and the other under /admin/reassignments, we probably
> > want
> > > to
> > > > > > > >>> prevent the
> > > > > > > >>> > > same topic being reassigned under both paths at the
> same
> > > > time?
> > > > > > > >>> > > Thanks,
> > > > > > > >>> > >
> > > > > > > >>> > > Jun
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <
> > jun@confluent.io>
> > > > > > wrote:
> > > > > > > >>> > >
> > > > > > > >>> > > > Hi, Tom,
> > > > > > > >>> > > >
> > > > > > > >>> > > > Thanks for the KIP. It definitely addresses one of
> the
> > > pain
> > > > > > > points
> > > > > > > >>> in
> > > > > > > >>> > > > partition reassignment. Another issue that it also
> > > > addresses
> > > > > is
> > > > > > > >>> the ZK
> > > > > > > >>> > > node
> > > > > > > >>> > > > size limit when writing the reassignment JSON.
> > > > > > > >>> > > >
> > > > > > > >>> > > > My only concern is that the KIP needs to create one
> > > watcher
> > > > > per
> > > > > > > >>> > > reassigned
> > > > > > > >>> > > > partition. This could add overhead in ZK and
> complexity
> > > for
> > > > > > > >>> debugging
> > > > > > > >>> > > when
> > > > > > > >>> > > > lots of partitions are being reassigned
> simultaneously.
> > > We
> > > > > > could
> > > > > > > >>> > > > potentially improve this by introducing a separate ZK
> > > path
> > > > > for
> > > > > > > >>> change
> > > > > > > >>> > > > notification as we do for configs. For example, every
> > > time
> > > > we
> > > > > > > >>> change
> > > > > > > >>> > the
> > > > > > > >>> > > > assignment for a set of partitions, we could further
> > > write
> > > > a
> > > > > > > >>> sequential
> > > > > > > >>> > > > node /admin/reassignment_changes/[change_x]. That
> way,
> > > the
> > > > > > > >>> controller
> > > > > > > >>> > > > only needs to watch the change path. Once a change is
> > > > > > triggered,
> > > > > > > >>> the
> > > > > > > >>> > > > controller can read everything under
> > > /admin/reassignments/.
> > > > > > > >>> > > >
> > > > > > > >>> > > > Jun
> > > > > > > >>> > > >
> > > > > > > >>> > > >
> > > > > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > > > > t.j.bentley@gmail.com
> > > > > > > >>> >
> > > > > > > >>> > > wrote:
> > > > > > > >>> > > >
> > > > > > > >>> > > >> Hi,
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> This is still very new, but I wanted some quick
> > feedback
> > > > on
> > > > > a
> > > > > > > >>> > > preliminary
> > > > > > > >>> > > >> KIP which could, I think, help with providing an
> > > > AdminClient
> > > > > > API
> > > > > > > >>> for
> > > > > > > >>> > > >> partition reassignment.
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > > > 236%
> > > > > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > > > > AdminClient
> > > > > > > >>> API in
> > > > > > > >>> > > >> this
> > > > > > > >>> > > >> KIP (which would make it very big, and difficult to
> > > read),
> > > > > or
> > > > > > > >>> whether
> > > > > > > >>> > to
> > > > > > > >>> > > >> break it down into smaller KIPs (which makes it
> easier
> > > to
> > > > > read
> > > > > > > and
> > > > > > > >>> > > >> implement in pieces, but harder to get a high-level
> > > > picture
> > > > > of
> > > > > > > the
> > > > > > > >>> > > >> ultimate
> > > > > > > >>> > > >> destination). For now I've gone for a very small
> > initial
> > > > > KIP,
> > > > > > > but
> > > > > > > >>> I'm
> > > > > > > >>> > > >> happy
> > > > > > > >>> > > >> to sketch the bigger picture here if people are
> > > > interested.
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> Cheers,
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> Tom
> > > > > > > >>> > > >>
> > > > > > > >>> > > >
> > > > > > > >>> > > >
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>> >
> > > > > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> > > > wrote:
> > > > > > > >>> >
> > > > > > > >>> > > Another question is on the compatibility. Since now
> there
> > > > are 2
> > > > > > > ways
> > > > > > > >>> of
> > > > > > > >>> > > specifying a partition reassignment, one under
> > > > > > > >>> /admin/reassign_partitions
> > > > > > > >>> > > and the other under /admin/reassignments, we probably
> > want
> > > to
> > > > > > > >>> prevent the
> > > > > > > >>> > > same topic being reassigned under both paths at the
> same
> > > > time?
> > > > > > > >>> > > Thanks,
> > > > > > > >>> > >
> > > > > > > >>> > > Jun
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > >
> > > > > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <
> > jun@confluent.io>
> > > > > > wrote:
> > > > > > > >>> > >
> > > > > > > >>> > > > Hi, Tom,
> > > > > > > >>> > > >
> > > > > > > >>> > > > Thanks for the KIP. It definitely addresses one of
> the
> > > pain
> > > > > > > points
> > > > > > > >>> in
> > > > > > > >>> > > > partition reassignment. Another issue that it also
> > > > addresses
> > > > > is
> > > > > > > >>> the ZK
> > > > > > > >>> > > node
> > > > > > > >>> > > > size limit when writing the reassignment JSON.
> > > > > > > >>> > > >
> > > > > > > >>> > > > My only concern is that the KIP needs to create one
> > > watcher
> > > > > per
> > > > > > > >>> > > reassigned
> > > > > > > >>> > > > partition. This could add overhead in ZK and
> complexity
> > > for
> > > > > > > >>> debugging
> > > > > > > >>> > > when
> > > > > > > >>> > > > lots of partitions are being reassigned
> simultaneously.
> > > We
> > > > > > could
> > > > > > > >>> > > > potentially improve this by introducing a separate ZK
> > > path
> > > > > for
> > > > > > > >>> change
> > > > > > > >>> > > > notification as we do for configs. For example, every
> > > time
> > > > we
> > > > > > > >>> change
> > > > > > > >>> > the
> > > > > > > >>> > > > assignment for a set of partitions, we could further
> > > write
> > > > a
> > > > > > > >>> sequential
> > > > > > > >>> > > > node /admin/reassignment_changes/[change_x]. That
> way,
> > > the
> > > > > > > >>> controller
> > > > > > > >>> > > > only needs to watch the change path. Once a change is
> > > > > > triggered,
> > > > > > > >>> the
> > > > > > > >>> > > > controller can read everything under
> > > /admin/reassignments/.
> > > > > > > >>> > > >
> > > > > > > >>> > > > Jun
> > > > > > > >>> > > >
> > > > > > > >>> > > >
> > > > > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > > > > t.j.bentley@gmail.com
> > > > > > > >>> >
> > > > > > > >>> > > wrote:
> > > > > > > >>> > > >
> > > > > > > >>> > > >> Hi,
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> This is still very new, but I wanted some quick
> > feedback
> > > > on
> > > > > a
> > > > > > > >>> > > preliminary
> > > > > > > >>> > > >> KIP which could, I think, help with providing an
> > > > AdminClient
> > > > > > API
> > > > > > > >>> for
> > > > > > > >>> > > >> partition reassignment.
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > > > 236%
> > > > > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > > > > AdminClient
> > > > > > > >>> API in
> > > > > > > >>> > > >> this
> > > > > > > >>> > > >> KIP (which would make it very big, and difficult to
> > > read),
> > > > > or
> > > > > > > >>> whether
> > > > > > > >>> > to
> > > > > > > >>> > > >> break it down into smaller KIPs (which makes it
> easier
> > > to
> > > > > read
> > > > > > > and
> > > > > > > >>> > > >> implement in pieces, but harder to get a high-level
> > > > picture
> > > > > of
> > > > > > > the
> > > > > > > >>> > > >> ultimate
> > > > > > > >>> > > >> destination). For now I've gone for a very small
> > initial
> > > > > KIP,
> > > > > > > but
> > > > > > > >>> I'm
> > > > > > > >>> > > >> happy
> > > > > > > >>> > > >> to sketch the bigger picture here if people are
> > > > interested.
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> Cheers,
> > > > > > > >>> > > >>
> > > > > > > >>> > > >> Tom
> > > > > > > >>> > > >>
> > > > > > > >>> > > >
> > > > > > > >>> > > >
> > > > > > > >>> > >
> > > > > > > >>> >
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Tom Bentley <t....@gmail.com>.
Hi Jun,

Thanks for responding, my replies are inline:

10. You explanation makes sense. My remaining concern is the additional ZK
> writes in the proposal. With the proposal, we will need to do following
> writes in ZK.
>
> a. write new assignment in /admin/reassignment_requests
>
> b. write new assignment and additional metadata in
> /admin/reassignments/$topic/$partition
>
> c. write old + new assignment  in /brokers/topics/[topic]
>
> d. write new assignment in /brokers/topics/[topic]
>
> e. delete /admin/reassignments/$topic/$partition
>
> So, there are quite a few ZK writes. I am wondering if it's better to
> consolidate the info in /admin/reassignments/$topic/$partition into
> /brokers/topics/[topic].
> For example, we can just add some new JSON fields in
> /brokers/topics/[topic]
> to remember the new assignment and potentially the original replica count
> when doing step c. Those fields with then be removed in step d. That way,
> we can get rid of step b and e, saving 2 ZK writes per partition.
>

This seems like a great idea to me.

It might also be possible to get rid of the /admin/reassignment_requests
subtree too. I've not yet published the ideas I have for the AdminClient
API for reassigning partitions, but given the existence of such an API, the
route to starting a reassignment would be the AdminClient, and not
zookeeper. In that case there is no need for /admin/reassignment_requests
at all. The only drawback that I can see is that while it's currently
possible to trigger a reassignment even during a controller
election/failover that would no longer be the case if all requests had to
go via the controller.


> 11. What you described sounds good. We could potentially optimize the
> dropped replicas a bit more. Suppose that assignment [0,1,2] is first
> changed to [1,2,3] and then to [2,3,4]. When initiating the second
> assignment, we may end up dropping replica 3 and only to restart it again.
> In this case, we could only drop a replica if it's not going to be added
> back again.
>

I had missed that, thank you! I will update the proposed algorithm to
prevent this.


> 13. Since this is a corner case, we can either prevent or allow overriding
> with old/new mechanisms. To me, it seems that allowing is simpler to
> implement, the order in /admin/reassignment_requests determines the
> ordering the of override, whether that's initiated by the new way or the
> old way.
>

That makes sense except for the corner case where:

* There is no current controller and
* Writes to both the new and old znodes happen

On election of the new controller, for those partitions with both a
reassignment_request and in /admin/reassign_partitions, we have to decide
which should win. You could use the modification time, though there are
some very unlikely scenarios where that doesn't work properly, for example
if both znodes have the same mtime, or the /admin/reassign_partitions was
updated, but the assignment of the partition wasn't changed, like this:

0. /admin/reassign_partitions has my-topic/42 = [1,2,3]
1. Controller stops watching.
2. Create /admin/reassignment_requests/request_1234 to change the
reassignment of partition my-topic/42 = [4,5,6]
3. Update /admin/reassign_partitions to add your-topic/12=[7,8,9]
4. New controller resumes



> Thanks,
>
> Jun
>
> On Tue, Dec 19, 2017 at 2:43 AM, Tom Bentley <t....@gmail.com>
> wrote:
>
> > Hi Jun,
> >
> > 10. Another concern of mine is on consistency with the current pattern.
> The
> > > current pattern for change notification based on ZK is (1) we first
> write
> > > the actual value in the entity path and then write the change
> > notification
> > > path, and (2)  the change notification path only includes what entity
> has
> > > changed but not the actual changes. If we want to follow this pattern
> for
> > > consistency, /admin/reassignment_requests/request_xxx will only have
> the
> > > partitions whose reassignment have changed, but not the actual
> > > reassignment.
> > >
> >
> > Ah, I hadn't understood part (2). That means my concern about efficiency
> > with the current pattern is misplaced. There are still some interesting
> > differences in semantics, however:
> >
> > a) The mechanism currently proposed in KIP-236 means that the controller
> is
> > the only writer to /admin/reassignments. This means it can include
> > information in these znodes that requesters might not know, or
> information
> > that's necessary to perform the reassignment but not necessary to
> describe
> > the request. While this could be handled using the current pattern it
> would
> > rely on all  writers to preserve any information added by the controller,
> > which seems complicated and hence fragile.
> >
> > b) The current pattern for change notification doesn't cope with
> competing
> > writers to the entity path: If two processes write to the entity path
> > before the controller can read it (due to notification) then one set of
> > updates will be lost.
> >
> > c) If a single writing process crashes after writing to the entity path,
> > but before writing to the notification path then the write will be lost.
> >
> > I'm actually using point a) in my WIP (see below). Points b) and c) are
> > obviously edge cases.
> >
> >
> > > 11. Ok. I am not sure that I fully understand the description of that
> > part.
> > > Does "assigned" refer to the current assignment? Could you also
> describe
> > > where the length of the original assignment is stored in ZK?
> > >
> >
> > Sorry if the description is not clear. Yes, "assigned" referrs to the
> > currently assigned replicas (taken from the
> > ControllerContext.partitionReplicaAssignment). I would store the length
> of
> > the original assignment in the /admin/reassignments/$topic/$partition
> > znode
> > (this is where the point (a) above is useful -- the requester shouldn't
> > know that this information is used by the controller).
> >
> > I've updated the KIP to make these points clearer.
> >
> >
> > > 13. Hmm, I am not sure that the cancellation needs to be done for the
> > whole
> > > batch. The reason that I brought this up is for consistency. The KIP
> > allows
> > > override when using the new approach. It just seems that it's simpler
> to
> > > extend this model when resolving multiple changes between the old and
> the
> > > new approach.
> >
> >
> > Ah, I think I've been unclear on this point too. Currently the
> > ReassignPartitionsCommand enforces that you can't change reassignments,
> but
> > this doesn't stop other ZK clients making changes to
> > /admin/reassign_partitions directly and I believe some Kafka users do
> > indeed change reassignments in-flight by writing to
> > /admin/reassign_partitions. What I'm proposing doesn't break that at all.
> > The semantic I've implemented is only that the controller only refuses a
> > reassignment change if there is already one in-flight (i.e. in
> > /admin/reassignments/$topic/$partition) **via the other mechansim**. So
> if
> > you're using /admin/reassign_partitions and you change or cancel part of
> it
> > via /admin/reassign_partitions, that's OK. Likewise if you're using
> > /admin/reassignment_request/request_xxx and you change or cancel part of
> > it
> > via another /admin/reassignment_request/request_xxx, that's OK.What you
> > can't do is change a request that was started via
> > /admin/reassign_partitions via /admin/reassignment_request/request_xxx,
> or
> > vice versa.
> >
> > What I was thinking of when I replied is the case where, on controller
> > failover, /admin/reassign_partitions has been changed and
> > /admin/reassignment_request/request_xxx created (in the period when the
> > new
> > controller was being elected, for example) with a common partition. In
> this
> > case we should apply a consistent rule to that used when the notification
> > happen in real time. Your suggestion to use the modification time of the
> > znode would work here too (except in the edge case where ZK writes to
> both
> > znodes happen within the same clock tick on the ZK server, so the mtimes
> > are the same).
> >
> > Let me know if you think this is the right semantic and I'll try to
> clarify
> > the KIP.
> >
> > Many thanks,
> >
> > Tom
> >
> > On 18 December 2017 at 18:12, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Tom,
> > >
> > > Thanks for the reply. A few more followup comments below.
> > >
> > > 10. Another concern of mine is on consistency with the current pattern.
> > The
> > > current pattern for change notification based on ZK is (1) we first
> write
> > > the actual value in the entity path and then write the change
> > notification
> > > path, and (2)  the change notification path only includes what entity
> has
> > > changed but not the actual changes. If we want to follow this pattern
> for
> > > consistency, /admin/reassignment_requests/request_xxx will only have
> the
> > > partitions whose reassignment have changed, but not the actual
> > > reassignment.
> > >
> > > 11. Ok. I am not sure that I fully understand the description of that
> > part.
> > > Does "assigned" refer to the current assignment? Could you also
> describe
> > > where the length of the original assignment is stored in ZK?
> > >
> > > 13. Hmm, I am not sure that the cancellation needs to be done for the
> > whole
> > > batch. The reason that I brought this up is for consistency. The KIP
> > allows
> > > override when using the new approach. It just seems that it's simpler
> to
> > > extend this model when resolving multiple changes between the old and
> the
> > > new approach.
> > >
> > > Jun
> > >
> > >
> > >
> > > On Mon, Dec 18, 2017 at 2:45 AM, Tom Bentley <t....@gmail.com>
> > > wrote:
> > >
> > > > Hi Jun,
> > > >
> > > > Thanks for replying, some answers below:
> > > >
> > > >
> > > > > 10. The proposal now stores the reassignment for all partitions in
> > > > > /admin/reassignment_requests/request_xxx. If the number of
> > reassigned
> > > > > partitions is larger, the ZK write may hit the default 1MB limit
> and
> > > > fail.
> > > > > An alternative approach is to have the reassignment requester first
> > > write
> > > > > the new assignment for each partition under
> > > > > /admin/reassignments/$topic/$partition and then write
> > > > > /admin/reassignment_requests/request_xxx with an empty value. The
> > > > > controller can then read all values under /admin/reassignments.
> > > > >
> > > >
> > > > You're right that reassigning enough partitions would hit the 1MB
> > limit,
> > > > but I don't think this would be a problem in practice because it
> would
> > be
> > > > trivial to split the partitions into several requests (i.e. mutleiple
> > > > request_xxx).
> > > > I don't think the non-atomicity this would imply is a problem. By
> > writing
> > > > the partitions whose /admin/reassignments/$topic/$partition has been
> > > > created or changed it makes it much more efficient to know which of
> > those
> > > > znodes we need to read. If I understand your suggestion, you would
> have
> > > to
> > > > read every node under /admin/reassignments to figure out which had
> > > changed.
> > > >
> > > >
> > > > > 11. The improvement you suggested in onPartitionReassignment()
> sounds
> > > > good
> > > > > at the high level. The computation of those dropped partitions
> seems
> > a
> > > > bit
> > > > > complicated. Perhaps a simple approach is to drop the replicas not
> in
> > > the
> > > > > original assignment and newest reassignment?
> > > > >
> > > >
> > > > This was what I came up with originally too, but when I looked into
> > > > implementing it I found a couple of things which made me reconsider
> it.
> > > > Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we
> start
> > > > reassigning to [2,3], but then change our minds about 2 and switch it
> > to
> > > 4
> > > > (maybe we've figured out a better overall balance). At that point it
> is
> > > > perfectly possible that broker 2 is in-sync and broker 1 is not
> > in-sync.
> > > It
> > > > seems silly to drop broker 2 in favour of broker 1: We're needlessly
> > > giving
> > > > the cluster more work to do.
> > > >
> > > > The second thing that made me reconsider was in that same scenario
> it's
> > > > even possible that broker 2 is the leader of the partition. Obviously
> > we
> > > > can elect a new leader before dropping it, but not without causing
> > > > disruption to producers and consumers.
> > > >
> > > > By accepting a little more complexity in choosing which brokers to
> drop
> > > we
> > > > make the dropping simpler (no need for leader election) and ensure
> the
> > > > cluster has less work to do.
> > > >
> > > >
> > > > > 12. You brought up the need of remembering the original assignment.
> > > This
> > > > > will be lost if the assignment is changed multiple times if we
> follow
> > > the
> > > > > approach described in 10. One way is to store the original
> assignment
> > > in
> > > > > /brokers/topics/[topic] as the following. When the final
> reassignment
> > > > > completes, we can remove the original field.
> > > > > {
> > > > >   "version": 1,
> > > > >   "partitions": {"0": [0, 1, 3] },
> > > > >   "originals": {"0": [0, 1, 2] }
> > > > > }
> > > > >
> > > >
> > > > While I was implementing my first version of
> onPartitionReassignment(),
> > > > where I preferred the originals, I was storing the originals in the
> > > > /admin/reassignments/$topic/$partition znodes. Since we will remove
> > that
> > > > znode at the end of reassignment anyway, I would suggest this is a
> > better
> > > > place to store that data (if it's necessary to do so), so that we can
> > > avoid
> > > > another ZK round trip.
> > > >
> > > >
> > > > > 13. For resolving the conflict between /admin/reassign_partitions
> and
> > > > > /admin/reassignments/$topic/$partition, perhaps it's more natural
> to
> > > > just
> > > > > let the assignment with a newer timestamp to override the older
> one?
> > > > >
> > > >
> > > > That would work but with slightly different semantics to what I have:
> > > Since
> > > > /admin/reassign_partitions contains multiple partitions, using the
> > > > timestamp means the whole batch wins or losses. By tracking how each
> > > > request was made we can be more fine-grained. I'm to use the
> > modification
> > > > time if such granularity is not required.
> > > >
> > > >
> > > > > 14. Implementation wise, currently, we register a watcher of the
> isr
> > > path
> > > > > of each partition being reassigned. This has the potential issue of
> > > > > registering many listeners. An improvement could be just
> piggybacking
> > > on
> > > > > the existing IsrChangeNotificationHandler, which only watches a
> > single
> > > ZK
> > > > > path and is triggered on a batch of isr changes. This is kind of
> > > > orthogonal
> > > > > to the KIP. However, if we are touching the reassignment logic, it
> > may
> > > be
> > > > > worth considering.
> > > >
> > > >
> > > > Let me look into that.
> > > >
> > > > Thanks,
> > > >
> > > > Tom
> > > >
> > > > On 16 December 2017 at 02:19, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > > > Hi, Tom,
> > > > >
> > > > > Thanks for the updated KIP. A few more comments below.
> > > > >
> > > > > 10. The proposal now stores the reassignment for all partitions in
> > > > > /admin/reassignment_requests/request_xxx. If the number of
> > reassigned
> > > > > partitions is larger, the ZK write may hit the default 1MB limit
> and
> > > > fail.
> > > > > An alternative approach is to have the reassignment requester first
> > > write
> > > > > the new assignment for each partition under
> > > > > /admin/reassignments/$topic/$partition and then write
> > > > > /admin/reassignment_requests/request_xxx with an empty value. The
> > > > > controller can then read all values under /admin/reassignments.
> > > > >
> > > > > 11. The improvement you suggested in onPartitionReassignment()
> sounds
> > > > good
> > > > > at the high level. The computation of those dropped partitions
> seems
> > a
> > > > bit
> > > > > complicated. Perhaps a simple approach is to drop the replicas not
> in
> > > the
> > > > > original assignment and newest reassignment?
> > > > >
> > > > > 12. You brought up the need of remembering the original assignment.
> > > This
> > > > > will be lost if the assignment is changed multiple times if we
> follow
> > > the
> > > > > approach described in 10. One way is to store the original
> assignment
> > > in
> > > > > /brokers/topics/[topic] as the following. When the final
> reassignment
> > > > > completes, we can remove the original field.
> > > > > {
> > > > >   "version": 1,
> > > > >   "partitions": {"0": [0, 1, 3] },
> > > > >   "originals": {"0": [0, 1, 2] }
> > > > > }
> > > > >
> > > > > 13. For resolving the conflict between /admin/reassign_partitions
> and
> > > > > /admin/reassignments/$topic/$partition, perhaps it's more natural
> to
> > > > just
> > > > > let the assignment with a newer timestamp to override the older
> one?
> > > > >
> > > > > 14. Implementation wise, currently, we register a watcher of the
> isr
> > > path
> > > > > of each partition being reassigned. This has the potential issue of
> > > > > registering many listeners. An improvement could be just
> piggybacking
> > > on
> > > > > the existing IsrChangeNotificationHandler, which only watches a
> > single
> > > ZK
> > > > > path and is triggered on a batch of isr changes. This is kind of
> > > > orthogonal
> > > > > to the KIP. However, if we are touching the reassignment logic, it
> > may
> > > be
> > > > > worth considering.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jun
> > > > >
> > > > > On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <
> t.j.bentley@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Just wanted to mention that I've started KIP-240, which builds on
> > top
> > > > of
> > > > > > this one to provide an AdminClient API for listing and describing
> > > > > > reassignments.
> > > > > >
> > > > > > On 15 December 2017 at 14:34, Tom Bentley <t.j.bentley@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > > Should we seek to improve this algorithm in this KIP, or
> leave
> > > that
> > > > > as
> > > > > > > a later optimisation?
> > > > > > >
> > > > > > > I've updated the KIP with a proposed algorithm.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 14 December 2017 at 09:57, Tom Bentley <
> t.j.bentley@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> Thanks Ted, now fixed.
> > > > > > >>
> > > > > > >> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > > > >>
> > > > > > >>> Tom:
> > > > > > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > > > > > >>>
> > > > > > >>> Looks like the tree structure above should be:
> > > > > > >>>
> > > > > > >>> /admin/reassignments/$topic/$partition
> > > > > > >>>
> > > > > > >>> bq. The controller removes /admin/reassignment/$topic/$
> > partition
> > > > > > >>>
> > > > > > >>> Note the lack of 's' for reassignment. It would be good to
> make
> > > > > > zookeeper
> > > > > > >>> paths consistent.
> > > > > > >>>
> > > > > > >>> Thanks
> > > > > > >>>
> > > > > > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <
> > > > t.j.bentley@gmail.com>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>> > Hi Jun and Ted,
> > > > > > >>> >
> > > > > > >>> > Jun, you're right that needing one watcher per reassigned
> > > > partition
> > > > > > >>> > presents a scalability problem, and using a separate
> > > notification
> > > > > > path
> > > > > > >>> > solves that. I also agree that it makes sense to prevent
> > users
> > > > from
> > > > > > >>> using
> > > > > > >>> > both methods on the same reassignment.
> > > > > > >>> >
> > > > > > >>> > Ted, naming the reassignments like mytopic-42 was simpler
> > > while I
> > > > > was
> > > > > > >>> > proposing a watcher-per-reassignment (I'd have needed a
> child
> > > > > watcher
> > > > > > >>> on
> > > > > > >>> > /admin/reassignments and also on
> > /admin/reassignments/mytopic).
> > > > > Using
> > > > > > >>> the
> > > > > > >>> > separate notification path means I don't need any watchers
> in
> > > the
> > > > > > >>> > /admin/reassignments subtree, so switching to
> > > > > > >>> /admin/reassignments/mytopic/
> > > > > > >>> > 42
> > > > > > >>> > would work, and avoid /admin/reassignments having a very
> > large
> > > > > number
> > > > > > >>> of
> > > > > > >>> > child nodes. On the other hand it also means I have to
> create
> > > and
> > > > > > >>> delete
> > > > > > >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which
> > > incurs
> > > > > the
> > > > > > >>> cost
> > > > > > >>> > of extra round trips to zookeeper. I suppose that since
> > > > > reassignment
> > > > > > is
> > > > > > >>> > generally a slow process it makes little difference if we
> > > > increase
> > > > > > the
> > > > > > >>> > latency of the interactions with zookeeper.
> > > > > > >>> >
> > > > > > >>> > I have updated the KIP with these improvements, and a more
> > > > detailed
> > > > > > >>> > description of exactly how we would manage these znodes.
> > > > > > >>> >
> > > > > > >>> > Reading the algorithm in KafkaController.
> > > > > onPartitionReassignment(),
> > > > > > it
> > > > > > >>> > seems that it would be suboptimal for changing
> reassignments
> > > > > > in-flight.
> > > > > > >>> > Consider an initial assignment of [1,2], reassigned to
> [2,3]
> > > and
> > > > > then
> > > > > > >>> > changed to [2,4]. Broker 3 will remain in the assigned
> > replicas
> > > > > until
> > > > > > >>> > broker 4 is in sync, even though 3 wasn't actually one of
> the
> > > > > > original
> > > > > > >>> > assigned replicas and is no longer a new assigned replica.
> I
> > > > think
> > > > > > this
> > > > > > >>> > also affects the case where the reassignment is cancelled
> > > > > > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch
> > up,
> > > > > even
> > > > > > >>> though
> > > > > > >>> > its replica will then be deleted.
> > > > > > >>> >
> > > > > > >>> > Should we seek to improve this algorithm in this KIP, or
> > leave
> > > > that
> > > > > > as
> > > > > > >>> a
> > > > > > >>> > later optimisation?
> > > > > > >>> >
> > > > > > >>> > Cheers,
> > > > > > >>> >
> > > > > > >>> > Tom
> > > > > > >>> >
> > > > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > > > >>> >
> > > > > > >>> > > Another question is on the compatibility. Since now there
> > > are 2
> > > > > > ways
> > > > > > >>> of
> > > > > > >>> > > specifying a partition reassignment, one under
> > > > > > >>> /admin/reassign_partitions
> > > > > > >>> > > and the other under /admin/reassignments, we probably
> want
> > to
> > > > > > >>> prevent the
> > > > > > >>> > > same topic being reassigned under both paths at the same
> > > time?
> > > > > > >>> > > Thanks,
> > > > > > >>> > >
> > > > > > >>> > > Jun
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <
> jun@confluent.io>
> > > > > wrote:
> > > > > > >>> > >
> > > > > > >>> > > > Hi, Tom,
> > > > > > >>> > > >
> > > > > > >>> > > > Thanks for the KIP. It definitely addresses one of the
> > pain
> > > > > > points
> > > > > > >>> in
> > > > > > >>> > > > partition reassignment. Another issue that it also
> > > addresses
> > > > is
> > > > > > >>> the ZK
> > > > > > >>> > > node
> > > > > > >>> > > > size limit when writing the reassignment JSON.
> > > > > > >>> > > >
> > > > > > >>> > > > My only concern is that the KIP needs to create one
> > watcher
> > > > per
> > > > > > >>> > > reassigned
> > > > > > >>> > > > partition. This could add overhead in ZK and complexity
> > for
> > > > > > >>> debugging
> > > > > > >>> > > when
> > > > > > >>> > > > lots of partitions are being reassigned simultaneously.
> > We
> > > > > could
> > > > > > >>> > > > potentially improve this by introducing a separate ZK
> > path
> > > > for
> > > > > > >>> change
> > > > > > >>> > > > notification as we do for configs. For example, every
> > time
> > > we
> > > > > > >>> change
> > > > > > >>> > the
> > > > > > >>> > > > assignment for a set of partitions, we could further
> > write
> > > a
> > > > > > >>> sequential
> > > > > > >>> > > > node /admin/reassignment_changes/[change_x]. That way,
> > the
> > > > > > >>> controller
> > > > > > >>> > > > only needs to watch the change path. Once a change is
> > > > > triggered,
> > > > > > >>> the
> > > > > > >>> > > > controller can read everything under
> > /admin/reassignments/.
> > > > > > >>> > > >
> > > > > > >>> > > > Jun
> > > > > > >>> > > >
> > > > > > >>> > > >
> > > > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > > > t.j.bentley@gmail.com
> > > > > > >>> >
> > > > > > >>> > > wrote:
> > > > > > >>> > > >
> > > > > > >>> > > >> Hi,
> > > > > > >>> > > >>
> > > > > > >>> > > >> This is still very new, but I wanted some quick
> feedback
> > > on
> > > > a
> > > > > > >>> > > preliminary
> > > > > > >>> > > >> KIP which could, I think, help with providing an
> > > AdminClient
> > > > > API
> > > > > > >>> for
> > > > > > >>> > > >> partition reassignment.
> > > > > > >>> > > >>
> > > > > > >>> > > >> https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> > > 236%
> > > > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > > > >>> > > >>
> > > > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > > > AdminClient
> > > > > > >>> API in
> > > > > > >>> > > >> this
> > > > > > >>> > > >> KIP (which would make it very big, and difficult to
> > read),
> > > > or
> > > > > > >>> whether
> > > > > > >>> > to
> > > > > > >>> > > >> break it down into smaller KIPs (which makes it easier
> > to
> > > > read
> > > > > > and
> > > > > > >>> > > >> implement in pieces, but harder to get a high-level
> > > picture
> > > > of
> > > > > > the
> > > > > > >>> > > >> ultimate
> > > > > > >>> > > >> destination). For now I've gone for a very small
> initial
> > > > KIP,
> > > > > > but
> > > > > > >>> I'm
> > > > > > >>> > > >> happy
> > > > > > >>> > > >> to sketch the bigger picture here if people are
> > > interested.
> > > > > > >>> > > >>
> > > > > > >>> > > >> Cheers,
> > > > > > >>> > > >>
> > > > > > >>> > > >> Tom
> > > > > > >>> > > >>
> > > > > > >>> > > >
> > > > > > >>> > > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>> >
> > > > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > > > >>> >
> > > > > > >>> > > Another question is on the compatibility. Since now there
> > > are 2
> > > > > > ways
> > > > > > >>> of
> > > > > > >>> > > specifying a partition reassignment, one under
> > > > > > >>> /admin/reassign_partitions
> > > > > > >>> > > and the other under /admin/reassignments, we probably
> want
> > to
> > > > > > >>> prevent the
> > > > > > >>> > > same topic being reassigned under both paths at the same
> > > time?
> > > > > > >>> > > Thanks,
> > > > > > >>> > >
> > > > > > >>> > > Jun
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > >
> > > > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <
> jun@confluent.io>
> > > > > wrote:
> > > > > > >>> > >
> > > > > > >>> > > > Hi, Tom,
> > > > > > >>> > > >
> > > > > > >>> > > > Thanks for the KIP. It definitely addresses one of the
> > pain
> > > > > > points
> > > > > > >>> in
> > > > > > >>> > > > partition reassignment. Another issue that it also
> > > addresses
> > > > is
> > > > > > >>> the ZK
> > > > > > >>> > > node
> > > > > > >>> > > > size limit when writing the reassignment JSON.
> > > > > > >>> > > >
> > > > > > >>> > > > My only concern is that the KIP needs to create one
> > watcher
> > > > per
> > > > > > >>> > > reassigned
> > > > > > >>> > > > partition. This could add overhead in ZK and complexity
> > for
> > > > > > >>> debugging
> > > > > > >>> > > when
> > > > > > >>> > > > lots of partitions are being reassigned simultaneously.
> > We
> > > > > could
> > > > > > >>> > > > potentially improve this by introducing a separate ZK
> > path
> > > > for
> > > > > > >>> change
> > > > > > >>> > > > notification as we do for configs. For example, every
> > time
> > > we
> > > > > > >>> change
> > > > > > >>> > the
> > > > > > >>> > > > assignment for a set of partitions, we could further
> > write
> > > a
> > > > > > >>> sequential
> > > > > > >>> > > > node /admin/reassignment_changes/[change_x]. That way,
> > the
> > > > > > >>> controller
> > > > > > >>> > > > only needs to watch the change path. Once a change is
> > > > > triggered,
> > > > > > >>> the
> > > > > > >>> > > > controller can read everything under
> > /admin/reassignments/.
> > > > > > >>> > > >
> > > > > > >>> > > > Jun
> > > > > > >>> > > >
> > > > > > >>> > > >
> > > > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > > > t.j.bentley@gmail.com
> > > > > > >>> >
> > > > > > >>> > > wrote:
> > > > > > >>> > > >
> > > > > > >>> > > >> Hi,
> > > > > > >>> > > >>
> > > > > > >>> > > >> This is still very new, but I wanted some quick
> feedback
> > > on
> > > > a
> > > > > > >>> > > preliminary
> > > > > > >>> > > >> KIP which could, I think, help with providing an
> > > AdminClient
> > > > > API
> > > > > > >>> for
> > > > > > >>> > > >> partition reassignment.
> > > > > > >>> > > >>
> > > > > > >>> > > >> https://cwiki.apache.org/
> confluence/display/KAFKA/KIP-
> > > 236%
> > > > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > > > >>> > > >>
> > > > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > > > AdminClient
> > > > > > >>> API in
> > > > > > >>> > > >> this
> > > > > > >>> > > >> KIP (which would make it very big, and difficult to
> > read),
> > > > or
> > > > > > >>> whether
> > > > > > >>> > to
> > > > > > >>> > > >> break it down into smaller KIPs (which makes it easier
> > to
> > > > read
> > > > > > and
> > > > > > >>> > > >> implement in pieces, but harder to get a high-level
> > > picture
> > > > of
> > > > > > the
> > > > > > >>> > > >> ultimate
> > > > > > >>> > > >> destination). For now I've gone for a very small
> initial
> > > > KIP,
> > > > > > but
> > > > > > >>> I'm
> > > > > > >>> > > >> happy
> > > > > > >>> > > >> to sketch the bigger picture here if people are
> > > interested.
> > > > > > >>> > > >>
> > > > > > >>> > > >> Cheers,
> > > > > > >>> > > >>
> > > > > > >>> > > >> Tom
> > > > > > >>> > > >>
> > > > > > >>> > > >
> > > > > > >>> > > >
> > > > > > >>> > >
> > > > > > >>> >
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Jun Rao <ju...@confluent.io>.
Hi, Tom,

Thanks for the reply. A few more comments below.

10. You explanation makes sense. My remaining concern is the additional ZK
writes in the proposal. With the proposal, we will need to do following
writes in ZK.

a. write new assignment in /admin/reassignment_requests

b. write new assignment and additional metadata in
/admin/reassignments/$topic/$partition

c. write old + new assignment  in /brokers/topics/[topic]

d. write new assignment in /brokers/topics/[topic]

e. delete /admin/reassignments/$topic/$partition

So, there are quite a few ZK writes. I am wondering if it's better to
consolidate the info in /admin/reassignments/$topic/$partition into
/brokers/topics/[topic].
For example, we can just add some new JSON fields in /brokers/topics/[topic]
to remember the new assignment and potentially the original replica count
when doing step c. Those fields with then be removed in step d. That way,
we can get rid of step b and e, saving 2 ZK writes per partition.

11. What you described sounds good. We could potentially optimize the
dropped replicas a bit more. Suppose that assignment [0,1,2] is first
changed to [1,2,3] and then to [2,3,4]. When initiating the second
assignment, we may end up dropping replica 3 and only to restart it again.
In this case, we could only drop a replica if it's not going to be added
back again.

13. Since this is a corner case, we can either prevent or allow overriding
with old/new mechanisms. To me, it seems that allowing is simpler to
implement, the order in /admin/reassignment_requests determines the
ordering the of override, whether that's initiated by the new way or the
old way.

Thanks,

Jun

On Tue, Dec 19, 2017 at 2:43 AM, Tom Bentley <t....@gmail.com> wrote:

> Hi Jun,
>
> 10. Another concern of mine is on consistency with the current pattern. The
> > current pattern for change notification based on ZK is (1) we first write
> > the actual value in the entity path and then write the change
> notification
> > path, and (2)  the change notification path only includes what entity has
> > changed but not the actual changes. If we want to follow this pattern for
> > consistency, /admin/reassignment_requests/request_xxx will only have the
> > partitions whose reassignment have changed, but not the actual
> > reassignment.
> >
>
> Ah, I hadn't understood part (2). That means my concern about efficiency
> with the current pattern is misplaced. There are still some interesting
> differences in semantics, however:
>
> a) The mechanism currently proposed in KIP-236 means that the controller is
> the only writer to /admin/reassignments. This means it can include
> information in these znodes that requesters might not know, or information
> that's necessary to perform the reassignment but not necessary to describe
> the request. While this could be handled using the current pattern it would
> rely on all  writers to preserve any information added by the controller,
> which seems complicated and hence fragile.
>
> b) The current pattern for change notification doesn't cope with competing
> writers to the entity path: If two processes write to the entity path
> before the controller can read it (due to notification) then one set of
> updates will be lost.
>
> c) If a single writing process crashes after writing to the entity path,
> but before writing to the notification path then the write will be lost.
>
> I'm actually using point a) in my WIP (see below). Points b) and c) are
> obviously edge cases.
>
>
> > 11. Ok. I am not sure that I fully understand the description of that
> part.
> > Does "assigned" refer to the current assignment? Could you also describe
> > where the length of the original assignment is stored in ZK?
> >
>
> Sorry if the description is not clear. Yes, "assigned" referrs to the
> currently assigned replicas (taken from the
> ControllerContext.partitionReplicaAssignment). I would store the length of
> the original assignment in the /admin/reassignments/$topic/$partition
> znode
> (this is where the point (a) above is useful -- the requester shouldn't
> know that this information is used by the controller).
>
> I've updated the KIP to make these points clearer.
>
>
> > 13. Hmm, I am not sure that the cancellation needs to be done for the
> whole
> > batch. The reason that I brought this up is for consistency. The KIP
> allows
> > override when using the new approach. It just seems that it's simpler to
> > extend this model when resolving multiple changes between the old and the
> > new approach.
>
>
> Ah, I think I've been unclear on this point too. Currently the
> ReassignPartitionsCommand enforces that you can't change reassignments, but
> this doesn't stop other ZK clients making changes to
> /admin/reassign_partitions directly and I believe some Kafka users do
> indeed change reassignments in-flight by writing to
> /admin/reassign_partitions. What I'm proposing doesn't break that at all.
> The semantic I've implemented is only that the controller only refuses a
> reassignment change if there is already one in-flight (i.e. in
> /admin/reassignments/$topic/$partition) **via the other mechansim**. So if
> you're using /admin/reassign_partitions and you change or cancel part of it
> via /admin/reassign_partitions, that's OK. Likewise if you're using
> /admin/reassignment_request/request_xxx and you change or cancel part of
> it
> via another /admin/reassignment_request/request_xxx, that's OK.What you
> can't do is change a request that was started via
> /admin/reassign_partitions via /admin/reassignment_request/request_xxx, or
> vice versa.
>
> What I was thinking of when I replied is the case where, on controller
> failover, /admin/reassign_partitions has been changed and
> /admin/reassignment_request/request_xxx created (in the period when the
> new
> controller was being elected, for example) with a common partition. In this
> case we should apply a consistent rule to that used when the notification
> happen in real time. Your suggestion to use the modification time of the
> znode would work here too (except in the edge case where ZK writes to both
> znodes happen within the same clock tick on the ZK server, so the mtimes
> are the same).
>
> Let me know if you think this is the right semantic and I'll try to clarify
> the KIP.
>
> Many thanks,
>
> Tom
>
> On 18 December 2017 at 18:12, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the reply. A few more followup comments below.
> >
> > 10. Another concern of mine is on consistency with the current pattern.
> The
> > current pattern for change notification based on ZK is (1) we first write
> > the actual value in the entity path and then write the change
> notification
> > path, and (2)  the change notification path only includes what entity has
> > changed but not the actual changes. If we want to follow this pattern for
> > consistency, /admin/reassignment_requests/request_xxx will only have the
> > partitions whose reassignment have changed, but not the actual
> > reassignment.
> >
> > 11. Ok. I am not sure that I fully understand the description of that
> part.
> > Does "assigned" refer to the current assignment? Could you also describe
> > where the length of the original assignment is stored in ZK?
> >
> > 13. Hmm, I am not sure that the cancellation needs to be done for the
> whole
> > batch. The reason that I brought this up is for consistency. The KIP
> allows
> > override when using the new approach. It just seems that it's simpler to
> > extend this model when resolving multiple changes between the old and the
> > new approach.
> >
> > Jun
> >
> >
> >
> > On Mon, Dec 18, 2017 at 2:45 AM, Tom Bentley <t....@gmail.com>
> > wrote:
> >
> > > Hi Jun,
> > >
> > > Thanks for replying, some answers below:
> > >
> > >
> > > > 10. The proposal now stores the reassignment for all partitions in
> > > > /admin/reassignment_requests/request_xxx. If the number of
> reassigned
> > > > partitions is larger, the ZK write may hit the default 1MB limit and
> > > fail.
> > > > An alternative approach is to have the reassignment requester first
> > write
> > > > the new assignment for each partition under
> > > > /admin/reassignments/$topic/$partition and then write
> > > > /admin/reassignment_requests/request_xxx with an empty value. The
> > > > controller can then read all values under /admin/reassignments.
> > > >
> > >
> > > You're right that reassigning enough partitions would hit the 1MB
> limit,
> > > but I don't think this would be a problem in practice because it would
> be
> > > trivial to split the partitions into several requests (i.e. mutleiple
> > > request_xxx).
> > > I don't think the non-atomicity this would imply is a problem. By
> writing
> > > the partitions whose /admin/reassignments/$topic/$partition has been
> > > created or changed it makes it much more efficient to know which of
> those
> > > znodes we need to read. If I understand your suggestion, you would have
> > to
> > > read every node under /admin/reassignments to figure out which had
> > changed.
> > >
> > >
> > > > 11. The improvement you suggested in onPartitionReassignment() sounds
> > > good
> > > > at the high level. The computation of those dropped partitions seems
> a
> > > bit
> > > > complicated. Perhaps a simple approach is to drop the replicas not in
> > the
> > > > original assignment and newest reassignment?
> > > >
> > >
> > > This was what I came up with originally too, but when I looked into
> > > implementing it I found a couple of things which made me reconsider it.
> > > Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we start
> > > reassigning to [2,3], but then change our minds about 2 and switch it
> to
> > 4
> > > (maybe we've figured out a better overall balance). At that point it is
> > > perfectly possible that broker 2 is in-sync and broker 1 is not
> in-sync.
> > It
> > > seems silly to drop broker 2 in favour of broker 1: We're needlessly
> > giving
> > > the cluster more work to do.
> > >
> > > The second thing that made me reconsider was in that same scenario it's
> > > even possible that broker 2 is the leader of the partition. Obviously
> we
> > > can elect a new leader before dropping it, but not without causing
> > > disruption to producers and consumers.
> > >
> > > By accepting a little more complexity in choosing which brokers to drop
> > we
> > > make the dropping simpler (no need for leader election) and ensure the
> > > cluster has less work to do.
> > >
> > >
> > > > 12. You brought up the need of remembering the original assignment.
> > This
> > > > will be lost if the assignment is changed multiple times if we follow
> > the
> > > > approach described in 10. One way is to store the original assignment
> > in
> > > > /brokers/topics/[topic] as the following. When the final reassignment
> > > > completes, we can remove the original field.
> > > > {
> > > >   "version": 1,
> > > >   "partitions": {"0": [0, 1, 3] },
> > > >   "originals": {"0": [0, 1, 2] }
> > > > }
> > > >
> > >
> > > While I was implementing my first version of onPartitionReassignment(),
> > > where I preferred the originals, I was storing the originals in the
> > > /admin/reassignments/$topic/$partition znodes. Since we will remove
> that
> > > znode at the end of reassignment anyway, I would suggest this is a
> better
> > > place to store that data (if it's necessary to do so), so that we can
> > avoid
> > > another ZK round trip.
> > >
> > >
> > > > 13. For resolving the conflict between /admin/reassign_partitions and
> > > > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> > > just
> > > > let the assignment with a newer timestamp to override the older one?
> > > >
> > >
> > > That would work but with slightly different semantics to what I have:
> > Since
> > > /admin/reassign_partitions contains multiple partitions, using the
> > > timestamp means the whole batch wins or losses. By tracking how each
> > > request was made we can be more fine-grained. I'm to use the
> modification
> > > time if such granularity is not required.
> > >
> > >
> > > > 14. Implementation wise, currently, we register a watcher of the isr
> > path
> > > > of each partition being reassigned. This has the potential issue of
> > > > registering many listeners. An improvement could be just piggybacking
> > on
> > > > the existing IsrChangeNotificationHandler, which only watches a
> single
> > ZK
> > > > path and is triggered on a batch of isr changes. This is kind of
> > > orthogonal
> > > > to the KIP. However, if we are touching the reassignment logic, it
> may
> > be
> > > > worth considering.
> > >
> > >
> > > Let me look into that.
> > >
> > > Thanks,
> > >
> > > Tom
> > >
> > > On 16 December 2017 at 02:19, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Tom,
> > > >
> > > > Thanks for the updated KIP. A few more comments below.
> > > >
> > > > 10. The proposal now stores the reassignment for all partitions in
> > > > /admin/reassignment_requests/request_xxx. If the number of
> reassigned
> > > > partitions is larger, the ZK write may hit the default 1MB limit and
> > > fail.
> > > > An alternative approach is to have the reassignment requester first
> > write
> > > > the new assignment for each partition under
> > > > /admin/reassignments/$topic/$partition and then write
> > > > /admin/reassignment_requests/request_xxx with an empty value. The
> > > > controller can then read all values under /admin/reassignments.
> > > >
> > > > 11. The improvement you suggested in onPartitionReassignment() sounds
> > > good
> > > > at the high level. The computation of those dropped partitions seems
> a
> > > bit
> > > > complicated. Perhaps a simple approach is to drop the replicas not in
> > the
> > > > original assignment and newest reassignment?
> > > >
> > > > 12. You brought up the need of remembering the original assignment.
> > This
> > > > will be lost if the assignment is changed multiple times if we follow
> > the
> > > > approach described in 10. One way is to store the original assignment
> > in
> > > > /brokers/topics/[topic] as the following. When the final reassignment
> > > > completes, we can remove the original field.
> > > > {
> > > >   "version": 1,
> > > >   "partitions": {"0": [0, 1, 3] },
> > > >   "originals": {"0": [0, 1, 2] }
> > > > }
> > > >
> > > > 13. For resolving the conflict between /admin/reassign_partitions and
> > > > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> > > just
> > > > let the assignment with a newer timestamp to override the older one?
> > > >
> > > > 14. Implementation wise, currently, we register a watcher of the isr
> > path
> > > > of each partition being reassigned. This has the potential issue of
> > > > registering many listeners. An improvement could be just piggybacking
> > on
> > > > the existing IsrChangeNotificationHandler, which only watches a
> single
> > ZK
> > > > path and is triggered on a batch of isr changes. This is kind of
> > > orthogonal
> > > > to the KIP. However, if we are touching the reassignment logic, it
> may
> > be
> > > > worth considering.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t.j.bentley@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Just wanted to mention that I've started KIP-240, which builds on
> top
> > > of
> > > > > this one to provide an AdminClient API for listing and describing
> > > > > reassignments.
> > > > >
> > > > > On 15 December 2017 at 14:34, Tom Bentley <t....@gmail.com>
> > > wrote:
> > > > >
> > > > > > > Should we seek to improve this algorithm in this KIP, or leave
> > that
> > > > as
> > > > > > a later optimisation?
> > > > > >
> > > > > > I've updated the KIP with a proposed algorithm.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 14 December 2017 at 09:57, Tom Bentley <t.j.bentley@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > >> Thanks Ted, now fixed.
> > > > > >>
> > > > > >> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com>
> wrote:
> > > > > >>
> > > > > >>> Tom:
> > > > > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > > > > >>>
> > > > > >>> Looks like the tree structure above should be:
> > > > > >>>
> > > > > >>> /admin/reassignments/$topic/$partition
> > > > > >>>
> > > > > >>> bq. The controller removes /admin/reassignment/$topic/$
> partition
> > > > > >>>
> > > > > >>> Note the lack of 's' for reassignment. It would be good to make
> > > > > zookeeper
> > > > > >>> paths consistent.
> > > > > >>>
> > > > > >>> Thanks
> > > > > >>>
> > > > > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <
> > > t.j.bentley@gmail.com>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > Hi Jun and Ted,
> > > > > >>> >
> > > > > >>> > Jun, you're right that needing one watcher per reassigned
> > > partition
> > > > > >>> > presents a scalability problem, and using a separate
> > notification
> > > > > path
> > > > > >>> > solves that. I also agree that it makes sense to prevent
> users
> > > from
> > > > > >>> using
> > > > > >>> > both methods on the same reassignment.
> > > > > >>> >
> > > > > >>> > Ted, naming the reassignments like mytopic-42 was simpler
> > while I
> > > > was
> > > > > >>> > proposing a watcher-per-reassignment (I'd have needed a child
> > > > watcher
> > > > > >>> on
> > > > > >>> > /admin/reassignments and also on
> /admin/reassignments/mytopic).
> > > > Using
> > > > > >>> the
> > > > > >>> > separate notification path means I don't need any watchers in
> > the
> > > > > >>> > /admin/reassignments subtree, so switching to
> > > > > >>> /admin/reassignments/mytopic/
> > > > > >>> > 42
> > > > > >>> > would work, and avoid /admin/reassignments having a very
> large
> > > > number
> > > > > >>> of
> > > > > >>> > child nodes. On the other hand it also means I have to create
> > and
> > > > > >>> delete
> > > > > >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which
> > incurs
> > > > the
> > > > > >>> cost
> > > > > >>> > of extra round trips to zookeeper. I suppose that since
> > > > reassignment
> > > > > is
> > > > > >>> > generally a slow process it makes little difference if we
> > > increase
> > > > > the
> > > > > >>> > latency of the interactions with zookeeper.
> > > > > >>> >
> > > > > >>> > I have updated the KIP with these improvements, and a more
> > > detailed
> > > > > >>> > description of exactly how we would manage these znodes.
> > > > > >>> >
> > > > > >>> > Reading the algorithm in KafkaController.
> > > > onPartitionReassignment(),
> > > > > it
> > > > > >>> > seems that it would be suboptimal for changing reassignments
> > > > > in-flight.
> > > > > >>> > Consider an initial assignment of [1,2], reassigned to [2,3]
> > and
> > > > then
> > > > > >>> > changed to [2,4]. Broker 3 will remain in the assigned
> replicas
> > > > until
> > > > > >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> > > > > original
> > > > > >>> > assigned replicas and is no longer a new assigned replica. I
> > > think
> > > > > this
> > > > > >>> > also affects the case where the reassignment is cancelled
> > > > > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch
> up,
> > > > even
> > > > > >>> though
> > > > > >>> > its replica will then be deleted.
> > > > > >>> >
> > > > > >>> > Should we seek to improve this algorithm in this KIP, or
> leave
> > > that
> > > > > as
> > > > > >>> a
> > > > > >>> > later optimisation?
> > > > > >>> >
> > > > > >>> > Cheers,
> > > > > >>> >
> > > > > >>> > Tom
> > > > > >>> >
> > > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > >>> >
> > > > > >>> > > Another question is on the compatibility. Since now there
> > are 2
> > > > > ways
> > > > > >>> of
> > > > > >>> > > specifying a partition reassignment, one under
> > > > > >>> /admin/reassign_partitions
> > > > > >>> > > and the other under /admin/reassignments, we probably want
> to
> > > > > >>> prevent the
> > > > > >>> > > same topic being reassigned under both paths at the same
> > time?
> > > > > >>> > > Thanks,
> > > > > >>> > >
> > > > > >>> > > Jun
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> > > > wrote:
> > > > > >>> > >
> > > > > >>> > > > Hi, Tom,
> > > > > >>> > > >
> > > > > >>> > > > Thanks for the KIP. It definitely addresses one of the
> pain
> > > > > points
> > > > > >>> in
> > > > > >>> > > > partition reassignment. Another issue that it also
> > addresses
> > > is
> > > > > >>> the ZK
> > > > > >>> > > node
> > > > > >>> > > > size limit when writing the reassignment JSON.
> > > > > >>> > > >
> > > > > >>> > > > My only concern is that the KIP needs to create one
> watcher
> > > per
> > > > > >>> > > reassigned
> > > > > >>> > > > partition. This could add overhead in ZK and complexity
> for
> > > > > >>> debugging
> > > > > >>> > > when
> > > > > >>> > > > lots of partitions are being reassigned simultaneously.
> We
> > > > could
> > > > > >>> > > > potentially improve this by introducing a separate ZK
> path
> > > for
> > > > > >>> change
> > > > > >>> > > > notification as we do for configs. For example, every
> time
> > we
> > > > > >>> change
> > > > > >>> > the
> > > > > >>> > > > assignment for a set of partitions, we could further
> write
> > a
> > > > > >>> sequential
> > > > > >>> > > > node /admin/reassignment_changes/[change_x]. That way,
> the
> > > > > >>> controller
> > > > > >>> > > > only needs to watch the change path. Once a change is
> > > > triggered,
> > > > > >>> the
> > > > > >>> > > > controller can read everything under
> /admin/reassignments/.
> > > > > >>> > > >
> > > > > >>> > > > Jun
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > > t.j.bentley@gmail.com
> > > > > >>> >
> > > > > >>> > > wrote:
> > > > > >>> > > >
> > > > > >>> > > >> Hi,
> > > > > >>> > > >>
> > > > > >>> > > >> This is still very new, but I wanted some quick feedback
> > on
> > > a
> > > > > >>> > > preliminary
> > > > > >>> > > >> KIP which could, I think, help with providing an
> > AdminClient
> > > > API
> > > > > >>> for
> > > > > >>> > > >> partition reassignment.
> > > > > >>> > > >>
> > > > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 236%
> > > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > > >>> > > >>
> > > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > > AdminClient
> > > > > >>> API in
> > > > > >>> > > >> this
> > > > > >>> > > >> KIP (which would make it very big, and difficult to
> read),
> > > or
> > > > > >>> whether
> > > > > >>> > to
> > > > > >>> > > >> break it down into smaller KIPs (which makes it easier
> to
> > > read
> > > > > and
> > > > > >>> > > >> implement in pieces, but harder to get a high-level
> > picture
> > > of
> > > > > the
> > > > > >>> > > >> ultimate
> > > > > >>> > > >> destination). For now I've gone for a very small initial
> > > KIP,
> > > > > but
> > > > > >>> I'm
> > > > > >>> > > >> happy
> > > > > >>> > > >> to sketch the bigger picture here if people are
> > interested.
> > > > > >>> > > >>
> > > > > >>> > > >> Cheers,
> > > > > >>> > > >>
> > > > > >>> > > >> Tom
> > > > > >>> > > >>
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>> >
> > > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> > wrote:
> > > > > >>> >
> > > > > >>> > > Another question is on the compatibility. Since now there
> > are 2
> > > > > ways
> > > > > >>> of
> > > > > >>> > > specifying a partition reassignment, one under
> > > > > >>> /admin/reassign_partitions
> > > > > >>> > > and the other under /admin/reassignments, we probably want
> to
> > > > > >>> prevent the
> > > > > >>> > > same topic being reassigned under both paths at the same
> > time?
> > > > > >>> > > Thanks,
> > > > > >>> > >
> > > > > >>> > > Jun
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> > > > wrote:
> > > > > >>> > >
> > > > > >>> > > > Hi, Tom,
> > > > > >>> > > >
> > > > > >>> > > > Thanks for the KIP. It definitely addresses one of the
> pain
> > > > > points
> > > > > >>> in
> > > > > >>> > > > partition reassignment. Another issue that it also
> > addresses
> > > is
> > > > > >>> the ZK
> > > > > >>> > > node
> > > > > >>> > > > size limit when writing the reassignment JSON.
> > > > > >>> > > >
> > > > > >>> > > > My only concern is that the KIP needs to create one
> watcher
> > > per
> > > > > >>> > > reassigned
> > > > > >>> > > > partition. This could add overhead in ZK and complexity
> for
> > > > > >>> debugging
> > > > > >>> > > when
> > > > > >>> > > > lots of partitions are being reassigned simultaneously.
> We
> > > > could
> > > > > >>> > > > potentially improve this by introducing a separate ZK
> path
> > > for
> > > > > >>> change
> > > > > >>> > > > notification as we do for configs. For example, every
> time
> > we
> > > > > >>> change
> > > > > >>> > the
> > > > > >>> > > > assignment for a set of partitions, we could further
> write
> > a
> > > > > >>> sequential
> > > > > >>> > > > node /admin/reassignment_changes/[change_x]. That way,
> the
> > > > > >>> controller
> > > > > >>> > > > only needs to watch the change path. Once a change is
> > > > triggered,
> > > > > >>> the
> > > > > >>> > > > controller can read everything under
> /admin/reassignments/.
> > > > > >>> > > >
> > > > > >>> > > > Jun
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > > t.j.bentley@gmail.com
> > > > > >>> >
> > > > > >>> > > wrote:
> > > > > >>> > > >
> > > > > >>> > > >> Hi,
> > > > > >>> > > >>
> > > > > >>> > > >> This is still very new, but I wanted some quick feedback
> > on
> > > a
> > > > > >>> > > preliminary
> > > > > >>> > > >> KIP which could, I think, help with providing an
> > AdminClient
> > > > API
> > > > > >>> for
> > > > > >>> > > >> partition reassignment.
> > > > > >>> > > >>
> > > > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 236%
> > > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > > >>> > > >>
> > > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > > AdminClient
> > > > > >>> API in
> > > > > >>> > > >> this
> > > > > >>> > > >> KIP (which would make it very big, and difficult to
> read),
> > > or
> > > > > >>> whether
> > > > > >>> > to
> > > > > >>> > > >> break it down into smaller KIPs (which makes it easier
> to
> > > read
> > > > > and
> > > > > >>> > > >> implement in pieces, but harder to get a high-level
> > picture
> > > of
> > > > > the
> > > > > >>> > > >> ultimate
> > > > > >>> > > >> destination). For now I've gone for a very small initial
> > > KIP,
> > > > > but
> > > > > >>> I'm
> > > > > >>> > > >> happy
> > > > > >>> > > >> to sketch the bigger picture here if people are
> > interested.
> > > > > >>> > > >>
> > > > > >>> > > >> Cheers,
> > > > > >>> > > >>
> > > > > >>> > > >> Tom
> > > > > >>> > > >>
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Tom Bentley <t....@gmail.com>.
Hi Jun,

10. Another concern of mine is on consistency with the current pattern. The
> current pattern for change notification based on ZK is (1) we first write
> the actual value in the entity path and then write the change notification
> path, and (2)  the change notification path only includes what entity has
> changed but not the actual changes. If we want to follow this pattern for
> consistency, /admin/reassignment_requests/request_xxx will only have the
> partitions whose reassignment have changed, but not the actual
> reassignment.
>

Ah, I hadn't understood part (2). That means my concern about efficiency
with the current pattern is misplaced. There are still some interesting
differences in semantics, however:

a) The mechanism currently proposed in KIP-236 means that the controller is
the only writer to /admin/reassignments. This means it can include
information in these znodes that requesters might not know, or information
that's necessary to perform the reassignment but not necessary to describe
the request. While this could be handled using the current pattern it would
rely on all  writers to preserve any information added by the controller,
which seems complicated and hence fragile.

b) The current pattern for change notification doesn't cope with competing
writers to the entity path: If two processes write to the entity path
before the controller can read it (due to notification) then one set of
updates will be lost.

c) If a single writing process crashes after writing to the entity path,
but before writing to the notification path then the write will be lost.

I'm actually using point a) in my WIP (see below). Points b) and c) are
obviously edge cases.


> 11. Ok. I am not sure that I fully understand the description of that part.
> Does "assigned" refer to the current assignment? Could you also describe
> where the length of the original assignment is stored in ZK?
>

Sorry if the description is not clear. Yes, "assigned" referrs to the
currently assigned replicas (taken from the
ControllerContext.partitionReplicaAssignment). I would store the length of
the original assignment in the /admin/reassignments/$topic/$partition znode
(this is where the point (a) above is useful -- the requester shouldn't
know that this information is used by the controller).

I've updated the KIP to make these points clearer.


> 13. Hmm, I am not sure that the cancellation needs to be done for the whole
> batch. The reason that I brought this up is for consistency. The KIP allows
> override when using the new approach. It just seems that it's simpler to
> extend this model when resolving multiple changes between the old and the
> new approach.


Ah, I think I've been unclear on this point too. Currently the
ReassignPartitionsCommand enforces that you can't change reassignments, but
this doesn't stop other ZK clients making changes to
/admin/reassign_partitions directly and I believe some Kafka users do
indeed change reassignments in-flight by writing to
/admin/reassign_partitions. What I'm proposing doesn't break that at all.
The semantic I've implemented is only that the controller only refuses a
reassignment change if there is already one in-flight (i.e. in
/admin/reassignments/$topic/$partition) **via the other mechansim**. So if
you're using /admin/reassign_partitions and you change or cancel part of it
via /admin/reassign_partitions, that's OK. Likewise if you're using
/admin/reassignment_request/request_xxx and you change or cancel part of it
via another /admin/reassignment_request/request_xxx, that's OK.What you
can't do is change a request that was started via
/admin/reassign_partitions via /admin/reassignment_request/request_xxx, or
vice versa.

What I was thinking of when I replied is the case where, on controller
failover, /admin/reassign_partitions has been changed and
/admin/reassignment_request/request_xxx created (in the period when the new
controller was being elected, for example) with a common partition. In this
case we should apply a consistent rule to that used when the notification
happen in real time. Your suggestion to use the modification time of the
znode would work here too (except in the edge case where ZK writes to both
znodes happen within the same clock tick on the ZK server, so the mtimes
are the same).

Let me know if you think this is the right semantic and I'll try to clarify
the KIP.

Many thanks,

Tom

On 18 December 2017 at 18:12, Jun Rao <ju...@confluent.io> wrote:

> Hi, Tom,
>
> Thanks for the reply. A few more followup comments below.
>
> 10. Another concern of mine is on consistency with the current pattern. The
> current pattern for change notification based on ZK is (1) we first write
> the actual value in the entity path and then write the change notification
> path, and (2)  the change notification path only includes what entity has
> changed but not the actual changes. If we want to follow this pattern for
> consistency, /admin/reassignment_requests/request_xxx will only have the
> partitions whose reassignment have changed, but not the actual
> reassignment.
>
> 11. Ok. I am not sure that I fully understand the description of that part.
> Does "assigned" refer to the current assignment? Could you also describe
> where the length of the original assignment is stored in ZK?
>
> 13. Hmm, I am not sure that the cancellation needs to be done for the whole
> batch. The reason that I brought this up is for consistency. The KIP allows
> override when using the new approach. It just seems that it's simpler to
> extend this model when resolving multiple changes between the old and the
> new approach.
>
> Jun
>
>
>
> On Mon, Dec 18, 2017 at 2:45 AM, Tom Bentley <t....@gmail.com>
> wrote:
>
> > Hi Jun,
> >
> > Thanks for replying, some answers below:
> >
> >
> > > 10. The proposal now stores the reassignment for all partitions in
> > > /admin/reassignment_requests/request_xxx. If the number of reassigned
> > > partitions is larger, the ZK write may hit the default 1MB limit and
> > fail.
> > > An alternative approach is to have the reassignment requester first
> write
> > > the new assignment for each partition under
> > > /admin/reassignments/$topic/$partition and then write
> > > /admin/reassignment_requests/request_xxx with an empty value. The
> > > controller can then read all values under /admin/reassignments.
> > >
> >
> > You're right that reassigning enough partitions would hit the 1MB limit,
> > but I don't think this would be a problem in practice because it would be
> > trivial to split the partitions into several requests (i.e. mutleiple
> > request_xxx).
> > I don't think the non-atomicity this would imply is a problem. By writing
> > the partitions whose /admin/reassignments/$topic/$partition has been
> > created or changed it makes it much more efficient to know which of those
> > znodes we need to read. If I understand your suggestion, you would have
> to
> > read every node under /admin/reassignments to figure out which had
> changed.
> >
> >
> > > 11. The improvement you suggested in onPartitionReassignment() sounds
> > good
> > > at the high level. The computation of those dropped partitions seems a
> > bit
> > > complicated. Perhaps a simple approach is to drop the replicas not in
> the
> > > original assignment and newest reassignment?
> > >
> >
> > This was what I came up with originally too, but when I looked into
> > implementing it I found a couple of things which made me reconsider it.
> > Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we start
> > reassigning to [2,3], but then change our minds about 2 and switch it to
> 4
> > (maybe we've figured out a better overall balance). At that point it is
> > perfectly possible that broker 2 is in-sync and broker 1 is not in-sync.
> It
> > seems silly to drop broker 2 in favour of broker 1: We're needlessly
> giving
> > the cluster more work to do.
> >
> > The second thing that made me reconsider was in that same scenario it's
> > even possible that broker 2 is the leader of the partition. Obviously we
> > can elect a new leader before dropping it, but not without causing
> > disruption to producers and consumers.
> >
> > By accepting a little more complexity in choosing which brokers to drop
> we
> > make the dropping simpler (no need for leader election) and ensure the
> > cluster has less work to do.
> >
> >
> > > 12. You brought up the need of remembering the original assignment.
> This
> > > will be lost if the assignment is changed multiple times if we follow
> the
> > > approach described in 10. One way is to store the original assignment
> in
> > > /brokers/topics/[topic] as the following. When the final reassignment
> > > completes, we can remove the original field.
> > > {
> > >   "version": 1,
> > >   "partitions": {"0": [0, 1, 3] },
> > >   "originals": {"0": [0, 1, 2] }
> > > }
> > >
> >
> > While I was implementing my first version of onPartitionReassignment(),
> > where I preferred the originals, I was storing the originals in the
> > /admin/reassignments/$topic/$partition znodes. Since we will remove that
> > znode at the end of reassignment anyway, I would suggest this is a better
> > place to store that data (if it's necessary to do so), so that we can
> avoid
> > another ZK round trip.
> >
> >
> > > 13. For resolving the conflict between /admin/reassign_partitions and
> > > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> > just
> > > let the assignment with a newer timestamp to override the older one?
> > >
> >
> > That would work but with slightly different semantics to what I have:
> Since
> > /admin/reassign_partitions contains multiple partitions, using the
> > timestamp means the whole batch wins or losses. By tracking how each
> > request was made we can be more fine-grained. I'm to use the modification
> > time if such granularity is not required.
> >
> >
> > > 14. Implementation wise, currently, we register a watcher of the isr
> path
> > > of each partition being reassigned. This has the potential issue of
> > > registering many listeners. An improvement could be just piggybacking
> on
> > > the existing IsrChangeNotificationHandler, which only watches a single
> ZK
> > > path and is triggered on a batch of isr changes. This is kind of
> > orthogonal
> > > to the KIP. However, if we are touching the reassignment logic, it may
> be
> > > worth considering.
> >
> >
> > Let me look into that.
> >
> > Thanks,
> >
> > Tom
> >
> > On 16 December 2017 at 02:19, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Tom,
> > >
> > > Thanks for the updated KIP. A few more comments below.
> > >
> > > 10. The proposal now stores the reassignment for all partitions in
> > > /admin/reassignment_requests/request_xxx. If the number of reassigned
> > > partitions is larger, the ZK write may hit the default 1MB limit and
> > fail.
> > > An alternative approach is to have the reassignment requester first
> write
> > > the new assignment for each partition under
> > > /admin/reassignments/$topic/$partition and then write
> > > /admin/reassignment_requests/request_xxx with an empty value. The
> > > controller can then read all values under /admin/reassignments.
> > >
> > > 11. The improvement you suggested in onPartitionReassignment() sounds
> > good
> > > at the high level. The computation of those dropped partitions seems a
> > bit
> > > complicated. Perhaps a simple approach is to drop the replicas not in
> the
> > > original assignment and newest reassignment?
> > >
> > > 12. You brought up the need of remembering the original assignment.
> This
> > > will be lost if the assignment is changed multiple times if we follow
> the
> > > approach described in 10. One way is to store the original assignment
> in
> > > /brokers/topics/[topic] as the following. When the final reassignment
> > > completes, we can remove the original field.
> > > {
> > >   "version": 1,
> > >   "partitions": {"0": [0, 1, 3] },
> > >   "originals": {"0": [0, 1, 2] }
> > > }
> > >
> > > 13. For resolving the conflict between /admin/reassign_partitions and
> > > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> > just
> > > let the assignment with a newer timestamp to override the older one?
> > >
> > > 14. Implementation wise, currently, we register a watcher of the isr
> path
> > > of each partition being reassigned. This has the potential issue of
> > > registering many listeners. An improvement could be just piggybacking
> on
> > > the existing IsrChangeNotificationHandler, which only watches a single
> ZK
> > > path and is triggered on a batch of isr changes. This is kind of
> > orthogonal
> > > to the KIP. However, if we are touching the reassignment logic, it may
> be
> > > worth considering.
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t....@gmail.com>
> > > wrote:
> > >
> > > > Just wanted to mention that I've started KIP-240, which builds on top
> > of
> > > > this one to provide an AdminClient API for listing and describing
> > > > reassignments.
> > > >
> > > > On 15 December 2017 at 14:34, Tom Bentley <t....@gmail.com>
> > wrote:
> > > >
> > > > > > Should we seek to improve this algorithm in this KIP, or leave
> that
> > > as
> > > > > a later optimisation?
> > > > >
> > > > > I've updated the KIP with a proposed algorithm.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On 14 December 2017 at 09:57, Tom Bentley <t....@gmail.com>
> > > wrote:
> > > > >
> > > > >> Thanks Ted, now fixed.
> > > > >>
> > > > >> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com> wrote:
> > > > >>
> > > > >>> Tom:
> > > > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > > > >>>
> > > > >>> Looks like the tree structure above should be:
> > > > >>>
> > > > >>> /admin/reassignments/$topic/$partition
> > > > >>>
> > > > >>> bq. The controller removes /admin/reassignment/$topic/$partition
> > > > >>>
> > > > >>> Note the lack of 's' for reassignment. It would be good to make
> > > > zookeeper
> > > > >>> paths consistent.
> > > > >>>
> > > > >>> Thanks
> > > > >>>
> > > > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <
> > t.j.bentley@gmail.com>
> > > > >>> wrote:
> > > > >>>
> > > > >>> > Hi Jun and Ted,
> > > > >>> >
> > > > >>> > Jun, you're right that needing one watcher per reassigned
> > partition
> > > > >>> > presents a scalability problem, and using a separate
> notification
> > > > path
> > > > >>> > solves that. I also agree that it makes sense to prevent users
> > from
> > > > >>> using
> > > > >>> > both methods on the same reassignment.
> > > > >>> >
> > > > >>> > Ted, naming the reassignments like mytopic-42 was simpler
> while I
> > > was
> > > > >>> > proposing a watcher-per-reassignment (I'd have needed a child
> > > watcher
> > > > >>> on
> > > > >>> > /admin/reassignments and also on /admin/reassignments/mytopic).
> > > Using
> > > > >>> the
> > > > >>> > separate notification path means I don't need any watchers in
> the
> > > > >>> > /admin/reassignments subtree, so switching to
> > > > >>> /admin/reassignments/mytopic/
> > > > >>> > 42
> > > > >>> > would work, and avoid /admin/reassignments having a very large
> > > number
> > > > >>> of
> > > > >>> > child nodes. On the other hand it also means I have to create
> and
> > > > >>> delete
> > > > >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which
> incurs
> > > the
> > > > >>> cost
> > > > >>> > of extra round trips to zookeeper. I suppose that since
> > > reassignment
> > > > is
> > > > >>> > generally a slow process it makes little difference if we
> > increase
> > > > the
> > > > >>> > latency of the interactions with zookeeper.
> > > > >>> >
> > > > >>> > I have updated the KIP with these improvements, and a more
> > detailed
> > > > >>> > description of exactly how we would manage these znodes.
> > > > >>> >
> > > > >>> > Reading the algorithm in KafkaController.
> > > onPartitionReassignment(),
> > > > it
> > > > >>> > seems that it would be suboptimal for changing reassignments
> > > > in-flight.
> > > > >>> > Consider an initial assignment of [1,2], reassigned to [2,3]
> and
> > > then
> > > > >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas
> > > until
> > > > >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> > > > original
> > > > >>> > assigned replicas and is no longer a new assigned replica. I
> > think
> > > > this
> > > > >>> > also affects the case where the reassignment is cancelled
> > > > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up,
> > > even
> > > > >>> though
> > > > >>> > its replica will then be deleted.
> > > > >>> >
> > > > >>> > Should we seek to improve this algorithm in this KIP, or leave
> > that
> > > > as
> > > > >>> a
> > > > >>> > later optimisation?
> > > > >>> >
> > > > >>> > Cheers,
> > > > >>> >
> > > > >>> > Tom
> > > > >>> >
> > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> wrote:
> > > > >>> >
> > > > >>> > > Another question is on the compatibility. Since now there
> are 2
> > > > ways
> > > > >>> of
> > > > >>> > > specifying a partition reassignment, one under
> > > > >>> /admin/reassign_partitions
> > > > >>> > > and the other under /admin/reassignments, we probably want to
> > > > >>> prevent the
> > > > >>> > > same topic being reassigned under both paths at the same
> time?
> > > > >>> > > Thanks,
> > > > >>> > >
> > > > >>> > > Jun
> > > > >>> > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > >>> > >
> > > > >>> > > > Hi, Tom,
> > > > >>> > > >
> > > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > > > points
> > > > >>> in
> > > > >>> > > > partition reassignment. Another issue that it also
> addresses
> > is
> > > > >>> the ZK
> > > > >>> > > node
> > > > >>> > > > size limit when writing the reassignment JSON.
> > > > >>> > > >
> > > > >>> > > > My only concern is that the KIP needs to create one watcher
> > per
> > > > >>> > > reassigned
> > > > >>> > > > partition. This could add overhead in ZK and complexity for
> > > > >>> debugging
> > > > >>> > > when
> > > > >>> > > > lots of partitions are being reassigned simultaneously. We
> > > could
> > > > >>> > > > potentially improve this by introducing a separate ZK path
> > for
> > > > >>> change
> > > > >>> > > > notification as we do for configs. For example, every time
> we
> > > > >>> change
> > > > >>> > the
> > > > >>> > > > assignment for a set of partitions, we could further write
> a
> > > > >>> sequential
> > > > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > > > >>> controller
> > > > >>> > > > only needs to watch the change path. Once a change is
> > > triggered,
> > > > >>> the
> > > > >>> > > > controller can read everything under /admin/reassignments/.
> > > > >>> > > >
> > > > >>> > > > Jun
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > t.j.bentley@gmail.com
> > > > >>> >
> > > > >>> > > wrote:
> > > > >>> > > >
> > > > >>> > > >> Hi,
> > > > >>> > > >>
> > > > >>> > > >> This is still very new, but I wanted some quick feedback
> on
> > a
> > > > >>> > > preliminary
> > > > >>> > > >> KIP which could, I think, help with providing an
> AdminClient
> > > API
> > > > >>> for
> > > > >>> > > >> partition reassignment.
> > > > >>> > > >>
> > > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 236%
> > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > >>> > > >>
> > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > AdminClient
> > > > >>> API in
> > > > >>> > > >> this
> > > > >>> > > >> KIP (which would make it very big, and difficult to read),
> > or
> > > > >>> whether
> > > > >>> > to
> > > > >>> > > >> break it down into smaller KIPs (which makes it easier to
> > read
> > > > and
> > > > >>> > > >> implement in pieces, but harder to get a high-level
> picture
> > of
> > > > the
> > > > >>> > > >> ultimate
> > > > >>> > > >> destination). For now I've gone for a very small initial
> > KIP,
> > > > but
> > > > >>> I'm
> > > > >>> > > >> happy
> > > > >>> > > >> to sketch the bigger picture here if people are
> interested.
> > > > >>> > > >>
> > > > >>> > > >> Cheers,
> > > > >>> > > >>
> > > > >>> > > >> Tom
> > > > >>> > > >>
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>> >
> > > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io>
> wrote:
> > > > >>> >
> > > > >>> > > Another question is on the compatibility. Since now there
> are 2
> > > > ways
> > > > >>> of
> > > > >>> > > specifying a partition reassignment, one under
> > > > >>> /admin/reassign_partitions
> > > > >>> > > and the other under /admin/reassignments, we probably want to
> > > > >>> prevent the
> > > > >>> > > same topic being reassigned under both paths at the same
> time?
> > > > >>> > > Thanks,
> > > > >>> > >
> > > > >>> > > Jun
> > > > >>> > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> > > wrote:
> > > > >>> > >
> > > > >>> > > > Hi, Tom,
> > > > >>> > > >
> > > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > > > points
> > > > >>> in
> > > > >>> > > > partition reassignment. Another issue that it also
> addresses
> > is
> > > > >>> the ZK
> > > > >>> > > node
> > > > >>> > > > size limit when writing the reassignment JSON.
> > > > >>> > > >
> > > > >>> > > > My only concern is that the KIP needs to create one watcher
> > per
> > > > >>> > > reassigned
> > > > >>> > > > partition. This could add overhead in ZK and complexity for
> > > > >>> debugging
> > > > >>> > > when
> > > > >>> > > > lots of partitions are being reassigned simultaneously. We
> > > could
> > > > >>> > > > potentially improve this by introducing a separate ZK path
> > for
> > > > >>> change
> > > > >>> > > > notification as we do for configs. For example, every time
> we
> > > > >>> change
> > > > >>> > the
> > > > >>> > > > assignment for a set of partitions, we could further write
> a
> > > > >>> sequential
> > > > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > > > >>> controller
> > > > >>> > > > only needs to watch the change path. Once a change is
> > > triggered,
> > > > >>> the
> > > > >>> > > > controller can read everything under /admin/reassignments/.
> > > > >>> > > >
> > > > >>> > > > Jun
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > > t.j.bentley@gmail.com
> > > > >>> >
> > > > >>> > > wrote:
> > > > >>> > > >
> > > > >>> > > >> Hi,
> > > > >>> > > >>
> > > > >>> > > >> This is still very new, but I wanted some quick feedback
> on
> > a
> > > > >>> > > preliminary
> > > > >>> > > >> KIP which could, I think, help with providing an
> AdminClient
> > > API
> > > > >>> for
> > > > >>> > > >> partition reassignment.
> > > > >>> > > >>
> > > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 236%
> > > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > > >>> > > >>
> > > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > > AdminClient
> > > > >>> API in
> > > > >>> > > >> this
> > > > >>> > > >> KIP (which would make it very big, and difficult to read),
> > or
> > > > >>> whether
> > > > >>> > to
> > > > >>> > > >> break it down into smaller KIPs (which makes it easier to
> > read
> > > > and
> > > > >>> > > >> implement in pieces, but harder to get a high-level
> picture
> > of
> > > > the
> > > > >>> > > >> ultimate
> > > > >>> > > >> destination). For now I've gone for a very small initial
> > KIP,
> > > > but
> > > > >>> I'm
> > > > >>> > > >> happy
> > > > >>> > > >> to sketch the bigger picture here if people are
> interested.
> > > > >>> > > >>
> > > > >>> > > >> Cheers,
> > > > >>> > > >>
> > > > >>> > > >> Tom
> > > > >>> > > >>
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Jun Rao <ju...@confluent.io>.
Hi, Tom,

Thanks for the reply. A few more followup comments below.

10. Another concern of mine is on consistency with the current pattern. The
current pattern for change notification based on ZK is (1) we first write
the actual value in the entity path and then write the change notification
path, and (2)  the change notification path only includes what entity has
changed but not the actual changes. If we want to follow this pattern for
consistency, /admin/reassignment_requests/request_xxx will only have the
partitions whose reassignment have changed, but not the actual reassignment.

11. Ok. I am not sure that I fully understand the description of that part.
Does "assigned" refer to the current assignment? Could you also describe
where the length of the original assignment is stored in ZK?

13. Hmm, I am not sure that the cancellation needs to be done for the whole
batch. The reason that I brought this up is for consistency. The KIP allows
override when using the new approach. It just seems that it's simpler to
extend this model when resolving multiple changes between the old and the
new approach.

Jun



On Mon, Dec 18, 2017 at 2:45 AM, Tom Bentley <t....@gmail.com> wrote:

> Hi Jun,
>
> Thanks for replying, some answers below:
>
>
> > 10. The proposal now stores the reassignment for all partitions in
> > /admin/reassignment_requests/request_xxx. If the number of reassigned
> > partitions is larger, the ZK write may hit the default 1MB limit and
> fail.
> > An alternative approach is to have the reassignment requester first write
> > the new assignment for each partition under
> > /admin/reassignments/$topic/$partition and then write
> > /admin/reassignment_requests/request_xxx with an empty value. The
> > controller can then read all values under /admin/reassignments.
> >
>
> You're right that reassigning enough partitions would hit the 1MB limit,
> but I don't think this would be a problem in practice because it would be
> trivial to split the partitions into several requests (i.e. mutleiple
> request_xxx).
> I don't think the non-atomicity this would imply is a problem. By writing
> the partitions whose /admin/reassignments/$topic/$partition has been
> created or changed it makes it much more efficient to know which of those
> znodes we need to read. If I understand your suggestion, you would have to
> read every node under /admin/reassignments to figure out which had changed.
>
>
> > 11. The improvement you suggested in onPartitionReassignment() sounds
> good
> > at the high level. The computation of those dropped partitions seems a
> bit
> > complicated. Perhaps a simple approach is to drop the replicas not in the
> > original assignment and newest reassignment?
> >
>
> This was what I came up with originally too, but when I looked into
> implementing it I found a couple of things which made me reconsider it.
> Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we start
> reassigning to [2,3], but then change our minds about 2 and switch it to 4
> (maybe we've figured out a better overall balance). At that point it is
> perfectly possible that broker 2 is in-sync and broker 1 is not in-sync. It
> seems silly to drop broker 2 in favour of broker 1: We're needlessly giving
> the cluster more work to do.
>
> The second thing that made me reconsider was in that same scenario it's
> even possible that broker 2 is the leader of the partition. Obviously we
> can elect a new leader before dropping it, but not without causing
> disruption to producers and consumers.
>
> By accepting a little more complexity in choosing which brokers to drop we
> make the dropping simpler (no need for leader election) and ensure the
> cluster has less work to do.
>
>
> > 12. You brought up the need of remembering the original assignment. This
> > will be lost if the assignment is changed multiple times if we follow the
> > approach described in 10. One way is to store the original assignment in
> > /brokers/topics/[topic] as the following. When the final reassignment
> > completes, we can remove the original field.
> > {
> >   "version": 1,
> >   "partitions": {"0": [0, 1, 3] },
> >   "originals": {"0": [0, 1, 2] }
> > }
> >
>
> While I was implementing my first version of onPartitionReassignment(),
> where I preferred the originals, I was storing the originals in the
> /admin/reassignments/$topic/$partition znodes. Since we will remove that
> znode at the end of reassignment anyway, I would suggest this is a better
> place to store that data (if it's necessary to do so), so that we can avoid
> another ZK round trip.
>
>
> > 13. For resolving the conflict between /admin/reassign_partitions and
> > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> just
> > let the assignment with a newer timestamp to override the older one?
> >
>
> That would work but with slightly different semantics to what I have: Since
> /admin/reassign_partitions contains multiple partitions, using the
> timestamp means the whole batch wins or losses. By tracking how each
> request was made we can be more fine-grained. I'm to use the modification
> time if such granularity is not required.
>
>
> > 14. Implementation wise, currently, we register a watcher of the isr path
> > of each partition being reassigned. This has the potential issue of
> > registering many listeners. An improvement could be just piggybacking on
> > the existing IsrChangeNotificationHandler, which only watches a single ZK
> > path and is triggered on a batch of isr changes. This is kind of
> orthogonal
> > to the KIP. However, if we are touching the reassignment logic, it may be
> > worth considering.
>
>
> Let me look into that.
>
> Thanks,
>
> Tom
>
> On 16 December 2017 at 02:19, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the updated KIP. A few more comments below.
> >
> > 10. The proposal now stores the reassignment for all partitions in
> > /admin/reassignment_requests/request_xxx. If the number of reassigned
> > partitions is larger, the ZK write may hit the default 1MB limit and
> fail.
> > An alternative approach is to have the reassignment requester first write
> > the new assignment for each partition under
> > /admin/reassignments/$topic/$partition and then write
> > /admin/reassignment_requests/request_xxx with an empty value. The
> > controller can then read all values under /admin/reassignments.
> >
> > 11. The improvement you suggested in onPartitionReassignment() sounds
> good
> > at the high level. The computation of those dropped partitions seems a
> bit
> > complicated. Perhaps a simple approach is to drop the replicas not in the
> > original assignment and newest reassignment?
> >
> > 12. You brought up the need of remembering the original assignment. This
> > will be lost if the assignment is changed multiple times if we follow the
> > approach described in 10. One way is to store the original assignment in
> > /brokers/topics/[topic] as the following. When the final reassignment
> > completes, we can remove the original field.
> > {
> >   "version": 1,
> >   "partitions": {"0": [0, 1, 3] },
> >   "originals": {"0": [0, 1, 2] }
> > }
> >
> > 13. For resolving the conflict between /admin/reassign_partitions and
> > /admin/reassignments/$topic/$partition, perhaps it's more natural to
> just
> > let the assignment with a newer timestamp to override the older one?
> >
> > 14. Implementation wise, currently, we register a watcher of the isr path
> > of each partition being reassigned. This has the potential issue of
> > registering many listeners. An improvement could be just piggybacking on
> > the existing IsrChangeNotificationHandler, which only watches a single ZK
> > path and is triggered on a batch of isr changes. This is kind of
> orthogonal
> > to the KIP. However, if we are touching the reassignment logic, it may be
> > worth considering.
> >
> > Thanks,
> >
> > Jun
> >
> > On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t....@gmail.com>
> > wrote:
> >
> > > Just wanted to mention that I've started KIP-240, which builds on top
> of
> > > this one to provide an AdminClient API for listing and describing
> > > reassignments.
> > >
> > > On 15 December 2017 at 14:34, Tom Bentley <t....@gmail.com>
> wrote:
> > >
> > > > > Should we seek to improve this algorithm in this KIP, or leave that
> > as
> > > > a later optimisation?
> > > >
> > > > I've updated the KIP with a proposed algorithm.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On 14 December 2017 at 09:57, Tom Bentley <t....@gmail.com>
> > wrote:
> > > >
> > > >> Thanks Ted, now fixed.
> > > >>
> > > >> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com> wrote:
> > > >>
> > > >>> Tom:
> > > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > > >>>
> > > >>> Looks like the tree structure above should be:
> > > >>>
> > > >>> /admin/reassignments/$topic/$partition
> > > >>>
> > > >>> bq. The controller removes /admin/reassignment/$topic/$partition
> > > >>>
> > > >>> Note the lack of 's' for reassignment. It would be good to make
> > > zookeeper
> > > >>> paths consistent.
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <
> t.j.bentley@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Hi Jun and Ted,
> > > >>> >
> > > >>> > Jun, you're right that needing one watcher per reassigned
> partition
> > > >>> > presents a scalability problem, and using a separate notification
> > > path
> > > >>> > solves that. I also agree that it makes sense to prevent users
> from
> > > >>> using
> > > >>> > both methods on the same reassignment.
> > > >>> >
> > > >>> > Ted, naming the reassignments like mytopic-42 was simpler while I
> > was
> > > >>> > proposing a watcher-per-reassignment (I'd have needed a child
> > watcher
> > > >>> on
> > > >>> > /admin/reassignments and also on /admin/reassignments/mytopic).
> > Using
> > > >>> the
> > > >>> > separate notification path means I don't need any watchers in the
> > > >>> > /admin/reassignments subtree, so switching to
> > > >>> /admin/reassignments/mytopic/
> > > >>> > 42
> > > >>> > would work, and avoid /admin/reassignments having a very large
> > number
> > > >>> of
> > > >>> > child nodes. On the other hand it also means I have to create and
> > > >>> delete
> > > >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs
> > the
> > > >>> cost
> > > >>> > of extra round trips to zookeeper. I suppose that since
> > reassignment
> > > is
> > > >>> > generally a slow process it makes little difference if we
> increase
> > > the
> > > >>> > latency of the interactions with zookeeper.
> > > >>> >
> > > >>> > I have updated the KIP with these improvements, and a more
> detailed
> > > >>> > description of exactly how we would manage these znodes.
> > > >>> >
> > > >>> > Reading the algorithm in KafkaController.
> > onPartitionReassignment(),
> > > it
> > > >>> > seems that it would be suboptimal for changing reassignments
> > > in-flight.
> > > >>> > Consider an initial assignment of [1,2], reassigned to [2,3] and
> > then
> > > >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas
> > until
> > > >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> > > original
> > > >>> > assigned replicas and is no longer a new assigned replica. I
> think
> > > this
> > > >>> > also affects the case where the reassignment is cancelled
> > > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up,
> > even
> > > >>> though
> > > >>> > its replica will then be deleted.
> > > >>> >
> > > >>> > Should we seek to improve this algorithm in this KIP, or leave
> that
> > > as
> > > >>> a
> > > >>> > later optimisation?
> > > >>> >
> > > >>> > Cheers,
> > > >>> >
> > > >>> > Tom
> > > >>> >
> > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> > > >>> >
> > > >>> > > Another question is on the compatibility. Since now there are 2
> > > ways
> > > >>> of
> > > >>> > > specifying a partition reassignment, one under
> > > >>> /admin/reassign_partitions
> > > >>> > > and the other under /admin/reassignments, we probably want to
> > > >>> prevent the
> > > >>> > > same topic being reassigned under both paths at the same time?
> > > >>> > > Thanks,
> > > >>> > >
> > > >>> > > Jun
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> > wrote:
> > > >>> > >
> > > >>> > > > Hi, Tom,
> > > >>> > > >
> > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > > points
> > > >>> in
> > > >>> > > > partition reassignment. Another issue that it also addresses
> is
> > > >>> the ZK
> > > >>> > > node
> > > >>> > > > size limit when writing the reassignment JSON.
> > > >>> > > >
> > > >>> > > > My only concern is that the KIP needs to create one watcher
> per
> > > >>> > > reassigned
> > > >>> > > > partition. This could add overhead in ZK and complexity for
> > > >>> debugging
> > > >>> > > when
> > > >>> > > > lots of partitions are being reassigned simultaneously. We
> > could
> > > >>> > > > potentially improve this by introducing a separate ZK path
> for
> > > >>> change
> > > >>> > > > notification as we do for configs. For example, every time we
> > > >>> change
> > > >>> > the
> > > >>> > > > assignment for a set of partitions, we could further write a
> > > >>> sequential
> > > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > > >>> controller
> > > >>> > > > only needs to watch the change path. Once a change is
> > triggered,
> > > >>> the
> > > >>> > > > controller can read everything under /admin/reassignments/.
> > > >>> > > >
> > > >>> > > > Jun
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > t.j.bentley@gmail.com
> > > >>> >
> > > >>> > > wrote:
> > > >>> > > >
> > > >>> > > >> Hi,
> > > >>> > > >>
> > > >>> > > >> This is still very new, but I wanted some quick feedback on
> a
> > > >>> > > preliminary
> > > >>> > > >> KIP which could, I think, help with providing an AdminClient
> > API
> > > >>> for
> > > >>> > > >> partition reassignment.
> > > >>> > > >>
> > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > >>> > > >>
> > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > AdminClient
> > > >>> API in
> > > >>> > > >> this
> > > >>> > > >> KIP (which would make it very big, and difficult to read),
> or
> > > >>> whether
> > > >>> > to
> > > >>> > > >> break it down into smaller KIPs (which makes it easier to
> read
> > > and
> > > >>> > > >> implement in pieces, but harder to get a high-level picture
> of
> > > the
> > > >>> > > >> ultimate
> > > >>> > > >> destination). For now I've gone for a very small initial
> KIP,
> > > but
> > > >>> I'm
> > > >>> > > >> happy
> > > >>> > > >> to sketch the bigger picture here if people are interested.
> > > >>> > > >>
> > > >>> > > >> Cheers,
> > > >>> > > >>
> > > >>> > > >> Tom
> > > >>> > > >>
> > > >>> > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>> >
> > > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> > > >>> >
> > > >>> > > Another question is on the compatibility. Since now there are 2
> > > ways
> > > >>> of
> > > >>> > > specifying a partition reassignment, one under
> > > >>> /admin/reassign_partitions
> > > >>> > > and the other under /admin/reassignments, we probably want to
> > > >>> prevent the
> > > >>> > > same topic being reassigned under both paths at the same time?
> > > >>> > > Thanks,
> > > >>> > >
> > > >>> > > Jun
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> > wrote:
> > > >>> > >
> > > >>> > > > Hi, Tom,
> > > >>> > > >
> > > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > > points
> > > >>> in
> > > >>> > > > partition reassignment. Another issue that it also addresses
> is
> > > >>> the ZK
> > > >>> > > node
> > > >>> > > > size limit when writing the reassignment JSON.
> > > >>> > > >
> > > >>> > > > My only concern is that the KIP needs to create one watcher
> per
> > > >>> > > reassigned
> > > >>> > > > partition. This could add overhead in ZK and complexity for
> > > >>> debugging
> > > >>> > > when
> > > >>> > > > lots of partitions are being reassigned simultaneously. We
> > could
> > > >>> > > > potentially improve this by introducing a separate ZK path
> for
> > > >>> change
> > > >>> > > > notification as we do for configs. For example, every time we
> > > >>> change
> > > >>> > the
> > > >>> > > > assignment for a set of partitions, we could further write a
> > > >>> sequential
> > > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > > >>> controller
> > > >>> > > > only needs to watch the change path. Once a change is
> > triggered,
> > > >>> the
> > > >>> > > > controller can read everything under /admin/reassignments/.
> > > >>> > > >
> > > >>> > > > Jun
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > > t.j.bentley@gmail.com
> > > >>> >
> > > >>> > > wrote:
> > > >>> > > >
> > > >>> > > >> Hi,
> > > >>> > > >>
> > > >>> > > >> This is still very new, but I wanted some quick feedback on
> a
> > > >>> > > preliminary
> > > >>> > > >> KIP which could, I think, help with providing an AdminClient
> > API
> > > >>> for
> > > >>> > > >> partition reassignment.
> > > >>> > > >>
> > > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > > >>> > > >>
> > > >>> > > >> I wasn't sure whether to start fleshing out a whole
> > AdminClient
> > > >>> API in
> > > >>> > > >> this
> > > >>> > > >> KIP (which would make it very big, and difficult to read),
> or
> > > >>> whether
> > > >>> > to
> > > >>> > > >> break it down into smaller KIPs (which makes it easier to
> read
> > > and
> > > >>> > > >> implement in pieces, but harder to get a high-level picture
> of
> > > the
> > > >>> > > >> ultimate
> > > >>> > > >> destination). For now I've gone for a very small initial
> KIP,
> > > but
> > > >>> I'm
> > > >>> > > >> happy
> > > >>> > > >> to sketch the bigger picture here if people are interested.
> > > >>> > > >>
> > > >>> > > >> Cheers,
> > > >>> > > >>
> > > >>> > > >> Tom
> > > >>> > > >>
> > > >>> > > >
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Tom Bentley <t....@gmail.com>.
Hi Jun,

Thanks for replying, some answers below:


> 10. The proposal now stores the reassignment for all partitions in
> /admin/reassignment_requests/request_xxx. If the number of reassigned
> partitions is larger, the ZK write may hit the default 1MB limit and fail.
> An alternative approach is to have the reassignment requester first write
> the new assignment for each partition under
> /admin/reassignments/$topic/$partition and then write
> /admin/reassignment_requests/request_xxx with an empty value. The
> controller can then read all values under /admin/reassignments.
>

You're right that reassigning enough partitions would hit the 1MB limit,
but I don't think this would be a problem in practice because it would be
trivial to split the partitions into several requests (i.e. mutleiple
request_xxx).
I don't think the non-atomicity this would imply is a problem. By writing
the partitions whose /admin/reassignments/$topic/$partition has been
created or changed it makes it much more efficient to know which of those
znodes we need to read. If I understand your suggestion, you would have to
read every node under /admin/reassignments to figure out which had changed.


> 11. The improvement you suggested in onPartitionReassignment() sounds good
> at the high level. The computation of those dropped partitions seems a bit
> complicated. Perhaps a simple approach is to drop the replicas not in the
> original assignment and newest reassignment?
>

This was what I came up with originally too, but when I looked into
implementing it I found a couple of things which made me reconsider it.
Consider the reassignments [0,1] -> [2,3] -> [3,4]. In words: we start
reassigning to [2,3], but then change our minds about 2 and switch it to 4
(maybe we've figured out a better overall balance). At that point it is
perfectly possible that broker 2 is in-sync and broker 1 is not in-sync. It
seems silly to drop broker 2 in favour of broker 1: We're needlessly giving
the cluster more work to do.

The second thing that made me reconsider was in that same scenario it's
even possible that broker 2 is the leader of the partition. Obviously we
can elect a new leader before dropping it, but not without causing
disruption to producers and consumers.

By accepting a little more complexity in choosing which brokers to drop we
make the dropping simpler (no need for leader election) and ensure the
cluster has less work to do.


> 12. You brought up the need of remembering the original assignment. This
> will be lost if the assignment is changed multiple times if we follow the
> approach described in 10. One way is to store the original assignment in
> /brokers/topics/[topic] as the following. When the final reassignment
> completes, we can remove the original field.
> {
>   "version": 1,
>   "partitions": {"0": [0, 1, 3] },
>   "originals": {"0": [0, 1, 2] }
> }
>

While I was implementing my first version of onPartitionReassignment(),
where I preferred the originals, I was storing the originals in the
/admin/reassignments/$topic/$partition znodes. Since we will remove that
znode at the end of reassignment anyway, I would suggest this is a better
place to store that data (if it's necessary to do so), so that we can avoid
another ZK round trip.


> 13. For resolving the conflict between /admin/reassign_partitions and
> /admin/reassignments/$topic/$partition, perhaps it's more natural to just
> let the assignment with a newer timestamp to override the older one?
>

That would work but with slightly different semantics to what I have: Since
/admin/reassign_partitions contains multiple partitions, using the
timestamp means the whole batch wins or losses. By tracking how each
request was made we can be more fine-grained. I'm to use the modification
time if such granularity is not required.


> 14. Implementation wise, currently, we register a watcher of the isr path
> of each partition being reassigned. This has the potential issue of
> registering many listeners. An improvement could be just piggybacking on
> the existing IsrChangeNotificationHandler, which only watches a single ZK
> path and is triggered on a batch of isr changes. This is kind of orthogonal
> to the KIP. However, if we are touching the reassignment logic, it may be
> worth considering.


Let me look into that.

Thanks,

Tom

On 16 December 2017 at 02:19, Jun Rao <ju...@confluent.io> wrote:

> Hi, Tom,
>
> Thanks for the updated KIP. A few more comments below.
>
> 10. The proposal now stores the reassignment for all partitions in
> /admin/reassignment_requests/request_xxx. If the number of reassigned
> partitions is larger, the ZK write may hit the default 1MB limit and fail.
> An alternative approach is to have the reassignment requester first write
> the new assignment for each partition under
> /admin/reassignments/$topic/$partition and then write
> /admin/reassignment_requests/request_xxx with an empty value. The
> controller can then read all values under /admin/reassignments.
>
> 11. The improvement you suggested in onPartitionReassignment() sounds good
> at the high level. The computation of those dropped partitions seems a bit
> complicated. Perhaps a simple approach is to drop the replicas not in the
> original assignment and newest reassignment?
>
> 12. You brought up the need of remembering the original assignment. This
> will be lost if the assignment is changed multiple times if we follow the
> approach described in 10. One way is to store the original assignment in
> /brokers/topics/[topic] as the following. When the final reassignment
> completes, we can remove the original field.
> {
>   "version": 1,
>   "partitions": {"0": [0, 1, 3] },
>   "originals": {"0": [0, 1, 2] }
> }
>
> 13. For resolving the conflict between /admin/reassign_partitions and
> /admin/reassignments/$topic/$partition, perhaps it's more natural to just
> let the assignment with a newer timestamp to override the older one?
>
> 14. Implementation wise, currently, we register a watcher of the isr path
> of each partition being reassigned. This has the potential issue of
> registering many listeners. An improvement could be just piggybacking on
> the existing IsrChangeNotificationHandler, which only watches a single ZK
> path and is triggered on a batch of isr changes. This is kind of orthogonal
> to the KIP. However, if we are touching the reassignment logic, it may be
> worth considering.
>
> Thanks,
>
> Jun
>
> On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t....@gmail.com>
> wrote:
>
> > Just wanted to mention that I've started KIP-240, which builds on top of
> > this one to provide an AdminClient API for listing and describing
> > reassignments.
> >
> > On 15 December 2017 at 14:34, Tom Bentley <t....@gmail.com> wrote:
> >
> > > > Should we seek to improve this algorithm in this KIP, or leave that
> as
> > > a later optimisation?
> > >
> > > I've updated the KIP with a proposed algorithm.
> > >
> > >
> > >
> > >
> > >
> > >
> > > On 14 December 2017 at 09:57, Tom Bentley <t....@gmail.com>
> wrote:
> > >
> > >> Thanks Ted, now fixed.
> > >>
> > >> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com> wrote:
> > >>
> > >>> Tom:
> > >>> bq. create a znode /admin/reassignments/$topic-$partition
> > >>>
> > >>> Looks like the tree structure above should be:
> > >>>
> > >>> /admin/reassignments/$topic/$partition
> > >>>
> > >>> bq. The controller removes /admin/reassignment/$topic/$partition
> > >>>
> > >>> Note the lack of 's' for reassignment. It would be good to make
> > zookeeper
> > >>> paths consistent.
> > >>>
> > >>> Thanks
> > >>>
> > >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t....@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Hi Jun and Ted,
> > >>> >
> > >>> > Jun, you're right that needing one watcher per reassigned partition
> > >>> > presents a scalability problem, and using a separate notification
> > path
> > >>> > solves that. I also agree that it makes sense to prevent users from
> > >>> using
> > >>> > both methods on the same reassignment.
> > >>> >
> > >>> > Ted, naming the reassignments like mytopic-42 was simpler while I
> was
> > >>> > proposing a watcher-per-reassignment (I'd have needed a child
> watcher
> > >>> on
> > >>> > /admin/reassignments and also on /admin/reassignments/mytopic).
> Using
> > >>> the
> > >>> > separate notification path means I don't need any watchers in the
> > >>> > /admin/reassignments subtree, so switching to
> > >>> /admin/reassignments/mytopic/
> > >>> > 42
> > >>> > would work, and avoid /admin/reassignments having a very large
> number
> > >>> of
> > >>> > child nodes. On the other hand it also means I have to create and
> > >>> delete
> > >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs
> the
> > >>> cost
> > >>> > of extra round trips to zookeeper. I suppose that since
> reassignment
> > is
> > >>> > generally a slow process it makes little difference if we increase
> > the
> > >>> > latency of the interactions with zookeeper.
> > >>> >
> > >>> > I have updated the KIP with these improvements, and a more detailed
> > >>> > description of exactly how we would manage these znodes.
> > >>> >
> > >>> > Reading the algorithm in KafkaController.
> onPartitionReassignment(),
> > it
> > >>> > seems that it would be suboptimal for changing reassignments
> > in-flight.
> > >>> > Consider an initial assignment of [1,2], reassigned to [2,3] and
> then
> > >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas
> until
> > >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> > original
> > >>> > assigned replicas and is no longer a new assigned replica. I think
> > this
> > >>> > also affects the case where the reassignment is cancelled
> > >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up,
> even
> > >>> though
> > >>> > its replica will then be deleted.
> > >>> >
> > >>> > Should we seek to improve this algorithm in this KIP, or leave that
> > as
> > >>> a
> > >>> > later optimisation?
> > >>> >
> > >>> > Cheers,
> > >>> >
> > >>> > Tom
> > >>> >
> > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> > >>> >
> > >>> > > Another question is on the compatibility. Since now there are 2
> > ways
> > >>> of
> > >>> > > specifying a partition reassignment, one under
> > >>> /admin/reassign_partitions
> > >>> > > and the other under /admin/reassignments, we probably want to
> > >>> prevent the
> > >>> > > same topic being reassigned under both paths at the same time?
> > >>> > > Thanks,
> > >>> > >
> > >>> > > Jun
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> wrote:
> > >>> > >
> > >>> > > > Hi, Tom,
> > >>> > > >
> > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > points
> > >>> in
> > >>> > > > partition reassignment. Another issue that it also addresses is
> > >>> the ZK
> > >>> > > node
> > >>> > > > size limit when writing the reassignment JSON.
> > >>> > > >
> > >>> > > > My only concern is that the KIP needs to create one watcher per
> > >>> > > reassigned
> > >>> > > > partition. This could add overhead in ZK and complexity for
> > >>> debugging
> > >>> > > when
> > >>> > > > lots of partitions are being reassigned simultaneously. We
> could
> > >>> > > > potentially improve this by introducing a separate ZK path for
> > >>> change
> > >>> > > > notification as we do for configs. For example, every time we
> > >>> change
> > >>> > the
> > >>> > > > assignment for a set of partitions, we could further write a
> > >>> sequential
> > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > >>> controller
> > >>> > > > only needs to watch the change path. Once a change is
> triggered,
> > >>> the
> > >>> > > > controller can read everything under /admin/reassignments/.
> > >>> > > >
> > >>> > > > Jun
> > >>> > > >
> > >>> > > >
> > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > t.j.bentley@gmail.com
> > >>> >
> > >>> > > wrote:
> > >>> > > >
> > >>> > > >> Hi,
> > >>> > > >>
> > >>> > > >> This is still very new, but I wanted some quick feedback on a
> > >>> > > preliminary
> > >>> > > >> KIP which could, I think, help with providing an AdminClient
> API
> > >>> for
> > >>> > > >> partition reassignment.
> > >>> > > >>
> > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > >>> > > >>
> > >>> > > >> I wasn't sure whether to start fleshing out a whole
> AdminClient
> > >>> API in
> > >>> > > >> this
> > >>> > > >> KIP (which would make it very big, and difficult to read), or
> > >>> whether
> > >>> > to
> > >>> > > >> break it down into smaller KIPs (which makes it easier to read
> > and
> > >>> > > >> implement in pieces, but harder to get a high-level picture of
> > the
> > >>> > > >> ultimate
> > >>> > > >> destination). For now I've gone for a very small initial KIP,
> > but
> > >>> I'm
> > >>> > > >> happy
> > >>> > > >> to sketch the bigger picture here if people are interested.
> > >>> > > >>
> > >>> > > >> Cheers,
> > >>> > > >>
> > >>> > > >> Tom
> > >>> > > >>
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>> >
> > >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> > >>> >
> > >>> > > Another question is on the compatibility. Since now there are 2
> > ways
> > >>> of
> > >>> > > specifying a partition reassignment, one under
> > >>> /admin/reassign_partitions
> > >>> > > and the other under /admin/reassignments, we probably want to
> > >>> prevent the
> > >>> > > same topic being reassigned under both paths at the same time?
> > >>> > > Thanks,
> > >>> > >
> > >>> > > Jun
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io>
> wrote:
> > >>> > >
> > >>> > > > Hi, Tom,
> > >>> > > >
> > >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> > points
> > >>> in
> > >>> > > > partition reassignment. Another issue that it also addresses is
> > >>> the ZK
> > >>> > > node
> > >>> > > > size limit when writing the reassignment JSON.
> > >>> > > >
> > >>> > > > My only concern is that the KIP needs to create one watcher per
> > >>> > > reassigned
> > >>> > > > partition. This could add overhead in ZK and complexity for
> > >>> debugging
> > >>> > > when
> > >>> > > > lots of partitions are being reassigned simultaneously. We
> could
> > >>> > > > potentially improve this by introducing a separate ZK path for
> > >>> change
> > >>> > > > notification as we do for configs. For example, every time we
> > >>> change
> > >>> > the
> > >>> > > > assignment for a set of partitions, we could further write a
> > >>> sequential
> > >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> > >>> controller
> > >>> > > > only needs to watch the change path. Once a change is
> triggered,
> > >>> the
> > >>> > > > controller can read everything under /admin/reassignments/.
> > >>> > > >
> > >>> > > > Jun
> > >>> > > >
> > >>> > > >
> > >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> > t.j.bentley@gmail.com
> > >>> >
> > >>> > > wrote:
> > >>> > > >
> > >>> > > >> Hi,
> > >>> > > >>
> > >>> > > >> This is still very new, but I wanted some quick feedback on a
> > >>> > > preliminary
> > >>> > > >> KIP which could, I think, help with providing an AdminClient
> API
> > >>> for
> > >>> > > >> partition reassignment.
> > >>> > > >>
> > >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > >>> > > >> 3A+Interruptible+Partition+Reassignment
> > >>> > > >>
> > >>> > > >> I wasn't sure whether to start fleshing out a whole
> AdminClient
> > >>> API in
> > >>> > > >> this
> > >>> > > >> KIP (which would make it very big, and difficult to read), or
> > >>> whether
> > >>> > to
> > >>> > > >> break it down into smaller KIPs (which makes it easier to read
> > and
> > >>> > > >> implement in pieces, but harder to get a high-level picture of
> > the
> > >>> > > >> ultimate
> > >>> > > >> destination). For now I've gone for a very small initial KIP,
> > but
> > >>> I'm
> > >>> > > >> happy
> > >>> > > >> to sketch the bigger picture here if people are interested.
> > >>> > > >>
> > >>> > > >> Cheers,
> > >>> > > >>
> > >>> > > >> Tom
> > >>> > > >>
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Jun Rao <ju...@confluent.io>.
Hi, Tom,

Thanks for the updated KIP. A few more comments below.

10. The proposal now stores the reassignment for all partitions in
/admin/reassignment_requests/request_xxx. If the number of reassigned
partitions is larger, the ZK write may hit the default 1MB limit and fail.
An alternative approach is to have the reassignment requester first write
the new assignment for each partition under
/admin/reassignments/$topic/$partition and then write
/admin/reassignment_requests/request_xxx with an empty value. The
controller can then read all values under /admin/reassignments.

11. The improvement you suggested in onPartitionReassignment() sounds good
at the high level. The computation of those dropped partitions seems a bit
complicated. Perhaps a simple approach is to drop the replicas not in the
original assignment and newest reassignment?

12. You brought up the need of remembering the original assignment. This
will be lost if the assignment is changed multiple times if we follow the
approach described in 10. One way is to store the original assignment in
/brokers/topics/[topic] as the following. When the final reassignment
completes, we can remove the original field.
{
  "version": 1,
  "partitions": {"0": [0, 1, 3] },
  "originals": {"0": [0, 1, 2] }
}

13. For resolving the conflict between /admin/reassign_partitions and
/admin/reassignments/$topic/$partition, perhaps it's more natural to just
let the assignment with a newer timestamp to override the older one?

14. Implementation wise, currently, we register a watcher of the isr path
of each partition being reassigned. This has the potential issue of
registering many listeners. An improvement could be just piggybacking on
the existing IsrChangeNotificationHandler, which only watches a single ZK
path and is triggered on a batch of isr changes. This is kind of orthogonal
to the KIP. However, if we are touching the reassignment logic, it may be
worth considering.

Thanks,

Jun

On Fri, Dec 15, 2017 at 10:17 AM, Tom Bentley <t....@gmail.com> wrote:

> Just wanted to mention that I've started KIP-240, which builds on top of
> this one to provide an AdminClient API for listing and describing
> reassignments.
>
> On 15 December 2017 at 14:34, Tom Bentley <t....@gmail.com> wrote:
>
> > > Should we seek to improve this algorithm in this KIP, or leave that as
> > a later optimisation?
> >
> > I've updated the KIP with a proposed algorithm.
> >
> >
> >
> >
> >
> >
> > On 14 December 2017 at 09:57, Tom Bentley <t....@gmail.com> wrote:
> >
> >> Thanks Ted, now fixed.
> >>
> >> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com> wrote:
> >>
> >>> Tom:
> >>> bq. create a znode /admin/reassignments/$topic-$partition
> >>>
> >>> Looks like the tree structure above should be:
> >>>
> >>> /admin/reassignments/$topic/$partition
> >>>
> >>> bq. The controller removes /admin/reassignment/$topic/$partition
> >>>
> >>> Note the lack of 's' for reassignment. It would be good to make
> zookeeper
> >>> paths consistent.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t....@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi Jun and Ted,
> >>> >
> >>> > Jun, you're right that needing one watcher per reassigned partition
> >>> > presents a scalability problem, and using a separate notification
> path
> >>> > solves that. I also agree that it makes sense to prevent users from
> >>> using
> >>> > both methods on the same reassignment.
> >>> >
> >>> > Ted, naming the reassignments like mytopic-42 was simpler while I was
> >>> > proposing a watcher-per-reassignment (I'd have needed a child watcher
> >>> on
> >>> > /admin/reassignments and also on /admin/reassignments/mytopic). Using
> >>> the
> >>> > separate notification path means I don't need any watchers in the
> >>> > /admin/reassignments subtree, so switching to
> >>> /admin/reassignments/mytopic/
> >>> > 42
> >>> > would work, and avoid /admin/reassignments having a very large number
> >>> of
> >>> > child nodes. On the other hand it also means I have to create and
> >>> delete
> >>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the
> >>> cost
> >>> > of extra round trips to zookeeper. I suppose that since reassignment
> is
> >>> > generally a slow process it makes little difference if we increase
> the
> >>> > latency of the interactions with zookeeper.
> >>> >
> >>> > I have updated the KIP with these improvements, and a more detailed
> >>> > description of exactly how we would manage these znodes.
> >>> >
> >>> > Reading the algorithm in KafkaController.onPartitionReassignment(),
> it
> >>> > seems that it would be suboptimal for changing reassignments
> in-flight.
> >>> > Consider an initial assignment of [1,2], reassigned to [2,3] and then
> >>> > changed to [2,4]. Broker 3 will remain in the assigned replicas until
> >>> > broker 4 is in sync, even though 3 wasn't actually one of the
> original
> >>> > assigned replicas and is no longer a new assigned replica. I think
> this
> >>> > also affects the case where the reassignment is cancelled
> >>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even
> >>> though
> >>> > its replica will then be deleted.
> >>> >
> >>> > Should we seek to improve this algorithm in this KIP, or leave that
> as
> >>> a
> >>> > later optimisation?
> >>> >
> >>> > Cheers,
> >>> >
> >>> > Tom
> >>> >
> >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> >>> >
> >>> > > Another question is on the compatibility. Since now there are 2
> ways
> >>> of
> >>> > > specifying a partition reassignment, one under
> >>> /admin/reassign_partitions
> >>> > > and the other under /admin/reassignments, we probably want to
> >>> prevent the
> >>> > > same topic being reassigned under both paths at the same time?
> >>> > > Thanks,
> >>> > >
> >>> > > Jun
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
> >>> > >
> >>> > > > Hi, Tom,
> >>> > > >
> >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> points
> >>> in
> >>> > > > partition reassignment. Another issue that it also addresses is
> >>> the ZK
> >>> > > node
> >>> > > > size limit when writing the reassignment JSON.
> >>> > > >
> >>> > > > My only concern is that the KIP needs to create one watcher per
> >>> > > reassigned
> >>> > > > partition. This could add overhead in ZK and complexity for
> >>> debugging
> >>> > > when
> >>> > > > lots of partitions are being reassigned simultaneously. We could
> >>> > > > potentially improve this by introducing a separate ZK path for
> >>> change
> >>> > > > notification as we do for configs. For example, every time we
> >>> change
> >>> > the
> >>> > > > assignment for a set of partitions, we could further write a
> >>> sequential
> >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> >>> controller
> >>> > > > only needs to watch the change path. Once a change is triggered,
> >>> the
> >>> > > > controller can read everything under /admin/reassignments/.
> >>> > > >
> >>> > > > Jun
> >>> > > >
> >>> > > >
> >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> t.j.bentley@gmail.com
> >>> >
> >>> > > wrote:
> >>> > > >
> >>> > > >> Hi,
> >>> > > >>
> >>> > > >> This is still very new, but I wanted some quick feedback on a
> >>> > > preliminary
> >>> > > >> KIP which could, I think, help with providing an AdminClient API
> >>> for
> >>> > > >> partition reassignment.
> >>> > > >>
> >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >>> > > >> 3A+Interruptible+Partition+Reassignment
> >>> > > >>
> >>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
> >>> API in
> >>> > > >> this
> >>> > > >> KIP (which would make it very big, and difficult to read), or
> >>> whether
> >>> > to
> >>> > > >> break it down into smaller KIPs (which makes it easier to read
> and
> >>> > > >> implement in pieces, but harder to get a high-level picture of
> the
> >>> > > >> ultimate
> >>> > > >> destination). For now I've gone for a very small initial KIP,
> but
> >>> I'm
> >>> > > >> happy
> >>> > > >> to sketch the bigger picture here if people are interested.
> >>> > > >>
> >>> > > >> Cheers,
> >>> > > >>
> >>> > > >> Tom
> >>> > > >>
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>> >
> >>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> >>> >
> >>> > > Another question is on the compatibility. Since now there are 2
> ways
> >>> of
> >>> > > specifying a partition reassignment, one under
> >>> /admin/reassign_partitions
> >>> > > and the other under /admin/reassignments, we probably want to
> >>> prevent the
> >>> > > same topic being reassigned under both paths at the same time?
> >>> > > Thanks,
> >>> > >
> >>> > > Jun
> >>> > >
> >>> > >
> >>> > >
> >>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
> >>> > >
> >>> > > > Hi, Tom,
> >>> > > >
> >>> > > > Thanks for the KIP. It definitely addresses one of the pain
> points
> >>> in
> >>> > > > partition reassignment. Another issue that it also addresses is
> >>> the ZK
> >>> > > node
> >>> > > > size limit when writing the reassignment JSON.
> >>> > > >
> >>> > > > My only concern is that the KIP needs to create one watcher per
> >>> > > reassigned
> >>> > > > partition. This could add overhead in ZK and complexity for
> >>> debugging
> >>> > > when
> >>> > > > lots of partitions are being reassigned simultaneously. We could
> >>> > > > potentially improve this by introducing a separate ZK path for
> >>> change
> >>> > > > notification as we do for configs. For example, every time we
> >>> change
> >>> > the
> >>> > > > assignment for a set of partitions, we could further write a
> >>> sequential
> >>> > > > node /admin/reassignment_changes/[change_x]. That way, the
> >>> controller
> >>> > > > only needs to watch the change path. Once a change is triggered,
> >>> the
> >>> > > > controller can read everything under /admin/reassignments/.
> >>> > > >
> >>> > > > Jun
> >>> > > >
> >>> > > >
> >>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <
> t.j.bentley@gmail.com
> >>> >
> >>> > > wrote:
> >>> > > >
> >>> > > >> Hi,
> >>> > > >>
> >>> > > >> This is still very new, but I wanted some quick feedback on a
> >>> > > preliminary
> >>> > > >> KIP which could, I think, help with providing an AdminClient API
> >>> for
> >>> > > >> partition reassignment.
> >>> > > >>
> >>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >>> > > >> 3A+Interruptible+Partition+Reassignment
> >>> > > >>
> >>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
> >>> API in
> >>> > > >> this
> >>> > > >> KIP (which would make it very big, and difficult to read), or
> >>> whether
> >>> > to
> >>> > > >> break it down into smaller KIPs (which makes it easier to read
> and
> >>> > > >> implement in pieces, but harder to get a high-level picture of
> the
> >>> > > >> ultimate
> >>> > > >> destination). For now I've gone for a very small initial KIP,
> but
> >>> I'm
> >>> > > >> happy
> >>> > > >> to sketch the bigger picture here if people are interested.
> >>> > > >>
> >>> > > >> Cheers,
> >>> > > >>
> >>> > > >> Tom
> >>> > > >>
> >>> > > >
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Tom Bentley <t....@gmail.com>.
Just wanted to mention that I've started KIP-240, which builds on top of
this one to provide an AdminClient API for listing and describing
reassignments.

On 15 December 2017 at 14:34, Tom Bentley <t....@gmail.com> wrote:

> > Should we seek to improve this algorithm in this KIP, or leave that as
> a later optimisation?
>
> I've updated the KIP with a proposed algorithm.
>
>
>
>
>
>
> On 14 December 2017 at 09:57, Tom Bentley <t....@gmail.com> wrote:
>
>> Thanks Ted, now fixed.
>>
>> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com> wrote:
>>
>>> Tom:
>>> bq. create a znode /admin/reassignments/$topic-$partition
>>>
>>> Looks like the tree structure above should be:
>>>
>>> /admin/reassignments/$topic/$partition
>>>
>>> bq. The controller removes /admin/reassignment/$topic/$partition
>>>
>>> Note the lack of 's' for reassignment. It would be good to make zookeeper
>>> paths consistent.
>>>
>>> Thanks
>>>
>>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t....@gmail.com>
>>> wrote:
>>>
>>> > Hi Jun and Ted,
>>> >
>>> > Jun, you're right that needing one watcher per reassigned partition
>>> > presents a scalability problem, and using a separate notification path
>>> > solves that. I also agree that it makes sense to prevent users from
>>> using
>>> > both methods on the same reassignment.
>>> >
>>> > Ted, naming the reassignments like mytopic-42 was simpler while I was
>>> > proposing a watcher-per-reassignment (I'd have needed a child watcher
>>> on
>>> > /admin/reassignments and also on /admin/reassignments/mytopic). Using
>>> the
>>> > separate notification path means I don't need any watchers in the
>>> > /admin/reassignments subtree, so switching to
>>> /admin/reassignments/mytopic/
>>> > 42
>>> > would work, and avoid /admin/reassignments having a very large number
>>> of
>>> > child nodes. On the other hand it also means I have to create and
>>> delete
>>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the
>>> cost
>>> > of extra round trips to zookeeper. I suppose that since reassignment is
>>> > generally a slow process it makes little difference if we increase the
>>> > latency of the interactions with zookeeper.
>>> >
>>> > I have updated the KIP with these improvements, and a more detailed
>>> > description of exactly how we would manage these znodes.
>>> >
>>> > Reading the algorithm in KafkaController.onPartitionReassignment(), it
>>> > seems that it would be suboptimal for changing reassignments in-flight.
>>> > Consider an initial assignment of [1,2], reassigned to [2,3] and then
>>> > changed to [2,4]. Broker 3 will remain in the assigned replicas until
>>> > broker 4 is in sync, even though 3 wasn't actually one of the original
>>> > assigned replicas and is no longer a new assigned replica. I think this
>>> > also affects the case where the reassignment is cancelled
>>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even
>>> though
>>> > its replica will then be deleted.
>>> >
>>> > Should we seek to improve this algorithm in this KIP, or leave that as
>>> a
>>> > later optimisation?
>>> >
>>> > Cheers,
>>> >
>>> > Tom
>>> >
>>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
>>> >
>>> > > Another question is on the compatibility. Since now there are 2 ways
>>> of
>>> > > specifying a partition reassignment, one under
>>> /admin/reassign_partitions
>>> > > and the other under /admin/reassignments, we probably want to
>>> prevent the
>>> > > same topic being reassigned under both paths at the same time?
>>> > > Thanks,
>>> > >
>>> > > Jun
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
>>> > >
>>> > > > Hi, Tom,
>>> > > >
>>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>>> in
>>> > > > partition reassignment. Another issue that it also addresses is
>>> the ZK
>>> > > node
>>> > > > size limit when writing the reassignment JSON.
>>> > > >
>>> > > > My only concern is that the KIP needs to create one watcher per
>>> > > reassigned
>>> > > > partition. This could add overhead in ZK and complexity for
>>> debugging
>>> > > when
>>> > > > lots of partitions are being reassigned simultaneously. We could
>>> > > > potentially improve this by introducing a separate ZK path for
>>> change
>>> > > > notification as we do for configs. For example, every time we
>>> change
>>> > the
>>> > > > assignment for a set of partitions, we could further write a
>>> sequential
>>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>>> controller
>>> > > > only needs to watch the change path. Once a change is triggered,
>>> the
>>> > > > controller can read everything under /admin/reassignments/.
>>> > > >
>>> > > > Jun
>>> > > >
>>> > > >
>>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bentley@gmail.com
>>> >
>>> > > wrote:
>>> > > >
>>> > > >> Hi,
>>> > > >>
>>> > > >> This is still very new, but I wanted some quick feedback on a
>>> > > preliminary
>>> > > >> KIP which could, I think, help with providing an AdminClient API
>>> for
>>> > > >> partition reassignment.
>>> > > >>
>>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>>> > > >> 3A+Interruptible+Partition+Reassignment
>>> > > >>
>>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>>> API in
>>> > > >> this
>>> > > >> KIP (which would make it very big, and difficult to read), or
>>> whether
>>> > to
>>> > > >> break it down into smaller KIPs (which makes it easier to read and
>>> > > >> implement in pieces, but harder to get a high-level picture of the
>>> > > >> ultimate
>>> > > >> destination). For now I've gone for a very small initial KIP, but
>>> I'm
>>> > > >> happy
>>> > > >> to sketch the bigger picture here if people are interested.
>>> > > >>
>>> > > >> Cheers,
>>> > > >>
>>> > > >> Tom
>>> > > >>
>>> > > >
>>> > > >
>>> > >
>>> >
>>> >
>>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
>>> >
>>> > > Another question is on the compatibility. Since now there are 2 ways
>>> of
>>> > > specifying a partition reassignment, one under
>>> /admin/reassign_partitions
>>> > > and the other under /admin/reassignments, we probably want to
>>> prevent the
>>> > > same topic being reassigned under both paths at the same time?
>>> > > Thanks,
>>> > >
>>> > > Jun
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
>>> > >
>>> > > > Hi, Tom,
>>> > > >
>>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>>> in
>>> > > > partition reassignment. Another issue that it also addresses is
>>> the ZK
>>> > > node
>>> > > > size limit when writing the reassignment JSON.
>>> > > >
>>> > > > My only concern is that the KIP needs to create one watcher per
>>> > > reassigned
>>> > > > partition. This could add overhead in ZK and complexity for
>>> debugging
>>> > > when
>>> > > > lots of partitions are being reassigned simultaneously. We could
>>> > > > potentially improve this by introducing a separate ZK path for
>>> change
>>> > > > notification as we do for configs. For example, every time we
>>> change
>>> > the
>>> > > > assignment for a set of partitions, we could further write a
>>> sequential
>>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>>> controller
>>> > > > only needs to watch the change path. Once a change is triggered,
>>> the
>>> > > > controller can read everything under /admin/reassignments/.
>>> > > >
>>> > > > Jun
>>> > > >
>>> > > >
>>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t.j.bentley@gmail.com
>>> >
>>> > > wrote:
>>> > > >
>>> > > >> Hi,
>>> > > >>
>>> > > >> This is still very new, but I wanted some quick feedback on a
>>> > > preliminary
>>> > > >> KIP which could, I think, help with providing an AdminClient API
>>> for
>>> > > >> partition reassignment.
>>> > > >>
>>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>>> > > >> 3A+Interruptible+Partition+Reassignment
>>> > > >>
>>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>>> API in
>>> > > >> this
>>> > > >> KIP (which would make it very big, and difficult to read), or
>>> whether
>>> > to
>>> > > >> break it down into smaller KIPs (which makes it easier to read and
>>> > > >> implement in pieces, but harder to get a high-level picture of the
>>> > > >> ultimate
>>> > > >> destination). For now I've gone for a very small initial KIP, but
>>> I'm
>>> > > >> happy
>>> > > >> to sketch the bigger picture here if people are interested.
>>> > > >>
>>> > > >> Cheers,
>>> > > >>
>>> > > >> Tom
>>> > > >>
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Tom Bentley <t....@gmail.com>.
> Should we seek to improve this algorithm in this KIP, or leave that as a
later optimisation?

I've updated the KIP with a proposed algorithm.






On 14 December 2017 at 09:57, Tom Bentley <t....@gmail.com> wrote:

> Thanks Ted, now fixed.
>
> On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com> wrote:
>
>> Tom:
>> bq. create a znode /admin/reassignments/$topic-$partition
>>
>> Looks like the tree structure above should be:
>>
>> /admin/reassignments/$topic/$partition
>>
>> bq. The controller removes /admin/reassignment/$topic/$partition
>>
>> Note the lack of 's' for reassignment. It would be good to make zookeeper
>> paths consistent.
>>
>> Thanks
>>
>> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t....@gmail.com>
>> wrote:
>>
>> > Hi Jun and Ted,
>> >
>> > Jun, you're right that needing one watcher per reassigned partition
>> > presents a scalability problem, and using a separate notification path
>> > solves that. I also agree that it makes sense to prevent users from
>> using
>> > both methods on the same reassignment.
>> >
>> > Ted, naming the reassignments like mytopic-42 was simpler while I was
>> > proposing a watcher-per-reassignment (I'd have needed a child watcher on
>> > /admin/reassignments and also on /admin/reassignments/mytopic). Using
>> the
>> > separate notification path means I don't need any watchers in the
>> > /admin/reassignments subtree, so switching to
>> /admin/reassignments/mytopic/
>> > 42
>> > would work, and avoid /admin/reassignments having a very large number of
>> > child nodes. On the other hand it also means I have to create and delete
>> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the
>> cost
>> > of extra round trips to zookeeper. I suppose that since reassignment is
>> > generally a slow process it makes little difference if we increase the
>> > latency of the interactions with zookeeper.
>> >
>> > I have updated the KIP with these improvements, and a more detailed
>> > description of exactly how we would manage these znodes.
>> >
>> > Reading the algorithm in KafkaController.onPartitionReassignment(), it
>> > seems that it would be suboptimal for changing reassignments in-flight.
>> > Consider an initial assignment of [1,2], reassigned to [2,3] and then
>> > changed to [2,4]. Broker 3 will remain in the assigned replicas until
>> > broker 4 is in sync, even though 3 wasn't actually one of the original
>> > assigned replicas and is no longer a new assigned replica. I think this
>> > also affects the case where the reassignment is cancelled
>> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even
>> though
>> > its replica will then be deleted.
>> >
>> > Should we seek to improve this algorithm in this KIP, or leave that as a
>> > later optimisation?
>> >
>> > Cheers,
>> >
>> > Tom
>> >
>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
>> >
>> > > Another question is on the compatibility. Since now there are 2 ways
>> of
>> > > specifying a partition reassignment, one under
>> /admin/reassign_partitions
>> > > and the other under /admin/reassignments, we probably want to prevent
>> the
>> > > same topic being reassigned under both paths at the same time?
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
>> > >
>> > > > Hi, Tom,
>> > > >
>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>> in
>> > > > partition reassignment. Another issue that it also addresses is the
>> ZK
>> > > node
>> > > > size limit when writing the reassignment JSON.
>> > > >
>> > > > My only concern is that the KIP needs to create one watcher per
>> > > reassigned
>> > > > partition. This could add overhead in ZK and complexity for
>> debugging
>> > > when
>> > > > lots of partitions are being reassigned simultaneously. We could
>> > > > potentially improve this by introducing a separate ZK path for
>> change
>> > > > notification as we do for configs. For example, every time we change
>> > the
>> > > > assignment for a set of partitions, we could further write a
>> sequential
>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>> controller
>> > > > only needs to watch the change path. Once a change is triggered, the
>> > > > controller can read everything under /admin/reassignments/.
>> > > >
>> > > > Jun
>> > > >
>> > > >
>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
>> > > wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> This is still very new, but I wanted some quick feedback on a
>> > > preliminary
>> > > >> KIP which could, I think, help with providing an AdminClient API
>> for
>> > > >> partition reassignment.
>> > > >>
>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>> > > >> 3A+Interruptible+Partition+Reassignment
>> > > >>
>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>> API in
>> > > >> this
>> > > >> KIP (which would make it very big, and difficult to read), or
>> whether
>> > to
>> > > >> break it down into smaller KIPs (which makes it easier to read and
>> > > >> implement in pieces, but harder to get a high-level picture of the
>> > > >> ultimate
>> > > >> destination). For now I've gone for a very small initial KIP, but
>> I'm
>> > > >> happy
>> > > >> to sketch the bigger picture here if people are interested.
>> > > >>
>> > > >> Cheers,
>> > > >>
>> > > >> Tom
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>> >
>> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
>> >
>> > > Another question is on the compatibility. Since now there are 2 ways
>> of
>> > > specifying a partition reassignment, one under
>> /admin/reassign_partitions
>> > > and the other under /admin/reassignments, we probably want to prevent
>> the
>> > > same topic being reassigned under both paths at the same time?
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
>> > >
>> > > > Hi, Tom,
>> > > >
>> > > > Thanks for the KIP. It definitely addresses one of the pain points
>> in
>> > > > partition reassignment. Another issue that it also addresses is the
>> ZK
>> > > node
>> > > > size limit when writing the reassignment JSON.
>> > > >
>> > > > My only concern is that the KIP needs to create one watcher per
>> > > reassigned
>> > > > partition. This could add overhead in ZK and complexity for
>> debugging
>> > > when
>> > > > lots of partitions are being reassigned simultaneously. We could
>> > > > potentially improve this by introducing a separate ZK path for
>> change
>> > > > notification as we do for configs. For example, every time we change
>> > the
>> > > > assignment for a set of partitions, we could further write a
>> sequential
>> > > > node /admin/reassignment_changes/[change_x]. That way, the
>> controller
>> > > > only needs to watch the change path. Once a change is triggered, the
>> > > > controller can read everything under /admin/reassignments/.
>> > > >
>> > > > Jun
>> > > >
>> > > >
>> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
>> > > wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> This is still very new, but I wanted some quick feedback on a
>> > > preliminary
>> > > >> KIP which could, I think, help with providing an AdminClient API
>> for
>> > > >> partition reassignment.
>> > > >>
>> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>> > > >> 3A+Interruptible+Partition+Reassignment
>> > > >>
>> > > >> I wasn't sure whether to start fleshing out a whole AdminClient
>> API in
>> > > >> this
>> > > >> KIP (which would make it very big, and difficult to read), or
>> whether
>> > to
>> > > >> break it down into smaller KIPs (which makes it easier to read and
>> > > >> implement in pieces, but harder to get a high-level picture of the
>> > > >> ultimate
>> > > >> destination). For now I've gone for a very small initial KIP, but
>> I'm
>> > > >> happy
>> > > >> to sketch the bigger picture here if people are interested.
>> > > >>
>> > > >> Cheers,
>> > > >>
>> > > >> Tom
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Tom Bentley <t....@gmail.com>.
Thanks Ted, now fixed.

On 13 December 2017 at 18:38, Ted Yu <yu...@gmail.com> wrote:

> Tom:
> bq. create a znode /admin/reassignments/$topic-$partition
>
> Looks like the tree structure above should be:
>
> /admin/reassignments/$topic/$partition
>
> bq. The controller removes /admin/reassignment/$topic/$partition
>
> Note the lack of 's' for reassignment. It would be good to make zookeeper
> paths consistent.
>
> Thanks
>
> On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t....@gmail.com>
> wrote:
>
> > Hi Jun and Ted,
> >
> > Jun, you're right that needing one watcher per reassigned partition
> > presents a scalability problem, and using a separate notification path
> > solves that. I also agree that it makes sense to prevent users from using
> > both methods on the same reassignment.
> >
> > Ted, naming the reassignments like mytopic-42 was simpler while I was
> > proposing a watcher-per-reassignment (I'd have needed a child watcher on
> > /admin/reassignments and also on /admin/reassignments/mytopic). Using the
> > separate notification path means I don't need any watchers in the
> > /admin/reassignments subtree, so switching to
> /admin/reassignments/mytopic/
> > 42
> > would work, and avoid /admin/reassignments having a very large number of
> > child nodes. On the other hand it also means I have to create and delete
> > the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the
> cost
> > of extra round trips to zookeeper. I suppose that since reassignment is
> > generally a slow process it makes little difference if we increase the
> > latency of the interactions with zookeeper.
> >
> > I have updated the KIP with these improvements, and a more detailed
> > description of exactly how we would manage these znodes.
> >
> > Reading the algorithm in KafkaController.onPartitionReassignment(), it
> > seems that it would be suboptimal for changing reassignments in-flight.
> > Consider an initial assignment of [1,2], reassigned to [2,3] and then
> > changed to [2,4]. Broker 3 will remain in the assigned replicas until
> > broker 4 is in sync, even though 3 wasn't actually one of the original
> > assigned replicas and is no longer a new assigned replica. I think this
> > also affects the case where the reassignment is cancelled
> > ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even
> though
> > its replica will then be deleted.
> >
> > Should we seek to improve this algorithm in this KIP, or leave that as a
> > later optimisation?
> >
> > Cheers,
> >
> > Tom
> >
> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Another question is on the compatibility. Since now there are 2 ways of
> > > specifying a partition reassignment, one under
> /admin/reassign_partitions
> > > and the other under /admin/reassignments, we probably want to prevent
> the
> > > same topic being reassigned under both paths at the same time?
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Tom,
> > > >
> > > > Thanks for the KIP. It definitely addresses one of the pain points in
> > > > partition reassignment. Another issue that it also addresses is the
> ZK
> > > node
> > > > size limit when writing the reassignment JSON.
> > > >
> > > > My only concern is that the KIP needs to create one watcher per
> > > reassigned
> > > > partition. This could add overhead in ZK and complexity for debugging
> > > when
> > > > lots of partitions are being reassigned simultaneously. We could
> > > > potentially improve this by introducing a separate ZK path for change
> > > > notification as we do for configs. For example, every time we change
> > the
> > > > assignment for a set of partitions, we could further write a
> sequential
> > > > node /admin/reassignment_changes/[change_x]. That way, the
> controller
> > > > only needs to watch the change path. Once a change is triggered, the
> > > > controller can read everything under /admin/reassignments/.
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
> > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> This is still very new, but I wanted some quick feedback on a
> > > preliminary
> > > >> KIP which could, I think, help with providing an AdminClient API for
> > > >> partition reassignment.
> > > >>
> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > > >> 3A+Interruptible+Partition+Reassignment
> > > >>
> > > >> I wasn't sure whether to start fleshing out a whole AdminClient API
> in
> > > >> this
> > > >> KIP (which would make it very big, and difficult to read), or
> whether
> > to
> > > >> break it down into smaller KIPs (which makes it easier to read and
> > > >> implement in pieces, but harder to get a high-level picture of the
> > > >> ultimate
> > > >> destination). For now I've gone for a very small initial KIP, but
> I'm
> > > >> happy
> > > >> to sketch the bigger picture here if people are interested.
> > > >>
> > > >> Cheers,
> > > >>
> > > >> Tom
> > > >>
> > > >
> > > >
> > >
> >
> >
> > On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Another question is on the compatibility. Since now there are 2 ways of
> > > specifying a partition reassignment, one under
> /admin/reassign_partitions
> > > and the other under /admin/reassignments, we probably want to prevent
> the
> > > same topic being reassigned under both paths at the same time?
> > > Thanks,
> > >
> > > Jun
> > >
> > >
> > >
> > > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > > > Hi, Tom,
> > > >
> > > > Thanks for the KIP. It definitely addresses one of the pain points in
> > > > partition reassignment. Another issue that it also addresses is the
> ZK
> > > node
> > > > size limit when writing the reassignment JSON.
> > > >
> > > > My only concern is that the KIP needs to create one watcher per
> > > reassigned
> > > > partition. This could add overhead in ZK and complexity for debugging
> > > when
> > > > lots of partitions are being reassigned simultaneously. We could
> > > > potentially improve this by introducing a separate ZK path for change
> > > > notification as we do for configs. For example, every time we change
> > the
> > > > assignment for a set of partitions, we could further write a
> sequential
> > > > node /admin/reassignment_changes/[change_x]. That way, the
> controller
> > > > only needs to watch the change path. Once a change is triggered, the
> > > > controller can read everything under /admin/reassignments/.
> > > >
> > > > Jun
> > > >
> > > >
> > > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
> > > wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> This is still very new, but I wanted some quick feedback on a
> > > preliminary
> > > >> KIP which could, I think, help with providing an AdminClient API for
> > > >> partition reassignment.
> > > >>
> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > > >> 3A+Interruptible+Partition+Reassignment
> > > >>
> > > >> I wasn't sure whether to start fleshing out a whole AdminClient API
> in
> > > >> this
> > > >> KIP (which would make it very big, and difficult to read), or
> whether
> > to
> > > >> break it down into smaller KIPs (which makes it easier to read and
> > > >> implement in pieces, but harder to get a high-level picture of the
> > > >> ultimate
> > > >> destination). For now I've gone for a very small initial KIP, but
> I'm
> > > >> happy
> > > >> to sketch the bigger picture here if people are interested.
> > > >>
> > > >> Cheers,
> > > >>
> > > >> Tom
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Ted Yu <yu...@gmail.com>.
Tom:
bq. create a znode /admin/reassignments/$topic-$partition

Looks like the tree structure above should be:

/admin/reassignments/$topic/$partition

bq. The controller removes /admin/reassignment/$topic/$partition

Note the lack of 's' for reassignment. It would be good to make zookeeper
paths consistent.

Thanks

On Wed, Dec 13, 2017 at 9:49 AM, Tom Bentley <t....@gmail.com> wrote:

> Hi Jun and Ted,
>
> Jun, you're right that needing one watcher per reassigned partition
> presents a scalability problem, and using a separate notification path
> solves that. I also agree that it makes sense to prevent users from using
> both methods on the same reassignment.
>
> Ted, naming the reassignments like mytopic-42 was simpler while I was
> proposing a watcher-per-reassignment (I'd have needed a child watcher on
> /admin/reassignments and also on /admin/reassignments/mytopic). Using the
> separate notification path means I don't need any watchers in the
> /admin/reassignments subtree, so switching to /admin/reassignments/mytopic/
> 42
> would work, and avoid /admin/reassignments having a very large number of
> child nodes. On the other hand it also means I have to create and delete
> the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the cost
> of extra round trips to zookeeper. I suppose that since reassignment is
> generally a slow process it makes little difference if we increase the
> latency of the interactions with zookeeper.
>
> I have updated the KIP with these improvements, and a more detailed
> description of exactly how we would manage these znodes.
>
> Reading the algorithm in KafkaController.onPartitionReassignment(), it
> seems that it would be suboptimal for changing reassignments in-flight.
> Consider an initial assignment of [1,2], reassigned to [2,3] and then
> changed to [2,4]. Broker 3 will remain in the assigned replicas until
> broker 4 is in sync, even though 3 wasn't actually one of the original
> assigned replicas and is no longer a new assigned replica. I think this
> also affects the case where the reassignment is cancelled
> ([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even though
> its replica will then be deleted.
>
> Should we seek to improve this algorithm in this KIP, or leave that as a
> later optimisation?
>
> Cheers,
>
> Tom
>
> On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
>
> > Another question is on the compatibility. Since now there are 2 ways of
> > specifying a partition reassignment, one under /admin/reassign_partitions
> > and the other under /admin/reassignments, we probably want to prevent the
> > same topic being reassigned under both paths at the same time?
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Tom,
> > >
> > > Thanks for the KIP. It definitely addresses one of the pain points in
> > > partition reassignment. Another issue that it also addresses is the ZK
> > node
> > > size limit when writing the reassignment JSON.
> > >
> > > My only concern is that the KIP needs to create one watcher per
> > reassigned
> > > partition. This could add overhead in ZK and complexity for debugging
> > when
> > > lots of partitions are being reassigned simultaneously. We could
> > > potentially improve this by introducing a separate ZK path for change
> > > notification as we do for configs. For example, every time we change
> the
> > > assignment for a set of partitions, we could further write a sequential
> > > node /admin/reassignment_changes/[change_x]. That way, the controller
> > > only needs to watch the change path. Once a change is triggered, the
> > > controller can read everything under /admin/reassignments/.
> > >
> > > Jun
> > >
> > >
> > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> This is still very new, but I wanted some quick feedback on a
> > preliminary
> > >> KIP which could, I think, help with providing an AdminClient API for
> > >> partition reassignment.
> > >>
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > >> 3A+Interruptible+Partition+Reassignment
> > >>
> > >> I wasn't sure whether to start fleshing out a whole AdminClient API in
> > >> this
> > >> KIP (which would make it very big, and difficult to read), or whether
> to
> > >> break it down into smaller KIPs (which makes it easier to read and
> > >> implement in pieces, but harder to get a high-level picture of the
> > >> ultimate
> > >> destination). For now I've gone for a very small initial KIP, but I'm
> > >> happy
> > >> to sketch the bigger picture here if people are interested.
> > >>
> > >> Cheers,
> > >>
> > >> Tom
> > >>
> > >
> > >
> >
>
>
> On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:
>
> > Another question is on the compatibility. Since now there are 2 ways of
> > specifying a partition reassignment, one under /admin/reassign_partitions
> > and the other under /admin/reassignments, we probably want to prevent the
> > same topic being reassigned under both paths at the same time?
> > Thanks,
> >
> > Jun
> >
> >
> >
> > On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Tom,
> > >
> > > Thanks for the KIP. It definitely addresses one of the pain points in
> > > partition reassignment. Another issue that it also addresses is the ZK
> > node
> > > size limit when writing the reassignment JSON.
> > >
> > > My only concern is that the KIP needs to create one watcher per
> > reassigned
> > > partition. This could add overhead in ZK and complexity for debugging
> > when
> > > lots of partitions are being reassigned simultaneously. We could
> > > potentially improve this by introducing a separate ZK path for change
> > > notification as we do for configs. For example, every time we change
> the
> > > assignment for a set of partitions, we could further write a sequential
> > > node /admin/reassignment_changes/[change_x]. That way, the controller
> > > only needs to watch the change path. Once a change is triggered, the
> > > controller can read everything under /admin/reassignments/.
> > >
> > > Jun
> > >
> > >
> > > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> This is still very new, but I wanted some quick feedback on a
> > preliminary
> > >> KIP which could, I think, help with providing an AdminClient API for
> > >> partition reassignment.
> > >>
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> > >> 3A+Interruptible+Partition+Reassignment
> > >>
> > >> I wasn't sure whether to start fleshing out a whole AdminClient API in
> > >> this
> > >> KIP (which would make it very big, and difficult to read), or whether
> to
> > >> break it down into smaller KIPs (which makes it easier to read and
> > >> implement in pieces, but harder to get a high-level picture of the
> > >> ultimate
> > >> destination). For now I've gone for a very small initial KIP, but I'm
> > >> happy
> > >> to sketch the bigger picture here if people are interested.
> > >>
> > >> Cheers,
> > >>
> > >> Tom
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Tom Bentley <t....@gmail.com>.
Hi Jun and Ted,

Jun, you're right that needing one watcher per reassigned partition
presents a scalability problem, and using a separate notification path
solves that. I also agree that it makes sense to prevent users from using
both methods on the same reassignment.

Ted, naming the reassignments like mytopic-42 was simpler while I was
proposing a watcher-per-reassignment (I'd have needed a child watcher on
/admin/reassignments and also on /admin/reassignments/mytopic). Using the
separate notification path means I don't need any watchers in the
/admin/reassignments subtree, so switching to /admin/reassignments/mytopic/42
would work, and avoid /admin/reassignments having a very large number of
child nodes. On the other hand it also means I have to create and delete
the topic nodes (e.g. /admin/reassignments/mytopic), which incurs the cost
of extra round trips to zookeeper. I suppose that since reassignment is
generally a slow process it makes little difference if we increase the
latency of the interactions with zookeeper.

I have updated the KIP with these improvements, and a more detailed
description of exactly how we would manage these znodes.

Reading the algorithm in KafkaController.onPartitionReassignment(), it
seems that it would be suboptimal for changing reassignments in-flight.
Consider an initial assignment of [1,2], reassigned to [2,3] and then
changed to [2,4]. Broker 3 will remain in the assigned replicas until
broker 4 is in sync, even though 3 wasn't actually one of the original
assigned replicas and is no longer a new assigned replica. I think this
also affects the case where the reassignment is cancelled
([1,2]->[2,3]->[1,2]): We again have to wait for 3 to catch up, even though
its replica will then be deleted.

Should we seek to improve this algorithm in this KIP, or leave that as a
later optimisation?

Cheers,

Tom

On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:

> Another question is on the compatibility. Since now there are 2 ways of
> specifying a partition reassignment, one under /admin/reassign_partitions
> and the other under /admin/reassignments, we probably want to prevent the
> same topic being reassigned under both paths at the same time?
> Thanks,
>
> Jun
>
>
>
> On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the KIP. It definitely addresses one of the pain points in
> > partition reassignment. Another issue that it also addresses is the ZK
> node
> > size limit when writing the reassignment JSON.
> >
> > My only concern is that the KIP needs to create one watcher per
> reassigned
> > partition. This could add overhead in ZK and complexity for debugging
> when
> > lots of partitions are being reassigned simultaneously. We could
> > potentially improve this by introducing a separate ZK path for change
> > notification as we do for configs. For example, every time we change the
> > assignment for a set of partitions, we could further write a sequential
> > node /admin/reassignment_changes/[change_x]. That way, the controller
> > only needs to watch the change path. Once a change is triggered, the
> > controller can read everything under /admin/reassignments/.
> >
> > Jun
> >
> >
> > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> This is still very new, but I wanted some quick feedback on a
> preliminary
> >> KIP which could, I think, help with providing an AdminClient API for
> >> partition reassignment.
> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >> 3A+Interruptible+Partition+Reassignment
> >>
> >> I wasn't sure whether to start fleshing out a whole AdminClient API in
> >> this
> >> KIP (which would make it very big, and difficult to read), or whether to
> >> break it down into smaller KIPs (which makes it easier to read and
> >> implement in pieces, but harder to get a high-level picture of the
> >> ultimate
> >> destination). For now I've gone for a very small initial KIP, but I'm
> >> happy
> >> to sketch the bigger picture here if people are interested.
> >>
> >> Cheers,
> >>
> >> Tom
> >>
> >
> >
>


On 11 December 2017 at 21:31, Jun Rao <ju...@confluent.io> wrote:

> Another question is on the compatibility. Since now there are 2 ways of
> specifying a partition reassignment, one under /admin/reassign_partitions
> and the other under /admin/reassignments, we probably want to prevent the
> same topic being reassigned under both paths at the same time?
> Thanks,
>
> Jun
>
>
>
> On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Tom,
> >
> > Thanks for the KIP. It definitely addresses one of the pain points in
> > partition reassignment. Another issue that it also addresses is the ZK
> node
> > size limit when writing the reassignment JSON.
> >
> > My only concern is that the KIP needs to create one watcher per
> reassigned
> > partition. This could add overhead in ZK and complexity for debugging
> when
> > lots of partitions are being reassigned simultaneously. We could
> > potentially improve this by introducing a separate ZK path for change
> > notification as we do for configs. For example, every time we change the
> > assignment for a set of partitions, we could further write a sequential
> > node /admin/reassignment_changes/[change_x]. That way, the controller
> > only needs to watch the change path. Once a change is triggered, the
> > controller can read everything under /admin/reassignments/.
> >
> > Jun
> >
> >
> > On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> This is still very new, but I wanted some quick feedback on a
> preliminary
> >> KIP which could, I think, help with providing an AdminClient API for
> >> partition reassignment.
> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
> >> 3A+Interruptible+Partition+Reassignment
> >>
> >> I wasn't sure whether to start fleshing out a whole AdminClient API in
> >> this
> >> KIP (which would make it very big, and difficult to read), or whether to
> >> break it down into smaller KIPs (which makes it easier to read and
> >> implement in pieces, but harder to get a high-level picture of the
> >> ultimate
> >> destination). For now I've gone for a very small initial KIP, but I'm
> >> happy
> >> to sketch the bigger picture here if people are interested.
> >>
> >> Cheers,
> >>
> >> Tom
> >>
> >
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Jun Rao <ju...@confluent.io>.
Another question is on the compatibility. Since now there are 2 ways of
specifying a partition reassignment, one under /admin/reassign_partitions
and the other under /admin/reassignments, we probably want to prevent the
same topic being reassigned under both paths at the same time?
Thanks,

Jun



On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Tom,
>
> Thanks for the KIP. It definitely addresses one of the pain points in
> partition reassignment. Another issue that it also addresses is the ZK node
> size limit when writing the reassignment JSON.
>
> My only concern is that the KIP needs to create one watcher per reassigned
> partition. This could add overhead in ZK and complexity for debugging when
> lots of partitions are being reassigned simultaneously. We could
> potentially improve this by introducing a separate ZK path for change
> notification as we do for configs. For example, every time we change the
> assignment for a set of partitions, we could further write a sequential
> node /admin/reassignment_changes/[change_x]. That way, the controller
> only needs to watch the change path. Once a change is triggered, the
> controller can read everything under /admin/reassignments/.
>
> Jun
>
>
> On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com> wrote:
>
>> Hi,
>>
>> This is still very new, but I wanted some quick feedback on a preliminary
>> KIP which could, I think, help with providing an AdminClient API for
>> partition reassignment.
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-236%
>> 3A+Interruptible+Partition+Reassignment
>>
>> I wasn't sure whether to start fleshing out a whole AdminClient API in
>> this
>> KIP (which would make it very big, and difficult to read), or whether to
>> break it down into smaller KIPs (which makes it easier to read and
>> implement in pieces, but harder to get a high-level picture of the
>> ultimate
>> destination). For now I've gone for a very small initial KIP, but I'm
>> happy
>> to sketch the bigger picture here if people are interested.
>>
>> Cheers,
>>
>> Tom
>>
>
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Ted Yu <yu...@gmail.com>.
For the znodes created under /admin/reassignments/, would it be better if
the partition znodes for the same topic are put under subtree headed by
topic name ?
In Tom's example, instead of:

/admin/reassignments/mytopic-42

the znode would be:

/admin/reassignments/mytopic/42

Cheers

On Fri, Dec 8, 2017 at 5:41 PM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Tom,
>
> Thanks for the KIP. It definitely addresses one of the pain points in
> partition reassignment. Another issue that it also addresses is the ZK node
> size limit when writing the reassignment JSON.
>
> My only concern is that the KIP needs to create one watcher per reassigned
> partition. This could add overhead in ZK and complexity for debugging when
> lots of partitions are being reassigned simultaneously. We could
> potentially improve this by introducing a separate ZK path for change
> notification as we do for configs. For example, every time we change the
> assignment for a set of partitions, we could further write a sequential
> node /admin/reassignment_changes/[change_x]. That way, the controller only
> needs to watch the change path. Once a change is triggered, the controller
> can read everything under /admin/reassignments/.
>
> Jun
>
>
> On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com> wrote:
>
> > Hi,
> >
> > This is still very new, but I wanted some quick feedback on a preliminary
> > KIP which could, I think, help with providing an AdminClient API for
> > partition reassignment.
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 236%3A+Interruptible+Partition+Reassignment
> >
> > I wasn't sure whether to start fleshing out a whole AdminClient API in
> this
> > KIP (which would make it very big, and difficult to read), or whether to
> > break it down into smaller KIPs (which makes it easier to read and
> > implement in pieces, but harder to get a high-level picture of the
> ultimate
> > destination). For now I've gone for a very small initial KIP, but I'm
> happy
> > to sketch the bigger picture here if people are interested.
> >
> > Cheers,
> >
> > Tom
> >
>

Re: [DISCUSS] KIP-236 Interruptible Partition Reassignment

Posted by Jun Rao <ju...@confluent.io>.
Hi, Tom,

Thanks for the KIP. It definitely addresses one of the pain points in
partition reassignment. Another issue that it also addresses is the ZK node
size limit when writing the reassignment JSON.

My only concern is that the KIP needs to create one watcher per reassigned
partition. This could add overhead in ZK and complexity for debugging when
lots of partitions are being reassigned simultaneously. We could
potentially improve this by introducing a separate ZK path for change
notification as we do for configs. For example, every time we change the
assignment for a set of partitions, we could further write a sequential
node /admin/reassignment_changes/[change_x]. That way, the controller only
needs to watch the change path. Once a change is triggered, the controller
can read everything under /admin/reassignments/.

Jun


On Wed, Dec 6, 2017 at 1:19 PM, Tom Bentley <t....@gmail.com> wrote:

> Hi,
>
> This is still very new, but I wanted some quick feedback on a preliminary
> KIP which could, I think, help with providing an AdminClient API for
> partition reassignment.
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 236%3A+Interruptible+Partition+Reassignment
>
> I wasn't sure whether to start fleshing out a whole AdminClient API in this
> KIP (which would make it very big, and difficult to read), or whether to
> break it down into smaller KIPs (which makes it easier to read and
> implement in pieces, but harder to get a high-level picture of the ultimate
> destination). For now I've gone for a very small initial KIP, but I'm happy
> to sketch the bigger picture here if people are interested.
>
> Cheers,
>
> Tom
>