You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by George Li <sq...@yahoo.com.INVALID> on 2019/08/03 03:01:49 UTC

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

 Hi Colin,
Thanks for looking into this KIP.  Sorry for the late response. been busy. 

If a cluster has MAMY topic partitions, moving this "blacklist" broker to the end of replica list is still a rather "big" operation, involving submitting reassignments.  The KIP-491 way of blacklist is much simpler/easier and can undo easily without changing the replica assignment ordering. 
Major use case for me, a failed broker got swapped with new hardware, and starts up as empty (with latest offset of all partitions), the SLA of retention is 1 day, so before this broker is up to be in-sync for 1 day, we would like to blacklist this broker from serving traffic. after 1 day, the blacklist is removed and run preferred leader election.  This way, no need to run reassignments before/after.  This is the "temporary" use-case.

There are use-cases that this Preferred Leader "blacklist" can be somewhat permanent, as I explained in the AWS data center instances Vs. on-premises data center bare metal machines (heterogenous hardware), that the AWS broker_ids will be blacklisted.  So new topics created,  or existing topic expansion would not make them serve traffic even they could be the preferred leader. 

Please let me know there are more question. 


Thanks,
George

    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe <cm...@apache.org> wrote:  
 
 We still want to give the "blacklisted" broker the leadership if nobody else is available.  Therefore, isn't putting a broker on the blacklist pretty much the same as moving it to the last entry in the replicas list and then triggering a preferred leader election?

If we want this to be undone after a certain amount of time, or under certain conditions, that seems like something that would be more effectively done by an external system, rather than putting all these policies into Kafka.

best,
Colin


On Fri, Jul 19, 2019, at 18:23, George Li wrote:
>  Hi Satish,
> Thanks for the reviews and feedbacks.
> 
> > > The following is the requirements this KIP is trying to accomplish:
> > This can be moved to the"Proposed changes" section.
> 
> Updated the KIP-491. 
> 
> > >>The logic to determine the priority/order of which broker should be
> > preferred leader should be modified.  The broker in the preferred leader
> > blacklist should be moved to the end (lowest priority) when
> > determining leadership.
> >
> > I believe there is no change required in the ordering of the preferred
> > replica list. Brokers in the preferred leader blacklist are skipped
> > until other brokers int he list are unavailable.
> 
> Yes. partition assignment remained the same, replica & ordering. The 
> blacklist logic can be optimized during implementation. 
> 
> > >>The blacklist can be at the broker level. However, there might be use cases
> > where a specific topic should blacklist particular brokers, which
> > would be at the
> > Topic level Config. For this use cases of this KIP, it seems that broker level
> > blacklist would suffice.  Topic level preferred leader blacklist might
> > be future enhancement work.
> > 
> > I agree that the broker level preferred leader blacklist would be
> > sufficient. Do you have any use cases which require topic level
> > preferred blacklist?
> 
> 
> 
> I don't have any concrete use cases for Topic level preferred leader 
> blacklist.  One scenarios I can think of is when a broker has high CPU 
> usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> etc), then try to move the leaders away from this broker,  before doing 
> an actual reassignment to change its preferred leader,  try to put this 
> preferred_leader_blacklist in the Topic Level config, and run preferred 
> leader election, and see whether CPU decreases for this broker,  if 
> yes, then do the reassignments to change the preferred leaders to be 
> "permanent" (the topic may have many partitions like 256 that has quite 
> a few of them having this broker as preferred leader).  So this Topic 
> Level config is an easy way of doing trial and check the result. 
> 
> 
> > You can add the below workaround as an item in the rejected alternatives section
> > "Reassigning all the topic/partitions which the intended broker is a
> > replica for."
> 
> Updated the KIP-491. 
> 
> 
> 
> Thanks, 
> George
> 
>    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> <sa...@gmail.com> wrote:  
>  
>  Thanks for the KIP. I have put my comments below.
> 
> This is a nice improvement to avoid cumbersome maintenance.
> 
> >> The following is the requirements this KIP is trying to accomplish:
>   The ability to add and remove the preferred leader deprioritized
> list/blacklist. e.g. new ZK path/node or new dynamic config.
> 
> This can be moved to the"Proposed changes" section.
> 
> >>The logic to determine the priority/order of which broker should be
> preferred leader should be modified.  The broker in the preferred leader
> blacklist should be moved to the end (lowest priority) when
> determining leadership.
> 
> I believe there is no change required in the ordering of the preferred
> replica list. Brokers in the preferred leader blacklist are skipped
> until other brokers int he list are unavailable.
> 
> >>The blacklist can be at the broker level. However, there might be use cases
> where a specific topic should blacklist particular brokers, which
> would be at the
> Topic level Config. For this use cases of this KIP, it seems that broker level
> blacklist would suffice.  Topic level preferred leader blacklist might
> be future enhancement work.
> 
> I agree that the broker level preferred leader blacklist would be
> sufficient. Do you have any use cases which require topic level
> preferred blacklist?
> 
> You can add the below workaround as an item in the rejected alternatives section
> "Reassigning all the topic/partitions which the intended broker is a
> replica for."
> 
> Thanks,
> Satish.
> 
> On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> <st...@confluent.io> wrote:
> >
> > Hey George,
> >
> > Thanks for the KIP, it's an interesting idea.
> >
> > I was wondering whether we could achieve the same thing via the
> > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > true that this is currently very tedious with the tool. My thoughts are
> > that we could improve the tool and give it the notion of a "blacklisted
> > preferred leader".
> > This would have some benefits like:
> > - more fine-grained control over the blacklist. we may not want to
> > blacklist all the preferred leaders, as that would make the blacklisted
> > broker a follower of last resort which is not very useful. In the cases of
> > an underpowered AWS machine or a controller, you might overshoot and make
> > the broker very underutilized if you completely make it leaderless.
> > - is not permanent. If we are to have a blacklist leaders config,
> > rebalancing tools would also need to know about it and manipulate/respect
> > it to achieve a fair balance.
> > It seems like both problems are tied to balancing partitions, it's just
> > that KIP-491's use case wants to balance them against other factors in a
> > more nuanced way. It makes sense to have both be done from the same place
> >
> > To make note of the motivation section:
> > > Avoid bouncing broker in order to lose its leadership
> > The recommended way to make a broker lose its leadership is to run a
> > reassignment on its partitions
> > > The cross-data center cluster has AWS cloud instances which have less
> > computing power
> > We recommend running Kafka on homogeneous machines. It would be cool if the
> > system supported more flexibility in that regard but that is more nuanced
> > and a preferred leader blacklist may not be the best first approach to the
> > issue
> >
> > Adding a new config which can fundamentally change the way replication is
> > done is complex, both for the system (the replication code is complex
> > enough) and the user. Users would have another potential config that could
> > backfire on them - e.g if left forgotten.
> >
> > Could you think of any downsides to implementing this functionality (or a
> > variation of it) in the kafka-reassign-partitions.sh tool?
> > One downside I can see is that we would not have it handle new partitions
> > created after the "blacklist operation". As a first iteration I think that
> > may be acceptable
> >
> > Thanks,
> > Stanislav
> >
> > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > wrote:
> >
> > >  Hi,
> > >
> > > Pinging the list for the feedbacks of this KIP-491  (
> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > )
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > sql_consulting@yahoo.com.INVALID> wrote:
> > >
> > >  Hi,
> > >
> > > I have created KIP-491 (
> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > for putting a broker to the preferred leader blacklist or deprioritized
> > > list so when determining leadership,  it's moved to the lowest priority for
> > > some of the listed use-cases.
> > >
> > > Please provide your comments/feedbacks.
> > >
> > > Thanks,
> > > George
> > >
> > >
> > >
> > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > >
> > >    [
> > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > ]
> > >
> > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > ---------------------------------------------------
> > >
> > > Thanks for feedback and clear use cases [~sql_consulting].
> > >
> > > > Preferred Leader Blacklist (deprioritized list)
> > > > -----------------------------------------------
> > > >
> > > >                Key: KAFKA-8638
> > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > >            Project: Kafka
> > > >          Issue Type: Improvement
> > > >          Components: config, controller, core
> > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > >            Reporter: GEORGE LI
> > > >            Assignee: GEORGE LI
> > > >            Priority: Major
> > > >
> > > > Currently, the kafka preferred leader election will pick the broker_id
> > > in the topic/partition replica assignments in a priority order when the
> > > broker is in ISR. The preferred leader is the broker id in the first
> > > position of replica. There are use-cases that, even the first broker in the
> > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > of ordering (lowest priority) when deciding leadership during  preferred
> > > leader election.
> > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > when deciding leadership during preferred leader election.  Below is a list
> > > of use cases:
> > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > segments or latest offset without historical data (There is another effort
> > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > * The cross-data center cluster has AWS instances which have less
> > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > leaders, without changing the reassignments ordering of the replicas.
> > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > constantly/quickly, the sets of partition replicas they belong to will see
> > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > Preferred Leader Blacklist to move the priority of its being elected as
> > > leaders to the lowest.
> > > > *  If the controller is busy serving an extra load of metadata requests
> > > and other tasks. we would like to put the controller's leaders to other
> > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > not work for Controller, because after the bounce, the controller fails
> > > over to another broker.
> > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > good if we have a way to specify which broker should be excluded from
> > > serving traffic/leadership (without changing the replica assignment
> > > ordering by reassignments, even though that's quick), and run preferred
> > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > leadership.
> > > > The current work-around of the above is to change the topic/partition's
> > > replica reassignments to move the broker_id 1 from the first position to
> > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > the original one and restore if things change (e.g. controller fails over
> > > to another broker, the swapped empty broker caught up). That’s a rather
> > > tedious task.
> > > >
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian JIRA
> > > (v7.6.3#76005)  

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Satish Duggana <sa...@gmail.com>.
Hi George,
Thanks for addressing the comments. I do not have any more questions.

On Wed, Aug 7, 2019 at 11:08 AM George Li
<sq...@yahoo.com.invalid> wrote:
>
>  Hi Colin, Satish, Stanislav,
>
> Did I answer all your comments/concerns for KIP-491 ?  Please let me know if you have more questions regarding this feature.  I would like to start coding soon. I hope this feature can get into the open source trunk so every time we upgrade Kafka in our environment, we don't need to cherry pick this.
>
> BTW, I have added below in KIP-491 for auto.leader.rebalance.enable behavior with the new Preferred Leader "Blacklist".
>
> "When auto.leader.rebalance.enable is enabled.  The broker(s) in the preferred leader "blacklist" should be excluded from being elected leaders. "
>
>
> Thanks,
> George
>
>     On Friday, August 2, 2019, 08:02:07 PM PDT, George Li <sq...@yahoo.com.INVALID> wrote:
>
>   Hi Colin,
> Thanks for looking into this KIP.  Sorry for the late response. been busy.
>
> If a cluster has MAMY topic partitions, moving this "blacklist" broker to the end of replica list is still a rather "big" operation, involving submitting reassignments.  The KIP-491 way of blacklist is much simpler/easier and can undo easily without changing the replica assignment ordering.
> Major use case for me, a failed broker got swapped with new hardware, and starts up as empty (with latest offset of all partitions), the SLA of retention is 1 day, so before this broker is up to be in-sync for 1 day, we would like to blacklist this broker from serving traffic. after 1 day, the blacklist is removed and run preferred leader election.  This way, no need to run reassignments before/after.  This is the "temporary" use-case.
>
> There are use-cases that this Preferred Leader "blacklist" can be somewhat permanent, as I explained in the AWS data center instances Vs. on-premises data center bare metal machines (heterogenous hardware), that the AWS broker_ids will be blacklisted.  So new topics created,  or existing topic expansion would not make them serve traffic even they could be the preferred leader.
>
> Please let me know there are more question.
>
>
> Thanks,
> George
>
>     On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe <cm...@apache.org> wrote:
>
>  We still want to give the "blacklisted" broker the leadership if nobody else is available.  Therefore, isn't putting a broker on the blacklist pretty much the same as moving it to the last entry in the replicas list and then triggering a preferred leader election?
>
> If we want this to be undone after a certain amount of time, or under certain conditions, that seems like something that would be more effectively done by an external system, rather than putting all these policies into Kafka.
>
> best,
> Colin
>
>
> On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> >  Hi Satish,
> > Thanks for the reviews and feedbacks.
> >
> > > > The following is the requirements this KIP is trying to accomplish:
> > > This can be moved to the"Proposed changes" section.
> >
> > Updated the KIP-491.
> >
> > > >>The logic to determine the priority/order of which broker should be
> > > preferred leader should be modified.  The broker in the preferred leader
> > > blacklist should be moved to the end (lowest priority) when
> > > determining leadership.
> > >
> > > I believe there is no change required in the ordering of the preferred
> > > replica list. Brokers in the preferred leader blacklist are skipped
> > > until other brokers int he list are unavailable.
> >
> > Yes. partition assignment remained the same, replica & ordering. The
> > blacklist logic can be optimized during implementation.
> >
> > > >>The blacklist can be at the broker level. However, there might be use cases
> > > where a specific topic should blacklist particular brokers, which
> > > would be at the
> > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > be future enhancement work.
> > >
> > > I agree that the broker level preferred leader blacklist would be
> > > sufficient. Do you have any use cases which require topic level
> > > preferred blacklist?
> >
> >
> >
> > I don't have any concrete use cases for Topic level preferred leader
> > blacklist.  One scenarios I can think of is when a broker has high CPU
> > usage, trying to identify the big topics (High MsgIn, High BytesIn,
> > etc), then try to move the leaders away from this broker,  before doing
> > an actual reassignment to change its preferred leader,  try to put this
> > preferred_leader_blacklist in the Topic Level config, and run preferred
> > leader election, and see whether CPU decreases for this broker,  if
> > yes, then do the reassignments to change the preferred leaders to be
> > "permanent" (the topic may have many partitions like 256 that has quite
> > a few of them having this broker as preferred leader).  So this Topic
> > Level config is an easy way of doing trial and check the result.
> >
> >
> > > You can add the below workaround as an item in the rejected alternatives section
> > > "Reassigning all the topic/partitions which the intended broker is a
> > > replica for."
> >
> > Updated the KIP-491.
> >
> >
> >
> > Thanks,
> > George
> >
> >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > <sa...@gmail.com> wrote:
> >
> >  Thanks for the KIP. I have put my comments below.
> >
> > This is a nice improvement to avoid cumbersome maintenance.
> >
> > >> The following is the requirements this KIP is trying to accomplish:
> >   The ability to add and remove the preferred leader deprioritized
> > list/blacklist. e.g. new ZK path/node or new dynamic config.
> >
> > This can be moved to the"Proposed changes" section.
> >
> > >>The logic to determine the priority/order of which broker should be
> > preferred leader should be modified.  The broker in the preferred leader
> > blacklist should be moved to the end (lowest priority) when
> > determining leadership.
> >
> > I believe there is no change required in the ordering of the preferred
> > replica list. Brokers in the preferred leader blacklist are skipped
> > until other brokers int he list are unavailable.
> >
> > >>The blacklist can be at the broker level. However, there might be use cases
> > where a specific topic should blacklist particular brokers, which
> > would be at the
> > Topic level Config. For this use cases of this KIP, it seems that broker level
> > blacklist would suffice.  Topic level preferred leader blacklist might
> > be future enhancement work.
> >
> > I agree that the broker level preferred leader blacklist would be
> > sufficient. Do you have any use cases which require topic level
> > preferred blacklist?
> >
> > You can add the below workaround as an item in the rejected alternatives section
> > "Reassigning all the topic/partitions which the intended broker is a
> > replica for."
> >
> > Thanks,
> > Satish.
> >
> > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > <st...@confluent.io> wrote:
> > >
> > > Hey George,
> > >
> > > Thanks for the KIP, it's an interesting idea.
> > >
> > > I was wondering whether we could achieve the same thing via the
> > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > true that this is currently very tedious with the tool. My thoughts are
> > > that we could improve the tool and give it the notion of a "blacklisted
> > > preferred leader".
> > > This would have some benefits like:
> > > - more fine-grained control over the blacklist. we may not want to
> > > blacklist all the preferred leaders, as that would make the blacklisted
> > > broker a follower of last resort which is not very useful. In the cases of
> > > an underpowered AWS machine or a controller, you might overshoot and make
> > > the broker very underutilized if you completely make it leaderless.
> > > - is not permanent. If we are to have a blacklist leaders config,
> > > rebalancing tools would also need to know about it and manipulate/respect
> > > it to achieve a fair balance.
> > > It seems like both problems are tied to balancing partitions, it's just
> > > that KIP-491's use case wants to balance them against other factors in a
> > > more nuanced way. It makes sense to have both be done from the same place
> > >
> > > To make note of the motivation section:
> > > > Avoid bouncing broker in order to lose its leadership
> > > The recommended way to make a broker lose its leadership is to run a
> > > reassignment on its partitions
> > > > The cross-data center cluster has AWS cloud instances which have less
> > > computing power
> > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > system supported more flexibility in that regard but that is more nuanced
> > > and a preferred leader blacklist may not be the best first approach to the
> > > issue
> > >
> > > Adding a new config which can fundamentally change the way replication is
> > > done is complex, both for the system (the replication code is complex
> > > enough) and the user. Users would have another potential config that could
> > > backfire on them - e.g if left forgotten.
> > >
> > > Could you think of any downsides to implementing this functionality (or a
> > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > One downside I can see is that we would not have it handle new partitions
> > > created after the "blacklist operation". As a first iteration I think that
> > > may be acceptable
> > >
> > > Thanks,
> > > Stanislav
> > >
> > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > wrote:
> > >
> > > >  Hi,
> > > >
> > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > )
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > >
> > > >  Hi,
> > > >
> > > > I have created KIP-491 (
> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > some of the listed use-cases.
> > > >
> > > > Please provide your comments/feedbacks.
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >
> > > >
> > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > >
> > > >    [
> > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > ]
> > > >
> > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > ---------------------------------------------------
> > > >
> > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > >
> > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > -----------------------------------------------
> > > > >
> > > > >                Key: KAFKA-8638
> > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > >            Project: Kafka
> > > > >          Issue Type: Improvement
> > > > >          Components: config, controller, core
> > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > >            Reporter: GEORGE LI
> > > > >            Assignee: GEORGE LI
> > > > >            Priority: Major
> > > > >
> > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > in the topic/partition replica assignments in a priority order when the
> > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > position of replica. There are use-cases that, even the first broker in the
> > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > leader election.
> > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > when deciding leadership during preferred leader election.  Below is a list
> > > > of use cases:
> > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > segments or latest offset without historical data (There is another effort
> > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > * The cross-data center cluster has AWS instances which have less
> > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > leaders to the lowest.
> > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > and other tasks. we would like to put the controller's leaders to other
> > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > not work for Controller, because after the bounce, the controller fails
> > > > over to another broker.
> > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > good if we have a way to specify which broker should be excluded from
> > > > serving traffic/leadership (without changing the replica assignment
> > > > ordering by reassignments, even though that's quick), and run preferred
> > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > leadership.
> > > > > The current work-around of the above is to change the topic/partition's
> > > > replica reassignments to move the broker_id 1 from the first position to
> > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > the original one and restore if things change (e.g. controller fails over
> > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > tedious task.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > This message was sent by Atlassian JIRA
> > > > (v7.6.3#76005)

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by George Li <sq...@yahoo.com.INVALID>.
 Hi Colin, Satish, Stanislav, 

Did I answer all your comments/concerns for KIP-491 ?  Please let me know if you have more questions regarding this feature.  I would like to start coding soon. I hope this feature can get into the open source trunk so every time we upgrade Kafka in our environment, we don't need to cherry pick this.

BTW, I have added below in KIP-491 for auto.leader.rebalance.enable behavior with the new Preferred Leader "Blacklist".  

"When auto.leader.rebalance.enable is enabled.  The broker(s) in the preferred leader "blacklist" should be excluded from being elected leaders. "


Thanks,
George

    On Friday, August 2, 2019, 08:02:07 PM PDT, George Li <sq...@yahoo.com.INVALID> wrote:  
 
  Hi Colin,
Thanks for looking into this KIP.  Sorry for the late response. been busy. 

If a cluster has MAMY topic partitions, moving this "blacklist" broker to the end of replica list is still a rather "big" operation, involving submitting reassignments.  The KIP-491 way of blacklist is much simpler/easier and can undo easily without changing the replica assignment ordering. 
Major use case for me, a failed broker got swapped with new hardware, and starts up as empty (with latest offset of all partitions), the SLA of retention is 1 day, so before this broker is up to be in-sync for 1 day, we would like to blacklist this broker from serving traffic. after 1 day, the blacklist is removed and run preferred leader election.  This way, no need to run reassignments before/after.  This is the "temporary" use-case.

There are use-cases that this Preferred Leader "blacklist" can be somewhat permanent, as I explained in the AWS data center instances Vs. on-premises data center bare metal machines (heterogenous hardware), that the AWS broker_ids will be blacklisted.  So new topics created,  or existing topic expansion would not make them serve traffic even they could be the preferred leader. 

Please let me know there are more question. 


Thanks,
George

    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe <cm...@apache.org> wrote:  
 
 We still want to give the "blacklisted" broker the leadership if nobody else is available.  Therefore, isn't putting a broker on the blacklist pretty much the same as moving it to the last entry in the replicas list and then triggering a preferred leader election?

If we want this to be undone after a certain amount of time, or under certain conditions, that seems like something that would be more effectively done by an external system, rather than putting all these policies into Kafka.

best,
Colin


On Fri, Jul 19, 2019, at 18:23, George Li wrote:
>  Hi Satish,
> Thanks for the reviews and feedbacks.
> 
> > > The following is the requirements this KIP is trying to accomplish:
> > This can be moved to the"Proposed changes" section.
> 
> Updated the KIP-491. 
> 
> > >>The logic to determine the priority/order of which broker should be
> > preferred leader should be modified.  The broker in the preferred leader
> > blacklist should be moved to the end (lowest priority) when
> > determining leadership.
> >
> > I believe there is no change required in the ordering of the preferred
> > replica list. Brokers in the preferred leader blacklist are skipped
> > until other brokers int he list are unavailable.
> 
> Yes. partition assignment remained the same, replica & ordering. The 
> blacklist logic can be optimized during implementation. 
> 
> > >>The blacklist can be at the broker level. However, there might be use cases
> > where a specific topic should blacklist particular brokers, which
> > would be at the
> > Topic level Config. For this use cases of this KIP, it seems that broker level
> > blacklist would suffice.  Topic level preferred leader blacklist might
> > be future enhancement work.
> > 
> > I agree that the broker level preferred leader blacklist would be
> > sufficient. Do you have any use cases which require topic level
> > preferred blacklist?
> 
> 
> 
> I don't have any concrete use cases for Topic level preferred leader 
> blacklist.  One scenarios I can think of is when a broker has high CPU 
> usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> etc), then try to move the leaders away from this broker,  before doing 
> an actual reassignment to change its preferred leader,  try to put this 
> preferred_leader_blacklist in the Topic Level config, and run preferred 
> leader election, and see whether CPU decreases for this broker,  if 
> yes, then do the reassignments to change the preferred leaders to be 
> "permanent" (the topic may have many partitions like 256 that has quite 
> a few of them having this broker as preferred leader).  So this Topic 
> Level config is an easy way of doing trial and check the result. 
> 
> 
> > You can add the below workaround as an item in the rejected alternatives section
> > "Reassigning all the topic/partitions which the intended broker is a
> > replica for."
> 
> Updated the KIP-491. 
> 
> 
> 
> Thanks, 
> George
> 
>    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> <sa...@gmail.com> wrote:  
>  
>  Thanks for the KIP. I have put my comments below.
> 
> This is a nice improvement to avoid cumbersome maintenance.
> 
> >> The following is the requirements this KIP is trying to accomplish:
>   The ability to add and remove the preferred leader deprioritized
> list/blacklist. e.g. new ZK path/node or new dynamic config.
> 
> This can be moved to the"Proposed changes" section.
> 
> >>The logic to determine the priority/order of which broker should be
> preferred leader should be modified.  The broker in the preferred leader
> blacklist should be moved to the end (lowest priority) when
> determining leadership.
> 
> I believe there is no change required in the ordering of the preferred
> replica list. Brokers in the preferred leader blacklist are skipped
> until other brokers int he list are unavailable.
> 
> >>The blacklist can be at the broker level. However, there might be use cases
> where a specific topic should blacklist particular brokers, which
> would be at the
> Topic level Config. For this use cases of this KIP, it seems that broker level
> blacklist would suffice.  Topic level preferred leader blacklist might
> be future enhancement work.
> 
> I agree that the broker level preferred leader blacklist would be
> sufficient. Do you have any use cases which require topic level
> preferred blacklist?
> 
> You can add the below workaround as an item in the rejected alternatives section
> "Reassigning all the topic/partitions which the intended broker is a
> replica for."
> 
> Thanks,
> Satish.
> 
> On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> <st...@confluent.io> wrote:
> >
> > Hey George,
> >
> > Thanks for the KIP, it's an interesting idea.
> >
> > I was wondering whether we could achieve the same thing via the
> > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > true that this is currently very tedious with the tool. My thoughts are
> > that we could improve the tool and give it the notion of a "blacklisted
> > preferred leader".
> > This would have some benefits like:
> > - more fine-grained control over the blacklist. we may not want to
> > blacklist all the preferred leaders, as that would make the blacklisted
> > broker a follower of last resort which is not very useful. In the cases of
> > an underpowered AWS machine or a controller, you might overshoot and make
> > the broker very underutilized if you completely make it leaderless.
> > - is not permanent. If we are to have a blacklist leaders config,
> > rebalancing tools would also need to know about it and manipulate/respect
> > it to achieve a fair balance.
> > It seems like both problems are tied to balancing partitions, it's just
> > that KIP-491's use case wants to balance them against other factors in a
> > more nuanced way. It makes sense to have both be done from the same place
> >
> > To make note of the motivation section:
> > > Avoid bouncing broker in order to lose its leadership
> > The recommended way to make a broker lose its leadership is to run a
> > reassignment on its partitions
> > > The cross-data center cluster has AWS cloud instances which have less
> > computing power
> > We recommend running Kafka on homogeneous machines. It would be cool if the
> > system supported more flexibility in that regard but that is more nuanced
> > and a preferred leader blacklist may not be the best first approach to the
> > issue
> >
> > Adding a new config which can fundamentally change the way replication is
> > done is complex, both for the system (the replication code is complex
> > enough) and the user. Users would have another potential config that could
> > backfire on them - e.g if left forgotten.
> >
> > Could you think of any downsides to implementing this functionality (or a
> > variation of it) in the kafka-reassign-partitions.sh tool?
> > One downside I can see is that we would not have it handle new partitions
> > created after the "blacklist operation". As a first iteration I think that
> > may be acceptable
> >
> > Thanks,
> > Stanislav
> >
> > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > wrote:
> >
> > >  Hi,
> > >
> > > Pinging the list for the feedbacks of this KIP-491  (
> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > )
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > sql_consulting@yahoo.com.INVALID> wrote:
> > >
> > >  Hi,
> > >
> > > I have created KIP-491 (
> > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > for putting a broker to the preferred leader blacklist or deprioritized
> > > list so when determining leadership,  it's moved to the lowest priority for
> > > some of the listed use-cases.
> > >
> > > Please provide your comments/feedbacks.
> > >
> > > Thanks,
> > > George
> > >
> > >
> > >
> > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > >
> > >    [
> > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > ]
> > >
> > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > ---------------------------------------------------
> > >
> > > Thanks for feedback and clear use cases [~sql_consulting].
> > >
> > > > Preferred Leader Blacklist (deprioritized list)
> > > > -----------------------------------------------
> > > >
> > > >                Key: KAFKA-8638
> > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > >            Project: Kafka
> > > >          Issue Type: Improvement
> > > >          Components: config, controller, core
> > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > >            Reporter: GEORGE LI
> > > >            Assignee: GEORGE LI
> > > >            Priority: Major
> > > >
> > > > Currently, the kafka preferred leader election will pick the broker_id
> > > in the topic/partition replica assignments in a priority order when the
> > > broker is in ISR. The preferred leader is the broker id in the first
> > > position of replica. There are use-cases that, even the first broker in the
> > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > of ordering (lowest priority) when deciding leadership during  preferred
> > > leader election.
> > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > when deciding leadership during preferred leader election.  Below is a list
> > > of use cases:
> > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > segments or latest offset without historical data (There is another effort
> > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > * The cross-data center cluster has AWS instances which have less
> > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > leaders, without changing the reassignments ordering of the replicas.
> > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > constantly/quickly, the sets of partition replicas they belong to will see
> > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > Preferred Leader Blacklist to move the priority of its being elected as
> > > leaders to the lowest.
> > > > *  If the controller is busy serving an extra load of metadata requests
> > > and other tasks. we would like to put the controller's leaders to other
> > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > not work for Controller, because after the bounce, the controller fails
> > > over to another broker.
> > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > good if we have a way to specify which broker should be excluded from
> > > serving traffic/leadership (without changing the replica assignment
> > > ordering by reassignments, even though that's quick), and run preferred
> > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > leadership.
> > > > The current work-around of the above is to change the topic/partition's
> > > replica reassignments to move the broker_id 1 from the first position to
> > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > the original one and restore if things change (e.g. controller fails over
> > > to another broker, the swapped empty broker caught up). That’s a rather
> > > tedious task.
> > > >
> > >
> > >
> > >
> > > --
> > > This message was sent by Atlassian JIRA
> > > (v7.6.3#76005)    

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Harsha Ch <ha...@gmail.com>.
Hi Stanislav,

               Thanks for the comments. The proposal we are making is not about optimizing Big-O but instead provide a simpler way of stopping a broker becoming leader.  If we want to go with making this an option and providing a tool which abstracts moving the broker to end preferred leader list , it needs to do it for all the partitions that broker is leader for. As said in the above comment a broker i.e leader for 1000 partitions we have to this for all the partitions.  Instead of having a blacklist will help simplify this process and we can provide monitoring/alerts on such list. 

"This sounds like a bit of a hack. If that is the concern, why not propose a KIP that addresses the specific issue?"

Do you mind shedding some light what issue you are talking to propose a KIP for?

Replication is a challenge when we are bringing up a new node.  If you have retention period of 3 days there is honestly no way to do it via online replication without taking a hit on latency SLAs. 

Is your ask to find a way to fix the replication itself when we are bringing a new broker from  no data.

"Having a blacklist you control still seems like a workaround given that Kafka itself knows when the topic retention would allow you to switch that replica to a leader"

Not sure how its making it any complicated by having a single zk path to have a list of brokers.

Thanks,

Harsha

On Mon, Sep 09, 2019 at 3:55 PM, Stanislav Kozlovski < stanislav@confluent.io > wrote:

> 
> 
> 
> I agree with Colin that the same result should be achievable through
> proper abstraction in a tool. Even if that might be "4xO(N)" operations,
> that is still not a lot - it is still classified as O(N)
> 
> 
> 
> Let's say a healthy broker hosting 3000 partitions, and of which 1000 are
> 
> 
>> 
>> 
>> the preferred leaders (leader count is 1000). There is a hardware failure
>> (disk/memory, etc.), and kafka process crashed. We swap this host with
>> another host but keep the same broker. id ( http://broker.id/ ) , when this
>> new broker coming up, it has no historical data, and we manage to have the
>> current last offsets of all partitions set in the
>> replication-offset-checkpoint (if we don't set them, it could cause crazy
>> ReplicaFetcher pulling of historical data from other brokers and cause
>> cluster high latency and other instabilities), so when Kafka is brought
>> up, it is quickly catching up as followers in the ISR. Note, we have
>> auto.leader.rebalance.enable disabled, so it's not serving any traffic as
>> leaders (leader count = 0), even there are 1000 partitions that this
>> broker is the Preferred Leader. We need to make this broker not serving
>> traffic for a few hours or days depending on the SLA of the topic
>> retention requirement until after it's having enough historical data.
>> 
>> 
> 
> 
> 
> This sounds like a bit of a hack. If that is the concern, why not propose
> a KIP that addresses the specific issue? Having a blacklist you control
> still seems like a workaround given that Kafka itself knows when the topic
> retention would allow you to switch that replica to a leader
> 
> 
> 
> I really hope we can come up with a solution that avoids complicating the
> controller and state machine logic further.
> Could you please list out the main drawbacks of abstract this away in the
> reassignments tool (or a new tool)?
> 
> 
> 
> On Mon, Sep 9, 2019 at 7:53 AM Colin McCabe < cmccabe@ apache. org (
> cmccabe@apache.org ) > wrote:
> 
> 
>> 
>> 
>> On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote:
>> 
>> 
>>> 
>>> 
>>> Hi Colin,
>>> Can you give us more details on why you don't want this to be part of the
>>> Kafka core. You are proposing KIP-500 which will take away zookeeper and
>>> writing this interim tools to change the zookeeper metadata doesn't make
>>> sense to me.
>>> 
>>> 
>> 
>> 
>> 
>> Hi Harsha,
>> 
>> 
>> 
>> The reassignment API described in KIP-455, which will be part of Kafka
>> 2.4, doesn't rely on ZooKeeper. This API will stay the same after KIP-500
>> is implemented.
>> 
>> 
>>> 
>>> 
>>> As George pointed out there are
>>> several benefits having it in the system itself instead of asking users to
>>> hack bunch of json files to deal with outage scenario.
>>> 
>>> 
>> 
>> 
>> 
>> In both cases, the user just has to run a shell command, right? In both
>> cases, the user has to remember to undo the command later when they want
>> the broker to be treated normally again. And in both cases, the user
>> should probably be running an external rebalancing tool to avoid having to
>> run these commands manually. :)
>> 
>> 
>> 
>> best,
>> Colin
>> 
>> 
>>> 
>>> 
>>> Thanks,
>>> Harsha
>>> 
>>> 
>>> 
>>> On Fri, Sep 6, 2019 at 4:36 PM George Li < sql_consulting@ yahoo. com (
>>> sql_consulting@yahoo.com )
>>> 
>>> 
>> 
>> 
>> 
>> .invalid>
>> 
>> 
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>>> 
>>>> 
>>>> Hi Colin,
>>>> 
>>>> 
>>>> 
>>>> Thanks for the feedback. The "separate set of metadata about
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> blacklists"
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> in
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> the cluster. Should be easier than keeping json files? e.g. what if
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> we
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> first blacklist broker_id_1, then another broker_id_2 has issues, and
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> we
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> need to write out another json file to restore later (and in which
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> order)?
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> Using blacklist, we can just add the broker_id_2 to the existing one.
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> and
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> remove whatever broker_id returning to good state without worrying
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> how(the
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> ordering of putting the broker to blacklist) to restore.
>>>> 
>>>> 
>>>> 
>>>> For topic level config, the blacklist will be tied to topic/partition(e.g.
>>>> Configs:
>>>> topic.preferred.leader.blacklist=0:101,102;1:103 where 0 & 1 is the
>>>> partition#, 101,102,103 are the blacklist broker_ids), and easier to
>>>> update/remove, no need for external json files?
>>>> 
>>>> 
>>>> 
>>>> Thanks,
>>>> George
>>>> 
>>>> 
>>>> 
>>>> On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe < cmccabe@ apache.
>>>> org ( cmccabe@apache.org ) > wrote:
>>>> 
>>>> 
>>>> 
>>>> One possibility would be writing a new command-line tool that would
>>>> deprioritize a given replica using the new KIP-455 API. Then it could
>>>> write out a JSON files containing the old priorities, which could be
>>>> restored when (or if) we needed to do so. This seems like it might be
>>>> simpler and easier to maintain than a separate set of metadata about
>>>> blacklists.
>>>> 
>>>> 
>>>> 
>>>> best,
>>>> Colin
>>>> 
>>>> 
>>>> 
>>>> On Fri, Sep 6, 2019, at 11:58, George Li wrote:
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> 
>>>>> 
>>>>> Just want to ping and bubble up the discussion of KIP-491.
>>>>> 
>>>>> 
>>>>> 
>>>>> On a large scale of Kafka clusters with thousands of brokers in many
>>>>> clusters. Frequent hardware failures are common, although the
>>>>> reassignments to change the preferred leaders is a workaround, it incurs
>>>>> unnecessary additional work than the proposed preferred leader blacklist
>>>>> in KIP-491, and hard to scale.
>>>>> 
>>>>> 
>>>>> 
>>>>> I am wondering whether others using Kafka in a big scale running into same
>>>>> problem.
>>>>> 
>>>>> 
>>>>> 
>>>>> Satish,
>>>>> 
>>>>> 
>>>>> 
>>>>> Regarding your previous question about whether there is use-case for
>>>>> TopicLevel preferred leader "blacklist", I thought about one use-case: to
>>>>> improve rebalance/reassignment, the large partition
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> will
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> usually cause performance/stability issues, planning to change the
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> say
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> the New Replica will start with Leader's latest offset(this way the
>>>>> replica is almost instantly in the ISR and reassignment completed),
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> and
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> put this partition's NewReplica into Preferred Leader "Blacklist" at the
>>>>> Topic Level config for that partition. After sometime(retention time),
>>>>> this new replica has caught up and ready to serve traffic, update/remove
>>>>> the TopicConfig for this partition's preferred leader blacklist.
>>>>> 
>>>>> 
>>>>> 
>>>>> I will update the KIP-491 later for this use case of Topic Level
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> config
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> for Preferred Leader Blacklist.
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> George
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li
>>>>> < sql_consulting@ yahoo. com ( sql_consulting@yahoo.com ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> Hi Colin,
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> In your example, I think we're comparing apples and oranges. You
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> started by outlining a scenario where "an empty broker... comes up...
>>>> [without] any > leadership[s]." But then you criticize using
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> reassignment
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> to switch the order of preferred replicas because it "would not
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> actually
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> switch the leader > automatically." If the empty broker doesn't have
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> any
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> leaderships, there is nothing to be switched, right?
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Let me explained in details of this particular use case example for
>>>>> comparing apples to apples.
>>>>> 
>>>>> 
>>>>> 
>>>>> Let's say a healthy broker hosting 3000 partitions, and of which 1000 are
>>>>> the preferred leaders (leader count is 1000). There is a hardware failure
>>>>> (disk/memory, etc.), and kafka process crashed. We swap this host with
>>>>> another host but keep the same broker. id ( http://broker.id/ ) , when this
>>>>> new broker coming up, it has no historical data, and we manage to have
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> current last offsets of all partitions set in
>>>>> the replication-offset-checkpoint (if we don't set them, it could
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> cause
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> crazy ReplicaFetcher pulling of historical data from other brokers
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> and
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> cause cluster high latency and other instabilities), so when Kafka is
>>>>> brought up, it is quickly catching up as followers in the ISR. Note, we
>>>>> have auto.leader.rebalance.enable disabled, so it's not serving
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> any
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> traffic as leaders (leader count = 0), even there are 1000 partitions that
>>>>> this broker is the Preferred Leader.
>>>>> 
>>>>> 
>>>>> 
>>>>> We need to make this broker not serving traffic for a few hours or
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> days
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> depending on the SLA of the topic retention requirement until after it's
>>>>> having enough historical data.
>>>>> 
>>>>> 
>>>>> 
>>>>> * The traditional way using the reassignments to move this broker in that
>>>>> 1000 partitions where it's the preferred leader to the end of assignment,
>>>>> this is O(N) operation. and from my experience, we can't submit all 1000
>>>>> at the same time, otherwise cause higher latencies
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> even
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> the reassignment in this case can complete almost instantly. After
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> a
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> few hours/days whatever, this broker is ready to serve traffic, we have to
>>>>> run reassignments again to restore that 1000 partitions preferred leaders
>>>>> for this broker: O(N) operation. then run
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> preferred
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> leader election O(N) again. So total 3 x O(N) operations. The point is
>>>>> since the new empty broker is expected to be the same as the old
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> one
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> in terms of hosting partition/leaders, it would seem unnecessary to
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> do
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> reassignments (ordering of replica) during the broker catching up
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> time.
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> * The new feature Preferred Leader "Blacklist": just need to put a dynamic
>>>>> config to indicate that this broker should be considered
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> leader
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> (preferred leader election or broker failover or unclean leader election)
>>>>> to the lowest priority. NO need to run any reassignments. After a few
>>>>> hours/days, when this broker is ready, remove the dynamic config, and run
>>>>> preferred leader election and this broker will serve traffic for that 1000
>>>>> original partitions it was the preferred
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> leader.
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> So total 1 x O(N) operation.
>>>>> 
>>>>> 
>>>>> 
>>>>> If auto.leader.rebalance.enable is enabled, the Preferred Leader
>>>>> "Blacklist" can be put it before Kafka is started to prevent this broker
>>>>> serving traffic. In the traditional way of running reassignments, once the
>>>>> broker is up,
>>>>> with auto.leader.rebalance.enable , if leadership starts going to
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> this
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> new empty broker, it might have to do preferred leader election after
>>>>> reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1)
>>>>> reassignment only change the ordering, 1 remains as the current
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> leader,
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> and needs prefer leader election to change to 2 after reassignment.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> so
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> potentially one more O(N) operation.
>>>>> 
>>>>> 
>>>>> 
>>>>> I hope the above example can show how easy to "blacklist" a broker serving
>>>>> leadership. For someone managing Production Kafka cluster, it's important
>>>>> to react fast to certain alerts and mitigate/resolve some issues. As I
>>>>> listed the other use cases in KIP-291, I think this feature can make the
>>>>> Kafka product more easier to manage/operate.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> In general, using an external rebalancing tool like Cruise Control
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> is
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> a good idea to keep things balanced without having deal with manual
>>>> rebalancing. > We expect more and more people who have a complex or
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> large
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> cluster will start using tools like this.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> However, if you choose to do manual rebalancing, it shouldn't be
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> that
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> bad. You would save the existing partition ordering before making your
>>>> changes, then> make your changes (perhaps by running a simple command
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> line
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> tool that switches the order of the replicas). Then, once you felt
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> like
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> the broker was ready to> serve traffic, you could just re-apply the old
>>>> ordering which you had saved.
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> We do have our own rebalancing tool which has its own criteria like Rack
>>>>> diversity, disk usage, spread partitions/leaders across all brokers in the
>>>>> cluster per topic, leadership Bytes/BytesIn served per broker, etc. We can
>>>>> run reassignments. The point is whether it's really necessary, and if
>>>>> there is more effective, easier, safer way
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> to
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> do it.
>>>>> 
>>>>> 
>>>>> 
>>>>> take another use case example of taking leadership out of busy Controller
>>>>> to give it more power to serve metadata requests and other work. The
>>>>> controller can failover, with the preferred leader
>>>>> "blacklist", it does not have to run reassignments again when controller
>>>>> failover, just change the blacklisted broker_id.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I was thinking about a PlacementPolicy filling the role of
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> preventing
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> people from creating single-replica partitions on a node that we didn't
>>>> want to > ever be the leader. I thought that it could also prevent
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> people
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> from designating those nodes as preferred leaders during topic
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> creation, or
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> Kafka from doing> itduring random topic creation. I was assuming that
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> PlacementPolicy would determine which nodes were which through static
>>>> configuration keys. I agree> static configuration keys are somewhat
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> less
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> flexible than dynamic configuration.
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I think single-replica partition might not be a good example. There should
>>>>> not be any single-replica partition at all. If yes. it's probably because
>>>>> of trying to save disk space with less replicas. I think at least minimum
>>>>> 2. The user purposely creating single-replica partition will take full
>>>>> responsibilities of data loss and unavailability when a broker fails or
>>>>> under maintenance.
>>>>> 
>>>>> 
>>>>> 
>>>>> I think it would be better to use dynamic instead of static config.
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> I
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> also think it would be better to have topic creation Policy enforced
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> in
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> Kafka server OR an external service. We have an external/central service
>>>>> managing topic creation/partition expansion which takes into account of
>>>>> rack-diversity, replication factor (2, 3 or 4 depending on cluster/topic
>>>>> type), Policy replicating the topic between kafka clusters, etc.
>>>>> 
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> George
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe
>>>>> < cmccabe@ apache. org ( cmccabe@apache.org ) > wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Aug 7, 2019, at 12:48, George Li wrote:
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi Colin,
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks for your feedbacks. Comments below:
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Even if you have a way of blacklisting an entire broker all at
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> once,
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> you still would need to run a leader election > for each partition
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> where
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> you want to move the leader off of the blacklisted broker. So the
>>>> operation is still O(N) in > that sense-- you have to do something per
>>>> partition.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> For a failed broker and swapped with an empty broker, when it comes
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> up,
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> it will not have any leadership, and we would like it to remain not having
>>>>>> leaderships for a couple of hours or days. So there is no preferred leader
>>>>>> election needed which incurs O(N) operation in
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> this
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> case. Putting the preferred leader blacklist would safe guard this broker
>>>>>> serving traffic during that time. otherwise, if another
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> broker
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fails(if this broker is the 1st, 2nd in the assignment), or someone runs
>>>>>> preferred leader election, this new "empty" broker can still
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> get
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> leaderships.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Also running reassignment to change the ordering of preferred
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> leader
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> would not actually switch the leader automatically. e.g. (1,2,3)
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> =>
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> (2,3,1). unless preferred leader election is run to switch current leader
>>>>>> from 1 to 2. So the operation is at least 2 x O(N). and
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> then
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> after the broker is back to normal, another 2 x O(N) to rollback.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Hi George,
>>>>> 
>>>>> 
>>>>> 
>>>>> Hmm. I guess I'm still on the fence about this feature.
>>>>> 
>>>>> 
>>>>> 
>>>>> In your example, I think we're comparing apples and oranges. You started
>>>>> by outlining a scenario where "an empty broker... comes up...
>>>>> [without] any leadership[s]." But then you criticize using reassignment to
>>>>> switch the order of preferred replicas because it
>>>>> "would not actually switch the leader automatically." If the empty broker
>>>>> doesn't have any leaderships, there is nothing to be switched, right?
>>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> In general, reassignment will get a lot easier and quicker once
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> KIP-455 is implemented. > Reassignments that just change the order of
>>>> preferred replicas for a specific partition should complete pretty much
>>>> instantly.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I think it's simpler and easier just to have one source of truth
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> for what the preferred replica is for a partition, rather than two. So
>>>> for> me, the fact that the replica assignment ordering isn't changed is
>>>> actually a big disadvantage of this KIP. If you are a new user (or
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> just>
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> an existing user that didn't read all of the documentation) and you
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> just
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> look at the replica assignment, you might be confused by why> a
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> particular
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> broker wasn't getting any leaderships, even though it appeared like it
>>>> should. More mechanisms mean more complexity> for users and developers
>>>> most of the time.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I would like stress the point that running reassignment to change
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ordering of the replica (putting a broker to the end of partition
>>>>>> assignment) is unnecessary, because after some time the broker is caught
>>>>>> up, it can start serving traffic and then need to run reassignments again
>>>>>> to "rollback" to previous states. As I
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> mentioned
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> in
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> KIP-491, this is just tedious work.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> In general, using an external rebalancing tool like Cruise Control
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> is a
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> good idea to keep things balanced without having deal with manual
>>>>> rebalancing. We expect more and more people who have a complex or large
>>>>> cluster will start using tools like this.
>>>>> 
>>>>> 
>>>>> 
>>>>> However, if you choose to do manual rebalancing, it shouldn't be that bad.
>>>>> You would save the existing partition ordering before making
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> your
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> changes, then make your changes (perhaps by running a simple command line
>>>>> tool that switches the order of the replicas). Then, once you felt like
>>>>> the broker was ready to serve traffic, you could just re-apply the old
>>>>> ordering which you had saved.
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I agree this might introduce some complexities for
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> users/developers.
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> But if this feature is good, and well documented, it is good for
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> kafka product/community. Just like KIP-460 enabling unclean leader
>>>>>> election to override TopicLevel/Broker Level config of
>>>>>> `unclean.leader.election.enable`
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I agree that it would be nice if we could treat some brokers
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> differently for the purposes of placing replicas, selecting leaders,
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> etc. >
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> Right now, we don't have any way of implementing that without forking
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> broker. I would support a new PlacementPolicy class that> would close
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> this
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> gap. But I don't think this KIP is flexible enough to fill this
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> role. For
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> example, it can't prevent users from creating> new single-replica
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> topics
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> that get put on the "bad" replica. Perhaps we should reopen the
>>>> discussion> about
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-201%3A+Rationalising+Policy+interfaces
>> (
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
>> )
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Creating topic with single-replica is beyond what KIP-491 is
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> trying to
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> achieve. The user needs to take responsibility of doing that. I do
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> see
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> some Samza clients notoriously creating single-replica topics and
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> that
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> got flagged by alerts, because a single broker down/maintenance
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> will
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> cause offline partitions. For KIP-491 preferred leader "blacklist", the
>>>>>> single-replica will still serve as leaders, because there is no other
>>>>>> alternative replica to be chosen as leader.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Even with a new PlacementPolicy for topic creation/partition
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> expansion,
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> it still needs the blacklist info (e.g. a zk path node, or broker
>>>>>> level/topic level config) to "blacklist" the broker to be preferred
>>>>>> leader? Would it be the same as KIP-491 is introducing?
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> I was thinking about a PlacementPolicy filling the role of preventing
>>>>> people from creating single-replica partitions on a node that we
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> didn't
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> want to ever be the leader. I thought that it could also prevent people
>>>>> from designating those nodes as preferred leaders during topic creation,
>>>>> or Kafka from doing itduring random topic creation. I was assuming that
>>>>> the PlacementPolicy would determine which nodes were which through static
>>>>> configuration keys. I agree static
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> configuration
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> keys are somewhat less flexible than dynamic configuration.
>>>>> 
>>>>> 
>>>>> 
>>>>> best,
>>>>> Colin
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> George
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
>>>>>> < cmccabe@ apache. org ( cmccabe@apache.org ) > wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Fri, Aug 2, 2019, at 20:02, George Li wrote:
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hi Colin,
>>>>>>> Thanks for looking into this KIP. Sorry for the late response.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> been
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> busy.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If a cluster has MAMY topic partitions, moving this "blacklist"
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> broker
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> to the end of replica list is still a rather "big" operation,
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> involving
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> submitting reassignments. The KIP-491 way of blacklist is much
>>>>>>> simpler/easier and can undo easily without changing the replica assignment
>>>>>>> ordering.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Hi George,
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Even if you have a way of blacklisting an entire broker all at
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> once,
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> you still would need to run a leader election for each partition
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> where
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> you want to move the leader off of the blacklisted broker. So the
>>>>>> operation is still O(N) in that sense-- you have to do something
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> per
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> partition.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> In general, reassignment will get a lot easier and quicker once
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> KIP-455
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> is implemented. Reassignments that just change the order of
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> preferred
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> replicas for a specific partition should complete pretty much
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> instantly.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I think it's simpler and easier just to have one source of truth
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> for
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> what the preferred replica is for a partition, rather than two. So
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> for
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> me, the fact that the replica assignment ordering isn't changed is
>>>>>> actually a big disadvantage of this KIP. If you are a new user (or just an
>>>>>> existing user that didn't read all of the documentation)
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> and
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> you just look at the replica assignment, you might be confused by
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> why
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> a
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> particular broker wasn't getting any leaderships, even though it appeared
>>>>>> like it should. More mechanisms mean more complexity for users and
>>>>>> developers most of the time.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Major use case for me, a failed broker got swapped with new
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> hardware,
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> and starts up as empty (with latest offset of all partitions),
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> SLA
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> of retention is 1 day, so before this broker is up to be in-sync
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> for
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> 1
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> day, we would like to blacklist this broker from serving traffic.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> after
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 1 day, the blacklist is removed and run preferred leader
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> election.
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> This way, no need to run reassignments before/after. This is the
>>>>>>> "temporary" use-case.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> What if we just add an option to the reassignment tool to generate
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> a
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> plan to move all the leaders off of a specific broker? The tool
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> could
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> also run a leader election as well. That would be a simple way of doing
>>>>>> this without adding new mechanisms or broker-side
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> configurations,
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> etc.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> There are use-cases that this Preferred Leader "blacklist" can be somewhat
>>>>>>> permanent, as I explained in the AWS data center
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> instances
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> Vs.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> on-premises data center bare metal machines (heterogenous
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> hardware),
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> that the AWS broker_ids will be blacklisted. So new topics
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> created,
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> or existing topic expansion would not make them serve traffic
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> even
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> they
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> could be the preferred leader.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I agree that it would be nice if we could treat some brokers differently
>>>>>> for the purposes of placing replicas, selecting
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> leaders,
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> etc. Right now, we don't have any way of implementing that without forking
>>>>>> the broker. I would support a new PlacementPolicy class
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> that
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> would close this gap. But I don't think this KIP is flexible
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> enough
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> to
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> fill this role. For example, it can't prevent users from creating
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> new
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> single-replica topics that get put on the "bad" replica. Perhaps
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> we
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> should reopen the discussion about
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> https:/ / cwiki. apache. org/ confluence/ display/ KAFKA/ KIP-201%3A+Rationalising+Policy+interfaces
>> (
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
>> )
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> regards,
>>>>>> Colin
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Please let me know there are more question.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> George
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
>>>>>>> < cmccabe@ apache. org ( cmccabe@apache.org ) > wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> We still want to give the "blacklisted" broker the leadership if nobody
>>>>>>> else is available. Therefore, isn't putting a broker on
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> blacklist pretty much the same as moving it to the last entry in
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> replicas list and then triggering a preferred leader election?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> If we want this to be undone after a certain amount of time, or
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> under
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> certain conditions, that seems like something that would be more
>>>>>>> effectively done by an external system, rather than putting all
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> these
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> policies into Kafka.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> best,
>>>>>>> Colin
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Jul 19, 2019, at 18:23, George Li wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Hi Satish,
>>>>>>>> Thanks for the reviews and feedbacks.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The following is the requirements this KIP is trying to
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> accomplish:
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This can be moved to the"Proposed changes" section.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Updated the KIP-491.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The logic to determine the priority/order of which broker
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> should be
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> preferred leader should be modified. The broker in the
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> preferred leader
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> blacklist should be moved to the end (lowest priority) when determining
>>>>>>>>> leadership.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I believe there is no change required in the ordering of the
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> preferred
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> replica list. Brokers in the preferred leader blacklist are
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> skipped
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> until other brokers int he list are unavailable.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Yes. partition assignment remained the same, replica &
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> ordering.
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> The
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> blacklist logic can be optimized during implementation.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The blacklist can be at the broker level. However, there
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> might
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> be use cases
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> where a specific topic should blacklist particular brokers,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> which
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> would be at the
>>>>>>>>> Topic level Config. For this use cases of this KIP, it seems
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> that broker level
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> blacklist would suffice. Topic level preferred leader
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> blacklist
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> might
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> be future enhancement work.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I agree that the broker level preferred leader blacklist
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> would be
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> sufficient. Do you have any use cases which require topic
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> level
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> preferred blacklist?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't have any concrete use cases for Topic level preferred
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> leader
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> blacklist. One scenarios I can think of is when a broker has
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> high
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> CPU
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> usage, trying to identify the big topics (High MsgIn, High
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> BytesIn,
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> etc), then try to move the leaders away from this broker,
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> before
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> doing
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> an actual reassignment to change its preferred leader, try to
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> put
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> this
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> preferred_leader_blacklist in the Topic Level config, and run
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> preferred
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> leader election, and see whether CPU decreases for this broker,
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> if
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> yes, then do the reassignments to change the preferred leaders
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> to
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> be
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> "permanent" (the topic may have many partitions like 256 that
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> has
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> quite
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> a few of them having this broker as preferred leader). So this
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> Topic
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Level config is an easy way of doing trial and check the
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> result.
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> You can add the below workaround as an item in the rejected
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> alternatives section
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> "Reassigning all the topic/partitions which the intended
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> broker
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> is a
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> replica for."
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Updated the KIP-491.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> George
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
>>>>>>>> < satish. duggana@ gmail. com ( satish.duggana@gmail.com ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks for the KIP. I have put my comments below.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This is a nice improvement to avoid cumbersome maintenance.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The following is the requirements this KIP is trying to
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> accomplish:
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> The ability to add and remove the preferred leader
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> deprioritized
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> list/blacklist. e.g. new ZK path/node or new dynamic config.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> This can be moved to the"Proposed changes" section.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The logic to determine the priority/order of which broker
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> should
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> be
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> preferred leader should be modified. The broker in the
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> preferred
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> leader
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> blacklist should be moved to the end (lowest priority) when determining
>>>>>>>> leadership.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I believe there is no change required in the ordering of the
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> preferred
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> replica list. Brokers in the preferred leader blacklist are
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> skipped
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> until other brokers int he list are unavailable.
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The blacklist can be at the broker level. However, there
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> might
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> be use cases
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> where a specific topic should blacklist particular brokers,
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> which
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> would be at the
>>>>>>>> Topic level Config. For this use cases of this KIP, it seems
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> that
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> broker level
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> blacklist would suffice. Topic level preferred leader
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> blacklist
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> might
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> be future enhancement work.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I agree that the broker level preferred leader blacklist would
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> be
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> sufficient. Do you have any use cases which require topic level preferred
>>>>>>>> blacklist?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> You can add the below workaround as an item in the rejected
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> alternatives section
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> "Reassigning all the topic/partitions which the intended
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> broker is
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> a
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> replica for."
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Satish.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
>>>>>>>> < stanislav@ confluent. io ( stanislav@confluent.io ) > wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Hey George,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks for the KIP, it's an interesting idea.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I was wondering whether we could achieve the same thing via
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> kafka-reassign-partitions tool. As you had also said in the
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> JIRA, it is
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> true that this is currently very tedious with the tool. My
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> thoughts are
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> that we could improve the tool and give it the notion of a
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> "blacklisted
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> preferred leader".
>>>>>>>>> This would have some benefits like:
>>>>>>>>> - more fine-grained control over the blacklist. we may not
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> want
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> to
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> blacklist all the preferred leaders, as that would make the
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> blacklisted
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> broker a follower of last resort which is not very useful. In
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> the cases of
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> an underpowered AWS machine or a controller, you might
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> overshoot
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> and make
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> the broker very underutilized if you completely make it
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> leaderless.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> - is not permanent. If we are to have a blacklist leaders
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> config,
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> rebalancing tools would also need to know about it and
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> manipulate/respect
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> it to achieve a fair balance.
>>>>>>>>> It seems like both problems are tied to balancing partitions,
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> it's just
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> that KIP-491's use case wants to balance them against other
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> factors in a
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> more nuanced way. It makes sense to have both be done from
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> same place
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> To make note of the motivation section:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Avoid bouncing broker in order to lose its leadership
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> The recommended way to make a broker lose its leadership is
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> to
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> run a
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> reassignment on its partitions
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> The cross-data center cluster has AWS cloud instances which
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> have less
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> computing power
>>>>>>>>> We recommend running Kafka on homogeneous machines. It would
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> be
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> cool if the
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> system supported more flexibility in that regard but that is
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> more nuanced
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> and a preferred leader blacklist may not be the best first
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> approach to the
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> issue
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Adding a new config which can fundamentally change the way
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> replication is
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> done is complex, both for the system (the replication code is
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> complex
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> enough) and the user. Users would have another potential
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> config
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> that could
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> backfire on them - e.g if left forgotten.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Could you think of any downsides to implementing this
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> functionality (or a
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> variation of it) in the kafka-reassign-partitions. sh (
>>>>>>>>> http://kafka-reassign-partitions.sh/ ) tool? One downside I can see is that
>>>>>>>>> we would not have it handle
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> new
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> partitions
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> created after the "blacklist operation". As a first
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> iteration I
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> think that
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> may be acceptable
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Stanislav
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 19, 2019 at 3:20 AM George Li <
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> sql_consulting@ yahoo. com. invalid ( sql_consulting@yahoo.com.invalid ) >
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Pinging the list for the feedbacks of this KIP-491 (
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> https:/ / cwiki. apache. org/ confluence/ pages/ viewpage. action?pageId=120736982
>> (
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
>> )
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> )
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> George
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li < sql_consulting@ yahoo.
>>>>>>>>>> com. INVALID ( sql_consulting@yahoo.com.INVALID ) > wrote:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> I have created KIP-491 (
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> https:/ / cwiki. apache. org/ confluence/ pages/ viewpage. action?pageId=120736982
>> (
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
>> )
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> )
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> for putting a broker to the preferred leader blacklist or
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> deprioritized
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> list so when determining leadership, it's moved to the
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> lowest
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> priority for
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> some of the listed use-cases.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Please provide your comments/feedbacks.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> George
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ----- Forwarded Message ----- From: Jose Armando Garcia
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> Sancio (JIRA) <
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> jira@ apache. org ( jira@apache.org ) >To: " sql_consulting@ yahoo. com (
>>>>>>>>>> sql_consulting@yahoo.com ) " <
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> sql_consulting@ yahoo. com ( sql_consulting@yahoo.com ) >Sent:
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> [Commented]
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> (KAFKA-8638) Preferred Leader Blacklist (deprioritized
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> list)
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> [
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> https:/ / issues. apache. org/ jira/ browse/ KAFKA-8638?page=com. atlassian.
>> jira. plugin. system. issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
>> (
>> https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
>> )
>> 
>> 
>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ]
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Jose Armando Garcia Sancio commented on KAFKA-8638:
>>>>>>>>>> ---------------------------------------------------
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks for feedback and clear use cases [~sql_consulting].
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Preferred Leader Blacklist (deprioritized list)
>>>>>>>>>>> -----------------------------------------------
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Key: KAFKA-8638
>>>>>>>>>>> URL:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> https:/ / issues. apache. org/ jira/ browse/ KAFKA-8638 (
>>>> https://issues.apache.org/jira/browse/KAFKA-8638 )
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Project: Kafka
>>>>>>>>>>> Issue Type: Improvement
>>>>>>>>>>> Components: config, controller, core
>>>>>>>>>>> Affects Versions: 1.1.1, 2.3.0, 2.2.1
>>>>>>>>>>> Reporter: GEORGE LI
>>>>>>>>>>> Assignee: GEORGE LI
>>>>>>>>>>> Priority: Major
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Currently, the kafka preferred leader election will pick
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> broker_id
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> in the topic/partition replica assignments in a priority
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> order
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> when the
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> broker is in ISR. The preferred leader is the broker id in
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> the
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> first
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> position of replica. There are use-cases that, even the
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> first
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> broker in the
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> replica assignment is in ISR, there is a need for it to be
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> moved to the end
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> of ordering (lowest priority) when deciding leadership
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> during
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> preferred
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> leader election.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Let’s use topic/partition replica (1,2,3) as an example.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 1
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> is the
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> preferred leader. When preferred leadership is run, it
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> will
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> pick 1 as the
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> leader if it's ISR, if 1 is not online and in ISR, then
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> pick
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> 2, if 2 is not
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> in ISR, then pick 3 as the leader. There are use cases
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> that,
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> even 1 is in
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ISR, we would like it to be moved to the end of ordering
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> (lowest priority)
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> when deciding leadership during preferred leader election.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> Below is a list
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> of use cases:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> * (If broker_id 1 is a swapped failed host and brought up
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> with last
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> segments or latest offset without historical data (There is
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> another effort
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> on this), it's better for it to not serve leadership till
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> it's
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> caught-up.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> * The cross-data center cluster has AWS instances which
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> have
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> less
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> computing power than the on-prem bare metal machines. We
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> could put the AWS
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> broker_ids in Preferred Leader Blacklist, so on-prem
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> brokers
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> can be elected
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> leaders, without changing the reassignments ordering of the
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> replicas.
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> * If the broker_id 1 is constantly losing leadership
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> after
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> some time:
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> "Flapping". we would want to exclude 1 to be a leader
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> unless
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> all other
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> brokers of this topic/partition are offline. The
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> “Flapping”
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> effect was
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> seen in the past when 2 or more brokers were bad, when they
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> lost leadership
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> constantly/quickly, the sets of partition replicas they
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> belong
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> to will see
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> leadership constantly changing. The ultimate solution is
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> to
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> swap these bad
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> hosts. But for quick mitigation, we can also put the bad
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> hosts in the
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Preferred Leader Blacklist to move the priority of its
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> being
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> elected as
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> leaders to the lowest.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> * If the controller is busy serving an extra load of
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> metadata requests
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> and other tasks. we would like to put the controller's
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> leaders
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> to other
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> brokers to lower its CPU load. currently bouncing to lose
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> leadership would
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> not work for Controller, because after the bounce, the
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> controller fails
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> over to another broker.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> * Avoid bouncing broker in order to lose its leadership:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> it
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> would be
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> good if we have a way to specify which broker should be
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> excluded from
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> serving traffic/leadership (without changing the replica
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> assignment
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ordering by reassignments, even though that's quick), and
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> run
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> preferred
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> leader election. A bouncing broker will cause temporary
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> URP,
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> and sometimes
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> other issues. Also a bouncing of broker (e.g. broker_id 1)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> can temporarily
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> lose all its leadership, but if another broker (e.g.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> broker_id
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> 2) fails or
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> gets bounced, some of its leaderships will likely failover
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> to
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> broker_id 1
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> on a replica with 3 brokers. If broker_id 1 is in the
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> blacklist, then in
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> such a scenario even broker_id 2 offline, the 3rd broker
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> can
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> take
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> leadership.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The current work-around of the above is to change the
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> topic/partition's
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> replica reassignments to move the broker_id 1 from the
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> first
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> position to
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> the last position and run preferred leader election. e.g.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> (1,
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> 2, 3) => (2,
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 3, 1). This changes the replica reassignments, and we need
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> to
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> keep track of
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> the original one and restore if things change (e.g.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> controller
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> fails over
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> to another broker, the swapped empty broker caught up).
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> 
>> That’s
>> 
>> 
>>> 
>>>> 
>>>> 
>>>> a rather
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> tedious task.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> This message was sent by Atlassian JIRA
>>>>>>>>>> (v7.6.3#76005)
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> --
> Best,
> Stanislav
> 
> 
>

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Stanislav Kozlovski <st...@confluent.io>.
I agree with Colin that the same result should be achievable through proper
abstraction in a tool. Even if that might be "4xO(N)" operations, that is
still not a lot - it is still classified as O(N)

Let's say a healthy broker hosting 3000 partitions, and of which 1000 are
> the preferred leaders (leader count is 1000). There is a hardware failure
> (disk/memory, etc.), and kafka process crashed. We swap this host with
> another host but keep the same broker.id, when this new broker coming up,
> it has no historical data, and we manage to have the current last offsets
> of all partitions set in the replication-offset-checkpoint (if we don't set
> them, it could cause crazy ReplicaFetcher pulling of historical data from
> other brokers and cause cluster high latency and other instabilities), so
> when Kafka is brought up, it is quickly catching up as followers in the
> ISR.  Note, we have auto.leader.rebalance.enable  disabled, so it's not
> serving any traffic as leaders (leader count = 0), even there are 1000
> partitions that this broker is the Preferred Leader.
> We need to make this broker not serving traffic for a few hours or days
> depending on the SLA of the topic retention requirement until after it's
> having enough historical data.


This sounds like a bit of a hack. If that is the concern, why not propose a
KIP that addresses the specific issue? Having a blacklist you control still
seems like a workaround given that Kafka itself knows when the topic
retention would allow you to switch that replica to a leader

I really hope we can come up with a solution that avoids complicating the
controller and state machine logic further.
Could you please list out the main drawbacks of abstract this away in the
reassignments tool (or a new tool)?

On Mon, Sep 9, 2019 at 7:53 AM Colin McCabe <cm...@apache.org> wrote:

> On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote:
> > Hi Colin,
> >           Can you give us more details on why you don't want this to be
> > part of the Kafka core. You are proposing KIP-500 which will take away
> > zookeeper and writing this interim tools to change the zookeeper
> > metadata doesn't make sense to me.
>
> Hi Harsha,
>
> The reassignment API described in KIP-455, which will be part of Kafka
> 2.4, doesn't rely on ZooKeeper.  This API will stay the same after KIP-500
> is implemented.
>
> > As George pointed out there are
> > several benefits having it in the system itself instead of asking users
> > to hack bunch of json files to deal with outage scenario.
>
> In both cases, the user just has to run a shell command, right?  In both
> cases, the user has to remember to undo the command later when they want
> the broker to be treated normally again.  And in both cases, the user
> should probably be running an external rebalancing tool to avoid having to
> run these commands manually. :)
>
> best,
> Colin
>
> >
> > Thanks,
> > Harsha
> >
> > On Fri, Sep 6, 2019 at 4:36 PM George Li <sql_consulting@yahoo.com
> .invalid>
> > wrote:
> >
> > >  Hi Colin,
> > >
> > > Thanks for the feedback.  The "separate set of metadata about
> blacklists"
> > > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple
> in
> > > the cluster.  Should be easier than keeping json files?  e.g. what if
> we
> > > first blacklist broker_id_1, then another broker_id_2 has issues, and
> we
> > > need to write out another json file to restore later (and in which
> order)?
> > >  Using blacklist, we can just add the broker_id_2 to the existing one.
> and
> > > remove whatever broker_id returning to good state without worrying
> how(the
> > > ordering of putting the broker to blacklist) to restore.
> > >
> > > For topic level config,  the blacklist will be tied to
> > > topic/partition(e.g.  Configs:
> > > topic.preferred.leader.blacklist=0:101,102;1:103    where 0 & 1 is the
> > > partition#, 101,102,103 are the blacklist broker_ids), and easier to
> > > update/remove, no need for external json files?
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >     On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe <
> > > cmccabe@apache.org> wrote:
> > >
> > >  One possibility would be writing a new command-line tool that would
> > > deprioritize a given replica using the new KIP-455 API.  Then it could
> > > write out a JSON files containing the old priorities, which could be
> > > restored when (or if) we needed to do so.  This seems like it might be
> > > simpler and easier to maintain than a separate set of metadata about
> > > blacklists.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Fri, Sep 6, 2019, at 11:58, George Li wrote:
> > > >  Hi,
> > > >
> > > > Just want to ping and bubble up the discussion of KIP-491.
> > > >
> > > > On a large scale of Kafka clusters with thousands of brokers in many
> > > > clusters.  Frequent hardware failures are common, although the
> > > > reassignments to change the preferred leaders is a workaround, it
> > > > incurs unnecessary additional work than the proposed preferred leader
> > > > blacklist in KIP-491, and hard to scale.
> > > >
> > > > I am wondering whether others using Kafka in a big scale running into
> > > > same problem.
> > > >
> > > >
> > > > Satish,
> > > >
> > > > Regarding your previous question about whether there is use-case for
> > > > TopicLevel preferred leader "blacklist",  I thought about one
> > > > use-case:  to improve rebalance/reassignment, the large partition
> will
> > > > usually cause performance/stability issues, planning to change the
> say
> > > > the New Replica will start with Leader's latest offset(this way the
> > > > replica is almost instantly in the ISR and reassignment completed),
> and
> > > > put this partition's NewReplica into Preferred Leader "Blacklist" at
> > > > the Topic Level config for that partition. After sometime(retention
> > > > time), this new replica has caught up and ready to serve traffic,
> > > > update/remove the TopicConfig for this partition's preferred leader
> > > > blacklist.
> > > >
> > > > I will update the KIP-491 later for this use case of Topic Level
> config
> > > > for Preferred Leader Blacklist.
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li
> > > > <sq...@yahoo.com> wrote:
> > > >
> > > >  Hi Colin,
> > > >
> > > > > In your example, I think we're comparing apples and oranges.  You
> > > started by outlining a scenario where "an empty broker... comes up...
> > > [without] any > leadership[s]."  But then you criticize using
> reassignment
> > > to switch the order of preferred replicas because it "would not
> actually
> > > switch the leader > automatically."  If the empty broker doesn't have
> any
> > > leaderships, there is nothing to be switched, right?
> > > >
> > > > Let me explained in details of this particular use case example for
> > > > comparing apples to apples.
> > > >
> > > > Let's say a healthy broker hosting 3000 partitions, and of which 1000
> > > > are the preferred leaders (leader count is 1000). There is a hardware
> > > > failure (disk/memory, etc.), and kafka process crashed. We swap this
> > > > host with another host but keep the same broker.id, when this new
> > > > broker coming up, it has no historical data, and we manage to have
> the
> > > > current last offsets of all partitions set in
> > > > the replication-offset-checkpoint (if we don't set them, it could
> cause
> > > > crazy ReplicaFetcher pulling of historical data from other brokers
> and
> > > > cause cluster high latency and other instabilities), so when Kafka is
> > > > brought up, it is quickly catching up as followers in the ISR.  Note,
> > > > we have auto.leader.rebalance.enable  disabled, so it's not serving
> any
> > > > traffic as leaders (leader count = 0), even there are 1000 partitions
> > > > that this broker is the Preferred Leader.
> > > >
> > > > We need to make this broker not serving traffic for a few hours or
> days
> > > > depending on the SLA of the topic retention requirement until after
> > > > it's having enough historical data.
> > > >
> > > >
> > > > * The traditional way using the reassignments to move this broker in
> > > > that 1000 partitions where it's the preferred leader to the end of
> > > > assignment, this is O(N) operation. and from my experience, we can't
> > > > submit all 1000 at the same time, otherwise cause higher latencies
> even
> > > > the reassignment in this case can complete almost instantly.  After
> a
> > > > few hours/days whatever, this broker is ready to serve traffic,  we
> > > > have to run reassignments again to restore that 1000 partitions
> > > > preferred leaders for this broker: O(N) operation.  then run
> preferred
> > > > leader election O(N) again.  So total 3 x O(N) operations.  The point
> > > > is since the new empty broker is expected to be the same as the old
> one
> > > > in terms of hosting partition/leaders, it would seem unnecessary to
> do
> > > > reassignments (ordering of replica) during the broker catching up
> time.
> > > >
> > > >
> > > >
> > > > * The new feature Preferred Leader "Blacklist":  just need to put a
> > > > dynamic config to indicate that this broker should be considered
> leader
> > > > (preferred leader election or broker failover or unclean leader
> > > > election) to the lowest priority. NO need to run any reassignments.
> > > > After a few hours/days, when this broker is ready, remove the dynamic
> > > > config, and run preferred leader election and this broker will serve
> > > > traffic for that 1000 original partitions it was the preferred
> leader.
> > > > So total  1 x O(N) operation.
> > > >
> > > >
> > > > If auto.leader.rebalance.enable  is enabled,  the Preferred Leader
> > > > "Blacklist" can be put it before Kafka is started to prevent this
> > > > broker serving traffic.  In the traditional way of running
> > > > reassignments, once the broker is up,
> > > > with auto.leader.rebalance.enable  , if leadership starts going to
> this
> > > > new empty broker, it might have to do preferred leader election after
> > > > reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1)
> > > > reassignment only change the ordering, 1 remains as the current
> leader,
> > > > and needs prefer leader election to change to 2 after reassignment.
> so
> > > > potentially one more O(N) operation.
> > > >
> > > > I hope the above example can show how easy to "blacklist" a broker
> > > > serving leadership.  For someone managing Production Kafka cluster,
> > > > it's important to react fast to certain alerts and mitigate/resolve
> > > > some issues. As I listed the other use cases in KIP-291, I think this
> > > > feature can make the Kafka product more easier to manage/operate.
> > > >
> > > > > In general, using an external rebalancing tool like Cruise Control
> is
> > > a good idea to keep things balanced without having deal with manual
> > > rebalancing.  > We expect more and more people who have a complex or
> large
> > > cluster will start using tools like this.
> > > > >
> > > > > However, if you choose to do manual rebalancing, it shouldn't be
> that
> > > bad.  You would save the existing partition ordering before making your
> > > changes, then> make your changes (perhaps by running a simple command
> line
> > > tool that switches the order of the replicas).  Then, once you felt
> like
> > > the broker was ready to> serve traffic, you could just re-apply the old
> > > ordering which you had saved.
> > > >
> > > >
> > > > We do have our own rebalancing tool which has its own criteria like
> > > > Rack diversity,  disk usage,  spread partitions/leaders across all
> > > > brokers in the cluster per topic, leadership Bytes/BytesIn served per
> > > > broker, etc.  We can run reassignments. The point is whether it's
> > > > really necessary, and if there is more effective, easier, safer way
> to
> > > > do it.
> > > >
> > > > take another use case example of taking leadership out of busy
> > > > Controller to give it more power to serve metadata requests and other
> > > > work. The controller can failover, with the preferred leader
> > > > "blacklist",  it does not have to run reassignments again when
> > > > controller failover, just change the blacklisted broker_id.
> > > >
> > > >
> > > > > I was thinking about a PlacementPolicy filling the role of
> preventing
> > > people from creating single-replica partitions on a node that we didn't
> > > want to > ever be the leader.  I thought that it could also prevent
> people
> > > from designating those nodes as preferred leaders during topic
> creation, or
> > > Kafka from doing> itduring random topic creation.  I was assuming that
> the
> > > PlacementPolicy would determine which nodes were which through static
> > > configuration keys.  I agree> static configuration keys are somewhat
> less
> > > flexible than dynamic configuration.
> > > >
> > > >
> > > > I think single-replica partition might not be a good example.  There
> > > > should not be any single-replica partition at all. If yes. it's
> > > > probably because of trying to save disk space with less replicas.  I
> > > > think at least minimum 2. The user purposely creating single-replica
> > > > partition will take full responsibilities of data loss and
> > > > unavailability when a broker fails or under maintenance.
> > > >
> > > >
> > > > I think it would be better to use dynamic instead of static config.
> I
> > > > also think it would be better to have topic creation Policy enforced
> in
> > > > Kafka server OR an external service. We have an external/central
> > > > service managing topic creation/partition expansion which takes into
> > > > account of rack-diversity, replication factor (2, 3 or 4 depending on
> > > > cluster/topic type), Policy replicating the topic between kafka
> > > > clusters, etc.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >
> > > >    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe
> > > > <cm...@apache.org> wrote:
> > > >
> > > >  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> > > > >  Hi Colin,
> > > > >
> > > > > Thanks for your feedbacks.  Comments below:
> > > > > > Even if you have a way of blacklisting an entire broker all at
> once,
> > > you still would need to run a leader election > for each partition
> where
> > > you want to move the leader off of the blacklisted broker.  So the
> > > operation is still O(N) in > that sense-- you have to do something per
> > > partition.
> > > > >
> > > > > For a failed broker and swapped with an empty broker, when it comes
> > > up,
> > > > > it will not have any leadership, and we would like it to remain not
> > > > > having leaderships for a couple of hours or days. So there is no
> > > > > preferred leader election needed which incurs O(N) operation in
> this
> > > > > case.  Putting the preferred leader blacklist would safe guard this
> > > > > broker serving traffic during that time. otherwise, if another
> broker
> > > > > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > > > > runs preferred leader election, this new "empty" broker can still
> get
> > > > > leaderships.
> > > > >
> > > > > Also running reassignment to change the ordering of preferred
> leader
> > > > > would not actually switch the leader automatically.  e.g.  (1,2,3)
> =>
> > > > > (2,3,1). unless preferred leader election is run to switch current
> > > > > leader from 1 to 2.  So the operation is at least 2 x O(N).  and
> then
> > > > > after the broker is back to normal, another 2 x O(N) to rollback.
> > > >
> > > > Hi George,
> > > >
> > > > Hmm.  I guess I'm still on the fence about this feature.
> > > >
> > > > In your example, I think we're comparing apples and oranges.  You
> > > > started by outlining a scenario where "an empty broker... comes up...
> > > > [without] any leadership[s]."  But then you criticize using
> > > > reassignment to switch the order of preferred replicas because it
> > > > "would not actually switch the leader automatically."  If the empty
> > > > broker doesn't have any leaderships, there is nothing to be switched,
> > > > right?
> > > >
> > > > >
> > > > >
> > > > > > In general, reassignment will get a lot easier and quicker once
> > > KIP-455 is implemented.  > Reassignments that just change the order of
> > > preferred replicas for a specific partition should complete pretty much
> > > instantly.
> > > > > >> I think it's simpler and easier just to have one source of truth
> > > for what the preferred replica is for a partition, rather than two.  So
> > > for> me, the fact that the replica assignment ordering isn't changed is
> > > actually a big disadvantage of this KIP.  If you are a new user (or
> just>
> > > an existing user that didn't read all of the documentation) and you
> just
> > > look at the replica assignment, you might be confused by why> a
> particular
> > > broker wasn't getting any leaderships, even  though it appeared like it
> > > should.  More mechanisms mean more complexity> for users and developers
> > > most of the time.
> > > > >
> > > > >
> > > > > I would like stress the point that running reassignment to change
> the
> > > > > ordering of the replica (putting a broker to the end of partition
> > > > > assignment) is unnecessary, because after some time the broker is
> > > > > caught up, it can start serving traffic and then need to run
> > > > > reassignments again to "rollback" to previous states. As I
> mentioned
> > > in
> > > > > KIP-491, this is just tedious work.
> > > >
> > > > In general, using an external rebalancing tool like Cruise Control
> is a
> > > > good idea to keep things balanced without having deal with manual
> > > > rebalancing.  We expect more and more people who have a complex or
> > > > large cluster will start using tools like this.
> > > >
> > > > However, if you choose to do manual rebalancing, it shouldn't be that
> > > > bad.  You would save the existing partition ordering before making
> your
> > > > changes, then make your changes (perhaps by running a simple command
> > > > line tool that switches the order of the replicas).  Then, once you
> > > > felt like the broker was ready to serve traffic, you could just
> > > > re-apply the old ordering which you had saved.
> > > >
> > > > >
> > > > > I agree this might introduce some complexities for
> users/developers.
> > > > > But if this feature is good, and well documented, it is good for
> the
> > > > > kafka product/community.  Just like KIP-460 enabling unclean leader
> > > > > election to override TopicLevel/Broker Level config of
> > > > > `unclean.leader.election.enable`
> > > > >
> > > > > > I agree that it would be nice if we could treat some brokers
> > > differently for the purposes of placing replicas, selecting leaders,
> etc. >
> > > Right now, we don't have any way of implementing that without forking
> the
> > > broker.  I would support a new PlacementPolicy class that> would close
> this
> > > gap.  But I don't think this KIP is flexible enough to fill this
> role.  For
> > > example, it can't prevent users from creating> new single-replica
> topics
> > > that get put on the "bad" replica.  Perhaps we should reopen the
> > > discussion> about
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > > >
> > > > > Creating topic with single-replica is beyond what KIP-491 is
> trying to
> > > > > achieve.  The user needs to take responsibility of doing that. I do
> > > see
> > > > > some Samza clients notoriously creating single-replica topics and
> that
> > > > > got flagged by alerts, because a single broker down/maintenance
> will
> > > > > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > > > > the single-replica will still serve as leaders, because there is no
> > > > > other alternative replica to be chosen as leader.
> > > > >
> > > > > Even with a new PlacementPolicy for topic creation/partition
> > > expansion,
> > > > > it still needs the blacklist info (e.g. a zk path node, or broker
> > > > > level/topic level config) to "blacklist" the broker to be preferred
> > > > > leader? Would it be the same as KIP-491 is introducing?
> > > >
> > > > I was thinking about a PlacementPolicy filling the role of preventing
> > > > people from creating single-replica partitions on a node that we
> didn't
> > > > want to ever be the leader.  I thought that it could also prevent
> > > > people from designating those nodes as preferred leaders during topic
> > > > creation, or Kafka from doing itduring random topic creation.  I was
> > > > assuming that the PlacementPolicy would determine which nodes were
> > > > which through static configuration keys.  I agree static
> configuration
> > > > keys are somewhat less flexible than dynamic configuration.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > > > > <cm...@apache.org> wrote:
> > > > >
> > > > >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > > > > >  Hi Colin,
> > > > > > Thanks for looking into this KIP.  Sorry for the late response.
> been
> > > busy.
> > > > > >
> > > > > > If a cluster has MAMY topic partitions, moving this "blacklist"
> > > broker
> > > > > > to the end of replica list is still a rather "big" operation,
> > > involving
> > > > > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > > > > simpler/easier and can undo easily without changing the replica
> > > > > > assignment ordering.
> > > > >
> > > > > Hi George,
> > > > >
> > > > > Even if you have a way of blacklisting an entire broker all at
> once,
> > > > > you still would need to run a leader election for each partition
> where
> > > > > you want to move the leader off of the blacklisted broker.  So the
> > > > > operation is still O(N) in that sense-- you have to do something
> per
> > > > > partition.
> > > > >
> > > > > In general, reassignment will get a lot easier and quicker once
> > > KIP-455
> > > > > is implemented.  Reassignments that just change the order of
> preferred
> > > > > replicas for a specific partition should complete pretty much
> > > instantly.
> > > > >
> > > > > I think it's simpler and easier just to have one source of truth
> for
> > > > > what the preferred replica is for a partition, rather than two.  So
> > > for
> > > > > me, the fact that the replica assignment ordering isn't changed is
> > > > > actually a big disadvantage of this KIP.  If you are a new user (or
> > > > > just an existing user that didn't read all of the documentation)
> and
> > > > > you just look at the replica assignment, you might be confused by
> why
> > > a
> > > > > particular broker wasn't getting any leaderships, even  though it
> > > > > appeared like it should.  More mechanisms mean more complexity for
> > > > > users and developers most of the time.
> > > > >
> > > > > > Major use case for me, a failed broker got swapped with new
> > > hardware,
> > > > > > and starts up as empty (with latest offset of all partitions),
> the
> > > SLA
> > > > > > of retention is 1 day, so before this broker is up to be in-sync
> for
> > > 1
> > > > > > day, we would like to blacklist this broker from serving traffic.
> > > after
> > > > > > 1 day, the blacklist is removed and run preferred leader
> election.
> > > > > > This way, no need to run reassignments before/after.  This is the
> > > > > > "temporary" use-case.
> > > > >
> > > > > What if we just add an option to the reassignment tool to generate
> a
> > > > > plan to move all the leaders off of a specific broker?  The tool
> could
> > > > > also run a leader election as well.  That would be a simple way of
> > > > > doing this without adding new mechanisms or broker-side
> > > configurations,
> > > > > etc.
> > > > >
> > > > > >
> > > > > > There are use-cases that this Preferred Leader "blacklist" can be
> > > > > > somewhat permanent, as I explained in the AWS data center
> instances
> > > Vs.
> > > > > > on-premises data center bare metal machines (heterogenous
> hardware),
> > > > > > that the AWS broker_ids will be blacklisted.  So new topics
> > > created,
> > > > > > or existing topic expansion would not make them serve traffic
> even
> > > they
> > > > > > could be the preferred leader.
> > > > >
> > > > > I agree that it would be nice if we could treat some brokers
> > > > > differently for the purposes of placing replicas, selecting
> leaders,
> > > > > etc.  Right now, we don't have any way of implementing that without
> > > > > forking the broker.  I would support a new PlacementPolicy class
> that
> > > > > would close this gap.  But I don't think this KIP is flexible
> enough
> > > to
> > > > > fill this role.  For example, it can't prevent users from creating
> new
> > > > > single-replica topics that get put on the "bad" replica.  Perhaps
> we
> > > > > should reopen the discussion about
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > > >
> > > > > regards,
> > > > > Colin
> > > > >
> > > > > >
> > > > > > Please let me know there are more question.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > > > > <cm...@apache.org> wrote:
> > > > > >
> > > > > >  We still want to give the "blacklisted" broker the leadership if
> > > > > > nobody else is available.  Therefore, isn't putting a broker on
> the
> > > > > > blacklist pretty much the same as moving it to the last entry in
> the
> > > > > > replicas list and then triggering a preferred leader election?
> > > > > >
> > > > > > If we want this to be undone after a certain amount of time, or
> > > under
> > > > > > certain conditions, that seems like something that would be more
> > > > > > effectively done by an external system, rather than putting all
> > > these
> > > > > > policies into Kafka.
> > > > > >
> > > > > > best,
> > > > > > Colin
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > > > > >  Hi Satish,
> > > > > > > Thanks for the reviews and feedbacks.
> > > > > > >
> > > > > > > > > The following is the requirements this KIP is trying to
> > > accomplish:
> > > > > > > > This can be moved to the"Proposed changes" section.
> > > > > > >
> > > > > > > Updated the KIP-491.
> > > > > > >
> > > > > > > > >>The logic to determine the priority/order of which broker
> > > should be
> > > > > > > > preferred leader should be modified.  The broker in the
> > > preferred leader
> > > > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > > > determining leadership.
> > > > > > > >
> > > > > > > > I believe there is no change required in the ordering of the
> > > preferred
> > > > > > > > replica list. Brokers in the preferred leader blacklist are
> > > skipped
> > > > > > > > until other brokers int he list are unavailable.
> > > > > > >
> > > > > > > Yes. partition assignment remained the same, replica &
> ordering.
> > > The
> > > > > > > blacklist logic can be optimized during implementation.
> > > > > > >
> > > > > > > > >>The blacklist can be at the broker level. However, there
> might
> > > be use cases
> > > > > > > > where a specific topic should blacklist particular brokers,
> which
> > > > > > > > would be at the
> > > > > > > > Topic level Config. For this use cases of this KIP, it seems
> > > that broker level
> > > > > > > > blacklist would suffice.  Topic level preferred leader
> blacklist
> > > might
> > > > > > > > be future enhancement work.
> > > > > > > >
> > > > > > > > I agree that the broker level preferred leader blacklist
> would be
> > > > > > > > sufficient. Do you have any use cases which require topic
> level
> > > > > > > > preferred blacklist?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > I don't have any concrete use cases for Topic level preferred
> > > leader
> > > > > > > blacklist.  One scenarios I can think of is when a broker has
> high
> > > CPU
> > > > > > > usage, trying to identify the big topics (High MsgIn, High
> > > BytesIn,
> > > > > > > etc), then try to move the leaders away from this broker,
> before
> > > doing
> > > > > > > an actual reassignment to change its preferred leader,  try to
> put
> > > this
> > > > > > > preferred_leader_blacklist in the Topic Level config, and run
> > > preferred
> > > > > > > leader election, and see whether CPU decreases for this broker,
> > > if
> > > > > > > yes, then do the reassignments to change the preferred leaders
> to
> > > be
> > > > > > > "permanent" (the topic may have many partitions like 256 that
> has
> > > quite
> > > > > > > a few of them having this broker as preferred leader).  So this
> > > Topic
> > > > > > > Level config is an easy way of doing trial and check the
> result.
> > > > > > >
> > > > > > >
> > > > > > > > You can add the below workaround as an item in the rejected
> > > alternatives section
> > > > > > > > "Reassigning all the topic/partitions which the intended
> broker
> > > is a
> > > > > > > > replica for."
> > > > > > >
> > > > > > > Updated the KIP-491.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > > > > <sa...@gmail.com> wrote:
> > > > > > >
> > > > > > >  Thanks for the KIP. I have put my comments below.
> > > > > > >
> > > > > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > > > >
> > > > > > > >> The following is the requirements this KIP is trying to
> > > accomplish:
> > > > > > >   The ability to add and remove the preferred leader
> deprioritized
> > > > > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > > > >
> > > > > > > This can be moved to the"Proposed changes" section.
> > > > > > >
> > > > > > > >>The logic to determine the priority/order of which broker
> should
> > > be
> > > > > > > preferred leader should be modified.  The broker in the
> preferred
> > > leader
> > > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > > determining leadership.
> > > > > > >
> > > > > > > I believe there is no change required in the ordering of the
> > > preferred
> > > > > > > replica list. Brokers in the preferred leader blacklist are
> skipped
> > > > > > > until other brokers int he list are unavailable.
> > > > > > >
> > > > > > > >>The blacklist can be at the broker level. However, there
> might
> > > be use cases
> > > > > > > where a specific topic should blacklist particular brokers,
> which
> > > > > > > would be at the
> > > > > > > Topic level Config. For this use cases of this KIP, it seems
> that
> > > broker level
> > > > > > > blacklist would suffice.  Topic level preferred leader
> blacklist
> > > might
> > > > > > > be future enhancement work.
> > > > > > >
> > > > > > > I agree that the broker level preferred leader blacklist would
> be
> > > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > > preferred blacklist?
> > > > > > >
> > > > > > > You can add the below workaround as an item in the rejected
> > > alternatives section
> > > > > > > "Reassigning all the topic/partitions which the intended
> broker is
> > > a
> > > > > > > replica for."
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Satish.
> > > > > > >
> > > > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > > > > <st...@confluent.io> wrote:
> > > > > > > >
> > > > > > > > Hey George,
> > > > > > > >
> > > > > > > > Thanks for the KIP, it's an interesting idea.
> > > > > > > >
> > > > > > > > I was wondering whether we could achieve the same thing via
> the
> > > > > > > > kafka-reassign-partitions tool. As you had also said in the
> > > JIRA,  it is
> > > > > > > > true that this is currently very tedious with the tool. My
> > > thoughts are
> > > > > > > > that we could improve the tool and give it the notion of a
> > > "blacklisted
> > > > > > > > preferred leader".
> > > > > > > > This would have some benefits like:
> > > > > > > > - more fine-grained control over the blacklist. we may not
> want
> > > to
> > > > > > > > blacklist all the preferred leaders, as that would make the
> > > blacklisted
> > > > > > > > broker a follower of last resort which is not very useful. In
> > > the cases of
> > > > > > > > an underpowered AWS machine or a controller, you might
> overshoot
> > > and make
> > > > > > > > the broker very underutilized if you completely make it
> > > leaderless.
> > > > > > > > - is not permanent. If we are to have a blacklist leaders
> config,
> > > > > > > > rebalancing tools would also need to know about it and
> > > manipulate/respect
> > > > > > > > it to achieve a fair balance.
> > > > > > > > It seems like both problems are tied to balancing partitions,
> > > it's just
> > > > > > > > that KIP-491's use case wants to balance them against other
> > > factors in a
> > > > > > > > more nuanced way. It makes sense to have both be done from
> the
> > > same place
> > > > > > > >
> > > > > > > > To make note of the motivation section:
> > > > > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > > > > The recommended way to make a broker lose its leadership is
> to
> > > run a
> > > > > > > > reassignment on its partitions
> > > > > > > > > The cross-data center cluster has AWS cloud instances which
> > > have less
> > > > > > > > computing power
> > > > > > > > We recommend running Kafka on homogeneous machines. It would
> be
> > > cool if the
> > > > > > > > system supported more flexibility in that regard but that is
> > > more nuanced
> > > > > > > > and a preferred leader blacklist may not be the best first
> > > approach to the
> > > > > > > > issue
> > > > > > > >
> > > > > > > > Adding a new config which can fundamentally change the way
> > > replication is
> > > > > > > > done is complex, both for the system (the replication code is
> > > complex
> > > > > > > > enough) and the user. Users would have another potential
> config
> > > that could
> > > > > > > > backfire on them - e.g if left forgotten.
> > > > > > > >
> > > > > > > > Could you think of any downsides to implementing this
> > > functionality (or a
> > > > > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > > > > One downside I can see is that we would not have it handle
> new
> > > partitions
> > > > > > > > created after the "blacklist operation". As a first
> iteration I
> > > think that
> > > > > > > > may be acceptable
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Stanislav
> > > > > > > >
> > > > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <
> > > sql_consulting@yahoo.com.invalid>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > >  Hi,
> > > > > > > > >
> > > > > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > > > > )
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > George
> > > > > > > > >
> > > > > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > > > > >
> > > > > > > > >  Hi,
> > > > > > > > >
> > > > > > > > > I have created KIP-491 (
> > > > > > > > >
> > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > )
> > > > > > > > > for putting a broker to the preferred leader blacklist or
> > > deprioritized
> > > > > > > > > list so when determining leadership,  it's moved to the
> lowest
> > > priority for
> > > > > > > > > some of the listed use-cases.
> > > > > > > > >
> > > > > > > > > Please provide your comments/feedbacks.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > George
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia
> > > Sancio (JIRA) <
> > > > > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <
> > > sql_consulting@yahoo.com>Sent:
> > > > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
> > > [Commented]
> > > > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized
> list)
> > > > > > > > >
> > > > > > > > >    [
> > > > > > > > >
> > >
> https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > > > > ]
> > > > > > > > >
> > > > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > > > > ---------------------------------------------------
> > > > > > > > >
> > > > > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > > > > >
> > > > > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > > > > -----------------------------------------------
> > > > > > > > > >
> > > > > > > > > >                Key: KAFKA-8638
> > > > > > > > > >                URL:
> > > https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > > > > >            Project: Kafka
> > > > > > > > > >          Issue Type: Improvement
> > > > > > > > > >          Components: config, controller, core
> > > > > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > > > > >            Reporter: GEORGE LI
> > > > > > > > > >            Assignee: GEORGE LI
> > > > > > > > > >            Priority: Major
> > > > > > > > > >
> > > > > > > > > > Currently, the kafka preferred leader election will pick
> the
> > > broker_id
> > > > > > > > > in the topic/partition replica assignments in a priority
> order
> > > when the
> > > > > > > > > broker is in ISR. The preferred leader is the broker id in
> the
> > > first
> > > > > > > > > position of replica. There are use-cases that, even the
> first
> > > broker in the
> > > > > > > > > replica assignment is in ISR, there is a need for it to be
> > > moved to the end
> > > > > > > > > of ordering (lowest priority) when deciding leadership
> during
> > > preferred
> > > > > > > > > leader election.
> > > > > > > > > > Let’s use topic/partition replica (1,2,3) as an example.
> 1
> > > is the
> > > > > > > > > preferred leader.  When preferred leadership is run, it
> will
> > > pick 1 as the
> > > > > > > > > leader if it's ISR, if 1 is not online and in ISR, then
> pick
> > > 2, if 2 is not
> > > > > > > > > in ISR, then pick 3 as the leader. There are use cases
> that,
> > > even 1 is in
> > > > > > > > > ISR, we would like it to be moved to the end of ordering
> > > (lowest priority)
> > > > > > > > > when deciding leadership during preferred leader election.
> > > Below is a list
> > > > > > > > > of use cases:
> > > > > > > > > > * (If broker_id 1 is a swapped failed host and brought up
> > > with last
> > > > > > > > > segments or latest offset without historical data (There is
> > > another effort
> > > > > > > > > on this), it's better for it to not serve leadership till
> it's
> > > caught-up.
> > > > > > > > > > * The cross-data center cluster has AWS instances which
> have
> > > less
> > > > > > > > > computing power than the on-prem bare metal machines.  We
> > > could put the AWS
> > > > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem
> brokers
> > > can be elected
> > > > > > > > > leaders, without changing the reassignments ordering of the
> > > replicas.
> > > > > > > > > > * If the broker_id 1 is constantly losing leadership
> after
> > > some time:
> > > > > > > > > "Flapping". we would want to exclude 1 to be a leader
> unless
> > > all other
> > > > > > > > > brokers of this topic/partition are offline.  The
> “Flapping”
> > > effect was
> > > > > > > > > seen in the past when 2 or more brokers were bad, when they
> > > lost leadership
> > > > > > > > > constantly/quickly, the sets of partition replicas they
> belong
> > > to will see
> > > > > > > > > leadership constantly changing.  The ultimate solution is
> to
> > > swap these bad
> > > > > > > > > hosts.  But for quick mitigation, we can also put the bad
> > > hosts in the
> > > > > > > > > Preferred Leader Blacklist to move the priority of its
> being
> > > elected as
> > > > > > > > > leaders to the lowest.
> > > > > > > > > > *  If the controller is busy serving an extra load of
> > > metadata requests
> > > > > > > > > and other tasks. we would like to put the controller's
> leaders
> > > to other
> > > > > > > > > brokers to lower its CPU load. currently bouncing to lose
> > > leadership would
> > > > > > > > > not work for Controller, because after the bounce, the
> > > controller fails
> > > > > > > > > over to another broker.
> > > > > > > > > > * Avoid bouncing broker in order to lose its leadership:
> it
> > > would be
> > > > > > > > > good if we have a way to specify which broker should be
> > > excluded from
> > > > > > > > > serving traffic/leadership (without changing the replica
> > > assignment
> > > > > > > > > ordering by reassignments, even though that's quick), and
> run
> > > preferred
> > > > > > > > > leader election.  A bouncing broker will cause temporary
> URP,
> > > and sometimes
> > > > > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1)
> > > can temporarily
> > > > > > > > > lose all its leadership, but if another broker (e.g.
> broker_id
> > > 2) fails or
> > > > > > > > > gets bounced, some of its leaderships will likely failover
> to
> > > broker_id 1
> > > > > > > > > on a replica with 3 brokers.  If broker_id 1 is in the
> > > blacklist, then in
> > > > > > > > > such a scenario even broker_id 2 offline,  the 3rd broker
> can
> > > take
> > > > > > > > > leadership.
> > > > > > > > > > The current work-around of the above is to change the
> > > topic/partition's
> > > > > > > > > replica reassignments to move the broker_id 1 from the
> first
> > > position to
> > > > > > > > > the last position and run preferred leader election. e.g.
> (1,
> > > 2, 3) => (2,
> > > > > > > > > 3, 1). This changes the replica reassignments, and we need
> to
> > > keep track of
> > > > > > > > > the original one and restore if things change (e.g.
> controller
> > > fails over
> > > > > > > > > to another broker, the swapped empty broker caught up).
> That’s
> > > a rather
> > > > > > > > > tedious task.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > This message was sent by Atlassian JIRA
> > > > > > > > > (v7.6.3#76005)
> >
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Colin McCabe <cm...@apache.org>.
On Sat, Sep 7, 2019, at 09:21, Harsha Chintalapani wrote:
> Hi Colin,
>           Can you give us more details on why you don't want this to be
> part of the Kafka core. You are proposing KIP-500 which will take away
> zookeeper and writing this interim tools to change the zookeeper 
> metadata doesn't make sense to me.

Hi Harsha,

The reassignment API described in KIP-455, which will be part of Kafka 2.4, doesn't rely on ZooKeeper.  This API will stay the same after KIP-500 is implemented.

> As George pointed out there are
> several benefits having it in the system itself instead of asking users
> to hack bunch of json files to deal with outage scenario.

In both cases, the user just has to run a shell command, right?  In both cases, the user has to remember to undo the command later when they want the broker to be treated normally again.  And in both cases, the user should probably be running an external rebalancing tool to avoid having to run these commands manually. :)

best,
Colin

> 
> Thanks,
> Harsha
> 
> On Fri, Sep 6, 2019 at 4:36 PM George Li <sq...@yahoo.com.invalid>
> wrote:
> 
> >  Hi Colin,
> >
> > Thanks for the feedback.  The "separate set of metadata about blacklists"
> > in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple in
> > the cluster.  Should be easier than keeping json files?  e.g. what if we
> > first blacklist broker_id_1, then another broker_id_2 has issues, and we
> > need to write out another json file to restore later (and in which order)?
> >  Using blacklist, we can just add the broker_id_2 to the existing one. and
> > remove whatever broker_id returning to good state without worrying how(the
> > ordering of putting the broker to blacklist) to restore.
> >
> > For topic level config,  the blacklist will be tied to
> > topic/partition(e.g.  Configs:
> > topic.preferred.leader.blacklist=0:101,102;1:103    where 0 & 1 is the
> > partition#, 101,102,103 are the blacklist broker_ids), and easier to
> > update/remove, no need for external json files?
> >
> >
> > Thanks,
> > George
> >
> >     On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe <
> > cmccabe@apache.org> wrote:
> >
> >  One possibility would be writing a new command-line tool that would
> > deprioritize a given replica using the new KIP-455 API.  Then it could
> > write out a JSON files containing the old priorities, which could be
> > restored when (or if) we needed to do so.  This seems like it might be
> > simpler and easier to maintain than a separate set of metadata about
> > blacklists.
> >
> > best,
> > Colin
> >
> >
> > On Fri, Sep 6, 2019, at 11:58, George Li wrote:
> > >  Hi,
> > >
> > > Just want to ping and bubble up the discussion of KIP-491.
> > >
> > > On a large scale of Kafka clusters with thousands of brokers in many
> > > clusters.  Frequent hardware failures are common, although the
> > > reassignments to change the preferred leaders is a workaround, it
> > > incurs unnecessary additional work than the proposed preferred leader
> > > blacklist in KIP-491, and hard to scale.
> > >
> > > I am wondering whether others using Kafka in a big scale running into
> > > same problem.
> > >
> > >
> > > Satish,
> > >
> > > Regarding your previous question about whether there is use-case for
> > > TopicLevel preferred leader "blacklist",  I thought about one
> > > use-case:  to improve rebalance/reassignment, the large partition will
> > > usually cause performance/stability issues, planning to change the say
> > > the New Replica will start with Leader's latest offset(this way the
> > > replica is almost instantly in the ISR and reassignment completed), and
> > > put this partition's NewReplica into Preferred Leader "Blacklist" at
> > > the Topic Level config for that partition. After sometime(retention
> > > time), this new replica has caught up and ready to serve traffic,
> > > update/remove the TopicConfig for this partition's preferred leader
> > > blacklist.
> > >
> > > I will update the KIP-491 later for this use case of Topic Level config
> > > for Preferred Leader Blacklist.
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li
> > > <sq...@yahoo.com> wrote:
> > >
> > >  Hi Colin,
> > >
> > > > In your example, I think we're comparing apples and oranges.  You
> > started by outlining a scenario where "an empty broker... comes up...
> > [without] any > leadership[s]."  But then you criticize using reassignment
> > to switch the order of preferred replicas because it "would not actually
> > switch the leader > automatically."  If the empty broker doesn't have any
> > leaderships, there is nothing to be switched, right?
> > >
> > > Let me explained in details of this particular use case example for
> > > comparing apples to apples.
> > >
> > > Let's say a healthy broker hosting 3000 partitions, and of which 1000
> > > are the preferred leaders (leader count is 1000). There is a hardware
> > > failure (disk/memory, etc.), and kafka process crashed. We swap this
> > > host with another host but keep the same broker.id, when this new
> > > broker coming up, it has no historical data, and we manage to have the
> > > current last offsets of all partitions set in
> > > the replication-offset-checkpoint (if we don't set them, it could cause
> > > crazy ReplicaFetcher pulling of historical data from other brokers and
> > > cause cluster high latency and other instabilities), so when Kafka is
> > > brought up, it is quickly catching up as followers in the ISR.  Note,
> > > we have auto.leader.rebalance.enable  disabled, so it's not serving any
> > > traffic as leaders (leader count = 0), even there are 1000 partitions
> > > that this broker is the Preferred Leader.
> > >
> > > We need to make this broker not serving traffic for a few hours or days
> > > depending on the SLA of the topic retention requirement until after
> > > it's having enough historical data.
> > >
> > >
> > > * The traditional way using the reassignments to move this broker in
> > > that 1000 partitions where it's the preferred leader to the end of
> > > assignment, this is O(N) operation. and from my experience, we can't
> > > submit all 1000 at the same time, otherwise cause higher latencies even
> > > the reassignment in this case can complete almost instantly.  After  a
> > > few hours/days whatever, this broker is ready to serve traffic,  we
> > > have to run reassignments again to restore that 1000 partitions
> > > preferred leaders for this broker: O(N) operation.  then run preferred
> > > leader election O(N) again.  So total 3 x O(N) operations.  The point
> > > is since the new empty broker is expected to be the same as the old one
> > > in terms of hosting partition/leaders, it would seem unnecessary to do
> > > reassignments (ordering of replica) during the broker catching up time.
> > >
> > >
> > >
> > > * The new feature Preferred Leader "Blacklist":  just need to put a
> > > dynamic config to indicate that this broker should be considered leader
> > > (preferred leader election or broker failover or unclean leader
> > > election) to the lowest priority. NO need to run any reassignments.
> > > After a few hours/days, when this broker is ready, remove the dynamic
> > > config, and run preferred leader election and this broker will serve
> > > traffic for that 1000 original partitions it was the preferred leader.
> > > So total  1 x O(N) operation.
> > >
> > >
> > > If auto.leader.rebalance.enable  is enabled,  the Preferred Leader
> > > "Blacklist" can be put it before Kafka is started to prevent this
> > > broker serving traffic.  In the traditional way of running
> > > reassignments, once the broker is up,
> > > with auto.leader.rebalance.enable  , if leadership starts going to this
> > > new empty broker, it might have to do preferred leader election after
> > > reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1)
> > > reassignment only change the ordering, 1 remains as the current leader,
> > > and needs prefer leader election to change to 2 after reassignment. so
> > > potentially one more O(N) operation.
> > >
> > > I hope the above example can show how easy to "blacklist" a broker
> > > serving leadership.  For someone managing Production Kafka cluster,
> > > it's important to react fast to certain alerts and mitigate/resolve
> > > some issues. As I listed the other use cases in KIP-291, I think this
> > > feature can make the Kafka product more easier to manage/operate.
> > >
> > > > In general, using an external rebalancing tool like Cruise Control is
> > a good idea to keep things balanced without having deal with manual
> > rebalancing.  > We expect more and more people who have a complex or large
> > cluster will start using tools like this.
> > > >
> > > > However, if you choose to do manual rebalancing, it shouldn't be that
> > bad.  You would save the existing partition ordering before making your
> > changes, then> make your changes (perhaps by running a simple command line
> > tool that switches the order of the replicas).  Then, once you felt like
> > the broker was ready to> serve traffic, you could just re-apply the old
> > ordering which you had saved.
> > >
> > >
> > > We do have our own rebalancing tool which has its own criteria like
> > > Rack diversity,  disk usage,  spread partitions/leaders across all
> > > brokers in the cluster per topic, leadership Bytes/BytesIn served per
> > > broker, etc.  We can run reassignments. The point is whether it's
> > > really necessary, and if there is more effective, easier, safer way to
> > > do it.
> > >
> > > take another use case example of taking leadership out of busy
> > > Controller to give it more power to serve metadata requests and other
> > > work. The controller can failover, with the preferred leader
> > > "blacklist",  it does not have to run reassignments again when
> > > controller failover, just change the blacklisted broker_id.
> > >
> > >
> > > > I was thinking about a PlacementPolicy filling the role of preventing
> > people from creating single-replica partitions on a node that we didn't
> > want to > ever be the leader.  I thought that it could also prevent people
> > from designating those nodes as preferred leaders during topic creation, or
> > Kafka from doing> itduring random topic creation.  I was assuming that the
> > PlacementPolicy would determine which nodes were which through static
> > configuration keys.  I agree> static configuration keys are somewhat less
> > flexible than dynamic configuration.
> > >
> > >
> > > I think single-replica partition might not be a good example.  There
> > > should not be any single-replica partition at all. If yes. it's
> > > probably because of trying to save disk space with less replicas.  I
> > > think at least minimum 2. The user purposely creating single-replica
> > > partition will take full responsibilities of data loss and
> > > unavailability when a broker fails or under maintenance.
> > >
> > >
> > > I think it would be better to use dynamic instead of static config.  I
> > > also think it would be better to have topic creation Policy enforced in
> > > Kafka server OR an external service. We have an external/central
> > > service managing topic creation/partition expansion which takes into
> > > account of rack-diversity, replication factor (2, 3 or 4 depending on
> > > cluster/topic type), Policy replicating the topic between kafka
> > > clusters, etc.
> > >
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >
> > >    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe
> > > <cm...@apache.org> wrote:
> > >
> > >  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> > > >  Hi Colin,
> > > >
> > > > Thanks for your feedbacks.  Comments below:
> > > > > Even if you have a way of blacklisting an entire broker all at once,
> > you still would need to run a leader election > for each partition where
> > you want to move the leader off of the blacklisted broker.  So the
> > operation is still O(N) in > that sense-- you have to do something per
> > partition.
> > > >
> > > > For a failed broker and swapped with an empty broker, when it comes
> > up,
> > > > it will not have any leadership, and we would like it to remain not
> > > > having leaderships for a couple of hours or days. So there is no
> > > > preferred leader election needed which incurs O(N) operation in this
> > > > case.  Putting the preferred leader blacklist would safe guard this
> > > > broker serving traffic during that time. otherwise, if another broker
> > > > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > > > runs preferred leader election, this new "empty" broker can still get
> > > > leaderships.
> > > >
> > > > Also running reassignment to change the ordering of preferred leader
> > > > would not actually switch the leader automatically.  e.g.  (1,2,3) =>
> > > > (2,3,1). unless preferred leader election is run to switch current
> > > > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then
> > > > after the broker is back to normal, another 2 x O(N) to rollback.
> > >
> > > Hi George,
> > >
> > > Hmm.  I guess I'm still on the fence about this feature.
> > >
> > > In your example, I think we're comparing apples and oranges.  You
> > > started by outlining a scenario where "an empty broker... comes up...
> > > [without] any leadership[s]."  But then you criticize using
> > > reassignment to switch the order of preferred replicas because it
> > > "would not actually switch the leader automatically."  If the empty
> > > broker doesn't have any leaderships, there is nothing to be switched,
> > > right?
> > >
> > > >
> > > >
> > > > > In general, reassignment will get a lot easier and quicker once
> > KIP-455 is implemented.  > Reassignments that just change the order of
> > preferred replicas for a specific partition should complete pretty much
> > instantly.
> > > > >> I think it's simpler and easier just to have one source of truth
> > for what the preferred replica is for a partition, rather than two.  So
> > for> me, the fact that the replica assignment ordering isn't changed is
> > actually a big disadvantage of this KIP.  If you are a new user (or just>
> > an existing user that didn't read all of the documentation) and you just
> > look at the replica assignment, you might be confused by why> a particular
> > broker wasn't getting any leaderships, even  though it appeared like it
> > should.  More mechanisms mean more complexity> for users and developers
> > most of the time.
> > > >
> > > >
> > > > I would like stress the point that running reassignment to change the
> > > > ordering of the replica (putting a broker to the end of partition
> > > > assignment) is unnecessary, because after some time the broker is
> > > > caught up, it can start serving traffic and then need to run
> > > > reassignments again to "rollback" to previous states. As I mentioned
> > in
> > > > KIP-491, this is just tedious work.
> > >
> > > In general, using an external rebalancing tool like Cruise Control is a
> > > good idea to keep things balanced without having deal with manual
> > > rebalancing.  We expect more and more people who have a complex or
> > > large cluster will start using tools like this.
> > >
> > > However, if you choose to do manual rebalancing, it shouldn't be that
> > > bad.  You would save the existing partition ordering before making your
> > > changes, then make your changes (perhaps by running a simple command
> > > line tool that switches the order of the replicas).  Then, once you
> > > felt like the broker was ready to serve traffic, you could just
> > > re-apply the old ordering which you had saved.
> > >
> > > >
> > > > I agree this might introduce some complexities for users/developers.
> > > > But if this feature is good, and well documented, it is good for the
> > > > kafka product/community.  Just like KIP-460 enabling unclean leader
> > > > election to override TopicLevel/Broker Level config of
> > > > `unclean.leader.election.enable`
> > > >
> > > > > I agree that it would be nice if we could treat some brokers
> > differently for the purposes of placing replicas, selecting leaders, etc. >
> > Right now, we don't have any way of implementing that without forking the
> > broker.  I would support a new PlacementPolicy class that> would close this
> > gap.  But I don't think this KIP is flexible enough to fill this role.  For
> > example, it can't prevent users from creating> new single-replica topics
> > that get put on the "bad" replica.  Perhaps we should reopen the
> > discussion> about
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > >
> > > > Creating topic with single-replica is beyond what KIP-491 is trying to
> > > > achieve.  The user needs to take responsibility of doing that. I do
> > see
> > > > some Samza clients notoriously creating single-replica topics and that
> > > > got flagged by alerts, because a single broker down/maintenance will
> > > > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > > > the single-replica will still serve as leaders, because there is no
> > > > other alternative replica to be chosen as leader.
> > > >
> > > > Even with a new PlacementPolicy for topic creation/partition
> > expansion,
> > > > it still needs the blacklist info (e.g. a zk path node, or broker
> > > > level/topic level config) to "blacklist" the broker to be preferred
> > > > leader? Would it be the same as KIP-491 is introducing?
> > >
> > > I was thinking about a PlacementPolicy filling the role of preventing
> > > people from creating single-replica partitions on a node that we didn't
> > > want to ever be the leader.  I thought that it could also prevent
> > > people from designating those nodes as preferred leaders during topic
> > > creation, or Kafka from doing itduring random topic creation.  I was
> > > assuming that the PlacementPolicy would determine which nodes were
> > > which through static configuration keys.  I agree static configuration
> > > keys are somewhat less flexible than dynamic configuration.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > > > <cm...@apache.org> wrote:
> > > >
> > > >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > > > >  Hi Colin,
> > > > > Thanks for looking into this KIP.  Sorry for the late response. been
> > busy.
> > > > >
> > > > > If a cluster has MAMY topic partitions, moving this "blacklist"
> > broker
> > > > > to the end of replica list is still a rather "big" operation,
> > involving
> > > > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > > > simpler/easier and can undo easily without changing the replica
> > > > > assignment ordering.
> > > >
> > > > Hi George,
> > > >
> > > > Even if you have a way of blacklisting an entire broker all at once,
> > > > you still would need to run a leader election for each partition where
> > > > you want to move the leader off of the blacklisted broker.  So the
> > > > operation is still O(N) in that sense-- you have to do something per
> > > > partition.
> > > >
> > > > In general, reassignment will get a lot easier and quicker once
> > KIP-455
> > > > is implemented.  Reassignments that just change the order of preferred
> > > > replicas for a specific partition should complete pretty much
> > instantly.
> > > >
> > > > I think it's simpler and easier just to have one source of truth for
> > > > what the preferred replica is for a partition, rather than two.  So
> > for
> > > > me, the fact that the replica assignment ordering isn't changed is
> > > > actually a big disadvantage of this KIP.  If you are a new user (or
> > > > just an existing user that didn't read all of the documentation) and
> > > > you just look at the replica assignment, you might be confused by why
> > a
> > > > particular broker wasn't getting any leaderships, even  though it
> > > > appeared like it should.  More mechanisms mean more complexity for
> > > > users and developers most of the time.
> > > >
> > > > > Major use case for me, a failed broker got swapped with new
> > hardware,
> > > > > and starts up as empty (with latest offset of all partitions), the
> > SLA
> > > > > of retention is 1 day, so before this broker is up to be in-sync for
> > 1
> > > > > day, we would like to blacklist this broker from serving traffic.
> > after
> > > > > 1 day, the blacklist is removed and run preferred leader election.
> > > > > This way, no need to run reassignments before/after.  This is the
> > > > > "temporary" use-case.
> > > >
> > > > What if we just add an option to the reassignment tool to generate a
> > > > plan to move all the leaders off of a specific broker?  The tool could
> > > > also run a leader election as well.  That would be a simple way of
> > > > doing this without adding new mechanisms or broker-side
> > configurations,
> > > > etc.
> > > >
> > > > >
> > > > > There are use-cases that this Preferred Leader "blacklist" can be
> > > > > somewhat permanent, as I explained in the AWS data center instances
> > Vs.
> > > > > on-premises data center bare metal machines (heterogenous hardware),
> > > > > that the AWS broker_ids will be blacklisted.  So new topics
> > created,
> > > > > or existing topic expansion would not make them serve traffic even
> > they
> > > > > could be the preferred leader.
> > > >
> > > > I agree that it would be nice if we could treat some brokers
> > > > differently for the purposes of placing replicas, selecting leaders,
> > > > etc.  Right now, we don't have any way of implementing that without
> > > > forking the broker.  I would support a new PlacementPolicy class that
> > > > would close this gap.  But I don't think this KIP is flexible enough
> > to
> > > > fill this role.  For example, it can't prevent users from creating new
> > > > single-replica topics that get put on the "bad" replica.  Perhaps we
> > > > should reopen the discussion about
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > > >
> > > > regards,
> > > > Colin
> > > >
> > > > >
> > > > > Please let me know there are more question.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > > > <cm...@apache.org> wrote:
> > > > >
> > > > >  We still want to give the "blacklisted" broker the leadership if
> > > > > nobody else is available.  Therefore, isn't putting a broker on the
> > > > > blacklist pretty much the same as moving it to the last entry in the
> > > > > replicas list and then triggering a preferred leader election?
> > > > >
> > > > > If we want this to be undone after a certain amount of time, or
> > under
> > > > > certain conditions, that seems like something that would be more
> > > > > effectively done by an external system, rather than putting all
> > these
> > > > > policies into Kafka.
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > >
> > > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > > > >  Hi Satish,
> > > > > > Thanks for the reviews and feedbacks.
> > > > > >
> > > > > > > > The following is the requirements this KIP is trying to
> > accomplish:
> > > > > > > This can be moved to the"Proposed changes" section.
> > > > > >
> > > > > > Updated the KIP-491.
> > > > > >
> > > > > > > >>The logic to determine the priority/order of which broker
> > should be
> > > > > > > preferred leader should be modified.  The broker in the
> > preferred leader
> > > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > > determining leadership.
> > > > > > >
> > > > > > > I believe there is no change required in the ordering of the
> > preferred
> > > > > > > replica list. Brokers in the preferred leader blacklist are
> > skipped
> > > > > > > until other brokers int he list are unavailable.
> > > > > >
> > > > > > Yes. partition assignment remained the same, replica & ordering.
> > The
> > > > > > blacklist logic can be optimized during implementation.
> > > > > >
> > > > > > > >>The blacklist can be at the broker level. However, there might
> > be use cases
> > > > > > > where a specific topic should blacklist particular brokers, which
> > > > > > > would be at the
> > > > > > > Topic level Config. For this use cases of this KIP, it seems
> > that broker level
> > > > > > > blacklist would suffice.  Topic level preferred leader blacklist
> > might
> > > > > > > be future enhancement work.
> > > > > > >
> > > > > > > I agree that the broker level preferred leader blacklist would be
> > > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > > preferred blacklist?
> > > > > >
> > > > > >
> > > > > >
> > > > > > I don't have any concrete use cases for Topic level preferred
> > leader
> > > > > > blacklist.  One scenarios I can think of is when a broker has high
> > CPU
> > > > > > usage, trying to identify the big topics (High MsgIn, High
> > BytesIn,
> > > > > > etc), then try to move the leaders away from this broker,  before
> > doing
> > > > > > an actual reassignment to change its preferred leader,  try to put
> > this
> > > > > > preferred_leader_blacklist in the Topic Level config, and run
> > preferred
> > > > > > leader election, and see whether CPU decreases for this broker,
> > if
> > > > > > yes, then do the reassignments to change the preferred leaders to
> > be
> > > > > > "permanent" (the topic may have many partitions like 256 that has
> > quite
> > > > > > a few of them having this broker as preferred leader).  So this
> > Topic
> > > > > > Level config is an easy way of doing trial and check the result.
> > > > > >
> > > > > >
> > > > > > > You can add the below workaround as an item in the rejected
> > alternatives section
> > > > > > > "Reassigning all the topic/partitions which the intended broker
> > is a
> > > > > > > replica for."
> > > > > >
> > > > > > Updated the KIP-491.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > > > <sa...@gmail.com> wrote:
> > > > > >
> > > > > >  Thanks for the KIP. I have put my comments below.
> > > > > >
> > > > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > > >
> > > > > > >> The following is the requirements this KIP is trying to
> > accomplish:
> > > > > >   The ability to add and remove the preferred leader deprioritized
> > > > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > > >
> > > > > > This can be moved to the"Proposed changes" section.
> > > > > >
> > > > > > >>The logic to determine the priority/order of which broker should
> > be
> > > > > > preferred leader should be modified.  The broker in the preferred
> > leader
> > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > determining leadership.
> > > > > >
> > > > > > I believe there is no change required in the ordering of the
> > preferred
> > > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > > until other brokers int he list are unavailable.
> > > > > >
> > > > > > >>The blacklist can be at the broker level. However, there might
> > be use cases
> > > > > > where a specific topic should blacklist particular brokers, which
> > > > > > would be at the
> > > > > > Topic level Config. For this use cases of this KIP, it seems that
> > broker level
> > > > > > blacklist would suffice.  Topic level preferred leader blacklist
> > might
> > > > > > be future enhancement work.
> > > > > >
> > > > > > I agree that the broker level preferred leader blacklist would be
> > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > preferred blacklist?
> > > > > >
> > > > > > You can add the below workaround as an item in the rejected
> > alternatives section
> > > > > > "Reassigning all the topic/partitions which the intended broker is
> > a
> > > > > > replica for."
> > > > > >
> > > > > > Thanks,
> > > > > > Satish.
> > > > > >
> > > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > > > <st...@confluent.io> wrote:
> > > > > > >
> > > > > > > Hey George,
> > > > > > >
> > > > > > > Thanks for the KIP, it's an interesting idea.
> > > > > > >
> > > > > > > I was wondering whether we could achieve the same thing via the
> > > > > > > kafka-reassign-partitions tool. As you had also said in the
> > JIRA,  it is
> > > > > > > true that this is currently very tedious with the tool. My
> > thoughts are
> > > > > > > that we could improve the tool and give it the notion of a
> > "blacklisted
> > > > > > > preferred leader".
> > > > > > > This would have some benefits like:
> > > > > > > - more fine-grained control over the blacklist. we may not want
> > to
> > > > > > > blacklist all the preferred leaders, as that would make the
> > blacklisted
> > > > > > > broker a follower of last resort which is not very useful. In
> > the cases of
> > > > > > > an underpowered AWS machine or a controller, you might overshoot
> > and make
> > > > > > > the broker very underutilized if you completely make it
> > leaderless.
> > > > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > > > rebalancing tools would also need to know about it and
> > manipulate/respect
> > > > > > > it to achieve a fair balance.
> > > > > > > It seems like both problems are tied to balancing partitions,
> > it's just
> > > > > > > that KIP-491's use case wants to balance them against other
> > factors in a
> > > > > > > more nuanced way. It makes sense to have both be done from the
> > same place
> > > > > > >
> > > > > > > To make note of the motivation section:
> > > > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > > > The recommended way to make a broker lose its leadership is to
> > run a
> > > > > > > reassignment on its partitions
> > > > > > > > The cross-data center cluster has AWS cloud instances which
> > have less
> > > > > > > computing power
> > > > > > > We recommend running Kafka on homogeneous machines. It would be
> > cool if the
> > > > > > > system supported more flexibility in that regard but that is
> > more nuanced
> > > > > > > and a preferred leader blacklist may not be the best first
> > approach to the
> > > > > > > issue
> > > > > > >
> > > > > > > Adding a new config which can fundamentally change the way
> > replication is
> > > > > > > done is complex, both for the system (the replication code is
> > complex
> > > > > > > enough) and the user. Users would have another potential config
> > that could
> > > > > > > backfire on them - e.g if left forgotten.
> > > > > > >
> > > > > > > Could you think of any downsides to implementing this
> > functionality (or a
> > > > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > > > One downside I can see is that we would not have it handle new
> > partitions
> > > > > > > created after the "blacklist operation". As a first iteration I
> > think that
> > > > > > > may be acceptable
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Stanislav
> > > > > > >
> > > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <
> > sql_consulting@yahoo.com.invalid>
> > > > > > > wrote:
> > > > > > >
> > > > > > > >  Hi,
> > > > > > > >
> > > > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > > > )
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > George
> > > > > > > >
> > > > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > > > >
> > > > > > > >  Hi,
> > > > > > > >
> > > > > > > > I have created KIP-491 (
> > > > > > > >
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > )
> > > > > > > > for putting a broker to the preferred leader blacklist or
> > deprioritized
> > > > > > > > list so when determining leadership,  it's moved to the lowest
> > priority for
> > > > > > > > some of the listed use-cases.
> > > > > > > >
> > > > > > > > Please provide your comments/feedbacks.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > George
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia
> > Sancio (JIRA) <
> > > > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <
> > sql_consulting@yahoo.com>Sent:
> > > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
> > [Commented]
> > > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > > > >
> > > > > > > >    [
> > > > > > > >
> > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > > > ]
> > > > > > > >
> > > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > > > ---------------------------------------------------
> > > > > > > >
> > > > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > > > >
> > > > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > > > -----------------------------------------------
> > > > > > > > >
> > > > > > > > >                Key: KAFKA-8638
> > > > > > > > >                URL:
> > https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > > > >            Project: Kafka
> > > > > > > > >          Issue Type: Improvement
> > > > > > > > >          Components: config, controller, core
> > > > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > > > >            Reporter: GEORGE LI
> > > > > > > > >            Assignee: GEORGE LI
> > > > > > > > >            Priority: Major
> > > > > > > > >
> > > > > > > > > Currently, the kafka preferred leader election will pick the
> > broker_id
> > > > > > > > in the topic/partition replica assignments in a priority order
> > when the
> > > > > > > > broker is in ISR. The preferred leader is the broker id in the
> > first
> > > > > > > > position of replica. There are use-cases that, even the first
> > broker in the
> > > > > > > > replica assignment is in ISR, there is a need for it to be
> > moved to the end
> > > > > > > > of ordering (lowest priority) when deciding leadership during
> > preferred
> > > > > > > > leader election.
> > > > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1
> > is the
> > > > > > > > preferred leader.  When preferred leadership is run, it will
> > pick 1 as the
> > > > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick
> > 2, if 2 is not
> > > > > > > > in ISR, then pick 3 as the leader. There are use cases that,
> > even 1 is in
> > > > > > > > ISR, we would like it to be moved to the end of ordering
> > (lowest priority)
> > > > > > > > when deciding leadership during preferred leader election.
> > Below is a list
> > > > > > > > of use cases:
> > > > > > > > > * (If broker_id 1 is a swapped failed host and brought up
> > with last
> > > > > > > > segments or latest offset without historical data (There is
> > another effort
> > > > > > > > on this), it's better for it to not serve leadership till it's
> > caught-up.
> > > > > > > > > * The cross-data center cluster has AWS instances which have
> > less
> > > > > > > > computing power than the on-prem bare metal machines.  We
> > could put the AWS
> > > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers
> > can be elected
> > > > > > > > leaders, without changing the reassignments ordering of the
> > replicas.
> > > > > > > > > * If the broker_id 1 is constantly losing leadership after
> > some time:
> > > > > > > > "Flapping". we would want to exclude 1 to be a leader unless
> > all other
> > > > > > > > brokers of this topic/partition are offline.  The “Flapping”
> > effect was
> > > > > > > > seen in the past when 2 or more brokers were bad, when they
> > lost leadership
> > > > > > > > constantly/quickly, the sets of partition replicas they belong
> > to will see
> > > > > > > > leadership constantly changing.  The ultimate solution is to
> > swap these bad
> > > > > > > > hosts.  But for quick mitigation, we can also put the bad
> > hosts in the
> > > > > > > > Preferred Leader Blacklist to move the priority of its being
> > elected as
> > > > > > > > leaders to the lowest.
> > > > > > > > > *  If the controller is busy serving an extra load of
> > metadata requests
> > > > > > > > and other tasks. we would like to put the controller's leaders
> > to other
> > > > > > > > brokers to lower its CPU load. currently bouncing to lose
> > leadership would
> > > > > > > > not work for Controller, because after the bounce, the
> > controller fails
> > > > > > > > over to another broker.
> > > > > > > > > * Avoid bouncing broker in order to lose its leadership: it
> > would be
> > > > > > > > good if we have a way to specify which broker should be
> > excluded from
> > > > > > > > serving traffic/leadership (without changing the replica
> > assignment
> > > > > > > > ordering by reassignments, even though that's quick), and run
> > preferred
> > > > > > > > leader election.  A bouncing broker will cause temporary URP,
> > and sometimes
> > > > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1)
> > can temporarily
> > > > > > > > lose all its leadership, but if another broker (e.g. broker_id
> > 2) fails or
> > > > > > > > gets bounced, some of its leaderships will likely failover to
> > broker_id 1
> > > > > > > > on a replica with 3 brokers.  If broker_id 1 is in the
> > blacklist, then in
> > > > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can
> > take
> > > > > > > > leadership.
> > > > > > > > > The current work-around of the above is to change the
> > topic/partition's
> > > > > > > > replica reassignments to move the broker_id 1 from the first
> > position to
> > > > > > > > the last position and run preferred leader election. e.g. (1,
> > 2, 3) => (2,
> > > > > > > > 3, 1). This changes the replica reassignments, and we need to
> > keep track of
> > > > > > > > the original one and restore if things change (e.g. controller
> > fails over
> > > > > > > > to another broker, the swapped empty broker caught up). That’s
> > a rather
> > > > > > > > tedious task.
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > This message was sent by Atlassian JIRA
> > > > > > > > (v7.6.3#76005)
>

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Harsha Chintalapani <ka...@harsha.io>.
Hi Colin,
          Can you give us more details on why you don't want this to be
part of the Kafka core. You are proposing KIP-500 which will take away
zookeeper and
writing this interim tools to change the zookeeper metadata doesn't make
sense to me. As George pointed out there are several benefits having it in
the system itself
instead of asking users to hack bunch of json files to deal with outage
scenario.

Thanks,
Harsha

On Fri, Sep 6, 2019 at 4:36 PM George Li <sq...@yahoo.com.invalid>
wrote:

>  Hi Colin,
>
> Thanks for the feedback.  The "separate set of metadata about blacklists"
> in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple in
> the cluster.  Should be easier than keeping json files?  e.g. what if we
> first blacklist broker_id_1, then another broker_id_2 has issues, and we
> need to write out another json file to restore later (and in which order)?
>  Using blacklist, we can just add the broker_id_2 to the existing one. and
> remove whatever broker_id returning to good state without worrying how(the
> ordering of putting the broker to blacklist) to restore.
>
> For topic level config,  the blacklist will be tied to
> topic/partition(e.g.  Configs:
> topic.preferred.leader.blacklist=0:101,102;1:103    where 0 & 1 is the
> partition#, 101,102,103 are the blacklist broker_ids), and easier to
> update/remove, no need for external json files?
>
>
> Thanks,
> George
>
>     On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe <
> cmccabe@apache.org> wrote:
>
>  One possibility would be writing a new command-line tool that would
> deprioritize a given replica using the new KIP-455 API.  Then it could
> write out a JSON files containing the old priorities, which could be
> restored when (or if) we needed to do so.  This seems like it might be
> simpler and easier to maintain than a separate set of metadata about
> blacklists.
>
> best,
> Colin
>
>
> On Fri, Sep 6, 2019, at 11:58, George Li wrote:
> >  Hi,
> >
> > Just want to ping and bubble up the discussion of KIP-491.
> >
> > On a large scale of Kafka clusters with thousands of brokers in many
> > clusters.  Frequent hardware failures are common, although the
> > reassignments to change the preferred leaders is a workaround, it
> > incurs unnecessary additional work than the proposed preferred leader
> > blacklist in KIP-491, and hard to scale.
> >
> > I am wondering whether others using Kafka in a big scale running into
> > same problem.
> >
> >
> > Satish,
> >
> > Regarding your previous question about whether there is use-case for
> > TopicLevel preferred leader "blacklist",  I thought about one
> > use-case:  to improve rebalance/reassignment, the large partition will
> > usually cause performance/stability issues, planning to change the say
> > the New Replica will start with Leader's latest offset(this way the
> > replica is almost instantly in the ISR and reassignment completed), and
> > put this partition's NewReplica into Preferred Leader "Blacklist" at
> > the Topic Level config for that partition. After sometime(retention
> > time), this new replica has caught up and ready to serve traffic,
> > update/remove the TopicConfig for this partition's preferred leader
> > blacklist.
> >
> > I will update the KIP-491 later for this use case of Topic Level config
> > for Preferred Leader Blacklist.
> >
> >
> > Thanks,
> > George
> >
> >    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li
> > <sq...@yahoo.com> wrote:
> >
> >  Hi Colin,
> >
> > > In your example, I think we're comparing apples and oranges.  You
> started by outlining a scenario where "an empty broker... comes up...
> [without] any > leadership[s]."  But then you criticize using reassignment
> to switch the order of preferred replicas because it "would not actually
> switch the leader > automatically."  If the empty broker doesn't have any
> leaderships, there is nothing to be switched, right?
> >
> > Let me explained in details of this particular use case example for
> > comparing apples to apples.
> >
> > Let's say a healthy broker hosting 3000 partitions, and of which 1000
> > are the preferred leaders (leader count is 1000). There is a hardware
> > failure (disk/memory, etc.), and kafka process crashed. We swap this
> > host with another host but keep the same broker.id, when this new
> > broker coming up, it has no historical data, and we manage to have the
> > current last offsets of all partitions set in
> > the replication-offset-checkpoint (if we don't set them, it could cause
> > crazy ReplicaFetcher pulling of historical data from other brokers and
> > cause cluster high latency and other instabilities), so when Kafka is
> > brought up, it is quickly catching up as followers in the ISR.  Note,
> > we have auto.leader.rebalance.enable  disabled, so it's not serving any
> > traffic as leaders (leader count = 0), even there are 1000 partitions
> > that this broker is the Preferred Leader.
> >
> > We need to make this broker not serving traffic for a few hours or days
> > depending on the SLA of the topic retention requirement until after
> > it's having enough historical data.
> >
> >
> > * The traditional way using the reassignments to move this broker in
> > that 1000 partitions where it's the preferred leader to the end of
> > assignment, this is O(N) operation. and from my experience, we can't
> > submit all 1000 at the same time, otherwise cause higher latencies even
> > the reassignment in this case can complete almost instantly.  After  a
> > few hours/days whatever, this broker is ready to serve traffic,  we
> > have to run reassignments again to restore that 1000 partitions
> > preferred leaders for this broker: O(N) operation.  then run preferred
> > leader election O(N) again.  So total 3 x O(N) operations.  The point
> > is since the new empty broker is expected to be the same as the old one
> > in terms of hosting partition/leaders, it would seem unnecessary to do
> > reassignments (ordering of replica) during the broker catching up time.
> >
> >
> >
> > * The new feature Preferred Leader "Blacklist":  just need to put a
> > dynamic config to indicate that this broker should be considered leader
> > (preferred leader election or broker failover or unclean leader
> > election) to the lowest priority. NO need to run any reassignments.
> > After a few hours/days, when this broker is ready, remove the dynamic
> > config, and run preferred leader election and this broker will serve
> > traffic for that 1000 original partitions it was the preferred leader.
> > So total  1 x O(N) operation.
> >
> >
> > If auto.leader.rebalance.enable  is enabled,  the Preferred Leader
> > "Blacklist" can be put it before Kafka is started to prevent this
> > broker serving traffic.  In the traditional way of running
> > reassignments, once the broker is up,
> > with auto.leader.rebalance.enable  , if leadership starts going to this
> > new empty broker, it might have to do preferred leader election after
> > reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1)
> > reassignment only change the ordering, 1 remains as the current leader,
> > and needs prefer leader election to change to 2 after reassignment. so
> > potentially one more O(N) operation.
> >
> > I hope the above example can show how easy to "blacklist" a broker
> > serving leadership.  For someone managing Production Kafka cluster,
> > it's important to react fast to certain alerts and mitigate/resolve
> > some issues. As I listed the other use cases in KIP-291, I think this
> > feature can make the Kafka product more easier to manage/operate.
> >
> > > In general, using an external rebalancing tool like Cruise Control is
> a good idea to keep things balanced without having deal with manual
> rebalancing.  > We expect more and more people who have a complex or large
> cluster will start using tools like this.
> > >
> > > However, if you choose to do manual rebalancing, it shouldn't be that
> bad.  You would save the existing partition ordering before making your
> changes, then> make your changes (perhaps by running a simple command line
> tool that switches the order of the replicas).  Then, once you felt like
> the broker was ready to> serve traffic, you could just re-apply the old
> ordering which you had saved.
> >
> >
> > We do have our own rebalancing tool which has its own criteria like
> > Rack diversity,  disk usage,  spread partitions/leaders across all
> > brokers in the cluster per topic, leadership Bytes/BytesIn served per
> > broker, etc.  We can run reassignments. The point is whether it's
> > really necessary, and if there is more effective, easier, safer way to
> > do it.
> >
> > take another use case example of taking leadership out of busy
> > Controller to give it more power to serve metadata requests and other
> > work. The controller can failover, with the preferred leader
> > "blacklist",  it does not have to run reassignments again when
> > controller failover, just change the blacklisted broker_id.
> >
> >
> > > I was thinking about a PlacementPolicy filling the role of preventing
> people from creating single-replica partitions on a node that we didn't
> want to > ever be the leader.  I thought that it could also prevent people
> from designating those nodes as preferred leaders during topic creation, or
> Kafka from doing> itduring random topic creation.  I was assuming that the
> PlacementPolicy would determine which nodes were which through static
> configuration keys.  I agree> static configuration keys are somewhat less
> flexible than dynamic configuration.
> >
> >
> > I think single-replica partition might not be a good example.  There
> > should not be any single-replica partition at all. If yes. it's
> > probably because of trying to save disk space with less replicas.  I
> > think at least minimum 2. The user purposely creating single-replica
> > partition will take full responsibilities of data loss and
> > unavailability when a broker fails or under maintenance.
> >
> >
> > I think it would be better to use dynamic instead of static config.  I
> > also think it would be better to have topic creation Policy enforced in
> > Kafka server OR an external service. We have an external/central
> > service managing topic creation/partition expansion which takes into
> > account of rack-diversity, replication factor (2, 3 or 4 depending on
> > cluster/topic type), Policy replicating the topic between kafka
> > clusters, etc.
> >
> >
> >
> > Thanks,
> > George
> >
> >
> >    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe
> > <cm...@apache.org> wrote:
> >
> >  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> > >  Hi Colin,
> > >
> > > Thanks for your feedbacks.  Comments below:
> > > > Even if you have a way of blacklisting an entire broker all at once,
> you still would need to run a leader election > for each partition where
> you want to move the leader off of the blacklisted broker.  So the
> operation is still O(N) in > that sense-- you have to do something per
> partition.
> > >
> > > For a failed broker and swapped with an empty broker, when it comes
> up,
> > > it will not have any leadership, and we would like it to remain not
> > > having leaderships for a couple of hours or days. So there is no
> > > preferred leader election needed which incurs O(N) operation in this
> > > case.  Putting the preferred leader blacklist would safe guard this
> > > broker serving traffic during that time. otherwise, if another broker
> > > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > > runs preferred leader election, this new "empty" broker can still get
> > > leaderships.
> > >
> > > Also running reassignment to change the ordering of preferred leader
> > > would not actually switch the leader automatically.  e.g.  (1,2,3) =>
> > > (2,3,1). unless preferred leader election is run to switch current
> > > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then
> > > after the broker is back to normal, another 2 x O(N) to rollback.
> >
> > Hi George,
> >
> > Hmm.  I guess I'm still on the fence about this feature.
> >
> > In your example, I think we're comparing apples and oranges.  You
> > started by outlining a scenario where "an empty broker... comes up...
> > [without] any leadership[s]."  But then you criticize using
> > reassignment to switch the order of preferred replicas because it
> > "would not actually switch the leader automatically."  If the empty
> > broker doesn't have any leaderships, there is nothing to be switched,
> > right?
> >
> > >
> > >
> > > > In general, reassignment will get a lot easier and quicker once
> KIP-455 is implemented.  > Reassignments that just change the order of
> preferred replicas for a specific partition should complete pretty much
> instantly.
> > > >> I think it's simpler and easier just to have one source of truth
> for what the preferred replica is for a partition, rather than two.  So
> for> me, the fact that the replica assignment ordering isn't changed is
> actually a big disadvantage of this KIP.  If you are a new user (or just>
> an existing user that didn't read all of the documentation) and you just
> look at the replica assignment, you might be confused by why> a particular
> broker wasn't getting any leaderships, even  though it appeared like it
> should.  More mechanisms mean more complexity> for users and developers
> most of the time.
> > >
> > >
> > > I would like stress the point that running reassignment to change the
> > > ordering of the replica (putting a broker to the end of partition
> > > assignment) is unnecessary, because after some time the broker is
> > > caught up, it can start serving traffic and then need to run
> > > reassignments again to "rollback" to previous states. As I mentioned
> in
> > > KIP-491, this is just tedious work.
> >
> > In general, using an external rebalancing tool like Cruise Control is a
> > good idea to keep things balanced without having deal with manual
> > rebalancing.  We expect more and more people who have a complex or
> > large cluster will start using tools like this.
> >
> > However, if you choose to do manual rebalancing, it shouldn't be that
> > bad.  You would save the existing partition ordering before making your
> > changes, then make your changes (perhaps by running a simple command
> > line tool that switches the order of the replicas).  Then, once you
> > felt like the broker was ready to serve traffic, you could just
> > re-apply the old ordering which you had saved.
> >
> > >
> > > I agree this might introduce some complexities for users/developers.
> > > But if this feature is good, and well documented, it is good for the
> > > kafka product/community.  Just like KIP-460 enabling unclean leader
> > > election to override TopicLevel/Broker Level config of
> > > `unclean.leader.election.enable`
> > >
> > > > I agree that it would be nice if we could treat some brokers
> differently for the purposes of placing replicas, selecting leaders, etc. >
> Right now, we don't have any way of implementing that without forking the
> broker.  I would support a new PlacementPolicy class that> would close this
> gap.  But I don't think this KIP is flexible enough to fill this role.  For
> example, it can't prevent users from creating> new single-replica topics
> that get put on the "bad" replica.  Perhaps we should reopen the
> discussion> about
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > >
> > > Creating topic with single-replica is beyond what KIP-491 is trying to
> > > achieve.  The user needs to take responsibility of doing that. I do
> see
> > > some Samza clients notoriously creating single-replica topics and that
> > > got flagged by alerts, because a single broker down/maintenance will
> > > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > > the single-replica will still serve as leaders, because there is no
> > > other alternative replica to be chosen as leader.
> > >
> > > Even with a new PlacementPolicy for topic creation/partition
> expansion,
> > > it still needs the blacklist info (e.g. a zk path node, or broker
> > > level/topic level config) to "blacklist" the broker to be preferred
> > > leader? Would it be the same as KIP-491 is introducing?
> >
> > I was thinking about a PlacementPolicy filling the role of preventing
> > people from creating single-replica partitions on a node that we didn't
> > want to ever be the leader.  I thought that it could also prevent
> > people from designating those nodes as preferred leaders during topic
> > creation, or Kafka from doing itduring random topic creation.  I was
> > assuming that the PlacementPolicy would determine which nodes were
> > which through static configuration keys.  I agree static configuration
> > keys are somewhat less flexible than dynamic configuration.
> >
> > best,
> > Colin
> >
> >
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > > <cm...@apache.org> wrote:
> > >
> > >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > > >  Hi Colin,
> > > > Thanks for looking into this KIP.  Sorry for the late response. been
> busy.
> > > >
> > > > If a cluster has MAMY topic partitions, moving this "blacklist"
> broker
> > > > to the end of replica list is still a rather "big" operation,
> involving
> > > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > > simpler/easier and can undo easily without changing the replica
> > > > assignment ordering.
> > >
> > > Hi George,
> > >
> > > Even if you have a way of blacklisting an entire broker all at once,
> > > you still would need to run a leader election for each partition where
> > > you want to move the leader off of the blacklisted broker.  So the
> > > operation is still O(N) in that sense-- you have to do something per
> > > partition.
> > >
> > > In general, reassignment will get a lot easier and quicker once
> KIP-455
> > > is implemented.  Reassignments that just change the order of preferred
> > > replicas for a specific partition should complete pretty much
> instantly.
> > >
> > > I think it's simpler and easier just to have one source of truth for
> > > what the preferred replica is for a partition, rather than two.  So
> for
> > > me, the fact that the replica assignment ordering isn't changed is
> > > actually a big disadvantage of this KIP.  If you are a new user (or
> > > just an existing user that didn't read all of the documentation) and
> > > you just look at the replica assignment, you might be confused by why
> a
> > > particular broker wasn't getting any leaderships, even  though it
> > > appeared like it should.  More mechanisms mean more complexity for
> > > users and developers most of the time.
> > >
> > > > Major use case for me, a failed broker got swapped with new
> hardware,
> > > > and starts up as empty (with latest offset of all partitions), the
> SLA
> > > > of retention is 1 day, so before this broker is up to be in-sync for
> 1
> > > > day, we would like to blacklist this broker from serving traffic.
> after
> > > > 1 day, the blacklist is removed and run preferred leader election.
> > > > This way, no need to run reassignments before/after.  This is the
> > > > "temporary" use-case.
> > >
> > > What if we just add an option to the reassignment tool to generate a
> > > plan to move all the leaders off of a specific broker?  The tool could
> > > also run a leader election as well.  That would be a simple way of
> > > doing this without adding new mechanisms or broker-side
> configurations,
> > > etc.
> > >
> > > >
> > > > There are use-cases that this Preferred Leader "blacklist" can be
> > > > somewhat permanent, as I explained in the AWS data center instances
> Vs.
> > > > on-premises data center bare metal machines (heterogenous hardware),
> > > > that the AWS broker_ids will be blacklisted.  So new topics
> created,
> > > > or existing topic expansion would not make them serve traffic even
> they
> > > > could be the preferred leader.
> > >
> > > I agree that it would be nice if we could treat some brokers
> > > differently for the purposes of placing replicas, selecting leaders,
> > > etc.  Right now, we don't have any way of implementing that without
> > > forking the broker.  I would support a new PlacementPolicy class that
> > > would close this gap.  But I don't think this KIP is flexible enough
> to
> > > fill this role.  For example, it can't prevent users from creating new
> > > single-replica topics that get put on the "bad" replica.  Perhaps we
> > > should reopen the discussion about
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > >
> > > regards,
> > > Colin
> > >
> > > >
> > > > Please let me know there are more question.
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > > <cm...@apache.org> wrote:
> > > >
> > > >  We still want to give the "blacklisted" broker the leadership if
> > > > nobody else is available.  Therefore, isn't putting a broker on the
> > > > blacklist pretty much the same as moving it to the last entry in the
> > > > replicas list and then triggering a preferred leader election?
> > > >
> > > > If we want this to be undone after a certain amount of time, or
> under
> > > > certain conditions, that seems like something that would be more
> > > > effectively done by an external system, rather than putting all
> these
> > > > policies into Kafka.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > > >  Hi Satish,
> > > > > Thanks for the reviews and feedbacks.
> > > > >
> > > > > > > The following is the requirements this KIP is trying to
> accomplish:
> > > > > > This can be moved to the"Proposed changes" section.
> > > > >
> > > > > Updated the KIP-491.
> > > > >
> > > > > > >>The logic to determine the priority/order of which broker
> should be
> > > > > > preferred leader should be modified.  The broker in the
> preferred leader
> > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > determining leadership.
> > > > > >
> > > > > > I believe there is no change required in the ordering of the
> preferred
> > > > > > replica list. Brokers in the preferred leader blacklist are
> skipped
> > > > > > until other brokers int he list are unavailable.
> > > > >
> > > > > Yes. partition assignment remained the same, replica & ordering.
> The
> > > > > blacklist logic can be optimized during implementation.
> > > > >
> > > > > > >>The blacklist can be at the broker level. However, there might
> be use cases
> > > > > > where a specific topic should blacklist particular brokers, which
> > > > > > would be at the
> > > > > > Topic level Config. For this use cases of this KIP, it seems
> that broker level
> > > > > > blacklist would suffice.  Topic level preferred leader blacklist
> might
> > > > > > be future enhancement work.
> > > > > >
> > > > > > I agree that the broker level preferred leader blacklist would be
> > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > preferred blacklist?
> > > > >
> > > > >
> > > > >
> > > > > I don't have any concrete use cases for Topic level preferred
> leader
> > > > > blacklist.  One scenarios I can think of is when a broker has high
> CPU
> > > > > usage, trying to identify the big topics (High MsgIn, High
> BytesIn,
> > > > > etc), then try to move the leaders away from this broker,  before
> doing
> > > > > an actual reassignment to change its preferred leader,  try to put
> this
> > > > > preferred_leader_blacklist in the Topic Level config, and run
> preferred
> > > > > leader election, and see whether CPU decreases for this broker,
> if
> > > > > yes, then do the reassignments to change the preferred leaders to
> be
> > > > > "permanent" (the topic may have many partitions like 256 that has
> quite
> > > > > a few of them having this broker as preferred leader).  So this
> Topic
> > > > > Level config is an easy way of doing trial and check the result.
> > > > >
> > > > >
> > > > > > You can add the below workaround as an item in the rejected
> alternatives section
> > > > > > "Reassigning all the topic/partitions which the intended broker
> is a
> > > > > > replica for."
> > > > >
> > > > > Updated the KIP-491.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > > <sa...@gmail.com> wrote:
> > > > >
> > > > >  Thanks for the KIP. I have put my comments below.
> > > > >
> > > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > >
> > > > > >> The following is the requirements this KIP is trying to
> accomplish:
> > > > >   The ability to add and remove the preferred leader deprioritized
> > > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > >
> > > > > This can be moved to the"Proposed changes" section.
> > > > >
> > > > > >>The logic to determine the priority/order of which broker should
> be
> > > > > preferred leader should be modified.  The broker in the preferred
> leader
> > > > > blacklist should be moved to the end (lowest priority) when
> > > > > determining leadership.
> > > > >
> > > > > I believe there is no change required in the ordering of the
> preferred
> > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > until other brokers int he list are unavailable.
> > > > >
> > > > > >>The blacklist can be at the broker level. However, there might
> be use cases
> > > > > where a specific topic should blacklist particular brokers, which
> > > > > would be at the
> > > > > Topic level Config. For this use cases of this KIP, it seems that
> broker level
> > > > > blacklist would suffice.  Topic level preferred leader blacklist
> might
> > > > > be future enhancement work.
> > > > >
> > > > > I agree that the broker level preferred leader blacklist would be
> > > > > sufficient. Do you have any use cases which require topic level
> > > > > preferred blacklist?
> > > > >
> > > > > You can add the below workaround as an item in the rejected
> alternatives section
> > > > > "Reassigning all the topic/partitions which the intended broker is
> a
> > > > > replica for."
> > > > >
> > > > > Thanks,
> > > > > Satish.
> > > > >
> > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > > <st...@confluent.io> wrote:
> > > > > >
> > > > > > Hey George,
> > > > > >
> > > > > > Thanks for the KIP, it's an interesting idea.
> > > > > >
> > > > > > I was wondering whether we could achieve the same thing via the
> > > > > > kafka-reassign-partitions tool. As you had also said in the
> JIRA,  it is
> > > > > > true that this is currently very tedious with the tool. My
> thoughts are
> > > > > > that we could improve the tool and give it the notion of a
> "blacklisted
> > > > > > preferred leader".
> > > > > > This would have some benefits like:
> > > > > > - more fine-grained control over the blacklist. we may not want
> to
> > > > > > blacklist all the preferred leaders, as that would make the
> blacklisted
> > > > > > broker a follower of last resort which is not very useful. In
> the cases of
> > > > > > an underpowered AWS machine or a controller, you might overshoot
> and make
> > > > > > the broker very underutilized if you completely make it
> leaderless.
> > > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > > rebalancing tools would also need to know about it and
> manipulate/respect
> > > > > > it to achieve a fair balance.
> > > > > > It seems like both problems are tied to balancing partitions,
> it's just
> > > > > > that KIP-491's use case wants to balance them against other
> factors in a
> > > > > > more nuanced way. It makes sense to have both be done from the
> same place
> > > > > >
> > > > > > To make note of the motivation section:
> > > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > > The recommended way to make a broker lose its leadership is to
> run a
> > > > > > reassignment on its partitions
> > > > > > > The cross-data center cluster has AWS cloud instances which
> have less
> > > > > > computing power
> > > > > > We recommend running Kafka on homogeneous machines. It would be
> cool if the
> > > > > > system supported more flexibility in that regard but that is
> more nuanced
> > > > > > and a preferred leader blacklist may not be the best first
> approach to the
> > > > > > issue
> > > > > >
> > > > > > Adding a new config which can fundamentally change the way
> replication is
> > > > > > done is complex, both for the system (the replication code is
> complex
> > > > > > enough) and the user. Users would have another potential config
> that could
> > > > > > backfire on them - e.g if left forgotten.
> > > > > >
> > > > > > Could you think of any downsides to implementing this
> functionality (or a
> > > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > > One downside I can see is that we would not have it handle new
> partitions
> > > > > > created after the "blacklist operation". As a first iteration I
> think that
> > > > > > may be acceptable
> > > > > >
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > >
> > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <
> sql_consulting@yahoo.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > > >  Hi,
> > > > > > >
> > > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > > )
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > > >
> > > > > > >  Hi,
> > > > > > >
> > > > > > > I have created KIP-491 (
> > > > > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> )
> > > > > > > for putting a broker to the preferred leader blacklist or
> deprioritized
> > > > > > > list so when determining leadership,  it's moved to the lowest
> priority for
> > > > > > > some of the listed use-cases.
> > > > > > >
> > > > > > > Please provide your comments/feedbacks.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia
> Sancio (JIRA) <
> > > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <
> sql_consulting@yahoo.com>Sent:
> > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
> [Commented]
> > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > > >
> > > > > > >    [
> > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > > ]
> > > > > > >
> > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > > ---------------------------------------------------
> > > > > > >
> > > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > > >
> > > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > > -----------------------------------------------
> > > > > > > >
> > > > > > > >                Key: KAFKA-8638
> > > > > > > >                URL:
> https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > > >            Project: Kafka
> > > > > > > >          Issue Type: Improvement
> > > > > > > >          Components: config, controller, core
> > > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > > >            Reporter: GEORGE LI
> > > > > > > >            Assignee: GEORGE LI
> > > > > > > >            Priority: Major
> > > > > > > >
> > > > > > > > Currently, the kafka preferred leader election will pick the
> broker_id
> > > > > > > in the topic/partition replica assignments in a priority order
> when the
> > > > > > > broker is in ISR. The preferred leader is the broker id in the
> first
> > > > > > > position of replica. There are use-cases that, even the first
> broker in the
> > > > > > > replica assignment is in ISR, there is a need for it to be
> moved to the end
> > > > > > > of ordering (lowest priority) when deciding leadership during
> preferred
> > > > > > > leader election.
> > > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1
> is the
> > > > > > > preferred leader.  When preferred leadership is run, it will
> pick 1 as the
> > > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick
> 2, if 2 is not
> > > > > > > in ISR, then pick 3 as the leader. There are use cases that,
> even 1 is in
> > > > > > > ISR, we would like it to be moved to the end of ordering
> (lowest priority)
> > > > > > > when deciding leadership during preferred leader election.
> Below is a list
> > > > > > > of use cases:
> > > > > > > > * (If broker_id 1 is a swapped failed host and brought up
> with last
> > > > > > > segments or latest offset without historical data (There is
> another effort
> > > > > > > on this), it's better for it to not serve leadership till it's
> caught-up.
> > > > > > > > * The cross-data center cluster has AWS instances which have
> less
> > > > > > > computing power than the on-prem bare metal machines.  We
> could put the AWS
> > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers
> can be elected
> > > > > > > leaders, without changing the reassignments ordering of the
> replicas.
> > > > > > > > * If the broker_id 1 is constantly losing leadership after
> some time:
> > > > > > > "Flapping". we would want to exclude 1 to be a leader unless
> all other
> > > > > > > brokers of this topic/partition are offline.  The “Flapping”
> effect was
> > > > > > > seen in the past when 2 or more brokers were bad, when they
> lost leadership
> > > > > > > constantly/quickly, the sets of partition replicas they belong
> to will see
> > > > > > > leadership constantly changing.  The ultimate solution is to
> swap these bad
> > > > > > > hosts.  But for quick mitigation, we can also put the bad
> hosts in the
> > > > > > > Preferred Leader Blacklist to move the priority of its being
> elected as
> > > > > > > leaders to the lowest.
> > > > > > > > *  If the controller is busy serving an extra load of
> metadata requests
> > > > > > > and other tasks. we would like to put the controller's leaders
> to other
> > > > > > > brokers to lower its CPU load. currently bouncing to lose
> leadership would
> > > > > > > not work for Controller, because after the bounce, the
> controller fails
> > > > > > > over to another broker.
> > > > > > > > * Avoid bouncing broker in order to lose its leadership: it
> would be
> > > > > > > good if we have a way to specify which broker should be
> excluded from
> > > > > > > serving traffic/leadership (without changing the replica
> assignment
> > > > > > > ordering by reassignments, even though that's quick), and run
> preferred
> > > > > > > leader election.  A bouncing broker will cause temporary URP,
> and sometimes
> > > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1)
> can temporarily
> > > > > > > lose all its leadership, but if another broker (e.g. broker_id
> 2) fails or
> > > > > > > gets bounced, some of its leaderships will likely failover to
> broker_id 1
> > > > > > > on a replica with 3 brokers.  If broker_id 1 is in the
> blacklist, then in
> > > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can
> take
> > > > > > > leadership.
> > > > > > > > The current work-around of the above is to change the
> topic/partition's
> > > > > > > replica reassignments to move the broker_id 1 from the first
> position to
> > > > > > > the last position and run preferred leader election. e.g. (1,
> 2, 3) => (2,
> > > > > > > 3, 1). This changes the replica reassignments, and we need to
> keep track of
> > > > > > > the original one and restore if things change (e.g. controller
> fails over
> > > > > > > to another broker, the swapped empty broker caught up). That’s
> a rather
> > > > > > > tedious task.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > This message was sent by Atlassian JIRA
> > > > > > > (v7.6.3#76005)

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by George Li <sq...@yahoo.com.INVALID>.
 Hi Colin,

Thanks for the feedback.  The "separate set of metadata about blacklists" in KIP-491 is just the list of broker ids. Usually 1 or 2 or a couple in the cluster.  Should be easier than keeping json files?  e.g. what if we first blacklist broker_id_1, then another broker_id_2 has issues, and we need to write out another json file to restore later (and in which order)?   Using blacklist, we can just add the broker_id_2 to the existing one. and remove whatever broker_id returning to good state without worrying how(the ordering of putting the broker to blacklist) to restore.

For topic level config,  the blacklist will be tied to topic/partition(e.g.  Configs: topic.preferred.leader.blacklist=0:101,102;1:103    where 0 & 1 is the partition#, 101,102,103 are the blacklist broker_ids), and easier to update/remove, no need for external json files? 


Thanks,
George

    On Friday, September 6, 2019, 02:20:33 PM PDT, Colin McCabe <cm...@apache.org> wrote:  
 
 One possibility would be writing a new command-line tool that would deprioritize a given replica using the new KIP-455 API.  Then it could write out a JSON files containing the old priorities, which could be restored when (or if) we needed to do so.  This seems like it might be simpler and easier to maintain than a separate set of metadata about blacklists.

best,
Colin


On Fri, Sep 6, 2019, at 11:58, George Li wrote:
>  Hi, 
> 
> Just want to ping and bubble up the discussion of KIP-491. 
> 
> On a large scale of Kafka clusters with thousands of brokers in many 
> clusters.  Frequent hardware failures are common, although the 
> reassignments to change the preferred leaders is a workaround, it 
> incurs unnecessary additional work than the proposed preferred leader 
> blacklist in KIP-491, and hard to scale. 
> 
> I am wondering whether others using Kafka in a big scale running into 
> same problem. 
> 
> 
> Satish,  
> 
> Regarding your previous question about whether there is use-case for 
> TopicLevel preferred leader "blacklist",  I thought about one 
> use-case:  to improve rebalance/reassignment, the large partition will 
> usually cause performance/stability issues, planning to change the say 
> the New Replica will start with Leader's latest offset(this way the 
> replica is almost instantly in the ISR and reassignment completed), and 
> put this partition's NewReplica into Preferred Leader "Blacklist" at 
> the Topic Level config for that partition. After sometime(retention 
> time), this new replica has caught up and ready to serve traffic, 
> update/remove the TopicConfig for this partition's preferred leader 
> blacklist. 
> 
> I will update the KIP-491 later for this use case of Topic Level config 
> for Preferred Leader Blacklist.
> 
> 
> Thanks,
> George
>  
>    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li 
> <sq...@yahoo.com> wrote:  
>  
>  Hi Colin,
> 
> > In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any > leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader > automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?
> 
> Let me explained in details of this particular use case example for 
> comparing apples to apples. 
> 
> Let's say a healthy broker hosting 3000 partitions, and of which 1000 
> are the preferred leaders (leader count is 1000). There is a hardware 
> failure (disk/memory, etc.), and kafka process crashed. We swap this 
> host with another host but keep the same broker.id, when this new 
> broker coming up, it has no historical data, and we manage to have the 
> current last offsets of all partitions set in 
> the replication-offset-checkpoint (if we don't set them, it could cause 
> crazy ReplicaFetcher pulling of historical data from other brokers and 
> cause cluster high latency and other instabilities), so when Kafka is 
> brought up, it is quickly catching up as followers in the ISR.  Note, 
> we have auto.leader.rebalance.enable  disabled, so it's not serving any 
> traffic as leaders (leader count = 0), even there are 1000 partitions 
> that this broker is the Preferred Leader. 
> 
> We need to make this broker not serving traffic for a few hours or days 
> depending on the SLA of the topic retention requirement until after 
> it's having enough historical data. 
> 
> 
> * The traditional way using the reassignments to move this broker in 
> that 1000 partitions where it's the preferred leader to the end of  
> assignment, this is O(N) operation. and from my experience, we can't 
> submit all 1000 at the same time, otherwise cause higher latencies even 
> the reassignment in this case can complete almost instantly.  After  a 
> few hours/days whatever, this broker is ready to serve traffic,  we 
> have to run reassignments again to restore that 1000 partitions 
> preferred leaders for this broker: O(N) operation.  then run preferred 
> leader election O(N) again.  So total 3 x O(N) operations.  The point 
> is since the new empty broker is expected to be the same as the old one 
> in terms of hosting partition/leaders, it would seem unnecessary to do 
> reassignments (ordering of replica) during the broker catching up time. 
> 
> 
> 
> * The new feature Preferred Leader "Blacklist":  just need to put a 
> dynamic config to indicate that this broker should be considered leader 
> (preferred leader election or broker failover or unclean leader 
> election) to the lowest priority. NO need to run any reassignments. 
> After a few hours/days, when this broker is ready, remove the dynamic 
> config, and run preferred leader election and this broker will serve 
> traffic for that 1000 original partitions it was the preferred leader. 
> So total  1 x O(N) operation. 
> 
> 
> If auto.leader.rebalance.enable  is enabled,  the Preferred Leader 
> "Blacklist" can be put it before Kafka is started to prevent this 
> broker serving traffic.  In the traditional way of running 
> reassignments, once the broker is up, 
> with auto.leader.rebalance.enable  , if leadership starts going to this 
> new empty broker, it might have to do preferred leader election after 
> reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) 
> reassignment only change the ordering, 1 remains as the current leader, 
> and needs prefer leader election to change to 2 after reassignment. so 
> potentially one more O(N) operation. 
> 
> I hope the above example can show how easy to "blacklist" a broker 
> serving leadership.  For someone managing Production Kafka cluster, 
> it's important to react fast to certain alerts and mitigate/resolve 
> some issues. As I listed the other use cases in KIP-291, I think this 
> feature can make the Kafka product more easier to manage/operate. 
> 
> > In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  > We expect more and more people who have a complex or large cluster will start using tools like this.
> > 
> > However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then> make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to> serve traffic, you could just re-apply the old ordering which you had saved.
> 
> 
> We do have our own rebalancing tool which has its own criteria like 
> Rack diversity,  disk usage,  spread partitions/leaders across all 
> brokers in the cluster per topic, leadership Bytes/BytesIn served per 
> broker, etc.  We can run reassignments. The point is whether it's 
> really necessary, and if there is more effective, easier, safer way to 
> do it.    
> 
> take another use case example of taking leadership out of busy 
> Controller to give it more power to serve metadata requests and other 
> work. The controller can failover, with the preferred leader 
> "blacklist",  it does not have to run reassignments again when 
> controller failover, just change the blacklisted broker_id. 
> 
> 
> > I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to > ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing> itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree> static configuration keys are somewhat less flexible than dynamic configuration.
> 
> 
> I think single-replica partition might not be a good example.  There 
> should not be any single-replica partition at all. If yes. it's 
> probably because of trying to save disk space with less replicas.  I 
> think at least minimum 2. The user purposely creating single-replica 
> partition will take full responsibilities of data loss and 
> unavailability when a broker fails or under maintenance. 
> 
> 
> I think it would be better to use dynamic instead of static config.  I 
> also think it would be better to have topic creation Policy enforced in 
> Kafka server OR an external service. We have an external/central 
> service managing topic creation/partition expansion which takes into 
> account of rack-diversity, replication factor (2, 3 or 4 depending on 
> cluster/topic type), Policy replicating the topic between kafka 
> clusters, etc.  
> 
> 
> 
> Thanks,
> George
> 
> 
>    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe 
> <cm...@apache.org> wrote:  
>  
>  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> >  Hi Colin,
> > 
> > Thanks for your feedbacks.  Comments below:
> > > Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election > for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in > that sense-- you have to do something per partition.
> > 
> > For a failed broker and swapped with an empty broker, when it comes up, 
> > it will not have any leadership, and we would like it to remain not 
> > having leaderships for a couple of hours or days. So there is no 
> > preferred leader election needed which incurs O(N) operation in this 
> > case.  Putting the preferred leader blacklist would safe guard this 
> > broker serving traffic during that time. otherwise, if another broker 
> > fails(if this broker is the 1st, 2nd in the assignment), or someone 
> > runs preferred leader election, this new "empty" broker can still get 
> > leaderships. 
> > 
> > Also running reassignment to change the ordering of preferred leader 
> > would not actually switch the leader automatically.  e.g.  (1,2,3) => 
> > (2,3,1). unless preferred leader election is run to switch current 
> > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then 
> > after the broker is back to normal, another 2 x O(N) to rollback. 
> 
> Hi George,
> 
> Hmm.  I guess I'm still on the fence about this feature.
> 
> In your example, I think we're comparing apples and oranges.  You 
> started by outlining a scenario where "an empty broker... comes up... 
> [without] any leadership[s]."  But then you criticize using 
> reassignment to switch the order of preferred replicas because it 
> "would not actually switch the leader automatically."  If the empty 
> broker doesn't have any leaderships, there is nothing to be switched, 
> right?
> 
> > 
> > 
> > > In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  > Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.
> > >> I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for> me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just>  an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why> a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity> for users and developers most of the time.
> > 
> > 
> > I would like stress the point that running reassignment to change the 
> > ordering of the replica (putting a broker to the end of partition 
> > assignment) is unnecessary, because after some time the broker is 
> > caught up, it can start serving traffic and then need to run 
> > reassignments again to "rollback" to previous states. As I mentioned in 
> > KIP-491, this is just tedious work. 
> 
> In general, using an external rebalancing tool like Cruise Control is a 
> good idea to keep things balanced without having deal with manual 
> rebalancing.  We expect more and more people who have a complex or 
> large cluster will start using tools like this.
> 
> However, if you choose to do manual rebalancing, it shouldn't be that 
> bad.  You would save the existing partition ordering before making your 
> changes, then make your changes (perhaps by running a simple command 
> line tool that switches the order of the replicas).  Then, once you 
> felt like the broker was ready to serve traffic, you could just 
> re-apply the old ordering which you had saved.
> 
> > 
> > I agree this might introduce some complexities for users/developers. 
> > But if this feature is good, and well documented, it is good for the 
> > kafka product/community.  Just like KIP-460 enabling unclean leader 
> > election to override TopicLevel/Broker Level config of 
> > `unclean.leader.election.enable`
> > 
> > > I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. > Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that> would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating> new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion> about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > 
> > Creating topic with single-replica is beyond what KIP-491 is trying to 
> > achieve.  The user needs to take responsibility of doing that. I do see 
> > some Samza clients notoriously creating single-replica topics and that 
> > got flagged by alerts, because a single broker down/maintenance will 
> > cause offline partitions. For KIP-491 preferred leader "blacklist",  
> > the single-replica will still serve as leaders, because there is no 
> > other alternative replica to be chosen as leader. 
> > 
> > Even with a new PlacementPolicy for topic creation/partition expansion, 
> > it still needs the blacklist info (e.g. a zk path node, or broker 
> > level/topic level config) to "blacklist" the broker to be preferred 
> > leader? Would it be the same as KIP-491 is introducing? 
> 
> I was thinking about a PlacementPolicy filling the role of preventing 
> people from creating single-replica partitions on a node that we didn't 
> want to ever be the leader.  I thought that it could also prevent 
> people from designating those nodes as preferred leaders during topic 
> creation, or Kafka from doing itduring random topic creation.  I was 
> assuming that the PlacementPolicy would determine which nodes were 
> which through static configuration keys.  I agree static configuration 
> keys are somewhat less flexible than dynamic configuration.
> 
> best,
> Colin
> 
> 
> > 
> > 
> > Thanks,
> > George
> > 
> >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe 
> > <cm...@apache.org> wrote:  
> >  
> >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > >  Hi Colin,
> > > Thanks for looking into this KIP.  Sorry for the late response. been busy. 
> > > 
> > > If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> > > to the end of replica list is still a rather "big" operation, involving 
> > > submitting reassignments.  The KIP-491 way of blacklist is much 
> > > simpler/easier and can undo easily without changing the replica 
> > > assignment ordering. 
> > 
> > Hi George,
> > 
> > Even if you have a way of blacklisting an entire broker all at once, 
> > you still would need to run a leader election for each partition where 
> > you want to move the leader off of the blacklisted broker.  So the 
> > operation is still O(N) in that sense-- you have to do something per 
> > partition.
> > 
> > In general, reassignment will get a lot easier and quicker once KIP-455 
> > is implemented.  Reassignments that just change the order of preferred 
> > replicas for a specific partition should complete pretty much instantly.
> > 
> > I think it's simpler and easier just to have one source of truth for 
> > what the preferred replica is for a partition, rather than two.  So for 
> > me, the fact that the replica assignment ordering isn't changed is 
> > actually a big disadvantage of this KIP.  If you are a new user (or 
> > just an existing user that didn't read all of the documentation) and 
> > you just look at the replica assignment, you might be confused by why a 
> > particular broker wasn't getting any leaderships, even  though it 
> > appeared like it should.  More mechanisms mean more complexity for 
> > users and developers most of the time.
> > 
> > > Major use case for me, a failed broker got swapped with new hardware, 
> > > and starts up as empty (with latest offset of all partitions), the SLA 
> > > of retention is 1 day, so before this broker is up to be in-sync for 1 
> > > day, we would like to blacklist this broker from serving traffic. after 
> > > 1 day, the blacklist is removed and run preferred leader election.  
> > > This way, no need to run reassignments before/after.  This is the 
> > > "temporary" use-case.
> > 
> > What if we just add an option to the reassignment tool to generate a 
> > plan to move all the leaders off of a specific broker?  The tool could 
> > also run a leader election as well.  That would be a simple way of 
> > doing this without adding new mechanisms or broker-side configurations, 
> > etc.
> > 
> > > 
> > > There are use-cases that this Preferred Leader "blacklist" can be 
> > > somewhat permanent, as I explained in the AWS data center instances Vs. 
> > > on-premises data center bare metal machines (heterogenous hardware), 
> > > that the AWS broker_ids will be blacklisted.  So new topics created,  
> > > or existing topic expansion would not make them serve traffic even they 
> > > could be the preferred leader. 
> > 
> > I agree that it would be nice if we could treat some brokers 
> > differently for the purposes of placing replicas, selecting leaders, 
> > etc.  Right now, we don't have any way of implementing that without 
> > forking the broker.  I would support a new PlacementPolicy class that 
> > would close this gap.  But I don't think this KIP is flexible enough to 
> > fill this role.  For example, it can't prevent users from creating new 
> > single-replica topics that get put on the "bad" replica.  Perhaps we 
> > should reopen the discussion about 
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > 
> > regards,
> > Colin
> > 
> > > 
> > > Please let me know there are more question. 
> > > 
> > > 
> > > Thanks,
> > > George
> > > 
> > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> > > <cm...@apache.org> wrote:  
> > >  
> > >  We still want to give the "blacklisted" broker the leadership if 
> > > nobody else is available.  Therefore, isn't putting a broker on the 
> > > blacklist pretty much the same as moving it to the last entry in the 
> > > replicas list and then triggering a preferred leader election?
> > > 
> > > If we want this to be undone after a certain amount of time, or under 
> > > certain conditions, that seems like something that would be more 
> > > effectively done by an external system, rather than putting all these 
> > > policies into Kafka.
> > > 
> > > best,
> > > Colin
> > > 
> > > 
> > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > >  Hi Satish,
> > > > Thanks for the reviews and feedbacks.
> > > > 
> > > > > > The following is the requirements this KIP is trying to accomplish:
> > > > > This can be moved to the"Proposed changes" section.
> > > > 
> > > > Updated the KIP-491. 
> > > > 
> > > > > >>The logic to determine the priority/order of which broker should be
> > > > > preferred leader should be modified.  The broker in the preferred leader
> > > > > blacklist should be moved to the end (lowest priority) when
> > > > > determining leadership.
> > > > >
> > > > > I believe there is no change required in the ordering of the preferred
> > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > until other brokers int he list are unavailable.
> > > > 
> > > > Yes. partition assignment remained the same, replica & ordering. The 
> > > > blacklist logic can be optimized during implementation. 
> > > > 
> > > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > > where a specific topic should blacklist particular brokers, which
> > > > > would be at the
> > > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > > be future enhancement work.
> > > > > 
> > > > > I agree that the broker level preferred leader blacklist would be
> > > > > sufficient. Do you have any use cases which require topic level
> > > > > preferred blacklist?
> > > > 
> > > > 
> > > > 
> > > > I don't have any concrete use cases for Topic level preferred leader 
> > > > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > > > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > > > etc), then try to move the leaders away from this broker,  before doing 
> > > > an actual reassignment to change its preferred leader,  try to put this 
> > > > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > > > leader election, and see whether CPU decreases for this broker,  if 
> > > > yes, then do the reassignments to change the preferred leaders to be 
> > > > "permanent" (the topic may have many partitions like 256 that has quite 
> > > > a few of them having this broker as preferred leader).  So this Topic 
> > > > Level config is an easy way of doing trial and check the result. 
> > > > 
> > > > 
> > > > > You can add the below workaround as an item in the rejected alternatives section
> > > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > > replica for."
> > > > 
> > > > Updated the KIP-491. 
> > > > 
> > > > 
> > > > 
> > > > Thanks, 
> > > > George
> > > > 
> > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > > > <sa...@gmail.com> wrote:  
> > > >  
> > > >  Thanks for the KIP. I have put my comments below.
> > > > 
> > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > 
> > > > >> The following is the requirements this KIP is trying to accomplish:
> > > >   The ability to add and remove the preferred leader deprioritized
> > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > 
> > > > This can be moved to the"Proposed changes" section.
> > > > 
> > > > >>The logic to determine the priority/order of which broker should be
> > > > preferred leader should be modified.  The broker in the preferred leader
> > > > blacklist should be moved to the end (lowest priority) when
> > > > determining leadership.
> > > > 
> > > > I believe there is no change required in the ordering of the preferred
> > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > until other brokers int he list are unavailable.
> > > > 
> > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > where a specific topic should blacklist particular brokers, which
> > > > would be at the
> > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > be future enhancement work.
> > > > 
> > > > I agree that the broker level preferred leader blacklist would be
> > > > sufficient. Do you have any use cases which require topic level
> > > > preferred blacklist?
> > > > 
> > > > You can add the below workaround as an item in the rejected alternatives section
> > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > replica for."
> > > > 
> > > > Thanks,
> > > > Satish.
> > > > 
> > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > <st...@confluent.io> wrote:
> > > > >
> > > > > Hey George,
> > > > >
> > > > > Thanks for the KIP, it's an interesting idea.
> > > > >
> > > > > I was wondering whether we could achieve the same thing via the
> > > > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > > > true that this is currently very tedious with the tool. My thoughts are
> > > > > that we could improve the tool and give it the notion of a "blacklisted
> > > > > preferred leader".
> > > > > This would have some benefits like:
> > > > > - more fine-grained control over the blacklist. we may not want to
> > > > > blacklist all the preferred leaders, as that would make the blacklisted
> > > > > broker a follower of last resort which is not very useful. In the cases of
> > > > > an underpowered AWS machine or a controller, you might overshoot and make
> > > > > the broker very underutilized if you completely make it leaderless.
> > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > rebalancing tools would also need to know about it and manipulate/respect
> > > > > it to achieve a fair balance.
> > > > > It seems like both problems are tied to balancing partitions, it's just
> > > > > that KIP-491's use case wants to balance them against other factors in a
> > > > > more nuanced way. It makes sense to have both be done from the same place
> > > > >
> > > > > To make note of the motivation section:
> > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > The recommended way to make a broker lose its leadership is to run a
> > > > > reassignment on its partitions
> > > > > > The cross-data center cluster has AWS cloud instances which have less
> > > > > computing power
> > > > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > > > system supported more flexibility in that regard but that is more nuanced
> > > > > and a preferred leader blacklist may not be the best first approach to the
> > > > > issue
> > > > >
> > > > > Adding a new config which can fundamentally change the way replication is
> > > > > done is complex, both for the system (the replication code is complex
> > > > > enough) and the user. Users would have another potential config that could
> > > > > backfire on them - e.g if left forgotten.
> > > > >
> > > > > Could you think of any downsides to implementing this functionality (or a
> > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > One downside I can see is that we would not have it handle new partitions
> > > > > created after the "blacklist operation". As a first iteration I think that
> > > > > may be acceptable
> > > > >
> > > > > Thanks,
> > > > > Stanislav
> > > > >
> > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > > > wrote:
> > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > )
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > I have created KIP-491 (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > > > some of the listed use-cases.
> > > > > >
> > > > > > Please provide your comments/feedbacks.
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >
> > > > > >
> > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > >
> > > > > >    [
> > > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > ]
> > > > > >
> > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > ---------------------------------------------------
> > > > > >
> > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > >
> > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > -----------------------------------------------
> > > > > > >
> > > > > > >                Key: KAFKA-8638
> > > > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > >            Project: Kafka
> > > > > > >          Issue Type: Improvement
> > > > > > >          Components: config, controller, core
> > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > >            Reporter: GEORGE LI
> > > > > > >            Assignee: GEORGE LI
> > > > > > >            Priority: Major
> > > > > > >
> > > > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > > > in the topic/partition replica assignments in a priority order when the
> > > > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > > > position of replica. There are use-cases that, even the first broker in the
> > > > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > > > leader election.
> > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > > > when deciding leadership during preferred leader election.  Below is a list
> > > > > > of use cases:
> > > > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > > > segments or latest offset without historical data (There is another effort
> > > > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > > > * The cross-data center cluster has AWS instances which have less
> > > > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > > > leaders to the lowest.
> > > > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > > > and other tasks. we would like to put the controller's leaders to other
> > > > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > > > not work for Controller, because after the bounce, the controller fails
> > > > > > over to another broker.
> > > > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > > > good if we have a way to specify which broker should be excluded from
> > > > > > serving traffic/leadership (without changing the replica assignment
> > > > > > ordering by reassignments, even though that's quick), and run preferred
> > > > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > > > leadership.
> > > > > > > The current work-around of the above is to change the topic/partition's
> > > > > > replica reassignments to move the broker_id 1 from the first position to
> > > > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > > > the original one and restore if things change (e.g. controller fails over
> > > > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > > > tedious task.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > This message was sent by Atlassian JIRA
> > > > > > (v7.6.3#76005)  

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Colin McCabe <cm...@apache.org>.
One possibility would be writing a new command-line tool that would deprioritize a given replica using the new KIP-455 API.  Then it could write out a JSON files containing the old priorities, which could be restored when (or if) we needed to do so.  This seems like it might be simpler and easier to maintain than a separate set of metadata about blacklists.

best,
Colin


On Fri, Sep 6, 2019, at 11:58, George Li wrote:
>  Hi, 
> 
> Just want to ping and bubble up the discussion of KIP-491. 
> 
> On a large scale of Kafka clusters with thousands of brokers in many 
> clusters.  Frequent hardware failures are common, although the 
> reassignments to change the preferred leaders is a workaround, it 
> incurs unnecessary additional work than the proposed preferred leader 
> blacklist in KIP-491, and hard to scale. 
> 
> I am wondering whether others using Kafka in a big scale running into 
> same problem. 
> 
> 
> Satish,  
> 
> Regarding your previous question about whether there is use-case for 
> TopicLevel preferred leader "blacklist",  I thought about one 
> use-case:  to improve rebalance/reassignment, the large partition will 
> usually cause performance/stability issues, planning to change the say 
> the New Replica will start with Leader's latest offset(this way the 
> replica is almost instantly in the ISR and reassignment completed), and 
> put this partition's NewReplica into Preferred Leader "Blacklist" at 
> the Topic Level config for that partition. After sometime(retention 
> time), this new replica has caught up and ready to serve traffic, 
> update/remove the TopicConfig for this partition's preferred leader 
> blacklist. 
> 
> I will update the KIP-491 later for this use case of Topic Level config 
> for Preferred Leader Blacklist.
> 
> 
> Thanks,
> George
>  
>     On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li 
> <sq...@yahoo.com> wrote:  
>  
>   Hi Colin,
> 
> > In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any > leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader > automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?
> 
> Let me explained in details of this particular use case example for 
> comparing apples to apples. 
> 
> Let's say a healthy broker hosting 3000 partitions, and of which 1000 
> are the preferred leaders (leader count is 1000). There is a hardware 
> failure (disk/memory, etc.), and kafka process crashed. We swap this 
> host with another host but keep the same broker.id, when this new 
> broker coming up, it has no historical data, and we manage to have the 
> current last offsets of all partitions set in 
> the replication-offset-checkpoint (if we don't set them, it could cause 
> crazy ReplicaFetcher pulling of historical data from other brokers and 
> cause cluster high latency and other instabilities), so when Kafka is 
> brought up, it is quickly catching up as followers in the ISR.  Note, 
> we have auto.leader.rebalance.enable  disabled, so it's not serving any 
> traffic as leaders (leader count = 0), even there are 1000 partitions 
> that this broker is the Preferred Leader. 
> 
> We need to make this broker not serving traffic for a few hours or days 
> depending on the SLA of the topic retention requirement until after 
> it's having enough historical data. 
> 
> 
> * The traditional way using the reassignments to move this broker in 
> that 1000 partitions where it's the preferred leader to the end of  
> assignment, this is O(N) operation. and from my experience, we can't 
> submit all 1000 at the same time, otherwise cause higher latencies even 
> the reassignment in this case can complete almost instantly.  After  a 
> few hours/days whatever, this broker is ready to serve traffic,  we 
> have to run reassignments again to restore that 1000 partitions 
> preferred leaders for this broker: O(N) operation.  then run preferred 
> leader election O(N) again.  So total 3 x O(N) operations.  The point 
> is since the new empty broker is expected to be the same as the old one 
> in terms of hosting partition/leaders, it would seem unnecessary to do 
> reassignments (ordering of replica) during the broker catching up time. 
> 
> 
> 
> * The new feature Preferred Leader "Blacklist":  just need to put a 
> dynamic config to indicate that this broker should be considered leader 
> (preferred leader election or broker failover or unclean leader 
> election) to the lowest priority. NO need to run any reassignments. 
> After a few hours/days, when this broker is ready, remove the dynamic 
> config, and run preferred leader election and this broker will serve 
> traffic for that 1000 original partitions it was the preferred leader. 
> So total  1 x O(N) operation. 
> 
> 
> If auto.leader.rebalance.enable  is enabled,  the Preferred Leader 
> "Blacklist" can be put it before Kafka is started to prevent this 
> broker serving traffic.  In the traditional way of running 
> reassignments, once the broker is up, 
> with auto.leader.rebalance.enable  , if leadership starts going to this 
> new empty broker, it might have to do preferred leader election after 
> reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) 
> reassignment only change the ordering, 1 remains as the current leader, 
> and needs prefer leader election to change to 2 after reassignment. so 
> potentially one more O(N) operation. 
> 
> I hope the above example can show how easy to "blacklist" a broker 
> serving leadership.  For someone managing Production Kafka cluster, 
> it's important to react fast to certain alerts and mitigate/resolve 
> some issues. As I listed the other use cases in KIP-291, I think this 
> feature can make the Kafka product more easier to manage/operate. 
> 
> > In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  > We expect more and more people who have a complex or large cluster will start using tools like this.
> > 
> > However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then> make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to> serve traffic, you could just re-apply the old ordering which you had saved.
> 
> 
> We do have our own rebalancing tool which has its own criteria like 
> Rack diversity,  disk usage,  spread partitions/leaders across all 
> brokers in the cluster per topic, leadership Bytes/BytesIn served per 
> broker, etc.  We can run reassignments. The point is whether it's 
> really necessary, and if there is more effective, easier, safer way to 
> do it.    
> 
> take another use case example of taking leadership out of busy 
> Controller to give it more power to serve metadata requests and other 
> work. The controller can failover, with the preferred leader 
> "blacklist",  it does not have to run reassignments again when 
> controller failover, just change the blacklisted broker_id. 
> 
> 
> > I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to > ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing> itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree> static configuration keys are somewhat less flexible than dynamic configuration.
> 
> 
> I think single-replica partition might not be a good example.  There 
> should not be any single-replica partition at all. If yes. it's 
> probably because of trying to save disk space with less replicas.  I 
> think at least minimum 2. The user purposely creating single-replica 
> partition will take full responsibilities of data loss and 
> unavailability when a broker fails or under maintenance. 
> 
> 
> I think it would be better to use dynamic instead of static config.  I 
> also think it would be better to have topic creation Policy enforced in 
> Kafka server OR an external service. We have an external/central 
> service managing topic creation/partition expansion which takes into 
> account of rack-diversity, replication factor (2, 3 or 4 depending on 
> cluster/topic type), Policy replicating the topic between kafka 
> clusters, etc.  
> 
> 
> 
> Thanks,
> George
> 
> 
>     On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe 
> <cm...@apache.org> wrote:  
>  
>  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> >  Hi Colin,
> > 
> > Thanks for your feedbacks.  Comments below:
> > > Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election > for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in > that sense-- you have to do something per partition.
> > 
> > For a failed broker and swapped with an empty broker, when it comes up, 
> > it will not have any leadership, and we would like it to remain not 
> > having leaderships for a couple of hours or days. So there is no 
> > preferred leader election needed which incurs O(N) operation in this 
> > case.  Putting the preferred leader blacklist would safe guard this 
> > broker serving traffic during that time. otherwise, if another broker 
> > fails(if this broker is the 1st, 2nd in the assignment), or someone 
> > runs preferred leader election, this new "empty" broker can still get 
> > leaderships. 
> > 
> > Also running reassignment to change the ordering of preferred leader 
> > would not actually switch the leader automatically.  e.g.  (1,2,3) => 
> > (2,3,1). unless preferred leader election is run to switch current 
> > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then 
> > after the broker is back to normal, another 2 x O(N) to rollback. 
> 
> Hi George,
> 
> Hmm.  I guess I'm still on the fence about this feature.
> 
> In your example, I think we're comparing apples and oranges.  You 
> started by outlining a scenario where "an empty broker... comes up... 
> [without] any leadership[s]."  But then you criticize using 
> reassignment to switch the order of preferred replicas because it 
> "would not actually switch the leader automatically."  If the empty 
> broker doesn't have any leaderships, there is nothing to be switched, 
> right?
> 
> > 
> > 
> > > In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  > Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.
> > >> I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for> me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just>  an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why> a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity> for users and developers most of the time.
> > 
> > 
> > I would like stress the point that running reassignment to change the 
> > ordering of the replica (putting a broker to the end of partition 
> > assignment) is unnecessary, because after some time the broker is 
> > caught up, it can start serving traffic and then need to run 
> > reassignments again to "rollback" to previous states. As I mentioned in 
> > KIP-491, this is just tedious work. 
> 
> In general, using an external rebalancing tool like Cruise Control is a 
> good idea to keep things balanced without having deal with manual 
> rebalancing.  We expect more and more people who have a complex or 
> large cluster will start using tools like this.
> 
> However, if you choose to do manual rebalancing, it shouldn't be that 
> bad.  You would save the existing partition ordering before making your 
> changes, then make your changes (perhaps by running a simple command 
> line tool that switches the order of the replicas).  Then, once you 
> felt like the broker was ready to serve traffic, you could just 
> re-apply the old ordering which you had saved.
> 
> > 
> > I agree this might introduce some complexities for users/developers. 
> > But if this feature is good, and well documented, it is good for the 
> > kafka product/community.  Just like KIP-460 enabling unclean leader 
> > election to override TopicLevel/Broker Level config of 
> > `unclean.leader.election.enable`
> > 
> > > I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. > Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that> would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating> new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion> about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > 
> > Creating topic with single-replica is beyond what KIP-491 is trying to 
> > achieve.  The user needs to take responsibility of doing that. I do see 
> > some Samza clients notoriously creating single-replica topics and that 
> > got flagged by alerts, because a single broker down/maintenance will 
> > cause offline partitions. For KIP-491 preferred leader "blacklist",  
> > the single-replica will still serve as leaders, because there is no 
> > other alternative replica to be chosen as leader. 
> > 
> > Even with a new PlacementPolicy for topic creation/partition expansion, 
> > it still needs the blacklist info (e.g. a zk path node, or broker 
> > level/topic level config) to "blacklist" the broker to be preferred 
> > leader? Would it be the same as KIP-491 is introducing? 
> 
> I was thinking about a PlacementPolicy filling the role of preventing 
> people from creating single-replica partitions on a node that we didn't 
> want to ever be the leader.  I thought that it could also prevent 
> people from designating those nodes as preferred leaders during topic 
> creation, or Kafka from doing itduring random topic creation.  I was 
> assuming that the PlacementPolicy would determine which nodes were 
> which through static configuration keys.  I agree static configuration 
> keys are somewhat less flexible than dynamic configuration.
> 
> best,
> Colin
> 
> 
> > 
> > 
> > Thanks,
> > George
> > 
> >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe 
> > <cm...@apache.org> wrote:  
> >  
> >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > >  Hi Colin,
> > > Thanks for looking into this KIP.  Sorry for the late response. been busy. 
> > > 
> > > If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> > > to the end of replica list is still a rather "big" operation, involving 
> > > submitting reassignments.  The KIP-491 way of blacklist is much 
> > > simpler/easier and can undo easily without changing the replica 
> > > assignment ordering. 
> > 
> > Hi George,
> > 
> > Even if you have a way of blacklisting an entire broker all at once, 
> > you still would need to run a leader election for each partition where 
> > you want to move the leader off of the blacklisted broker.  So the 
> > operation is still O(N) in that sense-- you have to do something per 
> > partition.
> > 
> > In general, reassignment will get a lot easier and quicker once KIP-455 
> > is implemented.  Reassignments that just change the order of preferred 
> > replicas for a specific partition should complete pretty much instantly.
> > 
> > I think it's simpler and easier just to have one source of truth for 
> > what the preferred replica is for a partition, rather than two.  So for 
> > me, the fact that the replica assignment ordering isn't changed is 
> > actually a big disadvantage of this KIP.  If you are a new user (or 
> > just an existing user that didn't read all of the documentation) and 
> > you just look at the replica assignment, you might be confused by why a 
> > particular broker wasn't getting any leaderships, even  though it 
> > appeared like it should.  More mechanisms mean more complexity for 
> > users and developers most of the time.
> > 
> > > Major use case for me, a failed broker got swapped with new hardware, 
> > > and starts up as empty (with latest offset of all partitions), the SLA 
> > > of retention is 1 day, so before this broker is up to be in-sync for 1 
> > > day, we would like to blacklist this broker from serving traffic. after 
> > > 1 day, the blacklist is removed and run preferred leader election.  
> > > This way, no need to run reassignments before/after.  This is the 
> > > "temporary" use-case.
> > 
> > What if we just add an option to the reassignment tool to generate a 
> > plan to move all the leaders off of a specific broker?  The tool could 
> > also run a leader election as well.  That would be a simple way of 
> > doing this without adding new mechanisms or broker-side configurations, 
> > etc.
> > 
> > > 
> > > There are use-cases that this Preferred Leader "blacklist" can be 
> > > somewhat permanent, as I explained in the AWS data center instances Vs. 
> > > on-premises data center bare metal machines (heterogenous hardware), 
> > > that the AWS broker_ids will be blacklisted.  So new topics created,  
> > > or existing topic expansion would not make them serve traffic even they 
> > > could be the preferred leader. 
> > 
> > I agree that it would be nice if we could treat some brokers 
> > differently for the purposes of placing replicas, selecting leaders, 
> > etc.  Right now, we don't have any way of implementing that without 
> > forking the broker.  I would support a new PlacementPolicy class that 
> > would close this gap.  But I don't think this KIP is flexible enough to 
> > fill this role.  For example, it can't prevent users from creating new 
> > single-replica topics that get put on the "bad" replica.  Perhaps we 
> > should reopen the discussion about 
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > 
> > regards,
> > Colin
> > 
> > > 
> > > Please let me know there are more question. 
> > > 
> > > 
> > > Thanks,
> > > George
> > > 
> > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> > > <cm...@apache.org> wrote:  
> > >  
> > >  We still want to give the "blacklisted" broker the leadership if 
> > > nobody else is available.  Therefore, isn't putting a broker on the 
> > > blacklist pretty much the same as moving it to the last entry in the 
> > > replicas list and then triggering a preferred leader election?
> > > 
> > > If we want this to be undone after a certain amount of time, or under 
> > > certain conditions, that seems like something that would be more 
> > > effectively done by an external system, rather than putting all these 
> > > policies into Kafka.
> > > 
> > > best,
> > > Colin
> > > 
> > > 
> > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > >  Hi Satish,
> > > > Thanks for the reviews and feedbacks.
> > > > 
> > > > > > The following is the requirements this KIP is trying to accomplish:
> > > > > This can be moved to the"Proposed changes" section.
> > > > 
> > > > Updated the KIP-491. 
> > > > 
> > > > > >>The logic to determine the priority/order of which broker should be
> > > > > preferred leader should be modified.  The broker in the preferred leader
> > > > > blacklist should be moved to the end (lowest priority) when
> > > > > determining leadership.
> > > > >
> > > > > I believe there is no change required in the ordering of the preferred
> > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > until other brokers int he list are unavailable.
> > > > 
> > > > Yes. partition assignment remained the same, replica & ordering. The 
> > > > blacklist logic can be optimized during implementation. 
> > > > 
> > > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > > where a specific topic should blacklist particular brokers, which
> > > > > would be at the
> > > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > > be future enhancement work.
> > > > > 
> > > > > I agree that the broker level preferred leader blacklist would be
> > > > > sufficient. Do you have any use cases which require topic level
> > > > > preferred blacklist?
> > > > 
> > > > 
> > > > 
> > > > I don't have any concrete use cases for Topic level preferred leader 
> > > > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > > > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > > > etc), then try to move the leaders away from this broker,  before doing 
> > > > an actual reassignment to change its preferred leader,  try to put this 
> > > > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > > > leader election, and see whether CPU decreases for this broker,  if 
> > > > yes, then do the reassignments to change the preferred leaders to be 
> > > > "permanent" (the topic may have many partitions like 256 that has quite 
> > > > a few of them having this broker as preferred leader).  So this Topic 
> > > > Level config is an easy way of doing trial and check the result. 
> > > > 
> > > > 
> > > > > You can add the below workaround as an item in the rejected alternatives section
> > > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > > replica for."
> > > > 
> > > > Updated the KIP-491. 
> > > > 
> > > > 
> > > > 
> > > > Thanks, 
> > > > George
> > > > 
> > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > > > <sa...@gmail.com> wrote:  
> > > >  
> > > >  Thanks for the KIP. I have put my comments below.
> > > > 
> > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > 
> > > > >> The following is the requirements this KIP is trying to accomplish:
> > > >   The ability to add and remove the preferred leader deprioritized
> > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > 
> > > > This can be moved to the"Proposed changes" section.
> > > > 
> > > > >>The logic to determine the priority/order of which broker should be
> > > > preferred leader should be modified.  The broker in the preferred leader
> > > > blacklist should be moved to the end (lowest priority) when
> > > > determining leadership.
> > > > 
> > > > I believe there is no change required in the ordering of the preferred
> > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > until other brokers int he list are unavailable.
> > > > 
> > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > where a specific topic should blacklist particular brokers, which
> > > > would be at the
> > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > be future enhancement work.
> > > > 
> > > > I agree that the broker level preferred leader blacklist would be
> > > > sufficient. Do you have any use cases which require topic level
> > > > preferred blacklist?
> > > > 
> > > > You can add the below workaround as an item in the rejected alternatives section
> > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > replica for."
> > > > 
> > > > Thanks,
> > > > Satish.
> > > > 
> > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > <st...@confluent.io> wrote:
> > > > >
> > > > > Hey George,
> > > > >
> > > > > Thanks for the KIP, it's an interesting idea.
> > > > >
> > > > > I was wondering whether we could achieve the same thing via the
> > > > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > > > true that this is currently very tedious with the tool. My thoughts are
> > > > > that we could improve the tool and give it the notion of a "blacklisted
> > > > > preferred leader".
> > > > > This would have some benefits like:
> > > > > - more fine-grained control over the blacklist. we may not want to
> > > > > blacklist all the preferred leaders, as that would make the blacklisted
> > > > > broker a follower of last resort which is not very useful. In the cases of
> > > > > an underpowered AWS machine or a controller, you might overshoot and make
> > > > > the broker very underutilized if you completely make it leaderless.
> > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > rebalancing tools would also need to know about it and manipulate/respect
> > > > > it to achieve a fair balance.
> > > > > It seems like both problems are tied to balancing partitions, it's just
> > > > > that KIP-491's use case wants to balance them against other factors in a
> > > > > more nuanced way. It makes sense to have both be done from the same place
> > > > >
> > > > > To make note of the motivation section:
> > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > The recommended way to make a broker lose its leadership is to run a
> > > > > reassignment on its partitions
> > > > > > The cross-data center cluster has AWS cloud instances which have less
> > > > > computing power
> > > > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > > > system supported more flexibility in that regard but that is more nuanced
> > > > > and a preferred leader blacklist may not be the best first approach to the
> > > > > issue
> > > > >
> > > > > Adding a new config which can fundamentally change the way replication is
> > > > > done is complex, both for the system (the replication code is complex
> > > > > enough) and the user. Users would have another potential config that could
> > > > > backfire on them - e.g if left forgotten.
> > > > >
> > > > > Could you think of any downsides to implementing this functionality (or a
> > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > One downside I can see is that we would not have it handle new partitions
> > > > > created after the "blacklist operation". As a first iteration I think that
> > > > > may be acceptable
> > > > >
> > > > > Thanks,
> > > > > Stanislav
> > > > >
> > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > > > wrote:
> > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > )
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > I have created KIP-491 (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > > > some of the listed use-cases.
> > > > > >
> > > > > > Please provide your comments/feedbacks.
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >
> > > > > >
> > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > >
> > > > > >    [
> > > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > ]
> > > > > >
> > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > ---------------------------------------------------
> > > > > >
> > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > >
> > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > -----------------------------------------------
> > > > > > >
> > > > > > >                Key: KAFKA-8638
> > > > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > >            Project: Kafka
> > > > > > >          Issue Type: Improvement
> > > > > > >          Components: config, controller, core
> > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > >            Reporter: GEORGE LI
> > > > > > >            Assignee: GEORGE LI
> > > > > > >            Priority: Major
> > > > > > >
> > > > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > > > in the topic/partition replica assignments in a priority order when the
> > > > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > > > position of replica. There are use-cases that, even the first broker in the
> > > > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > > > leader election.
> > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > > > when deciding leadership during preferred leader election.  Below is a list
> > > > > > of use cases:
> > > > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > > > segments or latest offset without historical data (There is another effort
> > > > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > > > * The cross-data center cluster has AWS instances which have less
> > > > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > > > leaders to the lowest.
> > > > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > > > and other tasks. we would like to put the controller's leaders to other
> > > > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > > > not work for Controller, because after the bounce, the controller fails
> > > > > > over to another broker.
> > > > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > > > good if we have a way to specify which broker should be excluded from
> > > > > > serving traffic/leadership (without changing the replica assignment
> > > > > > ordering by reassignments, even though that's quick), and run preferred
> > > > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > > > leadership.
> > > > > > > The current work-around of the above is to change the topic/partition's
> > > > > > replica reassignments to move the broker_id 1 from the first position to
> > > > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > > > the original one and restore if things change (e.g. controller fails over
> > > > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > > > tedious task.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > This message was sent by Atlassian JIRA
> > > > > > (v7.6.3#76005)

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by George Li <sq...@yahoo.com.INVALID>.
 Hi Stanislav/Colin,

A couple people ran into issues with auto.leader.rebalance.enable=true in https://issues.apache.org/jira/browse/KAFKA-4084

And I think KIP-491 can help solve that issue.  We have implemented KIP-491 internally together with another feature called latest offset for quickly bringing up a failed empty node, and found it quite useful.   

Could you take a look at the comments in the ticket, re-evaluate and provide your feedbacks? 

Thanks,
George


    On Tuesday, September 17, 2019, 07:56:52 AM PDT, Stanislav Kozlovski <st...@confluent.io> wrote:  
 
 Hey Harsha,

> If we want to go with making this an option and providing a tool which
abstracts moving the broker to end preferred leader list , it needs to do
it for all the partitions that broker is leader for. As said in the above
comment a broker i.e leader for 1000 partitions we have to this for all the
partitions.  Instead of having a blacklist will help simplify this process
and we can provide monitoring/alerts on such list.

Sorry, I thought that part of the reasoning for not using reassignment was
to optimize the process.

> Do you mind shedding some light what issue you are talking to propose a
KIP for?


The issue I was talking about is the one I quoted in my previous reply. I
understand that you want to have a way of running a "shallow" replica of
sorts - one that is lacking the historical data but has (and continues to
replicate) the latest data. That is the goal of setting the last offsets
for all partitions in replication-offset-checkpoint, right?

Thanks,
Stanislav

On Mon, Sep 16, 2019 at 3:39 PM Satish Duggana <sa...@gmail.com>
wrote:

> Hi George,
> Thanks for explaining the usecase for topic level preferred leader
> blacklist. As I mentioned earlier, I am fine with broker level config
> for now.
>
> ~Satish.
>
>
> On Sat, Sep 7, 2019 at 12:29 AM George Li
> <sq...@yahoo.com.invalid> wrote:
> >
> >  Hi,
> >
> > Just want to ping and bubble up the discussion of KIP-491.
> >
> > On a large scale of Kafka clusters with thousands of brokers in many
> clusters.  Frequent hardware failures are common, although the
> reassignments to change the preferred leaders is a workaround, it incurs
> unnecessary additional work than the proposed preferred leader blacklist in
> KIP-491, and hard to scale.
> >
> > I am wondering whether others using Kafka in a big scale running into
> same problem.
> >
> >
> > Satish,
> >
> > Regarding your previous question about whether there is use-case for
> TopicLevel preferred leader "blacklist",  I thought about one use-case:  to
> improve rebalance/reassignment, the large partition will usually cause
> performance/stability issues, planning to change the say the New Replica
> will start with Leader's latest offset(this way the replica is almost
> instantly in the ISR and reassignment completed), and put this partition's
> NewReplica into Preferred Leader "Blacklist" at the Topic Level config for
> that partition. After sometime(retention time), this new replica has caught
> up and ready to serve traffic, update/remove the TopicConfig for this
> partition's preferred leader blacklist.
> >
> > I will update the KIP-491 later for this use case of Topic Level config
> for Preferred Leader Blacklist.
> >
> >
> > Thanks,
> > George
> >
> >    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li <
> sql_consulting@yahoo.com> wrote:
> >
> >  Hi Colin,
> >
> > > In your example, I think we're comparing apples and oranges.  You
> started by outlining a scenario where "an empty broker... comes up...
> [without] any > leadership[s]."  But then you criticize using reassignment
> to switch the order of preferred replicas because it "would not actually
> switch the leader > automatically."  If the empty broker doesn't have any
> leaderships, there is nothing to be switched, right?
> >
> > Let me explained in details of this particular use case example for
> comparing apples to apples.
> >
> > Let's say a healthy broker hosting 3000 partitions, and of which 1000
> are the preferred leaders (leader count is 1000). There is a hardware
> failure (disk/memory, etc.), and kafka process crashed. We swap this host
> with another host but keep the same broker.id, when this new broker
> coming up, it has no historical data, and we manage to have the current
> last offsets of all partitions set in the replication-offset-checkpoint (if
> we don't set them, it could cause crazy ReplicaFetcher pulling of
> historical data from other brokers and cause cluster high latency and other
> instabilities), so when Kafka is brought up, it is quickly catching up as
> followers in the ISR.  Note, we have auto.leader.rebalance.enable
> disabled, so it's not serving any traffic as leaders (leader count = 0),
> even there are 1000 partitions that this broker is the Preferred Leader.
> >
> > We need to make this broker not serving traffic for a few hours or days
> depending on the SLA of the topic retention requirement until after it's
> having enough historical data.
> >
> >
> > * The traditional way using the reassignments to move this broker in
> that 1000 partitions where it's the preferred leader to the end of
> assignment, this is O(N) operation. and from my experience, we can't submit
> all 1000 at the same time, otherwise cause higher latencies even the
> reassignment in this case can complete almost instantly.  After  a few
> hours/days whatever, this broker is ready to serve traffic,  we have to run
> reassignments again to restore that 1000 partitions preferred leaders for
> this broker: O(N) operation.  then run preferred leader election O(N)
> again.  So total 3 x O(N) operations.  The point is since the new empty
> broker is expected to be the same as the old one in terms of hosting
> partition/leaders, it would seem unnecessary to do reassignments (ordering
> of replica) during the broker catching up time.
> >
> >
> >
> > * The new feature Preferred Leader "Blacklist":  just need to put a
> dynamic config to indicate that this broker should be considered leader
> (preferred leader election or broker failover or unclean leader election)
> to the lowest priority. NO need to run any reassignments. After a few
> hours/days, when this broker is ready, remove the dynamic config, and run
> preferred leader election and this broker will serve traffic for that 1000
> original partitions it was the preferred leader. So total  1 x O(N)
> operation.
> >
> >
> > If auto.leader.rebalance.enable  is enabled,  the Preferred Leader
> "Blacklist" can be put it before Kafka is started to prevent this broker
> serving traffic.  In the traditional way of running reassignments, once the
> broker is up, with auto.leader.rebalance.enable  , if leadership starts
> going to this new empty broker, it might have to do preferred leader
> election after reassignments to remove its leaderships. e.g. (1,2,3) =>
> (2,3,1) reassignment only change the ordering, 1 remains as the current
> leader, and needs prefer leader election to change to 2 after reassignment.
> so potentially one more O(N) operation.
> >
> > I hope the above example can show how easy to "blacklist" a broker
> serving leadership.  For someone managing Production Kafka cluster, it's
> important to react fast to certain alerts and mitigate/resolve some issues.
> As I listed the other use cases in KIP-291, I think this feature can make
> the Kafka product more easier to manage/operate.
> >
> > > In general, using an external rebalancing tool like Cruise Control is
> a good idea to keep things balanced without having deal with manual
> rebalancing.  > We expect more and more people who have a complex or large
> cluster will start using tools like this.
> > >
> > > However, if you choose to do manual rebalancing, it shouldn't be that
> bad.  You would save the existing partition ordering before making your
> changes, then> make your changes (perhaps by running a simple command line
> tool that switches the order of the replicas).  Then, once you felt like
> the broker was ready to> serve traffic, you could just re-apply the old
> ordering which you had saved.
> >
> >
> > We do have our own rebalancing tool which has its own criteria like Rack
> diversity,  disk usage,  spread partitions/leaders across all brokers in
> the cluster per topic, leadership Bytes/BytesIn served per broker, etc.  We
> can run reassignments. The point is whether it's really necessary, and if
> there is more effective, easier, safer way to do it.
> >
> > take another use case example of taking leadership out of busy
> Controller to give it more power to serve metadata requests and other work.
> The controller can failover, with the preferred leader "blacklist",  it
> does not have to run reassignments again when controller failover, just
> change the blacklisted broker_id.
> >
> >
> > > I was thinking about a PlacementPolicy filling the role of preventing
> people from creating single-replica partitions on a node that we didn't
> want to > ever be the leader.  I thought that it could also prevent people
> from designating those nodes as preferred leaders during topic creation, or
> Kafka from doing> itduring random topic creation.  I was assuming that the
> PlacementPolicy would determine which nodes were which through static
> configuration keys.  I agree> static configuration keys are somewhat less
> flexible than dynamic configuration.
> >
> >
> > I think single-replica partition might not be a good example.  There
> should not be any single-replica partition at all. If yes. it's probably
> because of trying to save disk space with less replicas.  I think at least
> minimum 2. The user purposely creating single-replica partition will take
> full responsibilities of data loss and unavailability when a broker fails
> or under maintenance.
> >
> >
> > I think it would be better to use dynamic instead of static config.  I
> also think it would be better to have topic creation Policy enforced in
> Kafka server OR an external service. We have an external/central service
> managing topic creation/partition expansion which takes into account of
> rack-diversity, replication factor (2, 3 or 4 depending on cluster/topic
> type), Policy replicating the topic between kafka clusters, etc.
> >
> >
> >
> > Thanks,
> > George
> >
> >
> >    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe <
> cmccabe@apache.org> wrote:
> >
> >  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> > >  Hi Colin,
> > >
> > > Thanks for your feedbacks.  Comments below:
> > > > Even if you have a way of blacklisting an entire broker all at once,
> you still would need to run a leader election > for each partition where
> you want to move the leader off of the blacklisted broker.  So the
> operation is still O(N) in > that sense-- you have to do something per
> partition.
> > >
> > > For a failed broker and swapped with an empty broker, when it comes up,
> > > it will not have any leadership, and we would like it to remain not
> > > having leaderships for a couple of hours or days. So there is no
> > > preferred leader election needed which incurs O(N) operation in this
> > > case.  Putting the preferred leader blacklist would safe guard this
> > > broker serving traffic during that time. otherwise, if another broker
> > > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > > runs preferred leader election, this new "empty" broker can still get
> > > leaderships.
> > >
> > > Also running reassignment to change the ordering of preferred leader
> > > would not actually switch the leader automatically.  e.g.  (1,2,3) =>
> > > (2,3,1). unless preferred leader election is run to switch current
> > > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then
> > > after the broker is back to normal, another 2 x O(N) to rollback.
> >
> > Hi George,
> >
> > Hmm.  I guess I'm still on the fence about this feature.
> >
> > In your example, I think we're comparing apples and oranges.  You
> started by outlining a scenario where "an empty broker... comes up...
> [without] any leadership[s]."  But then you criticize using reassignment to
> switch the order of preferred replicas because it "would not actually
> switch the leader automatically."  If the empty broker doesn't have any
> leaderships, there is nothing to be switched, right?
> >
> > >
> > >
> > > > In general, reassignment will get a lot easier and quicker once
> KIP-455 is implemented.  > Reassignments that just change the order of
> preferred replicas for a specific partition should complete pretty much
> instantly.
> > > >> I think it's simpler and easier just to have one source of truth
> for what the preferred replica is for a partition, rather than two.  So
> for> me, the fact that the replica assignment ordering isn't changed is
> actually a big disadvantage of this KIP.  If you are a new user (or just>
> an existing user that didn't read all of the documentation) and you just
> look at the replica assignment, you might be confused by why> a particular
> broker wasn't getting any leaderships, even  though it appeared like it
> should.  More mechanisms mean more complexity> for users and developers
> most of the time.
> > >
> > >
> > > I would like stress the point that running reassignment to change the
> > > ordering of the replica (putting a broker to the end of partition
> > > assignment) is unnecessary, because after some time the broker is
> > > caught up, it can start serving traffic and then need to run
> > > reassignments again to "rollback" to previous states. As I mentioned in
> > > KIP-491, this is just tedious work.
> >
> > In general, using an external rebalancing tool like Cruise Control is a
> good idea to keep things balanced without having deal with manual
> rebalancing.  We expect more and more people who have a complex or large
> cluster will start using tools like this.
> >
> > However, if you choose to do manual rebalancing, it shouldn't be that
> bad.  You would save the existing partition ordering before making your
> changes, then make your changes (perhaps by running a simple command line
> tool that switches the order of the replicas).  Then, once you felt like
> the broker was ready to serve traffic, you could just re-apply the old
> ordering which you had saved.
> >
> > >
> > > I agree this might introduce some complexities for users/developers.
> > > But if this feature is good, and well documented, it is good for the
> > > kafka product/community.  Just like KIP-460 enabling unclean leader
> > > election to override TopicLevel/Broker Level config of
> > > `unclean.leader.election.enable`
> > >
> > > > I agree that it would be nice if we could treat some brokers
> differently for the purposes of placing replicas, selecting leaders, etc. >
> Right now, we don't have any way of implementing that without forking the
> broker.  I would support a new PlacementPolicy class that> would close this
> gap.  But I don't think this KIP is flexible enough to fill this role.  For
> example, it can't prevent users from creating> new single-replica topics
> that get put on the "bad" replica.  Perhaps we should reopen the
> discussion> about
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > >
> > > Creating topic with single-replica is beyond what KIP-491 is trying to
> > > achieve.  The user needs to take responsibility of doing that. I do see
> > > some Samza clients notoriously creating single-replica topics and that
> > > got flagged by alerts, because a single broker down/maintenance will
> > > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > > the single-replica will still serve as leaders, because there is no
> > > other alternative replica to be chosen as leader.
> > >
> > > Even with a new PlacementPolicy for topic creation/partition expansion,
> > > it still needs the blacklist info (e.g. a zk path node, or broker
> > > level/topic level config) to "blacklist" the broker to be preferred
> > > leader? Would it be the same as KIP-491 is introducing?
> >
> > I was thinking about a PlacementPolicy filling the role of preventing
> people from creating single-replica partitions on a node that we didn't
> want to ever be the leader.  I thought that it could also prevent people
> from designating those nodes as preferred leaders during topic creation, or
> Kafka from doing itduring random topic creation.  I was assuming that the
> PlacementPolicy would determine which nodes were which through static
> configuration keys.  I agree static configuration keys are somewhat less
> flexible than dynamic configuration.
> >
> > best,
> > Colin
> >
> >
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > > <cm...@apache.org> wrote:
> > >
> > >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > > >  Hi Colin,
> > > > Thanks for looking into this KIP.  Sorry for the late response. been
> busy.
> > > >
> > > > If a cluster has MAMY topic partitions, moving this "blacklist"
> broker
> > > > to the end of replica list is still a rather "big" operation,
> involving
> > > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > > simpler/easier and can undo easily without changing the replica
> > > > assignment ordering.
> > >
> > > Hi George,
> > >
> > > Even if you have a way of blacklisting an entire broker all at once,
> > > you still would need to run a leader election for each partition where
> > > you want to move the leader off of the blacklisted broker.  So the
> > > operation is still O(N) in that sense-- you have to do something per
> > > partition.
> > >
> > > In general, reassignment will get a lot easier and quicker once KIP-455
> > > is implemented.  Reassignments that just change the order of preferred
> > > replicas for a specific partition should complete pretty much
> instantly.
> > >
> > > I think it's simpler and easier just to have one source of truth for
> > > what the preferred replica is for a partition, rather than two.  So for
> > > me, the fact that the replica assignment ordering isn't changed is
> > > actually a big disadvantage of this KIP.  If you are a new user (or
> > > just an existing user that didn't read all of the documentation) and
> > > you just look at the replica assignment, you might be confused by why a
> > > particular broker wasn't getting any leaderships, even  though it
> > > appeared like it should.  More mechanisms mean more complexity for
> > > users and developers most of the time.
> > >
> > > > Major use case for me, a failed broker got swapped with new hardware,
> > > > and starts up as empty (with latest offset of all partitions), the
> SLA
> > > > of retention is 1 day, so before this broker is up to be in-sync for
> 1
> > > > day, we would like to blacklist this broker from serving traffic.
> after
> > > > 1 day, the blacklist is removed and run preferred leader election.
> > > > This way, no need to run reassignments before/after.  This is the
> > > > "temporary" use-case.
> > >
> > > What if we just add an option to the reassignment tool to generate a
> > > plan to move all the leaders off of a specific broker?  The tool could
> > > also run a leader election as well.  That would be a simple way of
> > > doing this without adding new mechanisms or broker-side configurations,
> > > etc.
> > >
> > > >
> > > > There are use-cases that this Preferred Leader "blacklist" can be
> > > > somewhat permanent, as I explained in the AWS data center instances
> Vs.
> > > > on-premises data center bare metal machines (heterogenous hardware),
> > > > that the AWS broker_ids will be blacklisted.  So new topics created,
> > > > or existing topic expansion would not make them serve traffic even
> they
> > > > could be the preferred leader.
> > >
> > > I agree that it would be nice if we could treat some brokers
> > > differently for the purposes of placing replicas, selecting leaders,
> > > etc.  Right now, we don't have any way of implementing that without
> > > forking the broker.  I would support a new PlacementPolicy class that
> > > would close this gap.  But I don't think this KIP is flexible enough to
> > > fill this role.  For example, it can't prevent users from creating new
> > > single-replica topics that get put on the "bad" replica.  Perhaps we
> > > should reopen the discussion about
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > >
> > > regards,
> > > Colin
> > >
> > > >
> > > > Please let me know there are more question.
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > > <cm...@apache.org> wrote:
> > > >
> > > >  We still want to give the "blacklisted" broker the leadership if
> > > > nobody else is available.  Therefore, isn't putting a broker on the
> > > > blacklist pretty much the same as moving it to the last entry in the
> > > > replicas list and then triggering a preferred leader election?
> > > >
> > > > If we want this to be undone after a certain amount of time, or under
> > > > certain conditions, that seems like something that would be more
> > > > effectively done by an external system, rather than putting all these
> > > > policies into Kafka.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > > >  Hi Satish,
> > > > > Thanks for the reviews and feedbacks.
> > > > >
> > > > > > > The following is the requirements this KIP is trying to
> accomplish:
> > > > > > This can be moved to the"Proposed changes" section.
> > > > >
> > > > > Updated the KIP-491.
> > > > >
> > > > > > >>The logic to determine the priority/order of which broker
> should be
> > > > > > preferred leader should be modified.  The broker in the
> preferred leader
> > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > determining leadership.
> > > > > >
> > > > > > I believe there is no change required in the ordering of the
> preferred
> > > > > > replica list. Brokers in the preferred leader blacklist are
> skipped
> > > > > > until other brokers int he list are unavailable.
> > > > >
> > > > > Yes. partition assignment remained the same, replica & ordering.
> The
> > > > > blacklist logic can be optimized during implementation.
> > > > >
> > > > > > >>The blacklist can be at the broker level. However, there might
> be use cases
> > > > > > where a specific topic should blacklist particular brokers, which
> > > > > > would be at the
> > > > > > Topic level Config. For this use cases of this KIP, it seems
> that broker level
> > > > > > blacklist would suffice.  Topic level preferred leader blacklist
> might
> > > > > > be future enhancement work.
> > > > > >
> > > > > > I agree that the broker level preferred leader blacklist would be
> > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > preferred blacklist?
> > > > >
> > > > >
> > > > >
> > > > > I don't have any concrete use cases for Topic level preferred
> leader
> > > > > blacklist.  One scenarios I can think of is when a broker has high
> CPU
> > > > > usage, trying to identify the big topics (High MsgIn, High BytesIn,
> > > > > etc), then try to move the leaders away from this broker,  before
> doing
> > > > > an actual reassignment to change its preferred leader,  try to put
> this
> > > > > preferred_leader_blacklist in the Topic Level config, and run
> preferred
> > > > > leader election, and see whether CPU decreases for this broker,  if
> > > > > yes, then do the reassignments to change the preferred leaders to
> be
> > > > > "permanent" (the topic may have many partitions like 256 that has
> quite
> > > > > a few of them having this broker as preferred leader).  So this
> Topic
> > > > > Level config is an easy way of doing trial and check the result.
> > > > >
> > > > >
> > > > > > You can add the below workaround as an item in the rejected
> alternatives section
> > > > > > "Reassigning all the topic/partitions which the intended broker
> is a
> > > > > > replica for."
> > > > >
> > > > > Updated the KIP-491.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > > <sa...@gmail.com> wrote:
> > > > >
> > > > >  Thanks for the KIP. I have put my comments below.
> > > > >
> > > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > >
> > > > > >> The following is the requirements this KIP is trying to
> accomplish:
> > > > >  The ability to add and remove the preferred leader deprioritized
> > > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > >
> > > > > This can be moved to the"Proposed changes" section.
> > > > >
> > > > > >>The logic to determine the priority/order of which broker should
> be
> > > > > preferred leader should be modified.  The broker in the preferred
> leader
> > > > > blacklist should be moved to the end (lowest priority) when
> > > > > determining leadership.
> > > > >
> > > > > I believe there is no change required in the ordering of the
> preferred
> > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > until other brokers int he list are unavailable.
> > > > >
> > > > > >>The blacklist can be at the broker level. However, there might
> be use cases
> > > > > where a specific topic should blacklist particular brokers, which
> > > > > would be at the
> > > > > Topic level Config. For this use cases of this KIP, it seems that
> broker level
> > > > > blacklist would suffice.  Topic level preferred leader blacklist
> might
> > > > > be future enhancement work.
> > > > >
> > > > > I agree that the broker level preferred leader blacklist would be
> > > > > sufficient. Do you have any use cases which require topic level
> > > > > preferred blacklist?
> > > > >
> > > > > You can add the below workaround as an item in the rejected
> alternatives section
> > > > > "Reassigning all the topic/partitions which the intended broker is
> a
> > > > > replica for."
> > > > >
> > > > > Thanks,
> > > > > Satish.
> > > > >
> > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > > <st...@confluent.io> wrote:
> > > > > >
> > > > > > Hey George,
> > > > > >
> > > > > > Thanks for the KIP, it's an interesting idea.
> > > > > >
> > > > > > I was wondering whether we could achieve the same thing via the
> > > > > > kafka-reassign-partitions tool. As you had also said in the
> JIRA,  it is
> > > > > > true that this is currently very tedious with the tool. My
> thoughts are
> > > > > > that we could improve the tool and give it the notion of a
> "blacklisted
> > > > > > preferred leader".
> > > > > > This would have some benefits like:
> > > > > > - more fine-grained control over the blacklist. we may not want
> to
> > > > > > blacklist all the preferred leaders, as that would make the
> blacklisted
> > > > > > broker a follower of last resort which is not very useful. In
> the cases of
> > > > > > an underpowered AWS machine or a controller, you might overshoot
> and make
> > > > > > the broker very underutilized if you completely make it
> leaderless.
> > > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > > rebalancing tools would also need to know about it and
> manipulate/respect
> > > > > > it to achieve a fair balance.
> > > > > > It seems like both problems are tied to balancing partitions,
> it's just
> > > > > > that KIP-491's use case wants to balance them against other
> factors in a
> > > > > > more nuanced way. It makes sense to have both be done from the
> same place
> > > > > >
> > > > > > To make note of the motivation section:
> > > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > > The recommended way to make a broker lose its leadership is to
> run a
> > > > > > reassignment on its partitions
> > > > > > > The cross-data center cluster has AWS cloud instances which
> have less
> > > > > > computing power
> > > > > > We recommend running Kafka on homogeneous machines. It would be
> cool if the
> > > > > > system supported more flexibility in that regard but that is
> more nuanced
> > > > > > and a preferred leader blacklist may not be the best first
> approach to the
> > > > > > issue
> > > > > >
> > > > > > Adding a new config which can fundamentally change the way
> replication is
> > > > > > done is complex, both for the system (the replication code is
> complex
> > > > > > enough) and the user. Users would have another potential config
> that could
> > > > > > backfire on them - e.g if left forgotten.
> > > > > >
> > > > > > Could you think of any downsides to implementing this
> functionality (or a
> > > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > > One downside I can see is that we would not have it handle new
> partitions
> > > > > > created after the "blacklist operation". As a first iteration I
> think that
> > > > > > may be acceptable
> > > > > >
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > >
> > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <
> sql_consulting@yahoo.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > > >  Hi,
> > > > > > >
> > > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > > )
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > > >
> > > > > > >  Hi,
> > > > > > >
> > > > > > > I have created KIP-491 (
> > > > > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> )
> > > > > > > for putting a broker to the preferred leader blacklist or
> deprioritized
> > > > > > > list so when determining leadership,  it's moved to the lowest
> priority for
> > > > > > > some of the listed use-cases.
> > > > > > >
> > > > > > > Please provide your comments/feedbacks.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia
> Sancio (JIRA) <
> > > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <
> sql_consulting@yahoo.com>Sent:
> > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
> [Commented]
> > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > > >
> > > > > > >    [
> > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > > ]
> > > > > > >
> > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > > ---------------------------------------------------
> > > > > > >
> > > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > > >
> > > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > > -----------------------------------------------
> > > > > > > >
> > > > > > > >                Key: KAFKA-8638
> > > > > > > >                URL:
> https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > > >            Project: Kafka
> > > > > > > >          Issue Type: Improvement
> > > > > > > >          Components: config, controller, core
> > > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > > >            Reporter: GEORGE LI
> > > > > > > >            Assignee: GEORGE LI
> > > > > > > >            Priority: Major
> > > > > > > >
> > > > > > > > Currently, the kafka preferred leader election will pick the
> broker_id
> > > > > > > in the topic/partition replica assignments in a priority order
> when the
> > > > > > > broker is in ISR. The preferred leader is the broker id in the
> first
> > > > > > > position of replica. There are use-cases that, even the first
> broker in the
> > > > > > > replica assignment is in ISR, there is a need for it to be
> moved to the end
> > > > > > > of ordering (lowest priority) when deciding leadership during
> preferred
> > > > > > > leader election.
> > > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1
> is the
> > > > > > > preferred leader.  When preferred leadership is run, it will
> pick 1 as the
> > > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick
> 2, if 2 is not
> > > > > > > in ISR, then pick 3 as the leader. There are use cases that,
> even 1 is in
> > > > > > > ISR, we would like it to be moved to the end of ordering
> (lowest priority)
> > > > > > > when deciding leadership during preferred leader election.
> Below is a list
> > > > > > > of use cases:
> > > > > > > > * (If broker_id 1 is a swapped failed host and brought up
> with last
> > > > > > > segments or latest offset without historical data (There is
> another effort
> > > > > > > on this), it's better for it to not serve leadership till it's
> caught-up.
> > > > > > > > * The cross-data center cluster has AWS instances which have
> less
> > > > > > > computing power than the on-prem bare metal machines.  We
> could put the AWS
> > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers
> can be elected
> > > > > > > leaders, without changing the reassignments ordering of the
> replicas.
> > > > > > > > * If the broker_id 1 is constantly losing leadership after
> some time:
> > > > > > > "Flapping". we would want to exclude 1 to be a leader unless
> all other
> > > > > > > brokers of this topic/partition are offline.  The “Flapping”
> effect was
> > > > > > > seen in the past when 2 or more brokers were bad, when they
> lost leadership
> > > > > > > constantly/quickly, the sets of partition replicas they belong
> to will see
> > > > > > > leadership constantly changing.  The ultimate solution is to
> swap these bad
> > > > > > > hosts.  But for quick mitigation, we can also put the bad
> hosts in the
> > > > > > > Preferred Leader Blacklist to move the priority of its being
> elected as
> > > > > > > leaders to the lowest.
> > > > > > > > *  If the controller is busy serving an extra load of
> metadata requests
> > > > > > > and other tasks. we would like to put the controller's leaders
> to other
> > > > > > > brokers to lower its CPU load. currently bouncing to lose
> leadership would
> > > > > > > not work for Controller, because after the bounce, the
> controller fails
> > > > > > > over to another broker.
> > > > > > > > * Avoid bouncing broker in order to lose its leadership: it
> would be
> > > > > > > good if we have a way to specify which broker should be
> excluded from
> > > > > > > serving traffic/leadership (without changing the replica
> assignment
> > > > > > > ordering by reassignments, even though that's quick), and run
> preferred
> > > > > > > leader election.  A bouncing broker will cause temporary URP,
> and sometimes
> > > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1)
> can temporarily
> > > > > > > lose all its leadership, but if another broker (e.g. broker_id
> 2) fails or
> > > > > > > gets bounced, some of its leaderships will likely failover to
> broker_id 1
> > > > > > > on a replica with 3 brokers.  If broker_id 1 is in the
> blacklist, then in
> > > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can
> take
> > > > > > > leadership.
> > > > > > > > The current work-around of the above is to change the
> topic/partition's
> > > > > > > replica reassignments to move the broker_id 1 from the first
> position to
> > > > > > > the last position and run preferred leader election. e.g. (1,
> 2, 3) => (2,
> > > > > > > 3, 1). This changes the replica reassignments, and we need to
> keep track of
> > > > > > > the original one and restore if things change (e.g. controller
> fails over
> > > > > > > to another broker, the swapped empty broker caught up). That’s
> a rather
> > > > > > > tedious task.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > This message was sent by Atlassian JIRA
> > > > > > > (v7.6.3#76005)
>


-- 
Best,
Stanislav  

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Stanislav Kozlovski <st...@confluent.io>.
Hey Harsha,

> If we want to go with making this an option and providing a tool which
abstracts moving the broker to end preferred leader list , it needs to do
it for all the partitions that broker is leader for. As said in the above
comment a broker i.e leader for 1000 partitions we have to this for all the
partitions.  Instead of having a blacklist will help simplify this process
and we can provide monitoring/alerts on such list.

Sorry, I thought that part of the reasoning for not using reassignment was
to optimize the process.

> Do you mind shedding some light what issue you are talking to propose a
KIP for?


The issue I was talking about is the one I quoted in my previous reply. I
understand that you want to have a way of running a "shallow" replica of
sorts - one that is lacking the historical data but has (and continues to
replicate) the latest data. That is the goal of setting the last offsets
for all partitions in replication-offset-checkpoint, right?

Thanks,
Stanislav

On Mon, Sep 16, 2019 at 3:39 PM Satish Duggana <sa...@gmail.com>
wrote:

> Hi George,
> Thanks for explaining the usecase for topic level preferred leader
> blacklist. As I mentioned earlier, I am fine with broker level config
> for now.
>
> ~Satish.
>
>
> On Sat, Sep 7, 2019 at 12:29 AM George Li
> <sq...@yahoo.com.invalid> wrote:
> >
> >  Hi,
> >
> > Just want to ping and bubble up the discussion of KIP-491.
> >
> > On a large scale of Kafka clusters with thousands of brokers in many
> clusters.  Frequent hardware failures are common, although the
> reassignments to change the preferred leaders is a workaround, it incurs
> unnecessary additional work than the proposed preferred leader blacklist in
> KIP-491, and hard to scale.
> >
> > I am wondering whether others using Kafka in a big scale running into
> same problem.
> >
> >
> > Satish,
> >
> > Regarding your previous question about whether there is use-case for
> TopicLevel preferred leader "blacklist",  I thought about one use-case:  to
> improve rebalance/reassignment, the large partition will usually cause
> performance/stability issues, planning to change the say the New Replica
> will start with Leader's latest offset(this way the replica is almost
> instantly in the ISR and reassignment completed), and put this partition's
> NewReplica into Preferred Leader "Blacklist" at the Topic Level config for
> that partition. After sometime(retention time), this new replica has caught
> up and ready to serve traffic, update/remove the TopicConfig for this
> partition's preferred leader blacklist.
> >
> > I will update the KIP-491 later for this use case of Topic Level config
> for Preferred Leader Blacklist.
> >
> >
> > Thanks,
> > George
> >
> >     On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li <
> sql_consulting@yahoo.com> wrote:
> >
> >   Hi Colin,
> >
> > > In your example, I think we're comparing apples and oranges.  You
> started by outlining a scenario where "an empty broker... comes up...
> [without] any > leadership[s]."  But then you criticize using reassignment
> to switch the order of preferred replicas because it "would not actually
> switch the leader > automatically."  If the empty broker doesn't have any
> leaderships, there is nothing to be switched, right?
> >
> > Let me explained in details of this particular use case example for
> comparing apples to apples.
> >
> > Let's say a healthy broker hosting 3000 partitions, and of which 1000
> are the preferred leaders (leader count is 1000). There is a hardware
> failure (disk/memory, etc.), and kafka process crashed. We swap this host
> with another host but keep the same broker.id, when this new broker
> coming up, it has no historical data, and we manage to have the current
> last offsets of all partitions set in the replication-offset-checkpoint (if
> we don't set them, it could cause crazy ReplicaFetcher pulling of
> historical data from other brokers and cause cluster high latency and other
> instabilities), so when Kafka is brought up, it is quickly catching up as
> followers in the ISR.  Note, we have auto.leader.rebalance.enable
> disabled, so it's not serving any traffic as leaders (leader count = 0),
> even there are 1000 partitions that this broker is the Preferred Leader.
> >
> > We need to make this broker not serving traffic for a few hours or days
> depending on the SLA of the topic retention requirement until after it's
> having enough historical data.
> >
> >
> > * The traditional way using the reassignments to move this broker in
> that 1000 partitions where it's the preferred leader to the end of
> assignment, this is O(N) operation. and from my experience, we can't submit
> all 1000 at the same time, otherwise cause higher latencies even the
> reassignment in this case can complete almost instantly.  After  a few
> hours/days whatever, this broker is ready to serve traffic,  we have to run
> reassignments again to restore that 1000 partitions preferred leaders for
> this broker: O(N) operation.  then run preferred leader election O(N)
> again.  So total 3 x O(N) operations.  The point is since the new empty
> broker is expected to be the same as the old one in terms of hosting
> partition/leaders, it would seem unnecessary to do reassignments (ordering
> of replica) during the broker catching up time.
> >
> >
> >
> > * The new feature Preferred Leader "Blacklist":  just need to put a
> dynamic config to indicate that this broker should be considered leader
> (preferred leader election or broker failover or unclean leader election)
> to the lowest priority. NO need to run any reassignments. After a few
> hours/days, when this broker is ready, remove the dynamic config, and run
> preferred leader election and this broker will serve traffic for that 1000
> original partitions it was the preferred leader. So total  1 x O(N)
> operation.
> >
> >
> > If auto.leader.rebalance.enable  is enabled,  the Preferred Leader
> "Blacklist" can be put it before Kafka is started to prevent this broker
> serving traffic.  In the traditional way of running reassignments, once the
> broker is up, with auto.leader.rebalance.enable  , if leadership starts
> going to this new empty broker, it might have to do preferred leader
> election after reassignments to remove its leaderships. e.g. (1,2,3) =>
> (2,3,1) reassignment only change the ordering, 1 remains as the current
> leader, and needs prefer leader election to change to 2 after reassignment.
> so potentially one more O(N) operation.
> >
> > I hope the above example can show how easy to "blacklist" a broker
> serving leadership.  For someone managing Production Kafka cluster, it's
> important to react fast to certain alerts and mitigate/resolve some issues.
> As I listed the other use cases in KIP-291, I think this feature can make
> the Kafka product more easier to manage/operate.
> >
> > > In general, using an external rebalancing tool like Cruise Control is
> a good idea to keep things balanced without having deal with manual
> rebalancing.  > We expect more and more people who have a complex or large
> cluster will start using tools like this.
> > >
> > > However, if you choose to do manual rebalancing, it shouldn't be that
> bad.  You would save the existing partition ordering before making your
> changes, then> make your changes (perhaps by running a simple command line
> tool that switches the order of the replicas).  Then, once you felt like
> the broker was ready to> serve traffic, you could just re-apply the old
> ordering which you had saved.
> >
> >
> > We do have our own rebalancing tool which has its own criteria like Rack
> diversity,  disk usage,  spread partitions/leaders across all brokers in
> the cluster per topic, leadership Bytes/BytesIn served per broker, etc.  We
> can run reassignments. The point is whether it's really necessary, and if
> there is more effective, easier, safer way to do it.
> >
> > take another use case example of taking leadership out of busy
> Controller to give it more power to serve metadata requests and other work.
> The controller can failover, with the preferred leader "blacklist",  it
> does not have to run reassignments again when controller failover, just
> change the blacklisted broker_id.
> >
> >
> > > I was thinking about a PlacementPolicy filling the role of preventing
> people from creating single-replica partitions on a node that we didn't
> want to > ever be the leader.  I thought that it could also prevent people
> from designating those nodes as preferred leaders during topic creation, or
> Kafka from doing> itduring random topic creation.  I was assuming that the
> PlacementPolicy would determine which nodes were which through static
> configuration keys.  I agree> static configuration keys are somewhat less
> flexible than dynamic configuration.
> >
> >
> > I think single-replica partition might not be a good example.  There
> should not be any single-replica partition at all. If yes. it's probably
> because of trying to save disk space with less replicas.  I think at least
> minimum 2. The user purposely creating single-replica partition will take
> full responsibilities of data loss and unavailability when a broker fails
> or under maintenance.
> >
> >
> > I think it would be better to use dynamic instead of static config.  I
> also think it would be better to have topic creation Policy enforced in
> Kafka server OR an external service. We have an external/central service
> managing topic creation/partition expansion which takes into account of
> rack-diversity, replication factor (2, 3 or 4 depending on cluster/topic
> type), Policy replicating the topic between kafka clusters, etc.
> >
> >
> >
> > Thanks,
> > George
> >
> >
> >     On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe <
> cmccabe@apache.org> wrote:
> >
> >  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> > >  Hi Colin,
> > >
> > > Thanks for your feedbacks.  Comments below:
> > > > Even if you have a way of blacklisting an entire broker all at once,
> you still would need to run a leader election > for each partition where
> you want to move the leader off of the blacklisted broker.  So the
> operation is still O(N) in > that sense-- you have to do something per
> partition.
> > >
> > > For a failed broker and swapped with an empty broker, when it comes up,
> > > it will not have any leadership, and we would like it to remain not
> > > having leaderships for a couple of hours or days. So there is no
> > > preferred leader election needed which incurs O(N) operation in this
> > > case.  Putting the preferred leader blacklist would safe guard this
> > > broker serving traffic during that time. otherwise, if another broker
> > > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > > runs preferred leader election, this new "empty" broker can still get
> > > leaderships.
> > >
> > > Also running reassignment to change the ordering of preferred leader
> > > would not actually switch the leader automatically.  e.g.  (1,2,3) =>
> > > (2,3,1). unless preferred leader election is run to switch current
> > > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then
> > > after the broker is back to normal, another 2 x O(N) to rollback.
> >
> > Hi George,
> >
> > Hmm.  I guess I'm still on the fence about this feature.
> >
> > In your example, I think we're comparing apples and oranges.  You
> started by outlining a scenario where "an empty broker... comes up...
> [without] any leadership[s]."  But then you criticize using reassignment to
> switch the order of preferred replicas because it "would not actually
> switch the leader automatically."  If the empty broker doesn't have any
> leaderships, there is nothing to be switched, right?
> >
> > >
> > >
> > > > In general, reassignment will get a lot easier and quicker once
> KIP-455 is implemented.  > Reassignments that just change the order of
> preferred replicas for a specific partition should complete pretty much
> instantly.
> > > >> I think it's simpler and easier just to have one source of truth
> for what the preferred replica is for a partition, rather than two.  So
> for> me, the fact that the replica assignment ordering isn't changed is
> actually a big disadvantage of this KIP.  If you are a new user (or just>
> an existing user that didn't read all of the documentation) and you just
> look at the replica assignment, you might be confused by why> a particular
> broker wasn't getting any leaderships, even  though it appeared like it
> should.  More mechanisms mean more complexity> for users and developers
> most of the time.
> > >
> > >
> > > I would like stress the point that running reassignment to change the
> > > ordering of the replica (putting a broker to the end of partition
> > > assignment) is unnecessary, because after some time the broker is
> > > caught up, it can start serving traffic and then need to run
> > > reassignments again to "rollback" to previous states. As I mentioned in
> > > KIP-491, this is just tedious work.
> >
> > In general, using an external rebalancing tool like Cruise Control is a
> good idea to keep things balanced without having deal with manual
> rebalancing.  We expect more and more people who have a complex or large
> cluster will start using tools like this.
> >
> > However, if you choose to do manual rebalancing, it shouldn't be that
> bad.  You would save the existing partition ordering before making your
> changes, then make your changes (perhaps by running a simple command line
> tool that switches the order of the replicas).  Then, once you felt like
> the broker was ready to serve traffic, you could just re-apply the old
> ordering which you had saved.
> >
> > >
> > > I agree this might introduce some complexities for users/developers.
> > > But if this feature is good, and well documented, it is good for the
> > > kafka product/community.  Just like KIP-460 enabling unclean leader
> > > election to override TopicLevel/Broker Level config of
> > > `unclean.leader.election.enable`
> > >
> > > > I agree that it would be nice if we could treat some brokers
> differently for the purposes of placing replicas, selecting leaders, etc. >
> Right now, we don't have any way of implementing that without forking the
> broker.  I would support a new PlacementPolicy class that> would close this
> gap.  But I don't think this KIP is flexible enough to fill this role.  For
> example, it can't prevent users from creating> new single-replica topics
> that get put on the "bad" replica.  Perhaps we should reopen the
> discussion> about
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > >
> > > Creating topic with single-replica is beyond what KIP-491 is trying to
> > > achieve.  The user needs to take responsibility of doing that. I do see
> > > some Samza clients notoriously creating single-replica topics and that
> > > got flagged by alerts, because a single broker down/maintenance will
> > > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > > the single-replica will still serve as leaders, because there is no
> > > other alternative replica to be chosen as leader.
> > >
> > > Even with a new PlacementPolicy for topic creation/partition expansion,
> > > it still needs the blacklist info (e.g. a zk path node, or broker
> > > level/topic level config) to "blacklist" the broker to be preferred
> > > leader? Would it be the same as KIP-491 is introducing?
> >
> > I was thinking about a PlacementPolicy filling the role of preventing
> people from creating single-replica partitions on a node that we didn't
> want to ever be the leader.  I thought that it could also prevent people
> from designating those nodes as preferred leaders during topic creation, or
> Kafka from doing itduring random topic creation.  I was assuming that the
> PlacementPolicy would determine which nodes were which through static
> configuration keys.  I agree static configuration keys are somewhat less
> flexible than dynamic configuration.
> >
> > best,
> > Colin
> >
> >
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > > <cm...@apache.org> wrote:
> > >
> > >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > > >  Hi Colin,
> > > > Thanks for looking into this KIP.  Sorry for the late response. been
> busy.
> > > >
> > > > If a cluster has MAMY topic partitions, moving this "blacklist"
> broker
> > > > to the end of replica list is still a rather "big" operation,
> involving
> > > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > > simpler/easier and can undo easily without changing the replica
> > > > assignment ordering.
> > >
> > > Hi George,
> > >
> > > Even if you have a way of blacklisting an entire broker all at once,
> > > you still would need to run a leader election for each partition where
> > > you want to move the leader off of the blacklisted broker.  So the
> > > operation is still O(N) in that sense-- you have to do something per
> > > partition.
> > >
> > > In general, reassignment will get a lot easier and quicker once KIP-455
> > > is implemented.  Reassignments that just change the order of preferred
> > > replicas for a specific partition should complete pretty much
> instantly.
> > >
> > > I think it's simpler and easier just to have one source of truth for
> > > what the preferred replica is for a partition, rather than two.  So for
> > > me, the fact that the replica assignment ordering isn't changed is
> > > actually a big disadvantage of this KIP.  If you are a new user (or
> > > just an existing user that didn't read all of the documentation) and
> > > you just look at the replica assignment, you might be confused by why a
> > > particular broker wasn't getting any leaderships, even  though it
> > > appeared like it should.  More mechanisms mean more complexity for
> > > users and developers most of the time.
> > >
> > > > Major use case for me, a failed broker got swapped with new hardware,
> > > > and starts up as empty (with latest offset of all partitions), the
> SLA
> > > > of retention is 1 day, so before this broker is up to be in-sync for
> 1
> > > > day, we would like to blacklist this broker from serving traffic.
> after
> > > > 1 day, the blacklist is removed and run preferred leader election.
> > > > This way, no need to run reassignments before/after.  This is the
> > > > "temporary" use-case.
> > >
> > > What if we just add an option to the reassignment tool to generate a
> > > plan to move all the leaders off of a specific broker?  The tool could
> > > also run a leader election as well.  That would be a simple way of
> > > doing this without adding new mechanisms or broker-side configurations,
> > > etc.
> > >
> > > >
> > > > There are use-cases that this Preferred Leader "blacklist" can be
> > > > somewhat permanent, as I explained in the AWS data center instances
> Vs.
> > > > on-premises data center bare metal machines (heterogenous hardware),
> > > > that the AWS broker_ids will be blacklisted.  So new topics created,
> > > > or existing topic expansion would not make them serve traffic even
> they
> > > > could be the preferred leader.
> > >
> > > I agree that it would be nice if we could treat some brokers
> > > differently for the purposes of placing replicas, selecting leaders,
> > > etc.  Right now, we don't have any way of implementing that without
> > > forking the broker.  I would support a new PlacementPolicy class that
> > > would close this gap.  But I don't think this KIP is flexible enough to
> > > fill this role.  For example, it can't prevent users from creating new
> > > single-replica topics that get put on the "bad" replica.  Perhaps we
> > > should reopen the discussion about
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> > >
> > > regards,
> > > Colin
> > >
> > > >
> > > > Please let me know there are more question.
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > > <cm...@apache.org> wrote:
> > > >
> > > >  We still want to give the "blacklisted" broker the leadership if
> > > > nobody else is available.  Therefore, isn't putting a broker on the
> > > > blacklist pretty much the same as moving it to the last entry in the
> > > > replicas list and then triggering a preferred leader election?
> > > >
> > > > If we want this to be undone after a certain amount of time, or under
> > > > certain conditions, that seems like something that would be more
> > > > effectively done by an external system, rather than putting all these
> > > > policies into Kafka.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > > >  Hi Satish,
> > > > > Thanks for the reviews and feedbacks.
> > > > >
> > > > > > > The following is the requirements this KIP is trying to
> accomplish:
> > > > > > This can be moved to the"Proposed changes" section.
> > > > >
> > > > > Updated the KIP-491.
> > > > >
> > > > > > >>The logic to determine the priority/order of which broker
> should be
> > > > > > preferred leader should be modified.  The broker in the
> preferred leader
> > > > > > blacklist should be moved to the end (lowest priority) when
> > > > > > determining leadership.
> > > > > >
> > > > > > I believe there is no change required in the ordering of the
> preferred
> > > > > > replica list. Brokers in the preferred leader blacklist are
> skipped
> > > > > > until other brokers int he list are unavailable.
> > > > >
> > > > > Yes. partition assignment remained the same, replica & ordering.
> The
> > > > > blacklist logic can be optimized during implementation.
> > > > >
> > > > > > >>The blacklist can be at the broker level. However, there might
> be use cases
> > > > > > where a specific topic should blacklist particular brokers, which
> > > > > > would be at the
> > > > > > Topic level Config. For this use cases of this KIP, it seems
> that broker level
> > > > > > blacklist would suffice.  Topic level preferred leader blacklist
> might
> > > > > > be future enhancement work.
> > > > > >
> > > > > > I agree that the broker level preferred leader blacklist would be
> > > > > > sufficient. Do you have any use cases which require topic level
> > > > > > preferred blacklist?
> > > > >
> > > > >
> > > > >
> > > > > I don't have any concrete use cases for Topic level preferred
> leader
> > > > > blacklist.  One scenarios I can think of is when a broker has high
> CPU
> > > > > usage, trying to identify the big topics (High MsgIn, High BytesIn,
> > > > > etc), then try to move the leaders away from this broker,  before
> doing
> > > > > an actual reassignment to change its preferred leader,  try to put
> this
> > > > > preferred_leader_blacklist in the Topic Level config, and run
> preferred
> > > > > leader election, and see whether CPU decreases for this broker,  if
> > > > > yes, then do the reassignments to change the preferred leaders to
> be
> > > > > "permanent" (the topic may have many partitions like 256 that has
> quite
> > > > > a few of them having this broker as preferred leader).  So this
> Topic
> > > > > Level config is an easy way of doing trial and check the result.
> > > > >
> > > > >
> > > > > > You can add the below workaround as an item in the rejected
> alternatives section
> > > > > > "Reassigning all the topic/partitions which the intended broker
> is a
> > > > > > replica for."
> > > > >
> > > > > Updated the KIP-491.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > > <sa...@gmail.com> wrote:
> > > > >
> > > > >  Thanks for the KIP. I have put my comments below.
> > > > >
> > > > > This is a nice improvement to avoid cumbersome maintenance.
> > > > >
> > > > > >> The following is the requirements this KIP is trying to
> accomplish:
> > > > >   The ability to add and remove the preferred leader deprioritized
> > > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > > >
> > > > > This can be moved to the"Proposed changes" section.
> > > > >
> > > > > >>The logic to determine the priority/order of which broker should
> be
> > > > > preferred leader should be modified.  The broker in the preferred
> leader
> > > > > blacklist should be moved to the end (lowest priority) when
> > > > > determining leadership.
> > > > >
> > > > > I believe there is no change required in the ordering of the
> preferred
> > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > until other brokers int he list are unavailable.
> > > > >
> > > > > >>The blacklist can be at the broker level. However, there might
> be use cases
> > > > > where a specific topic should blacklist particular brokers, which
> > > > > would be at the
> > > > > Topic level Config. For this use cases of this KIP, it seems that
> broker level
> > > > > blacklist would suffice.  Topic level preferred leader blacklist
> might
> > > > > be future enhancement work.
> > > > >
> > > > > I agree that the broker level preferred leader blacklist would be
> > > > > sufficient. Do you have any use cases which require topic level
> > > > > preferred blacklist?
> > > > >
> > > > > You can add the below workaround as an item in the rejected
> alternatives section
> > > > > "Reassigning all the topic/partitions which the intended broker is
> a
> > > > > replica for."
> > > > >
> > > > > Thanks,
> > > > > Satish.
> > > > >
> > > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > > <st...@confluent.io> wrote:
> > > > > >
> > > > > > Hey George,
> > > > > >
> > > > > > Thanks for the KIP, it's an interesting idea.
> > > > > >
> > > > > > I was wondering whether we could achieve the same thing via the
> > > > > > kafka-reassign-partitions tool. As you had also said in the
> JIRA,  it is
> > > > > > true that this is currently very tedious with the tool. My
> thoughts are
> > > > > > that we could improve the tool and give it the notion of a
> "blacklisted
> > > > > > preferred leader".
> > > > > > This would have some benefits like:
> > > > > > - more fine-grained control over the blacklist. we may not want
> to
> > > > > > blacklist all the preferred leaders, as that would make the
> blacklisted
> > > > > > broker a follower of last resort which is not very useful. In
> the cases of
> > > > > > an underpowered AWS machine or a controller, you might overshoot
> and make
> > > > > > the broker very underutilized if you completely make it
> leaderless.
> > > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > > rebalancing tools would also need to know about it and
> manipulate/respect
> > > > > > it to achieve a fair balance.
> > > > > > It seems like both problems are tied to balancing partitions,
> it's just
> > > > > > that KIP-491's use case wants to balance them against other
> factors in a
> > > > > > more nuanced way. It makes sense to have both be done from the
> same place
> > > > > >
> > > > > > To make note of the motivation section:
> > > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > > The recommended way to make a broker lose its leadership is to
> run a
> > > > > > reassignment on its partitions
> > > > > > > The cross-data center cluster has AWS cloud instances which
> have less
> > > > > > computing power
> > > > > > We recommend running Kafka on homogeneous machines. It would be
> cool if the
> > > > > > system supported more flexibility in that regard but that is
> more nuanced
> > > > > > and a preferred leader blacklist may not be the best first
> approach to the
> > > > > > issue
> > > > > >
> > > > > > Adding a new config which can fundamentally change the way
> replication is
> > > > > > done is complex, both for the system (the replication code is
> complex
> > > > > > enough) and the user. Users would have another potential config
> that could
> > > > > > backfire on them - e.g if left forgotten.
> > > > > >
> > > > > > Could you think of any downsides to implementing this
> functionality (or a
> > > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > > One downside I can see is that we would not have it handle new
> partitions
> > > > > > created after the "blacklist operation". As a first iteration I
> think that
> > > > > > may be acceptable
> > > > > >
> > > > > > Thanks,
> > > > > > Stanislav
> > > > > >
> > > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <
> sql_consulting@yahoo.com.invalid>
> > > > > > wrote:
> > > > > >
> > > > > > >  Hi,
> > > > > > >
> > > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > > )
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > > >
> > > > > > >  Hi,
> > > > > > >
> > > > > > > I have created KIP-491 (
> > > > > > >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> )
> > > > > > > for putting a broker to the preferred leader blacklist or
> deprioritized
> > > > > > > list so when determining leadership,  it's moved to the lowest
> priority for
> > > > > > > some of the listed use-cases.
> > > > > > >
> > > > > > > Please provide your comments/feedbacks.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > George
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia
> Sancio (JIRA) <
> > > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <
> sql_consulting@yahoo.com>Sent:
> > > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira]
> [Commented]
> > > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > > >
> > > > > > >    [
> > > > > > >
> https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > > ]
> > > > > > >
> > > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > > ---------------------------------------------------
> > > > > > >
> > > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > > >
> > > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > > -----------------------------------------------
> > > > > > > >
> > > > > > > >                Key: KAFKA-8638
> > > > > > > >                URL:
> https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > > >            Project: Kafka
> > > > > > > >          Issue Type: Improvement
> > > > > > > >          Components: config, controller, core
> > > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > > >            Reporter: GEORGE LI
> > > > > > > >            Assignee: GEORGE LI
> > > > > > > >            Priority: Major
> > > > > > > >
> > > > > > > > Currently, the kafka preferred leader election will pick the
> broker_id
> > > > > > > in the topic/partition replica assignments in a priority order
> when the
> > > > > > > broker is in ISR. The preferred leader is the broker id in the
> first
> > > > > > > position of replica. There are use-cases that, even the first
> broker in the
> > > > > > > replica assignment is in ISR, there is a need for it to be
> moved to the end
> > > > > > > of ordering (lowest priority) when deciding leadership during
> preferred
> > > > > > > leader election.
> > > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1
> is the
> > > > > > > preferred leader.  When preferred leadership is run, it will
> pick 1 as the
> > > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick
> 2, if 2 is not
> > > > > > > in ISR, then pick 3 as the leader. There are use cases that,
> even 1 is in
> > > > > > > ISR, we would like it to be moved to the end of ordering
> (lowest priority)
> > > > > > > when deciding leadership during preferred leader election.
> Below is a list
> > > > > > > of use cases:
> > > > > > > > * (If broker_id 1 is a swapped failed host and brought up
> with last
> > > > > > > segments or latest offset without historical data (There is
> another effort
> > > > > > > on this), it's better for it to not serve leadership till it's
> caught-up.
> > > > > > > > * The cross-data center cluster has AWS instances which have
> less
> > > > > > > computing power than the on-prem bare metal machines.  We
> could put the AWS
> > > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers
> can be elected
> > > > > > > leaders, without changing the reassignments ordering of the
> replicas.
> > > > > > > > * If the broker_id 1 is constantly losing leadership after
> some time:
> > > > > > > "Flapping". we would want to exclude 1 to be a leader unless
> all other
> > > > > > > brokers of this topic/partition are offline.  The “Flapping”
> effect was
> > > > > > > seen in the past when 2 or more brokers were bad, when they
> lost leadership
> > > > > > > constantly/quickly, the sets of partition replicas they belong
> to will see
> > > > > > > leadership constantly changing.  The ultimate solution is to
> swap these bad
> > > > > > > hosts.  But for quick mitigation, we can also put the bad
> hosts in the
> > > > > > > Preferred Leader Blacklist to move the priority of its being
> elected as
> > > > > > > leaders to the lowest.
> > > > > > > > *  If the controller is busy serving an extra load of
> metadata requests
> > > > > > > and other tasks. we would like to put the controller's leaders
> to other
> > > > > > > brokers to lower its CPU load. currently bouncing to lose
> leadership would
> > > > > > > not work for Controller, because after the bounce, the
> controller fails
> > > > > > > over to another broker.
> > > > > > > > * Avoid bouncing broker in order to lose its leadership: it
> would be
> > > > > > > good if we have a way to specify which broker should be
> excluded from
> > > > > > > serving traffic/leadership (without changing the replica
> assignment
> > > > > > > ordering by reassignments, even though that's quick), and run
> preferred
> > > > > > > leader election.  A bouncing broker will cause temporary URP,
> and sometimes
> > > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1)
> can temporarily
> > > > > > > lose all its leadership, but if another broker (e.g. broker_id
> 2) fails or
> > > > > > > gets bounced, some of its leaderships will likely failover to
> broker_id 1
> > > > > > > on a replica with 3 brokers.  If broker_id 1 is in the
> blacklist, then in
> > > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can
> take
> > > > > > > leadership.
> > > > > > > > The current work-around of the above is to change the
> topic/partition's
> > > > > > > replica reassignments to move the broker_id 1 from the first
> position to
> > > > > > > the last position and run preferred leader election. e.g. (1,
> 2, 3) => (2,
> > > > > > > 3, 1). This changes the replica reassignments, and we need to
> keep track of
> > > > > > > the original one and restore if things change (e.g. controller
> fails over
> > > > > > > to another broker, the swapped empty broker caught up). That’s
> a rather
> > > > > > > tedious task.
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > This message was sent by Atlassian JIRA
> > > > > > > (v7.6.3#76005)
>


-- 
Best,
Stanislav

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Satish Duggana <sa...@gmail.com>.
Hi George,
Thanks for explaining the usecase for topic level preferred leader
blacklist. As I mentioned earlier, I am fine with broker level config
for now.

~Satish.


On Sat, Sep 7, 2019 at 12:29 AM George Li
<sq...@yahoo.com.invalid> wrote:
>
>  Hi,
>
> Just want to ping and bubble up the discussion of KIP-491.
>
> On a large scale of Kafka clusters with thousands of brokers in many clusters.  Frequent hardware failures are common, although the reassignments to change the preferred leaders is a workaround, it incurs unnecessary additional work than the proposed preferred leader blacklist in KIP-491, and hard to scale.
>
> I am wondering whether others using Kafka in a big scale running into same problem.
>
>
> Satish,
>
> Regarding your previous question about whether there is use-case for TopicLevel preferred leader "blacklist",  I thought about one use-case:  to improve rebalance/reassignment, the large partition will usually cause performance/stability issues, planning to change the say the New Replica will start with Leader's latest offset(this way the replica is almost instantly in the ISR and reassignment completed), and put this partition's NewReplica into Preferred Leader "Blacklist" at the Topic Level config for that partition. After sometime(retention time), this new replica has caught up and ready to serve traffic, update/remove the TopicConfig for this partition's preferred leader blacklist.
>
> I will update the KIP-491 later for this use case of Topic Level config for Preferred Leader Blacklist.
>
>
> Thanks,
> George
>
>     On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li <sq...@yahoo.com> wrote:
>
>   Hi Colin,
>
> > In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any > leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader > automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?
>
> Let me explained in details of this particular use case example for comparing apples to apples.
>
> Let's say a healthy broker hosting 3000 partitions, and of which 1000 are the preferred leaders (leader count is 1000). There is a hardware failure (disk/memory, etc.), and kafka process crashed. We swap this host with another host but keep the same broker.id, when this new broker coming up, it has no historical data, and we manage to have the current last offsets of all partitions set in the replication-offset-checkpoint (if we don't set them, it could cause crazy ReplicaFetcher pulling of historical data from other brokers and cause cluster high latency and other instabilities), so when Kafka is brought up, it is quickly catching up as followers in the ISR.  Note, we have auto.leader.rebalance.enable  disabled, so it's not serving any traffic as leaders (leader count = 0), even there are 1000 partitions that this broker is the Preferred Leader.
>
> We need to make this broker not serving traffic for a few hours or days depending on the SLA of the topic retention requirement until after it's having enough historical data.
>
>
> * The traditional way using the reassignments to move this broker in that 1000 partitions where it's the preferred leader to the end of  assignment, this is O(N) operation. and from my experience, we can't submit all 1000 at the same time, otherwise cause higher latencies even the reassignment in this case can complete almost instantly.  After  a few hours/days whatever, this broker is ready to serve traffic,  we have to run reassignments again to restore that 1000 partitions preferred leaders for this broker: O(N) operation.  then run preferred leader election O(N) again.  So total 3 x O(N) operations.  The point is since the new empty broker is expected to be the same as the old one in terms of hosting partition/leaders, it would seem unnecessary to do reassignments (ordering of replica) during the broker catching up time.
>
>
>
> * The new feature Preferred Leader "Blacklist":  just need to put a dynamic config to indicate that this broker should be considered leader (preferred leader election or broker failover or unclean leader election) to the lowest priority. NO need to run any reassignments. After a few hours/days, when this broker is ready, remove the dynamic config, and run preferred leader election and this broker will serve traffic for that 1000 original partitions it was the preferred leader. So total  1 x O(N) operation.
>
>
> If auto.leader.rebalance.enable  is enabled,  the Preferred Leader "Blacklist" can be put it before Kafka is started to prevent this broker serving traffic.  In the traditional way of running reassignments, once the broker is up, with auto.leader.rebalance.enable  , if leadership starts going to this new empty broker, it might have to do preferred leader election after reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) reassignment only change the ordering, 1 remains as the current leader, and needs prefer leader election to change to 2 after reassignment. so potentially one more O(N) operation.
>
> I hope the above example can show how easy to "blacklist" a broker serving leadership.  For someone managing Production Kafka cluster, it's important to react fast to certain alerts and mitigate/resolve some issues. As I listed the other use cases in KIP-291, I think this feature can make the Kafka product more easier to manage/operate.
>
> > In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  > We expect more and more people who have a complex or large cluster will start using tools like this.
> >
> > However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then> make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to> serve traffic, you could just re-apply the old ordering which you had saved.
>
>
> We do have our own rebalancing tool which has its own criteria like Rack diversity,  disk usage,  spread partitions/leaders across all brokers in the cluster per topic, leadership Bytes/BytesIn served per broker, etc.  We can run reassignments. The point is whether it's really necessary, and if there is more effective, easier, safer way to do it.
>
> take another use case example of taking leadership out of busy Controller to give it more power to serve metadata requests and other work. The controller can failover, with the preferred leader "blacklist",  it does not have to run reassignments again when controller failover, just change the blacklisted broker_id.
>
>
> > I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to > ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing> itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree> static configuration keys are somewhat less flexible than dynamic configuration.
>
>
> I think single-replica partition might not be a good example.  There should not be any single-replica partition at all. If yes. it's probably because of trying to save disk space with less replicas.  I think at least minimum 2. The user purposely creating single-replica partition will take full responsibilities of data loss and unavailability when a broker fails or under maintenance.
>
>
> I think it would be better to use dynamic instead of static config.  I also think it would be better to have topic creation Policy enforced in Kafka server OR an external service. We have an external/central service managing topic creation/partition expansion which takes into account of rack-diversity, replication factor (2, 3 or 4 depending on cluster/topic type), Policy replicating the topic between kafka clusters, etc.
>
>
>
> Thanks,
> George
>
>
>     On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe <cm...@apache.org> wrote:
>
>  On Wed, Aug 7, 2019, at 12:48, George Li wrote:
> >  Hi Colin,
> >
> > Thanks for your feedbacks.  Comments below:
> > > Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election > for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in > that sense-- you have to do something per partition.
> >
> > For a failed broker and swapped with an empty broker, when it comes up,
> > it will not have any leadership, and we would like it to remain not
> > having leaderships for a couple of hours or days. So there is no
> > preferred leader election needed which incurs O(N) operation in this
> > case.  Putting the preferred leader blacklist would safe guard this
> > broker serving traffic during that time. otherwise, if another broker
> > fails(if this broker is the 1st, 2nd in the assignment), or someone
> > runs preferred leader election, this new "empty" broker can still get
> > leaderships.
> >
> > Also running reassignment to change the ordering of preferred leader
> > would not actually switch the leader automatically.  e.g.  (1,2,3) =>
> > (2,3,1). unless preferred leader election is run to switch current
> > leader from 1 to 2.  So the operation is at least 2 x O(N).  and then
> > after the broker is back to normal, another 2 x O(N) to rollback.
>
> Hi George,
>
> Hmm.  I guess I'm still on the fence about this feature.
>
> In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?
>
> >
> >
> > > In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  > Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.
> > >> I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for> me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just>  an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why> a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity> for users and developers most of the time.
> >
> >
> > I would like stress the point that running reassignment to change the
> > ordering of the replica (putting a broker to the end of partition
> > assignment) is unnecessary, because after some time the broker is
> > caught up, it can start serving traffic and then need to run
> > reassignments again to "rollback" to previous states. As I mentioned in
> > KIP-491, this is just tedious work.
>
> In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  We expect more and more people who have a complex or large cluster will start using tools like this.
>
> However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to serve traffic, you could just re-apply the old ordering which you had saved.
>
> >
> > I agree this might introduce some complexities for users/developers.
> > But if this feature is good, and well documented, it is good for the
> > kafka product/community.  Just like KIP-460 enabling unclean leader
> > election to override TopicLevel/Broker Level config of
> > `unclean.leader.election.enable`
> >
> > > I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. > Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that> would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating> new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion> about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> >
> > Creating topic with single-replica is beyond what KIP-491 is trying to
> > achieve.  The user needs to take responsibility of doing that. I do see
> > some Samza clients notoriously creating single-replica topics and that
> > got flagged by alerts, because a single broker down/maintenance will
> > cause offline partitions. For KIP-491 preferred leader "blacklist",
> > the single-replica will still serve as leaders, because there is no
> > other alternative replica to be chosen as leader.
> >
> > Even with a new PlacementPolicy for topic creation/partition expansion,
> > it still needs the blacklist info (e.g. a zk path node, or broker
> > level/topic level config) to "blacklist" the broker to be preferred
> > leader? Would it be the same as KIP-491 is introducing?
>
> I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree static configuration keys are somewhat less flexible than dynamic configuration.
>
> best,
> Colin
>
>
> >
> >
> > Thanks,
> > George
> >
> >    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe
> > <cm...@apache.org> wrote:
> >
> >  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> > >  Hi Colin,
> > > Thanks for looking into this KIP.  Sorry for the late response. been busy.
> > >
> > > If a cluster has MAMY topic partitions, moving this "blacklist" broker
> > > to the end of replica list is still a rather "big" operation, involving
> > > submitting reassignments.  The KIP-491 way of blacklist is much
> > > simpler/easier and can undo easily without changing the replica
> > > assignment ordering.
> >
> > Hi George,
> >
> > Even if you have a way of blacklisting an entire broker all at once,
> > you still would need to run a leader election for each partition where
> > you want to move the leader off of the blacklisted broker.  So the
> > operation is still O(N) in that sense-- you have to do something per
> > partition.
> >
> > In general, reassignment will get a lot easier and quicker once KIP-455
> > is implemented.  Reassignments that just change the order of preferred
> > replicas for a specific partition should complete pretty much instantly.
> >
> > I think it's simpler and easier just to have one source of truth for
> > what the preferred replica is for a partition, rather than two.  So for
> > me, the fact that the replica assignment ordering isn't changed is
> > actually a big disadvantage of this KIP.  If you are a new user (or
> > just an existing user that didn't read all of the documentation) and
> > you just look at the replica assignment, you might be confused by why a
> > particular broker wasn't getting any leaderships, even  though it
> > appeared like it should.  More mechanisms mean more complexity for
> > users and developers most of the time.
> >
> > > Major use case for me, a failed broker got swapped with new hardware,
> > > and starts up as empty (with latest offset of all partitions), the SLA
> > > of retention is 1 day, so before this broker is up to be in-sync for 1
> > > day, we would like to blacklist this broker from serving traffic. after
> > > 1 day, the blacklist is removed and run preferred leader election.
> > > This way, no need to run reassignments before/after.  This is the
> > > "temporary" use-case.
> >
> > What if we just add an option to the reassignment tool to generate a
> > plan to move all the leaders off of a specific broker?  The tool could
> > also run a leader election as well.  That would be a simple way of
> > doing this without adding new mechanisms or broker-side configurations,
> > etc.
> >
> > >
> > > There are use-cases that this Preferred Leader "blacklist" can be
> > > somewhat permanent, as I explained in the AWS data center instances Vs.
> > > on-premises data center bare metal machines (heterogenous hardware),
> > > that the AWS broker_ids will be blacklisted.  So new topics created,
> > > or existing topic expansion would not make them serve traffic even they
> > > could be the preferred leader.
> >
> > I agree that it would be nice if we could treat some brokers
> > differently for the purposes of placing replicas, selecting leaders,
> > etc.  Right now, we don't have any way of implementing that without
> > forking the broker.  I would support a new PlacementPolicy class that
> > would close this gap.  But I don't think this KIP is flexible enough to
> > fill this role.  For example, it can't prevent users from creating new
> > single-replica topics that get put on the "bad" replica.  Perhaps we
> > should reopen the discussion about
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> >
> > regards,
> > Colin
> >
> > >
> > > Please let me know there are more question.
> > >
> > >
> > > Thanks,
> > > George
> > >
> > >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe
> > > <cm...@apache.org> wrote:
> > >
> > >  We still want to give the "blacklisted" broker the leadership if
> > > nobody else is available.  Therefore, isn't putting a broker on the
> > > blacklist pretty much the same as moving it to the last entry in the
> > > replicas list and then triggering a preferred leader election?
> > >
> > > If we want this to be undone after a certain amount of time, or under
> > > certain conditions, that seems like something that would be more
> > > effectively done by an external system, rather than putting all these
> > > policies into Kafka.
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > > >  Hi Satish,
> > > > Thanks for the reviews and feedbacks.
> > > >
> > > > > > The following is the requirements this KIP is trying to accomplish:
> > > > > This can be moved to the"Proposed changes" section.
> > > >
> > > > Updated the KIP-491.
> > > >
> > > > > >>The logic to determine the priority/order of which broker should be
> > > > > preferred leader should be modified.  The broker in the preferred leader
> > > > > blacklist should be moved to the end (lowest priority) when
> > > > > determining leadership.
> > > > >
> > > > > I believe there is no change required in the ordering of the preferred
> > > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > > until other brokers int he list are unavailable.
> > > >
> > > > Yes. partition assignment remained the same, replica & ordering. The
> > > > blacklist logic can be optimized during implementation.
> > > >
> > > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > > where a specific topic should blacklist particular brokers, which
> > > > > would be at the
> > > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > > be future enhancement work.
> > > > >
> > > > > I agree that the broker level preferred leader blacklist would be
> > > > > sufficient. Do you have any use cases which require topic level
> > > > > preferred blacklist?
> > > >
> > > >
> > > >
> > > > I don't have any concrete use cases for Topic level preferred leader
> > > > blacklist.  One scenarios I can think of is when a broker has high CPU
> > > > usage, trying to identify the big topics (High MsgIn, High BytesIn,
> > > > etc), then try to move the leaders away from this broker,  before doing
> > > > an actual reassignment to change its preferred leader,  try to put this
> > > > preferred_leader_blacklist in the Topic Level config, and run preferred
> > > > leader election, and see whether CPU decreases for this broker,  if
> > > > yes, then do the reassignments to change the preferred leaders to be
> > > > "permanent" (the topic may have many partitions like 256 that has quite
> > > > a few of them having this broker as preferred leader).  So this Topic
> > > > Level config is an easy way of doing trial and check the result.
> > > >
> > > >
> > > > > You can add the below workaround as an item in the rejected alternatives section
> > > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > > replica for."
> > > >
> > > > Updated the KIP-491.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana
> > > > <sa...@gmail.com> wrote:
> > > >
> > > >  Thanks for the KIP. I have put my comments below.
> > > >
> > > > This is a nice improvement to avoid cumbersome maintenance.
> > > >
> > > > >> The following is the requirements this KIP is trying to accomplish:
> > > >   The ability to add and remove the preferred leader deprioritized
> > > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > >
> > > > This can be moved to the"Proposed changes" section.
> > > >
> > > > >>The logic to determine the priority/order of which broker should be
> > > > preferred leader should be modified.  The broker in the preferred leader
> > > > blacklist should be moved to the end (lowest priority) when
> > > > determining leadership.
> > > >
> > > > I believe there is no change required in the ordering of the preferred
> > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > until other brokers int he list are unavailable.
> > > >
> > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > where a specific topic should blacklist particular brokers, which
> > > > would be at the
> > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > be future enhancement work.
> > > >
> > > > I agree that the broker level preferred leader blacklist would be
> > > > sufficient. Do you have any use cases which require topic level
> > > > preferred blacklist?
> > > >
> > > > You can add the below workaround as an item in the rejected alternatives section
> > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > replica for."
> > > >
> > > > Thanks,
> > > > Satish.
> > > >
> > > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > > <st...@confluent.io> wrote:
> > > > >
> > > > > Hey George,
> > > > >
> > > > > Thanks for the KIP, it's an interesting idea.
> > > > >
> > > > > I was wondering whether we could achieve the same thing via the
> > > > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > > > true that this is currently very tedious with the tool. My thoughts are
> > > > > that we could improve the tool and give it the notion of a "blacklisted
> > > > > preferred leader".
> > > > > This would have some benefits like:
> > > > > - more fine-grained control over the blacklist. we may not want to
> > > > > blacklist all the preferred leaders, as that would make the blacklisted
> > > > > broker a follower of last resort which is not very useful. In the cases of
> > > > > an underpowered AWS machine or a controller, you might overshoot and make
> > > > > the broker very underutilized if you completely make it leaderless.
> > > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > > rebalancing tools would also need to know about it and manipulate/respect
> > > > > it to achieve a fair balance.
> > > > > It seems like both problems are tied to balancing partitions, it's just
> > > > > that KIP-491's use case wants to balance them against other factors in a
> > > > > more nuanced way. It makes sense to have both be done from the same place
> > > > >
> > > > > To make note of the motivation section:
> > > > > > Avoid bouncing broker in order to lose its leadership
> > > > > The recommended way to make a broker lose its leadership is to run a
> > > > > reassignment on its partitions
> > > > > > The cross-data center cluster has AWS cloud instances which have less
> > > > > computing power
> > > > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > > > system supported more flexibility in that regard but that is more nuanced
> > > > > and a preferred leader blacklist may not be the best first approach to the
> > > > > issue
> > > > >
> > > > > Adding a new config which can fundamentally change the way replication is
> > > > > done is complex, both for the system (the replication code is complex
> > > > > enough) and the user. Users would have another potential config that could
> > > > > backfire on them - e.g if left forgotten.
> > > > >
> > > > > Could you think of any downsides to implementing this functionality (or a
> > > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > > One downside I can see is that we would not have it handle new partitions
> > > > > created after the "blacklist operation". As a first iteration I think that
> > > > > may be acceptable
> > > > >
> > > > > Thanks,
> > > > > Stanislav
> > > > >
> > > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > > > wrote:
> > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > > )
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > > >
> > > > > >  Hi,
> > > > > >
> > > > > > I have created KIP-491 (
> > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > > > some of the listed use-cases.
> > > > > >
> > > > > > Please provide your comments/feedbacks.
> > > > > >
> > > > > > Thanks,
> > > > > > George
> > > > > >
> > > > > >
> > > > > >
> > > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > > >
> > > > > >    [
> > > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > > ]
> > > > > >
> > > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > > ---------------------------------------------------
> > > > > >
> > > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > > >
> > > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > > -----------------------------------------------
> > > > > > >
> > > > > > >                Key: KAFKA-8638
> > > > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > > >            Project: Kafka
> > > > > > >          Issue Type: Improvement
> > > > > > >          Components: config, controller, core
> > > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > > >            Reporter: GEORGE LI
> > > > > > >            Assignee: GEORGE LI
> > > > > > >            Priority: Major
> > > > > > >
> > > > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > > > in the topic/partition replica assignments in a priority order when the
> > > > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > > > position of replica. There are use-cases that, even the first broker in the
> > > > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > > > leader election.
> > > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > > > when deciding leadership during preferred leader election.  Below is a list
> > > > > > of use cases:
> > > > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > > > segments or latest offset without historical data (There is another effort
> > > > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > > > * The cross-data center cluster has AWS instances which have less
> > > > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > > > leaders to the lowest.
> > > > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > > > and other tasks. we would like to put the controller's leaders to other
> > > > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > > > not work for Controller, because after the bounce, the controller fails
> > > > > > over to another broker.
> > > > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > > > good if we have a way to specify which broker should be excluded from
> > > > > > serving traffic/leadership (without changing the replica assignment
> > > > > > ordering by reassignments, even though that's quick), and run preferred
> > > > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > > > leadership.
> > > > > > > The current work-around of the above is to change the topic/partition's
> > > > > > replica reassignments to move the broker_id 1 from the first position to
> > > > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > > > the original one and restore if things change (e.g. controller fails over
> > > > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > > > tedious task.
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > This message was sent by Atlassian JIRA
> > > > > > (v7.6.3#76005)

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by George Li <sq...@yahoo.com.INVALID>.
 Hi, 

Just want to ping and bubble up the discussion of KIP-491. 

On a large scale of Kafka clusters with thousands of brokers in many clusters.  Frequent hardware failures are common, although the reassignments to change the preferred leaders is a workaround, it incurs unnecessary additional work than the proposed preferred leader blacklist in KIP-491, and hard to scale. 

I am wondering whether others using Kafka in a big scale running into same problem. 


Satish,  

Regarding your previous question about whether there is use-case for TopicLevel preferred leader "blacklist",  I thought about one use-case:  to improve rebalance/reassignment, the large partition will usually cause performance/stability issues, planning to change the say the New Replica will start with Leader's latest offset(this way the replica is almost instantly in the ISR and reassignment completed), and put this partition's NewReplica into Preferred Leader "Blacklist" at the Topic Level config for that partition. After sometime(retention time), this new replica has caught up and ready to serve traffic, update/remove the TopicConfig for this partition's preferred leader blacklist. 

I will update the KIP-491 later for this use case of Topic Level config for Preferred Leader Blacklist.


Thanks,
George
 
    On Wednesday, August 7, 2019, 07:43:55 PM PDT, George Li <sq...@yahoo.com> wrote:  
 
  Hi Colin,

> In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any > leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader > automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?

Let me explained in details of this particular use case example for comparing apples to apples. 

Let's say a healthy broker hosting 3000 partitions, and of which 1000 are the preferred leaders (leader count is 1000). There is a hardware failure (disk/memory, etc.), and kafka process crashed. We swap this host with another host but keep the same broker.id, when this new broker coming up, it has no historical data, and we manage to have the current last offsets of all partitions set in the replication-offset-checkpoint (if we don't set them, it could cause crazy ReplicaFetcher pulling of historical data from other brokers and cause cluster high latency and other instabilities), so when Kafka is brought up, it is quickly catching up as followers in the ISR.  Note, we have auto.leader.rebalance.enable  disabled, so it's not serving any traffic as leaders (leader count = 0), even there are 1000 partitions that this broker is the Preferred Leader. 

We need to make this broker not serving traffic for a few hours or days depending on the SLA of the topic retention requirement until after it's having enough historical data. 


* The traditional way using the reassignments to move this broker in that 1000 partitions where it's the preferred leader to the end of  assignment, this is O(N) operation. and from my experience, we can't submit all 1000 at the same time, otherwise cause higher latencies even the reassignment in this case can complete almost instantly.  After  a few hours/days whatever, this broker is ready to serve traffic,  we have to run reassignments again to restore that 1000 partitions preferred leaders for this broker: O(N) operation.  then run preferred leader election O(N) again.  So total 3 x O(N) operations.  The point is since the new empty broker is expected to be the same as the old one in terms of hosting partition/leaders, it would seem unnecessary to do reassignments (ordering of replica) during the broker catching up time. 



* The new feature Preferred Leader "Blacklist":  just need to put a dynamic config to indicate that this broker should be considered leader (preferred leader election or broker failover or unclean leader election) to the lowest priority. NO need to run any reassignments. After a few hours/days, when this broker is ready, remove the dynamic config, and run preferred leader election and this broker will serve traffic for that 1000 original partitions it was the preferred leader. So total  1 x O(N) operation. 


If auto.leader.rebalance.enable  is enabled,  the Preferred Leader "Blacklist" can be put it before Kafka is started to prevent this broker serving traffic.  In the traditional way of running reassignments, once the broker is up, with auto.leader.rebalance.enable  , if leadership starts going to this new empty broker, it might have to do preferred leader election after reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) reassignment only change the ordering, 1 remains as the current leader, and needs prefer leader election to change to 2 after reassignment. so potentially one more O(N) operation. 

I hope the above example can show how easy to "blacklist" a broker serving leadership.  For someone managing Production Kafka cluster, it's important to react fast to certain alerts and mitigate/resolve some issues. As I listed the other use cases in KIP-291, I think this feature can make the Kafka product more easier to manage/operate. 

> In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  > We expect more and more people who have a complex or large cluster will start using tools like this.
> 
> However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then> make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to> serve traffic, you could just re-apply the old ordering which you had saved.


We do have our own rebalancing tool which has its own criteria like Rack diversity,  disk usage,  spread partitions/leaders across all brokers in the cluster per topic, leadership Bytes/BytesIn served per broker, etc.  We can run reassignments. The point is whether it's really necessary, and if there is more effective, easier, safer way to do it.    

take another use case example of taking leadership out of busy Controller to give it more power to serve metadata requests and other work. The controller can failover, with the preferred leader "blacklist",  it does not have to run reassignments again when controller failover, just change the blacklisted broker_id. 


> I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to > ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing> itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree> static configuration keys are somewhat less flexible than dynamic configuration.


I think single-replica partition might not be a good example.  There should not be any single-replica partition at all. If yes. it's probably because of trying to save disk space with less replicas.  I think at least minimum 2. The user purposely creating single-replica partition will take full responsibilities of data loss and unavailability when a broker fails or under maintenance. 


I think it would be better to use dynamic instead of static config.  I also think it would be better to have topic creation Policy enforced in Kafka server OR an external service. We have an external/central service managing topic creation/partition expansion which takes into account of rack-diversity, replication factor (2, 3 or 4 depending on cluster/topic type), Policy replicating the topic between kafka clusters, etc.  



Thanks,
George


    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe <cm...@apache.org> wrote:  
 
 On Wed, Aug 7, 2019, at 12:48, George Li wrote:
>  Hi Colin,
> 
> Thanks for your feedbacks.  Comments below:
> > Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election > for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in > that sense-- you have to do something per partition.
> 
> For a failed broker and swapped with an empty broker, when it comes up, 
> it will not have any leadership, and we would like it to remain not 
> having leaderships for a couple of hours or days. So there is no 
> preferred leader election needed which incurs O(N) operation in this 
> case.  Putting the preferred leader blacklist would safe guard this 
> broker serving traffic during that time. otherwise, if another broker 
> fails(if this broker is the 1st, 2nd in the assignment), or someone 
> runs preferred leader election, this new "empty" broker can still get 
> leaderships. 
> 
> Also running reassignment to change the ordering of preferred leader 
> would not actually switch the leader automatically.  e.g.  (1,2,3) => 
> (2,3,1). unless preferred leader election is run to switch current 
> leader from 1 to 2.  So the operation is at least 2 x O(N).  and then 
> after the broker is back to normal, another 2 x O(N) to rollback. 

Hi George,

Hmm.  I guess I'm still on the fence about this feature.

In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?

> 
> 
> > In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  > Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.
> >> I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for> me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just>  an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why> a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity> for users and developers most of the time.
> 
> 
> I would like stress the point that running reassignment to change the 
> ordering of the replica (putting a broker to the end of partition 
> assignment) is unnecessary, because after some time the broker is 
> caught up, it can start serving traffic and then need to run 
> reassignments again to "rollback" to previous states. As I mentioned in 
> KIP-491, this is just tedious work. 

In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  We expect more and more people who have a complex or large cluster will start using tools like this.

However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to serve traffic, you could just re-apply the old ordering which you had saved.

> 
> I agree this might introduce some complexities for users/developers. 
> But if this feature is good, and well documented, it is good for the 
> kafka product/community.  Just like KIP-460 enabling unclean leader 
> election to override TopicLevel/Broker Level config of 
> `unclean.leader.election.enable`
> 
> > I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. > Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that> would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating> new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion> about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> 
> Creating topic with single-replica is beyond what KIP-491 is trying to 
> achieve.  The user needs to take responsibility of doing that. I do see 
> some Samza clients notoriously creating single-replica topics and that 
> got flagged by alerts, because a single broker down/maintenance will 
> cause offline partitions. For KIP-491 preferred leader "blacklist",  
> the single-replica will still serve as leaders, because there is no 
> other alternative replica to be chosen as leader. 
> 
> Even with a new PlacementPolicy for topic creation/partition expansion, 
> it still needs the blacklist info (e.g. a zk path node, or broker 
> level/topic level config) to "blacklist" the broker to be preferred 
> leader? Would it be the same as KIP-491 is introducing? 

I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree static configuration keys are somewhat less flexible than dynamic configuration.

best,
Colin


> 
> 
> Thanks,
> George
> 
>    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe 
> <cm...@apache.org> wrote:  
>  
>  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> >  Hi Colin,
> > Thanks for looking into this KIP.  Sorry for the late response. been busy. 
> > 
> > If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> > to the end of replica list is still a rather "big" operation, involving 
> > submitting reassignments.  The KIP-491 way of blacklist is much 
> > simpler/easier and can undo easily without changing the replica 
> > assignment ordering. 
> 
> Hi George,
> 
> Even if you have a way of blacklisting an entire broker all at once, 
> you still would need to run a leader election for each partition where 
> you want to move the leader off of the blacklisted broker.  So the 
> operation is still O(N) in that sense-- you have to do something per 
> partition.
> 
> In general, reassignment will get a lot easier and quicker once KIP-455 
> is implemented.  Reassignments that just change the order of preferred 
> replicas for a specific partition should complete pretty much instantly.
> 
> I think it's simpler and easier just to have one source of truth for 
> what the preferred replica is for a partition, rather than two.  So for 
> me, the fact that the replica assignment ordering isn't changed is 
> actually a big disadvantage of this KIP.  If you are a new user (or 
> just an existing user that didn't read all of the documentation) and 
> you just look at the replica assignment, you might be confused by why a 
> particular broker wasn't getting any leaderships, even  though it 
> appeared like it should.  More mechanisms mean more complexity for 
> users and developers most of the time.
> 
> > Major use case for me, a failed broker got swapped with new hardware, 
> > and starts up as empty (with latest offset of all partitions), the SLA 
> > of retention is 1 day, so before this broker is up to be in-sync for 1 
> > day, we would like to blacklist this broker from serving traffic. after 
> > 1 day, the blacklist is removed and run preferred leader election.  
> > This way, no need to run reassignments before/after.  This is the 
> > "temporary" use-case.
> 
> What if we just add an option to the reassignment tool to generate a 
> plan to move all the leaders off of a specific broker?  The tool could 
> also run a leader election as well.  That would be a simple way of 
> doing this without adding new mechanisms or broker-side configurations, 
> etc.
> 
> > 
> > There are use-cases that this Preferred Leader "blacklist" can be 
> > somewhat permanent, as I explained in the AWS data center instances Vs. 
> > on-premises data center bare metal machines (heterogenous hardware), 
> > that the AWS broker_ids will be blacklisted.  So new topics created,  
> > or existing topic expansion would not make them serve traffic even they 
> > could be the preferred leader. 
> 
> I agree that it would be nice if we could treat some brokers 
> differently for the purposes of placing replicas, selecting leaders, 
> etc.  Right now, we don't have any way of implementing that without 
> forking the broker.  I would support a new PlacementPolicy class that 
> would close this gap.  But I don't think this KIP is flexible enough to 
> fill this role.  For example, it can't prevent users from creating new 
> single-replica topics that get put on the "bad" replica.  Perhaps we 
> should reopen the discussion about 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> 
> regards,
> Colin
> 
> > 
> > Please let me know there are more question. 
> > 
> > 
> > Thanks,
> > George
> > 
> >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> > <cm...@apache.org> wrote:  
> >  
> >  We still want to give the "blacklisted" broker the leadership if 
> > nobody else is available.  Therefore, isn't putting a broker on the 
> > blacklist pretty much the same as moving it to the last entry in the 
> > replicas list and then triggering a preferred leader election?
> > 
> > If we want this to be undone after a certain amount of time, or under 
> > certain conditions, that seems like something that would be more 
> > effectively done by an external system, rather than putting all these 
> > policies into Kafka.
> > 
> > best,
> > Colin
> > 
> > 
> > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > >  Hi Satish,
> > > Thanks for the reviews and feedbacks.
> > > 
> > > > > The following is the requirements this KIP is trying to accomplish:
> > > > This can be moved to the"Proposed changes" section.
> > > 
> > > Updated the KIP-491. 
> > > 
> > > > >>The logic to determine the priority/order of which broker should be
> > > > preferred leader should be modified.  The broker in the preferred leader
> > > > blacklist should be moved to the end (lowest priority) when
> > > > determining leadership.
> > > >
> > > > I believe there is no change required in the ordering of the preferred
> > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > until other brokers int he list are unavailable.
> > > 
> > > Yes. partition assignment remained the same, replica & ordering. The 
> > > blacklist logic can be optimized during implementation. 
> > > 
> > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > where a specific topic should blacklist particular brokers, which
> > > > would be at the
> > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > be future enhancement work.
> > > > 
> > > > I agree that the broker level preferred leader blacklist would be
> > > > sufficient. Do you have any use cases which require topic level
> > > > preferred blacklist?
> > > 
> > > 
> > > 
> > > I don't have any concrete use cases for Topic level preferred leader 
> > > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > > etc), then try to move the leaders away from this broker,  before doing 
> > > an actual reassignment to change its preferred leader,  try to put this 
> > > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > > leader election, and see whether CPU decreases for this broker,  if 
> > > yes, then do the reassignments to change the preferred leaders to be 
> > > "permanent" (the topic may have many partitions like 256 that has quite 
> > > a few of them having this broker as preferred leader).  So this Topic 
> > > Level config is an easy way of doing trial and check the result. 
> > > 
> > > 
> > > > You can add the below workaround as an item in the rejected alternatives section
> > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > replica for."
> > > 
> > > Updated the KIP-491. 
> > > 
> > > 
> > > 
> > > Thanks, 
> > > George
> > > 
> > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > > <sa...@gmail.com> wrote:  
> > >  
> > >  Thanks for the KIP. I have put my comments below.
> > > 
> > > This is a nice improvement to avoid cumbersome maintenance.
> > > 
> > > >> The following is the requirements this KIP is trying to accomplish:
> > >   The ability to add and remove the preferred leader deprioritized
> > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > 
> > > This can be moved to the"Proposed changes" section.
> > > 
> > > >>The logic to determine the priority/order of which broker should be
> > > preferred leader should be modified.  The broker in the preferred leader
> > > blacklist should be moved to the end (lowest priority) when
> > > determining leadership.
> > > 
> > > I believe there is no change required in the ordering of the preferred
> > > replica list. Brokers in the preferred leader blacklist are skipped
> > > until other brokers int he list are unavailable.
> > > 
> > > >>The blacklist can be at the broker level. However, there might be use cases
> > > where a specific topic should blacklist particular brokers, which
> > > would be at the
> > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > be future enhancement work.
> > > 
> > > I agree that the broker level preferred leader blacklist would be
> > > sufficient. Do you have any use cases which require topic level
> > > preferred blacklist?
> > > 
> > > You can add the below workaround as an item in the rejected alternatives section
> > > "Reassigning all the topic/partitions which the intended broker is a
> > > replica for."
> > > 
> > > Thanks,
> > > Satish.
> > > 
> > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > <st...@confluent.io> wrote:
> > > >
> > > > Hey George,
> > > >
> > > > Thanks for the KIP, it's an interesting idea.
> > > >
> > > > I was wondering whether we could achieve the same thing via the
> > > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > > true that this is currently very tedious with the tool. My thoughts are
> > > > that we could improve the tool and give it the notion of a "blacklisted
> > > > preferred leader".
> > > > This would have some benefits like:
> > > > - more fine-grained control over the blacklist. we may not want to
> > > > blacklist all the preferred leaders, as that would make the blacklisted
> > > > broker a follower of last resort which is not very useful. In the cases of
> > > > an underpowered AWS machine or a controller, you might overshoot and make
> > > > the broker very underutilized if you completely make it leaderless.
> > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > rebalancing tools would also need to know about it and manipulate/respect
> > > > it to achieve a fair balance.
> > > > It seems like both problems are tied to balancing partitions, it's just
> > > > that KIP-491's use case wants to balance them against other factors in a
> > > > more nuanced way. It makes sense to have both be done from the same place
> > > >
> > > > To make note of the motivation section:
> > > > > Avoid bouncing broker in order to lose its leadership
> > > > The recommended way to make a broker lose its leadership is to run a
> > > > reassignment on its partitions
> > > > > The cross-data center cluster has AWS cloud instances which have less
> > > > computing power
> > > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > > system supported more flexibility in that regard but that is more nuanced
> > > > and a preferred leader blacklist may not be the best first approach to the
> > > > issue
> > > >
> > > > Adding a new config which can fundamentally change the way replication is
> > > > done is complex, both for the system (the replication code is complex
> > > > enough) and the user. Users would have another potential config that could
> > > > backfire on them - e.g if left forgotten.
> > > >
> > > > Could you think of any downsides to implementing this functionality (or a
> > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > One downside I can see is that we would not have it handle new partitions
> > > > created after the "blacklist operation". As a first iteration I think that
> > > > may be acceptable
> > > >
> > > > Thanks,
> > > > Stanislav
> > > >
> > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > > wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > )
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > >
> > > > >  Hi,
> > > > >
> > > > > I have created KIP-491 (
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > > some of the listed use-cases.
> > > > >
> > > > > Please provide your comments/feedbacks.
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >
> > > > >
> > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > >
> > > > >    [
> > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > ]
> > > > >
> > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > ---------------------------------------------------
> > > > >
> > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > >
> > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > -----------------------------------------------
> > > > > >
> > > > > >                Key: KAFKA-8638
> > > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > >            Project: Kafka
> > > > > >          Issue Type: Improvement
> > > > > >          Components: config, controller, core
> > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > >            Reporter: GEORGE LI
> > > > > >            Assignee: GEORGE LI
> > > > > >            Priority: Major
> > > > > >
> > > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > > in the topic/partition replica assignments in a priority order when the
> > > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > > position of replica. There are use-cases that, even the first broker in the
> > > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > > leader election.
> > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > > when deciding leadership during preferred leader election.  Below is a list
> > > > > of use cases:
> > > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > > segments or latest offset without historical data (There is another effort
> > > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > > * The cross-data center cluster has AWS instances which have less
> > > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > > leaders to the lowest.
> > > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > > and other tasks. we would like to put the controller's leaders to other
> > > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > > not work for Controller, because after the bounce, the controller fails
> > > > > over to another broker.
> > > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > > good if we have a way to specify which broker should be excluded from
> > > > > serving traffic/leadership (without changing the replica assignment
> > > > > ordering by reassignments, even though that's quick), and run preferred
> > > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > > leadership.
> > > > > > The current work-around of the above is to change the topic/partition's
> > > > > replica reassignments to move the broker_id 1 from the first position to
> > > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > > the original one and restore if things change (e.g. controller fails over
> > > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > > tedious task.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > This message was sent by Atlassian JIRA
> > > > > (v7.6.3#76005)    

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by George Li <sq...@yahoo.com>.
 Hi Colin,

> In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any > leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader > automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?

Let me explained in details of this particular use case example for comparing apples to apples. 

Let's say a healthy broker hosting 3000 partitions, and of which 1000 are the preferred leaders (leader count is 1000). There is a hardware failure (disk/memory, etc.), and kafka process crashed. We swap this host with another host but keep the same broker.id, when this new broker coming up, it has no historical data, and we manage to have the current last offsets of all partitions set in the replication-offset-checkpoint (if we don't set them, it could cause crazy ReplicaFetcher pulling of historical data from other brokers and cause cluster high latency and other instabilities), so when Kafka is brought up, it is quickly catching up as followers in the ISR.  Note, we have auto.leader.rebalance.enable  disabled, so it's not serving any traffic as leaders (leader count = 0), even there are 1000 partitions that this broker is the Preferred Leader. 

We need to make this broker not serving traffic for a few hours or days depending on the SLA of the topic retention requirement until after it's having enough historical data. 


* The traditional way using the reassignments to move this broker in that 1000 partitions where it's the preferred leader to the end of  assignment, this is O(N) operation. and from my experience, we can't submit all 1000 at the same time, otherwise cause higher latencies even the reassignment in this case can complete almost instantly.  After  a few hours/days whatever, this broker is ready to serve traffic,  we have to run reassignments again to restore that 1000 partitions preferred leaders for this broker: O(N) operation.  then run preferred leader election O(N) again.  So total 3 x O(N) operations.  The point is since the new empty broker is expected to be the same as the old one in terms of hosting partition/leaders, it would seem unnecessary to do reassignments (ordering of replica) during the broker catching up time. 



* The new feature Preferred Leader "Blacklist":  just need to put a dynamic config to indicate that this broker should be considered leader (preferred leader election or broker failover or unclean leader election) to the lowest priority. NO need to run any reassignments. After a few hours/days, when this broker is ready, remove the dynamic config, and run preferred leader election and this broker will serve traffic for that 1000 original partitions it was the preferred leader. So total  1 x O(N) operation. 


If auto.leader.rebalance.enable  is enabled,  the Preferred Leader "Blacklist" can be put it before Kafka is started to prevent this broker serving traffic.  In the traditional way of running reassignments, once the broker is up, with auto.leader.rebalance.enable  , if leadership starts going to this new empty broker, it might have to do preferred leader election after reassignments to remove its leaderships. e.g. (1,2,3) => (2,3,1) reassignment only change the ordering, 1 remains as the current leader, and needs prefer leader election to change to 2 after reassignment. so potentially one more O(N) operation. 

I hope the above example can show how easy to "blacklist" a broker serving leadership.  For someone managing Production Kafka cluster, it's important to react fast to certain alerts and mitigate/resolve some issues. As I listed the other use cases in KIP-291, I think this feature can make the Kafka product more easier to manage/operate. 

> In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  > We expect more and more people who have a complex or large cluster will start using tools like this.
> 
> However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then> make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to> serve traffic, you could just re-apply the old ordering which you had saved.


We do have our own rebalancing tool which has its own criteria like Rack diversity,  disk usage,  spread partitions/leaders across all brokers in the cluster per topic, leadership Bytes/BytesIn served per broker, etc.  We can run reassignments. The point is whether it's really necessary, and if there is more effective, easier, safer way to do it.    

take another use case example of taking leadership out of busy Controller to give it more power to serve metadata requests and other work. The controller can failover, with the preferred leader "blacklist",  it does not have to run reassignments again when controller failover, just change the blacklisted broker_id. 


> I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to > ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing> itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree> static configuration keys are somewhat less flexible than dynamic configuration.


I think single-replica partition might not be a good example.  There should not be any single-replica partition at all. If yes. it's probably because of trying to save disk space with less replicas.  I think at least minimum 2. The user purposely creating single-replica partition will take full responsibilities of data loss and unavailability when a broker fails or under maintenance. 


I think it would be better to use dynamic instead of static config.  I also think it would be better to have topic creation Policy enforced in Kafka server OR an external service. We have an external/central service managing topic creation/partition expansion which takes into account of rack-diversity, replication factor (2, 3 or 4 depending on cluster/topic type), Policy replicating the topic between kafka clusters, etc.  



Thanks,
George


    On Wednesday, August 7, 2019, 05:41:28 PM PDT, Colin McCabe <cm...@apache.org> wrote:  
 
 On Wed, Aug 7, 2019, at 12:48, George Li wrote:
>  Hi Colin,
> 
> Thanks for your feedbacks.  Comments below:
> > Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election > for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in > that sense-- you have to do something per partition.
> 
> For a failed broker and swapped with an empty broker, when it comes up, 
> it will not have any leadership, and we would like it to remain not 
> having leaderships for a couple of hours or days. So there is no 
> preferred leader election needed which incurs O(N) operation in this 
> case.  Putting the preferred leader blacklist would safe guard this 
> broker serving traffic during that time. otherwise, if another broker 
> fails(if this broker is the 1st, 2nd in the assignment), or someone 
> runs preferred leader election, this new "empty" broker can still get 
> leaderships. 
> 
> Also running reassignment to change the ordering of preferred leader 
> would not actually switch the leader automatically.  e.g.  (1,2,3) => 
> (2,3,1). unless preferred leader election is run to switch current 
> leader from 1 to 2.  So the operation is at least 2 x O(N).  and then 
> after the broker is back to normal, another 2 x O(N) to rollback. 

Hi George,

Hmm.  I guess I'm still on the fence about this feature.

In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?

> 
> 
> > In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  > Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.
> >> I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for> me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just>  an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why> a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity> for users and developers most of the time.
> 
> 
> I would like stress the point that running reassignment to change the 
> ordering of the replica (putting a broker to the end of partition 
> assignment) is unnecessary, because after some time the broker is 
> caught up, it can start serving traffic and then need to run 
> reassignments again to "rollback" to previous states. As I mentioned in 
> KIP-491, this is just tedious work. 

In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  We expect more and more people who have a complex or large cluster will start using tools like this.

However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to serve traffic, you could just re-apply the old ordering which you had saved.

> 
> I agree this might introduce some complexities for users/developers. 
> But if this feature is good, and well documented, it is good for the 
> kafka product/community.  Just like KIP-460 enabling unclean leader 
> election to override TopicLevel/Broker Level config of 
> `unclean.leader.election.enable`
> 
> > I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. > Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that> would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating> new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion> about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> 
> Creating topic with single-replica is beyond what KIP-491 is trying to 
> achieve.  The user needs to take responsibility of doing that. I do see 
> some Samza clients notoriously creating single-replica topics and that 
> got flagged by alerts, because a single broker down/maintenance will 
> cause offline partitions. For KIP-491 preferred leader "blacklist",  
> the single-replica will still serve as leaders, because there is no 
> other alternative replica to be chosen as leader. 
> 
> Even with a new PlacementPolicy for topic creation/partition expansion, 
> it still needs the blacklist info (e.g. a zk path node, or broker 
> level/topic level config) to "blacklist" the broker to be preferred 
> leader? Would it be the same as KIP-491 is introducing? 

I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree static configuration keys are somewhat less flexible than dynamic configuration.

best,
Colin


> 
> 
> Thanks,
> George
> 
>    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe 
> <cm...@apache.org> wrote:  
>  
>  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> >  Hi Colin,
> > Thanks for looking into this KIP.  Sorry for the late response. been busy. 
> > 
> > If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> > to the end of replica list is still a rather "big" operation, involving 
> > submitting reassignments.  The KIP-491 way of blacklist is much 
> > simpler/easier and can undo easily without changing the replica 
> > assignment ordering. 
> 
> Hi George,
> 
> Even if you have a way of blacklisting an entire broker all at once, 
> you still would need to run a leader election for each partition where 
> you want to move the leader off of the blacklisted broker.  So the 
> operation is still O(N) in that sense-- you have to do something per 
> partition.
> 
> In general, reassignment will get a lot easier and quicker once KIP-455 
> is implemented.  Reassignments that just change the order of preferred 
> replicas for a specific partition should complete pretty much instantly.
> 
> I think it's simpler and easier just to have one source of truth for 
> what the preferred replica is for a partition, rather than two.  So for 
> me, the fact that the replica assignment ordering isn't changed is 
> actually a big disadvantage of this KIP.  If you are a new user (or 
> just an existing user that didn't read all of the documentation) and 
> you just look at the replica assignment, you might be confused by why a 
> particular broker wasn't getting any leaderships, even  though it 
> appeared like it should.  More mechanisms mean more complexity for 
> users and developers most of the time.
> 
> > Major use case for me, a failed broker got swapped with new hardware, 
> > and starts up as empty (with latest offset of all partitions), the SLA 
> > of retention is 1 day, so before this broker is up to be in-sync for 1 
> > day, we would like to blacklist this broker from serving traffic. after 
> > 1 day, the blacklist is removed and run preferred leader election.  
> > This way, no need to run reassignments before/after.  This is the 
> > "temporary" use-case.
> 
> What if we just add an option to the reassignment tool to generate a 
> plan to move all the leaders off of a specific broker?  The tool could 
> also run a leader election as well.  That would be a simple way of 
> doing this without adding new mechanisms or broker-side configurations, 
> etc.
> 
> > 
> > There are use-cases that this Preferred Leader "blacklist" can be 
> > somewhat permanent, as I explained in the AWS data center instances Vs. 
> > on-premises data center bare metal machines (heterogenous hardware), 
> > that the AWS broker_ids will be blacklisted.  So new topics created,  
> > or existing topic expansion would not make them serve traffic even they 
> > could be the preferred leader. 
> 
> I agree that it would be nice if we could treat some brokers 
> differently for the purposes of placing replicas, selecting leaders, 
> etc.  Right now, we don't have any way of implementing that without 
> forking the broker.  I would support a new PlacementPolicy class that 
> would close this gap.  But I don't think this KIP is flexible enough to 
> fill this role.  For example, it can't prevent users from creating new 
> single-replica topics that get put on the "bad" replica.  Perhaps we 
> should reopen the discussion about 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> 
> regards,
> Colin
> 
> > 
> > Please let me know there are more question. 
> > 
> > 
> > Thanks,
> > George
> > 
> >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> > <cm...@apache.org> wrote:  
> >  
> >  We still want to give the "blacklisted" broker the leadership if 
> > nobody else is available.  Therefore, isn't putting a broker on the 
> > blacklist pretty much the same as moving it to the last entry in the 
> > replicas list and then triggering a preferred leader election?
> > 
> > If we want this to be undone after a certain amount of time, or under 
> > certain conditions, that seems like something that would be more 
> > effectively done by an external system, rather than putting all these 
> > policies into Kafka.
> > 
> > best,
> > Colin
> > 
> > 
> > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > >  Hi Satish,
> > > Thanks for the reviews and feedbacks.
> > > 
> > > > > The following is the requirements this KIP is trying to accomplish:
> > > > This can be moved to the"Proposed changes" section.
> > > 
> > > Updated the KIP-491. 
> > > 
> > > > >>The logic to determine the priority/order of which broker should be
> > > > preferred leader should be modified.  The broker in the preferred leader
> > > > blacklist should be moved to the end (lowest priority) when
> > > > determining leadership.
> > > >
> > > > I believe there is no change required in the ordering of the preferred
> > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > until other brokers int he list are unavailable.
> > > 
> > > Yes. partition assignment remained the same, replica & ordering. The 
> > > blacklist logic can be optimized during implementation. 
> > > 
> > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > where a specific topic should blacklist particular brokers, which
> > > > would be at the
> > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > be future enhancement work.
> > > > 
> > > > I agree that the broker level preferred leader blacklist would be
> > > > sufficient. Do you have any use cases which require topic level
> > > > preferred blacklist?
> > > 
> > > 
> > > 
> > > I don't have any concrete use cases for Topic level preferred leader 
> > > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > > etc), then try to move the leaders away from this broker,  before doing 
> > > an actual reassignment to change its preferred leader,  try to put this 
> > > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > > leader election, and see whether CPU decreases for this broker,  if 
> > > yes, then do the reassignments to change the preferred leaders to be 
> > > "permanent" (the topic may have many partitions like 256 that has quite 
> > > a few of them having this broker as preferred leader).  So this Topic 
> > > Level config is an easy way of doing trial and check the result. 
> > > 
> > > 
> > > > You can add the below workaround as an item in the rejected alternatives section
> > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > replica for."
> > > 
> > > Updated the KIP-491. 
> > > 
> > > 
> > > 
> > > Thanks, 
> > > George
> > > 
> > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > > <sa...@gmail.com> wrote:  
> > >  
> > >  Thanks for the KIP. I have put my comments below.
> > > 
> > > This is a nice improvement to avoid cumbersome maintenance.
> > > 
> > > >> The following is the requirements this KIP is trying to accomplish:
> > >   The ability to add and remove the preferred leader deprioritized
> > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > 
> > > This can be moved to the"Proposed changes" section.
> > > 
> > > >>The logic to determine the priority/order of which broker should be
> > > preferred leader should be modified.  The broker in the preferred leader
> > > blacklist should be moved to the end (lowest priority) when
> > > determining leadership.
> > > 
> > > I believe there is no change required in the ordering of the preferred
> > > replica list. Brokers in the preferred leader blacklist are skipped
> > > until other brokers int he list are unavailable.
> > > 
> > > >>The blacklist can be at the broker level. However, there might be use cases
> > > where a specific topic should blacklist particular brokers, which
> > > would be at the
> > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > be future enhancement work.
> > > 
> > > I agree that the broker level preferred leader blacklist would be
> > > sufficient. Do you have any use cases which require topic level
> > > preferred blacklist?
> > > 
> > > You can add the below workaround as an item in the rejected alternatives section
> > > "Reassigning all the topic/partitions which the intended broker is a
> > > replica for."
> > > 
> > > Thanks,
> > > Satish.
> > > 
> > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > <st...@confluent.io> wrote:
> > > >
> > > > Hey George,
> > > >
> > > > Thanks for the KIP, it's an interesting idea.
> > > >
> > > > I was wondering whether we could achieve the same thing via the
> > > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > > true that this is currently very tedious with the tool. My thoughts are
> > > > that we could improve the tool and give it the notion of a "blacklisted
> > > > preferred leader".
> > > > This would have some benefits like:
> > > > - more fine-grained control over the blacklist. we may not want to
> > > > blacklist all the preferred leaders, as that would make the blacklisted
> > > > broker a follower of last resort which is not very useful. In the cases of
> > > > an underpowered AWS machine or a controller, you might overshoot and make
> > > > the broker very underutilized if you completely make it leaderless.
> > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > rebalancing tools would also need to know about it and manipulate/respect
> > > > it to achieve a fair balance.
> > > > It seems like both problems are tied to balancing partitions, it's just
> > > > that KIP-491's use case wants to balance them against other factors in a
> > > > more nuanced way. It makes sense to have both be done from the same place
> > > >
> > > > To make note of the motivation section:
> > > > > Avoid bouncing broker in order to lose its leadership
> > > > The recommended way to make a broker lose its leadership is to run a
> > > > reassignment on its partitions
> > > > > The cross-data center cluster has AWS cloud instances which have less
> > > > computing power
> > > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > > system supported more flexibility in that regard but that is more nuanced
> > > > and a preferred leader blacklist may not be the best first approach to the
> > > > issue
> > > >
> > > > Adding a new config which can fundamentally change the way replication is
> > > > done is complex, both for the system (the replication code is complex
> > > > enough) and the user. Users would have another potential config that could
> > > > backfire on them - e.g if left forgotten.
> > > >
> > > > Could you think of any downsides to implementing this functionality (or a
> > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > One downside I can see is that we would not have it handle new partitions
> > > > created after the "blacklist operation". As a first iteration I think that
> > > > may be acceptable
> > > >
> > > > Thanks,
> > > > Stanislav
> > > >
> > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > > wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > )
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > >
> > > > >  Hi,
> > > > >
> > > > > I have created KIP-491 (
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > > some of the listed use-cases.
> > > > >
> > > > > Please provide your comments/feedbacks.
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >
> > > > >
> > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > >
> > > > >    [
> > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > ]
> > > > >
> > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > ---------------------------------------------------
> > > > >
> > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > >
> > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > -----------------------------------------------
> > > > > >
> > > > > >                Key: KAFKA-8638
> > > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > >            Project: Kafka
> > > > > >          Issue Type: Improvement
> > > > > >          Components: config, controller, core
> > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > >            Reporter: GEORGE LI
> > > > > >            Assignee: GEORGE LI
> > > > > >            Priority: Major
> > > > > >
> > > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > > in the topic/partition replica assignments in a priority order when the
> > > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > > position of replica. There are use-cases that, even the first broker in the
> > > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > > leader election.
> > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > > when deciding leadership during preferred leader election.  Below is a list
> > > > > of use cases:
> > > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > > segments or latest offset without historical data (There is another effort
> > > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > > * The cross-data center cluster has AWS instances which have less
> > > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > > leaders to the lowest.
> > > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > > and other tasks. we would like to put the controller's leaders to other
> > > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > > not work for Controller, because after the bounce, the controller fails
> > > > > over to another broker.
> > > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > > good if we have a way to specify which broker should be excluded from
> > > > > serving traffic/leadership (without changing the replica assignment
> > > > > ordering by reassignments, even though that's quick), and run preferred
> > > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > > leadership.
> > > > > > The current work-around of the above is to change the topic/partition's
> > > > > replica reassignments to move the broker_id 1 from the first position to
> > > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > > the original one and restore if things change (e.g. controller fails over
> > > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > > tedious task.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > This message was sent by Atlassian JIRA
> > > > > (v7.6.3#76005)  

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Colin McCabe <cm...@apache.org>.
On Wed, Aug 7, 2019, at 12:48, George Li wrote:
>  Hi Colin,
> 
> Thanks for your feedbacks.  Comments below:
> > Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election > for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in > that sense-- you have to do something per partition.
> 
> For a failed broker and swapped with an empty broker, when it comes up, 
> it will not have any leadership, and we would like it to remain not 
> having leaderships for a couple of hours or days. So there is no 
> preferred leader election needed which incurs O(N) operation in this 
> case.  Putting the preferred leader blacklist would safe guard this 
> broker serving traffic during that time. otherwise, if another broker 
> fails(if this broker is the 1st, 2nd in the assignment), or someone 
> runs preferred leader election, this new "empty" broker can still get 
> leaderships. 
> 
> Also running reassignment to change the ordering of preferred leader 
> would not actually switch the leader automatically.  e.g.  (1,2,3) => 
> (2,3,1). unless preferred leader election is run to switch current 
> leader from 1 to 2.  So the operation is at least 2 x O(N).  and then 
> after the broker is back to normal, another 2 x O(N) to rollback. 

Hi George,

Hmm.  I guess I'm still on the fence about this feature.

In your example, I think we're comparing apples and oranges.  You started by outlining a scenario where "an empty broker... comes up... [without] any leadership[s]."  But then you criticize using reassignment to switch the order of preferred replicas because it "would not actually switch the leader automatically."  If the empty broker doesn't have any leaderships, there is nothing to be switched, right?

> 
> 
> > In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  > Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.
> >> I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for> me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just>  an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why> a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity> for users and developers most of the time.
> 
> 
> I would like stress the point that running reassignment to change the 
> ordering of the replica (putting a broker to the end of partition 
> assignment) is unnecessary, because after some time the broker is 
> caught up, it can start serving traffic and then need to run 
> reassignments again to "rollback" to previous states. As I mentioned in 
> KIP-491, this is just tedious work. 

In general, using an external rebalancing tool like Cruise Control is a good idea to keep things balanced without having deal with manual rebalancing.  We expect more and more people who have a complex or large cluster will start using tools like this.

However, if you choose to do manual rebalancing, it shouldn't be that bad.  You would save the existing partition ordering before making your changes, then make your changes (perhaps by running a simple command line tool that switches the order of the replicas).  Then, once you felt like the broker was ready to serve traffic, you could just re-apply the old ordering which you had saved.

> 
> I agree this might introduce some complexities for users/developers. 
> But if this feature is good, and well documented, it is good for the 
> kafka product/community.  Just like KIP-460 enabling unclean leader 
> election to override TopicLevel/Broker Level config of 
> `unclean.leader.election.enable`
> 
> > I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. > Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that> would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating> new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion> about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> 
> Creating topic with single-replica is beyond what KIP-491 is trying to 
> achieve.  The user needs to take responsibility of doing that. I do see 
> some Samza clients notoriously creating single-replica topics and that 
> got flagged by alerts, because a single broker down/maintenance will 
> cause offline partitions. For KIP-491 preferred leader "blacklist",  
> the single-replica will still serve as leaders, because there is no 
> other alternative replica to be chosen as leader. 
> 
> Even with a new PlacementPolicy for topic creation/partition expansion, 
> it still needs the blacklist info (e.g. a zk path node, or broker 
> level/topic level config) to "blacklist" the broker to be preferred 
> leader? Would it be the same as KIP-491 is introducing? 

I was thinking about a PlacementPolicy filling the role of preventing people from creating single-replica partitions on a node that we didn't want to ever be the leader.  I thought that it could also prevent people from designating those nodes as preferred leaders during topic creation, or Kafka from doing itduring random topic creation.  I was assuming that the PlacementPolicy would determine which nodes were which through static configuration keys.  I agree static configuration keys are somewhat less flexible than dynamic configuration.

best,
Colin


> 
> 
> Thanks,
> George
> 
>     On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe 
> <cm...@apache.org> wrote:  
>  
>  On Fri, Aug 2, 2019, at 20:02, George Li wrote:
> >  Hi Colin,
> > Thanks for looking into this KIP.  Sorry for the late response. been busy. 
> > 
> > If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> > to the end of replica list is still a rather "big" operation, involving 
> > submitting reassignments.  The KIP-491 way of blacklist is much 
> > simpler/easier and can undo easily without changing the replica 
> > assignment ordering. 
> 
> Hi George,
> 
> Even if you have a way of blacklisting an entire broker all at once, 
> you still would need to run a leader election for each partition where 
> you want to move the leader off of the blacklisted broker.  So the 
> operation is still O(N) in that sense-- you have to do something per 
> partition.
> 
> In general, reassignment will get a lot easier and quicker once KIP-455 
> is implemented.  Reassignments that just change the order of preferred 
> replicas for a specific partition should complete pretty much instantly.
> 
> I think it's simpler and easier just to have one source of truth for 
> what the preferred replica is for a partition, rather than two.  So for 
> me, the fact that the replica assignment ordering isn't changed is 
> actually a big disadvantage of this KIP.  If you are a new user (or 
> just an existing user that didn't read all of the documentation) and 
> you just look at the replica assignment, you might be confused by why a 
> particular broker wasn't getting any leaderships, even  though it 
> appeared like it should.  More mechanisms mean more complexity for 
> users and developers most of the time.
> 
> > Major use case for me, a failed broker got swapped with new hardware, 
> > and starts up as empty (with latest offset of all partitions), the SLA 
> > of retention is 1 day, so before this broker is up to be in-sync for 1 
> > day, we would like to blacklist this broker from serving traffic. after 
> > 1 day, the blacklist is removed and run preferred leader election.  
> > This way, no need to run reassignments before/after.  This is the 
> > "temporary" use-case.
> 
> What if we just add an option to the reassignment tool to generate a 
> plan to move all the leaders off of a specific broker?  The tool could 
> also run a leader election as well.  That would be a simple way of 
> doing this without adding new mechanisms or broker-side configurations, 
> etc.
> 
> > 
> > There are use-cases that this Preferred Leader "blacklist" can be 
> > somewhat permanent, as I explained in the AWS data center instances Vs. 
> > on-premises data center bare metal machines (heterogenous hardware), 
> > that the AWS broker_ids will be blacklisted.  So new topics created,  
> > or existing topic expansion would not make them serve traffic even they 
> > could be the preferred leader. 
> 
> I agree that it would be nice if we could treat some brokers 
> differently for the purposes of placing replicas, selecting leaders, 
> etc.  Right now, we don't have any way of implementing that without 
> forking the broker.  I would support a new PlacementPolicy class that 
> would close this gap.  But I don't think this KIP is flexible enough to 
> fill this role.  For example, it can't prevent users from creating new 
> single-replica topics that get put on the "bad" replica.  Perhaps we 
> should reopen the discussion about 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces
> 
> regards,
> Colin
> 
> > 
> > Please let me know there are more question. 
> > 
> > 
> > Thanks,
> > George
> > 
> >    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> > <cm...@apache.org> wrote:  
> >  
> >  We still want to give the "blacklisted" broker the leadership if 
> > nobody else is available.  Therefore, isn't putting a broker on the 
> > blacklist pretty much the same as moving it to the last entry in the 
> > replicas list and then triggering a preferred leader election?
> > 
> > If we want this to be undone after a certain amount of time, or under 
> > certain conditions, that seems like something that would be more 
> > effectively done by an external system, rather than putting all these 
> > policies into Kafka.
> > 
> > best,
> > Colin
> > 
> > 
> > On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> > >  Hi Satish,
> > > Thanks for the reviews and feedbacks.
> > > 
> > > > > The following is the requirements this KIP is trying to accomplish:
> > > > This can be moved to the"Proposed changes" section.
> > > 
> > > Updated the KIP-491. 
> > > 
> > > > >>The logic to determine the priority/order of which broker should be
> > > > preferred leader should be modified.  The broker in the preferred leader
> > > > blacklist should be moved to the end (lowest priority) when
> > > > determining leadership.
> > > >
> > > > I believe there is no change required in the ordering of the preferred
> > > > replica list. Brokers in the preferred leader blacklist are skipped
> > > > until other brokers int he list are unavailable.
> > > 
> > > Yes. partition assignment remained the same, replica & ordering. The 
> > > blacklist logic can be optimized during implementation. 
> > > 
> > > > >>The blacklist can be at the broker level. However, there might be use cases
> > > > where a specific topic should blacklist particular brokers, which
> > > > would be at the
> > > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > > be future enhancement work.
> > > > 
> > > > I agree that the broker level preferred leader blacklist would be
> > > > sufficient. Do you have any use cases which require topic level
> > > > preferred blacklist?
> > > 
> > > 
> > > 
> > > I don't have any concrete use cases for Topic level preferred leader 
> > > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > > etc), then try to move the leaders away from this broker,  before doing 
> > > an actual reassignment to change its preferred leader,  try to put this 
> > > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > > leader election, and see whether CPU decreases for this broker,  if 
> > > yes, then do the reassignments to change the preferred leaders to be 
> > > "permanent" (the topic may have many partitions like 256 that has quite 
> > > a few of them having this broker as preferred leader).  So this Topic 
> > > Level config is an easy way of doing trial and check the result. 
> > > 
> > > 
> > > > You can add the below workaround as an item in the rejected alternatives section
> > > > "Reassigning all the topic/partitions which the intended broker is a
> > > > replica for."
> > > 
> > > Updated the KIP-491. 
> > > 
> > > 
> > > 
> > > Thanks, 
> > > George
> > > 
> > >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > > <sa...@gmail.com> wrote:  
> > >  
> > >  Thanks for the KIP. I have put my comments below.
> > > 
> > > This is a nice improvement to avoid cumbersome maintenance.
> > > 
> > > >> The following is the requirements this KIP is trying to accomplish:
> > >   The ability to add and remove the preferred leader deprioritized
> > > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > > 
> > > This can be moved to the"Proposed changes" section.
> > > 
> > > >>The logic to determine the priority/order of which broker should be
> > > preferred leader should be modified.  The broker in the preferred leader
> > > blacklist should be moved to the end (lowest priority) when
> > > determining leadership.
> > > 
> > > I believe there is no change required in the ordering of the preferred
> > > replica list. Brokers in the preferred leader blacklist are skipped
> > > until other brokers int he list are unavailable.
> > > 
> > > >>The blacklist can be at the broker level. However, there might be use cases
> > > where a specific topic should blacklist particular brokers, which
> > > would be at the
> > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > be future enhancement work.
> > > 
> > > I agree that the broker level preferred leader blacklist would be
> > > sufficient. Do you have any use cases which require topic level
> > > preferred blacklist?
> > > 
> > > You can add the below workaround as an item in the rejected alternatives section
> > > "Reassigning all the topic/partitions which the intended broker is a
> > > replica for."
> > > 
> > > Thanks,
> > > Satish.
> > > 
> > > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > > <st...@confluent.io> wrote:
> > > >
> > > > Hey George,
> > > >
> > > > Thanks for the KIP, it's an interesting idea.
> > > >
> > > > I was wondering whether we could achieve the same thing via the
> > > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > > true that this is currently very tedious with the tool. My thoughts are
> > > > that we could improve the tool and give it the notion of a "blacklisted
> > > > preferred leader".
> > > > This would have some benefits like:
> > > > - more fine-grained control over the blacklist. we may not want to
> > > > blacklist all the preferred leaders, as that would make the blacklisted
> > > > broker a follower of last resort which is not very useful. In the cases of
> > > > an underpowered AWS machine or a controller, you might overshoot and make
> > > > the broker very underutilized if you completely make it leaderless.
> > > > - is not permanent. If we are to have a blacklist leaders config,
> > > > rebalancing tools would also need to know about it and manipulate/respect
> > > > it to achieve a fair balance.
> > > > It seems like both problems are tied to balancing partitions, it's just
> > > > that KIP-491's use case wants to balance them against other factors in a
> > > > more nuanced way. It makes sense to have both be done from the same place
> > > >
> > > > To make note of the motivation section:
> > > > > Avoid bouncing broker in order to lose its leadership
> > > > The recommended way to make a broker lose its leadership is to run a
> > > > reassignment on its partitions
> > > > > The cross-data center cluster has AWS cloud instances which have less
> > > > computing power
> > > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > > system supported more flexibility in that regard but that is more nuanced
> > > > and a preferred leader blacklist may not be the best first approach to the
> > > > issue
> > > >
> > > > Adding a new config which can fundamentally change the way replication is
> > > > done is complex, both for the system (the replication code is complex
> > > > enough) and the user. Users would have another potential config that could
> > > > backfire on them - e.g if left forgotten.
> > > >
> > > > Could you think of any downsides to implementing this functionality (or a
> > > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > > One downside I can see is that we would not have it handle new partitions
> > > > created after the "blacklist operation". As a first iteration I think that
> > > > may be acceptable
> > > >
> > > > Thanks,
> > > > Stanislav
> > > >
> > > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > > wrote:
> > > >
> > > > >  Hi,
> > > > >
> > > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > > )
> > > > >
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > > >
> > > > >  Hi,
> > > > >
> > > > > I have created KIP-491 (
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > > some of the listed use-cases.
> > > > >
> > > > > Please provide your comments/feedbacks.
> > > > >
> > > > > Thanks,
> > > > > George
> > > > >
> > > > >
> > > > >
> > > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > > >
> > > > >    [
> > > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > > ]
> > > > >
> > > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > > ---------------------------------------------------
> > > > >
> > > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > > >
> > > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > > -----------------------------------------------
> > > > > >
> > > > > >                Key: KAFKA-8638
> > > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > > >            Project: Kafka
> > > > > >          Issue Type: Improvement
> > > > > >          Components: config, controller, core
> > > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > > >            Reporter: GEORGE LI
> > > > > >            Assignee: GEORGE LI
> > > > > >            Priority: Major
> > > > > >
> > > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > > in the topic/partition replica assignments in a priority order when the
> > > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > > position of replica. There are use-cases that, even the first broker in the
> > > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > > leader election.
> > > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > > when deciding leadership during preferred leader election.  Below is a list
> > > > > of use cases:
> > > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > > segments or latest offset without historical data (There is another effort
> > > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > > * The cross-data center cluster has AWS instances which have less
> > > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > > leaders to the lowest.
> > > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > > and other tasks. we would like to put the controller's leaders to other
> > > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > > not work for Controller, because after the bounce, the controller fails
> > > > > over to another broker.
> > > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > > good if we have a way to specify which broker should be excluded from
> > > > > serving traffic/leadership (without changing the replica assignment
> > > > > ordering by reassignments, even though that's quick), and run preferred
> > > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > > leadership.
> > > > > > The current work-around of the above is to change the topic/partition's
> > > > > replica reassignments to move the broker_id 1 from the first position to
> > > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > > the original one and restore if things change (e.g. controller fails over
> > > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > > tedious task.
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > This message was sent by Atlassian JIRA
> > > > > (v7.6.3#76005)

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by George Li <sq...@yahoo.com>.
 Hi Colin,

Thanks for your feedbacks.  Comments below:
> Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election > for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in > that sense-- you have to do something per partition.

For a failed broker and swapped with an empty broker, when it comes up, it will not have any leadership, and we would like it to remain not having leaderships for a couple of hours or days. So there is no preferred leader election needed which incurs O(N) operation in this case.  Putting the preferred leader blacklist would safe guard this broker serving traffic during that time. otherwise, if another broker fails(if this broker is the 1st, 2nd in the assignment), or someone runs preferred leader election, this new "empty" broker can still get leaderships. 

Also running reassignment to change the ordering of preferred leader would not actually switch the leader automatically.  e.g.  (1,2,3) => (2,3,1). unless preferred leader election is run to switch current leader from 1 to 2.  So the operation is at least 2 x O(N).  and then after the broker is back to normal, another 2 x O(N) to rollback. 


> In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  > Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.
>> I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for> me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just>  an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why> a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity> for users and developers most of the time.


I would like stress the point that running reassignment to change the ordering of the replica (putting a broker to the end of partition assignment) is unnecessary, because after some time the broker is caught up, it can start serving traffic and then need to run reassignments again to "rollback" to previous states. As I mentioned in KIP-491, this is just tedious work. 

I agree this might introduce some complexities for users/developers. But if this feature is good, and well documented, it is good for the kafka product/community.  Just like KIP-460 enabling unclean leader election to override TopicLevel/Broker Level config of `unclean.leader.election.enable`

> I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc. > Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that> would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating> new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion> about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces

Creating topic with single-replica is beyond what KIP-491 is trying to achieve.  The user needs to take responsibility of doing that. I do see some Samza clients notoriously creating single-replica topics and that got flagged by alerts, because a single broker down/maintenance will cause offline partitions. For KIP-491 preferred leader "blacklist",  the single-replica will still serve as leaders, because there is no other alternative replica to be chosen as leader. 

Even with a new PlacementPolicy for topic creation/partition expansion, it still needs the blacklist info (e.g. a zk path node, or broker level/topic level config) to "blacklist" the broker to be preferred leader? Would it be the same as KIP-491 is introducing? 


Thanks,
George

    On Wednesday, August 7, 2019, 11:01:51 AM PDT, Colin McCabe <cm...@apache.org> wrote:  
 
 On Fri, Aug 2, 2019, at 20:02, George Li wrote:
>  Hi Colin,
> Thanks for looking into this KIP.  Sorry for the late response. been busy. 
> 
> If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> to the end of replica list is still a rather "big" operation, involving 
> submitting reassignments.  The KIP-491 way of blacklist is much 
> simpler/easier and can undo easily without changing the replica 
> assignment ordering. 

Hi George,

Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in that sense-- you have to do something per partition.

In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.

I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity for users and developers most of the time.

> Major use case for me, a failed broker got swapped with new hardware, 
> and starts up as empty (with latest offset of all partitions), the SLA 
> of retention is 1 day, so before this broker is up to be in-sync for 1 
> day, we would like to blacklist this broker from serving traffic. after 
> 1 day, the blacklist is removed and run preferred leader election.  
> This way, no need to run reassignments before/after.  This is the 
> "temporary" use-case.

What if we just add an option to the reassignment tool to generate a plan to move all the leaders off of a specific broker?  The tool could also run a leader election as well.  That would be a simple way of doing this without adding new mechanisms or broker-side configurations, etc.

> 
> There are use-cases that this Preferred Leader "blacklist" can be 
> somewhat permanent, as I explained in the AWS data center instances Vs. 
> on-premises data center bare metal machines (heterogenous hardware), 
> that the AWS broker_ids will be blacklisted.  So new topics created,  
> or existing topic expansion would not make them serve traffic even they 
> could be the preferred leader. 

I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc.  Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces

regards,
Colin

> 
> Please let me know there are more question. 
> 
> 
> Thanks,
> George
> 
>    On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> <cm...@apache.org> wrote:  
>  
>  We still want to give the "blacklisted" broker the leadership if 
> nobody else is available.  Therefore, isn't putting a broker on the 
> blacklist pretty much the same as moving it to the last entry in the 
> replicas list and then triggering a preferred leader election?
> 
> If we want this to be undone after a certain amount of time, or under 
> certain conditions, that seems like something that would be more 
> effectively done by an external system, rather than putting all these 
> policies into Kafka.
> 
> best,
> Colin
> 
> 
> On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> >  Hi Satish,
> > Thanks for the reviews and feedbacks.
> > 
> > > > The following is the requirements this KIP is trying to accomplish:
> > > This can be moved to the"Proposed changes" section.
> > 
> > Updated the KIP-491. 
> > 
> > > >>The logic to determine the priority/order of which broker should be
> > > preferred leader should be modified.  The broker in the preferred leader
> > > blacklist should be moved to the end (lowest priority) when
> > > determining leadership.
> > >
> > > I believe there is no change required in the ordering of the preferred
> > > replica list. Brokers in the preferred leader blacklist are skipped
> > > until other brokers int he list are unavailable.
> > 
> > Yes. partition assignment remained the same, replica & ordering. The 
> > blacklist logic can be optimized during implementation. 
> > 
> > > >>The blacklist can be at the broker level. However, there might be use cases
> > > where a specific topic should blacklist particular brokers, which
> > > would be at the
> > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > be future enhancement work.
> > > 
> > > I agree that the broker level preferred leader blacklist would be
> > > sufficient. Do you have any use cases which require topic level
> > > preferred blacklist?
> > 
> > 
> > 
> > I don't have any concrete use cases for Topic level preferred leader 
> > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > etc), then try to move the leaders away from this broker,  before doing 
> > an actual reassignment to change its preferred leader,  try to put this 
> > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > leader election, and see whether CPU decreases for this broker,  if 
> > yes, then do the reassignments to change the preferred leaders to be 
> > "permanent" (the topic may have many partitions like 256 that has quite 
> > a few of them having this broker as preferred leader).  So this Topic 
> > Level config is an easy way of doing trial and check the result. 
> > 
> > 
> > > You can add the below workaround as an item in the rejected alternatives section
> > > "Reassigning all the topic/partitions which the intended broker is a
> > > replica for."
> > 
> > Updated the KIP-491. 
> > 
> > 
> > 
> > Thanks, 
> > George
> > 
> >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > <sa...@gmail.com> wrote:  
> >  
> >  Thanks for the KIP. I have put my comments below.
> > 
> > This is a nice improvement to avoid cumbersome maintenance.
> > 
> > >> The following is the requirements this KIP is trying to accomplish:
> >   The ability to add and remove the preferred leader deprioritized
> > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > 
> > This can be moved to the"Proposed changes" section.
> > 
> > >>The logic to determine the priority/order of which broker should be
> > preferred leader should be modified.  The broker in the preferred leader
> > blacklist should be moved to the end (lowest priority) when
> > determining leadership.
> > 
> > I believe there is no change required in the ordering of the preferred
> > replica list. Brokers in the preferred leader blacklist are skipped
> > until other brokers int he list are unavailable.
> > 
> > >>The blacklist can be at the broker level. However, there might be use cases
> > where a specific topic should blacklist particular brokers, which
> > would be at the
> > Topic level Config. For this use cases of this KIP, it seems that broker level
> > blacklist would suffice.  Topic level preferred leader blacklist might
> > be future enhancement work.
> > 
> > I agree that the broker level preferred leader blacklist would be
> > sufficient. Do you have any use cases which require topic level
> > preferred blacklist?
> > 
> > You can add the below workaround as an item in the rejected alternatives section
> > "Reassigning all the topic/partitions which the intended broker is a
> > replica for."
> > 
> > Thanks,
> > Satish.
> > 
> > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > <st...@confluent.io> wrote:
> > >
> > > Hey George,
> > >
> > > Thanks for the KIP, it's an interesting idea.
> > >
> > > I was wondering whether we could achieve the same thing via the
> > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > true that this is currently very tedious with the tool. My thoughts are
> > > that we could improve the tool and give it the notion of a "blacklisted
> > > preferred leader".
> > > This would have some benefits like:
> > > - more fine-grained control over the blacklist. we may not want to
> > > blacklist all the preferred leaders, as that would make the blacklisted
> > > broker a follower of last resort which is not very useful. In the cases of
> > > an underpowered AWS machine or a controller, you might overshoot and make
> > > the broker very underutilized if you completely make it leaderless.
> > > - is not permanent. If we are to have a blacklist leaders config,
> > > rebalancing tools would also need to know about it and manipulate/respect
> > > it to achieve a fair balance.
> > > It seems like both problems are tied to balancing partitions, it's just
> > > that KIP-491's use case wants to balance them against other factors in a
> > > more nuanced way. It makes sense to have both be done from the same place
> > >
> > > To make note of the motivation section:
> > > > Avoid bouncing broker in order to lose its leadership
> > > The recommended way to make a broker lose its leadership is to run a
> > > reassignment on its partitions
> > > > The cross-data center cluster has AWS cloud instances which have less
> > > computing power
> > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > system supported more flexibility in that regard but that is more nuanced
> > > and a preferred leader blacklist may not be the best first approach to the
> > > issue
> > >
> > > Adding a new config which can fundamentally change the way replication is
> > > done is complex, both for the system (the replication code is complex
> > > enough) and the user. Users would have another potential config that could
> > > backfire on them - e.g if left forgotten.
> > >
> > > Could you think of any downsides to implementing this functionality (or a
> > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > One downside I can see is that we would not have it handle new partitions
> > > created after the "blacklist operation". As a first iteration I think that
> > > may be acceptable
> > >
> > > Thanks,
> > > Stanislav
> > >
> > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > wrote:
> > >
> > > >  Hi,
> > > >
> > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > )
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > >
> > > >  Hi,
> > > >
> > > > I have created KIP-491 (
> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > some of the listed use-cases.
> > > >
> > > > Please provide your comments/feedbacks.
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >
> > > >
> > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > >
> > > >    [
> > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > ]
> > > >
> > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > ---------------------------------------------------
> > > >
> > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > >
> > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > -----------------------------------------------
> > > > >
> > > > >                Key: KAFKA-8638
> > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > >            Project: Kafka
> > > > >          Issue Type: Improvement
> > > > >          Components: config, controller, core
> > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > >            Reporter: GEORGE LI
> > > > >            Assignee: GEORGE LI
> > > > >            Priority: Major
> > > > >
> > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > in the topic/partition replica assignments in a priority order when the
> > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > position of replica. There are use-cases that, even the first broker in the
> > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > leader election.
> > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > when deciding leadership during preferred leader election.  Below is a list
> > > > of use cases:
> > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > segments or latest offset without historical data (There is another effort
> > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > * The cross-data center cluster has AWS instances which have less
> > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > leaders to the lowest.
> > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > and other tasks. we would like to put the controller's leaders to other
> > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > not work for Controller, because after the bounce, the controller fails
> > > > over to another broker.
> > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > good if we have a way to specify which broker should be excluded from
> > > > serving traffic/leadership (without changing the replica assignment
> > > > ordering by reassignments, even though that's quick), and run preferred
> > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > leadership.
> > > > > The current work-around of the above is to change the topic/partition's
> > > > replica reassignments to move the broker_id 1 from the first position to
> > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > the original one and restore if things change (e.g. controller fails over
> > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > tedious task.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > This message was sent by Atlassian JIRA
> > > > (v7.6.3#76005)  

Re: [DISCUSS] KIP-491: Preferred Leader Deprioritized List (Temporary Blacklist)

Posted by Colin McCabe <cm...@apache.org>.
On Fri, Aug 2, 2019, at 20:02, George Li wrote:
>  Hi Colin,
> Thanks for looking into this KIP.  Sorry for the late response. been busy. 
> 
> If a cluster has MAMY topic partitions, moving this "blacklist" broker 
> to the end of replica list is still a rather "big" operation, involving 
> submitting reassignments.  The KIP-491 way of blacklist is much 
> simpler/easier and can undo easily without changing the replica 
> assignment ordering. 

Hi George,

Even if you have a way of blacklisting an entire broker all at once, you still would need to run a leader election for each partition where you want to move the leader off of the blacklisted broker.  So the operation is still O(N) in that sense-- you have to do something per partition.

In general, reassignment will get a lot easier and quicker once KIP-455 is implemented.  Reassignments that just change the order of preferred replicas for a specific partition should complete pretty much instantly.

I think it's simpler and easier just to have one source of truth for what the preferred replica is for a partition, rather than two.  So for me, the fact that the replica assignment ordering isn't changed is actually a big disadvantage of this KIP.  If you are a new user (or just an existing user that didn't read all of the documentation) and you just look at the replica assignment, you might be confused by why a particular broker wasn't getting any leaderships, even  though it appeared like it should.  More mechanisms mean more complexity for users and developers most of the time.

> Major use case for me, a failed broker got swapped with new hardware, 
> and starts up as empty (with latest offset of all partitions), the SLA 
> of retention is 1 day, so before this broker is up to be in-sync for 1 
> day, we would like to blacklist this broker from serving traffic. after 
> 1 day, the blacklist is removed and run preferred leader election.  
> This way, no need to run reassignments before/after.  This is the 
> "temporary" use-case.

What if we just add an option to the reassignment tool to generate a plan to move all the leaders off of a specific broker?  The tool could also run a leader election as well.  That would be a simple way of doing this without adding new mechanisms or broker-side configurations, etc.

> 
> There are use-cases that this Preferred Leader "blacklist" can be 
> somewhat permanent, as I explained in the AWS data center instances Vs. 
> on-premises data center bare metal machines (heterogenous hardware), 
> that the AWS broker_ids will be blacklisted.  So new topics created,  
> or existing topic expansion would not make them serve traffic even they 
> could be the preferred leader. 

I agree that it would be nice if we could treat some brokers differently for the purposes of placing replicas, selecting leaders, etc.  Right now, we don't have any way of implementing that without forking the broker.  I would support a new PlacementPolicy class that would close this gap.  But I don't think this KIP is flexible enough to fill this role.  For example, it can't prevent users from creating new single-replica topics that get put on the "bad" replica.  Perhaps we should reopen the discussion about https://cwiki.apache.org/confluence/display/KAFKA/KIP-201%3A+Rationalising+Policy+interfaces

regards,
Colin

> 
> Please let me know there are more question. 
> 
> 
> Thanks,
> George
> 
>     On Thursday, July 25, 2019, 08:38:28 AM PDT, Colin McCabe 
> <cm...@apache.org> wrote:  
>  
>  We still want to give the "blacklisted" broker the leadership if 
> nobody else is available.  Therefore, isn't putting a broker on the 
> blacklist pretty much the same as moving it to the last entry in the 
> replicas list and then triggering a preferred leader election?
> 
> If we want this to be undone after a certain amount of time, or under 
> certain conditions, that seems like something that would be more 
> effectively done by an external system, rather than putting all these 
> policies into Kafka.
> 
> best,
> Colin
> 
> 
> On Fri, Jul 19, 2019, at 18:23, George Li wrote:
> >  Hi Satish,
> > Thanks for the reviews and feedbacks.
> > 
> > > > The following is the requirements this KIP is trying to accomplish:
> > > This can be moved to the"Proposed changes" section.
> > 
> > Updated the KIP-491. 
> > 
> > > >>The logic to determine the priority/order of which broker should be
> > > preferred leader should be modified.  The broker in the preferred leader
> > > blacklist should be moved to the end (lowest priority) when
> > > determining leadership.
> > >
> > > I believe there is no change required in the ordering of the preferred
> > > replica list. Brokers in the preferred leader blacklist are skipped
> > > until other brokers int he list are unavailable.
> > 
> > Yes. partition assignment remained the same, replica & ordering. The 
> > blacklist logic can be optimized during implementation. 
> > 
> > > >>The blacklist can be at the broker level. However, there might be use cases
> > > where a specific topic should blacklist particular brokers, which
> > > would be at the
> > > Topic level Config. For this use cases of this KIP, it seems that broker level
> > > blacklist would suffice.  Topic level preferred leader blacklist might
> > > be future enhancement work.
> > > 
> > > I agree that the broker level preferred leader blacklist would be
> > > sufficient. Do you have any use cases which require topic level
> > > preferred blacklist?
> > 
> > 
> > 
> > I don't have any concrete use cases for Topic level preferred leader 
> > blacklist.  One scenarios I can think of is when a broker has high CPU 
> > usage, trying to identify the big topics (High MsgIn, High BytesIn, 
> > etc), then try to move the leaders away from this broker,  before doing 
> > an actual reassignment to change its preferred leader,  try to put this 
> > preferred_leader_blacklist in the Topic Level config, and run preferred 
> > leader election, and see whether CPU decreases for this broker,  if 
> > yes, then do the reassignments to change the preferred leaders to be 
> > "permanent" (the topic may have many partitions like 256 that has quite 
> > a few of them having this broker as preferred leader).  So this Topic 
> > Level config is an easy way of doing trial and check the result. 
> > 
> > 
> > > You can add the below workaround as an item in the rejected alternatives section
> > > "Reassigning all the topic/partitions which the intended broker is a
> > > replica for."
> > 
> > Updated the KIP-491. 
> > 
> > 
> > 
> > Thanks, 
> > George
> > 
> >    On Friday, July 19, 2019, 08:20:22 AM PDT, Satish Duggana 
> > <sa...@gmail.com> wrote:  
> >  
> >  Thanks for the KIP. I have put my comments below.
> > 
> > This is a nice improvement to avoid cumbersome maintenance.
> > 
> > >> The following is the requirements this KIP is trying to accomplish:
> >   The ability to add and remove the preferred leader deprioritized
> > list/blacklist. e.g. new ZK path/node or new dynamic config.
> > 
> > This can be moved to the"Proposed changes" section.
> > 
> > >>The logic to determine the priority/order of which broker should be
> > preferred leader should be modified.  The broker in the preferred leader
> > blacklist should be moved to the end (lowest priority) when
> > determining leadership.
> > 
> > I believe there is no change required in the ordering of the preferred
> > replica list. Brokers in the preferred leader blacklist are skipped
> > until other brokers int he list are unavailable.
> > 
> > >>The blacklist can be at the broker level. However, there might be use cases
> > where a specific topic should blacklist particular brokers, which
> > would be at the
> > Topic level Config. For this use cases of this KIP, it seems that broker level
> > blacklist would suffice.  Topic level preferred leader blacklist might
> > be future enhancement work.
> > 
> > I agree that the broker level preferred leader blacklist would be
> > sufficient. Do you have any use cases which require topic level
> > preferred blacklist?
> > 
> > You can add the below workaround as an item in the rejected alternatives section
> > "Reassigning all the topic/partitions which the intended broker is a
> > replica for."
> > 
> > Thanks,
> > Satish.
> > 
> > On Fri, Jul 19, 2019 at 7:33 AM Stanislav Kozlovski
> > <st...@confluent.io> wrote:
> > >
> > > Hey George,
> > >
> > > Thanks for the KIP, it's an interesting idea.
> > >
> > > I was wondering whether we could achieve the same thing via the
> > > kafka-reassign-partitions tool. As you had also said in the JIRA,  it is
> > > true that this is currently very tedious with the tool. My thoughts are
> > > that we could improve the tool and give it the notion of a "blacklisted
> > > preferred leader".
> > > This would have some benefits like:
> > > - more fine-grained control over the blacklist. we may not want to
> > > blacklist all the preferred leaders, as that would make the blacklisted
> > > broker a follower of last resort which is not very useful. In the cases of
> > > an underpowered AWS machine or a controller, you might overshoot and make
> > > the broker very underutilized if you completely make it leaderless.
> > > - is not permanent. If we are to have a blacklist leaders config,
> > > rebalancing tools would also need to know about it and manipulate/respect
> > > it to achieve a fair balance.
> > > It seems like both problems are tied to balancing partitions, it's just
> > > that KIP-491's use case wants to balance them against other factors in a
> > > more nuanced way. It makes sense to have both be done from the same place
> > >
> > > To make note of the motivation section:
> > > > Avoid bouncing broker in order to lose its leadership
> > > The recommended way to make a broker lose its leadership is to run a
> > > reassignment on its partitions
> > > > The cross-data center cluster has AWS cloud instances which have less
> > > computing power
> > > We recommend running Kafka on homogeneous machines. It would be cool if the
> > > system supported more flexibility in that regard but that is more nuanced
> > > and a preferred leader blacklist may not be the best first approach to the
> > > issue
> > >
> > > Adding a new config which can fundamentally change the way replication is
> > > done is complex, both for the system (the replication code is complex
> > > enough) and the user. Users would have another potential config that could
> > > backfire on them - e.g if left forgotten.
> > >
> > > Could you think of any downsides to implementing this functionality (or a
> > > variation of it) in the kafka-reassign-partitions.sh tool?
> > > One downside I can see is that we would not have it handle new partitions
> > > created after the "blacklist operation". As a first iteration I think that
> > > may be acceptable
> > >
> > > Thanks,
> > > Stanislav
> > >
> > > On Fri, Jul 19, 2019 at 3:20 AM George Li <sq...@yahoo.com.invalid>
> > > wrote:
> > >
> > > >  Hi,
> > > >
> > > > Pinging the list for the feedbacks of this KIP-491  (
> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982
> > > > )
> > > >
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >    On Saturday, July 13, 2019, 08:43:25 PM PDT, George Li <
> > > > sql_consulting@yahoo.com.INVALID> wrote:
> > > >
> > > >  Hi,
> > > >
> > > > I have created KIP-491 (
> > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120736982)
> > > > for putting a broker to the preferred leader blacklist or deprioritized
> > > > list so when determining leadership,  it's moved to the lowest priority for
> > > > some of the listed use-cases.
> > > >
> > > > Please provide your comments/feedbacks.
> > > >
> > > > Thanks,
> > > > George
> > > >
> > > >
> > > >
> > > >  ----- Forwarded Message ----- From: Jose Armando Garcia Sancio (JIRA) <
> > > > jira@apache.org>To: "sql_consulting@yahoo.com" <sq...@yahoo.com>Sent:
> > > > Tuesday, July 9, 2019, 01:06:05 PM PDTSubject: [jira] [Commented]
> > > > (KAFKA-8638) Preferred Leader Blacklist (deprioritized list)
> > > >
> > > >    [
> > > > https://issues.apache.org/jira/browse/KAFKA-8638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881511#comment-16881511
> > > > ]
> > > >
> > > > Jose Armando Garcia Sancio commented on KAFKA-8638:
> > > > ---------------------------------------------------
> > > >
> > > > Thanks for feedback and clear use cases [~sql_consulting].
> > > >
> > > > > Preferred Leader Blacklist (deprioritized list)
> > > > > -----------------------------------------------
> > > > >
> > > > >                Key: KAFKA-8638
> > > > >                URL: https://issues.apache.org/jira/browse/KAFKA-8638
> > > > >            Project: Kafka
> > > > >          Issue Type: Improvement
> > > > >          Components: config, controller, core
> > > > >    Affects Versions: 1.1.1, 2.3.0, 2.2.1
> > > > >            Reporter: GEORGE LI
> > > > >            Assignee: GEORGE LI
> > > > >            Priority: Major
> > > > >
> > > > > Currently, the kafka preferred leader election will pick the broker_id
> > > > in the topic/partition replica assignments in a priority order when the
> > > > broker is in ISR. The preferred leader is the broker id in the first
> > > > position of replica. There are use-cases that, even the first broker in the
> > > > replica assignment is in ISR, there is a need for it to be moved to the end
> > > > of ordering (lowest priority) when deciding leadership during  preferred
> > > > leader election.
> > > > > Let’s use topic/partition replica (1,2,3) as an example. 1 is the
> > > > preferred leader.  When preferred leadership is run, it will pick 1 as the
> > > > leader if it's ISR, if 1 is not online and in ISR, then pick 2, if 2 is not
> > > > in ISR, then pick 3 as the leader. There are use cases that, even 1 is in
> > > > ISR, we would like it to be moved to the end of ordering (lowest priority)
> > > > when deciding leadership during preferred leader election.  Below is a list
> > > > of use cases:
> > > > > * (If broker_id 1 is a swapped failed host and brought up with last
> > > > segments or latest offset without historical data (There is another effort
> > > > on this), it's better for it to not serve leadership till it's caught-up.
> > > > > * The cross-data center cluster has AWS instances which have less
> > > > computing power than the on-prem bare metal machines.  We could put the AWS
> > > > broker_ids in Preferred Leader Blacklist, so on-prem brokers can be elected
> > > > leaders, without changing the reassignments ordering of the replicas.
> > > > > * If the broker_id 1 is constantly losing leadership after some time:
> > > > "Flapping". we would want to exclude 1 to be a leader unless all other
> > > > brokers of this topic/partition are offline.  The “Flapping” effect was
> > > > seen in the past when 2 or more brokers were bad, when they lost leadership
> > > > constantly/quickly, the sets of partition replicas they belong to will see
> > > > leadership constantly changing.  The ultimate solution is to swap these bad
> > > > hosts.  But for quick mitigation, we can also put the bad hosts in the
> > > > Preferred Leader Blacklist to move the priority of its being elected as
> > > > leaders to the lowest.
> > > > > *  If the controller is busy serving an extra load of metadata requests
> > > > and other tasks. we would like to put the controller's leaders to other
> > > > brokers to lower its CPU load. currently bouncing to lose leadership would
> > > > not work for Controller, because after the bounce, the controller fails
> > > > over to another broker.
> > > > > * Avoid bouncing broker in order to lose its leadership: it would be
> > > > good if we have a way to specify which broker should be excluded from
> > > > serving traffic/leadership (without changing the replica assignment
> > > > ordering by reassignments, even though that's quick), and run preferred
> > > > leader election.  A bouncing broker will cause temporary URP, and sometimes
> > > > other issues.  Also a bouncing of broker (e.g. broker_id 1) can temporarily
> > > > lose all its leadership, but if another broker (e.g. broker_id 2) fails or
> > > > gets bounced, some of its leaderships will likely failover to broker_id 1
> > > > on a replica with 3 brokers.  If broker_id 1 is in the blacklist, then in
> > > > such a scenario even broker_id 2 offline,  the 3rd broker can take
> > > > leadership.
> > > > > The current work-around of the above is to change the topic/partition's
> > > > replica reassignments to move the broker_id 1 from the first position to
> > > > the last position and run preferred leader election. e.g. (1, 2, 3) => (2,
> > > > 3, 1). This changes the replica reassignments, and we need to keep track of
> > > > the original one and restore if things change (e.g. controller fails over
> > > > to another broker, the swapped empty broker caught up). That’s a rather
> > > > tedious task.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > This message was sent by Atlassian JIRA
> > > > (v7.6.3#76005)