You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "GEORGE LI (Jira)" <ji...@apache.org> on 2020/04/06 02:27:00 UTC
[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

    [ https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17076006#comment-17076006 ] 

GEORGE LI edited comment on KAFKA-4084 at 4/6/20, 2:26 AM:
-----------------------------------------------------------

[~blodsbror]

Have some free time this weekend to troubleshoot and found out after 1.1 , at least in 2.4,  the controller code has some optimization for PLE, not running PLE at all if  current_leader == Head of Replica.   That had cause my unit/integration tests to fail.  I have patched that as well.   I have landed my code change to  my repo's feature branch. 2.4-leader-deprioritized-list (based on the 2.4 branch)

detail installation and testing  steps in this [Google doc|https://docs.google.com/document/d/1ZuOcYTSuCAqCut_hjI_EY3lA9W7BuIlHVUdOUcSpWww/edit].   Please let me know if you have issues with the patch/testing.     If can not view the doc, please click the request access button. or send me your email to add to the share.   my email: sqlconsulting@gmail.com

Please keep us posted with your testing results. 

Thanks,
George



was (Author: sql_consulting):
[~blodsbror]

Have some free time this weekend to troubleshoot and found out after 1.1 , at least in 2.4,  the controller code has some optimization for PLE, not running PLE at all if  current_leader == Head of Replica.   That had cause my unit/integration tests to fail.  I have patched that as well.   I have landed my code change to  my repo's feature branch. 2.4-leader-deprioritized-list (based on the 2.4 branch)

detail installation and testing  steps in this [Google doc|https://docs.google.com/document/d/1ZuOcYTSuCAqCut_hjI_EY3lA9W7BuIlHVUdOUcSpWww/edit].   Please let me know if you have issues with the patch/testing.     If can not view the doc, please click the request access button. or send me your email to add to the share.   my email: sqlconsulting@gmail.com

Please keep us posted with your testing results. 

Thanks,
George

If you 

> automated leader rebalance causes replication downtime for clusters with too many partitions
> --------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4084
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4084
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>            Reporter: Tom Crayford
>            Priority: Major
>              Labels: reliability
>             Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and you have a cluster with many partitions, there is a severe amount of replication downtime following a restart. This causes `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes leaders for *all* imbalanced partitions at once, instead of doing it gradually. This effectively stops all replica fetchers in the cluster (assuming there are enough imbalanced partitions), and restarts them. This can take minutes on busy clusters, during which no replication is happening and user data is at risk. Clients with {{acks=-1}} also see issues at this time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election manually. There is also a broker configuration “auto.leader.rebalance.enable” which you can set to have the broker automatically perform the PLE when needed. DO NOT USE THIS OPTION. There are serious performance issues when doing so, especially on larger clusters. It needs some development work that has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of partitions to do automated leader rebalancing for at once, and *stop* once that number of leader rebalances are in flight, until they're done. There may be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)