You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ambari.apache.org by "Hari Sekhon (JIRA)" <ji...@apache.org> on 2018/12/12 11:31:00 UTC

[jira] [Updated] (AMBARI-24719) Kafka Rolling Restart causes outage(s) due to not checking for under replicated partitions

     [ https://issues.apache.org/jira/browse/AMBARI-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sekhon updated AMBARI-24719:
---------------------------------
    Summary: Kafka Rolling Restart causes outage(s) due to not checking for under replicated partitions  (was: Kafka Rolling Restart causes outage(s))

> Kafka Rolling Restart causes outage(s) due to not checking for under replicated partitions
> ------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-24719
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24719
>             Project: Ambari
>          Issue Type: Improvement
>          Components: ambari-server
>    Affects Versions: 2.6.2
>            Reporter: Hari Sekhon
>            Priority: Critical
>
> Ambari causes Kafka topic partition outages during rolling restarts because it only does a simplistic 2 minute wait between brokers and doesn't check the state of partition replicas before taking another broker down.
> On busty Kafka clusters with lots topics / partitions / data it might take a while before in-sync replicas recover.
> Ambari should therefore check for any under replicated partitions and wait as long as it takes for them to recover before proceeding to the next broker. There is however an issue in doing so which is there is a topic partition with a replica that no longer exists (eg. ambari_kafka_service_check) then it will never recover so there needs to be some thoughtful handling around that.
> This might be solved by AMBARI-24203 but I'm not sure it is tied in properly to the rolling restarts or what the timeout policy or time interval is for it, or whether it takes the above paragraph in to account.
> This could also have been easily offset if Ambari had proper extensible checking as raised in AMBARI-24381.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)