You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ambari.apache.org by "Hari Sekhon (JIRA)" <ji...@apache.org> on 2018/12/12 11:29:00 UTC

[jira] [Updated] (AMBARI-24719) Kafka Rolling Restart causes outage(s)

     [ https://issues.apache.org/jira/browse/AMBARI-24719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sekhon updated AMBARI-24719:
---------------------------------
    Description: 
Ambari causes Kafka topic partition outages during rolling restarts because it only does a simplistic 2 minute wait between brokers and doesn't check the state of partition replicas before taking another broker down.

On busty Kafka clusters with lots topics / partitions / data it might take a while before in-sync replicas recover.

Ambari should therefore check for any under replicated partitions and wait as long as it takes for them to recover before proceeding to the next broker. There is however an issue in doing so which is there is a topic partition with a replica that no longer exists (eg. ambari_kafka_service_check) then it will never recover so there needs to be some thoughtful handling around that.

This might be solved by AMBARI-24203 but I'm not sure it is tied in properly to the rolling restarts or what the timeout policy or time interval is for it, or whether it takes the above paragraph in to account.

  was:
Ambari causes Kafka topic partition outages during rolling restarts because it only does a simplistic 2 minute wait between brokers and doesn't check the state of partition replicas before taking another broker down.

On busty Kafka clusters with lots topics / partitions / data it might take a while before in-sync replicas recover.

Ambari should therefore check for under any replicated partitions and wait as long as it takes for them to recover before proceeding to the next broker. There is however an issue in doing so which is there is a topic partition with a replica that no longer exists (eg. ambari_kafka_service_check) then it will never recover so there needs to be some thoughtful handling around that.

This might be solved by AMBARI-24203 but I'm not sure it is tied in properly to the rolling restarts or what the timeout policy or time interval is for it, or whether it takes the above paragraph in to account.


> Kafka Rolling Restart causes outage(s)
> --------------------------------------
>
>                 Key: AMBARI-24719
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24719
>             Project: Ambari
>          Issue Type: Improvement
>          Components: ambari-server
>    Affects Versions: 2.6.2
>            Reporter: Hari Sekhon
>            Priority: Critical
>
> Ambari causes Kafka topic partition outages during rolling restarts because it only does a simplistic 2 minute wait between brokers and doesn't check the state of partition replicas before taking another broker down.
> On busty Kafka clusters with lots topics / partitions / data it might take a while before in-sync replicas recover.
> Ambari should therefore check for any under replicated partitions and wait as long as it takes for them to recover before proceeding to the next broker. There is however an issue in doing so which is there is a topic partition with a replica that no longer exists (eg. ambari_kafka_service_check) then it will never recover so there needs to be some thoughtful handling around that.
> This might be solved by AMBARI-24203 but I'm not sure it is tied in properly to the rolling restarts or what the timeout policy or time interval is for it, or whether it takes the above paragraph in to account.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)