You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Anatoliy Soldatov <ak...@avito.ru.INVALID> on 2019/09/10 10:31:12 UTC

Possible improvement of controller failure detection mechanism

Hello!

We are running Kafka 2.1.1. Yesterday we accidentally corrupted dns record for our controller broker. As a result, broker was not visible from outside, but was able to connect to zookeeper cluster (zookeepers were on remote servers) and other Kafka Brokers (also on remote servers) from inside. Thus, zookeeper did not delete ephemeral node of the controller and controller reelection was not triggered. Also leader reassignment of partitions was not triggered (because broker continued reporting itself as heathy).

To sum up, we experienced a lot of connection timeouts from clients and replicas to some partitions of Kafka cluster (leader partitions on corrupted broker) and we lost a broker for some time (it was inaccessible). Though, Kafka cluster did not react somehow and reported health state of cluster.

I believe, this is not a bug, but behavior of Kafka could be improved (for example, heart beats from zookeeper to Kafka brokers or some kind of ACKs, that Kafka really accessible from outside world). I am interested in community opinion.

Regards,
Tolya

________________________________
"This message contains confidential information/commercial secret. If you are not the intended addressee of this message you may not copy, save, print or forward it to any third party and you are kindly requested to destroy this message and notify the sender thereof by email.
Данное сообщение содержит конфиденциальную информацию/информацию, являющуюся коммерческой тайной. Если Вы не являетесь надлежащим адресатом данного сообщения, Вы не вправе копировать, сохранять, печатать или пересылать его каким либо иным лицам. Просьба уничтожить данное сообщение и уведомить об этом отправителя электронным письмом.”