You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Ray Chiang (JIRA)" <ji...@apache.org> on 2018/07/19 20:17:00 UTC
[jira] [Updated] (KAFKA-6013) Controller getting stuck

     [ https://issues.apache.org/jira/browse/KAFKA-6013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ray Chiang updated KAFKA-6013:
------------------------------
    Component/s: controller

> Controller getting stuck
> ------------------------
>
>                 Key: KAFKA-6013
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6013
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.11.0.0, 0.11.0.1
>            Reporter: Ivan Babrou
>            Priority: Major
>              Labels: reliability
>
> It looks like a new issue in 0.11.0.0 and 0.11.0.1 still has it.
> We upgraded one of the clusters from 0.11.0.0 to 0.11.0.1 by shutting down 28 machines at once (single rack). When nodes came up none of them progressed after these log lines:
> {noformat}
> Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka version : 0.11.0.1 (org.apache.kafka.common.utils.AppInfoParser)
> Oct 05 02:17:42 mybroker14 kafka[32940]: INFO Kafka commitId : c2a0d5f9b1f45bf5 (org.apache.kafka.common.utils.AppInfoParser)
> Oct 05 02:17:42 mybroker14 kafka[32940]: INFO [Kafka Server 10014], started (kafka.server.KafkaServer)
> {noformat}
> There was no indication in controller node logs that it picked up rebooted nodes. This happened multiple times during the upgrade: once per rack plus some on top of that.
> Reboot took ~20m, all nodes in a single rack rebooted in parallel.
> The fix was to restart controller node, but that did not go cleanly too:
> {noformat}
> ivan@mybroker26:~$ sudo journalctl --since 01:00 -u kafka | fgrep 'Error during controlled shutdown' -A1
> Oct 05 01:57:41 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error during controlled shutdown, possibly because leader movement took longer than the configured controller.socket.timeout.ms and/or request.timeout.ms: Connection to 10026 was disconnected before the response was read (kafka.server.KafkaServer)
> Oct 05 01:57:46 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying controlled shutdown after the previous attempt failed... (kafka.server.KafkaServer)
> --
> Oct 05 01:58:16 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error during controlled shutdown, possibly because leader movement took longer than the configured controller.socket.timeout.ms and/or request.timeout.ms: Connection to 10026 was disconnected before the response was read (kafka.server.KafkaServer)
> Oct 05 01:58:18 mybroker26 kafka[37409]: INFO Rolled new log segment for 'requests-40' in 3 ms. (kafka.log.Log)
> --
> Oct 05 01:58:51 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Error during controlled shutdown, possibly because leader movement took longer than the configured controller.socket.timeout.ms and/or request.timeout.ms: Connection to 10026 was disconnected before the response was read (kafka.server.KafkaServer)
> Oct 05 01:58:56 mybroker26 kafka[37409]: WARN [Kafka Server 10026], Retrying controlled shutdown after the previous attempt failed... (kafka.server.KafkaServer)
> {noformat}
> I'm unable to reproduce the issue by just restarting or even rebooting one broker, controller picks it up:
> {noformat}
> Oct 05 03:18:18 mybroker83 kafka[37402]: INFO [Controller 10083]: Newly added brokers: 10001, deleted brokers: , all live brokers: ...
> {noformat}
> KAFKA-5028 happened in 0.11.0.0, so it's likely related.
> cc [~ijuma]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)