You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jason Gustafson (Jira)" <ji...@apache.org> on 2019/12/30 22:22:00 UTC

[jira] [Resolved] (KAFKA-3944) After the broker restart, fetchers stopped due to a delayed controlled shutdown message

     [ https://issues.apache.org/jira/browse/KAFKA-3944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Gustafson resolved KAFKA-3944.
------------------------------------
    Resolution: Fixed

> After the broker restart, fetchers stopped due to a delayed controlled shutdown message
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-3944
>                 URL: https://issues.apache.org/jira/browse/KAFKA-3944
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>            Reporter: Maysam Yabandeh
>            Priority: Minor
>              Labels: reliability
>
> The symptom is that cluster reports under-replicated blocks and some replicas do not seem to catch up ever. It turns out that the corresponding fetchers in those brokers were stopped shortly after the broker's restart. The broker had stopped the fetcher upon receiving stop-replica request from the controller. The controller had issued those request upon processing controlled shutdown request form the same broker. However those requests were all sent before the broker restart but the controller processed them after. Here is the timeline:
> # broker sends controlled shutdown message to controller
> # the process fails and the broker proceeds with an unclean shutdown
> # the broker is restated
> # the controller processes the perviously sent controlled shutdown messages
> # the controller sends stop replica messages to the broker
> # the broker shuts down the fetchers, while it has no intent to shut down again
> # this leads to under-replicated blocks
> Example from logs:
> {code}
> broker19.com:/var/log/kafka$ grep "Retrying controlled shutdow\|unclean shutdown" server.log.2016-07-07.2 
> 2016-07-07 15:58:10,818 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
> 2016-07-07 15:58:45,887 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,927 WARN server.KafkaServer: [Kafka Server 19], Retrying controlled shutdown after the previous attempt failed...
> 2016-07-07 15:59:20,929 WARN server.KafkaServer: [Kafka Server 19], Proceeding to do an unclean shutdown as all the controlled shutdown attempts failed
> broker19.com:/var/log/kafka$ head -1 server.log.2016-07-07.3
> 2016-07-07 16:00:23,191 INFO server.KafkaConfig: KafkaConfig values: 
> {code}
> {code}
> broker13.com:/var/log/kafka$ grep "Shutting down broker 19" controller.log.2016-07-07.1 
> 2016-07-07 15:57:35,822 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
> 2016-07-07 16:02:45,526 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
> 2016-07-07 16:05:42,432 INFO controller.KafkaController: [Controller 13]: Shutting down broker 19
> {code}
> which resulted into many stop replica request to broker 19:
> {code}
> broker13.com:/var/log/kafka$ grep "The stop replica request (delete = false) sent to broker 19 is" controller.log.2016-07-07.1 | tail -1
> 2016-07-07 16:06:02,374 DEBUG controller.ControllerBrokerRequestBatch: The stop replica request (delete = false) sent to broker 19 is [Topic=topic-xyz,Partition=6,Replica=19]
> {code}
> broker 19 processes them AFTER its restart:
> {code}
> broker19.com:/var/log/kafka$ grep "handling stop replica (delete=false) for partition .topic-xzy,3." state-change.log.2016-07-07.2 
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xzy,3]
> 2016-07-07 16:06:00,154 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 handling stop replica (delete=false) for partition [topic-xyz,3]
> 2016-07-07 16:06:00,155 TRACE change.logger: Broker 19 finished handling stop replica (delete=false) for partition [topic-xyz,3]
> {code}
> and removes the fetchers:
> {code}
> broker19.com:/var/log/kafka$ grep "Removed fetcher.*topic-xyz.3" server.log.2016-07-07.3 | tail -2
> 2016-07-07 16:06:00,154 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3]
> 2016-07-07 16:06:00,155 INFO server.ReplicaFetcherManager: [ReplicaFetcherManager on broker 19] Removed fetcher for partitions [topic-xyz,3]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)