You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Gwen Shapira (JIRA)" <ji...@apache.org> on 2015/12/18 00:37:46 UTC

[jira] [Resolved] (KAFKA-2125) Infinite loop after controlled shutdown succeeds

     [ https://issues.apache.org/jira/browse/KAFKA-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gwen Shapira resolved KAFKA-2125.
---------------------------------
    Resolution: Cannot Reproduce

Closing this since we think this is resolved in trunk.

> Infinite loop after controlled shutdown succeeds
> ------------------------------------------------
>
>                 Key: KAFKA-2125
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2125
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Jason Rosenberg
>            Priority: Blocker
>         Attachments: grep_shut_edited.log
>
>
> I have a 4 node cluster, running 0.8.2.1, that got into a bad state last night during a rolling restart.  The first node to be restarted was the controller.  Controlled Shutdown completed successfully, after about 800ms.  But after that, the server failed to shutdown, and got into what appears to be an infinite cycle, involving the 'Controller-45-to-broker-45-send-thread', and the 'kafka-scheduler-0' thread.  Ultimately, the shutdown timed out (after 3 minutes) and it was terminated by the deployment system, and the server was restarted.  As expected, when it came back up it took some time re-syncing it's partitions, but eventually came back and all was well.  However, I think there was something fundamentally wrong with the shutdown process.  The controller didn't seem to give up controller status, for one thing, as part of the controlled shutdown (which I should think would be the first thing it should do?).
> Anyway, below are some log snippets. I do have full logs from each broker in the cluster, which I can provide (but would have to significantly anonymize the logs before forwarding along).
> Controlled shutdown starts and succeeds:
> {code}
> 2015-04-14 05:56:10,134  INFO [Thread-38] server.KafkaServer - [Kafka Server 45], shutting down
> 2015-04-14 05:56:10,136  INFO [Thread-38] server.KafkaServer - [Kafka Server 45], Starting controlled shutdown
> 2015-04-14 05:56:10,150  INFO [kafka-request-handler-0] controller.KafkaController - [Controller 45]: Shutting down broker 45
> ...
> ...
> 2015-04-14 05:56:10,951  INFO [Thread-38] server.KafkaServer - [Kafka Server 45], Controlled shutdown succeeded
> {code}
> Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly spamming the logs, like so:
> {code}
> 2015-04-14 05:56:11,281  WARN [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 epoch 21 fails to send request Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1) -> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48) to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
>         at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
>         at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
>         at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,281  INFO [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 connected to id:45,host:broker45.square,port:12345 for sending state change requests
> 2015-04-14 05:56:11,582  WARN [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 epoch 21 fails to send request Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1) -> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48) to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
>         at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
>         at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
>         at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,582  INFO [Controller-45-to-broker-45-send-thread] controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], Controller 45 connected to id:45,host:broker45.square,port:12345 for sending state change requests
> {code}
> This seems to repeat every 300 ms or so, and continues until the server is forcibly shutdown (for about 3 minutes in this case).  The partition being logged about there (anonymized as 'mytopic,1') is always the same in topic/partition in these repeated log messages. So perhaps that one partition got in a bad state.
> Interspersed, there are floods of error logs like these, which is repeated seemingly for every partition on the broker (the broker has about 450 partitions).  It's trying to shrink the ISR to just having the current broker (EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The following sequence goes on in bursts of 500ms, once for each partition on the box (and the recurring every 10 seconds):
> {code}
> 2015-04-14 05:56:23,716  INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0] from 46,45 to 45
> 2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic3,13] on broker 45: Shrinking ISR for partition [mytopic3,13] from 45,48 to 45
> 2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic4,12] on broker 45: Shrinking ISR for partition [mytopic4,12] from 46,45 to 45
> 2015-04-14 05:56:23,726  INFO [kafka-scheduler-0] cluster.Partition - Partition [mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that in zookeeper, skip updating ISR
> {code}
> The above sequence is repeated every 10 seconds, it seems.
> So there appear to be 2 different repeating non-terminating loops:
> 1. the Controller-45-to-broker-45-send-thread repeating every 300ms
> 2. the kafka-scheduler thread attempting to shrink the ISR for every partition the box, every 10 seconds.
> It seems clear that this sequence would recur indefinitely had the app not been killed hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)