You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Guozhang Wang (JIRA)" <ji...@apache.org> on 2017/12/12 00:52:00 UTC

[jira] [Commented] (KAFKA-2319) After controlled shutdown: IllegalStateException: Kafka scheduler has not been started

    [ https://issues.apache.org/jira/browse/KAFKA-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286884#comment-16286884 ] 

Guozhang Wang commented on KAFKA-2319:
--------------------------------------

I think this issue still exists, but probably in a slight different manner.

Here is what I observed while shutting down a server:

{code}
[2017-12-11 16:37:32,620] ERROR [kafka-log-cleaner-thread-0]: Error due to (kafka.log.LogCleaner$CleanerThread:73)
java.lang.IllegalStateException: Kafka scheduler is not running.
	at kafka.utils.KafkaScheduler.ensureRunning(KafkaScheduler.scala:132)
	at kafka.utils.KafkaScheduler.schedule(KafkaScheduler.scala:106)
	at kafka.log.Log.kafka$log$Log$$asyncDeleteSegment(Log.scala:1637)
	at kafka.log.Log$$anonfun$replaceSegments$1.apply(Log.scala:1682)
	at kafka.log.Log$$anonfun$replaceSegments$1.apply(Log.scala:1677)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at kafka.log.Log.replaceSegments(Log.scala:1677)
	at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:489)
	at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:400)
	at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:399)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at kafka.log.Cleaner.doClean(LogCleaner.scala:399)
	at kafka.log.Cleaner.clean(LogCleaner.scala:376)
	at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:267)
	at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:247)
	at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:68)
{code}

The root cause is that we call {{Log#asyncDeleteSegment}} to asynchronously delete the segment, AND we shutdown scheduler first then cleaner thread. Hence the following race condition can happen (very likely):

1. Kafka server shutdown, {{KafkaScheduler#shutdown}} called first.
2. {{cachedExecutor.shutdown()}} called, which executed all the already-scheduled tasks.
3. {{KafkaScheduler#shutdown}} completed, {{executor}} is set to {{null}}.
4. The {{LogCleaner.doWork()}} is called then before it is shutting down, and then calling {{scheduler.schedule("delete-file", deleteSeg _, delay = config.fileDeleteDelayMs)}}. BOOM.

An easy fix would be re-ordering the shutdown sequence, to make sure callers of the scheduler like the cleaner thread is shutdown first.


> After controlled shutdown: IllegalStateException: Kafka scheduler has not been started
> --------------------------------------------------------------------------------------
>
>                 Key: KAFKA-2319
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2319
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 1.0.1
>            Reporter: Jason Rosenberg
>
> Running 0.8.2.1, just saw this today at the end of a controlled shutdown.  It doesn't happen every time, but I've seen it several times:
> {code}
> 2015-07-07 18:54:28,424  INFO [Thread-4] server.KafkaServer - [Kafka Server 99], Controlled shutdown succeeded
> 2015-07-07 18:54:28,425  INFO [Thread-4] network.SocketServer - [Socket Server on Broker 99], Shutting down
> 2015-07-07 18:54:28,435  INFO [Thread-4] network.SocketServer - [Socket Server on Broker 99], Shutdown completed
> 2015-07-07 18:54:28,435  INFO [Thread-4] server.KafkaRequestHandlerPool - [Kafka Request Handler on Broker 99], shutting down
> 2015-07-07 18:54:28,444  INFO [Thread-4] server.KafkaRequestHandlerPool - [Kafka Request Handler on Broker 99], shut down completely
> 2015-07-07 18:54:28,649  INFO [Thread-4] server.ReplicaManager - [Replica Manager on Broker 99]: Shut down
> 2015-07-07 18:54:28,649  INFO [Thread-4] server.ReplicaFetcherManager - [ReplicaFetcherManager on broker 99] shutting down
> 2015-07-07 18:54:28,650  INFO [Thread-4] server.ReplicaFetcherThread - [ReplicaFetcherThread-0-95], Shutting down
> 2015-07-07 18:54:28,750  INFO [Thread-4] server.ReplicaFetcherThread - [ReplicaFetcherThread-0-95], Shutdown completed
> 2015-07-07 18:54:28,750  INFO [ReplicaFetcherThread-0-95] server.ReplicaFetcherThread - [ReplicaFetcherThread-0-95], Stopped
> 2015-07-07 18:54:28,750  INFO [Thread-4] server.ReplicaFetcherThread - [ReplicaFetcherThread-0-98], Shutting down
> 2015-07-07 18:54:28,791  INFO [Thread-4] server.ReplicaFetcherThread - [ReplicaFetcherThread-0-98], Shutdown completed
> 2015-07-07 18:54:28,791  INFO [ReplicaFetcherThread-0-98] server.ReplicaFetcherThread - [ReplicaFetcherThread-0-98], Stopped
> 2015-07-07 18:54:28,791  INFO [Thread-4] server.ReplicaFetcherManager - [ReplicaFetcherManager on broker 99] shutdown completed
> 2015-07-07 18:54:28,819  INFO [Thread-4] server.ReplicaManager - [Replica Manager on Broker 99]: Shut down completely
> 2015-07-07 18:54:28,826  INFO [Thread-4] log.LogManager - Shutting down.
> 2015-07-07 18:54:30,459  INFO [Thread-4] log.LogManager - Shutdown complete.
> 2015-07-07 18:54:30,463  WARN [Thread-4] utils.Utils$ - Kafka scheduler has not been started
> java.lang.IllegalStateException: Kafka scheduler has not been started
>         at kafka.utils.KafkaScheduler.ensureStarted(KafkaScheduler.scala:114)
>         at kafka.utils.KafkaScheduler.shutdown(KafkaScheduler.scala:86)
>         at kafka.controller.KafkaController.onControllerResignation(KafkaController.scala:350)
>         at kafka.controller.KafkaController.shutdown(KafkaController.scala:664)
>         at kafka.server.KafkaServer$$anonfun$shutdown$8.apply$mcV$sp(KafkaServer.scala:285)
>         at kafka.utils.Utils$.swallow(Utils.scala:172)
>         at kafka.utils.Logging$class.swallowWarn(Logging.scala:92)
>         at kafka.utils.Utils$.swallowWarn(Utils.scala:45)
>         at kafka.utils.Logging$class.swallow(Logging.scala:94)
>         at kafka.utils.Utils$.swallow(Utils.scala:45)
>         at kafka.server.KafkaServer.shutdown(KafkaServer.scala:285)
>  ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)