You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Rajini Sivaram (JIRA)" <ji...@apache.org> on 2017/05/22 08:35:04 UTC

[jira] [Commented] (KAFKA-5298) MirrorMaker deadlocks with missing topics

    [ https://issues.apache.org/jira/browse/KAFKA-5298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16019285#comment-16019285 ] 

Rajini Sivaram commented on KAFKA-5298:
---------------------------------------

This has been fixed under KAFKA-5086. Expired topics are now added back to the metadata list while in the metadata wait loop.

> MirrorMaker deadlocks with missing topics
> -----------------------------------------
>
>                 Key: KAFKA-5298
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5298
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, tools
>    Affects Versions: 0.10.2.1
>            Reporter: Raymond Conn
>
> When mirrorMaker mirrors a topic to destination brokers that have topic auto create disabled and a topic doesn't exist on the destination brokers, the producer in mirror maker logs the following 
> {code}
> Error while fetching metadata with correlation id 467 : \{mirror-test2=UNKNOWN_TOPIC_OR_PARTITION\}
> Error while fetching metadata with correlation id 468 : {mirror-test2=UNKNOWN_TOPIC_OR_PARTITION}
> {code}
> This log message is fine and expected. The problem is the log message stops ~5 min later. At which point the logs look fine, but mirror maker is not mirroring any of its topics. 
> What's worse is mirrorMaker is basically in an unrecoverable state once this happens (the log statement stops). If you create the topic at the destination mirrorMaker still won't mirror data until a restart. Attempts to restart mirrorMaker (cleanly) fail because the process is more or less deadlocked in its shutdown hook.
> Here is the reasoning:
> * MirrorMaker becomes unrecoverable after 5 minutes because of this loop in the [producer|https://github.com/apache/kafka/blob/e06cd3e55f25a0bb414e0770493906ea8019420a/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L543-L561]
> * The producer will keep waiting for metadata for the missing topic or until the max timeout is reached. (max long in this case) 
> * after 5 minutes the producer stops making a metadata request for the topic because that topic expires [here|https://github.com/apache/kafka/blob/e06cd3e55f25a0bb414e0770493906ea8019420a/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L218] 
> * topic is never re-added for metadata requests since the only add is before entering the loop [here|(https://github.com/apache/kafka/blob/e06cd3e55f25a0bb414e0770493906ea8019420a/clients/src/main/java/org/apache/kafka/clients/producer/KafkaProducer.java#L528]
> So basically after 5 minutes all metadata requests moving forward are for no topics since the topic expired. The mirrorMaker thread essentially gets stuck waiting forever since there will never be a metadata request for the topic the thread is waiting on
> All of this basically leads to a deadlock state in the shutdown hook. 
> * shutdown hook sends a shutdown to the mirrorMaker threads 
> * waits for threads to exit their loop by waitind on a [latch|https://github.com/apache/kafka/blob/0.10.2/core/src/main/scala/kafka/tools/MirrorMaker.scala#L396]
> * latch is never counted down in [produce|https://github.com/apache/kafka/blob/0.10.2/core/src/main/scala/kafka/tools/MirrorMaker.scala#L434] 
> * thread will never exit the loop to countdown the latch on line 462.
> This can be seen with a thread dump of the shutdown hook thread
> {code}
> Name: MirrorMakerShutdownHook
> State: WAITING on java.util.concurrent.CountDownLatch$Sync@3ffebeac
> Total blocked: 0  Total waited: 1
> Stack trace: 
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
> kafka.tools.MirrorMaker$MirrorMakerThread.awaitShutdown(MirrorMaker.scala:498)
> kafka.tools.MirrorMaker$$anonfun$cleanShutdown$4.apply(MirrorMaker.scala:396)
> kafka.tools.MirrorMaker$$anonfun$cleanShutdown$4.apply(MirrorMaker.scala:396)
> scala.collection.Iterator$class.foreach(Iterator.scala:893)
> {code}
> The root of the issue more or less related to the issue documented here where the producer can block waiting for metadata. https://issues.apache.org/jira/browse/KAFKA-3450



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)