You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Xinyu Liu (JIRA)" <ji...@apache.org> on 2016/12/23 00:07:58 UTC

[jira] [Commented] (SAMZA-1069) Deadlock between KafkaSystemProducer and KafkaProducer from kafka-clients lib

    [ https://issues.apache.org/jira/browse/SAMZA-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15771442#comment-15771442 ] 

Xinyu Liu commented on SAMZA-1069:
----------------------------------

Looks like the flush() is thread safe by itself, so there is no reason why it needs to hold the producerLock. By moving it outside should resolve this issue. I will put up a patch. Thanks for the thorough investigation!

> Deadlock between KafkaSystemProducer and KafkaProducer from kafka-clients lib
> -----------------------------------------------------------------------------
>
>                 Key: SAMZA-1069
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1069
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 0.11.0
>            Reporter: Yi Pan (Data Infrastructure)
>             Fix For: 0.12.0
>
>
> We have identified one deadlock scenario between the main thread that calls KafkaSystemProducer.close() vs the KafkaProducer client lib's network thread that calls the callback function within KafkaSystemProducer.send().
> The scenario is the following:
> # SamzaContainer main thread caught an exception from previous commit and container initiated shutdown, which calls KafkaSystemProducer.stop(), grabbing the synchronized producerLock in KafkaSystemProducer and call KafkaProducer.flush() to wait for all pending requests to be done.
> # KafkaProducer network I/O thread then calls KafkaSystemProducer’s callback function (in RecordBatch.done()), which is waiting on the same producerLock in KafkaSystemProducer before it can return and call producerFuture.done() and release the CountDownLatch that the main thread KafkaSystemProducer.close() is waiting on. Hence, deadlock!
> We need to make sure the KafkaSystemProducer.close() won't have race condition w/ the callbacks triggered by the KafkaProducer's network thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)