You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Federico Giraud (JIRA)" <ji...@apache.org> on 2017/02/28 14:43:46 UTC

[jira] [Commented] (KAFKA-4669) KafkaProducer.flush hangs when NetworkClient.handleCompletedReceives throws exception

    [ https://issues.apache.org/jira/browse/KAFKA-4669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888126#comment-15888126 ] 

Federico Giraud commented on KAFKA-4669:
----------------------------------------

We had the same issue with a Kafka 0.8.2.0 producer and a Kafka 0.9 cluster. The bug seems to have been triggered by a change in the ISR (a broker fell out of sync and cause a rebalance on multiple topics). A producer that was sending messages to multiple topics in the cluster reporter the following exception multiple times, within a few seconds:

{code}
[2017-02-19 06:02:05,172] ERROR {kafka-producer-network-thread | egress} org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka producer I/O thread:
java.lang.IllegalStateException: Correlation id for response (90177743) does not match request (90177741)
        at org.apache.kafka.clients.NetworkClient.correlate(NetworkClient.java:356)
        at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:296)
        at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:199)
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:191)
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:122)
        at java.lang.Thread.run(Thread.java:745)
{code}

There were no producer errors preceding the sequence of IllegalStateExceptions. We have numerous other producers running and none of them reported the issue. The only solution was to terminate the process and restarting it.

Please let me know if you need any additional information.

> KafkaProducer.flush hangs when NetworkClient.handleCompletedReceives throws exception
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-4669
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4669
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 0.9.0.1
>            Reporter: Cheng Ju
>            Priority: Critical
>              Labels: reliability
>             Fix For: 0.10.3.0
>
>
> There is no try catch in NetworkClient.handleCompletedReceives.  If an exception is thrown after inFlightRequests.completeNext(source), then the corresponding RecordBatch's done will never get called, and KafkaProducer.flush will hang on this RecordBatch.
> I've checked 0.10 code and think this bug does exist in 0.10 versions.
> A real case.  First a correlateId not match exception happens:
> 13 Jan 2017 17:08:24,059 ERROR [kafka-producer-network-thread | producer-21] (org.apache.kafka.clients.producer.internals.Sender.run:130)  - Uncaught error in kafka producer I/O thread: 
> java.lang.IllegalStateException: Correlation id for response (703766) does not match request (703764)
> 	at org.apache.kafka.clients.NetworkClient.correlate(NetworkClient.java:477)
> 	at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:440)
> 	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:265)
> 	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:216)
> 	at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:128)
> 	at java.lang.Thread.run(Thread.java:745)
> Then jstack shows the thread is hanging on:
> 	at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
> 	at org.apache.kafka.clients.producer.internals.ProduceRequestResult.await(ProduceRequestResult.java:57)
> 	at org.apache.kafka.clients.producer.internals.RecordAccumulator.awaitFlushCompletion(RecordAccumulator.java:425)
> 	at org.apache.kafka.clients.producer.KafkaProducer.flush(KafkaProducer.java:544)
> 	at org.apache.flume.sink.kafka.KafkaSink.process(KafkaSink.java:224)
> 	at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:67)
> 	at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:145)
> 	at java.lang.Thread.run(Thread.java:745)
> client code 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)