You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Catalina-Alina Dobrica (JIRA)" <ji...@apache.org> on 2016/11/09 12:34:58 UTC

[jira] [Commented] (KAFKA-1894) Avoid long or infinite blocking in the consumer

    [ https://issues.apache.org/jira/browse/KAFKA-1894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15650814#comment-15650814 ] 

Catalina-Alina Dobrica commented on KAFKA-1894:
-----------------------------------------------

This issue also prevents the consumer's thread from being interrupted. This is particularly relevant when the consumer is integrated in an external system - like a camel ecosystem. Trying to force the shutdown of the ExecutorService that manages the thread or to terminate the thread itself has no effect and the thread is in the infinite loop. This eventually leads to OOME if enough such threads are started.
I found this issue when providing an incorrect SSL protocol to the consumer in version 0.10.1.0, but it can occur in any circumstance where the channel is not established - such as not having kafka enabled. The thread loops infinitely to check if this connection was established, which, in some cases, will never happen.

> Avoid long or infinite blocking in the consumer
> -----------------------------------------------
>
>                 Key: KAFKA-1894
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1894
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: consumer
>            Reporter: Jay Kreps
>            Assignee: Jason Gustafson
>             Fix For: 0.10.2.0
>
>
> The new consumer has a lot of loops that look something like
> {code}
>   while(!isThingComplete())
>     client.poll();
> {code}
> This occurs both in KafkaConsumer but also in NetworkClient.completeAll. These retry loops are actually mostly the behavior we want but there are several cases where they may cause problems:
>  - In the case of a hard failure we may hang for a long time or indefinitely before realizing the connection is lost.
>  - In the case where the cluster is malfunctioning or down we may retry forever.
> It would probably be better to give a timeout to these. The proposed approach would be to add something like retry.time.ms=60000 and only continue retrying for that period of time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)