You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Geoffrey Stewart (JIRA)" <ji...@apache.org> on 2017/07/14 23:34:00 UTC
[jira] [Commented] (KAFKA-5016) Consumer hang in poll method while rebalancing is in progress

    [ https://issues.apache.org/jira/browse/KAFKA-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088285#comment-16088285 ] 

Geoffrey Stewart commented on KAFKA-5016:
-----------------------------------------

I have also encountered the issue documented in this Jira using 0.10.2.0 brokers with the 0.10.2.0 client.  This issue only occurs when we use the "subscribe" call from the API, which dynamically assigns partitions.  When we use the "assign" call from the API, to manually assign lists of partitions, we do not have any issue.  I don't think what is being described above represents the expected behavior of dynamic partition assignment and consumer group coordination.  Based on the above explanation it sounds like it would not be possible to have 2 or more simultaneous consumer instances in the same consumer group when using dynamic partition assignment (subscribe).  For example, there could be one consumer instance in the group which has made some calls to "poll".  As soon as a second consumer instance comes along, it's call to "poll" is only processed after max.poll.interval.ms has elapsed since the first consumer's most recent poll request - at this time the broker will no longer consider that this first consumer is part of the group.  I certainly agree that with the arrival of the second consumer to the group, the broker must perform a rebalance or restabilization which may take some time.  However this should not take max.poll.interval.ms since the liveness of the first consumer should be maintained by it's heartbeat which occurs every heartbeat.interval.ms.  I have confirmed that by using the default value for the property max.poll.interval.ms of 300000, the group restabilization (rebalance) takes about this long (5mins) and then the second consumer instance's poll request is processed.  Lowering this value to 30000, has the effect of reducing the group restabilization (rebalance) to about 30 seconds before the second consumer instance's poll request is processed.
To summarize, please explain how I can establish parallel consumer instances in the same group using the subscribe method from the API, which dynamically assigns partitions.  Further, please help me to understand why the consumer instances heartbeat does not seem to be maintaining it's liveness.

> Consumer hang in poll method while rebalancing is in progress
> -------------------------------------------------------------
>
>                 Key: KAFKA-5016
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5016
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.0, 0.10.2.0
>            Reporter: Domenico Di Giulio
>            Assignee: Vahid Hashemian
>         Attachments: Kafka 0.10.2.0 Issue (TRACE) - Server + Client.txt, Kafka 0.10.2.0 Issue (TRACE).txt, KAFKA_5016.java
>
>
> After moving to Kafka 0.10.2.0, it looks like I'm experiencing a hang in the rebalancing code. 
> This is a test case, not (still) production code. It does the following with a single-partition topic and two consumers in the same group:
> 1) a topic with one partition is forced to be created (auto-created)
> 2) a producer is used to write 10 messages
> 3) the first consumer reads all the messages and commits
> 4) the second consumer attempts a poll() and hangs indefinitely
> The same issue can't be found with 0.10.0.0.
> See the attached logs at TRACE level. Look for "SERVER HANGS" to see where the hang is found: when this happens, the client keeps failing any hearbeat attempt, as the rebalancing is in progress, and the poll method hangs indefinitely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)