You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Panos Skianis (JIRA)" <ji...@apache.org> on 2017/07/22 01:16:02 UTC

[jira] [Comment Edited] (KAFKA-5611) One or more consumers in a consumer-group stop consuming after rebalancing

    [ https://issues.apache.org/jira/browse/KAFKA-5611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16097048#comment-16097048 ] 

Panos Skianis edited comment on KAFKA-5611 at 7/22/17 1:15 AM:
---------------------------------------------------------------

Somehow managed to recreate this on a different environment but stll cannot do it at will. I am attaching a filtered log from the "consumer" app that had the issue. 
Debugging was switched on for the consumer internals package of the kafka clients (o.a.k.c.consumer.internals). 

We have introduced a way to force the kafka consumer code to be "restarted" manually. At the moment, on a cluster of 4 kafkas/3 zookeepers  and 3 client consumer apps, we have been trying to recreate this problem by causing rebalancing on purpose (group.id = salesforce-group-gb). You may notice other groups as well since this app is using multiple groups but those are ok at the moment afaik. 

File related to this comment is attached as bad-server-with-more-logging-1 (tarballed and gzipped). Let me know if you want me to provide it in another format.
There are some comments inside that file (starting normally with the word COMMENT). 

Note: the app is using akka-kafka-streams and I believe there is a timeout involved. You will be able to see this if you search for WakeupException.
I am wondering if that is causing the issue because there is delay in getting the information back from kafka (timeout current set at 3s). I will go looking down that route in case this is just a long network call that route.





was (Author: pskianis):
Somehow managed to recreate this on a different environment but stll cannot do it at will. I am attaching a filtered log from the "consumer" app that had the issue. 
Debugging was switched on for the consumer internals package of the kafka clients (o.a.k.c.consumer.internals). 

We have introduced a way to force the kafka consumer code to be "restarted" manually. At the moment, on a cluster of 4 kafkas/3 zookeepers  and 3 client consumer apps, we have been trying to recreate this problem by causing rebalancing on purpose (group.id = salesforce-group-gb). You may notice other groups as well since this app is using multiple groups but those are ok at the moment afaik. 

File related to this comment is attached as bad-server-with-more-logging-1 (tarballed and gzipped). Let me know if you want me to provide it in another format.
There are some comments inside that file (starting normally with the word COMMENT). 

Note: the app is using akka-kafka-streams and I believe there is a timeout involved. You will be able to see this if you search for WakeupException.
I am wondering if that is causing the issue because there is delay in getting the information back from kafka (timeout current set at 3s). I will go looking down that route in case this is just a long network call that route.



> One or more consumers in a consumer-group stop consuming after rebalancing
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-5611
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5611
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.10.2.0
>            Reporter: Panos Skianis
>         Attachments: bad-server-with-more-logging-1.tar.gz, kka02, Server 1, Server 2, Server 3
>
>
> Scenario: 
>   - 3 zookeepers, 4 Kafkas. 0.10.2.0, with 0.9.0 compatibility still on (other apps need it but the one mentioned below is already on kafka 0.10.2.0  client).
>   - 3 servers running 1 consumer each under the same consumer groupId. 
>   - Servers seem to be consuming messages happily but then there is a timeout to an external service that causes our app to restart the Kafka Consumer on one of the servers (this is by design). That causes rebalancing of the group and upon restart of one of the Consumers seem to "block".
>   - Server 3 is where the problems occur.
>   - Problem fixes itself either by restarting one of the 3 servers or cause the group to rebalance again by using the console consumer with the autocommit set to false and using the same group.
>  
> Note: 
>  - Haven't managed to recreate it at will yet.
>  - Mainly happens in production environment, often enough. Hence I do not have any logs with DEBUG/TRACE statements yet.
>  - Extracts from log of each app server are attached. Also the log of the kafka that seems to be dealing with the related group and generations.
>  - See COMMENT lines in the files for further info.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)