You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jun MA <mj...@gmail.com> on 2016/11/19 00:19:18 UTC

Consumer group keep bouncing between generation id 0 and 1

Hi guys,

I’m using kafka 0.9.0.1 and Java client. I saw the following exceptions throw by my consumer:
Caused by: java.lang.IllegalStateException: Correlation id for response (767587) does not match request (767585)
	at org.apache.kafka.clients.NetworkClient.correlate(NetworkClient.java:477)
	at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:440)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:265)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:908)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
	at com.hulu.flintan.metadatacache.kafka.AssetChangeEventConsumer.run(AssetChangeEventConsumer.java:47)
35 seconds later, I started seeing 

Error ILLEGAL_GENERATION occurred while committing offsets for group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856

happens all the time.

I then checked the server side log, it shows 

[2016-11-17 06:17:28,868] INFO [GroupCoordinator 2]: Preparing to restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 with old generation 1 (kafka.coordinator.GroupCoordinator)
[2016-11-17 06:17:28,869] INFO [GroupCoordinator 2]: Group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation 1 is dead and removed (kafka.coordinator.GroupCoordinator)
[2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Preparing to restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 with old generation 0 (kafka.coordinator.GroupCoordinator)
[2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Stabilized group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation 1 (kafka.coordinator.GroupCoordinator)
[2016-11-17 06:17:42,399] INFO [GroupCoordinator 2]: Assignment received from leader for group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 for generation 1 (kafka.coordinator.GroupCoordinator)
[2016-11-17 06:18:12,404] INFO [GroupCoordinator 2]: Preparing to restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 with old generation 1 (kafka.coordinator.GroupCoordinator)

over and over all the time. It looks like the consumer group is bouncing between generation 0 and 1, and it stop consuming anything. This consumer group only have 1 consumer with it.

We are using 
auto commit with interval 1000ms
session timeout 30000ms
heartbeat interval 3000ms

My questions are:
1. Why this happens and how to prevent it happening again?
2. If it happens, how should I react in this case? Catch IllegalGeneration exception and resubscribe the topic? Or recreate the consumer w/ same (or different) consumer group id?
3. Where can I find the active consumer group for a topic? Does that store in zookeeper?

Thanks,
Jun

Re: Consumer group keep bouncing between generation id 0 and 1

Posted by Guozhang Wang <wa...@gmail.com>.

1. If you are not having heavy callback, then I'd suggest you checking the
frequency of consumer.poll() call, because that is when the heartbeat is
sent. For example, between

[2016-11-17 06:17:42,399] new group formed with the single consumer member

[2016-11-17 06:18:12,404] that single member has failed because no
heartbeat is received, hence removing the group

Are there any consumer.poll calls between these 30 seconds. For example, if
your processing logic is slow, it may be possible that it is taking more
than 30 seconds to process a batch and hence the above issue.

You can try with the "max.poll.records" config to return less messages in
one poll call and see if that helps, or upgrade to 0.10.x version in which
this issue is better fixed in KIP-62:

https://cwiki.apache.org/confluence/display/KAFKA/KIP-62%3A+Allow+consumer+to+send+heartbeats+from+a+background+thread


2. The consumer group information is not stored in zookeeper, but in Kafka
server.



Guozhang


On Sat, Nov 19, 2016 at 10:39 AM, Jun Ma <mj...@gmail.com> wrote:

> Hi Guozhang,
>
> Thank you so much for your reply. Follow up questions:
>
> 1. We are using default onPartitionAssigned callback, not customized one.
> What do you think would be the possible reason of the heavy callback? Can
> you point me the actual implementation class for Java 0.9.0.1 driver?
> 2. Does consumer group information store in zookeeper? If it is, do you
> think it is possible that the session timeout is because of the zookeeper?
> Can you tell me the actual znode that stores the consumer group
> information?
>
> Thanks,
> Jun
>
> On Sat, Nov 19, 2016 at 9:32 AM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Jun,
> >
> > The reason that the group is bouncing between generation 0 and 1 is that
> it
> > is iteratively removed as an empty group after the only member is
> removed,
> > and then re-create the group with the member jumping back to re-join.
> >
> > My suspect is that your PartitionAssigned callback is too heavy, that it
> > takes more than the session timeout. Note that the coordinator start
> > ticking on the member right after the partition is assigned, but the
> > consumer will not start heartbeating until the callback is completed.
> >
> > As for the "Correlation id for response (767587) does not match request
> > (767585)" issue, it is indeed a bit weird and I have not seen it before.
> > Would need to investigate further from the logs.
> >
> > Currently there is no tools for searching for all the active consumer
> > groups for a given topic, one has to query all the consumer groups and
> > filter by their subscribed topics.
> >
> >
> > Guozhang
> >
> >
> >
> >
> > On Fri, Nov 18, 2016 at 4:19 PM, Jun MA <mj...@gmail.com> wrote:
> >
> > > Hi guys,
> > >
> > > I’m using kafka 0.9.0.1 and Java client. I saw the following exceptions
> > > throw by my consumer:
> > > Caused by: java.lang.IllegalStateException: Correlation id for
> response
> > > (767587) does not match request (767585)
> > >         at org.apache.kafka.clients.NetworkClient.correlate(
> > > NetworkClient.java:477)
> > >         at org.apache.kafka.clients.NetworkClient.
> > handleCompletedReceives(
> > > NetworkClient.java:440)
> > >         at org.apache.kafka.clients.NetworkClient.poll(
> > > NetworkClient.java:265)
> > >         at org.apache.kafka.clients.consumer.internals.
> > > ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
> > >         at org.apache.kafka.clients.consumer.internals.
> > > ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
> > >         at org.apache.kafka.clients.consumer.internals.
> > > ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
> > >         at org.apache.kafka.clients.consumer.KafkaConsumer.
> > > pollOnce(KafkaConsumer.java:908)
> > >         at org.apache.kafka.clients.consumer.KafkaConsumer.poll(
> > > KafkaConsumer.java:853)
> > >         at com.hulu.flintan.metadatacache.kafka.
> > > AssetChangeEventConsumer.run(AssetChangeEventConsumer.java:47)
> > > 35 seconds later, I started seeing
> > >
> > > Error ILLEGAL_GENERATION occurred while committing offsets for group
> > > flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856
> > >
> > > happens all the time.
> > >
> > > I then checked the server side log, it shows
> > >
> > > [2016-11-17 06:17:28,868] INFO [GroupCoordinator 2]: Preparing to
> > > restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> > 3989c76c3856
> > > with old generation 1 (kafka.coordinator.GroupCoordinator)
> > > [2016-11-17 06:17:28,869] INFO [GroupCoordinator 2]: Group
> > > flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation
> 1
> > > is dead and removed (kafka.coordinator.GroupCoordinator)
> > > [2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Preparing to
> > > restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> > 3989c76c3856
> > > with old generation 0 (kafka.coordinator.GroupCoordinator)
> > > [2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Stabilized group
> > > flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation
> 1
> > > (kafka.coordinator.GroupCoordinator)
> > > [2016-11-17 06:17:42,399] INFO [GroupCoordinator 2]: Assignment
> received
> > > from leader for group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> > 3989c76c3856
> > > for generation 1 (kafka.coordinator.GroupCoordinator)
> > > [2016-11-17 06:18:12,404] INFO [GroupCoordinator 2]: Preparing to
> > > restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> > 3989c76c3856
> > > with old generation 1 (kafka.coordinator.GroupCoordinator)
> > >
> > > over and over all the time. It looks like the consumer group is
> bouncing
> > > between generation 0 and 1, and it stop consuming anything. This
> consumer
> > > group only have 1 consumer with it.
> > >
> > > We are using
> > > auto commit with interval 1000ms
> > > session timeout 30000ms
> > > heartbeat interval 3000ms
> > >
> > > My questions are:
> > > 1. Why this happens and how to prevent it happening again?
> > > 2. If it happens, how should I react in this case? Catch
> > IllegalGeneration
> > > exception and resubscribe the topic? Or recreate the consumer w/ same
> (or
> > > different) consumer group id?
> > > 3. Where can I find the active consumer group for a topic? Does that
> > store
> > > in zookeeper?
> > >
> > > Thanks,
> > > Jun
> >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: Consumer group keep bouncing between generation id 0 and 1

Posted by Jun Ma <mj...@gmail.com>.

Hi Guozhang,

Thank you so much for your reply. Follow up questions:

1. We are using default onPartitionAssigned callback, not customized one.
What do you think would be the possible reason of the heavy callback? Can
you point me the actual implementation class for Java 0.9.0.1 driver?
2. Does consumer group information store in zookeeper? If it is, do you
think it is possible that the session timeout is because of the zookeeper?
Can you tell me the actual znode that stores the consumer group
information?

Thanks,
Jun

On Sat, Nov 19, 2016 at 9:32 AM, Guozhang Wang <wa...@gmail.com> wrote:

> Hello Jun,
>
> The reason that the group is bouncing between generation 0 and 1 is that it
> is iteratively removed as an empty group after the only member is removed,
> and then re-create the group with the member jumping back to re-join.
>
> My suspect is that your PartitionAssigned callback is too heavy, that it
> takes more than the session timeout. Note that the coordinator start
> ticking on the member right after the partition is assigned, but the
> consumer will not start heartbeating until the callback is completed.
>
> As for the "Correlation id for response (767587) does not match request
> (767585)" issue, it is indeed a bit weird and I have not seen it before.
> Would need to investigate further from the logs.
>
> Currently there is no tools for searching for all the active consumer
> groups for a given topic, one has to query all the consumer groups and
> filter by their subscribed topics.
>
>
> Guozhang
>
>
>
>
> On Fri, Nov 18, 2016 at 4:19 PM, Jun MA <mj...@gmail.com> wrote:
>
> > Hi guys,
> >
> > I’m using kafka 0.9.0.1 and Java client. I saw the following exceptions
> > throw by my consumer:
> > Caused by: java.lang.IllegalStateException: Correlation id for response
> > (767587) does not match request (767585)
> >         at org.apache.kafka.clients.NetworkClient.correlate(
> > NetworkClient.java:477)
> >         at org.apache.kafka.clients.NetworkClient.
> handleCompletedReceives(
> > NetworkClient.java:440)
> >         at org.apache.kafka.clients.NetworkClient.poll(
> > NetworkClient.java:265)
> >         at org.apache.kafka.clients.consumer.internals.
> > ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
> >         at org.apache.kafka.clients.consumer.internals.
> > ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
> >         at org.apache.kafka.clients.consumer.internals.
> > ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
> >         at org.apache.kafka.clients.consumer.KafkaConsumer.
> > pollOnce(KafkaConsumer.java:908)
> >         at org.apache.kafka.clients.consumer.KafkaConsumer.poll(
> > KafkaConsumer.java:853)
> >         at com.hulu.flintan.metadatacache.kafka.
> > AssetChangeEventConsumer.run(AssetChangeEventConsumer.java:47)
> > 35 seconds later, I started seeing
> >
> > Error ILLEGAL_GENERATION occurred while committing offsets for group
> > flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856
> >
> > happens all the time.
> >
> > I then checked the server side log, it shows
> >
> > [2016-11-17 06:17:28,868] INFO [GroupCoordinator 2]: Preparing to
> > restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> 3989c76c3856
> > with old generation 1 (kafka.coordinator.GroupCoordinator)
> > [2016-11-17 06:17:28,869] INFO [GroupCoordinator 2]: Group
> > flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation 1
> > is dead and removed (kafka.coordinator.GroupCoordinator)
> > [2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Preparing to
> > restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> 3989c76c3856
> > with old generation 0 (kafka.coordinator.GroupCoordinator)
> > [2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Stabilized group
> > flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation 1
> > (kafka.coordinator.GroupCoordinator)
> > [2016-11-17 06:17:42,399] INFO [GroupCoordinator 2]: Assignment received
> > from leader for group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> 3989c76c3856
> > for generation 1 (kafka.coordinator.GroupCoordinator)
> > [2016-11-17 06:18:12,404] INFO [GroupCoordinator 2]: Preparing to
> > restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-
> 3989c76c3856
> > with old generation 1 (kafka.coordinator.GroupCoordinator)
> >
> > over and over all the time. It looks like the consumer group is bouncing
> > between generation 0 and 1, and it stop consuming anything. This consumer
> > group only have 1 consumer with it.
> >
> > We are using
> > auto commit with interval 1000ms
> > session timeout 30000ms
> > heartbeat interval 3000ms
> >
> > My questions are:
> > 1. Why this happens and how to prevent it happening again?
> > 2. If it happens, how should I react in this case? Catch
> IllegalGeneration
> > exception and resubscribe the topic? Or recreate the consumer w/ same (or
> > different) consumer group id?
> > 3. Where can I find the active consumer group for a topic? Does that
> store
> > in zookeeper?
> >
> > Thanks,
> > Jun
>
>
>
>
> --
> -- Guozhang
>

Re: Consumer group keep bouncing between generation id 0 and 1

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Jun,

The reason that the group is bouncing between generation 0 and 1 is that it
is iteratively removed as an empty group after the only member is removed,
and then re-create the group with the member jumping back to re-join.

My suspect is that your PartitionAssigned callback is too heavy, that it
takes more than the session timeout. Note that the coordinator start
ticking on the member right after the partition is assigned, but the
consumer will not start heartbeating until the callback is completed.

As for the "Correlation id for response (767587) does not match request
(767585)" issue, it is indeed a bit weird and I have not seen it before.
Would need to investigate further from the logs.

Currently there is no tools for searching for all the active consumer
groups for a given topic, one has to query all the consumer groups and
filter by their subscribed topics.


Guozhang




On Fri, Nov 18, 2016 at 4:19 PM, Jun MA <mj...@gmail.com> wrote:

> Hi guys,
>
> I’m using kafka 0.9.0.1 and Java client. I saw the following exceptions
> throw by my consumer:
> Caused by: java.lang.IllegalStateException: Correlation id for response
> (767587) does not match request (767585)
>         at org.apache.kafka.clients.NetworkClient.correlate(
> NetworkClient.java:477)
>         at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(
> NetworkClient.java:440)
>         at org.apache.kafka.clients.NetworkClient.poll(
> NetworkClient.java:265)
>         at org.apache.kafka.clients.consumer.internals.
> ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
>         at org.apache.kafka.clients.consumer.internals.
> ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
>         at org.apache.kafka.clients.consumer.internals.
> ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
>         at org.apache.kafka.clients.consumer.KafkaConsumer.
> pollOnce(KafkaConsumer.java:908)
>         at org.apache.kafka.clients.consumer.KafkaConsumer.poll(
> KafkaConsumer.java:853)
>         at com.hulu.flintan.metadatacache.kafka.
> AssetChangeEventConsumer.run(AssetChangeEventConsumer.java:47)
> 35 seconds later, I started seeing
>
> Error ILLEGAL_GENERATION occurred while committing offsets for group
> flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856
>
> happens all the time.
>
> I then checked the server side log, it shows
>
> [2016-11-17 06:17:28,868] INFO [GroupCoordinator 2]: Preparing to
> restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856
> with old generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-11-17 06:17:28,869] INFO [GroupCoordinator 2]: Group
> flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation 1
> is dead and removed (kafka.coordinator.GroupCoordinator)
> [2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Preparing to
> restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856
> with old generation 0 (kafka.coordinator.GroupCoordinator)
> [2016-11-17 06:17:42,396] INFO [GroupCoordinator 2]: Stabilized group
> flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856 generation 1
> (kafka.coordinator.GroupCoordinator)
> [2016-11-17 06:17:42,399] INFO [GroupCoordinator 2]: Assignment received
> from leader for group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856
> for generation 1 (kafka.coordinator.GroupCoordinator)
> [2016-11-17 06:18:12,404] INFO [GroupCoordinator 2]: Preparing to
> restabilize group flintan-metadatacache-5b9d551b-bd4a-41fa-812f-3989c76c3856
> with old generation 1 (kafka.coordinator.GroupCoordinator)
>
> over and over all the time. It looks like the consumer group is bouncing
> between generation 0 and 1, and it stop consuming anything. This consumer
> group only have 1 consumer with it.
>
> We are using
> auto commit with interval 1000ms
> session timeout 30000ms
> heartbeat interval 3000ms
>
> My questions are:
> 1. Why this happens and how to prevent it happening again?
> 2. If it happens, how should I react in this case? Catch IllegalGeneration
> exception and resubscribe the topic? Or recreate the consumer w/ same (or
> different) consumer group id?
> 3. Where can I find the active consumer group for a topic? Does that store
> in zookeeper?
>
> Thanks,
> Jun




-- 
-- Guozhang