You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Vincent Dautremont <vi...@olamobile.com.INVALID> on 2017/09/26 15:42:56 UTC

consumer group offset chaos

Hi,
I've recently experienced a reset of consumer group offset on a cluster of
3 Kafka nodes (v0.11.0.0).

I use 3 high level consumers using librdkafka 0.9.4
They first ask the consumer group assigned partition offsets just after
each rebalance and before consuming anything.

every offset related  action is logged to file to retrace possible problems.
On those 3 consumers the log end was at an offset near 313 000 000 for all
12 partitions.

following an unknown "cluster problem", the communication with my 3 clients
and the cluster ended unexpectedly : in one of the offset commit, error
response tells me
err = *Broker: Not coordinator for group*
and
err = *Broker: Unknown member*

so my 3 clients all resets and retries to assign the consumer group 30
seconds later.
On the 3 clients the CG assigned partition offsets was then reported around
9 000 000
Because the first valid messages of each partition was around offset 300
000 000.
The clients did  reprocess the whole topic from its beginning, reprocessing
13 000 000.
my near real time client programs... couldn't achieve near real time any
more from there, which is quite critical.

I'd like help to know what happened here, I wonder if there's a bug in
kafka / usage of zookeeper related to that.

we clearly see from the log that there was a problem on zookeeper on node
2, and that the kafka node 2 was removed from the cluster for a few seconds.

log of KafkaNode3-controller.log.2017-09-22-21 https://pastebin.com/ekXNd13G

log of KafkaNode2-server.log.2017-09-22-21 https://pastebin.com/3d05jtNx

log of *ServerNode2-zookeeper.log* https://pastebin.com/yY9vMWDQ
log of *ServerNode3-zookeeper.log* https://pastebin.com/nQRt8dh1

I don't understand what's the source of all this, but it seems related to a
zookeeper problem on server 2 (what caused it ?).

Could all this be linked to https://issues.apache.org/jira/browse/KAFKA-5600
which has been fixed in v0.11.0.1 ?


Thanks for your thoughts.
VIncent.

-- 
The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this in error, please contact the sender and delete the material from any 
computer.