You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Enrique Medina Montenegro <e....@gmail.com> on 2018/03/31 13:39:35 UTC
Marking coordinator dead and consumer 0.10.0.0

Hi,

We're experiencing this issue over and over again since we started to use
Kafka. Basically, our issue is about network glitches (that we are trying
to solve) which make consumers not see brokers temporarily and even brokers
not see each other either.

So the scenario is the following:

1) 3 brokers up & running with several topics, where each topic (5
partitions) is consumed by a single consumer group (with 3 consumers
average).

2) Everything works fine during the working day and we experience no issues
whatsoever.

3) However, sometimes when we return back to the office in the morning, we
realize that some consumers in some consumer groups are no longer
consuming, but other in the same consumer group run normally.

For example, in a consumer group named
"absolutegrounds.helper.processor.datapipeline"
we see that, out of 3 consumers, 2 of them stopped consuming, whereas 1 of
them could "recover" and continue consuming. These are their (last)
respective logs:

One consumer in consumer group = "
absolutegrounds.helper.processor.datapipeline":
2018-03-26 01:01:04,070 INFO -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:542 - Marking the coordinator
10.141.36.18:9092 (id: 2147483647 rack: null) dead for group
absolutegrounds.helper.processor.datapipeline
2018-03-26 01:01:12,026 INFO -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:505 - Discovered coordinator
10.141.36.18:9092 (id: 2147483647 rack: null) for group
absolutegrounds.helper.processor.datapipeline.

Another consumer in consumer group = "
absolutegrounds.helper.processor.datapipeline":
2018-03-26 01:01:04,157 INFO -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:542 - Marking the coordinator
10.141.36.18:9092 (id: 2147483647 rack: null) dead for group
absolutegrounds.helper.processor.datapipeline
2018-03-26 01:01:12,040 INFO -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:505 - Discovered coordinator
10.141.36.18:9092 (id: 2147483647 rack: null) for group
absolutegrounds.helper.processor.datapipeline.

Last consumer in consumer group = "
absolutegrounds.helper.processor.datapipeline":
March 26th 2018, 03:01:07.757 -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:542 - Marking the coordinator
10.141.36.18:9092 (id: 2147483647 rack: null) dead for group
absolutegrounds.helper.processor.datapipeline
March 26th 2018, 03:01:11.561 -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:505 - Discovered coordinator
10.141.36.18:9092 (id: 2147483647 rack: null) for group
absolutegrounds.helper.processor.datapipeline.
March 26th 2018, 03:01:16.216 -kafka-consumer-1
o.a.k.c.c.i.ConsumerCoordinator:292 - Revoking previously assigned
partitions [AG_TASK_SOURCE_DP-4] for group
absolutegrounds.helper.processor.datapipeline
March 26th 2018, 03:01:16.948 -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:326 - (Re-)joining group
absolutegrounds.helper.processor.datapipeline
March 26th 2018, 03:01:18.478 -kafka-consumer-1
o.a.k.c.c.i.AbstractCoordinator:434 - Successfully joined group
absolutegrounds.helper.processor.datapipeline with generation 1
March 26th 2018, 03:01:18.478 -kafka-consumer-1
o.a.k.c.c.i.ConsumerCoordinator:231 - Setting newly assigned partitions
[AG_TASK_SOURCE_DP-0, AG_TASK_SOURCE_DP-1, AG_TASK_SOURCE_DP-2,
AG_TASK_SOURCE_DP-3, AG_TASK_SOURCE_DP-4] for group
absolutegrounds.helper.processor.datapipeline
March 26th 2018, 03:01:18.780 -kafka-listener-5
e.e.t.d.a.h.p.TaskProcessor:203 - Published Event CREATED. Task:
{dossiertype=1, tasktype=current, taskdate=18/04/2018 00:00:00, examiner=,
dossierid=017879332, tyoper=1, outcome={description=Pending, code=-1},
taskid=133838532, status={description=Completed, code=2}, logo=null,
owner={ownerid=711016, ownername=Jaguar Land Rover Limited}, firstlang=EN,
gsclasses=1;2;7;10;11;13;15;17;19;20;22;23;29;30;31;33;34;43;44;45,
acl=f4b794ffba01d3c8d68d21e98f6d7f75, markdate=24/03/2018 19:18:31,
kdmark=1, denomination=, milestone=EXAMINATION, marktype=2,
clazz=f4b794ffba01d3c8d68d21e98f6d7f75, lct=false}

So for the same consumer group =
"absolutegrounds.helper.processor.datapipeline",
2 out of 3 consumers stopped consuming and the remaining one could recover
and continue consuming, apparently getting all the partitions in the topic
(probably because the other consumers were stuck). All of them did show the
"Mark coordinator dead" message for the same broker (10.141.36.18:9092).

4) Checking with the network admins, they swear that they are not aware of
any issues with the network, and the only thing that might be related is
the fact that at that time some backup processes are triggered (we have not
invested too much time on figuring out what the root cause for the network
glitches is because eventually, they will impact equally our brokers).

As mentioned in the subject of this message, we are using Kafka 0.10.0.0
for both the brokers and the clients/consumers and our consumers are using
the high-level consumer API  with Spring Kafka (actually Spring Cloud
Stream with Kafka Binders).

We have also tried to reproduce this issue without success as in a
"controlled" environment the consumers always recover properly.

Not sure whether this could be related to this issue -->
https://issues.apache.org/jira/browse/KAFKA-6671

Anything we can try out to spot the issue?

Thanks.