You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Raman Gupta (Jira)" <ji...@apache.org> on 2020/07/02 18:03:00 UTC

[jira] [Created] (KAFKA-10229) Kafka stream dies when earlier shut down node leaves group, no errors logged on client

Raman Gupta created KAFKA-10229:
-----------------------------------

Summary: Kafka stream dies when earlier shut down node leaves group, no errors logged on client
Key: KAFKA-10229
URL: https://issues.apache.org/jira/browse/KAFKA-10229
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 2.4.1
Reporter: Raman Gupta

My broker and clients are 2.4.1. I'm currently running a single broker. I have a Kafka stream with exactly once processing turned on. I also have an uncaught exception handler defined on the client. I have a stream which I noticed was lagging. Upon investigation, I see that the consumer group was empty.

On restarting the consumers, the consumer group re-established itself, but after about 8 minutes, the group became empty again. There is nothing logged on the client side about any stream errors, despite the existence of an uncaught exception handler.

In the broker logs, I see that about 8 minutes after the clients restart / the stream goes to RUNNING state:

```
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Member cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 in group produs-cisFileIndexer-stream has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Preparing to rebalance group produs-cisFileIndexer-stream in state PreparingRebalance with old generation 228 (__consumer_offsets-3) (reason: removing member cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
```

so according to this the consumer heartbeat has expired. I don't know why this would be, logging shows that the stream was running and processing messages normally and then just stopped processing anything about 4 minutes before it dies, with no apparent errors or issues or anything logged via the uncaught exception handler.

It doesn't appear to be related to any specific poison pill type messages: restarting the stream causes it to reprocess a bunch more messages from the backlog, and then die again approximately 8 minutes later. At the time of the last message consumed by the stream, there are no `INFO`-level or above logs either in the client or the broker, or any errors whatsoever. The stream consumption simply stops.

There are two consumers -- even if I limit consumption to only a single consumer, the same thing happens.

The runtime environment is Kubernetes.

--
This message was sent by Atlassian Jira
(v8.3.4#803005)