You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Tim Costa (Jira)" <ji...@apache.org> on 2022/01/24 22:28:00 UTC

[jira] [Created] (KAFKA-13615) Kafka Streams does not transition state on LeaveGroup due to poll interval being exceeded

Tim Costa created KAFKA-13615:
---------------------------------

             Summary: Kafka Streams does not transition state on LeaveGroup due to poll interval being exceeded
                 Key: KAFKA-13615
                 URL: https://issues.apache.org/jira/browse/KAFKA-13615
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.8.1
            Reporter: Tim Costa


We are running a KafkaStreams application with largely default settings. Occasionally one of our consumers in the group takes too long between polls, and streams leaves the consumer group but the state of the application remains `RUNNING`. We are using the default `max.poll.interval.ms` of 5000.

The process stays alive with no exception that bubbles to our code, so when this occurs our app just kinda sits there idle until a manual restart is performed.

Here are the logs from around the time of the problem:
{code:java}
{"timestamp":"2022-01-24 19:56:44.404","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread [kubepodname-StreamThread-1] Processed 65296 total records, ran 0 punctuators, and committed 400 total tasks since the last update","context":"default"} {"timestamp":"2022-01-24 19:58:44.478","level":"INFO","thread":"kubepodname-StreamThread-1","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread [kubepodname-StreamThread-1] Processed 65284 total records, ran 0 punctuators, and committed 400 total tasks since the last update","context":"default"} {"timestamp":"2022-01-24 20:03:50.383","level":"INFO","thread":"kafka-coordinator-heartbeat-thread | stage-us-1-fanout-logs-2c99","logger":"org.apache.kafka.clients.consumer.internals.AbstractCoordinator","message":"[Consumer clientId=kubepodname-StreamThread-1-consumer, groupId=stage-us-1-fanout-logs-2c99] Member kubepodname-StreamThread-1-consumer-283f0e0d-defa-4edf-88b2-39703f845db5 sending LeaveGroup request to coordinator b-2.***.kafka.us-east-1.amazonaws.com:9096 (id: 2147483645 rack: null) due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.","context":"default"} {code}
At this point the application entirely stops processing data. We initiated a shutdown by deleting the kubernetes pod, and the line printed immediately by kafka is the following:
{code:java}
{"timestamp":"2022-01-24 20:05:27.368","level":"INFO","thread":"kafka-streams-close-thread","logger":"org.apache.kafka.streams.processor.internals.StreamThread","message":"stream-thread [kubepodname-StreamThread-1] State transition from RUNNING to PENDING_SHUTDOWN","context":"default"}
 {code}
For a period of over a minute the application was in a state of hiatus where it had left the group, however it was still marked as being in a `RUNNING` state so we had no way to detect that the application had entered a bad state to kill it in an automated fashion. While the above logs are from an app that we shut down within a minute or two manually, we have seen this stay in a bad state for up to an hour before.

It feels like a bug to me that the streams consumer can leave the consumer group but not exit the `RUNNING` state. I tried searching for other bugs like this, but couldn't find any. Any ideas on how to detect this, or thoughts on whether this is actually a bug?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)