You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Greg Harris (Jira)" <ji...@apache.org> on 2022/12/22 19:25:00 UTC

[jira] [Commented] (KAFKA-14548) Stable streams applications stall due to infrequent restoreConsumer polls

    [ https://issues.apache.org/jira/browse/KAFKA-14548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17651401#comment-17651401 ] 

Greg Harris commented on KAFKA-14548:
-------------------------------------

[~mjsax] as you had previously categorized https://issues.apache.org/jira/browse/KAFKA-13405 (which has the exact same cause and symptoms as this issue) as Not A Bug, do you think that the reasoning for the above tactical fix make sense?

> Stable streams applications stall due to infrequent restoreConsumer polls
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-14548
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14548
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Greg Harris
>            Priority: Major
>
> We have observed behavior with Streams where otherwise healthy applications stall and become unable to process data after a rebalance. The root cause of which is that a restoreConsumer can be partitioned from a Kafka cluster with stale metadata, while the mainConsumer is healthy with up-to-date metadata. This is due to both an issue in streams and an issue in the consumer logic.
> In StoreChangelogReader, a long-lived restoreConsumer is kept instantiated while the streams app is running. This consumer is only `poll()`ed when the ChangelogReader::restore method is called and at least one changelog is in the RESTORING state. This may be very infrequent if the streams app is stable.
> This is an anti-pattern, as frequent poll()s are expected to keep kafka consumers in contact with the kafka cluster. Infrequent polls are considered failures from the perspective of the consumer API. From the [official Kafka Consumer documentation|https://kafka.apache.org/33/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html]:
> {noformat}
> The poll API is designed to ensure consumer liveness.
> ...
> So to stay in the group, you must continue to call poll.
> ...
> The recommended way to handle these cases [where the main thread is not ready for more data] is to move message processing to another thread, which allows the consumer to continue calling poll while the processor is still working.
> ...
> Note also that you will need to pause the partition so that no new records are received from poll until after thread has finished handling those previously returned.{noformat}
> With the current behavior, it is expected that the restoreConsumer will fall out of the group regularly and be considered failed, when the rest of the application is running exactly as intended.
> This is not normally an issue, as falling out of the group is easily repaired by joining the group during the next poll. It does mean that there is slightly higher latency to performing a restore, but that does not appear to be a major concern at this time.
> This does become an issue when other deeper assumptions about the usage of Kafka clients are violated. Relevant to this issue, it is assumed by the client metadata management logic that regular polling will take place, and that the regular poll call can be piggy-backed to initiate a metadata update. Without a regular poll, the regular metadata update cannot be performed, and the consumer violates its own `metadata.max.age.ms` configuration. This leads to the restoreConsumer having a much older metadata containing none of the currently live brokers, partitioning it from the cluster.
> Alleviating this failure mode does not _require_ the streams' polling behavior to change, as solutions for all clients have been considered (https://issues.apache.org/jira/browse/KAFKA-3068 and that family of duplicate issues).
> However, as a tactical fix for the issue, and one which does not require a KIP changing the behavior of {_}every kafka client{_}, we should consider changing the restoreConsumer poll behavior to bring it closer to the expected happy-path of at least one poll() every poll.interval.ms.
> If there is another hidden assumption of the clients that relies on regular polling, then this tactical fix may prevent users of the streams library from being affected, reducing the impact of that hidden assumption through defense-in-depth.
> This would also be a backport-able fix for streams users, instead of a fix to the consumers which would only apply to new versions of the consumers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)