You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Matthias J. Sax (Jira)" <ji...@apache.org> on 2022/12/06 03:40:00 UTC

[jira] [Resolved] (KAFKA-14440) Local state wipeout with EOS

     [ https://issues.apache.org/jira/browse/KAFKA-14440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias J. Sax resolved KAFKA-14440.
-------------------------------------
    Resolution: Duplicate

> Local state wipeout with EOS
> ----------------------------
>
>                 Key: KAFKA-14440
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14440
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.2.3
>            Reporter: Abdullah alkhawatrah
>            Priority: Major
>         Attachments: Screenshot 2022-12-02 at 09.26.27.png
>
>
> Hey,
> I have a kafka streams service that aggregates events from multiple input topics (running in a k8s cluster). The topology has multiple FKJs. The input topics have around 7 billion events when the service was started from `earliest`.
> The service has EOS enabled and 
> {code:java}
> transaction.timeout.ms: 600000{code}
> The problem I am having is with frequent local state wipe-outs, this leads to very long rebalances. As can be seen from the attached images, local disk sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee based on this log message: 
> {code:java}
> State store transfer-store did not find checkpoint offsets while stores are not empty, since under EOS it has the risk of getting uncommitted data in stores we have to treat it as a task corruption error and wipe out the local state of task 1_8 before re-bootstrapping{code}
>  
> I noticed that this happens as a result of one of the following:
>  * Process gets sigkill when running out of memory or on failure to shutdown gracefully on pod rotation for example, this explains the missing local checkpoint file, but for some reason I thought local checkpoint updates are frequent, so I expected to get part of the state to be reset but not the whole local state.
>  * Although we have a  long transaction timeout config, this appears many times in the logs, after which kafka streams gets into error state. On startup, local checkpoint file is not found:
> {code:java}
> Transiting to abortable error state due to org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.{code}
> The service has 10 instances all having the same behaviour. The issue disappears when EOS is disabled.
> The kafka cluster runs kafka 2.6, with minimum isr of 3.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)