You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "Abdullah alkhawatrah (Jira)" <ji...@apache.org> on 2022/12/03 10:38:00 UTC

[jira] [Created] (KAFKA-14440) Local state wipeout with EOS

Abdullah alkhawatrah created KAFKA-14440:
--------------------------------------------

             Summary: Local state wipeout with EOS
                 Key: KAFKA-14440
                 URL: https://issues.apache.org/jira/browse/KAFKA-14440
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 3.2.3
            Reporter: Abdullah alkhawatrah
         Attachments: Screenshot 2022-12-02 at 09.26.27.png

Hey,

I have a kafka streams service that aggregates events from multiple input topics (running in a k8s cluster). The topology has multiple FKJs. The input topics have around 7 billion events when the service was started from `earliest`.

The service has EOS enabled and `[transaction.timeout.ms|http://transaction.timeout.ms/]` is `600000`. 

The problem I am having is with frequent local state wipe-outs, this leads to very long rebalances. As can be seen from the attached images, local disk sizes go to ~ 0 very often. These wipe out are part of the EOS guarantee based on this log message: 
State store transfer-store did not find checkpoint offsets while stores are not empty, since under EOS it has the risk of getting uncommitted data in stores we have to treat it as a task corruption error and wipe out the local state of task 1_8 before re-bootstrapping
 

I noticed that this happens as a result of one of the following:
 * Process gets sigkill when running out of memory or on failure to shutdown gracefully on pod rotation for example, this explains the missing local checkpoint file, but for some reason I thought local checkpoint updates are frequent, so I expected to get part of the state to be reset but not the whole local state.
 * Although we have a  long transaction timeout config, this appears many times in the logs, after which kafka streams gets into error state. On startup, local checkpoint file is not found:

Transiting to abortable error state due to org.apache.kafka.common.errors.InvalidProducerEpochException: Producer attempted to produce with an old epoch.
 

The service has 10 instances all having the same behaviour. The issue disappears when EOS is disabled.

The kafka cluster runs kafka 2.6, with minimum isr of 3.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)