You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Yordan Pavlov (Jira)" <ji...@apache.org> on 2022/08/04 08:25:00 UTC

[jira] [Created] (FLINK-28802) EOFException when recovering from a checkpoint

Yordan Pavlov created FLINK-28802:
-------------------------------------

             Summary: EOFException when recovering from a checkpoint
                 Key: FLINK-28802
                 URL: https://issues.apache.org/jira/browse/FLINK-28802
             Project: Flink
          Issue Type: Bug
            Reporter: Yordan Pavlov


Flink version: 1.14.2

When recovering from a Savepoint, the TaskManager would throw an EOFException at this point:

[https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaProducer.java#L1681]

It looks like the transactional id is missing on deserialization. As an interesting observation, before trying to recover from the mentioned checkpoint the job failed with:

 
{code:java}
 Aug 2, 2022 @ 16:50:40.107    Caused by: org.apache.kafka.common.errors.OutOfOrderSequenceException: The broker received an out of order sequence number.
    Aug 2, 2022 @ 16:50:40.106    2022-08-02 13:50:40.106 [flink-akka.actor.default-dispatcher-5] INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph  - Time correct sink UTXO Balances -> Sink Save to Kafka UTXO Balances bch-balances-v4 (1/1) (d2847bd31137561e3ba075a7cf70ee7c) switched from RUNNING to FAILED on 10.42.238.71:37353-fc3e48 @ bch-balances-v4-flink-taskmanager-0.bch-balances-v4-flink-taskmanager.flink.svc.cluster.local (dataPort=42553).
    Aug 2, 2022 @ 16:50:40.106    org.apache.flink.util.FlinkRuntimeException: Failed to send data to Kafka bch-balances-v4-0@-1 with FlinkKafkaInternalProducer{transactionalId='bch-balances-v4-0-690507', inTransaction=true, closed=false} {code}
 

Could it be that Flink actually failed to construct the checkpoint but still marked it as completed. What would be the way to check this? Is there something like a checksum byte that is checked on recovery?
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)