You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 18:18:25 UTC

[GitHub] [beam] damccorm opened a new issue, #20622: Beam Flink Runner 1.10 checkpoint failure

damccorm opened a new issue, #20622:
URL: https://github.com/apache/beam/issues/20622

Recently upgraded to beam-runners-flink-1.10 v2.23.0 from beam-runners-flink-1.9 v2.23.0. Also, upgraded the flink server to 1.10.2 from 1.9.3.

The beam pipeline reads from kafkaio and writes to kafkaio and there is an in-memory pardo between PBegin and PDone. The application is configured to use s3 for checkpointing and the state backend is RocksDB.

This beam pipeline was working as expected with beam-runners-flink-1.9 as expected. But after upgrading to beam-runners-flink-1.10 the checkpoints keep timing out. I have tried increasing time out to several hours. But checkpoints keep timing out.

There are no exceptions in the log. Based on the logs, both synchronous and asynchronous phases of checkpointing are not happening. Usually "Trigger checkpoint" log statement is followed by "Confirm checkpoint" when the checkpoint succeeds. But with 1.10, I only see "Trigger checkpoint" and no confirmation of completion or even indication of progress. There are enough cpu and memory available and there is no deadlock.

Imported from Jira [BEAM-10927](https://issues.apache.org/jira/browse/BEAM-10927). Original Jira may contain additional context.
Reported by: omkardeshpande8.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org