You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2021/03/16 00:15:00 UTC

[jira] [Commented] (BEAM-10927) Beam Flink Runner 1.10 checkpoint failure

    [ https://issues.apache.org/jira/browse/BEAM-10927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302115#comment-17302115 ] 

Kenneth Knowles commented on BEAM-10927:
----------------------------------------

Pinging [~mxm] and [~thw] who may know more about this. Probably we need more information to reproduce this and determine the cause.

> Beam Flink Runner 1.10 checkpoint failure
> -----------------------------------------
>
>                 Key: BEAM-10927
>                 URL: https://issues.apache.org/jira/browse/BEAM-10927
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-flink
>    Affects Versions: 2.23.0
>            Reporter: Omkar Deshpande
>            Priority: P1
>
> Recently upgraded to beam-runners-flink-1.10 v2.23.0 from beam-runners-flink-1.9 v2.23.0. Also, upgraded the flink server to 1.10.2 from 1.9.3.
> The beam pipeline reads from kafkaio and writes to kafkaio and there is an in-memory pardo between PBegin and PDone. The application is configured to use s3 for checkpointing and the state backend is RocksDB.
> This beam pipeline was working as expected with beam-runners-flink-1.9 as expected. But after upgrading to beam-runners-flink-1.10 the checkpoints keep timing out. I have tried increasing time out to several hours. But checkpoints keep timing out.
> There are no exceptions in the log. Based on the logs, both synchronous and asynchronous phases of checkpointing are not happening. Usually "Trigger checkpoint" log statement is followed by "Confirm checkpoint" when the checkpoint succeeds. But with 1.10, I only see "Trigger checkpoint" and no confirmation of completion or even indication of progress. There are enough cpu and memory available and there is no deadlock.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)