You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (Jira)" <ji...@apache.org> on 2020/05/14 09:19:00 UTC

[jira] [Commented] (FLINK-17089) Checkpoint fail because RocksDBException: Error While opening a file for sequentially reading

    [ https://issues.apache.org/jira/browse/FLINK-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107111#comment-17107111 ] 

Stephan Ewen commented on FLINK-17089:
--------------------------------------

Are you certain that the root cause is that the user has also put RocksDB into their job jar file?
Flink classes should always be loaded "parent first" which should prevent and Flink code that is duplicated in the job jar to have any effect.

> Checkpoint fail because RocksDBException: Error While opening a file for sequentially reading
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-17089
>                 URL: https://issues.apache.org/jira/browse/FLINK-17089
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>            Reporter: Lu Niu
>            Priority: Major
>
> we use incremental rocksdb state backend. Flink job checkpoint throws following exception after running for about 20 hours:
> {code:java}
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/foo/bar/usercache/xxx/appcache/application_1584397637704_9072/flink-io-4e2294f0-7e9b-4102-b079-1089f23c47aa/job_d781983f4967703b0480c7943e8100af_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__27_60__uuid_dee7e33b-9bce-42f3-909a-f6fa4ab52d8c/db/MANIFEST-000006: No such file or directory	at org.rocksdb.Checkpoint.createCheckpoint(Native Method)	at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51)	at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:249)	at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:160)	at org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126)	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:439)	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:411)	... 17 more
> {code}
> This failure consistent happens until the job restarts.
> Some findings:
> Jobmanager log shows each time the error came from different subTask:
> {code:java}
> // grep jobManager log on appcache/application_1584397637704_9622
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-c42b6665-0170-4dc9-9933-8abd78812fd5/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__5_60__uuid_fa8124e4-1678-4555-a90a-8eec4d974a22/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a8dfe34d-909e-4aea-8d20-c89199b20856/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__4_60__uuid_12fc9764-418e-4802-800e-3623e385743f/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-e98c35d7-586a-4edb-9eba-99c6fd823540/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__9_60__uuid_f52a3f02-aa12-4285-b594-b94e1b0f8ba7/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a2887f93-1c75-48b1-8b67-72acdc69ce1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__2_60__uuid_6a8267eb-aa04-48a3-b82f-7b5b9f21c8e0/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-e98c35d7-586a-4edb-9eba-99c6fd823540/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__9_60__uuid_f52a3f02-aa12-4285-b594-b94e1b0f8ba7/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-7be6a975-c0cd-4083-a1c3-b47e4c8fbb1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__13_60__uuid_d779fe65-181f-40d2-b32e-e17a023c128d/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-44fefa0f-c58a-4ce5-ac44-b8b9a436eae5/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__40_60__uuid_bfcd85f6-270b-4e56-8c09-250d9171b8a3/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-1dff583b-5fb3-4521-8cdf-261a2e3a0f4d/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__6_60__uuid_27a20e68-22d6-4e35-a23f-f267c523b829/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a8dfe34d-909e-4aea-8d20-c89199b20856/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__4_60__uuid_12fc9764-418e-4802-800e-3623e385743f/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a2887f93-1c75-48b1-8b67-72acdc69ce1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__2_60__uuid_6a8267eb-aa04-48a3-b82f-7b5b9f21c8e0/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
> {code}
> question:
> The state size is actually small. The largest one is ~3KB. That is actually smaller state.backend.fs.memory-threshold we set. In this case, why it still need to store data in rocksdb? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)