You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Lu Niu (Jira)" <ji...@apache.org> on 2020/04/10 23:35:00 UTC

[jira] [Commented] (FLINK-17089) Checkpoint fail because RocksDBException: Error While opening a file for sequentially reading

    [ https://issues.apache.org/jira/browse/FLINK-17089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081039#comment-17081039 ] 

Lu Niu commented on FLINK-17089:
--------------------------------

The root cause is actually user brings in rocksdb 6.1.0 in their application jar. How to prevent this from the perspective of flink framwork? Can we shade the rocksdb under different namespace? 

> Checkpoint fail because RocksDBException: Error While opening a file for sequentially reading
> ---------------------------------------------------------------------------------------------
>
>                 Key: FLINK-17089
>                 URL: https://issues.apache.org/jira/browse/FLINK-17089
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>            Reporter: Lu Niu
>            Priority: Major
>
> we use incremental rocksdb state backend. Flink job checkpoint throws following exception after running for about 20 hours:
> {code:java}
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/foo/bar/usercache/xxx/appcache/application_1584397637704_9072/flink-io-4e2294f0-7e9b-4102-b079-1089f23c47aa/job_d781983f4967703b0480c7943e8100af_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__27_60__uuid_dee7e33b-9bce-42f3-909a-f6fa4ab52d8c/db/MANIFEST-000006: No such file or directory	at org.rocksdb.Checkpoint.createCheckpoint(Native Method)	at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51)	at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:249)	at org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:160)	at org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126)	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:439)	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:411)	... 17 more
> {code}
> This failure consistent happens until the job restarts.
> Some findings:
> Jobmanager log shows each time the error came from different subTask:
> {code:java}
> // grep jobManager log on appcache/application_1584397637704_9622
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-c42b6665-0170-4dc9-9933-8abd78812fd5/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__5_60__uuid_fa8124e4-1678-4555-a90a-8eec4d974a22/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a8dfe34d-909e-4aea-8d20-c89199b20856/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__4_60__uuid_12fc9764-418e-4802-800e-3623e385743f/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-e98c35d7-586a-4edb-9eba-99c6fd823540/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__9_60__uuid_f52a3f02-aa12-4285-b594-b94e1b0f8ba7/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a2887f93-1c75-48b1-8b67-72acdc69ce1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__2_60__uuid_6a8267eb-aa04-48a3-b82f-7b5b9f21c8e0/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-e98c35d7-586a-4edb-9eba-99c6fd823540/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__9_60__uuid_f52a3f02-aa12-4285-b594-b94e1b0f8ba7/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-7be6a975-c0cd-4083-a1c3-b47e4c8fbb1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__13_60__uuid_d779fe65-181f-40d2-b32e-e17a023c128d/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-44fefa0f-c58a-4ce5-ac44-b8b9a436eae5/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__40_60__uuid_bfcd85f6-270b-4e56-8c09-250d9171b8a3/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme1n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-1dff583b-5fb3-4521-8cdf-261a2e3a0f4d/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__6_60__uuid_27a20e68-22d6-4e35-a23f-f267c523b829/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a8dfe34d-909e-4aea-8d20-c89199b20856/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__4_60__uuid_12fc9764-418e-4802-800e-3623e385743f/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme3n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-a2887f93-1c75-48b1-8b67-72acdc69ce1b/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__2_60__uuid_6a8267eb-aa04-48a3-b82f-7b5b9f21c8e0/db/MANIFEST-000006: No such file or directory
> Caused by: org.rocksdb.RocksDBException: While opening a file for sequentially reading: /data/nvme2n1/nm-local-dir/usercache/dkapoor/appcache/application_1584397637704_9622/flink-io-27e797c3-de39-4140-84e8-b94e640154cc/job_03a4b302f44a8d9f5b31693a80bde30c_op_KeyedProcessOperator_b9daf26d7397cd4b00184cc833054139__1_60__uuid_fde8b198-32d8-4e0c-a412-f316a4fe1e3e/db/MANIFEST-000006: No such file or directory
> {code}
> question:
> The state size is actually small. The largest one is ~3KB. That is actually smaller state.backend.fs.memory-threshold we set. In this case, why it still need to store data in rocksdb? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)