You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Ben Yan <ya...@gmail.com> on 2018/12/07 10:28:27 UTC

Failed to resume job from checkpoint

hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend,
but recently I found the following exception when the job resumed from the
checkpoint. Task-local state is always considered a secondary copy, the
ground truth of the checkpoint state is the primary copy in the distributed
store. But it seems that the job did not recover from hdfs, and it failed
directly.Hope someone can give me advices or hints about the problem that I
encountered.


2018-12-06 22:54:04,171 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from
RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore
keyed state backend for
KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any
of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
	... 5 more
Caused by: java.nio.file.NoSuchFileException:
/mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst
-> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
	at java.nio.file.Files.createLink(Files.java:1086)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
	... 7 more


Best

Ben

Re: Failed to resume job from checkpoint

Posted by Stefan Richter <s....@data-artisans.com>.
Hi,

good that you found the cause of the problem in your configuration setting, but unfortunately I think I cannot yet follow your reasoning. Can you explain why the code would fail for a “slow” HDFS? If no local recovery is happening (this means: job failover, with local recovery activated)  the job will always first download all files from HDFS to your local disk. After that, it will hard link the file on local disk to another directory. I would assume that all HDFS problems like slowdowns will show in the part that is downloading the files to local disk. But your exceptions comes after that, when the file supposedly was already copied. So I don’t understand how you think that this is connected, can you please explain it in more detail? 

For your second question, Flink currently assumes that your HDFS (or whatever checkpoint filesystem you use) is stable, highly available storage and that files do not “get lost”. It can tolerate temporary outages through multiple restart attempts, but your setup of the checkpoint directory should prevent data loss.

Best,
Stefan 

> On 9. Dec 2018, at 14:05, Ben Yan <ya...@gmail.com> wrote:
> 
> hi,
> 
> 1. I took a closer look at the relevant code about RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. And I did some verification. I found this problem is likely related to file system connection restrictions. At first I was worried that my hdfs would be overloaded due to a large number of connections, so I configured the following related parameters:
> 
> fs.<scheme>.limit.total: (number, 0/-1 mean no limit)
> fs.<scheme>.limit.input: (number, 0/-1 mean no limit)
> fs.<scheme>.limit.output: (number, 0/-1 mean no limit)
> fs.<scheme>.limit.timeout: (milliseconds, 0 means infinite)
> fs.<scheme>.limit.stream-timeout: (milliseconds, 0 means infinite)
> 
> Since I configured the above configuration, this problem has begun to appear! When I removed the above configuration, the problem disappeared.I think that when flink is configured with file system connection restrictions, the mechanism for recovering from checkpoint needs to be improved. Jobs can recover from checkpoints more slowly with file system connection restrictions, rather than failing directly because of the above exceptions.
> 
> 2. After the job has been running for a long time, if the state data stored in the state backend (such as hdfs) is lost for some reason, what other ways can quickly restore this state data back quickly, for example, through some kind of offline task is to quickly recover state data from offline data, so that streaming jobs can be launched from this recovered state data.
> 
> Best
> Ben
> 
> Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> 于2018年12月8日周六 上午11:08写道:
> I hava already tested it.
> 
> [root@node ~]#ll /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
> total 32
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29 blobStore-273cf1a6-0f98-4c86-801e-5d76fef66a58
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29 blobStore-992562a5-f42f-43f7-90de-a415b4dcd398
> drwx--x---  4 yarn hadoop 4096 Dec  8 02:29 container_e73_1544101169829_0038_01_000059
> drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29 flink-dist-cache-6d8dab0c-4034-4bbe-a9b9-b524cf6856e3
> drwxr-xr-x  8 yarn hadoop 4096 Dec  8 02:29 flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0
> drwxr-xr-x  4 yarn hadoop 4096 Dec  8 02:29 localState
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29 rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36
> 
> And the derectory "flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0" does not exist.
> 
> [root@node ~]#ll /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
> total 12
> drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53 localState
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53 rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36
> 
> Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> 于2018年12月8日周六 上午12:23写道:
> Thank you for your advice! I will check this out next, and I will sync the information at any time with new progress.
> 
> Stefan Richter <s.richter@data-artisans.com <ma...@data-artisans.com>> 于2018年12月8日周六 上午12:05写道:
> I think then you need to investigate what goes wrong in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If you look at the code it lists the files in a directory and tries to hard link them into another directory, and I would only expect to see the mentioned exception if the original file that we try to link does not exist. However, imo it must exist because we list it in the directory right before the link attempt and Flink is not delete anything in the meantime. So the question is, why can a file that was listed before just suddenly disappear when it is hard linked? The only potential problem could be in the path transformations and concatenations, but they look good to me and also pass all tests, including end-to-end tests that do exactly such a restore. I suggest to either observe the created files and what happens with the one that is mentioned in the exception or introduce debug logging in the code, in particular a check if the listed file (the link target) does exist before linking, which it should in my opinion because it is listed in the directory. 
> 
>> On 7. Dec 2018, at 16:33, Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> wrote:
>> 
>> The version of the recovered checkpoint is also 1.7.0 . 
>> 
>> Stefan Richter <s.richter@data-artisans.com <ma...@data-artisans.com>> 于2018年12月7日周五 下午11:06写道:
>> Just to clarify, the checkpoint from which you want to resume in 1.7, was that taken by 1.6 or by 1.7? So far this is a bit mysterious because it says FileNotFound, but the whole iteration is driven by listing the existing files. Can you somehow monitor which files and directories are created during the restore attempt?
>> 
>>> On 7. Dec 2018, at 15:53, Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> hi ,Stefan
>>> 
>>> Thank you for your explanation. I used flink1.6.2, which is without any problems. I have tested it a few times with version 1.7.0, but every time I resume from the checkpoint, the job will show the exception I showed earlier, which will make the job unrecoverable.And I checked all the logs, except for this exception, there are no other exceptions.
>>> 
>>> The following is all the logs when an exception occurs:
>>> 2018-12-06 22:53:41,282 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to RUNNING.
>>> 2018-12-06 22:53:41,285 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>> 	... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>> 	... 7 more
>>> 2018-12-06 22:53:41,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state RUNNING to FAILING.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>> 	... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>> 	... 7 more
>>> 2018-12-06 22:53:41,287 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to CANCELING.
>>> 
>>> 
>>> Best, 
>>> Ben
>>> 
>>> Stefan Richter <s.richter@data-artisans.com <ma...@data-artisans.com>> 于2018年12月7日周五 下午10:00写道:
>>> Hi,
>>> 
>>> From what I can see in the log here, it looks like your RocksDB is not recovering from local but from a remote filesystem. This recovery basically has steps:
>>> 
>>> 1: Create a temporary directory (in your example, this is the dir that ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, mainly sst files from remote fs to the temporary directory in local fs.
>>> 
>>> 2: List all the downloaded files in the temporary directory and either hardlink (for sst files) or copy (for all other files) the listed files into the new RocksDb instance path (the path that ends with …/db)
>>> 
>>> 3: Open the new db from the instance path, delete the temporary directory.
>>> 
>>> Now what is very surprising here is that it claims some file was not found (not clear which one, but I assume the downloaded file). However, how the file can be lost between downloading/listing and the attempt to hardlink it is very mysterious. Can you check the logs for any other exceptions and can you check what files exist in the recovery (e.g. what is downloaded, if the instance path is there, …). For now, I cannot see how a listed file could suddenly disappear, Flink will only delete the temporary directory if recovery is completed or failed. 
>>> 
>>> Also: is this problem deterministic or was this a singularity? Did you use a different Flink version before (which worked)?
>>> 
>>> Best,
>>> Stefan
>>> 
>>>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, but recently I found the following exception when the job resumed from the checkpoint. Task-local state is always considered a secondary copy, the ground truth of the checkpoint state is the primary copy in the distributed store. But it seems that the job did not recover from hdfs, and it failed directly.Hope someone can give me advices or hints about the problem that I encountered.
>>>> 
>>>> 
>>>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>> 	... 5 more
>>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>> 	... 7 more
>>>> 
>>>> Best
>>>> Ben
>>> 
>> 
> 


Re: Failed to resume job from checkpoint

Posted by Ben Yan <ya...@gmail.com>.
hi,

1. I took a closer look at the relevant code about
RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. And I
did some verification. I found this problem is likely related to file
system connection restrictions. At first I was worried that my hdfs would
be overloaded due to a large number of connections, so I configured the
following related parameters:

fs.<scheme>.limit.total: (number, 0/-1 mean no
limit)fs.<scheme>.limit.input: (number, 0/-1 mean no
limit)fs.<scheme>.limit.output: (number, 0/-1 mean no
limit)fs.<scheme>.limit.timeout: (milliseconds, 0 means
infinite)fs.<scheme>.limit.stream-timeout: (milliseconds, 0 means
infinite)


Since I configured the above configuration, this problem has begun to
appear! When I removed the above configuration, the problem disappeared.I
think that when flink is configured with file system connection
restrictions, the mechanism for recovering from checkpoint needs to be
improved. Jobs can recover from checkpoints more slowly with file system
connection restrictions, rather than failing directly because of the above
exceptions.

2. After the job has been running for a long time, if the state data stored
in the state backend (such as hdfs) is lost for some reason, what other
ways can quickly restore this state data back quickly, for example, through
some kind of offline task is to quickly recover state data from offline
data, so that streaming jobs can be launched from this recovered state data.

Best
Ben

Ben Yan <ya...@gmail.com> 于2018年12月8日周六 上午11:08写道:

> I hava already tested it.
>
> [root@node ~]#ll
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
> total 32
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
> blobStore-273cf1a6-0f98-4c86-801e-5d76fef66a58
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
> blobStore-992562a5-f42f-43f7-90de-a415b4dcd398
> drwx--x---  4 yarn hadoop 4096 Dec  8 02:29
> container_e73_1544101169829_0038_01_000059
> drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
> flink-dist-cache-6d8dab0c-4034-4bbe-a9b9-b524cf6856e3
> drwxr-xr-x  8 yarn hadoop 4096 Dec  8 02:29
> flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0
> drwxr-xr-x  4 yarn hadoop 4096 Dec  8 02:29 localState
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
> rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36
>
> And the derectory "flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0" does not
> exist.
>
> [root@node ~]#ll
> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
> total 12
> drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53 localState
> drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53
> rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36
>
> Ben Yan <ya...@gmail.com> 于2018年12月8日周六 上午12:23写道:
>
>> Thank you for your advice! I will check this out next, and I will sync
>> the information at any time with new progress.
>>
>> Stefan Richter <s....@data-artisans.com> 于2018年12月8日周六 上午12:05写道:
>>
>>> I think then you need to investigate what goes wrong
>>> in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If
>>> you look at the code it lists the files in a directory and tries to hard
>>> link them into another directory, and I would only expect to see the
>>> mentioned exception if the original file that we try to link does not
>>> exist. However, imo it must exist because we list it in the directory right
>>> before the link attempt and Flink is not delete anything in the meantime.
>>> So the question is, why can a file that was listed before just suddenly
>>> disappear when it is hard linked? The only potential problem could be in
>>> the path transformations and concatenations, but they look good to me and
>>> also pass all tests, including end-to-end tests that do exactly such a
>>> restore. I suggest to either observe the created files and what happens
>>> with the one that is mentioned in the exception or introduce debug logging
>>> in the code, in particular a check if the listed file (the link target)
>>> does exist before linking, which it should in my opinion because it is
>>> listed in the directory.
>>>
>>> On 7. Dec 2018, at 16:33, Ben Yan <ya...@gmail.com> wrote:
>>>
>>> The version of the recovered checkpoint is also 1.7.0 .
>>>
>>> Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午11:06写道:
>>>
>>>> Just to clarify, the checkpoint from which you want to resume in 1.7,
>>>> was that taken by 1.6 or by 1.7? So far this is a bit mysterious because it
>>>> says FileNotFound, but the whole iteration is driven by listing the
>>>> existing files. Can you somehow monitor which files and directories are
>>>> created during the restore attempt?
>>>>
>>>> On 7. Dec 2018, at 15:53, Ben Yan <ya...@gmail.com> wrote:
>>>>
>>>> hi ,Stefan
>>>>
>>>> Thank you for your explanation. I used flink1.6.2, which is without any
>>>> problems. I have tested it a few times with version 1.7.0, but every time I
>>>> resume from the checkpoint, the job will show the exception I showed
>>>> earlier, which will make the job unrecoverable.And I checked all the logs,
>>>> except for this exception, there are no other exceptions.
>>>>
>>>> The following is all the logs when an exception occurs:
>>>>
>>>> 2018-12-06 22:53:41,282 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to RUNNING.
>>>>
>>>> 2018-12-06 22:53:41,285 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
>>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>> 	... 5 more
>>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>> 	... 7 more
>>>> 2018-12-06 22:53:41,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state RUNNING to FAILING.
>>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>> 	... 5 more
>>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>> 	... 7 more
>>>> 2018-12-06 22:53:41,287 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to CANCELING.
>>>>
>>>>
>>>>
>>>> Best,
>>>> Ben
>>>>
>>>> Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午10:00写道:
>>>>
>>>>> Hi,
>>>>>
>>>>> From what I can see in the log here, it looks like your RocksDB is not
>>>>> recovering from local but from a remote filesystem. This recovery basically
>>>>> has steps:
>>>>>
>>>>> 1: Create a temporary directory (in your example, this is the dir that
>>>>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files,
>>>>> mainly sst files from remote fs to the temporary directory in local fs.
>>>>>
>>>>> 2: List all the downloaded files in the temporary directory and either
>>>>> hardlink (for sst files) or copy (for all other files) the listed files
>>>>> into the new RocksDb instance path (the path that ends with …/db)
>>>>>
>>>>> 3: Open the new db from the instance path, delete the temporary
>>>>> directory.
>>>>>
>>>>> Now what is very surprising here is that it claims some file was not
>>>>> found (not clear which one, but I assume the downloaded file). However, how
>>>>> the file can be lost between downloading/listing and the attempt to
>>>>> hardlink it is very mysterious. Can you check the logs for any other
>>>>> exceptions and can you check what files exist in the recovery (e.g. what is
>>>>> downloaded, if the instance path is there, …). For now, I cannot see how a
>>>>> listed file could suddenly disappear, Flink will only delete the temporary
>>>>> directory if recovery is completed or failed.
>>>>>
>>>>> Also: is this problem deterministic or was this a singularity? Did you
>>>>> use a different Flink version before (which worked)?
>>>>>
>>>>> Best,
>>>>> Stefan
>>>>>
>>>>> On 7. Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
>>>>>
>>>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as
>>>>> statebackend, but recently I found the following exception when the job
>>>>> resumed from the checkpoint. Task-local state is always considered a
>>>>> secondary copy, the ground truth of the checkpoint state is the primary
>>>>> copy in the distributed store. But it seems that the job did not
>>>>> recover from hdfs, and it failed directly.Hope someone can give me
>>>>> advices or hints about the problem that I encountered.
>>>>>
>>>>>
>>>>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>>>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>>> 	... 5 more
>>>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>>> 	... 7 more
>>>>>
>>>>>
>>>>> Best
>>>>>
>>>>> Ben
>>>>>
>>>>>
>>>>>
>>>>
>>>

Re: Failed to resume job from checkpoint

Posted by Ben Yan <ya...@gmail.com>.
I hava already tested it.

[root@node ~]#ll
/mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
total 32
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
blobStore-273cf1a6-0f98-4c86-801e-5d76fef66a58
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
blobStore-992562a5-f42f-43f7-90de-a415b4dcd398
drwx--x---  4 yarn hadoop 4096 Dec  8 02:29
container_e73_1544101169829_0038_01_000059
drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
flink-dist-cache-6d8dab0c-4034-4bbe-a9b9-b524cf6856e3
drwxr-xr-x  8 yarn hadoop 4096 Dec  8 02:29
flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0
drwxr-xr-x  4 yarn hadoop 4096 Dec  8 02:29 localState
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:29
rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36

And the derectory "flink-io-6fba8471-4d84-4c13-9e3c-ef3891b366f0" does not
exist.

[root@node ~]#ll
/mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0038/
total 12
drwx--x--- 13 yarn hadoop 4096 Dec  8 02:29 filecache
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53 localState
drwxr-xr-x  2 yarn hadoop 4096 Dec  8 02:53
rocksdb-lib-7ef4471db8d3b8c1bdcfa4dba4d95a36

Ben Yan <ya...@gmail.com> 于2018年12月8日周六 上午12:23写道:

> Thank you for your advice! I will check this out next, and I will sync the
> information at any time with new progress.
>
> Stefan Richter <s....@data-artisans.com> 于2018年12月8日周六 上午12:05写道:
>
>> I think then you need to investigate what goes wrong
>> in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If
>> you look at the code it lists the files in a directory and tries to hard
>> link them into another directory, and I would only expect to see the
>> mentioned exception if the original file that we try to link does not
>> exist. However, imo it must exist because we list it in the directory right
>> before the link attempt and Flink is not delete anything in the meantime.
>> So the question is, why can a file that was listed before just suddenly
>> disappear when it is hard linked? The only potential problem could be in
>> the path transformations and concatenations, but they look good to me and
>> also pass all tests, including end-to-end tests that do exactly such a
>> restore. I suggest to either observe the created files and what happens
>> with the one that is mentioned in the exception or introduce debug logging
>> in the code, in particular a check if the listed file (the link target)
>> does exist before linking, which it should in my opinion because it is
>> listed in the directory.
>>
>> On 7. Dec 2018, at 16:33, Ben Yan <ya...@gmail.com> wrote:
>>
>> The version of the recovered checkpoint is also 1.7.0 .
>>
>> Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午11:06写道:
>>
>>> Just to clarify, the checkpoint from which you want to resume in 1.7,
>>> was that taken by 1.6 or by 1.7? So far this is a bit mysterious because it
>>> says FileNotFound, but the whole iteration is driven by listing the
>>> existing files. Can you somehow monitor which files and directories are
>>> created during the restore attempt?
>>>
>>> On 7. Dec 2018, at 15:53, Ben Yan <ya...@gmail.com> wrote:
>>>
>>> hi ,Stefan
>>>
>>> Thank you for your explanation. I used flink1.6.2, which is without any
>>> problems. I have tested it a few times with version 1.7.0, but every time I
>>> resume from the checkpoint, the job will show the exception I showed
>>> earlier, which will make the job unrecoverable.And I checked all the logs,
>>> except for this exception, there are no other exceptions.
>>>
>>> The following is all the logs when an exception occurs:
>>>
>>> 2018-12-06 22:53:41,282 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to RUNNING.
>>>
>>> 2018-12-06 22:53:41,285 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>> 	... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>> 	... 7 more
>>> 2018-12-06 22:53:41,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state RUNNING to FAILING.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>> 	... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>> 	... 7 more
>>> 2018-12-06 22:53:41,287 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to CANCELING.
>>>
>>>
>>>
>>> Best,
>>> Ben
>>>
>>> Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午10:00写道:
>>>
>>>> Hi,
>>>>
>>>> From what I can see in the log here, it looks like your RocksDB is not
>>>> recovering from local but from a remote filesystem. This recovery basically
>>>> has steps:
>>>>
>>>> 1: Create a temporary directory (in your example, this is the dir that
>>>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files,
>>>> mainly sst files from remote fs to the temporary directory in local fs.
>>>>
>>>> 2: List all the downloaded files in the temporary directory and either
>>>> hardlink (for sst files) or copy (for all other files) the listed files
>>>> into the new RocksDb instance path (the path that ends with …/db)
>>>>
>>>> 3: Open the new db from the instance path, delete the temporary
>>>> directory.
>>>>
>>>> Now what is very surprising here is that it claims some file was not
>>>> found (not clear which one, but I assume the downloaded file). However, how
>>>> the file can be lost between downloading/listing and the attempt to
>>>> hardlink it is very mysterious. Can you check the logs for any other
>>>> exceptions and can you check what files exist in the recovery (e.g. what is
>>>> downloaded, if the instance path is there, …). For now, I cannot see how a
>>>> listed file could suddenly disappear, Flink will only delete the temporary
>>>> directory if recovery is completed or failed.
>>>>
>>>> Also: is this problem deterministic or was this a singularity? Did you
>>>> use a different Flink version before (which worked)?
>>>>
>>>> Best,
>>>> Stefan
>>>>
>>>> On 7. Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
>>>>
>>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as
>>>> statebackend, but recently I found the following exception when the job
>>>> resumed from the checkpoint. Task-local state is always considered a
>>>> secondary copy, the ground truth of the checkpoint state is the primary
>>>> copy in the distributed store. But it seems that the job did not
>>>> recover from hdfs, and it failed directly.Hope someone can give me
>>>> advices or hints about the problem that I encountered.
>>>>
>>>>
>>>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>>> 	at java.lang.Thread.run(Thread.java:748)
>>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>>> 	... 5 more
>>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>>> 	... 7 more
>>>>
>>>>
>>>> Best
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>
>>

Re: Failed to resume job from checkpoint

Posted by Ben Yan <ya...@gmail.com>.
Thank you for your advice! I will check this out next, and I will sync the
information at any time with new progress.

Stefan Richter <s....@data-artisans.com> 于2018年12月8日周六 上午12:05写道:

> I think then you need to investigate what goes wrong
> in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If
> you look at the code it lists the files in a directory and tries to hard
> link them into another directory, and I would only expect to see the
> mentioned exception if the original file that we try to link does not
> exist. However, imo it must exist because we list it in the directory right
> before the link attempt and Flink is not delete anything in the meantime.
> So the question is, why can a file that was listed before just suddenly
> disappear when it is hard linked? The only potential problem could be in
> the path transformations and concatenations, but they look good to me and
> also pass all tests, including end-to-end tests that do exactly such a
> restore. I suggest to either observe the created files and what happens
> with the one that is mentioned in the exception or introduce debug logging
> in the code, in particular a check if the listed file (the link target)
> does exist before linking, which it should in my opinion because it is
> listed in the directory.
>
> On 7. Dec 2018, at 16:33, Ben Yan <ya...@gmail.com> wrote:
>
> The version of the recovered checkpoint is also 1.7.0 .
>
> Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午11:06写道:
>
>> Just to clarify, the checkpoint from which you want to resume in 1.7, was
>> that taken by 1.6 or by 1.7? So far this is a bit mysterious because it
>> says FileNotFound, but the whole iteration is driven by listing the
>> existing files. Can you somehow monitor which files and directories are
>> created during the restore attempt?
>>
>> On 7. Dec 2018, at 15:53, Ben Yan <ya...@gmail.com> wrote:
>>
>> hi ,Stefan
>>
>> Thank you for your explanation. I used flink1.6.2, which is without any
>> problems. I have tested it a few times with version 1.7.0, but every time I
>> resume from the checkpoint, the job will show the exception I showed
>> earlier, which will make the job unrecoverable.And I checked all the logs,
>> except for this exception, there are no other exceptions.
>>
>> The following is all the logs when an exception occurs:
>>
>> 2018-12-06 22:53:41,282 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to RUNNING.
>>
>> 2018-12-06 22:53:41,285 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>> 2018-12-06 22:53:41,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state RUNNING to FAILING.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>> 2018-12-06 22:53:41,287 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to CANCELING.
>>
>>
>>
>> Best,
>> Ben
>>
>> Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午10:00写道:
>>
>>> Hi,
>>>
>>> From what I can see in the log here, it looks like your RocksDB is not
>>> recovering from local but from a remote filesystem. This recovery basically
>>> has steps:
>>>
>>> 1: Create a temporary directory (in your example, this is the dir that
>>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files,
>>> mainly sst files from remote fs to the temporary directory in local fs.
>>>
>>> 2: List all the downloaded files in the temporary directory and either
>>> hardlink (for sst files) or copy (for all other files) the listed files
>>> into the new RocksDb instance path (the path that ends with …/db)
>>>
>>> 3: Open the new db from the instance path, delete the temporary
>>> directory.
>>>
>>> Now what is very surprising here is that it claims some file was not
>>> found (not clear which one, but I assume the downloaded file). However, how
>>> the file can be lost between downloading/listing and the attempt to
>>> hardlink it is very mysterious. Can you check the logs for any other
>>> exceptions and can you check what files exist in the recovery (e.g. what is
>>> downloaded, if the instance path is there, …). For now, I cannot see how a
>>> listed file could suddenly disappear, Flink will only delete the temporary
>>> directory if recovery is completed or failed.
>>>
>>> Also: is this problem deterministic or was this a singularity? Did you
>>> use a different Flink version before (which worked)?
>>>
>>> Best,
>>> Stefan
>>>
>>> On 7. Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
>>>
>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend,
>>> but recently I found the following exception when the job resumed from the
>>> checkpoint. Task-local state is always considered a secondary copy, the
>>> ground truth of the checkpoint state is the primary copy in the distributed
>>> store. But it seems that the job did not recover from hdfs, and it
>>> failed directly.Hope someone can give me advices or hints about the
>>> problem that I encountered.
>>>
>>>
>>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>> 	... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>> 	... 7 more
>>>
>>>
>>> Best
>>>
>>> Ben
>>>
>>>
>>>
>>
>

Re: Failed to resume job from checkpoint

Posted by Stefan Richter <s....@data-artisans.com>.
I think then you need to investigate what goes wrong in RocksDBIncrementalRestoreOperation::restoreInstanceDirectoryFromPath. If you look at the code it lists the files in a directory and tries to hard link them into another directory, and I would only expect to see the mentioned exception if the original file that we try to link does not exist. However, imo it must exist because we list it in the directory right before the link attempt and Flink is not delete anything in the meantime. So the question is, why can a file that was listed before just suddenly disappear when it is hard linked? The only potential problem could be in the path transformations and concatenations, but they look good to me and also pass all tests, including end-to-end tests that do exactly such a restore. I suggest to either observe the created files and what happens with the one that is mentioned in the exception or introduce debug logging in the code, in particular a check if the listed file (the link target) does exist before linking, which it should in my opinion because it is listed in the directory. 

> On 7. Dec 2018, at 16:33, Ben Yan <ya...@gmail.com> wrote:
> 
> The version of the recovered checkpoint is also 1.7.0 . 
> 
> Stefan Richter <s.richter@data-artisans.com <ma...@data-artisans.com>> 于2018年12月7日周五 下午11:06写道:
> Just to clarify, the checkpoint from which you want to resume in 1.7, was that taken by 1.6 or by 1.7? So far this is a bit mysterious because it says FileNotFound, but the whole iteration is driven by listing the existing files. Can you somehow monitor which files and directories are created during the restore attempt?
> 
>> On 7. Dec 2018, at 15:53, Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> wrote:
>> 
>> hi ,Stefan
>> 
>> Thank you for your explanation. I used flink1.6.2, which is without any problems. I have tested it a few times with version 1.7.0, but every time I resume from the checkpoint, the job will show the exception I showed earlier, which will make the job unrecoverable.And I checked all the logs, except for this exception, there are no other exceptions.
>> 
>> The following is all the logs when an exception occurs:
>> 2018-12-06 22:53:41,282 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to RUNNING.
>> 2018-12-06 22:53:41,285 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>> 2018-12-06 22:53:41,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state RUNNING to FAILING.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>> 2018-12-06 22:53:41,287 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to CANCELING.
>> 
>> 
>> Best, 
>> Ben
>> 
>> Stefan Richter <s.richter@data-artisans.com <ma...@data-artisans.com>> 于2018年12月7日周五 下午10:00写道:
>> Hi,
>> 
>> From what I can see in the log here, it looks like your RocksDB is not recovering from local but from a remote filesystem. This recovery basically has steps:
>> 
>> 1: Create a temporary directory (in your example, this is the dir that ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, mainly sst files from remote fs to the temporary directory in local fs.
>> 
>> 2: List all the downloaded files in the temporary directory and either hardlink (for sst files) or copy (for all other files) the listed files into the new RocksDb instance path (the path that ends with …/db)
>> 
>> 3: Open the new db from the instance path, delete the temporary directory.
>> 
>> Now what is very surprising here is that it claims some file was not found (not clear which one, but I assume the downloaded file). However, how the file can be lost between downloading/listing and the attempt to hardlink it is very mysterious. Can you check the logs for any other exceptions and can you check what files exist in the recovery (e.g. what is downloaded, if the instance path is there, …). For now, I cannot see how a listed file could suddenly disappear, Flink will only delete the temporary directory if recovery is completed or failed. 
>> 
>> Also: is this problem deterministic or was this a singularity? Did you use a different Flink version before (which worked)?
>> 
>> Best,
>> Stefan
>> 
>>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, but recently I found the following exception when the job resumed from the checkpoint. Task-local state is always considered a secondary copy, the ground truth of the checkpoint state is the primary copy in the distributed store. But it seems that the job did not recover from hdfs, and it failed directly.Hope someone can give me advices or hints about the problem that I encountered.
>>> 
>>> 
>>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>>> 	at java.lang.Thread.run(Thread.java:748)
>>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>>> 	... 5 more
>>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>>> 	at java.nio.file.Files.createLink(Files.java:1086)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>>> 	... 7 more
>>> 
>>> Best
>>> Ben
>> 
> 


Re: Failed to resume job from checkpoint

Posted by Ben Yan <ya...@gmail.com>.
The version of the recovered checkpoint is also 1.7.0 .

Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午11:06写道:

> Just to clarify, the checkpoint from which you want to resume in 1.7, was
> that taken by 1.6 or by 1.7? So far this is a bit mysterious because it
> says FileNotFound, but the whole iteration is driven by listing the
> existing files. Can you somehow monitor which files and directories are
> created during the restore attempt?
>
> On 7. Dec 2018, at 15:53, Ben Yan <ya...@gmail.com> wrote:
>
> hi ,Stefan
>
> Thank you for your explanation. I used flink1.6.2, which is without any
> problems. I have tested it a few times with version 1.7.0, but every time I
> resume from the checkpoint, the job will show the exception I showed
> earlier, which will make the job unrecoverable.And I checked all the logs,
> except for this exception, there are no other exceptions.
>
> The following is all the logs when an exception occurs:
>
> 2018-12-06 22:53:41,282 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to RUNNING.
>
> 2018-12-06 22:53:41,285 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> 	... 5 more
> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
> 	at java.nio.file.Files.createLink(Files.java:1086)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more
> 2018-12-06 22:53:41,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state RUNNING to FAILING.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> 	... 5 more
> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
> 	at java.nio.file.Files.createLink(Files.java:1086)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more
> 2018-12-06 22:53:41,287 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to CANCELING.
>
>
>
> Best,
> Ben
>
> Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午10:00写道:
>
>> Hi,
>>
>> From what I can see in the log here, it looks like your RocksDB is not
>> recovering from local but from a remote filesystem. This recovery basically
>> has steps:
>>
>> 1: Create a temporary directory (in your example, this is the dir that
>> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files,
>> mainly sst files from remote fs to the temporary directory in local fs.
>>
>> 2: List all the downloaded files in the temporary directory and either
>> hardlink (for sst files) or copy (for all other files) the listed files
>> into the new RocksDb instance path (the path that ends with …/db)
>>
>> 3: Open the new db from the instance path, delete the temporary directory.
>>
>> Now what is very surprising here is that it claims some file was not
>> found (not clear which one, but I assume the downloaded file). However, how
>> the file can be lost between downloading/listing and the attempt to
>> hardlink it is very mysterious. Can you check the logs for any other
>> exceptions and can you check what files exist in the recovery (e.g. what is
>> downloaded, if the instance path is there, …). For now, I cannot see how a
>> listed file could suddenly disappear, Flink will only delete the temporary
>> directory if recovery is completed or failed.
>>
>> Also: is this problem deterministic or was this a singularity? Did you
>> use a different Flink version before (which worked)?
>>
>> Best,
>> Stefan
>>
>> On 7. Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
>>
>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend,
>> but recently I found the following exception when the job resumed from the
>> checkpoint. Task-local state is always considered a secondary copy, the
>> ground truth of the checkpoint state is the primary copy in the distributed
>> store. But it seems that the job did not recover from hdfs, and it
>> failed directly.Hope someone can give me advices or hints about the
>> problem that I encountered.
>>
>>
>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>>
>>
>> Best
>>
>> Ben
>>
>>
>>
>

Re: Failed to resume job from checkpoint

Posted by Stefan Richter <s....@data-artisans.com>.
Just to clarify, the checkpoint from which you want to resume in 1.7, was that taken by 1.6 or by 1.7? So far this is a bit mysterious because it says FileNotFound, but the whole iteration is driven by listing the existing files. Can you somehow monitor which files and directories are created during the restore attempt?

> On 7. Dec 2018, at 15:53, Ben Yan <ya...@gmail.com> wrote:
> 
> hi ,Stefan
> 
> Thank you for your explanation. I used flink1.6.2, which is without any problems. I have tested it a few times with version 1.7.0, but every time I resume from the checkpoint, the job will show the exception I showed earlier, which will make the job unrecoverable.And I checked all the logs, except for this exception, there are no other exceptions.
> 
> The following is all the logs when an exception occurs:
> 2018-12-06 22:53:41,282 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched from DEPLOYING to RUNNING.
> 2018-12-06 22:53:41,285 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from RUNNING to FAILED.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> 	... 5 more
> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
> 	at java.nio.file.Files.createLink(Files.java:1086)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more
> 2018-12-06 22:53:41,286 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from state RUNNING to FAILING.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> 	... 5 more
> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
> 	at java.nio.file.Files.createLink(Files.java:1086)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more
> 2018-12-06 22:53:41,287 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched from RUNNING to CANCELING.
> 
> 
> Best, 
> Ben
> 
> Stefan Richter <s.richter@data-artisans.com <ma...@data-artisans.com>> 于2018年12月7日周五 下午10:00写道:
> Hi,
> 
> From what I can see in the log here, it looks like your RocksDB is not recovering from local but from a remote filesystem. This recovery basically has steps:
> 
> 1: Create a temporary directory (in your example, this is the dir that ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, mainly sst files from remote fs to the temporary directory in local fs.
> 
> 2: List all the downloaded files in the temporary directory and either hardlink (for sst files) or copy (for all other files) the listed files into the new RocksDb instance path (the path that ends with …/db)
> 
> 3: Open the new db from the instance path, delete the temporary directory.
> 
> Now what is very surprising here is that it claims some file was not found (not clear which one, but I assume the downloaded file). However, how the file can be lost between downloading/listing and the attempt to hardlink it is very mysterious. Can you check the logs for any other exceptions and can you check what files exist in the recovery (e.g. what is downloaded, if the instance path is there, …). For now, I cannot see how a listed file could suddenly disappear, Flink will only delete the temporary directory if recovery is completed or failed. 
> 
> Also: is this problem deterministic or was this a singularity? Did you use a different Flink version before (which worked)?
> 
> Best,
> Stefan
> 
>> On 7. Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> wrote:
>> 
>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, but recently I found the following exception when the job resumed from the checkpoint. Task-local state is always considered a secondary copy, the ground truth of the checkpoint state is the primary copy in the distributed store. But it seems that the job did not recover from hdfs, and it failed directly.Hope someone can give me advices or hints about the problem that I encountered.
>> 
>> 
>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>> 
>> Best
>> Ben
> 


Re: Failed to resume job from checkpoint

Posted by Ben Yan <ya...@gmail.com>.
hi ,Stefan

Thank you for your explanation. I used flink1.6.2, which is without any
problems. I have tested it a few times with version 1.7.0, but every time I
resume from the checkpoint, the job will show the exception I showed
earlier, which will make the job unrecoverable.And I checked all the logs,
except for this exception, there are no other exceptions.

The following is all the logs when an exception occurs:

2018-12-06 22:53:41,282 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
KeyedProcess (120/138) (25ab0c8d0bc657860b766fa4c8d85a42) switched
from DEPLOYING to RUNNING.

2018-12-06 22:53:41,285 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
KeyedProcess (2/138) (f770d22a976463d90fb4349d1c8521b8) switched from
RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore
keyed state backend for
KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any
of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
	... 5 more
Caused by: java.nio.file.NoSuchFileException:
/mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
-> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
	at java.nio.file.Files.createLink(Files.java:1086)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
	... 7 more
2018-12-06 22:53:41,286 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job
Flink-Job-Offline (6e40c9381aa12f69b6ac182c91d993f5) switched from
state RUNNING to FAILING.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore
keyed state backend for
KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(2/138) from any
of the 1 provided restore options.
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
	... 5 more
Caused by: java.nio.file.NoSuchFileException:
/mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/db/000495.sst
-> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-133c16c5-4565-4014-b769-2a978af8e772/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__2_138__uuid_ab38b75f-77d7-4124-a410-6444b35d232d/cf45eae8-d5d4-4f04-8bf9-8d54ac078769/000495.sst
	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
	at java.nio.file.Files.createLink(Files.java:1086)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
	... 7 more
2018-12-06 22:53:41,287 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
Source: topic.rate (1/16) (5637f1c3568ca7c29db002e579c05546) switched
from RUNNING to CANCELING.



Best,
Ben

Stefan Richter <s....@data-artisans.com> 于2018年12月7日周五 下午10:00写道:

> Hi,
>
> From what I can see in the log here, it looks like your RocksDB is not
> recovering from local but from a remote filesystem. This recovery basically
> has steps:
>
> 1: Create a temporary directory (in your example, this is the dir that
> ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files,
> mainly sst files from remote fs to the temporary directory in local fs.
>
> 2: List all the downloaded files in the temporary directory and either
> hardlink (for sst files) or copy (for all other files) the listed files
> into the new RocksDb instance path (the path that ends with …/db)
>
> 3: Open the new db from the instance path, delete the temporary directory.
>
> Now what is very surprising here is that it claims some file was not found
> (not clear which one, but I assume the downloaded file). However, how the
> file can be lost between downloading/listing and the attempt to hardlink it
> is very mysterious. Can you check the logs for any other exceptions and can
> you check what files exist in the recovery (e.g. what is downloaded, if the
> instance path is there, …). For now, I cannot see how a listed file could
> suddenly disappear, Flink will only delete the temporary directory if
> recovery is completed or failed.
>
> Also: is this problem deterministic or was this a singularity? Did you use
> a different Flink version before (which worked)?
>
> Best,
> Stefan
>
> On 7. Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
>
> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend,
> but recently I found the following exception when the job resumed from the
> checkpoint. Task-local state is always considered a secondary copy, the
> ground truth of the checkpoint state is the primary copy in the distributed
> store. But it seems that the job did not recover from hdfs, and it failed
> directly.Hope someone can give me advices or hints about the problem that
> I encountered.
>
>
> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> 	... 5 more
> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
> 	at java.nio.file.Files.createLink(Files.java:1086)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more
>
>
> Best
>
> Ben
>
>
>

Re: Failed to resume job from checkpoint

Posted by Stefan Richter <s....@data-artisans.com>.
Hi,

From what I can see in the log here, it looks like your RocksDB is not recovering from local but from a remote filesystem. This recovery basically has steps:

1: Create a temporary directory (in your example, this is the dir that ends …/5683a26f-cde2-406d-b4cf-3c6c3976f8ba) and download all the files, mainly sst files from remote fs to the temporary directory in local fs.

2: List all the downloaded files in the temporary directory and either hardlink (for sst files) or copy (for all other files) the listed files into the new RocksDb instance path (the path that ends with …/db)

3: Open the new db from the instance path, delete the temporary directory.

Now what is very surprising here is that it claims some file was not found (not clear which one, but I assume the downloaded file). However, how the file can be lost between downloading/listing and the attempt to hardlink it is very mysterious. Can you check the logs for any other exceptions and can you check what files exist in the recovery (e.g. what is downloaded, if the instance path is there, …). For now, I cannot see how a listed file could suddenly disappear, Flink will only delete the temporary directory if recovery is completed or failed. 

Also: is this problem deterministic or was this a singularity? Did you use a different Flink version before (which worked)?

Best,
Stefan

> On 7. Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
> 
> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, but recently I found the following exception when the job resumed from the checkpoint. Task-local state is always considered a secondary copy, the ground truth of the checkpoint state is the primary copy in the distributed store. But it seems that the job did not recover from hdfs, and it failed directly.Hope someone can give me advices or hints about the problem that I encountered.
> 
> 
> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> 	... 5 more
> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
> 	at java.nio.file.Files.createLink(Files.java:1086)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more
> 
> Best
> Ben


Re: Failed to resume job from checkpoint

Posted by Ben Yan <ya...@gmail.com>.
Thanks. If you need me to provide information, please let me know, I will
provide relevant information.

Piotr Nowojski <pi...@data-artisans.com> 于2018年12月7日周五 下午7:31写道:

> Adding back user mailing list.
>
> Andrey, could you take a look at this?
>
> Piotrek
>
> On 7 Dec 2018, at 12:28, Ben Yan <ya...@gmail.com> wrote:
>
> Yes. Previous versions never happened
>
> Piotr Nowojski <pi...@data-artisans.com> 于2018年12月7日周五 下午7:27写道:
>
>> Hey,
>>
>> Do you mean that the problem started occurring only after upgrading to
>> Flink 1.7.0?
>>
>> Piotrek
>>
>> On 7 Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
>>
>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend,
>> but recently I found the following exception when the job resumed from the
>> checkpoint. Task-local state is always considered a secondary copy, the
>> ground truth of the checkpoint state is the primary copy in the distributed
>> store. But it seems that the job did not recover from hdfs, and it
>> failed directly.Hope someone can give me advices or hints about the
>> problem that I encountered.
>>
>>
>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>>
>>
>> Best
>>
>> Ben
>>
>>
>>
>

Re: Failed to resume job from checkpoint

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Adding back user mailing list.

Andrey, could you take a look at this?

Piotrek

> On 7 Dec 2018, at 12:28, Ben Yan <ya...@gmail.com> wrote:
> 
> Yes. Previous versions never happened
> 
> Piotr Nowojski <piotr@data-artisans.com <ma...@data-artisans.com>> 于2018年12月7日周五 下午7:27写道:
> Hey,
> 
> Do you mean that the problem started occurring only after upgrading to Flink 1.7.0?
> 
> Piotrek
> 
>> On 7 Dec 2018, at 11:28, Ben Yan <yan.xiao.bin.mail@gmail.com <ma...@gmail.com>> wrote:
>> 
>> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, but recently I found the following exception when the job resumed from the checkpoint. Task-local state is always considered a secondary copy, the ground truth of the checkpoint state is the primary copy in the distributed store. But it seems that the job did not recover from hdfs, and it failed directly.Hope someone can give me advices or hints about the problem that I encountered.
>> 
>> 
>> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
>> java.lang.Exception: Exception while creating StreamOperatorStateContext.
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
>> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
>> 	at java.lang.Thread.run(Thread.java:748)
>> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
>> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
>> 	... 5 more
>> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
>> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
>> 	at java.nio.file.Files.createLink(Files.java:1086)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
>> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
>> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
>> 	... 7 more
>> 
>> Best
>> Ben
> 


Re: Failed to resume job from checkpoint

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hey,

Do you mean that the problem started occurring only after upgrading to Flink 1.7.0?

Piotrek

> On 7 Dec 2018, at 11:28, Ben Yan <ya...@gmail.com> wrote:
> 
> hi . I am using flink-1.7.0. I am using RockDB and hdfs as statebackend, but recently I found the following exception when the job resumed from the checkpoint. Task-local state is always considered a secondary copy, the ground truth of the checkpoint state is the primary copy in the distributed store. But it seems that the job did not recover from hdfs, and it failed directly.Hope someone can give me advices or hints about the problem that I encountered.
> 
> 
> 2018-12-06 22:54:04,171 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - KeyedProcess (3/138) (5d96a585130f7a21f22f82f79941fb1d) switched from RUNNING to FAILED.
> java.lang.Exception: Exception while creating StreamOperatorStateContext.
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
> 	at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:250)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:738)
> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:289)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5_(3/138) from any of the 1 provided restore options.
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:137)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:284)
> 	at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
> 	... 5 more
> Caused by: java.nio.file.NoSuchFileException: /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/db/000495.sst -> /mnt/yarn/local/usercache/yarn/appcache/application_1544101169829_0004/flink-io-0115e9d6-a816-4b65-8944-1423f0fdae58/job_6e40c9381aa12f69b6ac182c91d993f5_op_KeyedProcessOperator_e528d5d97ea2d7cefbcf6ff5b46354d5__3_138__uuid_1c6a5a11-caaf-4564-b3d0-9c7dadddc390/5683a26f-cde2-406d-b4cf-3c6c3976f8ba/000495.sst
> 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
> 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> 	at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
> 	at java.nio.file.Files.createLink(Files.java:1086)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBKeyedStateBackend.java:1238)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreLocalStateIntoFullInstance(RocksDBKeyedStateBackend.java:1186)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBKeyedStateBackend.java:916)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend$RocksDBIncrementalRestoreOperation.restore(RocksDBKeyedStateBackend.java:864)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:525)
> 	at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.restore(RocksDBKeyedStateBackend.java:147)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:151)
> 	at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:123)
> 	... 7 more
> 
> Best
> Ben