You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Dominique Rondé <do...@allsecur.de> on 2016/05/04 16:27:06 UTC

Restart Flink in Yarn

Hi @all,

i have a yarn cluster with 5 Nodes with a running flink (0.10.2) 
instance. Today we shut down one of the Yarn-Hosts due to maintance 
reasons. After the restart we have some flink streaming routes in a 
restarting status (see stacktrace below). Now I want to restart these 
routes to continue their work from the last checkpoint. What can i do?

Greets
Dominique

Stacktrace
===================================================================================

java.io.IOException: Cannot get library with hash 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:254)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:114)
	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:710)
	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:471)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to fetch BLOB 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317 from /10.24.20.14:60485 and store it under /tmp/blobStore-efdeddf9-d096-440f-a4cb-9c79334ff92c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:245)
	... 4 more
Caused by: java.io.IOException: GET operation failed: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:165)
	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:125)
	... 5 more
Caused by: java.io.IOException: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobClient.receiveAndCheckResponse(BlobClient.java:213)
	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:159)
	... 6 more
Caused by: java.io.IOException: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
	at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:202)
	at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:112)



Re: Restart Flink in Yarn

Posted by Ufuk Celebi <uc...@apache.org>.
Hey Dominique!

Are you running the job in HA mode?

– Ufuk

On Thu, May 5, 2016 at 1:49 PM, Robert Metzger <rm...@apache.org> wrote:
> Hi Dominic,
> I'm sorry that you ran into this issue.
> What do you mean by "flink streaming routes" ?
>
> Regarding the second question: "Now I want to restart these routes to
> continue their work from the last checkpoint. What can i do?"
> I think the feature you are looking for are savepoints:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
> However, this has been added to Flink in 1.0, so its not available in your
> 0.10 release.
>
>
> I have to admit that I haven't seen the "Cannot find required BLOB at ..."
> exceptions before. Is there any chance that the files have been deleted from
> the /tmp directory by any external service (like a periodic cleanup script?)
> or has the /tmp dir been mounted to another disk in the meantime?
>
>
>
> On Wed, May 4, 2016 at 6:27 PM, Dominique Rondé
> <do...@allsecur.de> wrote:
>>
>> Hi @all,
>>
>> i have a yarn cluster with 5 Nodes with a running flink (0.10.2) instance.
>> Today we shut down one of the Yarn-Hosts due to maintance reasons. After the
>> restart we have some flink streaming routes in a restarting status (see
>> stacktrace below). Now I want to restart these routes to continue their work
>> from the last checkpoint. What can i do?
>>
>> Greets
>> Dominique
>>
>> Stacktrace
>>
>> ===================================================================================
>>
>> java.io.IOException: Cannot get library with hash
>> 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> 	at
>> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:254)
>> 	at
>> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:114)
>> 	at
>> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:710)
>> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:471)
>> 	at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.IOException: Failed to fetch BLOB
>> 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317 from /10.24.20.14:60485 and store
>> it under
>> /tmp/blobStore-efdeddf9-d096-440f-a4cb-9c79334ff92c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> 	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
>> 	at
>> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:245)
>> 	... 4 more
>> Caused by: java.io.IOException: GET operation failed: Server side error:
>> Cannot find required BLOB at
>> /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> 	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:165)
>> 	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:125)
>> 	... 5 more
>> Caused by: java.io.IOException: Server side error: Cannot find required
>> BLOB at
>> /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> 	at
>> org.apache.flink.runtime.blob.BlobClient.receiveAndCheckResponse(BlobClient.java:213)
>> 	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:159)
>> 	... 6 more
>> Caused by: java.io.IOException: Cannot find required BLOB at
>> /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
>> 	at
>> org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:202)
>> 	at
>> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:112)
>>
>>
>

Re: Restart Flink in Yarn

Posted by Robert Metzger <rm...@apache.org>.
Hi Dominic,
I'm sorry that you ran into this issue.
What do you mean by "flink streaming routes" ?

Regarding the second question: "Now I want to restart these routes to
continue their work from the last checkpoint. What can i do?"
I think the feature you are looking for are savepoints:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
However, this has been added to Flink in 1.0, so its not available in your
0.10 release.


I have to admit that I haven't seen the "Cannot find required BLOB at ..."
exceptions before. Is there any chance that the files have been deleted
from the /tmp directory by any external service (like a periodic cleanup
script?) or has the /tmp dir been mounted to another disk in the meantime?



On Wed, May 4, 2016 at 6:27 PM, Dominique Rondé <dominique.ronde@allsecur.de
> wrote:

> Hi @all,
>
> i have a yarn cluster with 5 Nodes with a running flink (0.10.2) instance.
> Today we shut down one of the Yarn-Hosts due to maintance reasons. After
> the restart we have some flink streaming routes in a restarting status (see
> stacktrace below). Now I want to restart these routes to continue their
> work from the last checkpoint. What can i do?
>
> Greets
> Dominique
>
> Stacktrace
>
> ===================================================================================
>
> java.io.IOException: Cannot get library with hash 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
> 	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:254)
> 	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:114)
> 	at org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:710)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:471)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to fetch BLOB 8f15fe4a8137ca2f9fb348ec634f3703f4fd7317 from /10.24.20.14:60485 and store it under /tmp/blobStore-efdeddf9-d096-440f-a4cb-9c79334ff92c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
> 	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:177)
> 	at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerReferenceToBlobKeyAndGetURL(BlobLibraryCacheManager.java:245)
> 	... 4 more
> Caused by: java.io.IOException: GET operation failed: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
> 	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:165)
> 	at org.apache.flink.runtime.blob.BlobCache.getURL(BlobCache.java:125)
> 	... 5 more
> Caused by: java.io.IOException: Server side error: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
> 	at org.apache.flink.runtime.blob.BlobClient.receiveAndCheckResponse(BlobClient.java:213)
> 	at org.apache.flink.runtime.blob.BlobClient.get(BlobClient.java:159)
> 	... 6 more
> Caused by: java.io.IOException: Cannot find required BLOB at /tmp/blobStore-0f9a63e3-5700-4d47-aea7-310506c1496c/cache/blob_8f15fe4a8137ca2f9fb348ec634f3703f4fd7317
> 	at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:202)
> 	at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:112)
>
>
>