You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Sushant (Jira)" <ji...@apache.org> on 2022/06/01 12:48:00 UTC
[jira] [Commented] (FLINK-27855) Job Manager fails to recover with S3 storage and HA enabled

    [ https://issues.apache.org/jira/browse/FLINK-27855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17544887#comment-17544887 ] 

Sushant commented on FLINK-27855:
---------------------------------

Recovery files weren't deleted during creation/deletion of flink deployment due to s3 permission issues , can't replicate the issue after giving required permissions to S3 from AWS EKS cluster pods using IRSA method

> Job Manager fails to recover with S3 storage and HA enabled
> -----------------------------------------------------------
>
>                 Key: FLINK-27855
>                 URL: https://issues.apache.org/jira/browse/FLINK-27855
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>            Reporter: Sushant
>            Priority: Minor
>
> Flink version: 1.15  with Native Integration K8s operator mode: https://github.com/apache/flink-kubernetes-operator
> Steps to replicate
> 1. Enable HA and mention S3 recovery path in flink configuration property: high-availability.storageDir
> 2. Create the flink application deployment and let it run for some time to generate checkpoints
> 3. Delete the flink application deployment
> 4. Recreate once again and the job manager pod doesn’t come up complaining about S3 recovery cleanup, error is described below
> Note that the above steps go through fine if AWS EFS is being used instead of S3 for HA
> Error Traceback:
> {code:java}
> 2022-05-31 16:39:44,332 WARN  org.apache.flink.runtime.dispatcher.cleanup.DefaultResourceCleaner [] - Cleanup of BlobServer failed for job 00000000000000000000000000000000 due to a CompletionException: java.io.IOException: java.io.IOException: Error while cleaning up the BlobStore for job 00000000000000000000000000000000
> 2022-05-31 16:42:56,955 WARN  org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Ignoring JobGraph submission (00000000000000000000000000000000) because the job already reached a globally-terminal state (i.e. FAILED, CANCELED, FINISHED) in a previous execution.
> 2022-05-31 16:42:57,026 ERROR              [] - Error while processing events :
> org.apache.flink.util.FlinkException: Failed to execute job
> 	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2108) ~[flink-dist-1.15.0.jar:1.15.0]
> Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job has already been submitted.
> 	at org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35) ~[flink-dist-1.15.0.jar:1.15.0]
> 2022-05-31 16:42:57,130 INFO  org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application CANCELED:
> java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: CANCELED
> 	at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:389) ~[flink-dist-1.15.0.jar:1.15.0]
> 	at java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) [?:?]
> Caused by: org.apache.flink.client.deployment.application.UnsuccessfulExecutionException: Application Status: CANCELED
> 	at org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:71) ~[flink-dist-1.15.0.jar:1.15.0]
> 	... 56 more
> Caused by: org.apache.flink.runtime.client.JobCancellationException: Job was cancelled.
> 	at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) ~[flink-dist-1.15.0.jar:1.15.0]
> 	at org.apache.flink.client.deployment.application.UnsuccessfulExecutionException.fromJobResult(UnsuccessfulExecutionException.java:60) ~[flink-dist-1.15.0.jar:1.15.0]
> 	... 56 more
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)