You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@streampark.apache.org by GitBox <gi...@apache.org> on 2022/12/16 10:00:33 UTC

[GitHub] [incubator-streampark] xujiangfeng001 opened a new issue, #2172: [Bug] Flink sql unable to recover from sp and cp

xujiangfeng001 opened a new issue, #2172:
URL: https://github.com/apache/incubator-streampark/issues/2172

   ### Search before asking
   
   - [X] I had searched in the [issues](https://github.com/apache/incubator-streampark/issues?q=is%3Aissue+label%3A%22bug%22) and found no similar issues.
   
   
   ### What happened
   
   Flink sql cannot recover tasks from savepoints and checkpoints in the yarn session mode and per job mode.
   
   ### StreamPark Version
   
   2.0.0
   
   ### Java Version
   
   1.8
   
   ### Flink Version
   
   1.15.0
   
   ### Scala Version of Flink
   
   2.12
   
   ### Error Exception
   
   ```log
   org.apache.flink.util.FlinkException: JobMaster for job 5cead21d3c963a174e96ebeed0db44ee failed.
   	at org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:1149) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:667) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:654) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$4(Dispatcher.java:612) ~[flink-dist-1.15.0.jar:1.15.0]
   	at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:822) ~[?:1.8.0_181]
   	at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:797) ~[?:1.8.0_181]
   	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) ~[?:1.8.0_181]
   	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:443) ~[flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:443) ~[flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) ~[flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) ~[flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) ~[flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.actor.Actor.aroundReceive(Actor.scala:537) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.actor.Actor.aroundReceive$(Actor.scala:535) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.actor.ActorCell.invoke(ActorCell.scala:548) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.dispatch.Mailbox.run(Mailbox.scala:231) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at akka.dispatch.Mailbox.exec(Mailbox.scala:243) [flink-rpc-akka_de352cbe-0e82-4b92-ab58-dbf650c2a132.jar:1.15.0]
   	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) [?:1.8.0_181]
   	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) [?:1.8.0_181]
   	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) [?:1.8.0_181]
   	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157) [?:1.8.0_181]
   Caused by: org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster.
   	at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) ~[flink-dist-1.15.0.jar:1.15.0]
   	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_181]
   	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_181]
   	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_181]
   	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595) ~[?:1.8.0_181]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
   	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
   Caused by: java.util.concurrent.CompletionException: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint hdfs://hadoop-node01:8020/flink/jobs/savepoint/test/savepoint-8b63ea-6498a5f5861e. Cannot map checkpoint/savepoint state for operator e30ccce83304773b609843a7d9d3e8d7 to the new program, because the operator is not available in the new program. If you want to allow to skip this, you can set the --allowNonRestoredState option on the CLI.
   	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273) ~[?:1.8.0_181]
   	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280) ~[?:1.8.0_181]
   	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592) ~[?:1.8.0_181]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
   	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
   Caused by: java.lang.IllegalStateException: Failed to rollback to checkpoint/savepoint hdfs://hadoop-node01:8020/flink/jobs/savepoint/test/savepoint-8b63ea-6498a5f5861e. Cannot map checkpoint/savepoint state for operator e30ccce83304773b609843a7d9d3e8d7 to the new program, because the operator is not available in the new program. If you want to allow to skip this, you can set the --allowNonRestoredState option on the CLI.
   	at org.apache.flink.runtime.checkpoint.Checkpoints.throwNonRestoredStateException(Checkpoints.java:233) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:197) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1771) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:206) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:181) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:363) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:208) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:191) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:139) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:135) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:115) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:345) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:322) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:106) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:94) ~[flink-dist-1.15.0.jar:1.15.0]
   	at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) ~[flink-dist-1.15.0.jar:1.15.0]
   	at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) ~[?:1.8.0_181]
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
   	at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
   2022-12-16 17:46:16,305 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Shutting YarnJobClusterEntrypoint down with application status UNKNOWN. Diagnostics Cluster entrypoint has been closed externally..
   2022-12-16 17:46:16,307 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:34542
   2022-12-16 17:46:16,309 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting down rest endpoint.
   2022-12-16 17:46:16,324 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Removing cache directory /tmp/flink-web-b27feb87-0c73-4d52-acae-7876b56e48c5/flink-web-ui
   2022-12-16 17:46:16,325 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - http://10.250.250.6:41203 lost leadership
   2022-12-16 17:46:16,325 INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shut down complete.
   2022-12-16 17:46:16,325 INFO  org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent [] - Closing components.
   2022-12-16 17:46:16,325 INFO  org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - DefaultDispatcherRunner was revoked the leadership with leader id 00000000-0000-0000-0000-000000000000. Stopping the DispatcherLeaderProcess.
   2022-12-16 17:46:16,326 INFO  org.apache.flink.runtime.dispatcher.runner.JobDispatcherLeaderProcess [] - Stopping JobDispatcherLeaderProcess.
   2022-12-16 17:46:16,326 INFO  org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Stopping resource manager service.
   2022-12-16 17:46:16,327 INFO  org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Resource manager service is not running. Ignore revoking leadership.
   2022-12-16 17:46:16,329 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Closing the slot manager.
   2022-12-16 17:46:16,329 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Suspending the slot manager.
   ```
   
   
   ### Screenshots
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] ziqiang-wang commented on issue #2172: [Bug] Flink sql unable to recover from sp and cp

Posted by GitBox <gi...@apache.org>.
ziqiang-wang commented on issue #2172:
URL: https://github.com/apache/incubator-streampark/issues/2172#issuecomment-1367037786

   If you change the contents of your flink sql and then restart from cp or sp, you may not be able to restart because Flink does not support manually specifying Uids for the underlying operators of the flink sql transformation. After modifying flink sql, Some operators have changed Uids, so restarting with the cp or sp generated by the task before modifying flink sql will fail.
   
   By default, Flink generates Uids based on the operator's code. Therefore, after modifying the flink sql content, the code of the underlying operator changes, and so does the corresponding uid.
   
   After the flink sql is modified and restarted using cp or sp, you can check the "ignore restored" button on the startup interface to ignore the restored operator. However, this behavior may cause the original state of the operator to be lost and result data errors.
   
   ![image](https://user-images.githubusercontent.com/107013241/209896661-b532f869-eb52-4988-ab31-5c0520131259.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] wolfboys commented on issue #2172: [Bug] Flink sql unable to recover from sp and cp

Posted by "wolfboys (via GitHub)" <gi...@apache.org>.
wolfboys commented on issue #2172:
URL: https://github.com/apache/incubator-streampark/issues/2172#issuecomment-1434666106

   fixed in latest code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [incubator-streampark] wolfboys closed issue #2172: [Bug] Flink sql unable to recover from sp and cp

Posted by "wolfboys (via GitHub)" <gi...@apache.org>.
wolfboys closed issue #2172: [Bug] Flink sql unable to recover from sp and cp
URL: https://github.com/apache/incubator-streampark/issues/2172


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@streampark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org