You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Clive Wong (Jira)" <ji...@apache.org> on 2022/07/19 18:52:00 UTC

[jira] [Commented] (FLINK-28613) PyFlink 1.15 unable to start in Application Mode in k8s

    [ https://issues.apache.org/jira/browse/FLINK-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17568685#comment-17568685 ] 

Clive Wong commented on FLINK-28613:
------------------------------------

Turns out it's because of this change:
[https://github.com/apache/flink/blob/adbf09fb941c8f793df6d322ed95df87bc4254f3/flink-core/src/main/java/org/apache/flink/util/NetUtils.java#L166]

that attempts to write to a path that flink doesn't have access to. We fixed it by chmod the path it tries to write (flink bin path) in the container.

I'd recommend giving option as an env variable so that FileLock can be created a different directory.

> PyFlink 1.15 unable to start in Application Mode in k8s
> -------------------------------------------------------
>
>                 Key: FLINK-28613
>                 URL: https://issues.apache.org/jira/browse/FLINK-28613
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.15.1
>            Reporter: Clive Wong
>            Priority: Major
>
> I recently bumped my PyFlink job from 1.14 to 1.15, and the job is failing with build 1.15 in k8s.
> The error is due to NetUtils not able to getAvailablePort. I suspect this is related to the version bump of py4j from 0.10.8.1 to 0.10.9.3 in required by apache-flink 1.15 in python.
> The error stack is:
> {code:java}
> 2022-07-19 11:17:06,225 INFO  org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Start SessionDispatcherLeaderProcess.
> 2022-07-19 11:17:06,226 INFO  org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Starting resource manager service.
> 2022-07-19 11:17:06,227 INFO  org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Resource manager service is granted leadership with session id 00000000-0000-0000-0000-000000000000.
> 2022-07-19 11:17:06,229 INFO  org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Recover all persisted job graphs that are not finished, yet.
> 2022-07-19 11:17:06,229 INFO  org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Successfully recovered 0 persisted job graphs.
> 2022-07-19 11:17:06,306 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/rpc/dispatcher_0 .
> 2022-07-19 11:17:06,309 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting RPC endpoint for org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at akka://flink/user/rpc/resourcemanager_1 .
> 2022-07-19 11:17:06,317 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Starting the resource manager.
> 2022-07-19 11:17:06,401 INFO  org.apache.flink.client.ClientUtils                          [] - Starting program (detached: true)
> 2022-07-19 11:17:06,500 WARN  org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application failed unexpectedly: 
> java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.ApplicationExecutionException: Could not execute application.
>     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
>     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
>     at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1063) ~[?:?]
>     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
>     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:323) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$2(ApplicationDispatcherBootstrap.java:244) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>     at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:171) ~[flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$withContextClassLoader$0(ClassLoadingUtils.java:41) ~[flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49) [flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48) [flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
>     at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?]
>     at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
>     at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?]
>     at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]
> Caused by: org.apache.flink.client.deployment.application.ApplicationExecutionException: Could not execute application.
>     ... 14 more
> Caused by: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: java.lang.RuntimeException: Could not find a free permitted port on the machine.
>     at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:291) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     ... 13 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not find a free permitted port on the machine.
>     at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
>     at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
>     at org.apache.flink.client.python.PythonEnvUtils.startGatewayServer(PythonEnvUtils.java:387) ~[?:?]
>     at org.apache.flink.client.python.PythonDriver.main(PythonDriver.java:75) ~[?:?]
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
>     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
>     at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
>     at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:291) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     ... 13 more
> Caused by: java.lang.RuntimeException: Could not find a free permitted port on the machine.
>     at org.apache.flink.util.NetUtils.getAvailablePort(NetUtils.java:177) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.python.PythonEnvUtils.lambda$startGatewayServer$3(PythonEnvUtils.java:365) ~[?:?]
>     at java.lang.Thread.run(Thread.java:834) ~[?:?]
> 2022-07-19 11:17:06,505 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Fatal error occurred in the cluster entrypoint.
> java.util.concurrent.CompletionException: org.apache.flink.client.deployment.application.ApplicationExecutionException: Could not execute application.
>     at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
>     at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
>     at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1063) ~[?:?]
>     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
>     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:323) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$2(ApplicationDispatcherBootstrap.java:244) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>     at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:171) ~[flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$withContextClassLoader$0(ClassLoadingUtils.java:41) ~[flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:49) [flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:48) [flink-rpc-akka_73d9230b-9d22-4143-8bbc-2ab5d539166f.jar:1.15.0-stream1]
>     at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?]
>     at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?]
>     at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?]
>     at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?]
>     at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?]
> Caused by: org.apache.flink.client.deployment.application.ApplicationExecutionException: Could not execute application.
>     ... 14 more
> Caused by: org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: java.lang.RuntimeException: Could not find a free permitted port on the machine.
>     at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:291) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     ... 13 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could not find a free permitted port on the machine.
>     at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?]
>     at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]
>     at org.apache.flink.client.python.PythonEnvUtils.startGatewayServer(PythonEnvUtils.java:387) ~[?:?]
>     at org.apache.flink.client.python.PythonDriver.main(PythonDriver.java:75) ~[?:?]
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
>     at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
>     at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
>     at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:291) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     ... 13 more
> Caused by: java.lang.RuntimeException: Could not find a free permitted port on the machine.
>     at org.apache.flink.util.NetUtils.getAvailablePort(NetUtils.java:177) ~[flink-dist-1.15.0-stream1.jar:1.15.0-stream1]
>     at org.apache.flink.client.python.PythonEnvUtils.lambda$startGatewayServer$3(PythonEnvUtils.java:365) ~[?:?]
>     at java.lang.Thread.run(Thread.java:834) ~[?:?]
> 2022-07-19 11:17:06,508 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - Shutting StandaloneApplicationClusterEntryPoint down with application status UNKNOWN. Diagnostics Cluster entrypoint has been closed externally..
> 2022-07-19 11:17:06,509 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124 {code}
> It's the same with Python3.7 & Python3.8



--
This message was sent by Atlassian Jira
(v8.20.10#820010)