You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Marke Builder <ma...@gmail.com> on 2018/10/19 04:40:48 UTC

Flink fails continuously: Couldn't retrieve the JobExecutionResult from the JobManager.

Hi,

my flink job fails continously(sometimes behind minutes, sometimes behind
hours) with the
follwing exception.

Flink run configuration:
run with yarn: -yn 2 -ys 5 -yjm 8192 -ymt 12288
streaming-job: kafka source and redis sink


The program finished with the following exception:

org.apache.flink.client.program.ProgramInvocationException: The program
execution failed: Couldn't retrieve the JobExecutionResult from the
JobManager.
        at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
        at
org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:215)
        at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
        at
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
        at
com.voith.cloud.app.timeseries.fasttrack.StreamToCache.main(StreamToCache.java:54)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:525)
        at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:417)
        at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:396)
        at
org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
        at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
        at
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
        at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
        at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't
retrieve the JobExecutionResult from the JobManager.
        at
org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
        at
org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
        at
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
        ... 21 more
Caused by:
org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException:
Lost connection to the JobManager.
        at
org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
        at
org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
        at
org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
        at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)




and:

Re: Flink fails continuously: Couldn't retrieve the JobExecutionResult from the JobManager.

Posted by vino yang <ya...@gmail.com>.
Hi marke,

My advice is not to keep your client connected to JM.
If you expect continuous output, you can sink it out.
In addition, it does not rule out that your JM load is too high, such as
the emergence of full GC and so on.
So, make sure your JM has enough resources to use and monitor it.

Thanks, vino.

Marke Builder <ma...@gmail.com> 于2018年11月1日周四 上午7:25写道:

> Hi Vino,
> yes I'm expecting a quickly result of the stream (parse, normalize and
> store).
> Thanks for the tip with the detached mode, the job is now a bit more
> stable, but sometimes it failed with the same issue.
> Any other proposals ?
>
> Thanks, marke.
>
> Am Mo., 22. Okt. 2018 um 04:04 Uhr schrieb vino yang <
> yanghua1127@gmail.com>:
>
>> Hi Marke,
>>
>> Are you expecting your job to quickly return the results of the stream
>> calculation?
>> If it is running for a long time, you can run it in detached mode when
>> you submit the job[1].
>> It will not cause your client to be blocked and stay connected to the
>> Flink JobManager.
>>
>> Thanks, vino.
>>
>> [1]:
>> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/deployment/yarn_setup.html#run-a-single-flink-job-on-yarn
>>
>> Marke Builder <ma...@gmail.com> 于2018年10月19日周五 下午12:41写道:
>>
>>> Hi,
>>>
>>> my flink job fails continously(sometimes behind minutes, sometimes
>>> behind hours) with the
>>> follwing exception.
>>>
>>> Flink run configuration:
>>> run with yarn: -yn 2 -ys 5 -yjm 8192 -ymt 12288
>>> streaming-job: kafka source and redis sink
>>>
>>>
>>> The program finished with the following exception:
>>>
>>> org.apache.flink.client.program.ProgramInvocationException: The program
>>> execution failed: Couldn't retrieve the JobExecutionResult from the
>>> JobManager.
>>>         at
>>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
>>>         at
>>> org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:215)
>>>         at
>>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
>>>         at
>>> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>>>         at
>>> com.voith.cloud.app.timeseries.fasttrack.StreamToCache.main(StreamToCache.java:54)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>         at
>>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:525)
>>>         at
>>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:417)
>>>         at
>>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:396)
>>>         at
>>> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
>>>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
>>>         at
>>> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
>>>         at
>>> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
>>>         at
>>> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>>         at
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>>>         at
>>> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>>         at
>>> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
>>> Caused by: org.apache.flink.runtime.client.JobExecutionException:
>>> Couldn't retrieve the JobExecutionResult from the JobManager.
>>>         at
>>> org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
>>>         at
>>> org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
>>>         at
>>> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
>>>         ... 21 more
>>> Caused by:
>>> org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException:
>>> Lost connection to the JobManager.
>>>         at
>>> org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
>>>         at
>>> org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
>>>         at
>>> org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
>>>         at
>>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>>         at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>>         at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>         at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>         at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>         at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>
>>>
>>>
>>>
>>> and:
>>>
>>>
>>>
>>>

Re: Flink fails continuously: Couldn't retrieve the JobExecutionResult from the JobManager.

Posted by vino yang <ya...@gmail.com>.
Hi Marke,

Are you expecting your job to quickly return the results of the stream
calculation?
If it is running for a long time, you can run it in detached mode when you
submit the job[1].
It will not cause your client to be blocked and stay connected to the Flink
JobManager.

Thanks, vino.

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/deployment/yarn_setup.html#run-a-single-flink-job-on-yarn

Marke Builder <ma...@gmail.com> 于2018年10月19日周五 下午12:41写道:

> Hi,
>
> my flink job fails continously(sometimes behind minutes, sometimes behind
> hours) with the
> follwing exception.
>
> Flink run configuration:
> run with yarn: -yn 2 -ys 5 -yjm 8192 -ymt 12288
> streaming-job: kafka source and redis sink
>
>
> The program finished with the following exception:
>
> org.apache.flink.client.program.ProgramInvocationException: The program
> execution failed: Couldn't retrieve the JobExecutionResult from the
> JobManager.
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
>         at
> org.apache.flink.yarn.YarnClusterClient.submitJob(YarnClusterClient.java:215)
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
>         at
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
>         at
> com.voith.cloud.app.timeseries.fasttrack.StreamToCache.main(StreamToCache.java:54)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:525)
>         at
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:417)
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:396)
>         at
> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
>         at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
>         at
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
>         at
> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
>         at
> org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
>         at
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>         at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't
> retrieve the JobExecutionResult from the JobManager.
>         at
> org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
>         at
> org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
>         at
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
>         ... 21 more
> Caused by:
> org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException:
> Lost connection to the JobManager.
>         at
> org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
>         at
> org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
>         at
> org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
>         at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>         at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>         at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
>
> and:
>
>
>
>