You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Son Mai <ho...@gmail.com> on 2019/02/27 02:27:17 UTC

ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Hi,
I'm having a question regarding Flink.
I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager on
the same host). At first, I'm able to submit and cancel jobs normally, the
jobs showed up in the web UI and ran.
However, after ~1month, when I canceled the old job and submitting a new
one, I faced *org.apache.flink.client.program.ProgramInvocationException:
Could not retrieve the execution result.*
At this moment, I was able to run *flink list* to list current jobs and *flink
cancel* to cancel the job, but *flink run* failed. Exception was thrown and
the job was now shown in the web UI.
When I tried to stop the current stand-alone cluster using *stop-cluster*,
it said 'no cluster was found'. Then I had to find the pid of flink
processes and stop them manually. Then if I run *start-cluster* to create a
new stand-alone cluster, I was able to submit jobs normally.
The shortened stack-trace: (full stack-trace at google docs link
<https://docs.google.com/document/d/1v07A4Jp45worykjgMyQTVR-BAoPXwL-O9qGxxhNjXyE/edit?usp=sharing>
)
org.apache.flink.client.program.ProgramInvocationException: Could not
retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob
(RestClusterClient.java:261)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed
to submit JobGraph.
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
Could not complete the operation. Number of retries has been exhausted.
Caused by: java.util.concurrent.CompletionException:
org.apache.flink.runtime.rest.ConnectionClosedException: Channel became
inactive.
Caused by: org.apache.flink.runtime.rest.ConnectionClosedException: Channel
became inactive.
... 37 more
The error is consistent. It always happens after I let Flink run for a
while, usually more than 1 month). Why am I not able to submit job to flink
after a while? What happened here?
Regards,

Son

Re: ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Posted by Benchao Li <li...@gmail.com>.
Hi Son,

According to your description, maybe it's caused by the '/tmp' file system
retain strategy which removes tmp files regularly.

Son Mai <ho...@gmail.com> 于2019年2月27日周三 上午10:27写道:

> Hi,
> I'm having a question regarding Flink.
> I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager
> on the same host). At first, I'm able to submit and cancel jobs normally,
> the jobs showed up in the web UI and ran.
> However, after ~1month, when I canceled the old job and submitting a new
> one, I faced *org.apache.flink.client.program.ProgramInvocationException:
> Could not retrieve the execution result.*
> At this moment, I was able to run *flink list* to list current jobs and *flink
> cancel* to cancel the job, but *flink run* failed. Exception was thrown
> and the job was now shown in the web UI.
> When I tried to stop the current stand-alone cluster using *stop-cluster*,
> it said 'no cluster was found'. Then I had to find the pid of flink
> processes and stop them manually. Then if I run *start-cluster* to create
> a new stand-alone cluster, I was able to submit jobs normally.
> The shortened stack-trace: (full stack-trace at google docs link
> <https://docs.google.com/document/d/1v07A4Jp45worykjgMyQTVR-BAoPXwL-O9qGxxhNjXyE/edit?usp=sharing>
> )
> org.apache.flink.client.program.ProgramInvocationException: Could not
> retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
> at org.apache.flink.client.program.rest.RestClusterClient.submitJob
> (RestClusterClient.java:261)
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed
> to submit JobGraph.
> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
> Could not complete the operation. Number of retries has been exhausted.
> Caused by: java.util.concurrent.CompletionException:
> org.apache.flink.runtime.rest.ConnectionClosedException: Channel became
> inactive.
> Caused by: org.apache.flink.runtime.rest.ConnectionClosedException:
> Channel became inactive.
> ... 37 more
> The error is consistent. It always happens after I let Flink run for a
> while, usually more than 1 month). Why am I not able to submit job to flink
> after a while? What happened here?
> Regards,
>
> Son
>


-- 

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: libenchao@gmail.com; libenchao@pku.edu.cn

Re: ProgramInvocationException when I submit job by 'flink run' after running Flink stand-alone more than 1 month?

Posted by Zhenghua Gao <do...@gmail.com>.
Seem like there is something wrong with RestServer and the RestClient
didn't connect to it.
U can check the standalonesession log for investigating causes.

btw: The cause of  "no cluster was found"  is ur pid information was
cleaned for some reason.
The pid information is stored in ur TMP directory, it should look like
/tmp/flink-user-taskexecutor.pid or /tmp/flink-user-standalonesession.pid

On Wed, Feb 27, 2019 at 10:27 AM Son Mai <ho...@gmail.com> wrote:

> Hi,
> I'm having a question regarding Flink.
> I'm running Flink in stand-alone mode on 1 host (JobManager, TaskManager
> on the same host). At first, I'm able to submit and cancel jobs normally,
> the jobs showed up in the web UI and ran.
> However, after ~1month, when I canceled the old job and submitting a new
> one, I faced *org.apache.flink.client.program.ProgramInvocationException:
> Could not retrieve the execution result.*
> At this moment, I was able to run *flink list* to list current jobs and *flink
> cancel* to cancel the job, but *flink run* failed. Exception was thrown
> and the job was now shown in the web UI.
> When I tried to stop the current stand-alone cluster using *stop-cluster*,
> it said 'no cluster was found'. Then I had to find the pid of flink
> processes and stop them manually. Then if I run *start-cluster* to create
> a new stand-alone cluster, I was able to submit jobs normally.
> The shortened stack-trace: (full stack-trace at google docs link
> <https://docs.google.com/document/d/1v07A4Jp45worykjgMyQTVR-BAoPXwL-O9qGxxhNjXyE/edit?usp=sharing>
> )
> org.apache.flink.client.program.ProgramInvocationException: Could not
> retrieve the execution result. (JobID: 7ef1cbddb744cd5769297f4059f7c531)
> at org.apache.flink.client.program.rest.RestClusterClient.submitJob
> (RestClusterClient.java:261)
> Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed
> to submit JobGraph.
> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
> Could not complete the operation. Number of retries has been exhausted.
> Caused by: java.util.concurrent.CompletionException:
> org.apache.flink.runtime.rest.ConnectionClosedException: Channel became
> inactive.
> Caused by: org.apache.flink.runtime.rest.ConnectionClosedException:
> Channel became inactive.
> ... 37 more
> The error is consistent. It always happens after I let Flink run for a
> while, usually more than 1 month). Why am I not able to submit job to flink
> after a while? What happened here?
> Regards,
>
> Son
>