You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Robert Metzger <rm...@apache.org> on 2020/07/03 08:25:39 UTC

Re: Task recovery?

Hi John,

did you also restart the JobManager, or just the TaskManagers?
In either case, they should recover.

Do you still have the JobManager logs around, so that we can analyze them?

On Thu, Jun 25, 2020 at 6:40 PM John Smith <ja...@gmail.com> wrote:

> Hi running 1.10.0
>
> 3 Zookeepers
> 3 Job Nodes
> 3 Task Nodes
>
> Yesterday my task nodeas failed with metaspace error. I increased the
> metaspace a bit to be sure and I restarted the 3 task nodes.
>
> But none of the jobs recovered, or no jobs running, should they not
> recover from the job and zookeeper state? It's as if no jobs ran.
>
> P.s: I'm not running the history server.
>

Re: Task recovery?

Posted by John Smith <ja...@gmail.com>.

Hi Robert is my assumption correct?

On Fri., Jul. 3, 2020, 12:42 p.m. John Smith, <ja...@gmail.com>
wrote:

> Here is one log....
>
> https://www.dropbox.com/s/s8uom5uto708izf/flink-job-001.log?dl=0
>
> If I understand correctly on June 23rd it suspended the jobs? So at that
> point they would no longer show in the UI or be restarted?
>
> On Fri, 3 Jul 2020 at 12:05, John Smith <ja...@gmail.com> wrote:
>
>> I didn't restart the job manager. Let me see if I can dig up the logs...
>> Also I just realised it's possible that the retry attempts to recover may
>> have been exhausted.
>>
>

Re: Task recovery?

Posted by John Smith <ja...@gmail.com>.

Yeah it's fine but the thing is I guess because I don't have the history
server and the UI wasn't showing any jobs and I didn't have any job Id so I
can go and look for the checkpoints.

I restarted them but instead of checkpoint I went and played back a few
days before just to be sure... All my jobs also have a kafka start time.

On Fri, 10 Jul 2020 at 09:31, Aljoscha Krettek <al...@apache.org> wrote:

> On 03.07.20 18:42, John Smith wrote:
> > If I understand correctly on June 23rd it suspended the jobs? So at that
> > point they would no longer show in the UI or be restarted?
>
> Yes, that is correct, though in the logs it seems the jobs failed
> terminally on June 22nd:
>
> 2020-06-22 23:30:22,130 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job
> ba50a77608992097a98b250b87a08da0 reached globally terminal state FAILED.
>
> What you can do in that case is restore the jobs from a savepoint or
> from a retained checkpoint. See
>
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints,
>
> you need to manually enable checkpoint retention.
>
> I hope that helps.
>
> Best,
> Aljoscha
>

Re: Task recovery?

Posted by Aljoscha Krettek <al...@apache.org>.

On 03.07.20 18:42, John Smith wrote:
> If I understand correctly on June 23rd it suspended the jobs? So at that
> point they would no longer show in the UI or be restarted?

Yes, that is correct, though in the logs it seems the jobs failed 
terminally on June 22nd:

2020-06-22 23:30:22,130 INFO 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher      - Job 
ba50a77608992097a98b250b87a08da0 reached globally terminal state FAILED.

What you can do in that case is restore the jobs from a savepoint or 
from a retained checkpoint. See 
https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints, 
you need to manually enable checkpoint retention.

I hope that helps.

Best,
Aljoscha

Re: Task recovery?

Posted by John Smith <ja...@gmail.com>.

Here is one log....

https://www.dropbox.com/s/s8uom5uto708izf/flink-job-001.log?dl=0

If I understand correctly on June 23rd it suspended the jobs? So at that
point they would no longer show in the UI or be restarted?

On Fri, 3 Jul 2020 at 12:05, John Smith <ja...@gmail.com> wrote:

> I didn't restart the job manager. Let me see if I can dig up the logs...
> Also I just realised it's possible that the retry attempts to recover may
> have been exhausted.
>

Re: Task recovery?

Posted by John Smith <ja...@gmail.com>.

I didn't restart the job manager. Let me see if I can dig up the logs...
Also I just realised it's possible that the retry attempts to recover may
have been exhausted.