You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Dan Hill <qu...@gmail.com> on 2021/04/26 05:20:02 UTC

Checkpoint error - "The job has failed"

My Flink job failed to checkpoint with a "The job has failed" error.  The
logs contained no other recent errors.  I keep hitting the error even if I
cancel the jobs and restart them.  When I restarted my jobmanager and
taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives
outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with
errors like this?

Re: Checkpoint error - "The job has failed"

Posted by Dan Hill <qu...@gmail.com>.
Oh interesting.  Yea, could be.  We'll soon update to v1.12.  Thanks Robert
and Yun!

On Wed, Apr 28, 2021 at 1:30 AM Yun Tang <my...@live.com> wrote:

> Hi Dan,
>
> You could refer to the "Fix Versions" in FLINK-16753 [1] and know that
> this bug is resolved after 1.11.3 not 1.11.1.
>
> [1] https://issues.apache.org/jira/browse/FLINK-16753
>
> Best
> Yun Tang
> ------------------------------
> *From:* Dan Hill <qu...@gmail.com>
> *Sent:* Tuesday, April 27, 2021 7:50
> *To:* Yun Tang <my...@live.com>
> *Cc:* Robert Metzger <rm...@apache.org>; user <us...@flink.apache.org>
> *Subject:* Re: Checkpoint error - "The job has failed"
>
> Hey Yun and Robert,
>
> I'm using Flink v1.11.1.
>
> Robert, I'll send you a separate email with the logs.
>
> On Mon, Apr 26, 2021 at 12:46 AM Yun Tang <my...@live.com> wrote:
>
> Hi Dan,
>
> I think you might use older version of Flink and this problem has been
> resolved by FLINK-16753 [1] after Flink-1.10.3.
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-16753
>
> Best
> Yun Tang
> ------------------------------
> *From:* Robert Metzger <rm...@apache.org>
> *Sent:* Monday, April 26, 2021 14:46
> *To:* Dan Hill <qu...@gmail.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: Checkpoint error - "The job has failed"
>
> Hi Dan,
>
> can you provide me with the JobManager logs to take a look as well? (This
> will also tell me which Flink version you are using)
>
>
>
> On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <qu...@gmail.com> wrote:
>
> My Flink job failed to checkpoint with a "The job has failed" error.  The
> logs contained no other recent errors.  I keep hitting the error even if I
> cancel the jobs and restart them.  When I restarted my jobmanager and
> taskmanager, the error went away.
>
> What error am I hitting?  It looks like there is bad state that lives
> outside the scope of a job.
>
> How often do people restart their jobmanagers and taskmanager to deal with
> errors like this?
>
>

Re: Checkpoint error - "The job has failed"

Posted by Yun Tang <my...@live.com>.
Hi Dan,

You could refer to the "Fix Versions" in FLINK-16753 [1] and know that this bug is resolved after 1.11.3 not 1.11.1.

[1] https://issues.apache.org/jira/browse/FLINK-16753

Best
Yun Tang
________________________________
From: Dan Hill <qu...@gmail.com>
Sent: Tuesday, April 27, 2021 7:50
To: Yun Tang <my...@live.com>
Cc: Robert Metzger <rm...@apache.org>; user <us...@flink.apache.org>
Subject: Re: Checkpoint error - "The job has failed"

Hey Yun and Robert,

I'm using Flink v1.11.1.

Robert, I'll send you a separate email with the logs.

On Mon, Apr 26, 2021 at 12:46 AM Yun Tang <my...@live.com>> wrote:
Hi Dan,

I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3.


[1] https://issues.apache.org/jira/browse/FLINK-16753

Best
Yun Tang
________________________________
From: Robert Metzger <rm...@apache.org>>
Sent: Monday, April 26, 2021 14:46
To: Dan Hill <qu...@gmail.com>>
Cc: user <us...@flink.apache.org>>
Subject: Re: Checkpoint error - "The job has failed"

Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <qu...@gmail.com>> wrote:
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?

Re: Checkpoint error - "The job has failed"

Posted by Dan Hill <qu...@gmail.com>.
Hey Yun and Robert,

I'm using Flink v1.11.1.

Robert, I'll send you a separate email with the logs.

On Mon, Apr 26, 2021 at 12:46 AM Yun Tang <my...@live.com> wrote:

> Hi Dan,
>
> I think you might use older version of Flink and this problem has been
> resolved by FLINK-16753 [1] after Flink-1.10.3.
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-16753
>
> Best
> Yun Tang
> ------------------------------
> *From:* Robert Metzger <rm...@apache.org>
> *Sent:* Monday, April 26, 2021 14:46
> *To:* Dan Hill <qu...@gmail.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: Checkpoint error - "The job has failed"
>
> Hi Dan,
>
> can you provide me with the JobManager logs to take a look as well? (This
> will also tell me which Flink version you are using)
>
>
>
> On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <qu...@gmail.com> wrote:
>
> My Flink job failed to checkpoint with a "The job has failed" error.  The
> logs contained no other recent errors.  I keep hitting the error even if I
> cancel the jobs and restart them.  When I restarted my jobmanager and
> taskmanager, the error went away.
>
> What error am I hitting?  It looks like there is bad state that lives
> outside the scope of a job.
>
> How often do people restart their jobmanagers and taskmanager to deal with
> errors like this?
>
>

Re: Checkpoint error - "The job has failed"

Posted by Yun Tang <my...@live.com>.
Hi Dan,

I think you might use older version of Flink and this problem has been resolved by FLINK-16753 [1] after Flink-1.10.3.


[1] https://issues.apache.org/jira/browse/FLINK-16753

Best
Yun Tang
________________________________
From: Robert Metzger <rm...@apache.org>
Sent: Monday, April 26, 2021 14:46
To: Dan Hill <qu...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: Re: Checkpoint error - "The job has failed"

Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <qu...@gmail.com>> wrote:
My Flink job failed to checkpoint with a "The job has failed" error.  The logs contained no other recent errors.  I keep hitting the error even if I cancel the jobs and restart them.  When I restarted my jobmanager and taskmanager, the error went away.

What error am I hitting?  It looks like there is bad state that lives outside the scope of a job.

How often do people restart their jobmanagers and taskmanager to deal with errors like this?

Re: Checkpoint error - "The job has failed"

Posted by Robert Metzger <rm...@apache.org>.
Hi Dan,

can you provide me with the JobManager logs to take a look as well? (This
will also tell me which Flink version you are using)



On Mon, Apr 26, 2021 at 7:20 AM Dan Hill <qu...@gmail.com> wrote:

> My Flink job failed to checkpoint with a "The job has failed" error.  The
> logs contained no other recent errors.  I keep hitting the error even if I
> cancel the jobs and restart them.  When I restarted my jobmanager and
> taskmanager, the error went away.
>
> What error am I hitting?  It looks like there is bad state that lives
> outside the scope of a job.
>
> How often do people restart their jobmanagers and taskmanager to deal with
> errors like this?
>