You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by 刘 家锹 <LJ...@outlook.com> on 2022/03/02 05:54:40 UTC

回复: Flink failure rate restart not work as expect

Hi, all

I think we may find the reason, that's relate to the 'jobmanager.execution.failover-strategy' configuration and the job region numbers. In our case, we set failover-strategy to 'region' and this job has 6 regions running on only one TaskManager. So when the container goes down, every regions need to be restart because they belong to this only one TaskManager.
That's easy to tell that 4 retry times is not enough for 6 regions, so this job quit is reasonable.
Also, why my testing job didn't quit, that's because this job is kind of different, it only has one region, so the behavior also expected.

For us, we change failover-stratety to 'full', since most of our jobs has only one TaskManager and topology is simple. It will be helpful in most case. Further more, combine with region failover, that's kind of complex to configure a right parameter, we apply it to complex job only.

If has any best practice about pipelined-region failover restart or document about region that would be helpfull.

Again, thx for your time to reply, that help us a lot.
________________________________
发件人: 刘 家锹 <lj...@outlook.com>
发送时间: 2022年3月1日 23:06
收件人: Matthias Pohl <ma...@ververica.com>; user <us...@flink.apache.org>; David Morávek <dm...@apache.org>
主题: Re: Flink failure rate restart not work as expect

I realized I missed mentioning something above, the container exit code is 163, which is not the normal code, at least I can’t find any meaning from google. So, my test didn’t cover this situation, I don’t know whether it impacts the results.

获取 Outlook for iOS<https://aka.ms/o0ukef>
________________________________
发件人: 刘 家锹 <LJ...@outlook.com>
发送时间: Tuesday, March 1, 2022 10:23:50 PM
收件人: Matthias Pohl <ma...@ververica.com>; user <us...@flink.apache.org>; David Morávek <dm...@apache.org>
主题: Re: Flink failure rate restart not work as expect

We didn't find any obvious configuration issues in our cluster. As far as I know, It works fine in most cases; I also simulate failover under current configuration, by starting a new job with only one TaskManager, then kill the TaskManager container, and this job recovery from failures successfully.
As you said, yarn logs look it may have some problems, we try digging into it to see if we can find any hints.

获取 Outlook for iOS<https://aka.ms/o0ukef>
________________________________
发件人: Matthias Pohl <ma...@ververica.com>
发送时间: Tuesday, March 1, 2022 9:50:36 PM
收件人: 刘 家锹 <LJ...@outlook.com>; user <us...@flink.apache.org>; David Morávek <dm...@apache.org>
主题: Re: Flink failure rate restart not work as expect

The YARN node manager logs support my observation: The container exits with a failure which, if I understand it correctly, should cause a container restart on the YARN side. In HA mode, Flink expects the underlying resource management to restart the Flink cluster in case of failure. This does not seem to happen in your case. Is there a configuration issue in your YARN cluster? Or does the container recovery usually work in failure cases for you? I'm not that experienced with YARN deployments. I'm adding David to this thread. He might have some additional insights.

Matthias

On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 <LJ...@outlook.com>> wrote:
Unfortunately we did't keep log properly , this happen too far away, yarn ResourceMnager log had clean,  and the broken machine had reinstall. We only found the yarn log of JobManager on Yarn NodeManager, it maybe useless. We will put the detail logs to this thread when it happen again, since it happen sometime, like between two weeks,  if one of our cluster machine go down.
________________________________
发件人: Matthias Pohl <ma...@ververica.com>>
发送时间: 2022年3月1日 17:57
收件人: Alexander Preuß <al...@ververica.com>>
抄送: 刘 家锹 <LJ...@outlook.com>>; user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>>
主题: Re: Flink failure rate restart not work as expect

Hi,
I second Alex' observation - based on the logs it looks like the task restart functionality worked as expected: It tried to restart the tasks until it reached the limit of 4 attempts due to the missing TaskManager. The job-cluster shut down with an error code. At this point, YARN should pick it up and bring up a new JobManager based on the non-0 exit code of the Flink cluster. It would be interesting to see the YARN logs to figure out why the cluster failover didn't work.

Best,
Matthias

On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <al...@ververica.com>> wrote:
Hi,
from a first glance it looks like the exception was thrown very rapidly so it exceeded the maxFailuresPerInterval and the FailureRestartStrategy decided not to restart. Why do you think this is different from the expected behavior?

Best,
Alex

On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <LJ...@outlook.com>> wrote:
Hi, all
We encounter some problem with FailureRateRestartStrategy, which confuse us and don't know how to solove it. Here's the situation:

Flink version: 1.10.1
Development env: on Yarn
FailureRateRestartStrategy: failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4

One of our hadoop machine got stuck without response, which our job's taskmanager running on. At this moment, the jobmanager receive a heartbeat timeout exception, but after throwing 4 times exception in a very short time(about 10ms each), it hit the FailureRateRestartStrategy and all job quit, we got the message of 'org.apache.flink.runtime.JobException: Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
As I know from document, the behavior expected was jobmanager should try to restart the job which will bring up a new taskmanager on other machine, but it did not.
We also do some test, start a new job and just kill the taskamanger, but it can restart as expect.

So it confuse us most,  if anyone know what happen, that would be thanks.

JobManager log and TaskManager log append below


--

Alexander Preuß | Junior Engineer - Data Intensive Systems

alexanderpreuss@ververica.com<ma...@ververica.com>

[https://lh4.googleusercontent.com/NPTiLXYOUlWRdjeXe6hdOe_UvXESdi5aTB7HzziTY19ReOdVh04c4ED8DPqLmLHRlTiWHdtIjvMzFEUh0eoY7vOO_xTTAGmOxwlSQfwGN6tBbjSimj-eh5v094v1KHk5XOOoSBbU=s0]<https://www.ververica.com/>


Follow us @VervericaData

--

Join Flink Forward<https://flink-forward.org/> - The Apache Flink Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbH

Registered at Amtsgericht Charlottenburg: HRB 158244 B

Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason, Jinwei (Kevin) Zhang


Re: Flink failure rate restart not work as expect

Posted by Zhilong Hong <zh...@gmail.com>.
Hi, Jiaqiao:

Since your job enables checkpoint, you can just try to remove the restart
strategy config. The default value will be fixed-delay with
Integer.MAX_VALUE restart attempts and '1 s' delay, as mentioned in [1]. In
this way when a failover occurs, your job will wait for 1 seconds before it
restarts. Since the value of max restart attempts is Integer.MAX_VALUE, the
job will not transition to FAILED unless a fatal error occurs.

Best,
Zhilong

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#restart-strategy

On Wed, Mar 2, 2022 at 1:55 PM 刘 家锹 <LJ...@outlook.com> wrote:

> Hi, all
>
> I think we may find the reason, that's relate to the '
> *jobmanager.execution.failover-strategy*' configuration and the job
> region numbers. In our case, we set failover-strategy to 'region' and
> this job has 6 regions running on only one TaskManager. So when the
> container goes down, every regions need to be restart because they belong
> to this only one TaskManager.
> That's easy to tell that 4 retry times is not enough for 6 regions, so
> this job quit is reasonable.
> Also, why my testing job didn't quit, that's because this job is kind of
> different, it only has one region, so the behavior also expected.
>
> For us, we change failover-stratety to 'full', since most of our jobs has
> only one TaskManager and topology is simple. It will be helpful in most
> case. Further more, combine with region failover, that's kind of complex to
> configure a right parameter, we apply it to complex job only.
>
> If has any best practice about pipelined-region failover restart or
> document about region that would be helpfull.
>
> Again, thx for your time to reply, that help us a lot.
> ------------------------------
> *发件人:* 刘 家锹 <lj...@outlook.com>
> *发送时间:* 2022年3月1日 23:06
> *收件人:* Matthias Pohl <ma...@ververica.com>; user <us...@flink.apache.org>;
> David Morávek <dm...@apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> I realized I missed mentioning something above, the container exit code is
> 163, which is not the normal code, at least I can’t find any meaning from
> google. So, my test didn’t cover this situation, I don’t know whether it
> impacts the results.
>
> 获取 Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------
> *发件人:* 刘 家锹 <LJ...@outlook.com>
> *发送时间:* Tuesday, March 1, 2022 10:23:50 PM
> *收件人:* Matthias Pohl <ma...@ververica.com>; user <us...@flink.apache.org>;
> David Morávek <dm...@apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> We didn't find any obvious configuration issues in our cluster. As far as
> I know, It works fine in most cases; I also simulate failover under current
> configuration, by starting a new job with only one TaskManager, then kill
> the TaskManager container, and this job recovery from failures
> successfully.
> As you said, yarn logs look it may have some problems, we try digging into
> it to see if we can find any hints.
>
> 获取 Outlook for iOS <https://aka.ms/o0ukef>
> ------------------------------
> *发件人:* Matthias Pohl <ma...@ververica.com>
> *发送时间:* Tuesday, March 1, 2022 9:50:36 PM
> *收件人:* 刘 家锹 <LJ...@outlook.com>; user <us...@flink.apache.org>;
> David Morávek <dm...@apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> The YARN node manager logs support my observation: The container exits
> with a failure which, if I understand it correctly, should cause a
> container restart on the YARN side. In HA mode, Flink expects the
> underlying resource management to restart the Flink cluster in case of
> failure. This does not seem to happen in your case. Is there a
> configuration issue in your YARN cluster? Or does the container recovery
> usually work in failure cases for you? I'm not that experienced with YARN
> deployments. I'm adding David to this thread. He might have some additional
> insights.
>
> Matthias
>
> On Tue, Mar 1, 2022 at 12:19 PM 刘 家锹 <LJ...@outlook.com> wrote:
>
> Unfortunately we did't keep log properly , this happen too far away, yarn
> ResourceMnager log had clean,  and the broken machine had reinstall. We
> only found the yarn log of JobManager on Yarn NodeManager, it maybe
> useless. We will put the detail logs to this thread when it happen again,
> since it happen sometime, like between two weeks,  if one of our cluster
> machine go down.
> ------------------------------
> *发件人:* Matthias Pohl <ma...@ververica.com>
> *发送时间:* 2022年3月1日 17:57
> *收件人:* Alexander Preuß <al...@ververica.com>
> *抄送:* 刘 家锹 <LJ...@outlook.com>; user@flink.apache.org <
> user@flink.apache.org>
> *主题:* Re: Flink failure rate restart not work as expect
>
> Hi,
> I second Alex' observation - based on the logs it looks like the task
> restart functionality worked as expected: It tried to restart the tasks
> until it reached the limit of 4 attempts due to the missing TaskManager.
> The job-cluster shut down with an error code. At this point, YARN should
> pick it up and bring up a new JobManager based on the non-0 exit code of
> the Flink cluster. It would be interesting to see the YARN logs to figure
> out why the cluster failover didn't work.
>
> Best,
> Matthias
>
> On Tue, Mar 1, 2022 at 8:00 AM Alexander Preuß <
> alexanderpreuss@ververica.com> wrote:
>
> Hi,
> from a first glance it looks like the exception was thrown very rapidly so
> it exceeded the maxFailuresPerInterval and the FailureRestartStrategy
> decided not to restart. Why do you think this is different from the
> expected behavior?
>
> Best,
> Alex
>
> On Tue, Mar 1, 2022 at 3:23 AM 刘 家锹 <LJ...@outlook.com> wrote:
>
> Hi, all
> We encounter some problem with FailureRateRestartStrategy, which confuse
> us and don't know how to solove it. Here's the situation:
>
> Flink version: 1.10.1
> Development env: on Yarn
>
> FailureRateRestartStrategy: failuresIntervalMS=60000,backoffTimeMS=15000,maxFailuresPerInterval=4
>
> One of our hadoop machine got stuck without response, which our job's
> taskmanager running on. At this moment, the jobmanager receive a heartbeat
> timeout exception, but after throwing 4 times exception in a very short
> time(about 10ms each), it hit the FailureRateRestartStrategy and all job
> quit, we got the message of 'org.apache.flink.runtime.JobException:
> Recovery is suppressed by FailureRateRestartBackoffTimeStrategy'.
> As I know from document, the behavior expected was jobmanager should try
> to restart the job which will bring up a new taskmanager on other machine,
> but it did not.
> We also do some test, start a new job and just kill the taskamanger, but
> it can restart as expect.
>
> So it confuse us most,  if anyone know what happen, that would be thanks.
>
> JobManager log and TaskManager log append below
>
>
>
> --
>
> Alexander Preuß | Junior Engineer - Data Intensive Systems
>
> alexanderpreuss@ververica.com
>
> <https://www.ververica.com/>
>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
>
> Ververica GmbH
>
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
>
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>
>