You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by dhirajpraj <dh...@gmail.com> on 2018/03/30 11:59:26 UTC

Task Manager fault tolerance does not work

HI,
I have set up a flink 1.4 cluster with 1 job manager and two task managers.
The configs taskmanager.numberOfTaskSlots and parallelism.default were set
to 2 on each node. I submitted a job to this cluster and it runs fine. To
test fault tolerance, I killed one task manager. I was expecting the job to
run fine because one of the 2 task managers was still up and running.
However, the job failed. Am I missing something?



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by Fabian Hueske <fh...@gmail.com>.

Hi,

Thanks for the feedback!
As Till explained, the problem is that the JM first tries to schedule the
job to the failed TM (which hasn't been detected as failed yet).
The configured three restart attempts are "consumed" by these attempts and
the job fails afterwards.

Best, Fabian

2018-04-05 8:17 GMT+02:00 dhirajpraj <dh...@gmail.com>:

> Just for the record,
> It did not work with RestartStrategies.fixedDelayRestart(3, 5000) but
> worked
> with RestartStrategies.fixedDelayRestart(20, 5000)
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Re: Task Manager fault tolerance does not work

Posted by dhirajpraj <dh...@gmail.com>.

Just for the record,
It did not work with RestartStrategies.fixedDelayRestart(3, 5000) but worked
with RestartStrategies.fixedDelayRestart(20, 5000)



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by dhirajpraj <dh...@gmail.com>.

As suggested by Till, it works perfectly fine after increasing the no. of
retries. Thanks people.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by Till Rohrmann <tr...@apache.org>.

There is a JIRA issue for the problem:
https://issues.apache.org/jira/browse/FLINK-9120. Mirroring my response to
this thread:

The logs (attached to the JIRA ticket) show that the JM did not yet
recognize the killed TM as killed when trying to restart. Thus, it tries to
re-deploy tasks to this machine. When it finally realizes that the TM has
been killed, it fails the jobs. At this point, it would try to recover the
job. However, since the number of restart attempts are depleted (set to 3),
it will fail the job terminally. Please try to raise the number of retry
attempts. This should hopefully fix your problem.

Cheers,
Till

On Tue, Apr 3, 2018 at 3:26 PM, Timo Walther <tw...@apache.org> wrote:

> @Till: Do you have any advice for this issue?
>
>
> Am 03.04.18 um 11:54 schrieb dhirajpraj:
>
> What I have found is that the TM fault tolerance behaviour is not
>> consistent.
>> Sometimes it works and sometimes it doesnt. I am attaching my java code
>> file
>> (which is the main class).
>>
>> What I did was:
>> 1) Run cluster with JM on machine A, one TM on machine B and one TM on
>> machine C
>> 2) Submit a job to the cluster. Works fine till now.
>> 3) Forcefully kill the TM on machine C. The web UI shows job failing and
>> then restarting and finally the job is up on its own. This is perfect.
>> 4) Now I start the TM on machine C and wait for sufficient time
>> 5) Now kill the TM on machine B. At this point the job fails. Shouldnt the
>> job be handled by the running TM on machine C? FlinkPatternDetection.java
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/file/t1400/FlinkPatternDetection.java>
>>
>>
>>
>> --
>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/
>>
>
>
>

Re: Task Manager fault tolerance does not work

Posted by Timo Walther <tw...@apache.org>.

@Till: Do you have any advice for this issue?


Am 03.04.18 um 11:54 schrieb dhirajpraj:
> What I have found is that the TM fault tolerance behaviour is not consistent.
> Sometimes it works and sometimes it doesnt. I am attaching my java code file
> (which is the main class).
>
> What I did was:
> 1) Run cluster with JM on machine A, one TM on machine B and one TM on
> machine C
> 2) Submit a job to the cluster. Works fine till now.
> 3) Forcefully kill the TM on machine C. The web UI shows job failing and
> then restarting and finally the job is up on its own. This is perfect.
> 4) Now I start the TM on machine C and wait for sufficient time
> 5) Now kill the TM on machine B. At this point the job fails. Shouldnt the
> job be handled by the running TM on machine C? FlinkPatternDetection.java
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1400/FlinkPatternDetection.java>
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by dhirajpraj <dh...@gmail.com>.

What I have found is that the TM fault tolerance behaviour is not consistent.
Sometimes it works and sometimes it doesnt. I am attaching my java code file
(which is the main class).

What I did was:
1) Run cluster with JM on machine A, one TM on machine B and one TM on
machine C
2) Submit a job to the cluster. Works fine till now.
3) Forcefully kill the TM on machine C. The web UI shows job failing and
then restarting and finally the job is up on its own. This is perfect.
4) Now I start the TM on machine C and wait for sufficient time
5) Now kill the TM on machine B. At this point the job fails. Shouldnt the
job be handled by the running TM on machine C? FlinkPatternDetection.java
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1400/FlinkPatternDetection.java>  



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by Timo Walther <tw...@apache.org>.

Could you provide a little reproducible example? Which file system are 
you using? This sounds like a bug to me that should be fixed if valid.

Am 03.04.18 um 11:28 schrieb dhirajpraj:
> I have not specified any parallelism in the job code. So I guess, the
> parallelism should be set to parallelism.default defined in the
> flinkConfig.yaml.
>
> An update: The TMs were on different machines and I was using FsStateBackend
> with state backend directories pointing to instance specific file paths.
> After using MemoryStateBackend instead of FsStateBackend , the issue seems
> to be resolved.
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by dhirajpraj <dh...@gmail.com>.

I have not specified any parallelism in the job code. So I guess, the
parallelism should be set to parallelism.default defined in the
flinkConfig.yaml. 

An update: The TMs were on different machines and I was using FsStateBackend
with state backend directories pointing to instance specific file paths.
After using MemoryStateBackend instead of FsStateBackend , the issue seems
to be resolved.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by Timo Walther <tw...@apache.org>.

Hi,

does your job code declare a higher parallelism than 2? Or is submitted 
with a higher parallelism? What is the Web UI displaying?

Regards,
Timo

Am 03.04.18 um 10:48 schrieb dhirajpraj:
> Hi,
> I have done that
> env.enableCheckpointing(5000L);
> env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by dhirajpraj <dh...@gmail.com>.

Hi,
I have done that
env.enableCheckpointing(5000L);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3, 5000));



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Task Manager fault tolerance does not work

Posted by Stephan Ewen <se...@apache.org>.

Please make sure you have set a number of re-tries and have checkpointing
activated if you use streaming.

On Fri, Mar 30, 2018 at 1:59 PM, dhirajpraj <dh...@gmail.com> wrote:

> HI,
> I have set up a flink 1.4 cluster with 1 job manager and two task managers.
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set
> to 2 on each node. I submitted a job to this cluster and it runs fine. To
> test fault tolerance, I killed one task manager. I was expecting the job to
> run fine because one of the 2 task managers was still up and running.
> However, the job failed. Am I missing something?
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>