You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Anchit Jatana <de...@gmail.com> on 2016/10/28 22:13:27 UTC

Flink on YARN - Fault Tolerance | use case supported or not

Hi All,

I tried testing fault tolerance in a different way(not sure if it as
appropriate way) of my running flink application. I ran the flink
application on YARN and after completing few checkpoints, killed the YARN
application using:

yarn application -kill application_1476277440022_xxxx

Further, tried restarting the application by providing the same path of the
checkpointing directory. The application started afresh and did not resume
from the last check-pointed state. Just wanted to make sure if fault
tolerance in this usecase is valid or not. If yes, what am I doing wrong?

I'm aware of the savepoint process- to create savepoint, stop the
application and resume new application from the same savepoint but wished
to check the above usecase considering the fact that for some reason if the
YARN application gets killed perhaps accidentally or due to any other
reason, is this kind of fault tolerance supported or not.


Regards,
Anchit

Re: Flink on YARN - Fault Tolerance | use case supported or not

Posted by Anchit Jatana <de...@gmail.com>.

Yes, thank Stephan.

Regards,
Anchit



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-on-YARN-Fault-Tolerance-use-case-supported-or-not-tp9776p9817.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: Flink on YARN - Fault Tolerance | use case supported or not

Posted by Stephan Ewen <se...@apache.org>.

Hi Anchit!

In high-availability cases, a Flink cluster recovers jobs that it considers
belonging to the cluster. That is determined by what is set in the
Zookeeper Cluster Namespace: "recovery.zookeeper.path.namespace"
https://github.com/apache/flink/blob/release-1.1.3/flink-core/src/main/java/org/apache/flink/configuration/ConfigConstants.java#L646

If you submit the job in the "per-job-yarn" mode (via 'bin/flink run -m
yarn-cluster ...') then this gets a unique auto-generated namespace. The
assumption is that the job recovers itself as long as the yarn job keeps
running. If you force yarn to terminate the job, it is gone.

If you start a "yarn session", then it picks up the namespace from the
config. If you kill that yarn session while jobs are running, and then
start a new session with the same namespace, it will start recovering the
previously running jobs.

Does that make sense?

Greetings,
Stephan

On Mon, Oct 31, 2016 at 4:17 PM, Kostas Kloudas <k.kloudas@data-artisans.com
> wrote:

> Hi Jatana,
>
> As you pointed out, the correct way to do the above is to use savepoints.
> If you kill your application, then this is not a crass but rather a
> voluntary action.
>
> I am also looping in Max, as he may have something more to say on this.
>
> Cheers,
> Kostas
>
> On Sat, Oct 29, 2016 at 12:13 AM, Anchit Jatana <
> development.anchit@gmail.com> wrote:
>
>> Hi All,
>>
>> I tried testing fault tolerance in a different way(not sure if it as
>> appropriate way) of my running flink application. I ran the flink
>> application on YARN and after completing few checkpoints, killed the YARN
>> application using:
>>
>> yarn application -kill application_1476277440022_xxxx
>>
>> Further, tried restarting the application by providing the same path of
>> the checkpointing directory. The application started afresh and did not
>> resume from the last check-pointed state. Just wanted to make sure if fault
>> tolerance in this usecase is valid or not. If yes, what am I doing wrong?
>>
>> I'm aware of the savepoint process- to create savepoint, stop the
>> application and resume new application from the same savepoint but wished
>> to check the above usecase considering the fact that for some reason if the
>> YARN application gets killed perhaps accidentally or due to any other
>> reason, is this kind of fault tolerance supported or not.
>>
>>
>> Regards,
>> Anchit
>>
>
>

Re: Flink on YARN - Fault Tolerance | use case supported or not

Posted by Kostas Kloudas <k....@data-artisans.com>.

Hi Jatana,

As you pointed out, the correct way to do the above is to use savepoints.
If you kill your application, then this is not a crass but rather a
voluntary action.

I am also looping in Max, as he may have something more to say on this.

Cheers,
Kostas

On Sat, Oct 29, 2016 at 12:13 AM, Anchit Jatana <
development.anchit@gmail.com> wrote:

> Hi All,
>
> I tried testing fault tolerance in a different way(not sure if it as
> appropriate way) of my running flink application. I ran the flink
> application on YARN and after completing few checkpoints, killed the YARN
> application using:
>
> yarn application -kill application_1476277440022_xxxx
>
> Further, tried restarting the application by providing the same path of
> the checkpointing directory. The application started afresh and did not
> resume from the last check-pointed state. Just wanted to make sure if fault
> tolerance in this usecase is valid or not. If yes, what am I doing wrong?
>
> I'm aware of the savepoint process- to create savepoint, stop the
> application and resume new application from the same savepoint but wished
> to check the above usecase considering the fact that for some reason if the
> YARN application gets killed perhaps accidentally or due to any other
> reason, is this kind of fault tolerance supported or not.
>
>
> Regards,
> Anchit
>