You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by "Bajaj, Abhinav" <ab...@here.com> on 2018/05/14 23:42:38 UTC

Akka heartbeat configurations

Hi,

We are running into issues where GC pause will result into Taskmanagers being marked dead incorrectly.
Flink documentation<https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#distributed-coordination-via-akka> documents some knobs of Akka configurations to play around.

Focusing on “akka.watch.heartbeat.pause”, it mentions “Higher value increases the time to detect a dead TaskManager”

Can someone please help me understand the downside of increasing the time to detect a dead taskmanager?
Will this affect the fault tolerance guarantees / state management/ checkpointing?

Thanks,
Abhinav

Re: Akka heartbeat configurations

Posted by "Bajaj, Abhinav" <ab...@here.com>.

I had the same feeling.

Thanks Timo for clarifying.

~ Abhinav

From: Timo Walther <tw...@apache.org>
Date: Tuesday, May 15, 2018 at 6:05 AM
To: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: Akka heartbeat configurations

Hi,

increasing the time to detect a dead task manager usually increases the amount of elements that need to be reprocessed in case of a failure. Once a dead task manager is identified, the entire application is rolled back to the latest successful checkpointed/consistent state of the application. So it is desirable to keep this time low in order to keep the time to catch up low. Faul tolerance guarantees should not be affected.

I hope this helps.

Regards,
Timo

Am 15.05.18 um 01:42 schrieb Bajaj, Abhinav:
Hi,

We are running into issues where GC pause will result into Taskmanagers being marked dead incorrectly.
Flink documentation<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fci.apache.org%2Fprojects%2Fflink%2Fflink-docs-release-1.4%2Fops%2Fconfig.html%23distributed-coordination-via-akka&data=01%7C01%7C%7Cac06a3b1e0584d37bc8308d5ba64883d%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=Vf%2FJSTKhC%2BPDungq9sENLenlUml6K7r4z40wgQmcr0U%3D&reserved=0> documents some knobs of Akka configurations to play around.

Focusing on “akka.watch.heartbeat.pause”, it mentions “Higher value increases the time to detect a dead TaskManager”

Can someone please help me understand the downside of increasing the time to detect a dead taskmanager?
Will this affect the fault tolerance guarantees / state management/ checkpointing?

Thanks,
Abhinav

Re: Akka heartbeat configurations

Posted by Timo Walther <tw...@apache.org>.

Hi,

increasing the time to detect a dead task manager usually increases the 
amount of elements that need to be reprocessed in case of a failure. 
Once a dead task manager is identified, the entire application is rolled 
back to the latest successful checkpointed/consistent state of the 
application. So it is desirable to keep this time low in order to keep 
the time to catch up low. Faul tolerance guarantees should not be affected.

I hope this helps.

Regards,
Timo

Am 15.05.18 um 01:42 schrieb Bajaj, Abhinav:
>
> Hi,
>
> We are running into issues where GC pause will result into 
> Taskmanagers being marked dead incorrectly.
>
> Flink documentation 
> <https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/config.html#distributed-coordination-via-akka> 
> documents some knobs of Akka configurations to play around.
>
> Focusing on /“akka.watch.heartbeat.pause”,/ it mentions /“Higher value 
> increases the time to detect a dead TaskManager”/
>
> Can someone please help me understand the downside of increasing the 
> time to detect a dead taskmanager?
>
> Will this affect the fault tolerance guarantees / state management/ 
> checkpointing?
>
> Thanks,
>
> Abhinav
>