You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Morgan Geldenhuys <mo...@tu-berlin.de> on 2020/03/10 13:54:18 UTC

Failure detection and Heartbeats

Hi community,

I am interested in knowing more about the failure detection mechanism 
used by Flink, unfortunately information is a little thin on the ground 
and I was hoping someone could shed a little light on the topic.

Looking at the documentation 
(https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html), 
there are these two configuration options:


          heartbeat.interval

	10000 	Long 	Time interval for requesting heartbeat from sender side.


          heartbeat.timeout

	50000 	Long 	Timeout for requesting and receiving heartbeat for both 
sender and receiver sides.

This would indicate Flink uses a heartbeat mechanism to ascertain the 
liveness of TaskManagers. From this the following assumptions are made:

The JobManager is responsible for broadcasting a heartbeat requests to 
all TaskManagers and awaits responses.
If a response is not forthcoming from any particular node within the 
heartbeat timeout period, e.g. 50 seconds by default, then that node is 
timed out and assumed to have failed.
The heartbeat interval indicated how often the heartbeat request 
broadcast is scheduled.
Having the heartbeat interval shorter than the heartbeat timeout would 
mean that multiple requests can be underway at the same time.
Therefore, the TaskManager would need to fail to respond to 4 requests 
(assuming normal response times are lower than 10 seconds) before being 
timed out after 50 seconds.

So therefore if a failure were to occur (considering the default settings):
- In the best case the JobManager would detect the failure in the 
shortest time, i.e. 50 seconds +- (node fails just before receiving the 
next heartbeat request)
- In the worst case the JobManager would detect the failure in the 
longest time, i.e. 60 seconds +- (node fails just after sending the last 
heartbeat response)

Is this correct?

For JobManagers in HA mode, this is left to ZooKeeper timeouts which 
then initiates a round of elections and the new leader picks up from the 
previous checkpoint.

Thank you in advance.

Regards,
M.

Re: Failure detection and Heartbeats

Posted by Gary Yao <ga...@apache.org>.

Hi Morgan,

> I am interested in knowing more about the failure detection mechanism
used by Flink, unfortunately information is a little thin on the ground and
I was hoping someone could shed a little light on the topic.
It is probably best to look into the implementation (see my answers below).

> Having the heartbeat interval shorter than the heartbeat timeout would
mean that multiple requests can be underway at the same time.
Yes, in fact the heartbeat interval must be shorter than the timeout or
else an exception is thrown [1]

> - In the worst case the JobManager would detect the failure in the
longest time, i.e. 60 seconds +- (node fails just after sending the last
heartbeat response)
If a heartbeat response is received, the 50s timeout is reset [2]. If we do
not receive a single heartbeat response for 50s, we will assume a failure
[3]. Therefore, I do not think that there is a worst case or best case here.

Lastly I wanted to mention that since FLIP-6 [4], the responsibilities of
the JobManager have been split. We now have a ResourceManager and one
JobManager for every job (note that in the code the class is called
JobMaster). Each instance employs heartbeating to each other and also to
the TaskManagers.

Best,
Gary

[1]
https://github.com/apache/flink/blob/bf1195232a49cce1897c1fa86c5af9ee005212c6/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatServices.java#L43
[2]
https://github.com/apache/flink/blob/1b628d4a7d92f9c79c31f3fe90911940e0676b22/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatMonitorImpl.java#L117-L128
[3]
https://github.com/apache/flink/blob/1b628d4a7d92f9c79c31f3fe90911940e0676b22/flink-runtime/src/main/java/org/apache/flink/runtime/heartbeat/HeartbeatMonitorImpl.java#L106-L111
[4]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077

On Tue, Mar 10, 2020 at 2:54 PM Morgan Geldenhuys <
morgan.geldenhuys@tu-berlin.de> wrote:

> Hi community,
>
> I am interested in knowing more about the failure detection mechanism used
> by Flink, unfortunately information is a little thin on the ground and I
> was hoping someone could shed a little light on the topic.
>
> Looking at the documentation (
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html),
> there are these two configuration options:
>
> heartbeat.interval 10000 Long Time interval for requesting heartbeat from
> sender side.
> heartbeat.timeout 50000 Long Timeout for requesting and receiving
> heartbeat for both sender and receiver sides. This would indicate Flink
> uses a heartbeat mechanism to ascertain the liveness of TaskManagers. From
> this the following assumptions are made:
>
> The JobManager is responsible for broadcasting a heartbeat requests to all
> TaskManagers and awaits responses.
> If a response is not forthcoming from any particular node within the
> heartbeat timeout period, e.g. 50 seconds by default, then that node is
> timed out and assumed to have failed.
> The heartbeat interval indicated how often the heartbeat request broadcast
> is scheduled.
> Having the heartbeat interval shorter than the heartbeat timeout would
> mean that multiple requests can be underway at the same time.
> Therefore, the TaskManager would need to fail to respond to 4 requests
> (assuming normal response times are lower than 10 seconds) before being
> timed out after 50 seconds.
>
> So therefore if a failure were to occur (considering the default settings):
> - In the best case the JobManager would detect the failure in the shortest
> time, i.e. 50 seconds +- (node fails just before receiving the next
> heartbeat request)
> - In the worst case the JobManager would detect the failure in the longest
> time, i.e. 60 seconds +- (node fails just after sending the last heartbeat
> response)
>
> Is this correct?
>
> For JobManagers in HA mode, this is left to ZooKeeper timeouts which then
> initiates a round of elections and the new leader picks up from the
> previous checkpoint.
>
> Thank you in advance.
>
> Regards,
> M.
>
>
>
>
>
>
>
>