You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "zhangjing (JIRA)" <ji...@apache.org> on 2016/08/25 02:46:20 UTC
[jira] [Commented] (FLINK-4478) Implement heartbeat logic

    [ https://issues.apache.org/jira/browse/FLINK-4478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436167#comment-15436167 ] 

zhangjing commented on FLINK-4478:
----------------------------------

[~till.rohrmann]. I agree we should define how should it look like first. And I try to give my opinions for your question. Here's my thought, What's your advice?
1. exponential backoff strategy.

In fact, it is not complete exponential backoff. like 'Math.min(2 * timeoutMillis, maxHeartbeatTimeout)', Maybe we could use maxHeartbeatTimeout to decrease the risk of wait twice as long as defined until notified about a heartbeat failure.
Also we could use constant retry period instead of backoff strategy
2. whether every heartbeat connection should be responsible for triggering itself or whether the heartbeat manager should be responsible for that?
Every heartbeat scheduler don't trigger itself, it depends on outer world(Here i means HeartbeatManager) call it's start method to trigger it.

3. Is the heartbeat receiving end an independent RpcEndpoint? How does the payload delivery works? Does the sender side asks for the result (future) or does the receiving side answers via a tell message to the heartbeat manager?
On the sender side, receiving end is a gateway which can be got by its address. And Sender side ask receiver for the heartbeat payload.
4. How does receiving end monitor the sender so that if the heartbeat request is not delivered, then receiving end could mark sending end as dead?
I think it could be independent of heartbeat manager on the sending side. It should run on the receiving end while heartbeat scheduler run on the sending side.



> Implement heartbeat logic
> -------------------------
>
>                 Key: FLINK-4478
>                 URL: https://issues.apache.org/jira/browse/FLINK-4478
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 1.1.0
>            Reporter: Till Rohrmann
>             Fix For: 1.2.0
>
>
> Parent issue to track the development of the heartbeat logic (sender and receiver) for the new Flip-6 refactoring.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)