You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/17 14:44:00 UTC

[jira] [Commented] (FLINK-6160) Retry JobManager/ResourceManager connection in case of timeout

    [ https://issues.apache.org/jira/browse/FLINK-6160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16479151#comment-16479151 ] 

ASF GitHub Bot commented on FLINK-6160:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6035

    [FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to JobMaster and TaskExecutor

    ## What is the purpose of the change
    
    If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then they will both try to reconnect
    to the last known RM address.
    
    Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on the TaskExecutor. This means that
    if the TaskExecutor could not register at a RM within the given registration timeout, it will fail with a
    fatal exception. This allows to fail the TaskExecutor process in case that it cannot establish a connection
    and ultimately frees the occupied resources.
    
    The commit also changes the default value for TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min".
    
    cc @GJL.
    
    ## Brief change log
    
    - Retry connection to RM in case of heartbeat timeout on `JobMaster` and `TaskExecutor`
    - Fail `TaskExecutor` if we could not connect to `RM` within `TaskManagerOptions#REGISTRATION_TIMEOUT`
    
    ## Verifying this change
    
    - Adapted `JobMasterTest#testHeartbeatTimeoutWithResourceManager`
    - Adapted `TaskExecutorTest#testHeartbeatTimeoutWithResourceManager`
    - Added `TaskExecutorTest#testMaximumRegistrationDuration` and `TaskExecutorTest#testMaximumRegistrationDurationAfterConnectionLoss`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixReconnection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6035.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6035
    
----
commit 6b45c84cf06688099e71c9e1809917653af43d31
Author: Till Rohrmann <tr...@...>
Date:   2018-05-17T12:44:14Z

    [FLINK-6160] Add reconnection attempts in case of heartbeat timeouts to JobMaster and TaskExecutor
    
    If a timeout with the RM occurs on on the JobMaster and TaskExecutor, then they will both try to reconnect
    to the last known RM address.
    
    Additionally, we now respect the TaskManagerOption#REGISTRATION_TIMEOUT on the TaskExecutor. This means that
    if the TaskExecutor could not register at a RM within the given registration timeout, it will fail with a
    fatal exception. This allows to fail the TaskExecutor process in case that it cannot establish a connection
    and ultimately frees the occupied resources.
    
    The commit also changes the default value for TaskManagerOption#REGISTRATION_TIMEOUT from "Inf" to "5 min".

----


>  Retry JobManager/ResourceManager connection in case of timeout
> ---------------------------------------------------------------
>
>                 Key: FLINK-6160
>                 URL: https://issues.apache.org/jira/browse/FLINK-6160
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Distributed Coordination
>    Affects Versions: 1.3.0, 1.5.0, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> In case of a heartbeat timeout, the {{TaskExecutor}} closes the connection to the remote component. Furthermore, it assumes that the component has actually failed and, thus, it will only start trying to connect to the component if it is notified about a new leader address and leader session id. This is brittle, because the heartbeat could also time out without the component having crashed. Thus, we should add an automatic retry to the latest known leader address information in case of a timeout.
> *Acceptance criteria:*
>   - The registration should be retried until a time limit expires after which the {{TaskExecutor}} terminates.
>   - If the registration is declined ({{RegistrationResponse.Decline}}), the {{TaskExecutor}} should terminate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)