You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/08/04 09:18:00 UTC

[jira] [Commented] (FLINK-3347) TaskManager (or its ActorSystem) need to restart in case they notice quarantine

    [ https://issues.apache.org/jira/browse/FLINK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114159#comment-16114159 ] 

ASF GitHub Bot commented on FLINK-3347:
---------------------------------------

GitHub user NicoK opened a pull request:

    https://github.com/apache/flink/pull/4478

    [hotfix][docs] add documentation for `taskmanager.exit-on-fatal-akka-error`

    ## What is the purpose of the change
    
    When the quarantine monitor was added as of FLINK-3347, documentation for enabling it only went into the backport for the 1.2 and 1.1 branches, not into master and therefore not into the 1.3 release either. This adds it again and should also be applied to the `release-1.3` branch.
    
    ## Brief change log
    
    - add configuration documentation for `taskmanager.exit-on-fatal-akka-error`
    
    ## Verifying this change
    
    This change is a trivial rework / code cleanup without any test coverage.
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/NicoK/flink hotfix_quarantine_monitor_config

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/4478.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4478
    
----
commit 6111a0626e13b85b8996dcdf9f3d741c23739cf5
Author: Nico Kruber <ni...@data-artisans.com>
Date:   2017-08-04T09:11:35Z

    [hotfix][docs] add documentation for `taskmanager.exit-on-fatal-akka-error`
    
    When the quarantine monitor was added as of FLINK-3347, this documentation for
    enabling it only went into the backport for the 1.2 branch, not into master and
    therefore not into the 1.3 release either. This adds it again.

----


> TaskManager (or its ActorSystem) need to restart in case they notice quarantine
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-3347
>                 URL: https://issues.apache.org/jira/browse/FLINK-3347
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 0.10.1
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.0.0, 1.1.4, 1.3.0, 1.2.1
>
>
> There are cases where Akka quarantines remote actor systems. In that case, no further communication is possible with that actor system unless one of the two actor systems is restarted.
> The result is that a TaskManager is up and available, but cannot register at the JobManager (Akka refuses connection because of the quarantined state), making the TaskManager a useless process.
> I suggest to let the TaskManager restart itself once it notices that either it quarantined the JobManager, or the JobManager quarantined it.
> It is possible to recognize that by listening to certain events in the actor system event stream: http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)