You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2016/02/05 15:47:40 UTC
[jira] [Created] (FLINK-3345) Restart TaskManager in case of a Akka
quarantine event
Till Rohrmann created FLINK-3345:
------------------------------------
Summary: Restart TaskManager in case of a Akka quarantine event
Key: FLINK-3345
URL: https://issues.apache.org/jira/browse/FLINK-3345
Project: Flink
Issue Type: Improvement
Components: Distributed Runtime
Affects Versions: 1.0.0
Reporter: Till Rohrmann
{{ActorSystems}} which get quarantined (death watch trigger, system message failure) are not able to reconnect to quarantining {{ActorSystem}}. In order to do that, the quarantined {{ActorSystem}} has to be restarted.
This is a problem for the {{TaskManager}}-{{JobManager}} communication. Whenever a {{TaskManager}} gets quarantined it is effectively useless for the Flink cluster, because it cannot reconnect to the {{JobManager}}. In such a case, the {{TaskManager}} would have to be restarted.
The following link [1] describes how an {{ActorSystem}} can detect that it got quarantined.
When the TM detects that it got quarantined it should shut itself down. In order to restart the TM we could add a retry loop to the `taskmanager.sh` start script which restarts a TM in case of a non-zero return code.
[1] http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)