You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2016/09/29 14:35:20 UTC

[jira] [Created] (FLINK-4711) TaskManager can crash due to failing onPartitionStateUpdate call

Till Rohrmann created FLINK-4711:
------------------------------------

             Summary: TaskManager can crash due to failing onPartitionStateUpdate call
                 Key: FLINK-4711
                 URL: https://issues.apache.org/jira/browse/FLINK-4711
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.2.0
            Reporter: Till Rohrmann
            Assignee: Till Rohrmann
             Fix For: 1.2.0


The {{TaskManager}} can crash because it calls {{Task.onPartitionStateUpdate}} when it receives a {{PartitionState}} message. The {{onPartitionStateUpdate}} method can throw an {{IOException}} or {{InterruptedException}} which are not handled on the {{TaskManager}} level.

Another problem is that the initial partition state request is triggered within the {{SingleInputGate}}. The request causes the {{JobManager}} to send a {{PartitionState}} message to the {{TaskManager}} which forwards it to the {{Task}}. If the at any of these points a message gets lost, then it is not retried and the partition state remains unknown.

In order to handle the exceptions, to make the data flow clearer and to add automatic retries, I propose to let the {{Task}} send the partition state check requests. Furthermore, the {{JobManager}} should directly answer to the {{Task}} by replying to an ask operation. That way the message does not have to be routed through the {{TaskManager}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)