You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2016/09/29 14:35:20 UTC
[jira] [Created] (FLINK-4711) TaskManager can crash due to failing
onPartitionStateUpdate call
Till Rohrmann created FLINK-4711:
------------------------------------
Summary: TaskManager can crash due to failing onPartitionStateUpdate call
Key: FLINK-4711
URL: https://issues.apache.org/jira/browse/FLINK-4711
Project: Flink
Issue Type: Bug
Components: Distributed Coordination
Affects Versions: 1.2.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
Fix For: 1.2.0
The {{TaskManager}} can crash because it calls {{Task.onPartitionStateUpdate}} when it receives a {{PartitionState}} message. The {{onPartitionStateUpdate}} method can throw an {{IOException}} or {{InterruptedException}} which are not handled on the {{TaskManager}} level.
Another problem is that the initial partition state request is triggered within the {{SingleInputGate}}. The request causes the {{JobManager}} to send a {{PartitionState}} message to the {{TaskManager}} which forwards it to the {{Task}}. If the at any of these points a message gets lost, then it is not retried and the partition state remains unknown.
In order to handle the exceptions, to make the data flow clearer and to add automatic retries, I propose to let the {{Task}} send the partition state check requests. Furthermore, the {{JobManager}} should directly answer to the {{Task}} by replying to an ask operation. That way the message does not have to be routed through the {{TaskManager}}.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)