You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/02/03 10:27:35 UTC

[jira] [Commented] (FLINK-1376) SubSlots are not properly released in case that a TaskManager fatally fails, leaving the system in a corrupted state

    [ https://issues.apache.org/jira/browse/FLINK-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302997#comment-14302997 ] 

ASF GitHub Bot commented on FLINK-1376:
---------------------------------------

Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/317#issuecomment-72618925
  
    I think this is a good fix, overall. There is one issue I would really like to fix, and that is the serializability of the `Instance` class. This class is not meant to be serialized and moved around, which can be reflected by the fact that it holds an Actor Ref, and the necessity to make a lot of the fields transient.
    
    I assume that the instance needs to be serialized as part of the ExechutionGraph archiving, where the ExecutionGraph is sent via an actor message to the archiver.
    
    I would like to solve that differently. The execution graph is "cleaned" before archiving (see #344 ) to reduce memory footprint. At this point, I would replace the `Instance` in the Executions with the `Instance Connection Info`, which holds all info necessary. Then we won't have to send instances through actor messages, which would be the cleaner solution.


> SubSlots are not properly released in case that a TaskManager fatally fails, leaving the system in a corrupted state
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-1376
>                 URL: https://issues.apache.org/jira/browse/FLINK-1376
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>
> In case that the TaskManager fatally fails and some of the failing node's slots are SharedSlots, then the slots are not properly released by the JobManager. This causes that the corresponding job will not be properly failed, leaving the system in a corrupted state.
> The reason for that is that the AllocatedSlot is not aware of being treated as a SharedSlot and thus he cannot release the associated SubSlots.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)