You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (JIRA)" <ji...@apache.org> on 2019/02/28 13:10:00 UTC
[jira] [Updated] (FLINK-5621) Flink should provide a mechanism to
prevent scheduling tasks on TaskManagers with operational issues
[ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Metzger updated FLINK-5621:
----------------------------------
Component/s: (was: Core)
Runtime / Coordination
> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
> ----------------------------------------------------------------------------------------------------
>
> Key: FLINK-5621
> URL: https://issues.apache.org/jira/browse/FLINK-5621
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.1.4
> Reporter: Jamie Grier
> Assignee: vinoyang
> Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made if there is something pathologically wrong with one of the TaskManager nodes in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk space. Flink never considers the TM to be "bad" and will keep using it to attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will commit suicide if that TM was the source of an exception that caused a job to fail/restart.
> I'm sure there are plenty of other approaches to solving this..
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)