You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "vinoyang (JIRA)" <ji...@apache.org> on 2018/03/01 03:57:00 UTC

[jira] [Commented] (FLINK-5621) Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues

    [ https://issues.apache.org/jira/browse/FLINK-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381479#comment-16381479 ] 

vinoyang commented on FLINK-5621:
---------------------------------

Hi [~jgrier] What about we can introduce "tm tag / label" mechanism(like YARN node label) for standalone cluster to mark different type taskmanagers. For example, "disk space insufficient", "network congestion" and so on. The task scheduler will pay attention to critical tags and avoid potential task failure risk. And we can report it as metrics and show these tags in web interface to let devOps monitor there nodes.

We are thinking about this feature in our inner Flink version at Tencent. 

> Flink should provide a mechanism to prevent scheduling tasks on TaskManagers with operational issues
> ----------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-5621
>                 URL: https://issues.apache.org/jira/browse/FLINK-5621
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.1.4
>            Reporter: Jamie Grier
>            Priority: Critical
>
> There are cases where jobs can get into a state where no progress can be made if there is something pathologically wrong with one of the TaskManager nodes in the cluster.
> An example of this would be a TaskManager on a machine that runs out of disk space.  Flink never considers the TM to be "bad" and will keep using it to attempt to run tasks -- which will continue to fail.
> A suggestion for overcoming this would be to allow an option where a TM will commit suicide if that TM was the source of an exception that caused a job to fail/restart.
> I'm sure there are plenty of other approaches to solving this..



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)