You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Yingjie Cao (JIRA)" <ji...@apache.org> on 2018/11/24 05:43:00 UTC

[jira] [Created] (FLINK-11000) Introduce Resource Blacklist Mechanism

Yingjie Cao created FLINK-11000:
-----------------------------------

             Summary: Introduce Resource Blacklist Mechanism
                 Key: FLINK-11000
                 URL: https://issues.apache.org/jira/browse/FLINK-11000
             Project: Flink
          Issue Type: Improvement
          Components: Scheduler
            Reporter: Yingjie Cao
             Fix For: 1.8.0


In a large clusters, jobs encounter Hardware and software environment problems 
occasionally, including software library missing，bad hardware，resource 
shortage like out of disk space，these problems will lead to task failure，the 
failover strategy will take care of that and redeploy the relevant tasks. 
But because of reasons like location preference and limited total 
resources，the failed task will be scheduled to be deployed on the same host, 
then the task will fail again and again, many times. The primary cause of 
this problem is the mismatching of task and resource. Currently, the 
resource allocation algorithm does not take these into consideration. 

The blacklist mechanism can solve this problem. The basic idea 
is that when a task fails too many times on some resource, the Scheduler 
will not assign the resource to that task. The detail design doc is as follows, 
[https://docs.google.com/document/d/1Qfb_QPd7CLcGT-kJjWSCdO8xFeobSCHF0vNcfiO4Bkw]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)