You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Lou DeGenaro (JIRA)" <de...@uima.apache.org> on 2015/05/27 22:46:31 UTC

[jira] [Created] (UIMA-4434) DUCC Orchestrator (OR) job:node blacklisting

Lou DeGenaro created UIMA-4434:
----------------------------------

             Summary: DUCC Orchestrator (OR) job:node blacklisting
                 Key: UIMA-4434
                 URL: https://issues.apache.org/jira/browse/UIMA-4434
             Project: UIMA
          Issue Type: Improvement
          Components: DUCC
            Reporter: Lou DeGenaro
            Assignee: Lou DeGenaro
             Fix For: future-DUCC


A submitted Job may have shares allocated on some nodes where the JP works and some nodes where the JP fails.

With respect to initialization, the OR should have a limit to the number of initialization failures on a node before that node is banished for the Job.  The OR should communicate the blacklisted nodes for each Job to the RM who should then not allocate and shares on said nodes for said corresponding Jobs.

An example failure situation is as follows:

1. Node X does not have Filesystem F mounted
2. Job 1 is submitted and is allocated to Node X
3. Job 1's JP on Node X fails initialization (missing files!)
4. RM allocates next JP for Job 1 to same Node X, ad infinitum until max init failures is reached
5. Job 1 is prevented from expanding because of a single "bad" Node

If Node X had been blacklisted, then the RM would have allocated Node Y to Job 1 and expansion could have occurred.

Other types of JP failure scenarios: process croak and work item failure/timeout will not be considered for blacklisting, presently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)