You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Lou DeGenaro (JIRA)" <de...@uima.apache.org> on 2015/05/27 22:46:31 UTC
[jira] [Created] (UIMA-4434) DUCC Orchestrator (OR) job:node
blacklisting
Lou DeGenaro created UIMA-4434:
----------------------------------
Summary: DUCC Orchestrator (OR) job:node blacklisting
Key: UIMA-4434
URL: https://issues.apache.org/jira/browse/UIMA-4434
Project: UIMA
Issue Type: Improvement
Components: DUCC
Reporter: Lou DeGenaro
Assignee: Lou DeGenaro
Fix For: future-DUCC
A submitted Job may have shares allocated on some nodes where the JP works and some nodes where the JP fails.
With respect to initialization, the OR should have a limit to the number of initialization failures on a node before that node is banished for the Job. The OR should communicate the blacklisted nodes for each Job to the RM who should then not allocate and shares on said nodes for said corresponding Jobs.
An example failure situation is as follows:
1. Node X does not have Filesystem F mounted
2. Job 1 is submitted and is allocated to Node X
3. Job 1's JP on Node X fails initialization (missing files!)
4. RM allocates next JP for Job 1 to same Node X, ad infinitum until max init failures is reached
5. Job 1 is prevented from expanding because of a single "bad" Node
If Node X had been blacklisted, then the RM would have allocated Node Y to Job 1 and expansion could have occurred.
Other types of JP failure scenarios: process croak and work item failure/timeout will not be considered for blacklisting, presently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)