You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Richard Eckart de Castilho (Jira)" <de...@uima.apache.org> on 2023/01/12 15:26:00 UTC

[jira] [Resolved] (UIMA-4434) DUCC Orchestrator (OR) job:node blacklisting

     [ https://issues.apache.org/jira/browse/UIMA-4434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Richard Eckart de Castilho resolved UIMA-4434.
----------------------------------------------
    Resolution: Abandoned

DUCC has been retired.

> DUCC Orchestrator (OR) job:node blacklisting
> --------------------------------------------
>
>                 Key: UIMA-4434
>                 URL: https://issues.apache.org/jira/browse/UIMA-4434
>             Project: UIMA
>          Issue Type: Improvement
>          Components: DUCC
>            Reporter: Lou DeGenaro
>            Assignee: Lou DeGenaro
>            Priority: Major
>             Fix For: future-DUCC
>
>
> A submitted Job may have shares allocated on some nodes where the JP works and some nodes where the JP fails.
> With respect to initialization, the OR should have a limit to the number of initialization failures on a node before that node is banished for the Job.  The OR should communicate the blacklisted nodes for each Job to the RM who should then not allocate and shares on said nodes for said corresponding Jobs.
> An example failure situation is as follows:
> 1. Node X does not have Filesystem F mounted
> 2. Job 1 is submitted and is allocated to Node X
> 3. Job 1's JP on Node X fails initialization (missing files!)
> 4. RM allocates next JP for Job 1 to same Node X, ad infinitum until max init failures is reached
> 5. Job 1 is prevented from expanding because of a single "bad" Node
> If Node X had been blacklisted, then the RM would have allocated Node Y to Job 1 and expansion could have occurred.
> Other types of JP failure scenarios: process croak and work item failure/timeout will not be considered for blacklisting, presently.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)