You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Lou DeGenaro (JIRA)" <de...@uima.apache.org> on 2013/03/25 21:23:16 UTC

[jira] [Commented] (UIMA-2772) DUCC resource manager - Restart and fast-start

    [ https://issues.apache.org/jira/browse/UIMA-2772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13613070#comment-13613070 ] 

Lou DeGenaro commented on UIMA-2772:
------------------------------------

Update transport such that DuccProcess and DuccReservation carry Node field, and provide getter/setter and constructors employing same.

Update orchestrator to employ above newly added constructors.

Code delivered.
                
> DUCC resource manager - Restart and fast-start
> ----------------------------------------------
>
>                 Key: UIMA-2772
>                 URL: https://issues.apache.org/jira/browse/UIMA-2772
>             Project: UIMA
>          Issue Type: Bug
>          Components: DUCC
>            Reporter: Jim Challenger
>            Assignee: Jim Challenger
>
> Currently RM waits a "reasonable time" (init-stabiity) on startup to allow nodes to check in, before accepting scheduling requests.  It is not possible to know exactly how long to wait, making init-stability a heuristic.  For normal startup this is not a problem.  If RM is restarting 'hot', or if the orchestrator publishes non-preemptable jobs on restart, and the necessary nodes have not arrived by the completion of init-stability wait, this can cause many problems: over-commitment, under-commitment, and in some cases  inconsistent state (and crashes).
> To remedy this, RM will include the full Node object in its publications to the OR, which will echo them back for work that it believes to be active. On startup RM can fully reconstruct state as of its last publication from this, eliminating the problem. A side-effect of this is that RM need not wait for nodes to check in, significantly decreasing its startup time.  If nodes added to the resource pool in this way never check in, the normal "dead node" mechanism will kick in, maintaining consistency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira