You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@airavata.apache.org by "Dimuthu Upeksha (JIRA)" <ji...@apache.org> on 2019/03/01 22:17:05 UTC

[jira] [Commented] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

    [ https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782158#comment-16782158 ] 

Dimuthu Upeksha commented on AIRAVATA-2943:
-------------------------------------------

Fixed in https://github.com/apache/airavata/commit/8b10120be4ce1d0720f214dc5e849d1dc862c595

> Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures 
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRAVATA-2943
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
>             Project: Airavata
>          Issue Type: Bug
>          Components: helix implementation
>    Affects Versions: 0.18
>         Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 in Jetstream
>            Reporter: Eroma
>            Assignee: Dimuthu Upeksha
>            Priority: Major
>             Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to node failures. In such scenarios the jobs are been executed after re-queueing but on gateway side it is taken as a FAILED job at the initial NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)