You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@airavata.apache.org by "Dimuthu Upeksha (JIRA)" <ji...@apache.org> on 2019/03/01 22:17:05 UTC

[jira] [Closed] (AIRAVATA-2943) Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures

     [ https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dimuthu Upeksha closed AIRAVATA-2943.
-------------------------------------
    Resolution: Fixed

> Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures 
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: AIRAVATA-2943
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
>             Project: Airavata
>          Issue Type: Bug
>          Components: helix implementation
>    Affects Versions: 0.18
>         Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 in Jetstream
>            Reporter: Eroma
>            Assignee: Dimuthu Upeksha
>            Priority: Major
>             Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to node failures. In such scenarios the jobs are been executed after re-queueing but on gateway side it is taken as a FAILED job at the initial NODE_FAIL. 
> These types of failures need to be captured as retrying failures instead of taking it as an end result.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)