You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@airavata.apache.org by "Dimuthu Upeksha (JIRA)" <ji...@apache.org> on 2019/03/01 22:17:05 UTC
[jira] [Commented] (AIRAVATA-2943) Re-queueing and node failures in
HPC clusters need to be handled in gateway middleware as resubmitting
failures
[ https://issues.apache.org/jira/browse/AIRAVATA-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782158#comment-16782158 ]
Dimuthu Upeksha commented on AIRAVATA-2943:
-------------------------------------------
Fixed in https://github.com/apache/airavata/commit/8b10120be4ce1d0720f214dc5e849d1dc862c595
> Re-queueing and node failures in HPC clusters need to be handled in gateway middleware as resubmitting failures
> ----------------------------------------------------------------------------------------------------------------
>
> Key: AIRAVATA-2943
> URL: https://issues.apache.org/jira/browse/AIRAVATA-2943
> Project: Airavata
> Issue Type: Bug
> Components: helix implementation
> Affects Versions: 0.18
> Environment: https://staging.ultrascan.scigap.org slurm job ID 8560 in Jetstream
> Reporter: Eroma
> Assignee: Dimuthu Upeksha
> Priority: Major
> Fix For: 0.18
>
>
> Currently in clusters (PBS and SLURM) jobs are getting either re-queued due to node failures. In such scenarios the jobs are been executed after re-queueing but on gateway side it is taken as a FAILED job at the initial NODE_FAIL.
> These types of failures need to be captured as retrying failures instead of taking it as an end result.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)