You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@airavata.apache.org by "Eroma (JIRA)" <ji...@apache.org> on 2017/08/01 14:30:01 UTC

[jira] [Comment Edited] (AIRAVATA-2388) Job ID is not returned by the cluster when airavata check for job ID

    [ https://issues.apache.org/jira/browse/AIRAVATA-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108976#comment-16108976 ] 

Eroma edited comment on AIRAVATA-2388 at 8/1/17 2:29 PM:
---------------------------------------------------------

1. In this issue although we didn't receive the job ID on submission step return (For this no wait time. When the job is submitted we get a response which has job ID on it.) the next is to qstat/squeue with job name and try to get the job ID. 
2. Three tries with 10 second intervals in each step. When the job ID is not received, the the experiment is tagged as FAILED
3. But it seems the job was submitted and ran because the emails on job start and end has received from the system.

This issue need to be investigated further.
1. Try to locate a job which job ID was returned in the verification step (to make sure that works)
2. Try and calculate the time gap between actual job submission and the last verification step which didn't return the job ID (Since we have the job started time from email, with queued time we should be able to get a rough estimation)
3. Review the job submission return message, when the job ID is not returned what does this message contain, is it same content at all times

Actions to take
1. Increase the number of verification steps and see whether the job ID returns
2. Change the current squeue command to sacct in SLURM machines. The new command will locate the job even if it is completed and not in the queue.
3. If none of above steps returns a job ID delete the job, this way the SUs wont be used and email system will not get unread mails accumulated. This step is more like a clean up step.



was (Author: eroma_a):
1. In this issue although we didn't receive the job ID on submission step return (For this no wait time. When the job is submitted we get a response which has job ID on it.) the next is to qstat/squeue with job name and try to get the job ID. 
2. Three tries with 10 second intervals in each step. When the job ID is not received, the the experiment is tagged as FAILED
3. But it seems the job was submitted and ran because the emails on job start and end has received from the system.

This issue need to be investigated further.
1. Try to locate a job which job ID was returned in the verification step (to make sure that works)
2. Try and calculate the time gap between actual job submission and the last verification step which didn't return the job ID (Since we have the job started time from email, with queued time we should be able to get a rough estimation)
3. Review the job submission return message, when the job ID is not returned what does this message contain, is it same content at all times



> Job ID is not returned by the cluster  when airavata check for job ID 
> ----------------------------------------------------------------------
>
>                 Key: AIRAVATA-2388
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-2388
>             Project: Airavata
>          Issue Type: Sub-task
>          Components: Airavata Job Monitor, GFac
>    Affects Versions: 0.17
>            Reporter: Eroma
>             Fix For: 0.18
>
>
> When airavata waits for a job ID it was not returned but it actually was submitted and executed in the cluster. Error in the logs would be like [1]. Emails are sent but since we have already tagged experiment failure, airavata is not monitoring for the emails.
> [1]
> org.apache.airavata.gfac.core.GFacException: Error: userFriendly msg :Error while executing JOB_SUBMISSION task, actual msg :expId: h2o_9d4058c6-219c-4a10-911c-f99f605eba3f, processId: PROCESS_2e82822e-3f30-4981-bd4b-c9d2a7ac355a, taskId: TASK_1c52403c-7966-4f8b-b6d8-290f7516c56e, type: JOB_SUBMISSION :- JOB_SUBMISSION failed. Reason: Couldn't find job id in both submitted and verified steps



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)