You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airavata.apache.org by Lahiru Gunathilake <gl...@gmail.com> on 2014/08/13 08:57:51 UTC

Experiment Cancellation

Hi All,

I have few concerns about experiment cancellation. When we want to cancel
and experiment we have to run a particular command in the computing
resource. Based on the computing resource different resources show the job
status of the cancelled jobs in a different way. Ex: trestles shows the
cancelled jobs as completed, some other machines show it as as cancelled,
some might show it as failed.

I think we should replicated this information in the JobDetails object as
the Job status and make sure the Experiments and Task statuses as
cancelled. The other approach is when we cancel we explicitly make all the
states in the experiment model (experiments,tasks,job states as cancelled)
as cancelled and manually handle the state we get from the computing
resource.

My concerns should we really hide that information shown in the computing
resource from the Job status we are storing in to the registry ? or leave
it as it is and handle other statuses to represent the cancelled
experiments ? If we make everything cancel there will be inconsistency in
the JobStatus.

WDYT ?

Lahiru

-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Experiment Cancellation

Posted by Lahiru Gunathilake <gl...@gmail.com>.

Thanks for the responses. I will take these points in to consideration
during cancel implementation.

Lahiru


On Wed, Aug 13, 2014 at 7:33 PM, Eroma Abeysinghe <
eroma.abeysinghe@gmail.com> wrote:

> My questions and thoughts on Experiment cancellation
> 1. What are we going to do for output or partial output of the job at the
> time of cancelling?
>     Are we going to discard or make them available for the experiment. Are
> we safe keeping all the job information, messages on CANCELLED jobs or
> discard them as well?
>
> 2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
> IMO we should not. because allowing editing is required if its going to
> Re-launch.
>
> 3. With existing experiment and job states we need to decide which are
> going to be CANCELLED
> Out of Airavata Experiment states Cancellation should be allowed for
> states;
> CREATED
> VALIDATED
> SCHEDULED
> LAUNCHED
> EXECUTING
> Cancellation should be communicated to resources if the job states are;
> SUBMITTED
> SETUP
> QUEUED
> ACTIVE
> HELD
>
> There is SUSPENDED state in both experiment and job but is this a
> currently active state?
>
> 4. Cloning will be available for CANCELLED and CANCELLING experiments.
>
> 5. In Experiment Summary we should display any errors took place in
> cancelling process
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:
>
>> There is an advantage for task (or job) state to capture the information
>> that really comes from the machine (completed, cancelled, failed, etc), and
>> for experiment state to be set to canceled by Airavata.  That is, there
>> should be parts of Airavata that capture machine-specific state information
>> about the job for logging/auditing purposes.
>>
>> * Airavata issues "cancel" command to job in "launched" or "executing"
>> state.
>>
>> * Airavata confirms that the job has left the queue or is no longer
>> executing. This could be machine-specific, but the main question is "has
>> the job left the queue?" or "is the job no longer in executing state?"  I
>> don't think it is "if this is trestles, and since we issued a qdel command,
>> is the job marked as completed; of if this is stampede, is the job now
>> marked as failed?"
>>
>> * If the job cancel works, the Airavata marks this as canceled.
>>
>> * If cancel fails for some reason, don't change the Experiment state but
>> throw an error.
>>
>>
>> Marlon
>>
>>
>> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>>
>>> Hi All,
>>>
>>> I have few concerns about experiment cancellation. When we want to cancel
>>> and experiment we have to run a particular command in the computing
>>> resource. Based on the computing resource different resources show the
>>> job
>>> status of the cancelled jobs in a different way. Ex: trestles shows the
>>> cancelled jobs as completed, some other machines show it as as cancelled,
>>> some might show it as failed.
>>>
>>> I think we should replicated this information in the JobDetails object as
>>> the Job status and make sure the Experiments and Task statuses as
>>> cancelled. The other approach is when we cancel we explicitly make all
>>> the
>>> states in the experiment model (experiments,tasks,job states as
>>> cancelled)
>>> as cancelled and manually handle the state we get from the computing
>>> resource.
>>>
>>> My concerns should we really hide that information shown in the computing
>>> resource from the Job status we are storing in to the registry ? or leave
>>> it as it is and handle other statuses to represent the cancelled
>>> experiments ? If we make everything cancel there will be inconsistency in
>>> the JobStatus.
>>>
>>> WDYT ?
>>>
>>> Lahiru
>>>
>>>
>>
>
>
> --
> Thank You,
> Best Regards,
> Eroma
>



-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Experiment Cancellation

Posted by Raminder Singh <ra...@gmail.com>.

Thanks Lahiru. 

I will give this a try and test for different cases. 

Raminder

On Aug 19, 2014, at 5:42 AM, Lahiru Gunathilake <gl...@gmail.com> wrote:

> Hi All,
> 
> I have committed the initial version of the Experiment canceling.
> 
> Experiment cancel is an Airavata-API method which can be invoked by the Airavata client. This request will get to the GFac Provider level cancellation only if the job is already submitted to the computing resource, otherwise it will be handled by the orchestrator.
> 
> If cancel request comes to an Experiment already completed, failed or cancelling, cancel operation will be failed and error will be throw to the client.
> 
> If the job is marked cancelled successfully, experiment launch execution will be stopped in the next immediate plugin invocation(launchExperiment operation which runs in a separate thread). Ex: GFac is running Handler1 during cancel and experiment launch execution will be stopped before the next plugin invocation. 
> Limitation: if there is 500 file transfer in Input Handlers(currently transferring file number 100) and during that step if  user cancel the experiment rest of the files will transfer and before the next plugin original execution will be cancelled. (If we want to download partial outputs we have to modify this logic). GFac framework can handle cancel(thats what we have now) or framework can just try to execute all the plugins and plugin implementation listen to a cancellation for that particular execution and act accordingly. 
> 
> If the job is already submitted and Gfac is monitoring the job, it will be cancelled by invoking providers cancel operation. Experiment statuses,Task Statuses,Job Statuses will be updated properly and monitoring will be stopped for those jobs with terminating Job statuses by the monitoring results.
> 
> When there are multiple Gfac instances, original experiment launch request can go to gfac Node1(separate jvm)and the cancel request doesn't have to go to the same gfac Node. Orchestrator will handle this scenario and make the job cancel request successful and experiment launch will be stopped as explained above.
> 
> During GFac node failure there could be jobs launching and job cancel executions happening in that instance. Orchestrator will route both type of requests to an available gfac nodes and recover the executions.
> 
> I have a knowns issue to be fixed, which is when I run the cancel operation sometimes GFac level authentication fails, I will try to find out what is happenning, this problem comes time to time and I am not sure whether this is something related to cancel feature or something to do with trestles.
> 
> Regards
> Lahiru
> 
> 
> 
> 
> On Mon, Aug 18, 2014 at 7:13 PM, Lahiru Gunathilake <gl...@gmail.com> wrote:
> Hi Marlon,
> 
> I should be able to wrap-up later today or early tomorrow. 
> 
> Regards
> Lahiru
> 
> 
> On Mon, Aug 18, 2014 at 7:01 PM, Marlon Pierce <ma...@iu.edu> wrote:
> How goes the implementation?
> 
> Marlon
> 
> 
> On 8/13/14, 11:09 PM, Lahiru Gunathilake wrote:
> Thank you very much for all the inputs ! This will take these in to
> consideration.
> 
> Regards
> Lahiru
> 
> 
> On Wed, Aug 13, 2014 at 10:31 PM, Miller, Mark <mm...@sdsc.edu> wrote:
> 
>   If I understand this correctly, I want to offer some input from our
> experience with CIPRES.
> 
> Currently, if a CIPRES user wishes to cancel a job, they must delete the
> entire job, and therefore all ability to view the input and other files
> used become unavailable.
> 
> This is not an ideal solution.
> 
> 
> 
> There is value to the user to being able to see partially completed
> results, or even the input files they used.
> 
> 
> 
> So I would vote for making partial output of the job available as an
> option.
> 
> Any additional information you can provide about status would be useful,
> especially for folks who are debugging failures..
> 
> 
> 
> Just my 2c.
> 
> 
> 
> Mark
> 
> 
> 
> *From:* Eroma Abeysinghe [mailto:eroma.abeysinghe@gmail.com]
> *Sent:* Wednesday, August 13, 2014 7:04 AM
> *To:* dev@airavata.apache.org
> *Subject:* Re: Experiment Cancellation
> 
> 
> 
> 
> My questions and thoughts on Experiment cancellation
> 1. What are we going to do for output or partial output of the job at the
> time of cancelling?
>      Are we going to discard or make them available for the experiment. Are
> we safe keeping all the job information, messages on CANCELLED jobs or
> discard them as well?
> 
> 2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
> IMO we should not. because allowing editing is required if its going to
> Re-launch.
> 
> 3. With existing experiment and job states we need to decide which are
> going to be CANCELLED
> Out of Airavata Experiment states Cancellation should be allowed for
> states;
> CREATED
> VALIDATED
> SCHEDULED
> LAUNCHED
> EXECUTING
> Cancellation should be communicated to resources if the job states are;
> SUBMITTED
> SETUP
> QUEUED
> ACTIVE
> HELD
> 
> 
> There is SUSPENDED state in both experiment and job but is this a
> currently active state?
> 
> 4. Cloning will be available for CANCELLED and CANCELLING experiments.
> 
> 5. In Experiment Summary we should display any errors took place in
> cancelling process
> 
> 
> 
> 
> 
> On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:
> 
> There is an advantage for task (or job) state to capture the information
> that really comes from the machine (completed, cancelled, failed, etc), and
> for experiment state to be set to canceled by Airavata.  That is, there
> should be parts of Airavata that capture machine-specific state information
> about the job for logging/auditing purposes.
> 
> * Airavata issues "cancel" command to job in "launched" or "executing"
> state.
> 
> * Airavata confirms that the job has left the queue or is no longer
> executing. This could be machine-specific, but the main question is "has
> the job left the queue?" or "is the job no longer in executing state?"  I
> don't think it is "if this is trestles, and since we issued a qdel command,
> is the job marked as completed; of if this is stampede, is the job now
> marked as failed?"
> 
> * If the job cancel works, the Airavata marks this as canceled.
> 
> * If cancel fails for some reason, don't change the Experiment state but
> throw an error.
> 
> 
> Marlon
> 
> 
> 
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
> 
> Hi All,
> 
> I have few concerns about experiment cancellation. When we want to cancel
> and experiment we have to run a particular command in the computing
> resource. Based on the computing resource different resources show the job
> status of the cancelled jobs in a different way. Ex: trestles shows the
> cancelled jobs as completed, some other machines show it as as cancelled,
> some might show it as failed.
> 
> I think we should replicated this information in the JobDetails object as
> the Job status and make sure the Experiments and Task statuses as
> cancelled. The other approach is when we cancel we explicitly make all the
> states in the experiment model (experiments,tasks,job states as cancelled)
> as cancelled and manually handle the state we get from the computing
> resource.
> 
> My concerns should we really hide that information shown in the computing
> resource from the Job status we are storing in to the registry ? or leave
> it as it is and handle other statuses to represent the cancelled
> experiments ? If we make everything cancel there will be inconsistency in
> the JobStatus.
> 
> WDYT ?
> 
> Lahiru
> 
> 
> 
> 
> 
> 
> --
> 
> Thank You,
> 
> Best Regards,
> 
> Eroma
> 
> 
> 
> 
> 
> 
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University
> 
> 
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University

Re: Experiment Cancellation

Posted by Lahiru Gunathilake <gl...@gmail.com>.

Hi All,

I have committed the initial version of the Experiment canceling.

Experiment cancel is an Airavata-API method which can be invoked by the
Airavata client. This request will get to the GFac Provider level
cancellation only if the job is already submitted to the computing
resource, otherwise it will be handled by the orchestrator.

If cancel request comes to an Experiment already completed, failed or
cancelling, cancel operation will be failed and error will be throw to the
client.

If the job is marked cancelled successfully, experiment launch execution
will be stopped in the next immediate plugin invocation(launchExperiment
operation which runs in a separate thread). Ex: GFac is running Handler1
during cancel and experiment launch execution will be stopped before the
next plugin invocation.
Limitation: if there is 500 file transfer in Input Handlers(currently
transferring file number 100) and during that step if  user cancel the
experiment rest of the files will transfer and before the next plugin
original execution will be cancelled. (If we want to download partial
outputs we have to modify this logic). GFac framework can handle
cancel(thats what we have now) or framework can just try to execute all the
plugins and plugin implementation listen to a cancellation for that
particular execution and act accordingly.

If the job is already submitted and Gfac is monitoring the job, it will be
cancelled by invoking providers cancel operation. Experiment statuses,Task
Statuses,Job Statuses will be updated properly and monitoring will be
stopped for those jobs with terminating Job statuses by the monitoring
results.

When there are multiple Gfac instances, original experiment launch request
can go to gfac Node1(separate jvm)and the cancel request doesn't have to go
to the same gfac Node. Orchestrator will handle this scenario and make the
job cancel request successful and experiment launch will be stopped as
explained above.

During GFac node failure there could be jobs launching and job cancel
executions happening in that instance. Orchestrator will route both type of
requests to an available gfac nodes and recover the executions.

I have a knowns issue to be fixed, which is when I run the cancel operation
sometimes GFac level authentication fails, I will try to find out what is
happenning, this problem comes time to time and I am not sure whether this
is something related to cancel feature or something to do with trestles.

Regards
Lahiru

On Mon, Aug 18, 2014 at 7:13 PM, Lahiru Gunathilake <gl...@gmail.com>
wrote:

> Hi Marlon,
>
> I should be able to wrap-up later today or early tomorrow.
>
> Regards
> Lahiru
>
>
> On Mon, Aug 18, 2014 at 7:01 PM, Marlon Pierce <ma...@iu.edu> wrote:
>
>> How goes the implementation?
>>
>> Marlon
>>
>>
>> On 8/13/14, 11:09 PM, Lahiru Gunathilake wrote:
>>
>>> Thank you very much for all the inputs ! This will take these in to
>>> consideration.
>>>
>>> Regards
>>> Lahiru
>>>
>>>
>>> On Wed, Aug 13, 2014 at 10:31 PM, Miller, Mark <mm...@sdsc.edu> wrote:
>>>
>>>    If I understand this correctly, I want to offer some input from our
>>>> experience with CIPRES.
>>>>
>>>> Currently, if a CIPRES user wishes to cancel a job, they must delete the
>>>> entire job, and therefore all ability to view the input and other files
>>>> used become unavailable.
>>>>
>>>> This is not an ideal solution.
>>>>
>>>>
>>>>
>>>> There is value to the user to being able to see partially completed
>>>> results, or even the input files they used.
>>>>
>>>>
>>>>
>>>> So I would vote for making partial output of the job available as an
>>>> option.
>>>>
>>>> Any additional information you can provide about status would be useful,
>>>> especially for folks who are debugging failures..
>>>>
>>>>
>>>>
>>>> Just my 2c.
>>>>
>>>>
>>>>
>>>> Mark
>>>>
>>>>
>>>>
>>>> *From:* Eroma Abeysinghe [mailto:eroma.abeysinghe@gmail.com]
>>>> *Sent:* Wednesday, August 13, 2014 7:04 AM
>>>> *To:* dev@airavata.apache.org
>>>> *Subject:* Re: Experiment Cancellation
>>>>
>>>>
>>>>
>>>>
>>>> My questions and thoughts on Experiment cancellation
>>>> 1. What are we going to do for output or partial output of the job at
>>>> the
>>>> time of cancelling?
>>>>      Are we going to discard or make them available for the experiment.
>>>> Are
>>>> we safe keeping all the job information, messages on CANCELLED jobs or
>>>> discard them as well?
>>>>
>>>> 2. Are we going to allow editing for CANCELLED or CANCELLING
>>>> experiments?
>>>> IMO we should not. because allowing editing is required if its going to
>>>> Re-launch.
>>>>
>>>> 3. With existing experiment and job states we need to decide which are
>>>> going to be CANCELLED
>>>> Out of Airavata Experiment states Cancellation should be allowed for
>>>> states;
>>>> CREATED
>>>> VALIDATED
>>>> SCHEDULED
>>>> LAUNCHED
>>>> EXECUTING
>>>> Cancellation should be communicated to resources if the job states are;
>>>> SUBMITTED
>>>> SETUP
>>>> QUEUED
>>>> ACTIVE
>>>> HELD
>>>>
>>>>
>>>> There is SUSPENDED state in both experiment and job but is this a
>>>> currently active state?
>>>>
>>>> 4. Cloning will be available for CANCELLED and CANCELLING experiments.
>>>>
>>>> 5. In Experiment Summary we should display any errors took place in
>>>> cancelling process
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:
>>>>
>>>> There is an advantage for task (or job) state to capture the information
>>>> that really comes from the machine (completed, cancelled, failed, etc),
>>>> and
>>>> for experiment state to be set to canceled by Airavata.  That is, there
>>>> should be parts of Airavata that capture machine-specific state
>>>> information
>>>> about the job for logging/auditing purposes.
>>>>
>>>> * Airavata issues "cancel" command to job in "launched" or "executing"
>>>> state.
>>>>
>>>> * Airavata confirms that the job has left the queue or is no longer
>>>> executing. This could be machine-specific, but the main question is "has
>>>> the job left the queue?" or "is the job no longer in executing state?"
>>>> I
>>>> don't think it is "if this is trestles, and since we issued a qdel
>>>> command,
>>>> is the job marked as completed; of if this is stampede, is the job now
>>>> marked as failed?"
>>>>
>>>> * If the job cancel works, the Airavata marks this as canceled.
>>>>
>>>> * If cancel fails for some reason, don't change the Experiment state but
>>>> throw an error.
>>>>
>>>>
>>>> Marlon
>>>>
>>>>
>>>>
>>>> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>>>>
>>>> Hi All,
>>>>
>>>> I have few concerns about experiment cancellation. When we want to
>>>> cancel
>>>> and experiment we have to run a particular command in the computing
>>>> resource. Based on the computing resource different resources show the
>>>> job
>>>> status of the cancelled jobs in a different way. Ex: trestles shows the
>>>> cancelled jobs as completed, some other machines show it as as
>>>> cancelled,
>>>> some might show it as failed.
>>>>
>>>> I think we should replicated this information in the JobDetails object
>>>> as
>>>> the Job status and make sure the Experiments and Task statuses as
>>>> cancelled. The other approach is when we cancel we explicitly make all
>>>> the
>>>> states in the experiment model (experiments,tasks,job states as
>>>> cancelled)
>>>> as cancelled and manually handle the state we get from the computing
>>>> resource.
>>>>
>>>> My concerns should we really hide that information shown in the
>>>> computing
>>>> resource from the Job status we are storing in to the registry ? or
>>>> leave
>>>> it as it is and handle other statuses to represent the cancelled
>>>> experiments ? If we make everything cancel there will be inconsistency
>>>> in
>>>> the JobStatus.
>>>>
>>>> WDYT ?
>>>>
>>>> Lahiru
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thank You,
>>>>
>>>> Best Regards,
>>>>
>>>> Eroma
>>>>
>>>>
>>>
>>>
>>
>
>
> --
> System Analyst Programmer
> PTI Lab
> Indiana University
>

-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Experiment Cancellation

Posted by Lahiru Gunathilake <gl...@gmail.com>.

Hi Marlon,

I should be able to wrap-up later today or early tomorrow.

Regards
Lahiru


On Mon, Aug 18, 2014 at 7:01 PM, Marlon Pierce <ma...@iu.edu> wrote:

> How goes the implementation?
>
> Marlon
>
>
> On 8/13/14, 11:09 PM, Lahiru Gunathilake wrote:
>
>> Thank you very much for all the inputs ! This will take these in to
>> consideration.
>>
>> Regards
>> Lahiru
>>
>>
>> On Wed, Aug 13, 2014 at 10:31 PM, Miller, Mark <mm...@sdsc.edu> wrote:
>>
>>    If I understand this correctly, I want to offer some input from our
>>> experience with CIPRES.
>>>
>>> Currently, if a CIPRES user wishes to cancel a job, they must delete the
>>> entire job, and therefore all ability to view the input and other files
>>> used become unavailable.
>>>
>>> This is not an ideal solution.
>>>
>>>
>>>
>>> There is value to the user to being able to see partially completed
>>> results, or even the input files they used.
>>>
>>>
>>>
>>> So I would vote for making partial output of the job available as an
>>> option.
>>>
>>> Any additional information you can provide about status would be useful,
>>> especially for folks who are debugging failures..
>>>
>>>
>>>
>>> Just my 2c.
>>>
>>>
>>>
>>> Mark
>>>
>>>
>>>
>>> *From:* Eroma Abeysinghe [mailto:eroma.abeysinghe@gmail.com]
>>> *Sent:* Wednesday, August 13, 2014 7:04 AM
>>> *To:* dev@airavata.apache.org
>>> *Subject:* Re: Experiment Cancellation
>>>
>>>
>>>
>>>
>>> My questions and thoughts on Experiment cancellation
>>> 1. What are we going to do for output or partial output of the job at the
>>> time of cancelling?
>>>      Are we going to discard or make them available for the experiment.
>>> Are
>>> we safe keeping all the job information, messages on CANCELLED jobs or
>>> discard them as well?
>>>
>>> 2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
>>> IMO we should not. because allowing editing is required if its going to
>>> Re-launch.
>>>
>>> 3. With existing experiment and job states we need to decide which are
>>> going to be CANCELLED
>>> Out of Airavata Experiment states Cancellation should be allowed for
>>> states;
>>> CREATED
>>> VALIDATED
>>> SCHEDULED
>>> LAUNCHED
>>> EXECUTING
>>> Cancellation should be communicated to resources if the job states are;
>>> SUBMITTED
>>> SETUP
>>> QUEUED
>>> ACTIVE
>>> HELD
>>>
>>>
>>> There is SUSPENDED state in both experiment and job but is this a
>>> currently active state?
>>>
>>> 4. Cloning will be available for CANCELLED and CANCELLING experiments.
>>>
>>> 5. In Experiment Summary we should display any errors took place in
>>> cancelling process
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:
>>>
>>> There is an advantage for task (or job) state to capture the information
>>> that really comes from the machine (completed, cancelled, failed, etc),
>>> and
>>> for experiment state to be set to canceled by Airavata.  That is, there
>>> should be parts of Airavata that capture machine-specific state
>>> information
>>> about the job for logging/auditing purposes.
>>>
>>> * Airavata issues "cancel" command to job in "launched" or "executing"
>>> state.
>>>
>>> * Airavata confirms that the job has left the queue or is no longer
>>> executing. This could be machine-specific, but the main question is "has
>>> the job left the queue?" or "is the job no longer in executing state?"  I
>>> don't think it is "if this is trestles, and since we issued a qdel
>>> command,
>>> is the job marked as completed; of if this is stampede, is the job now
>>> marked as failed?"
>>>
>>> * If the job cancel works, the Airavata marks this as canceled.
>>>
>>> * If cancel fails for some reason, don't change the Experiment state but
>>> throw an error.
>>>
>>>
>>> Marlon
>>>
>>>
>>>
>>> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>>>
>>> Hi All,
>>>
>>> I have few concerns about experiment cancellation. When we want to cancel
>>> and experiment we have to run a particular command in the computing
>>> resource. Based on the computing resource different resources show the
>>> job
>>> status of the cancelled jobs in a different way. Ex: trestles shows the
>>> cancelled jobs as completed, some other machines show it as as cancelled,
>>> some might show it as failed.
>>>
>>> I think we should replicated this information in the JobDetails object as
>>> the Job status and make sure the Experiments and Task statuses as
>>> cancelled. The other approach is when we cancel we explicitly make all
>>> the
>>> states in the experiment model (experiments,tasks,job states as
>>> cancelled)
>>> as cancelled and manually handle the state we get from the computing
>>> resource.
>>>
>>> My concerns should we really hide that information shown in the computing
>>> resource from the Job status we are storing in to the registry ? or leave
>>> it as it is and handle other statuses to represent the cancelled
>>> experiments ? If we make everything cancel there will be inconsistency in
>>> the JobStatus.
>>>
>>> WDYT ?
>>>
>>> Lahiru
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Thank You,
>>>
>>> Best Regards,
>>>
>>> Eroma
>>>
>>>
>>
>>
>


-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Experiment Cancellation

Posted by Marlon Pierce <ma...@iu.edu>.

How goes the implementation?

Marlon

On 8/13/14, 11:09 PM, Lahiru Gunathilake wrote:
> Thank you very much for all the inputs ! This will take these in to
> consideration.
>
> Regards
> Lahiru
>
>
> On Wed, Aug 13, 2014 at 10:31 PM, Miller, Mark <mm...@sdsc.edu> wrote:
>
>>   If I understand this correctly, I want to offer some input from our
>> experience with CIPRES.
>>
>> Currently, if a CIPRES user wishes to cancel a job, they must delete the
>> entire job, and therefore all ability to view the input and other files
>> used become unavailable.
>>
>> This is not an ideal solution.
>>
>>
>>
>> There is value to the user to being able to see partially completed
>> results, or even the input files they used.
>>
>>
>>
>> So I would vote for making partial output of the job available as an
>> option.
>>
>> Any additional information you can provide about status would be useful,
>> especially for folks who are debugging failures..
>>
>>
>>
>> Just my 2c.
>>
>>
>>
>> Mark
>>
>>
>>
>> *From:* Eroma Abeysinghe [mailto:eroma.abeysinghe@gmail.com]
>> *Sent:* Wednesday, August 13, 2014 7:04 AM
>> *To:* dev@airavata.apache.org
>> *Subject:* Re: Experiment Cancellation
>>
>>
>>
>> My questions and thoughts on Experiment cancellation
>> 1. What are we going to do for output or partial output of the job at the
>> time of cancelling?
>>      Are we going to discard or make them available for the experiment. Are
>> we safe keeping all the job information, messages on CANCELLED jobs or
>> discard them as well?
>>
>> 2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
>> IMO we should not. because allowing editing is required if its going to
>> Re-launch.
>>
>> 3. With existing experiment and job states we need to decide which are
>> going to be CANCELLED
>> Out of Airavata Experiment states Cancellation should be allowed for
>> states;
>> CREATED
>> VALIDATED
>> SCHEDULED
>> LAUNCHED
>> EXECUTING
>> Cancellation should be communicated to resources if the job states are;
>> SUBMITTED
>> SETUP
>> QUEUED
>> ACTIVE
>> HELD
>>
>>
>> There is SUSPENDED state in both experiment and job but is this a
>> currently active state?
>>
>> 4. Cloning will be available for CANCELLED and CANCELLING experiments.
>>
>> 5. In Experiment Summary we should display any errors took place in
>> cancelling process
>>
>>
>>
>>
>>
>> On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:
>>
>> There is an advantage for task (or job) state to capture the information
>> that really comes from the machine (completed, cancelled, failed, etc), and
>> for experiment state to be set to canceled by Airavata.  That is, there
>> should be parts of Airavata that capture machine-specific state information
>> about the job for logging/auditing purposes.
>>
>> * Airavata issues "cancel" command to job in "launched" or "executing"
>> state.
>>
>> * Airavata confirms that the job has left the queue or is no longer
>> executing. This could be machine-specific, but the main question is "has
>> the job left the queue?" or "is the job no longer in executing state?"  I
>> don't think it is "if this is trestles, and since we issued a qdel command,
>> is the job marked as completed; of if this is stampede, is the job now
>> marked as failed?"
>>
>> * If the job cancel works, the Airavata marks this as canceled.
>>
>> * If cancel fails for some reason, don't change the Experiment state but
>> throw an error.
>>
>>
>> Marlon
>>
>>
>>
>> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>>
>> Hi All,
>>
>> I have few concerns about experiment cancellation. When we want to cancel
>> and experiment we have to run a particular command in the computing
>> resource. Based on the computing resource different resources show the job
>> status of the cancelled jobs in a different way. Ex: trestles shows the
>> cancelled jobs as completed, some other machines show it as as cancelled,
>> some might show it as failed.
>>
>> I think we should replicated this information in the JobDetails object as
>> the Job status and make sure the Experiments and Task statuses as
>> cancelled. The other approach is when we cancel we explicitly make all the
>> states in the experiment model (experiments,tasks,job states as cancelled)
>> as cancelled and manually handle the state we get from the computing
>> resource.
>>
>> My concerns should we really hide that information shown in the computing
>> resource from the Job status we are storing in to the registry ? or leave
>> it as it is and handle other statuses to represent the cancelled
>> experiments ? If we make everything cancel there will be inconsistency in
>> the JobStatus.
>>
>> WDYT ?
>>
>> Lahiru
>>
>>
>>
>>
>>
>>
>> --
>>
>> Thank You,
>>
>> Best Regards,
>>
>> Eroma
>>
>
>

Re: Experiment Cancellation

Posted by Lahiru Gunathilake <gl...@gmail.com>.

Thank you very much for all the inputs ! This will take these in to
consideration.

Regards
Lahiru


On Wed, Aug 13, 2014 at 10:31 PM, Miller, Mark <mm...@sdsc.edu> wrote:

>  If I understand this correctly, I want to offer some input from our
> experience with CIPRES.
>
> Currently, if a CIPRES user wishes to cancel a job, they must delete the
> entire job, and therefore all ability to view the input and other files
> used become unavailable.
>
> This is not an ideal solution.
>
>
>
> There is value to the user to being able to see partially completed
> results, or even the input files they used.
>
>
>
> So I would vote for making partial output of the job available as an
> option.
>
> Any additional information you can provide about status would be useful,
> especially for folks who are debugging failures..
>
>
>
> Just my 2c.
>
>
>
> Mark
>
>
>
> *From:* Eroma Abeysinghe [mailto:eroma.abeysinghe@gmail.com]
> *Sent:* Wednesday, August 13, 2014 7:04 AM
> *To:* dev@airavata.apache.org
> *Subject:* Re: Experiment Cancellation
>
>
>
> My questions and thoughts on Experiment cancellation
> 1. What are we going to do for output or partial output of the job at the
> time of cancelling?
>     Are we going to discard or make them available for the experiment. Are
> we safe keeping all the job information, messages on CANCELLED jobs or
> discard them as well?
>
> 2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
> IMO we should not. because allowing editing is required if its going to
> Re-launch.
>
> 3. With existing experiment and job states we need to decide which are
> going to be CANCELLED
> Out of Airavata Experiment states Cancellation should be allowed for
> states;
> CREATED
> VALIDATED
> SCHEDULED
> LAUNCHED
> EXECUTING
> Cancellation should be communicated to resources if the job states are;
> SUBMITTED
> SETUP
> QUEUED
> ACTIVE
> HELD
>
>
> There is SUSPENDED state in both experiment and job but is this a
> currently active state?
>
> 4. Cloning will be available for CANCELLED and CANCELLING experiments.
>
> 5. In Experiment Summary we should display any errors took place in
> cancelling process
>
>
>
>
>
> On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:
>
> There is an advantage for task (or job) state to capture the information
> that really comes from the machine (completed, cancelled, failed, etc), and
> for experiment state to be set to canceled by Airavata.  That is, there
> should be parts of Airavata that capture machine-specific state information
> about the job for logging/auditing purposes.
>
> * Airavata issues "cancel" command to job in "launched" or "executing"
> state.
>
> * Airavata confirms that the job has left the queue or is no longer
> executing. This could be machine-specific, but the main question is "has
> the job left the queue?" or "is the job no longer in executing state?"  I
> don't think it is "if this is trestles, and since we issued a qdel command,
> is the job marked as completed; of if this is stampede, is the job now
> marked as failed?"
>
> * If the job cancel works, the Airavata marks this as canceled.
>
> * If cancel fails for some reason, don't change the Experiment state but
> throw an error.
>
>
> Marlon
>
>
>
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>
> Hi All,
>
> I have few concerns about experiment cancellation. When we want to cancel
> and experiment we have to run a particular command in the computing
> resource. Based on the computing resource different resources show the job
> status of the cancelled jobs in a different way. Ex: trestles shows the
> cancelled jobs as completed, some other machines show it as as cancelled,
> some might show it as failed.
>
> I think we should replicated this information in the JobDetails object as
> the Job status and make sure the Experiments and Task statuses as
> cancelled. The other approach is when we cancel we explicitly make all the
> states in the experiment model (experiments,tasks,job states as cancelled)
> as cancelled and manually handle the state we get from the computing
> resource.
>
> My concerns should we really hide that information shown in the computing
> resource from the Job status we are storing in to the registry ? or leave
> it as it is and handle other statuses to represent the cancelled
> experiments ? If we make everything cancel there will be inconsistency in
> the JobStatus.
>
> WDYT ?
>
> Lahiru
>
>
>
>
>
>
> --
>
> Thank You,
>
> Best Regards,
>
> Eroma
>



-- 
System Analyst Programmer
PTI Lab
Indiana University

RE: Experiment Cancellation

Posted by "Miller, Mark" <mm...@sdsc.edu>.

If I understand this correctly, I want to offer some input from our experience with CIPRES.
Currently, if a CIPRES user wishes to cancel a job, they must delete the entire job, and therefore all ability to view the input and other files used become unavailable.
This is not an ideal solution.

There is value to the user to being able to see partially completed results, or even the input files they used.

So I would vote for making partial output of the job available as an option.
Any additional information you can provide about status would be useful, especially for folks who are debugging failures..

Just my 2c.

Mark

From: Eroma Abeysinghe [mailto:eroma.abeysinghe@gmail.com]
Sent: Wednesday, August 13, 2014 7:04 AM
To: dev@airavata.apache.org
Subject: Re: Experiment Cancellation

My questions and thoughts on Experiment cancellation
1. What are we going to do for output or partial output of the job at the time of cancelling?
    Are we going to discard or make them available for the experiment. Are we safe keeping all the job information, messages on CANCELLED jobs or discard them as well?

2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
IMO we should not. because allowing editing is required if its going to Re-launch.

3. With existing experiment and job states we need to decide which are going to be CANCELLED
Out of Airavata Experiment states Cancellation should be allowed for states;
CREATED
VALIDATED
SCHEDULED
LAUNCHED
EXECUTING
Cancellation should be communicated to resources if the job states are;
SUBMITTED
SETUP
QUEUED
ACTIVE
HELD


There is SUSPENDED state in both experiment and job but is this a currently active state?

4. Cloning will be available for CANCELLED and CANCELLING experiments.

5. In Experiment Summary we should display any errors took place in cancelling process












On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu>> wrote:
There is an advantage for task (or job) state to capture the information that really comes from the machine (completed, cancelled, failed, etc), and for experiment state to be set to canceled by Airavata.  That is, there should be parts of Airavata that capture machine-specific state information about the job for logging/auditing purposes.

* Airavata issues "cancel" command to job in "launched" or "executing" state.

* Airavata confirms that the job has left the queue or is no longer executing. This could be machine-specific, but the main question is "has the job left the queue?" or "is the job no longer in executing state?"  I don't think it is "if this is trestles, and since we issued a qdel command, is the job marked as completed; of if this is stampede, is the job now marked as failed?"

* If the job cancel works, the Airavata marks this as canceled.

* If cancel fails for some reason, don't change the Experiment state but throw an error.


Marlon


On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
Hi All,

I have few concerns about experiment cancellation. When we want to cancel
and experiment we have to run a particular command in the computing
resource. Based on the computing resource different resources show the job
status of the cancelled jobs in a different way. Ex: trestles shows the
cancelled jobs as completed, some other machines show it as as cancelled,
some might show it as failed.

I think we should replicated this information in the JobDetails object as
the Job status and make sure the Experiments and Task statuses as
cancelled. The other approach is when we cancel we explicitly make all the
states in the experiment model (experiments,tasks,job states as cancelled)
as cancelled and manually handle the state we get from the computing
resource.

My concerns should we really hide that information shown in the computing
resource from the Job status we are storing in to the registry ? or leave
it as it is and handle other statuses to represent the cancelled
experiments ? If we make everything cancel there will be inconsistency in
the JobStatus.

WDYT ?

Lahiru




--
Thank You,
Best Regards,
Eroma

Re: Experiment Cancellation

Posted by Eroma Abeysinghe <er...@gmail.com>.

My questions and thoughts on Experiment cancellation
1. What are we going to do for output or partial output of the job at the
time of cancelling?
    Are we going to discard or make them available for the experiment. Are
we safe keeping all the job information, messages on CANCELLED jobs or
discard them as well?

2. Are we going to allow editing for CANCELLED or CANCELLING experiments?
IMO we should not. because allowing editing is required if its going to
Re-launch.

3. With existing experiment and job states we need to decide which are
going to be CANCELLED
Out of Airavata Experiment states Cancellation should be allowed for states;
CREATED
VALIDATED
SCHEDULED
LAUNCHED
EXECUTING
Cancellation should be communicated to resources if the job states are;
SUBMITTED
SETUP
QUEUED
ACTIVE
HELD

There is SUSPENDED state in both experiment and job but is this a currently
active state?

4. Cloning will be available for CANCELLED and CANCELLING experiments.

5. In Experiment Summary we should display any errors took place in
cancelling process












On Wed, Aug 13, 2014 at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:

> There is an advantage for task (or job) state to capture the information
> that really comes from the machine (completed, cancelled, failed, etc), and
> for experiment state to be set to canceled by Airavata.  That is, there
> should be parts of Airavata that capture machine-specific state information
> about the job for logging/auditing purposes.
>
> * Airavata issues "cancel" command to job in "launched" or "executing"
> state.
>
> * Airavata confirms that the job has left the queue or is no longer
> executing. This could be machine-specific, but the main question is "has
> the job left the queue?" or "is the job no longer in executing state?"  I
> don't think it is "if this is trestles, and since we issued a qdel command,
> is the job marked as completed; of if this is stampede, is the job now
> marked as failed?"
>
> * If the job cancel works, the Airavata marks this as canceled.
>
> * If cancel fails for some reason, don't change the Experiment state but
> throw an error.
>
>
> Marlon
>
>
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>
>> Hi All,
>>
>> I have few concerns about experiment cancellation. When we want to cancel
>> and experiment we have to run a particular command in the computing
>> resource. Based on the computing resource different resources show the job
>> status of the cancelled jobs in a different way. Ex: trestles shows the
>> cancelled jobs as completed, some other machines show it as as cancelled,
>> some might show it as failed.
>>
>> I think we should replicated this information in the JobDetails object as
>> the Job status and make sure the Experiments and Task statuses as
>> cancelled. The other approach is when we cancel we explicitly make all the
>> states in the experiment model (experiments,tasks,job states as cancelled)
>> as cancelled and manually handle the state we get from the computing
>> resource.
>>
>> My concerns should we really hide that information shown in the computing
>> resource from the Job status we are storing in to the registry ? or leave
>> it as it is and handle other statuses to represent the cancelled
>> experiments ? If we make everything cancel there will be inconsistency in
>> the JobStatus.
>>
>> WDYT ?
>>
>> Lahiru
>>
>>
>


-- 
Thank You,
Best Regards,
Eroma

Re: Experiment Cancellation

Posted by Raminder Singh <ra...@gmail.com>.

We can’t depend on queue status as its different for different machine and none of the machine give the queue status as job was canceled (see examples below). As Airavata is managing the job and got the cancel request from user, Airavata should mark the job status to cancel along with task and experiment status on a successful attempt. In case of job got canceled in queued state, we don’t have stdout/error and in running state stdout/error will not have any detail that job was canceled.  As we discussed, when we are successfully able to cancel the job, we should mark the job status canceled and stop monitoring the job. In case of ultrascan, we don’t want to run output handers. We can have other gateways with requirement to get output some outputs and can be handled with a API flag. According to my understanding simple workflow steps are. Please add more to this if i missed anything.  

1. User calls job cancel with intermediate outputs false
2. Validator check the current status
	2.A.
		1 if status executing then it calls job cancel function from orchestrator 
		2 On success we remove the job from the queue viewer or mark the status canceled
		3 In job status canceled and flag false we don’t call out handler
		4 Incase intermediate flag true search or stdout/error   
 
	2.B if any other status API return exception that operation not allowed                                														 																																									
Thanks
Raminder

Trestles >> 
[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
2242884.trestles-fe1.l  us3         shared   A1613788797         --      2     64    --   00:30:00 Q       --
[us3@trestles-login1 ~]$ qdel 2242884
[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
2242884.trestles-fe1.l  us3         shared   A1613788797           0     2     64    --   00:30:00 R  00:00:05

[us3@trestles-login1 ~]$ qstat -u us3

trestles-fe1.local:
                                                                                  Req'd    Req'd       Elap
Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
2242884.trestles-fe1.l  us3         shared   A1613788797       10302     2     64    --   00:30:00 C       --


Stampede >>
us3@login4.stampede ~ $ squeue -u us3
             JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           3897023      normal A8020068      us3 PD       0:00      2 (Priority)
us3@login4.stampede ~ $ scancel 3897023
us3@login4.stampede ~ $ squeue -u us3
             JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

Lonestar >>
us3@lonestar ~ $ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2109621 0.00000 A619522656 us3          qw    08/13/2014 09:44:43                                   24
us3@lonestar ~ $ qdel 2109621
us3 has deleted job 2109621
us3@lonestar ~ $ qstat
us3@lonestar ~ $

Alamo >>
us3@alamo ~ $ qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
193052.alamo              1967556229       us3                    0 R default
us3@alamo ~ $ qdel 193052
us3@alamo ~ $ qstat
us3@alamo ~ $


On Aug 13, 2014, at 9:01 AM, Marlon Pierce <ma...@iu.edu> wrote:

> There is an advantage for task (or job) state to capture the information that really comes from the machine (completed, cancelled, failed, etc), and for experiment state to be set to canceled by Airavata.  That is, there should be parts of Airavata that capture machine-specific state information about the job for logging/auditing purposes.
> 
> * Airavata issues "cancel" command to job in "launched" or "executing" state.
> 
> * Airavata confirms that the job has left the queue or is no longer executing. This could be machine-specific, but the main question is "has the job left the queue?" or "is the job no longer in executing state?"  I don't think it is "if this is trestles, and since we issued a qdel command, is the job marked as completed; of if this is stampede, is the job now marked as failed?"
> 
> * If the job cancel works, the Airavata marks this as canceled.
> 
> * If cancel fails for some reason, don't change the Experiment state but throw an error.
> 
> 
> Marlon
> 
> On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
>> Hi All,
>> 
>> I have few concerns about experiment cancellation. When we want to cancel
>> and experiment we have to run a particular command in the computing
>> resource. Based on the computing resource different resources show the job
>> status of the cancelled jobs in a different way. Ex: trestles shows the
>> cancelled jobs as completed, some other machines show it as as cancelled,
>> some might show it as failed.
>> 
>> I think we should replicated this information in the JobDetails object as
>> the Job status and make sure the Experiments and Task statuses as
>> cancelled. The other approach is when we cancel we explicitly make all the
>> states in the experiment model (experiments,tasks,job states as cancelled)
>> as cancelled and manually handle the state we get from the computing
>> resource.
>> 
>> My concerns should we really hide that information shown in the computing
>> resource from the Job status we are storing in to the registry ? or leave
>> it as it is and handle other statuses to represent the cancelled
>> experiments ? If we make everything cancel there will be inconsistency in
>> the JobStatus.
>> 
>> WDYT ?
>> 
>> Lahiru
>> 
>

Re: Experiment Cancellation

Posted by Marlon Pierce <ma...@iu.edu>.

There is an advantage for task (or job) state to capture the information 
that really comes from the machine (completed, cancelled, failed, etc), 
and for experiment state to be set to canceled by Airavata.  That is, 
there should be parts of Airavata that capture machine-specific state 
information about the job for logging/auditing purposes.

* Airavata issues "cancel" command to job in "launched" or "executing" 
state.

* Airavata confirms that the job has left the queue or is no longer 
executing. This could be machine-specific, but the main question is "has 
the job left the queue?" or "is the job no longer in executing state?"  
I don't think it is "if this is trestles, and since we issued a qdel 
command, is the job marked as completed; of if this is stampede, is the 
job now marked as failed?"

* If the job cancel works, the Airavata marks this as canceled.

* If cancel fails for some reason, don't change the Experiment state but 
throw an error.

Marlon

On 8/13/14, 2:57 AM, Lahiru Gunathilake wrote:
> Hi All,
>
> I have few concerns about experiment cancellation. When we want to cancel
> and experiment we have to run a particular command in the computing
> resource. Based on the computing resource different resources show the job
> status of the cancelled jobs in a different way. Ex: trestles shows the
> cancelled jobs as completed, some other machines show it as as cancelled,
> some might show it as failed.
>
> I think we should replicated this information in the JobDetails object as
> the Job status and make sure the Experiments and Task statuses as
> cancelled. The other approach is when we cancel we explicitly make all the
> states in the experiment model (experiments,tasks,job states as cancelled)
> as cancelled and manually handle the state we get from the computing
> resource.
>
> My concerns should we really hide that information shown in the computing
> resource from the Job status we are storing in to the registry ? or leave
> it as it is and handle other statuses to represent the cancelled
> experiments ? If we make everything cancel there will be inconsistency in
> the JobStatus.
>
> WDYT ?
>
> Lahiru
>