You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@airavata.apache.org by Marlon Pierce <ma...@iu.edu> on 2013/10/24 17:48:19 UTC

Stateful vs. fire-and-forget GFac providers

The current GFAC providers all execute tasks in "blocking" mode: the
provider stays active until the job terminates. This introduces some
tradeoffs. On the one hand, determining the job state is very
provider-specific. Doing it all in the provider makes things relatively
simple to implement. 

On the other hand, this makes Airavata's state complicated.  This
increases the difficulty of handling fault recovery and "elastic"
scenarios, where we may need to restart failed servers, pass work from
one running instance to another, and so forth.

If we wanted to make the provider stateless and move monitoring to a
different place, this would take some thoughtful design--I don't have an
idea of the scope--so even if we all agreed it is a good idea, we have
to overcome an energy barrier of a current system that is good enough
for what we need to do.

What are your thoughts?  We had a related discussion about this for a
specific use case back in July [1]. 


Marlon


[1]
http://mail-archives.apache.org/mod_mbox/airavata-dev/201307.mbox/%3C1F73EAE9-5E50-4DD7-BB29-FFBC98DAF0D8@gmail.com%3E

Re: Stateful vs. fire-and-forget GFac providers

Posted by Lahiru Gunathilake <gl...@gmail.com>.

Hi Supun,

I think you idea is very similar what we have current.


On Thu, Oct 24, 2013 at 2:43 PM, Supun Kamburugamuva <su...@gmail.com>wrote:

> My thoughts are along the lines of making GFac stateless for better
> replication and recovery.
>
> My be what Airavata need is an Execution Plan concept in GFac.
>
Its already there in gfac-config.xml and this can be communicated between
nodes.

> An execution plan consists of execution blocks and their execution order
> and plan should be serializable for replication. When a job is submitted to
> GFac, it should be able to create a full execution plan from the
> information provided. Then this plan can be replicated to other nodes and
> coordinated execution of the blocks can be done.
>
>From my understanding this can be done.

>
> GFac can execute each of the execution blocks in this plan. The blocks
> should be stateless. The execution blocks corresponds to tasks like file
> transfer, invoking the job etc. The output of a block can be made available
> to other blocks down the execution.
>
This is also implemented and its done by JobExecution and my only issue is
how to replicate the JobExecution context information between replicas..

If you are familiar with current structure, cna you please explain how it
is different from your thoughts of having blocks. As per my understanding
handlers and blocks are pretty similar. (May be you are using bunch of
handlers as a block).

Regards
Lahiru

>
>


>
> Because the state of the execution plan is replicated any node should be
> able to take over the execution and continue.
>
> Thanks,
> Supun..
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Oct 24, 2013 at 12:27 PM, Raminder Singh <raminderjsingh@gmail.com
> > wrote:
>
>> Thanks Marlon for starting the discussion.  I think this change can solve
>> multiple issues gateways face.
>>
>> 1. Jobs sometime get zombie and loose its states. Having a monitoring
>> component outside the GFAC can allow us to provide interface to update the
>> state if the client think job is already finished. Then the jobs will not
>> be a black box for the clients.
>> 2. This can lead to providing better job management interface to gateways
>> as the job state is saved outside the GFAC. We can make recovery decisions
>> better based on human input also.
>>
>> I think we will be able to solve workflow problem also along this way by
>> introducing Job Orchestrator or some state machine and workflow interpreter
>> can relay on that for workflow orchestration.
>>
>> +1 to adding this and bringing some design discussion to the list.
>>
>> Thanks
>> Raminder
>>
>>
>> On Oct 24, 2013, at 12:00 PM, Lahiru Gunathilake <gl...@gmail.com>
>> wrote:
>>
>> Hi Marlon,
>>
>> In Airavata since we are using GFAC as an embedded mode with Workflow
>> Interpreter it not really a fire and forget even if we implement this in
>> GFAC core.
>>
>> But it will not be bad since in WorkflowInterpreter we are handing each
>> node in a separate thread. But if we are going to use gfac as a separate
>> job submitting component this will definitely make sense.
>>
>> So I am +1 for this change.
>>
>> Regards
>> Lahiru
>>
>>
>> On Thu, Oct 24, 2013 at 11:48 AM, Marlon Pierce <ma...@iu.edu> wrote:
>>
>>> The current GFAC providers all execute tasks in "blocking" mode: the
>>> provider stays active until the job terminates. This introduces some
>>> tradeoffs. On the one hand, determining the job state is very
>>> provider-specific. Doing it all in the provider makes things relatively
>>> simple to implement.
>>>
>>> On the other hand, this makes Airavata's state complicated.  This
>>> increases the difficulty of handling fault recovery and "elastic"
>>> scenarios, where we may need to restart failed servers, pass work from
>>> one running instance to another, and so forth.
>>>
>>> If we wanted to make the provider stateless and move monitoring to a
>>> different place, this would take some thoughtful design--I don't have an
>>> idea of the scope--so even if we all agreed it is a good idea, we have
>>> to overcome an energy barrier of a current system that is good enough
>>> for what we need to do.
>>>
>>> What are your thoughts?  We had a related discussion about this for a
>>> specific use case back in July [1].
>>>
>>>
>>> Marlon
>>>
>>>
>>> [1]
>>>
>>> http://mail-archives.apache.org/mod_mbox/airavata-dev/201307.mbox/%3C1F73EAE9-5E50-4DD7-BB29-FFBC98DAF0D8@gmail.com%3E
>>>
>>
>>
>>
>> --
>> System Analyst Programmer
>> PTI Lab
>> Indiana University
>>
>>
>>
>
>
> --
> Supun Kamburugamuva
> Member, Apache Software Foundation; http://www.apache.org
> E-mail: supun06@gmail.com;  Mobile: +1 812 369 6762
> Blog: http://supunk.blogspot.com
>
>


-- 
System Analyst Programmer
PTI Lab
Indiana University

Re: Stateful vs. fire-and-forget GFac providers

Posted by Supun Kamburugamuva <su...@gmail.com>.

My thoughts are along the lines of making GFac stateless for better
replication and recovery.

My be what Airavata need is an Execution Plan concept in GFac. An execution
plan consists of execution blocks and their execution order and plan should
be serializable for replication. When a job is submitted to GFac, it should
be able to create a full execution plan from the information provided. Then
this plan can be replicated to other nodes and coordinated execution of the
blocks can be done.

GFac can execute each of the execution blocks in this plan. The blocks
should be stateless. The execution blocks corresponds to tasks like file
transfer, invoking the job etc. The output of a block can be made available
to other blocks down the execution.

Because the state of the execution plan is replicated any node should be
able to take over the execution and continue.

Thanks,
Supun..













On Thu, Oct 24, 2013 at 12:27 PM, Raminder Singh
<ra...@gmail.com>wrote:

> Thanks Marlon for starting the discussion.  I think this change can solve
> multiple issues gateways face.
>
> 1. Jobs sometime get zombie and loose its states. Having a monitoring
> component outside the GFAC can allow us to provide interface to update the
> state if the client think job is already finished. Then the jobs will not
> be a black box for the clients.
> 2. This can lead to providing better job management interface to gateways
> as the job state is saved outside the GFAC. We can make recovery decisions
> better based on human input also.
>
> I think we will be able to solve workflow problem also along this way by
> introducing Job Orchestrator or some state machine and workflow interpreter
> can relay on that for workflow orchestration.
>
> +1 to adding this and bringing some design discussion to the list.
>
> Thanks
> Raminder
>
>
> On Oct 24, 2013, at 12:00 PM, Lahiru Gunathilake <gl...@gmail.com>
> wrote:
>
> Hi Marlon,
>
> In Airavata since we are using GFAC as an embedded mode with Workflow
> Interpreter it not really a fire and forget even if we implement this in
> GFAC core.
>
> But it will not be bad since in WorkflowInterpreter we are handing each
> node in a separate thread. But if we are going to use gfac as a separate
> job submitting component this will definitely make sense.
>
> So I am +1 for this change.
>
> Regards
> Lahiru
>
>
> On Thu, Oct 24, 2013 at 11:48 AM, Marlon Pierce <ma...@iu.edu> wrote:
>
>> The current GFAC providers all execute tasks in "blocking" mode: the
>> provider stays active until the job terminates. This introduces some
>> tradeoffs. On the one hand, determining the job state is very
>> provider-specific. Doing it all in the provider makes things relatively
>> simple to implement.
>>
>> On the other hand, this makes Airavata's state complicated.  This
>> increases the difficulty of handling fault recovery and "elastic"
>> scenarios, where we may need to restart failed servers, pass work from
>> one running instance to another, and so forth.
>>
>> If we wanted to make the provider stateless and move monitoring to a
>> different place, this would take some thoughtful design--I don't have an
>> idea of the scope--so even if we all agreed it is a good idea, we have
>> to overcome an energy barrier of a current system that is good enough
>> for what we need to do.
>>
>> What are your thoughts?  We had a related discussion about this for a
>> specific use case back in July [1].
>>
>>
>> Marlon
>>
>>
>> [1]
>>
>> http://mail-archives.apache.org/mod_mbox/airavata-dev/201307.mbox/%3C1F73EAE9-5E50-4DD7-BB29-FFBC98DAF0D8@gmail.com%3E
>>
>
>
>
> --
> System Analyst Programmer
> PTI Lab
> Indiana University
>
>
>


-- 
Supun Kamburugamuva
Member, Apache Software Foundation; http://www.apache.org
E-mail: supun06@gmail.com;  Mobile: +1 812 369 6762
Blog: http://supunk.blogspot.com

Re: Stateful vs. fire-and-forget GFac providers

Posted by Raminder Singh <ra...@gmail.com>.

Thanks Marlon for starting the discussion.  I think this change can solve multiple issues gateways face. 

1. Jobs sometime get zombie and loose its states. Having a monitoring component outside the GFAC can allow us to provide interface to update the state if the client think job is already finished. Then the jobs will not be a black box for the clients. 
2. This can lead to providing better job management interface to gateways as the job state is saved outside the GFAC. We can make recovery decisions better based on human input also.

I think we will be able to solve workflow problem also along this way by introducing Job Orchestrator or some state machine and workflow interpreter can relay on that for workflow orchestration. 

+1 to adding this and bringing some design discussion to the list. 

Thanks
Raminder

On Oct 24, 2013, at 12:00 PM, Lahiru Gunathilake <gl...@gmail.com> wrote:

> Hi Marlon,
> 
> In Airavata since we are using GFAC as an embedded mode with Workflow Interpreter it not really a fire and forget even if we implement this in GFAC core.
> 
> But it will not be bad since in WorkflowInterpreter we are handing each node in a separate thread. But if we are going to use gfac as a separate job submitting component this will definitely make sense.
> 
> So I am +1 for this change.
> 
> Regards
> Lahiru
> 
> 
> On Thu, Oct 24, 2013 at 11:48 AM, Marlon Pierce <ma...@iu.edu> wrote:
> The current GFAC providers all execute tasks in "blocking" mode: the
> provider stays active until the job terminates. This introduces some
> tradeoffs. On the one hand, determining the job state is very
> provider-specific. Doing it all in the provider makes things relatively
> simple to implement.
> 
> On the other hand, this makes Airavata's state complicated.  This
> increases the difficulty of handling fault recovery and "elastic"
> scenarios, where we may need to restart failed servers, pass work from
> one running instance to another, and so forth.
> 
> If we wanted to make the provider stateless and move monitoring to a
> different place, this would take some thoughtful design--I don't have an
> idea of the scope--so even if we all agreed it is a good idea, we have
> to overcome an energy barrier of a current system that is good enough
> for what we need to do.
> 
> What are your thoughts?  We had a related discussion about this for a
> specific use case back in July [1].
> 
> 
> Marlon
> 
> 
> [1]
> http://mail-archives.apache.org/mod_mbox/airavata-dev/201307.mbox/%3C1F73EAE9-5E50-4DD7-BB29-FFBC98DAF0D8@gmail.com%3E
> 
> 
> 
> -- 
> System Analyst Programmer
> PTI Lab
> Indiana University

Re: Stateful vs. fire-and-forget GFac providers

Posted by Lahiru Gunathilake <gl...@gmail.com>.

Hi Marlon,

In Airavata since we are using GFAC as an embedded mode with Workflow
Interpreter it not really a fire and forget even if we implement this in
GFAC core.

But it will not be bad since in WorkflowInterpreter we are handing each
node in a separate thread. But if we are going to use gfac as a separate
job submitting component this will definitely make sense.

So I am +1 for this change.

Regards
Lahiru


On Thu, Oct 24, 2013 at 11:48 AM, Marlon Pierce <ma...@iu.edu> wrote:

> The current GFAC providers all execute tasks in "blocking" mode: the
> provider stays active until the job terminates. This introduces some
> tradeoffs. On the one hand, determining the job state is very
> provider-specific. Doing it all in the provider makes things relatively
> simple to implement.
>
> On the other hand, this makes Airavata's state complicated.  This
> increases the difficulty of handling fault recovery and "elastic"
> scenarios, where we may need to restart failed servers, pass work from
> one running instance to another, and so forth.
>
> If we wanted to make the provider stateless and move monitoring to a
> different place, this would take some thoughtful design--I don't have an
> idea of the scope--so even if we all agreed it is a good idea, we have
> to overcome an energy barrier of a current system that is good enough
> for what we need to do.
>
> What are your thoughts?  We had a related discussion about this for a
> specific use case back in July [1].
>
>
> Marlon
>
>
> [1]
>
> http://mail-archives.apache.org/mod_mbox/airavata-dev/201307.mbox/%3C1F73EAE9-5E50-4DD7-BB29-FFBC98DAF0D8@gmail.com%3E
>



-- 
System Analyst Programmer
PTI Lab
Indiana University