You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by Lahiru Gunathilake <gl...@gmail.com> on 2014/01/09 19:24:46 UTC

Re: Airavata Orchestrator component

Hi All,

I have started the orchestrator implementation[1] and committed the initial
version (not fully functioning at this point). I need lot more work to
finish the functioning version of the orchestrator, but at this point it
would be useful to get some usecases for the test cases.

If you have any usecases please provide some information and it can be used
to change the current implementation and support those useful usecases,
before I go to working version.


[1]https://issues.apache.org/jira/browse/AIRAVATA-964

Regards
Lahiru


On Mon, Dec 9, 2013 at 9:36 AM, Raminder Singh <rs...@gmail.com> wrote:

> Lahiru,
>
> I added my comments to the google doc.
>
> https://docs.google.com/document/d/11fjql09tOiC0NLBaqdhZ9WAiMoBhkBJl7WC1N7DigcU/edit
>
> About the pull model: We don’t want to create locking issues at database
> level as Orchestrator and GFAC will be monitoring similar tables. Another
> problem i can see is a delay created by pull model to submit user job. GFAC
> need to look for submitted jobs and there need to be a frequency set to
> check the database. That will create a delay to handle user submission.
> Thats why i like the Async submission of job by Orchestrator using GFAC SPI.
>
> Thanks
> Raminder
>
> On Dec 9, 2013, at 9:12 AM, Lahiru Gunathilake <gl...@gmail.com> wrote:
>
> Hi Raman,
>
>
> On Fri, Dec 6, 2013 at 12:34 PM, Raminder Singh <rs...@gmail.com>wrote:
>
>> Lahiru: Can you please start a document to record this conversation?
>> There are very valuable points to records and don’t want to loose anything
>> in email threads.
>>
>> My comments are inline with prefix RS>>:
>>
>> On Dec 5, 2013, at 10:12 PM, Lahiru Gunathilake <gl...@gmail.com>
>> wrote:
>>
>> Hi Amila,
>>
>> I have answered questions you raised except some how to questions (for
>> how questions we need to figure out solutions, before that we need to come
>> up with good design).
>>
>>
>> On Thu, Dec 5, 2013 at 7:58 PM, Amila Jayasekara <thejaka.amila@gmail.com
>> > wrote:
>>
>>>
>>>
>>>
>>> On Thu, Dec 5, 2013 at 2:34 PM, Lahiru Gunathilake <gl...@gmail.com>wrote:
>>>
>>>> Hi All,
>>>>
>>>> We are thinking of implementing an Airavata Orchestrator component to
>>>> replace WorkflowInterpreter to avoid gateway developers to dealing with
>>>> workflows when they simply have one single independent jobs to run in their
>>>> gateways. This component is mainly focusing on how to invoke GFAC and
>>>> accept requests from the client API.
>>>>
>>>> I have following features in mind about this component.
>>>>
>>>> 1. It gives a web services or REST interface where we can implement a
>>>> client to invoke it to submit jobs.
>>>>
>>> RS >> We need a API method to handle this and protocol interfacing of
>> API can be handled separately using Thrift or Web services.
>>
>>
>>>> 2. Accepts a job request and parse the input types and if input types
>>>> are correct, this will create an Airavata experiment ID.
>>>>
>>> RS >> According to me, we need to save every request to registry before
>> verification and have a input configuration error if the inputs were not
>> correct. That will help us to find if there were any API invocation errors.
>>
> +1, we need to save the request to registry right away.
>
>>
>>
>>>> 3. Orchestrtor then store the job information to registry against the
>>>> generated experiment ID (All the other components identify the job using
>>>> this experiment ID).
>>>>
>>>> 4. After that Orchestrator pull up all the descriptors related to this
>>>> request and do some scheduling to decide where to run the job and submit
>>>> the job to a GFAC node (Handling multiple GFAC nodes is going to be a
>>>> future improvement in Orchestrator).
>>>>
>>>> If we are trying to do pull based job submission it might be a good
>>>> idea to handle errors, if we store jobs to Registry and GFAC pull jobs and
>>>> execute them Orchestrator component really doesn' t have to worry about the
>>>> error handling.
>>>>
>>>
>>> I did not quite understand what you meant by "pull based job
>>> submission". I believe it is saving job in registry and periodically GFAC
>>> looking up for new jobs and submitting them.
>>>
>> Yes.
>>
>> RS >> I think orchestrator should call GFAC to invoke the job than GFAC
>> polling for the jobs. Orchestrator should make a decision that to which
>> instance of GFAC it submit the job and if there is a system error then
>> bring up or communicate to another instance.I think pull based model for
>> GFAC will add an overhead. We will add another point of failure.
>>
> Can you please explain bit more what did you mean by "another point of
> failure" and "add an overhead".
>
>>
>>  Further why are you saying you dont need to worry about error handling ?
>>> What sort of errors are you considering ?
>>>
>> I am considering GFAC failures or connection between Orchestrator and
>> GFAC goes down.
>>
>>>
>>>
>>>>
>>>> Because we can implement a logic to GFAC if a particular job is not
>>>> updating its status fora g iven time it assume job is hanged or either GFAC
>>>> node which handles that job is fauiled, so  GFAC pull that job (we
>>>> definitely need a locking mechanism here, to avoid two instances are not
>>>> going to  execute hanged job) and  start execute it. (If GFAC is handling a
>>>> long running job still it has to update the job stutus frequently with the
>>>> same status to make sure GFAC node is running).
>>>>
>>>
>>> I have some comments/questions on this regard;
>>>
>>> 1. How are you going to detect that job is hanged ?
>>>
>>> 2. We clearly need to distinguish between fault jobs and fault GFAC
>>> instances. Because GFAC replication should not pick the job if its logic is
>>> leading to hang situation.
>>>
>> I haven't seen hanged logic situation, may be there are.
>>
>>>  GFAC replication should pick the job only if primary GFAC instance is
>>> down. I believe you proposed locking mechanism to handle this scenario. But
>>> I dont see how locking mechanism going to resolve this situation. Can you
>>> explain more ?
>>>
>> Example if gfac has an logic of picking up a job which didn't response in
>> a given time there could be a scenario where two gfac instances try to pick
>> the same job. Ex: there are 3 gfac nodes working and one goes down with a
>> given job. And two other nodes recognize this at the same time and try to
>> launch the sam ejob. I was talking about locks to fix this issue.
>>
>> RS >> One way to handle is to look at job walltime. If the walltime for a
>> running job is expired and we still don’t have the status of the job then
>> we can go ahead and check the status and start cleaning up the job.
>>
>
>>
>>> 2. According to your description, it seems there is no communication
>>> between GFAC instance and Orchastrator.So GFAC and Orchastrator exchange
>>> data through registry (Database). Performance might drop since we are going
>>> through persisting mediums.
>>>
>> Yes you are correct, I am assuming we are mostly focusing on implementing
>> more reliable system and most of these jobs are running hours, and we don't
>> need to implement high performance system for a system with  long running
>> jobs.
>>
>> RS >> We need to discuss this. I think orchestrator should only maintain
>> state of request not GFAC.
>>
>>
>>> 3. What is the strategy to divide jobs among GFAC instances ?
>>>
>> Not sure, we have to discuss it.
>>
>>
>>> 4. How to identify GFAC instance is failed ?
>>>
>>
>>> 5. How GFAC instances should be registered with the orchestrator ?
>>>
>> RS >> We need to have a mechanism which record how many GFAC instance are
>> running and how many jobs per instance.
>>
> If we are going to do pull based model its going to be a hassle otherwise
> orchestrator can keep track of that.
>
>>
>>
>>> 6. How job cancellations are handled ?
>>>
>> RS >> Single job canceling is simple and should have a API function to
>> cancel based on experiment id and/or local job id.
>>
>>
>>> 7. What happend if Orchestrator goes down ?
>>>
>> This is under assumption Orchestrator doesn't go down (Ex: as a Head node
>> in Map reduce).
>>
>> RS >> I think registration of job happen outside orchestrator and
>> orchestrator/GFAC progress the states.
>>
>>
>>
> Regards
> Lahiru
>
> --
> System Analyst Programmer
> PTI Lab
> Indiana University
>
>
>


-- 
System Analyst Programmer
PTI Lab
Indiana University