You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airavata.apache.org by Shameera Rathnayaka <sh...@gmail.com> on 2014/09/23 19:04:57 UTC

Job throttling implementation clarification.

Hi Devs,

I am working on queue based job throttling implementations and here is the
relatedJIRA[1] ticket which is created to track down the implementation
steps.

Following explain how job throttling has been implemented for now. This is
only apply for computer resources has batch queues define with it,
otherwise not.

There is a validator call JobCountValidator, this validator check whether
there is enough space to submit a new job or not and return "true" and
"false" accordingly. I am using zookeeper to track the runtime data like
how many jobs have been submitted to a given host. With the current
implementation job count is increased when the job added to the monitoring
queue and decreased when the job removed from monitoring queue. I ran few
test and this approach is working fine. But after i ran a load test in high
rate i observed that this approach is not working as we are doing
validation in orchestrator and the job count update in gfac. This is due to
a race condition,  Orchestrator can still pass the validation step even we
have submitted allowed max job count to a resource but not yet updated the
job count in zookeeper. Therefore we need to do job submission and job
count increase in the same place to fix that.

So potential place is SimpleOrchestratorImpl#launchExperiment method. WDYT?

As validation and launch operations are called using two client calls still
we have that race condition. i have sent a separate mail for that.

Thanks,
Shameera.

-- 
Best Regards,
Shameera Rathnayaka.

email: shameera AT apache.org , shameerainfo AT gmail.com
Blog : http://shameerarathnayaka.blogspot.com/

Re: Job throttling implementation clarification.

Posted by Lahiru Gunathilake <gl...@gmail.com>.
On Tue, Sep 23, 2014 at 4:04 PM, Shameera Rathnayaka <shameerainfo@gmail.com
> wrote:

> Hi Lahiru,
>
> I could able to resolve this by moving the job throttle logic to
> launchExperiment method and synchronizing jobSubmitter.submit and job count
> update tasks. This is introduced small performance bottle neck, if we can
> tolerate that bottle neck in job submission phase then this will work
> without an issue where we have one Orchestrator in our deployment. WDYT?
> can we go with this and later change it to a better way ?
>
OK

Regards
Lahiru

>
> Thanks,
> Shameera.
>
> On Tue, Sep 23, 2014 at 2:06 PM, Shameera Rathnayaka <
> shameerainfo@gmail.com> wrote:
>
>> Hi Lahiru,
>>
>> On Tue, Sep 23, 2014 at 1:38 PM, Lahiru Gunathilake <gl...@gmail.com>
>> wrote:
>>
>>> Its wrong to update the count before doing a successful job
>>> submission(Because finally job submission might fail and it is not the
>>> actual count in the queue), and even if we do it in the same place there
>>> will always be a race-condition.
>>>
>>
>> Can't we say if jobSubmitter.submit(..) method return "true" the job has
>> been submitted to the compute resource without any issue ?  if we can then
>> increase the job count after the submit operation would solve our issue for
>> some extend(yes i can see it is hard to completely fix the race condition).
>>
>>
>>> If we want to really fix this we have implement a queue based approach
>>> where GFAC will pick jobs from worker queue and if the count is exceeded we
>>> delay the job submission.
>>>
>>
>> Are you suggesting to move scheduling part to GFac instead of doing it in
>> Orchestrator? and is this a global queue where every GFac node can access
>> or queue per a GFac node?
>> ​
>>
>>
>>>
>>>
>>>
>>> On Tue, Sep 23, 2014 at 1:04 PM, Shameera Rathnayaka <
>>> shameerainfo@gmail.com> wrote:
>>>
>>>> Hi Devs,
>>>>
>>>> I am working on queue based job throttling implementations and here is
>>>> the relatedJIRA[1] ticket which is created to track down the implementation
>>>> steps.
>>>>
>>>> Following explain how job throttling has been implemented for now. This
>>>> is only apply for computer resources has batch queues define with it,
>>>> otherwise not.
>>>>
>>>> There is a validator call JobCountValidator, this validator check
>>>> whether there is enough space to submit a new job or not and return "true"
>>>> and "false" accordingly. I am using zookeeper to track the runtime data
>>>> like how many jobs have been submitted to a given host. With the current
>>>> implementation job count is increased when the job added to the monitoring
>>>> queue and decreased when the job removed from monitoring queue. I ran few
>>>> test and this approach is working fine. But after i ran a load test in high
>>>> rate i observed that this approach is not working as we are doing
>>>> validation in orchestrator and the job count update in gfac. This is due to
>>>> a race condition,  Orchestrator can still pass the validation step even we
>>>> have submitted allowed max job count to a resource but not yet updated the
>>>> job count in zookeeper. Therefore we need to do job submission and job
>>>> count increase in the same place to fix that.
>>>>
>>>> So potential place is SimpleOrchestratorImpl#launchExperiment method.
>>>> WDYT?
>>>>
>>>> As validation and launch operations are called using two client calls
>>>> still we have that race condition. i have sent a separate mail for that.
>>>>
>>>> Thanks,
>>>> Shameera.
>>>>
>>>> --
>>>> Best Regards,
>>>> Shameera Rathnayaka.
>>>>
>>>> email: shameera AT apache.org , shameerainfo AT gmail.com
>>>> Blog : http://shameerarathnayaka.blogspot.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Research Assistant
>>> Science Gateways Group
>>> Indiana University
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Shameera Rathnayaka.
>>
>> email: shameera AT apache.org , shameerainfo AT gmail.com
>> Blog : http://shameerarathnayaka.blogspot.com/
>>
>
>
>
> --
> Best Regards,
> Shameera Rathnayaka.
>
> email: shameera AT apache.org , shameerainfo AT gmail.com
> Blog : http://shameerarathnayaka.blogspot.com/
>



-- 
Research Assistant
Science Gateways Group
Indiana University

Re: Job throttling implementation clarification.

Posted by Shameera Rathnayaka <sh...@gmail.com>.
Hi Lahiru,

I could able to resolve this by moving the job throttle logic to
launchExperiment method and synchronizing jobSubmitter.submit and job count
update tasks. This is introduced small performance bottle neck, if we can
tolerate that bottle neck in job submission phase then this will work
without an issue where we have one Orchestrator in our deployment. WDYT?
can we go with this and later change it to a better way ?

Thanks,
Shameera.

On Tue, Sep 23, 2014 at 2:06 PM, Shameera Rathnayaka <shameerainfo@gmail.com
> wrote:

> Hi Lahiru,
>
> On Tue, Sep 23, 2014 at 1:38 PM, Lahiru Gunathilake <gl...@gmail.com>
> wrote:
>
>> Its wrong to update the count before doing a successful job
>> submission(Because finally job submission might fail and it is not the
>> actual count in the queue), and even if we do it in the same place there
>> will always be a race-condition.
>>
>
> Can't we say if jobSubmitter.submit(..) method return "true" the job has
> been submitted to the compute resource without any issue ?  if we can then
> increase the job count after the submit operation would solve our issue for
> some extend(yes i can see it is hard to completely fix the race condition).
>
>
>> If we want to really fix this we have implement a queue based approach
>> where GFAC will pick jobs from worker queue and if the count is exceeded we
>> delay the job submission.
>>
>
> Are you suggesting to move scheduling part to GFac instead of doing it in
> Orchestrator? and is this a global queue where every GFac node can access
> or queue per a GFac node?
> ​
>
>
>>
>>
>>
>> On Tue, Sep 23, 2014 at 1:04 PM, Shameera Rathnayaka <
>> shameerainfo@gmail.com> wrote:
>>
>>> Hi Devs,
>>>
>>> I am working on queue based job throttling implementations and here is
>>> the relatedJIRA[1] ticket which is created to track down the implementation
>>> steps.
>>>
>>> Following explain how job throttling has been implemented for now. This
>>> is only apply for computer resources has batch queues define with it,
>>> otherwise not.
>>>
>>> There is a validator call JobCountValidator, this validator check
>>> whether there is enough space to submit a new job or not and return "true"
>>> and "false" accordingly. I am using zookeeper to track the runtime data
>>> like how many jobs have been submitted to a given host. With the current
>>> implementation job count is increased when the job added to the monitoring
>>> queue and decreased when the job removed from monitoring queue. I ran few
>>> test and this approach is working fine. But after i ran a load test in high
>>> rate i observed that this approach is not working as we are doing
>>> validation in orchestrator and the job count update in gfac. This is due to
>>> a race condition,  Orchestrator can still pass the validation step even we
>>> have submitted allowed max job count to a resource but not yet updated the
>>> job count in zookeeper. Therefore we need to do job submission and job
>>> count increase in the same place to fix that.
>>>
>>> So potential place is SimpleOrchestratorImpl#launchExperiment method.
>>> WDYT?
>>>
>>> As validation and launch operations are called using two client calls
>>> still we have that race condition. i have sent a separate mail for that.
>>>
>>> Thanks,
>>> Shameera.
>>>
>>> --
>>> Best Regards,
>>> Shameera Rathnayaka.
>>>
>>> email: shameera AT apache.org , shameerainfo AT gmail.com
>>> Blog : http://shameerarathnayaka.blogspot.com/
>>>
>>
>>
>>
>> --
>> Research Assistant
>> Science Gateways Group
>> Indiana University
>>
>
>
>
> --
> Best Regards,
> Shameera Rathnayaka.
>
> email: shameera AT apache.org , shameerainfo AT gmail.com
> Blog : http://shameerarathnayaka.blogspot.com/
>



-- 
Best Regards,
Shameera Rathnayaka.

email: shameera AT apache.org , shameerainfo AT gmail.com
Blog : http://shameerarathnayaka.blogspot.com/

Re: Job throttling implementation clarification.

Posted by Shameera Rathnayaka <sh...@gmail.com>.
Hi Lahiru,

On Tue, Sep 23, 2014 at 1:38 PM, Lahiru Gunathilake <gl...@gmail.com>
wrote:

> Its wrong to update the count before doing a successful job
> submission(Because finally job submission might fail and it is not the
> actual count in the queue), and even if we do it in the same place there
> will always be a race-condition.
>

Can't we say if jobSubmitter.submit(..) method return "true" the job has
been submitted to the compute resource without any issue ?  if we can then
increase the job count after the submit operation would solve our issue for
some extend(yes i can see it is hard to completely fix the race condition).


> If we want to really fix this we have implement a queue based approach
> where GFAC will pick jobs from worker queue and if the count is exceeded we
> delay the job submission.
>

Are you suggesting to move scheduling part to GFac instead of doing it in
Orchestrator? and is this a global queue where every GFac node can access
or queue per a GFac node?
​


>
>
>
> On Tue, Sep 23, 2014 at 1:04 PM, Shameera Rathnayaka <
> shameerainfo@gmail.com> wrote:
>
>> Hi Devs,
>>
>> I am working on queue based job throttling implementations and here is
>> the relatedJIRA[1] ticket which is created to track down the implementation
>> steps.
>>
>> Following explain how job throttling has been implemented for now. This
>> is only apply for computer resources has batch queues define with it,
>> otherwise not.
>>
>> There is a validator call JobCountValidator, this validator check whether
>> there is enough space to submit a new job or not and return "true" and
>> "false" accordingly. I am using zookeeper to track the runtime data like
>> how many jobs have been submitted to a given host. With the current
>> implementation job count is increased when the job added to the monitoring
>> queue and decreased when the job removed from monitoring queue. I ran few
>> test and this approach is working fine. But after i ran a load test in high
>> rate i observed that this approach is not working as we are doing
>> validation in orchestrator and the job count update in gfac. This is due to
>> a race condition,  Orchestrator can still pass the validation step even we
>> have submitted allowed max job count to a resource but not yet updated the
>> job count in zookeeper. Therefore we need to do job submission and job
>> count increase in the same place to fix that.
>>
>> So potential place is SimpleOrchestratorImpl#launchExperiment method.
>> WDYT?
>>
>> As validation and launch operations are called using two client calls
>> still we have that race condition. i have sent a separate mail for that.
>>
>> Thanks,
>> Shameera.
>>
>> --
>> Best Regards,
>> Shameera Rathnayaka.
>>
>> email: shameera AT apache.org , shameerainfo AT gmail.com
>> Blog : http://shameerarathnayaka.blogspot.com/
>>
>
>
>
> --
> Research Assistant
> Science Gateways Group
> Indiana University
>



-- 
Best Regards,
Shameera Rathnayaka.

email: shameera AT apache.org , shameerainfo AT gmail.com
Blog : http://shameerarathnayaka.blogspot.com/

Re: Job throttling implementation clarification.

Posted by Lahiru Gunathilake <gl...@gmail.com>.
Its wrong to update the count before doing a successful job
submission(Because finally job submission might fail and it is not the
actual count in the queue), and even if we do it in the same place there
will always be a race-condition. If we want to really fix this we have
implement a queue based approach where GFAC will pick jobs from worker
queue and if the count is exceeded we delay the job submission.



On Tue, Sep 23, 2014 at 1:04 PM, Shameera Rathnayaka <shameerainfo@gmail.com
> wrote:

> Hi Devs,
>
> I am working on queue based job throttling implementations and here is the
> relatedJIRA[1] ticket which is created to track down the implementation
> steps.
>
> Following explain how job throttling has been implemented for now. This is
> only apply for computer resources has batch queues define with it,
> otherwise not.
>
> There is a validator call JobCountValidator, this validator check whether
> there is enough space to submit a new job or not and return "true" and
> "false" accordingly. I am using zookeeper to track the runtime data like
> how many jobs have been submitted to a given host. With the current
> implementation job count is increased when the job added to the monitoring
> queue and decreased when the job removed from monitoring queue. I ran few
> test and this approach is working fine. But after i ran a load test in high
> rate i observed that this approach is not working as we are doing
> validation in orchestrator and the job count update in gfac. This is due to
> a race condition,  Orchestrator can still pass the validation step even we
> have submitted allowed max job count to a resource but not yet updated the
> job count in zookeeper. Therefore we need to do job submission and job
> count increase in the same place to fix that.
>
> So potential place is SimpleOrchestratorImpl#launchExperiment method.
> WDYT?
>
> As validation and launch operations are called using two client calls
> still we have that race condition. i have sent a separate mail for that.
>
> Thanks,
> Shameera.
>
> --
> Best Regards,
> Shameera Rathnayaka.
>
> email: shameera AT apache.org , shameerainfo AT gmail.com
> Blog : http://shameerarathnayaka.blogspot.com/
>



-- 
Research Assistant
Science Gateways Group
Indiana University