You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@aurora.apache.org by Josh Adams <jo...@gmail.com> on 2015/10/28 23:45:02 UTC

Throttling task kill rates per job?

Good afternoon all,

Is it possible to tell the scheduler to throttle kill rates for a given
job? When all tasks in a job start consuming too much disk or ram because
of an unexpected service dependency meltdown it would be nice if we had a
little buffer time to triage the issue without the scheduler killing them
all en masse for using more than their allocated resources simultaneously...

Cheers,
Josh

Re: Throttling task kill rates per job?

Posted by Zameer Manji <zm...@apache.org>.

Josh,

I think you are advocating for something in AURORA-279
<https://issues.apache.org/jira/browse/AURORA-279> where health check
failures are sent to the scheduler to prevent all tasks being killed
concurrently by the health checker. I think this is the only case where
Aurora can possibly do throttling as other cases are resource exhaustion or
task failure (ie a process exited).

There is no design doc out yet so no work has started on this effort. It is
a lot more complicated than it seems because the following things need to
be done:
1. Create a reliable mechanism for the executor to rely information (health
check failure) to the scheduler.
2. Define some sort of SLA/threshold that determines if a health check
failure should result in a kill or not.
3. Modifying the scheduler to act on the information in #1 and #2.

On Wed, Oct 28, 2015 at 4:00 PM, Josh Adams <jo...@gmail.com> wrote:

> Hi Bill, thanks for the quick response.
>
> That's fair. I wonder if we could set a "start killing" threshold instead?
> For example, we set a "danger zone" limit so that any task that's in the
> danger zone is fair game to get killed. The closer it gets to the max (or
> over the max of course) makes it more likely to get killed, up to "it
> absolutely will be killed right away." This would achieve our goal of
> reducing the likelihood of all shards getting killed at the same time, and
> preserve the resource exhaustion protection you describe.
>
> Josh
>
> On Wed, Oct 28, 2015 at 3:55 PM, Bill Farner <wf...@apache.org> wrote:
>
>> For some resources (like disk, or more acutely - RAM), there's not much
>> we can do to provide assurances.  Ultimately resource-driven task
>> termination is managed at the node level, and may represent a real
>> exhaustion of the resource.  I'd be worried that trying to augment this
>> might trade one problem for another - where the rationale for killing a
>> task becomes non-deterministic, or even error-prone.
>>
>> On Wed, Oct 28, 2015 at 3:45 PM, Josh Adams <jo...@gmail.com> wrote:
>>
>>> Good afternoon all,
>>>
>>> Is it possible to tell the scheduler to throttle kill rates for a given
>>> job? When all tasks in a job start consuming too much disk or ram because
>>> of an unexpected service dependency meltdown it would be nice if we had a
>>> little buffer time to triage the issue without the scheduler killing them
>>> all en masse for using more than their allocated resources simultaneously...
>>>
>>> Cheers,
>>> Josh
>>>
>>
>>
>

-- 
Zameer Manji

Re: Throttling task kill rates per job?

Posted by Josh Adams <jo...@gmail.com>.

Hi Bill, thanks for the quick response.

That's fair. I wonder if we could set a "start killing" threshold instead?
For example, we set a "danger zone" limit so that any task that's in the
danger zone is fair game to get killed. The closer it gets to the max (or
over the max of course) makes it more likely to get killed, up to "it
absolutely will be killed right away." This would achieve our goal of
reducing the likelihood of all shards getting killed at the same time, and
preserve the resource exhaustion protection you describe.

Josh

On Wed, Oct 28, 2015 at 3:55 PM, Bill Farner <wf...@apache.org> wrote:

> For some resources (like disk, or more acutely - RAM), there's not much we
> can do to provide assurances.  Ultimately resource-driven task termination
> is managed at the node level, and may represent a real exhaustion of the
> resource.  I'd be worried that trying to augment this might trade one
> problem for another - where the rationale for killing a task becomes
> non-deterministic, or even error-prone.
>
> On Wed, Oct 28, 2015 at 3:45 PM, Josh Adams <jo...@gmail.com> wrote:
>
>> Good afternoon all,
>>
>> Is it possible to tell the scheduler to throttle kill rates for a given
>> job? When all tasks in a job start consuming too much disk or ram because
>> of an unexpected service dependency meltdown it would be nice if we had a
>> little buffer time to triage the issue without the scheduler killing them
>> all en masse for using more than their allocated resources simultaneously...
>>
>> Cheers,
>> Josh
>>
>
>

Re: Throttling task kill rates per job?

Posted by Bill Farner <wf...@apache.org>.

For some resources (like disk, or more acutely - RAM), there's not much we
can do to provide assurances.  Ultimately resource-driven task termination
is managed at the node level, and may represent a real exhaustion of the
resource.  I'd be worried that trying to augment this might trade one
problem for another - where the rationale for killing a task becomes
non-deterministic, or even error-prone.

On Wed, Oct 28, 2015 at 3:45 PM, Josh Adams <jo...@gmail.com> wrote:

> Good afternoon all,
>
> Is it possible to tell the scheduler to throttle kill rates for a given
> job? When all tasks in a job start consuming too much disk or ram because
> of an unexpected service dependency meltdown it would be nice if we had a
> little buffer time to triage the issue without the scheduler killing them
> all en masse for using more than their allocated resources simultaneously...
>
> Cheers,
> Josh
>