You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@aurora.apache.org by Mauricio Garavaglia <ma...@gmail.com> on 2017/11/29 19:54:04 UTC

Aurora taking really long to reschedule a full cluster

Hello!

Recently, running some reliability tests, we restarted all the nodes in a
cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
everything, we have a change of leader in the middle of the scheduling and
that slowed it down even more. So we started looking which aurora
parameters needed more tuning.

The value of max_tasks_per_schedule_attempt is set to the default now, that
probably needs to be increased, is there a rule of thumb to tune it based
on cluster size, # of jobs, # of frameworks, etc?

Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen
pressure there.

Any input on where to look at would be really appreciated :)

Mauricio

Re: Aurora taking really long to reschedule a full cluster

Posted by David McLaughlin <dm...@apache.org>.

You should not need to adjust max_schedule_attempts_per_sec as it defaults
to 40. Which should give you pretty close to 2400 schedule attempts per
minute. (Our max is set to 30 and in our scale tests we hit 1800 tasks
scheduled per minute pretty consistently).

Can you provide more info on how you are scheduling? Are you scheduling
from scratch (using job create or update start?) How many job keys?

On Wed, Nov 29, 2017 at 4:42 PM, Mauricio Garavaglia <
mauriciogaravaglia@gmail.com> wrote:

> This was on 0.17. No logs sorry, I'll run the same test again in a week or
> so. I can share the new ones and even kill the leader in the middle of the
> process.
>
> Tasks continued to run, I remember I dig through the logs to see how long
> it took for a particular task to show up again as assigned. I'll adjust the
> max_tasks_per_schedule_attempt and test it again.
>
> Thanks!
>
> On Wed, Nov 29, 2017 at 12:03 PM, Bill Farner <wf...@apache.org> wrote:
>
>> That works out to scheduling about 1 task/sec, which is at least one
>> order of magnitude lower than i would expect.  Are you sure tasks were
>> scheduling and continuing to run, rather than exiting/failing and
>> triggering more scheduling work?
>>
>> What build is this from?  Can you share (scrubbed) scheduler logs from
>> this period?
>>
>> On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia <
>> mauriciogaravaglia@gmail.com> wrote:
>>
>>> Hello!
>>>
>>> Recently, running some reliability tests, we restarted all the nodes in
>>> a cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
>>> everything, we have a change of leader in the middle of the scheduling and
>>> that slowed it down even more. So we started looking which aurora
>>> parameters needed more tuning.
>>>
>>> The value of max_tasks_per_schedule_attempt is set to the default now,
>>> that probably needs to be increased, is there a rule of thumb to tune it
>>> based on cluster size, # of jobs, # of frameworks, etc?
>>>
>>> Regarding the JVM, we are running it with Xmx=24G; so far we haven't
>>> seen pressure there.
>>>
>>> Any input on where to look at would be really appreciated :)
>>>
>>> Mauricio
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Aurora taking really long to reschedule a full cluster

Posted by Mauricio Garavaglia <ma...@gmail.com>.

This was on 0.17. No logs sorry, I'll run the same test again in a week or
so. I can share the new ones and even kill the leader in the middle of the
process.

Tasks continued to run, I remember I dig through the logs to see how long
it took for a particular task to show up again as assigned. I'll adjust the
max_tasks_per_schedule_attempt and test it again.

Thanks!

On Wed, Nov 29, 2017 at 12:03 PM, Bill Farner <wf...@apache.org> wrote:

> That works out to scheduling about 1 task/sec, which is at least one order
> of magnitude lower than i would expect.  Are you sure tasks were scheduling
> and continuing to run, rather than exiting/failing and triggering more
> scheduling work?
>
> What build is this from?  Can you share (scrubbed) scheduler logs from
> this period?
>
> On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia <
> mauriciogaravaglia@gmail.com> wrote:
>
>> Hello!
>>
>> Recently, running some reliability tests, we restarted all the nodes in a
>> cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
>> everything, we have a change of leader in the middle of the scheduling and
>> that slowed it down even more. So we started looking which aurora
>> parameters needed more tuning.
>>
>> The value of max_tasks_per_schedule_attempt is set to the default now,
>> that probably needs to be increased, is there a rule of thumb to tune it
>> based on cluster size, # of jobs, # of frameworks, etc?
>>
>> Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen
>> pressure there.
>>
>> Any input on where to look at would be really appreciated :)
>>
>> Mauricio
>>
>>
>>
>>
>>
>

Re: Aurora taking really long to reschedule a full cluster

Posted by Bill Farner <wf...@apache.org>.

That works out to scheduling about 1 task/sec, which is at least one order
of magnitude lower than i would expect.  Are you sure tasks were scheduling
and continuing to run, rather than exiting/failing and triggering more
scheduling work?

What build is this from?  Can you share (scrubbed) scheduler logs from this
period?

On Wed, Nov 29, 2017 at 11:54 AM, Mauricio Garavaglia <
mauriciogaravaglia@gmail.com> wrote:

> Hello!
>
> Recently, running some reliability tests, we restarted all the nodes in a
> cluster of ~300 hosts and 3k tasks. Aurora took about 1hour to reschedule
> everything, we have a change of leader in the middle of the scheduling and
> that slowed it down even more. So we started looking which aurora
> parameters needed more tuning.
>
> The value of max_tasks_per_schedule_attempt is set to the default now,
> that probably needs to be increased, is there a rule of thumb to tune it
> based on cluster size, # of jobs, # of frameworks, etc?
>
> Regarding the JVM, we are running it with Xmx=24G; so far we haven't seen
> pressure there.
>
> Any input on where to look at would be really appreciated :)
>
> Mauricio
>
>
>
>
>