You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Maxim Khutornenko (JIRA)" <ji...@apache.org> on 2014/08/18 19:36:19 UTC

[jira] [Commented] (AURORA-350) Parallelize updates to speed up deploys

    [ https://issues.apache.org/jira/browse/AURORA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100902#comment-14100902 ] 

Maxim Khutornenko commented on AURORA-350:
------------------------------------------

A few points on implementation:

- The batch_size config setting now defines the number of simultaneous instance update threads;
- Every instance is updated individually (i.e. kill/add/getStatus RPCs are handled per thread);
- All scheduler calls are multiplexed at the lower level assembling batches of matching action calls (i.e. kill/add/getStatus) sent at the predefined intervals (default 2 seconds). This is done so to avoid DDOSing scheduler and avoid contention bottleneck on the scheduler_client lock.

> Parallelize updates to speed up deploys
> ---------------------------------------
>
>                 Key: AURORA-350
>                 URL: https://issues.apache.org/jira/browse/AURORA-350
>             Project: Aurora
>          Issue Type: Story
>          Components: Client
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
>             Fix For: 0.5.0
>
>
> The way aurora deploy works inherently contributes to depressed deploy speeds.
> Aurora deploy, like cap/TCU, uses the "batch" model. You have 100 things, you loop in a batch of N at a time. You restart N things all at once, those N things come back online all at once (cold), you wait for the all of them to become available, and repeat.
> Disadvantages:
> - you can proceed no faster than the slowest guy in the batch. If one instance is "stuck" or slow, the whole deploy slows down.
> - The speed at which your deploy is bounded by your success rate, which is bounded by the number of instances currently online but serving below par due to warmup (because, computers). The batch methodology maximizes this effect because the restarted shards tend to come back online all at the same time.
> Let's say a full cycle of shutdown, reschedule, restart, wait-for-online-and-good takes 2 minutes, but the "bad time" is only 15 seconds. If we do these 8 at a time, we have a period where 8 boxes are bad for 15 seconds. That's a big success rate spike. What if we were able to 8 of these in parallel such that only one of them is bad at any given moment. It's the same speed (all other things being equal) but the impact is much less. We could leverage that to make the deploy go even faster.
> It's easy to see that we could speed deploys up by 2x or more by using an algorithm which minimizes the number of instances starting at any given time but still proceeds quickly in parallel.
> Aurora should be rewritten to use a thread-based deploy model. You have 100 things and N threads. The main thread dispatches (in a blocking fashion if no threads are ready) restart tasks to each thread in a user-set rate-limited fashion (e.g. no more than one per 15 seconds) which is defined by your per instance warmup time (the time an instance is listening/serving but slow). Each thread then restarts one instance, waits it to come back healthy, and reports done/failure/etc. Continue until the list is exhausted.
> This way you have a steady stream of single instances coming online with no clumping of restarts, and if any one gets hung up or slow, it doesn't significantly impact the speed of the deploy (you can "overprovision" the number of threads). You can also retain most of the current deploy semantics around failure counts, retry intervals, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)