You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Chris Lambert (JIRA)" <ji...@apache.org> on 2014/05/12 21:51:18 UTC

[jira] [Updated] (AURORA-350) Parallelize updates to speed up deploys

     [ https://issues.apache.org/jira/browse/AURORA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Lambert updated AURORA-350:
---------------------------------

    Sprint: Sprint 1

> Parallelize updates to speed up deploys
> ---------------------------------------
>
>                 Key: AURORA-350
>                 URL: https://issues.apache.org/jira/browse/AURORA-350
>             Project: Aurora
>          Issue Type: Story
>          Components: Client
>            Reporter: Maxim Khutornenko
>            Assignee: Maxim Khutornenko
>
> The way aurora deploy works inherently contributes to depressed deploy speeds.
> Aurora deploy, like cap/TCU, uses the "batch" model. You have 100 things, you loop in a batch of N at a time. You restart N things all at once, those N things come back online all at once (cold), you wait for the all of them to become available, and repeat.
> Disadvantages:
> - you can proceed no faster than the slowest guy in the batch. If one instance is "stuck" or slow, the whole deploy slows down.
> - The speed at which your deploy is bounded by your success rate, which is bounded by the number of instances currently online but serving below par due to warmup (because, computers). The batch methodology maximizes this effect because the restarted shards tend to come back online all at the same time.
> Let's say a full cycle of shutdown, reschedule, restart, wait-for-online-and-good takes 2 minutes, but the "bad time" is only 15 seconds. If we do these 8 at a time, we have a period where 8 boxes are bad for 15 seconds. That's a big success rate spike. What if we were able to 8 of these in parallel such that only one of them is bad at any given moment. It's the same speed (all other things being equal) but the impact is much less. We could leverage that to make the deploy go even faster.
> It's easy to see that we could speed deploys up by 2x or more by using an algorithm which minimizes the number of instances starting at any given time but still proceeds quickly in parallel.
> Aurora should be rewritten to use a thread-based deploy model. You have 100 things and N threads. The main thread dispatches (in a blocking fashion if no threads are ready) restart tasks to each thread in a user-set rate-limited fashion (e.g. no more than one per 15 seconds) which is defined by your per instance warmup time (the time an instance is listening/serving but slow). Each thread then restarts one instance, waits it to come back healthy, and reports done/failure/etc. Continue until the list is exhausted.
> This way you have a steady stream of single instances coming online with no clumping of restarts, and if any one gets hung up or slow, it doesn't significantly impact the speed of the deploy (you can "overprovision" the number of threads). You can also retain most of the current deploy semantics around failure counts, retry intervals, etc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)