You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Marco de Abreu <ma...@googlemail.com> on 2018/05/15 17:48:43 UTC

Auto scaling for MXNet CI

Hello,

I'd like to announce the deployment of auto scaling for our CI system (see
[1] for reference, setup documentation at [2]) for today at 11:00PM PST
05/15/18. I expect no downtime since these changes are happening outside of
Jenkins.

This system will increase the flexibility of our system to be able to
handle the increasing load, being a result of the growing number of great
contributions! In future, our CI will automatically adapt to the current
load and will thus support big tasks like the to-be-migrated nightly tests
or increased load before a release. Additionally, we're now able to provide
scalable p3.2xlarge instances and have the possibility to add new instance
types without much effort.

In future, you will see that new slaves are being started up as the queue
grows and stopped if they go into idle. Your tasks will automatically be
picked up and our system makes sure every PR gets processes as fast as
possible.

If you encounter any issues in the next week, please don't hesitate to
reach out to me. I'm looking forward to everyones feedback!

Best regards,
Marco


[1]:
https://cwiki.apache.org/confluence/display/MXNET/Proposal%3A+Auto+Scaling
[2]: https://cwiki.apache.org/confluence/display/MXNET/Setup

Re: Auto scaling for MXNet CI

Posted by Marco de Abreu <ma...@googlemail.com>.

Thanks a lot!

The following numbers are based on our experience in the test environment.
Best case: ~1:50h (unchanged) (0:01 + 0:38 + 0:39 + 0:33 + 0:03) -
conditions: No instances have to be provisioned and caches are primed
Average case: 2:10h (1:50h + 0:10 for instance startup + 0:10 for cache
loading) - conditions: Windows instances are available (they get turned off
less frequently), Ubuntu instances have to be provisioned and cache no
present
Worst case: 3:06h (1:50h + 0:02 + 0:50 + 0:20 + 0:02 + 0:02) - conditions:
no available instances

The bottleneck for the worst case is caused by the Windows instances. They
take about 20 minutes to start and the unprimed MSVC cache results in about
30 minutes increased build times. To balance this out, we're driving a less
aggressive downscaling policy for Windows and use increased buffers. At the
same time, we're currently working on persistent build caches. An
additional option could be reserved instances.

We will observe the situation and then assess the required next steps. For
now, we want to make sure everything is running stable and no builds are
getting interrupted.

Best regards,
Marco

On Wed, May 16, 2018 at 3:47 AM, Thomas DELTEIL <th...@gmail.com>
wrote:

> Great news :) thanks Marco!
>
> On Tue, May 15, 2018, 18:29 Steffen Rochel <st...@gmail.com>
> wrote:
>
> > Thanks Marco, good step forward.
> > What is the expected, typical and worst case TAT time for PR checks?
> >
> > Steffen
> >
> > On Tue, May 15, 2018 at 10:49 AM Marco de Abreu <
> > marco.g.abreu@googlemail.com> wrote:
> >
> > > Hello,
> > >
> > > I'd like to announce the deployment of auto scaling for our CI system
> > (see
> > > [1] for reference, setup documentation at [2]) for today at 11:00PM PST
> > > 05/15/18. I expect no downtime since these changes are happening
> outside
> > of
> > > Jenkins.
> > >
> > > This system will increase the flexibility of our system to be able to
> > > handle the increasing load, being a result of the growing number of
> great
> > > contributions! In future, our CI will automatically adapt to the
> current
> > > load and will thus support big tasks like the to-be-migrated nightly
> > tests
> > > or increased load before a release. Additionally, we're now able to
> > provide
> > > scalable p3.2xlarge instances and have the possibility to add new
> > instance
> > > types without much effort.
> > >
> > > In future, you will see that new slaves are being started up as the
> queue
> > > grows and stopped if they go into idle. Your tasks will automatically
> be
> > > picked up and our system makes sure every PR gets processes as fast as
> > > possible.
> > >
> > > If you encounter any issues in the next week, please don't hesitate to
> > > reach out to me. I'm looking forward to everyones feedback!
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > [1]:
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/
> Proposal%3A+Auto+Scaling
> > > [2]: https://cwiki.apache.org/confluence/display/MXNET/Setup
> > >
> >
>

Re: Auto scaling for MXNet CI

Posted by Thomas DELTEIL <th...@gmail.com>.

Great news :) thanks Marco!

On Tue, May 15, 2018, 18:29 Steffen Rochel <st...@gmail.com> wrote:

> Thanks Marco, good step forward.
> What is the expected, typical and worst case TAT time for PR checks?
>
> Steffen
>
> On Tue, May 15, 2018 at 10:49 AM Marco de Abreu <
> marco.g.abreu@googlemail.com> wrote:
>
> > Hello,
> >
> > I'd like to announce the deployment of auto scaling for our CI system
> (see
> > [1] for reference, setup documentation at [2]) for today at 11:00PM PST
> > 05/15/18. I expect no downtime since these changes are happening outside
> of
> > Jenkins.
> >
> > This system will increase the flexibility of our system to be able to
> > handle the increasing load, being a result of the growing number of great
> > contributions! In future, our CI will automatically adapt to the current
> > load and will thus support big tasks like the to-be-migrated nightly
> tests
> > or increased load before a release. Additionally, we're now able to
> provide
> > scalable p3.2xlarge instances and have the possibility to add new
> instance
> > types without much effort.
> >
> > In future, you will see that new slaves are being started up as the queue
> > grows and stopped if they go into idle. Your tasks will automatically be
> > picked up and our system makes sure every PR gets processes as fast as
> > possible.
> >
> > If you encounter any issues in the next week, please don't hesitate to
> > reach out to me. I'm looking forward to everyones feedback!
> >
> > Best regards,
> > Marco
> >
> >
> > [1]:
> >
> https://cwiki.apache.org/confluence/display/MXNET/Proposal%3A+Auto+Scaling
> > [2]: https://cwiki.apache.org/confluence/display/MXNET/Setup
> >
>

Re: Auto scaling for MXNet CI

Posted by Steffen Rochel <st...@gmail.com>.

Thanks Marco, good step forward.
What is the expected, typical and worst case TAT time for PR checks?

Steffen

On Tue, May 15, 2018 at 10:49 AM Marco de Abreu <
marco.g.abreu@googlemail.com> wrote:

> Hello,
>
> I'd like to announce the deployment of auto scaling for our CI system (see
> [1] for reference, setup documentation at [2]) for today at 11:00PM PST
> 05/15/18. I expect no downtime since these changes are happening outside of
> Jenkins.
>
> This system will increase the flexibility of our system to be able to
> handle the increasing load, being a result of the growing number of great
> contributions! In future, our CI will automatically adapt to the current
> load and will thus support big tasks like the to-be-migrated nightly tests
> or increased load before a release. Additionally, we're now able to provide
> scalable p3.2xlarge instances and have the possibility to add new instance
> types without much effort.
>
> In future, you will see that new slaves are being started up as the queue
> grows and stopped if they go into idle. Your tasks will automatically be
> picked up and our system makes sure every PR gets processes as fast as
> possible.
>
> If you encounter any issues in the next week, please don't hesitate to
> reach out to me. I'm looking forward to everyones feedback!
>
> Best regards,
> Marco
>
>
> [1]:
> https://cwiki.apache.org/confluence/display/MXNET/Proposal%3A+Auto+Scaling
> [2]: https://cwiki.apache.org/confluence/display/MXNET/Setup
>