You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Marco de Abreu <ma...@googlemail.com.INVALID> on 2018/11/21 04:24:20 UTC

CI impaired

Hello,

I'd like to let you know that our CI was impaired and down for the last few
hours. After getting the CI back up, I noticed that our auto scaling broke
due to a silent update of Jenkins which broke our upscale-detection. Manual
scaling is currently not possible and stopping the scaling won't help
either because there are currently no p3 instances available, which means
that all jobs will fail none the less. In a few hours, the auto scaling
will have recycled all slaves through the down-scale mechanism and we will
be out of capacity. This will lead to resource starvation and thus timeouts.

Your PRs will be properly registered by Jenkins, but please expect the jobs
to time out and thus fail your PRs.

I will fix the auto scaling as soon as I'm awake again.

Sorry for the caused inconveniences.

Best regards,
Marco


P.S. Sorry for the brief email and my lack of further fixes, but it's
5:30AM now and I've been working for 17 hours.

Re: CI impaired

Posted by Naveen Swamy <mn...@gmail.com>.

There are still pending PRs pending that needs to be merged and cherry picked to the branch

> On Nov 30, 2018, at 6:53 AM, Marco de Abreu <ma...@googlemail.com.INVALID> wrote:
> 
> Hello,
> 
> I'm now moving forward with #1. I will try to get to #3 as soon as possible
> to reduce parallel jobs in our CI. You might notice some unfinished jobs. I
> will let you know as soon as this process has been completed. Until then,
> please bare with me since we have hundreds of jobs to run in order to
> validate all PRs.
> 
> Best regards,
> Marco
> 
> On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <ma...@googlemail.com>
> wrote:
> 
>> Hello,
>> 
>> since the release branch has now been cut, I would like to move forward
>> with the CI improvements for the master branch. This would include the
>> following actions:
>> 1. Re-enable the new Jenkins job
>> 2. Request Apache Infra to move the protected branch check from the main
>> pipeline to our new ones
>> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
>> finalizes the deprecation process
>> 
>> If nobody objects, I would like to start with #1 soon. Mentors, could you
>> please assist to create the Apache Infra ticket? I would then take it from
>> there and talk to Infra.
>> 
>> Best regards,
>> Marco
>> 
>> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
>> kellen.sunderland@gmail.com> wrote:
>> 
>>> Sorry, [1] meant to reference
>>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>>> 
>>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
>>> kellen.sunderland@gmail.com> wrote:
>>> 
>>>> Marco and I ran into another urgent issue over the weekend that was
>>>> causing builds to fail.  This issue was unrelated to any feature
>>>> development work, or other CI fixes applied recently, but it did require
>>>> quite a bit of work from Marco (and a little from me) to fix.
>>>> 
>>>> We spent enough time on the problem that it caused us to take a step
>>> back
>>>> and consider how we could both fix issues in CI and support the 1.4
>>> release
>>>> with the least impact possible on MXNet devs.  Marco had planned to
>>> make a
>>>> significant change to the CI to fix a long-standing Jenkins error [1],
>>> but
>>>> we feel that most developers would prioritize having a stable build
>>>> environment for the next few weeks over having this fix in place.
>>>> 
>>>> To properly introduce a new CI system the intent was to do a gradual
>>>> blue/green roll out of the fix.  To manage this rollout would have taken
>>>> operational effort and double compute load as we run systems in
>>> parallel.
>>>> This risks outages due to scaling limits, and we’d rather make this
>>> change
>>>> during a period of low-developer activity, i.e. shortly after the 1.4
>>>> release.
>>>> 
>>>> This means that from now until the 1.4 release, in order to reduce
>>>> complexity MXNet developers should only see a single Jenkins
>>> verification
>>>> check, and a single Travis check.
>>>> 
>>>> 
>>> 
>>

Re: CI impaired

Posted by Hagay Lupesko <lu...@gmail.com>.

Thanks for the update Marco and all the hard work put into the CI!

On Sat, Dec 1, 2018 at 1:21 PM Marco de Abreu
<ma...@googlemail.com.invalid> wrote:

> Hello everyone,
>
> the move has just been completed and the old big pipeline as well as the
> according job have been disabled. From now on, you will see the details
> status messages below your PRs.
>
> Some people wanted to make modifications to the Jenkinsfiles recently. In
> that case, your PR will show a merge conflict. The new Jenkinsfiles are
> available at [1].
>
> Yesterday, I have indexed all PRs with our CI system to make sure that each
> one gets properly validated and our merge processes don't get impaired.
> Everything looks good so far, but due to the flakyness of our tests, it's
> quite unlikely that every single tests has passed. If your particular PR
> shows a failure for a certain test, please follow the same procedure as
> usual and retrigger it by pushing another commit. From now on, you can also
> trigger partial runs of the CI. For this, just hit up a committer and they
> will be happy to trigger that specific job on your behalf.
>
> If somebody in the community is interested, we would also be happy to
> collaborate on a bot that allows to control CI runs like retriggering
> certain jobs, requesting additional non-PR jobs to run - e.g. when you made
> changes to nightly, etc.
>
> Thanks everybody for being patient and so collaborative during this
> transisition time. I'm looking forward to everybodys contributions.
>
> Best regards,
> Marco
>
> [1]: https://github.com/apache/incubator-mxnet/tree/master/ci/jenkins
>
> On Sat, Dec 1, 2018 at 4:27 AM Marco de Abreu <
> marco.g.abreu@googlemail.com>
> wrote:
>
> > Thanks Naveen and Gavin!
> >
> > #1 has been completed and every job has finished its processing.
> >
> > #2 is the ticket with infra:
> > https://issues.apache.org/jira/browse/INFRA-17346
> >
> > I'm now waiting for their response.
> >
> > -Marco
> >
> > On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy <mn...@gmail.com> wrote:
> >
> >> Hi Marco/Gavin,
> >>
> >> Thanks for the clarification. I was not aware that it has been tested
> on a
> >> separate test environment(this is what I was suggesting and make the
> >> changes in a more controlled manner), last time the change was made,
> many
> >> PRs were left dangling and developers had to go trigger and I triggered
> >> them at least 5 times before it succeeded today.
> >>
> >> Appreciate all the hard work to make CI better.
> >>
> >> -Naveen
> >>
> >> On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell <gavin.max.bell@gmail.com
> >
> >> wrote:
> >>
> >> > Hey Folks,
> >> >
> >> > Marco has been running this change in dev, with flying colors, for
> some
> >> > time. This is not an experiment but a roll out that was announced.  We
> >> also
> >> > decided to make this change post the release cut so limit the blast
> >> radius
> >> > from any critical obligations to the community.  Marco is accountable
> >> for
> >> > this work and will address any issues that may occur as he has been
> put
> >> > on-call.  We have, to our best ability, mitigated as much risk as
> >> possible
> >> > and now it is time to pull the trigger.  The community will enjoy a
> bit
> >> > more visibility and clarity into the test process which will be
> >> > advantageous, as well as allowing us to extend our infrastructure in a
> >> way
> >> > that affords us more flexibility.
> >> >
> >> > No pending PRs will be impacted.
> >> >
> >> > Thank you for your support as we evolve this system to better serve
> the
> >> > community.
> >> >
> >> > -Gavin
> >> >
> >> > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
> >> > <ma...@googlemail.com.invalid> wrote:
> >> >
> >> > > Hello Naveen, this is not an experiment. Everything has been tested
> in
> >> > our
> >> > > test system and is considered working 100%. This is not a test but
> >> > actually
> >> > > the move into production - the merge into master happened a week
> ago.
> >> We
> >> > > now just have to put all PRs into the catalogue, which means that
> all
> >> PRs
> >> > > have to be analyzed with the new pipelines - the only thing that
> will
> >> be
> >> > > noticeable is that the CI is under higher load.
> >> > >
> >> > > The pending PRs will not be impacted. The existing pipeline is still
> >> > > running in parallel and everything will behave as before.
> >> > >
> >> > > -Marco
> >> > >
> >> > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mn...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Marco, run your experiments on a branch - set up, test it well and
> >> then
> >> > > > bring it to the master.
> >> > > >
> >> > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> >> > > > marco.g.abreu@googlemail.com.INVALID> wrote:
> >> > > > >
> >> > > > > Hello,
> >> > > > >
> >> > > > > I'm now moving forward with #1. I will try to get to #3 as soon
> as
> >> > > > possible
> >> > > > > to reduce parallel jobs in our CI. You might notice some
> >> unfinished
> >> > > > jobs. I
> >> > > > > will let you know as soon as this process has been completed.
> >> Until
> >> > > then,
> >> > > > > please bare with me since we have hundreds of jobs to run in
> >> order to
> >> > > > > validate all PRs.
> >> > > > >
> >> > > > > Best regards,
> >> > > > > Marco
> >> > > > >
> >> > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> >> > > > marco.g.abreu@googlemail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> Hello,
> >> > > > >>
> >> > > > >> since the release branch has now been cut, I would like to move
> >> > > forward
> >> > > > >> with the CI improvements for the master branch. This would
> >> include
> >> > the
> >> > > > >> following actions:
> >> > > > >> 1. Re-enable the new Jenkins job
> >> > > > >> 2. Request Apache Infra to move the protected branch check from
> >> the
> >> > > main
> >> > > > >> pipeline to our new ones
> >> > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474
> -
> >> > this
> >> > > > >> finalizes the deprecation process
> >> > > > >>
> >> > > > >> If nobody objects, I would like to start with #1 soon. Mentors,
> >> > could
> >> > > > you
> >> > > > >> please assist to create the Apache Infra ticket? I would then
> >> take
> >> > it
> >> > > > from
> >> > > > >> there and talk to Infra.
> >> > > > >>
> >> > > > >> Best regards,
> >> > > > >> Marco
> >> > > > >>
> >> > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> >> > > > >> kellen.sunderland@gmail.com> wrote:
> >> > > > >>
> >> > > > >>> Sorry, [1] meant to reference
> >> > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> >> > > > >>>
> >> > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> >> > > > >>> kellen.sunderland@gmail.com> wrote:
> >> > > > >>>
> >> > > > >>>> Marco and I ran into another urgent issue over the weekend
> that
> >> > was
> >> > > > >>>> causing builds to fail.  This issue was unrelated to any
> >> feature
> >> > > > >>>> development work, or other CI fixes applied recently, but it
> >> did
> >> > > > require
> >> > > > >>>> quite a bit of work from Marco (and a little from me) to fix.
> >> > > > >>>>
> >> > > > >>>> We spent enough time on the problem that it caused us to
> take a
> >> > step
> >> > > > >>> back
> >> > > > >>>> and consider how we could both fix issues in CI and support
> the
> >> > 1.4
> >> > > > >>> release
> >> > > > >>>> with the least impact possible on MXNet devs.  Marco had
> >> planned
> >> > to
> >> > > > >>> make a
> >> > > > >>>> significant change to the CI to fix a long-standing Jenkins
> >> error
> >> > > [1],
> >> > > > >>> but
> >> > > > >>>> we feel that most developers would prioritize having a stable
> >> > build
> >> > > > >>>> environment for the next few weeks over having this fix in
> >> place.
> >> > > > >>>>
> >> > > > >>>> To properly introduce a new CI system the intent was to do a
> >> > gradual
> >> > > > >>>> blue/green roll out of the fix.  To manage this rollout would
> >> have
> >> > > > taken
> >> > > > >>>> operational effort and double compute load as we run systems
> in
> >> > > > >>> parallel.
> >> > > > >>>> This risks outages due to scaling limits, and we’d rather
> make
> >> > this
> >> > > > >>> change
> >> > > > >>>> during a period of low-developer activity, i.e. shortly after
> >> the
> >> > > 1.4
> >> > > > >>>> release.
> >> > > > >>>>
> >> > > > >>>> This means that from now until the 1.4 release, in order to
> >> reduce
> >> > > > >>>> complexity MXNet developers should only see a single Jenkins
> >> > > > >>> verification
> >> > > > >>>> check, and a single Travis check.
> >> > > > >>>>
> >> > > > >>>>
> >> > > > >>>
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Sincerely,
> >> > Gavin M. Bell
> >> >
> >> >  "Never mistake a clear view for a short distance."
> >> >               -Paul Saffo
> >> >
> >>
> >
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Hello everyone,

the move has just been completed and the old big pipeline as well as the
according job have been disabled. From now on, you will see the details
status messages below your PRs.

Some people wanted to make modifications to the Jenkinsfiles recently. In
that case, your PR will show a merge conflict. The new Jenkinsfiles are
available at [1].

Yesterday, I have indexed all PRs with our CI system to make sure that each
one gets properly validated and our merge processes don't get impaired.
Everything looks good so far, but due to the flakyness of our tests, it's
quite unlikely that every single tests has passed. If your particular PR
shows a failure for a certain test, please follow the same procedure as
usual and retrigger it by pushing another commit. From now on, you can also
trigger partial runs of the CI. For this, just hit up a committer and they
will be happy to trigger that specific job on your behalf.

If somebody in the community is interested, we would also be happy to
collaborate on a bot that allows to control CI runs like retriggering
certain jobs, requesting additional non-PR jobs to run - e.g. when you made
changes to nightly, etc.

Thanks everybody for being patient and so collaborative during this
transisition time. I'm looking forward to everybodys contributions.

Best regards,
Marco

[1]: https://github.com/apache/incubator-mxnet/tree/master/ci/jenkins

On Sat, Dec 1, 2018 at 4:27 AM Marco de Abreu <ma...@googlemail.com>
wrote:

> Thanks Naveen and Gavin!
>
> #1 has been completed and every job has finished its processing.
>
> #2 is the ticket with infra:
> https://issues.apache.org/jira/browse/INFRA-17346
>
> I'm now waiting for their response.
>
> -Marco
>
> On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy <mn...@gmail.com> wrote:
>
>> Hi Marco/Gavin,
>>
>> Thanks for the clarification. I was not aware that it has been tested on a
>> separate test environment(this is what I was suggesting and make the
>> changes in a more controlled manner), last time the change was made, many
>> PRs were left dangling and developers had to go trigger and I triggered
>> them at least 5 times before it succeeded today.
>>
>> Appreciate all the hard work to make CI better.
>>
>> -Naveen
>>
>> On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell <ga...@gmail.com>
>> wrote:
>>
>> > Hey Folks,
>> >
>> > Marco has been running this change in dev, with flying colors, for some
>> > time. This is not an experiment but a roll out that was announced.  We
>> also
>> > decided to make this change post the release cut so limit the blast
>> radius
>> > from any critical obligations to the community.  Marco is accountable
>> for
>> > this work and will address any issues that may occur as he has been put
>> > on-call.  We have, to our best ability, mitigated as much risk as
>> possible
>> > and now it is time to pull the trigger.  The community will enjoy a bit
>> > more visibility and clarity into the test process which will be
>> > advantageous, as well as allowing us to extend our infrastructure in a
>> way
>> > that affords us more flexibility.
>> >
>> > No pending PRs will be impacted.
>> >
>> > Thank you for your support as we evolve this system to better serve the
>> > community.
>> >
>> > -Gavin
>> >
>> > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
>> > <ma...@googlemail.com.invalid> wrote:
>> >
>> > > Hello Naveen, this is not an experiment. Everything has been tested in
>> > our
>> > > test system and is considered working 100%. This is not a test but
>> > actually
>> > > the move into production - the merge into master happened a week ago.
>> We
>> > > now just have to put all PRs into the catalogue, which means that all
>> PRs
>> > > have to be analyzed with the new pipelines - the only thing that will
>> be
>> > > noticeable is that the CI is under higher load.
>> > >
>> > > The pending PRs will not be impacted. The existing pipeline is still
>> > > running in parallel and everything will behave as before.
>> > >
>> > > -Marco
>> > >
>> > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mn...@gmail.com>
>> wrote:
>> > >
>> > > > Marco, run your experiments on a branch - set up, test it well and
>> then
>> > > > bring it to the master.
>> > > >
>> > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
>> > > > marco.g.abreu@googlemail.com.INVALID> wrote:
>> > > > >
>> > > > > Hello,
>> > > > >
>> > > > > I'm now moving forward with #1. I will try to get to #3 as soon as
>> > > > possible
>> > > > > to reduce parallel jobs in our CI. You might notice some
>> unfinished
>> > > > jobs. I
>> > > > > will let you know as soon as this process has been completed.
>> Until
>> > > then,
>> > > > > please bare with me since we have hundreds of jobs to run in
>> order to
>> > > > > validate all PRs.
>> > > > >
>> > > > > Best regards,
>> > > > > Marco
>> > > > >
>> > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
>> > > > marco.g.abreu@googlemail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Hello,
>> > > > >>
>> > > > >> since the release branch has now been cut, I would like to move
>> > > forward
>> > > > >> with the CI improvements for the master branch. This would
>> include
>> > the
>> > > > >> following actions:
>> > > > >> 1. Re-enable the new Jenkins job
>> > > > >> 2. Request Apache Infra to move the protected branch check from
>> the
>> > > main
>> > > > >> pipeline to our new ones
>> > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 -
>> > this
>> > > > >> finalizes the deprecation process
>> > > > >>
>> > > > >> If nobody objects, I would like to start with #1 soon. Mentors,
>> > could
>> > > > you
>> > > > >> please assist to create the Apache Infra ticket? I would then
>> take
>> > it
>> > > > from
>> > > > >> there and talk to Infra.
>> > > > >>
>> > > > >> Best regards,
>> > > > >> Marco
>> > > > >>
>> > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
>> > > > >> kellen.sunderland@gmail.com> wrote:
>> > > > >>
>> > > > >>> Sorry, [1] meant to reference
>> > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>> > > > >>>
>> > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
>> > > > >>> kellen.sunderland@gmail.com> wrote:
>> > > > >>>
>> > > > >>>> Marco and I ran into another urgent issue over the weekend that
>> > was
>> > > > >>>> causing builds to fail.  This issue was unrelated to any
>> feature
>> > > > >>>> development work, or other CI fixes applied recently, but it
>> did
>> > > > require
>> > > > >>>> quite a bit of work from Marco (and a little from me) to fix.
>> > > > >>>>
>> > > > >>>> We spent enough time on the problem that it caused us to take a
>> > step
>> > > > >>> back
>> > > > >>>> and consider how we could both fix issues in CI and support the
>> > 1.4
>> > > > >>> release
>> > > > >>>> with the least impact possible on MXNet devs.  Marco had
>> planned
>> > to
>> > > > >>> make a
>> > > > >>>> significant change to the CI to fix a long-standing Jenkins
>> error
>> > > [1],
>> > > > >>> but
>> > > > >>>> we feel that most developers would prioritize having a stable
>> > build
>> > > > >>>> environment for the next few weeks over having this fix in
>> place.
>> > > > >>>>
>> > > > >>>> To properly introduce a new CI system the intent was to do a
>> > gradual
>> > > > >>>> blue/green roll out of the fix.  To manage this rollout would
>> have
>> > > > taken
>> > > > >>>> operational effort and double compute load as we run systems in
>> > > > >>> parallel.
>> > > > >>>> This risks outages due to scaling limits, and we’d rather make
>> > this
>> > > > >>> change
>> > > > >>>> during a period of low-developer activity, i.e. shortly after
>> the
>> > > 1.4
>> > > > >>>> release.
>> > > > >>>>
>> > > > >>>> This means that from now until the 1.4 release, in order to
>> reduce
>> > > > >>>> complexity MXNet developers should only see a single Jenkins
>> > > > >>> verification
>> > > > >>>> check, and a single Travis check.
>> > > > >>>>
>> > > > >>>>
>> > > > >>>
>> > > > >>
>> > > >
>> > >
>> >
>> >
>> > --
>> > Sincerely,
>> > Gavin M. Bell
>> >
>> >  "Never mistake a clear view for a short distance."
>> >               -Paul Saffo
>> >
>>
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Thanks Naveen and Gavin!

#1 has been completed and every job has finished its processing.

#2 is the ticket with infra:
https://issues.apache.org/jira/browse/INFRA-17346

I'm now waiting for their response.

-Marco

On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy <mn...@gmail.com> wrote:

> Hi Marco/Gavin,
>
> Thanks for the clarification. I was not aware that it has been tested on a
> separate test environment(this is what I was suggesting and make the
> changes in a more controlled manner), last time the change was made, many
> PRs were left dangling and developers had to go trigger and I triggered
> them at least 5 times before it succeeded today.
>
> Appreciate all the hard work to make CI better.
>
> -Naveen
>
> On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell <ga...@gmail.com>
> wrote:
>
> > Hey Folks,
> >
> > Marco has been running this change in dev, with flying colors, for some
> > time. This is not an experiment but a roll out that was announced.  We
> also
> > decided to make this change post the release cut so limit the blast
> radius
> > from any critical obligations to the community.  Marco is accountable for
> > this work and will address any issues that may occur as he has been put
> > on-call.  We have, to our best ability, mitigated as much risk as
> possible
> > and now it is time to pull the trigger.  The community will enjoy a bit
> > more visibility and clarity into the test process which will be
> > advantageous, as well as allowing us to extend our infrastructure in a
> way
> > that affords us more flexibility.
> >
> > No pending PRs will be impacted.
> >
> > Thank you for your support as we evolve this system to better serve the
> > community.
> >
> > -Gavin
> >
> > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
> > <ma...@googlemail.com.invalid> wrote:
> >
> > > Hello Naveen, this is not an experiment. Everything has been tested in
> > our
> > > test system and is considered working 100%. This is not a test but
> > actually
> > > the move into production - the merge into master happened a week ago.
> We
> > > now just have to put all PRs into the catalogue, which means that all
> PRs
> > > have to be analyzed with the new pipelines - the only thing that will
> be
> > > noticeable is that the CI is under higher load.
> > >
> > > The pending PRs will not be impacted. The existing pipeline is still
> > > running in parallel and everything will behave as before.
> > >
> > > -Marco
> > >
> > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mn...@gmail.com>
> wrote:
> > >
> > > > Marco, run your experiments on a branch - set up, test it well and
> then
> > > > bring it to the master.
> > > >
> > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > > > marco.g.abreu@googlemail.com.INVALID> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > > > possible
> > > > > to reduce parallel jobs in our CI. You might notice some unfinished
> > > > jobs. I
> > > > > will let you know as soon as this process has been completed. Until
> > > then,
> > > > > please bare with me since we have hundreds of jobs to run in order
> to
> > > > > validate all PRs.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > > > marco.g.abreu@googlemail.com>
> > > > > wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> since the release branch has now been cut, I would like to move
> > > forward
> > > > >> with the CI improvements for the master branch. This would include
> > the
> > > > >> following actions:
> > > > >> 1. Re-enable the new Jenkins job
> > > > >> 2. Request Apache Infra to move the protected branch check from
> the
> > > main
> > > > >> pipeline to our new ones
> > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 -
> > this
> > > > >> finalizes the deprecation process
> > > > >>
> > > > >> If nobody objects, I would like to start with #1 soon. Mentors,
> > could
> > > > you
> > > > >> please assist to create the Apache Infra ticket? I would then take
> > it
> > > > from
> > > > >> there and talk to Infra.
> > > > >>
> > > > >> Best regards,
> > > > >> Marco
> > > > >>
> > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > > > >> kellen.sunderland@gmail.com> wrote:
> > > > >>
> > > > >>> Sorry, [1] meant to reference
> > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > > > >>>
> > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > > > >>> kellen.sunderland@gmail.com> wrote:
> > > > >>>
> > > > >>>> Marco and I ran into another urgent issue over the weekend that
> > was
> > > > >>>> causing builds to fail.  This issue was unrelated to any feature
> > > > >>>> development work, or other CI fixes applied recently, but it did
> > > > require
> > > > >>>> quite a bit of work from Marco (and a little from me) to fix.
> > > > >>>>
> > > > >>>> We spent enough time on the problem that it caused us to take a
> > step
> > > > >>> back
> > > > >>>> and consider how we could both fix issues in CI and support the
> > 1.4
> > > > >>> release
> > > > >>>> with the least impact possible on MXNet devs.  Marco had planned
> > to
> > > > >>> make a
> > > > >>>> significant change to the CI to fix a long-standing Jenkins
> error
> > > [1],
> > > > >>> but
> > > > >>>> we feel that most developers would prioritize having a stable
> > build
> > > > >>>> environment for the next few weeks over having this fix in
> place.
> > > > >>>>
> > > > >>>> To properly introduce a new CI system the intent was to do a
> > gradual
> > > > >>>> blue/green roll out of the fix.  To manage this rollout would
> have
> > > > taken
> > > > >>>> operational effort and double compute load as we run systems in
> > > > >>> parallel.
> > > > >>>> This risks outages due to scaling limits, and we’d rather make
> > this
> > > > >>> change
> > > > >>>> during a period of low-developer activity, i.e. shortly after
> the
> > > 1.4
> > > > >>>> release.
> > > > >>>>
> > > > >>>> This means that from now until the 1.4 release, in order to
> reduce
> > > > >>>> complexity MXNet developers should only see a single Jenkins
> > > > >>> verification
> > > > >>>> check, and a single Travis check.
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> >
> > --
> > Sincerely,
> > Gavin M. Bell
> >
> >  "Never mistake a clear view for a short distance."
> >               -Paul Saffo
> >
>

Re: CI impaired

Posted by Naveen Swamy <mn...@gmail.com>.

Hi Marco/Gavin,

Thanks for the clarification. I was not aware that it has been tested on a
separate test environment(this is what I was suggesting and make the
changes in a more controlled manner), last time the change was made, many
PRs were left dangling and developers had to go trigger and I triggered
them at least 5 times before it succeeded today.

Appreciate all the hard work to make CI better.

-Naveen

On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell <ga...@gmail.com>
wrote:

> Hey Folks,
>
> Marco has been running this change in dev, with flying colors, for some
> time. This is not an experiment but a roll out that was announced.  We also
> decided to make this change post the release cut so limit the blast radius
> from any critical obligations to the community.  Marco is accountable for
> this work and will address any issues that may occur as he has been put
> on-call.  We have, to our best ability, mitigated as much risk as possible
> and now it is time to pull the trigger.  The community will enjoy a bit
> more visibility and clarity into the test process which will be
> advantageous, as well as allowing us to extend our infrastructure in a way
> that affords us more flexibility.
>
> No pending PRs will be impacted.
>
> Thank you for your support as we evolve this system to better serve the
> community.
>
> -Gavin
>
> On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
> <ma...@googlemail.com.invalid> wrote:
>
> > Hello Naveen, this is not an experiment. Everything has been tested in
> our
> > test system and is considered working 100%. This is not a test but
> actually
> > the move into production - the merge into master happened a week ago. We
> > now just have to put all PRs into the catalogue, which means that all PRs
> > have to be analyzed with the new pipelines - the only thing that will be
> > noticeable is that the CI is under higher load.
> >
> > The pending PRs will not be impacted. The existing pipeline is still
> > running in parallel and everything will behave as before.
> >
> > -Marco
> >
> > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mn...@gmail.com> wrote:
> >
> > > Marco, run your experiments on a branch - set up, test it well and then
> > > bring it to the master.
> > >
> > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > > marco.g.abreu@googlemail.com.INVALID> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > > possible
> > > > to reduce parallel jobs in our CI. You might notice some unfinished
> > > jobs. I
> > > > will let you know as soon as this process has been completed. Until
> > then,
> > > > please bare with me since we have hundreds of jobs to run in order to
> > > > validate all PRs.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > > marco.g.abreu@googlemail.com>
> > > > wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> since the release branch has now been cut, I would like to move
> > forward
> > > >> with the CI improvements for the master branch. This would include
> the
> > > >> following actions:
> > > >> 1. Re-enable the new Jenkins job
> > > >> 2. Request Apache Infra to move the protected branch check from the
> > main
> > > >> pipeline to our new ones
> > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 -
> this
> > > >> finalizes the deprecation process
> > > >>
> > > >> If nobody objects, I would like to start with #1 soon. Mentors,
> could
> > > you
> > > >> please assist to create the Apache Infra ticket? I would then take
> it
> > > from
> > > >> there and talk to Infra.
> > > >>
> > > >> Best regards,
> > > >> Marco
> > > >>
> > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > > >> kellen.sunderland@gmail.com> wrote:
> > > >>
> > > >>> Sorry, [1] meant to reference
> > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > > >>>
> > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > > >>> kellen.sunderland@gmail.com> wrote:
> > > >>>
> > > >>>> Marco and I ran into another urgent issue over the weekend that
> was
> > > >>>> causing builds to fail.  This issue was unrelated to any feature
> > > >>>> development work, or other CI fixes applied recently, but it did
> > > require
> > > >>>> quite a bit of work from Marco (and a little from me) to fix.
> > > >>>>
> > > >>>> We spent enough time on the problem that it caused us to take a
> step
> > > >>> back
> > > >>>> and consider how we could both fix issues in CI and support the
> 1.4
> > > >>> release
> > > >>>> with the least impact possible on MXNet devs.  Marco had planned
> to
> > > >>> make a
> > > >>>> significant change to the CI to fix a long-standing Jenkins error
> > [1],
> > > >>> but
> > > >>>> we feel that most developers would prioritize having a stable
> build
> > > >>>> environment for the next few weeks over having this fix in place.
> > > >>>>
> > > >>>> To properly introduce a new CI system the intent was to do a
> gradual
> > > >>>> blue/green roll out of the fix.  To manage this rollout would have
> > > taken
> > > >>>> operational effort and double compute load as we run systems in
> > > >>> parallel.
> > > >>>> This risks outages due to scaling limits, and we’d rather make
> this
> > > >>> change
> > > >>>> during a period of low-developer activity, i.e. shortly after the
> > 1.4
> > > >>>> release.
> > > >>>>
> > > >>>> This means that from now until the 1.4 release, in order to reduce
> > > >>>> complexity MXNet developers should only see a single Jenkins
> > > >>> verification
> > > >>>> check, and a single Travis check.
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> >
>
>
> --
> Sincerely,
> Gavin M. Bell
>
>  "Never mistake a clear view for a short distance."
>               -Paul Saffo
>

Re: CI impaired

Posted by "Gavin M. Bell" <ga...@gmail.com>.

Hey Folks,

Marco has been running this change in dev, with flying colors, for some
time. This is not an experiment but a roll out that was announced.  We also
decided to make this change post the release cut so limit the blast radius
from any critical obligations to the community.  Marco is accountable for
this work and will address any issues that may occur as he has been put
on-call.  We have, to our best ability, mitigated as much risk as possible
and now it is time to pull the trigger.  The community will enjoy a bit
more visibility and clarity into the test process which will be
advantageous, as well as allowing us to extend our infrastructure in a way
that affords us more flexibility.

No pending PRs will be impacted.

Thank you for your support as we evolve this system to better serve the
community.

-Gavin

On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
<ma...@googlemail.com.invalid> wrote:

> Hello Naveen, this is not an experiment. Everything has been tested in our
> test system and is considered working 100%. This is not a test but actually
> the move into production - the merge into master happened a week ago. We
> now just have to put all PRs into the catalogue, which means that all PRs
> have to be analyzed with the new pipelines - the only thing that will be
> noticeable is that the CI is under higher load.
>
> The pending PRs will not be impacted. The existing pipeline is still
> running in parallel and everything will behave as before.
>
> -Marco
>
> On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mn...@gmail.com> wrote:
>
> > Marco, run your experiments on a branch - set up, test it well and then
> > bring it to the master.
> >
> > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > marco.g.abreu@googlemail.com.INVALID> wrote:
> > >
> > > Hello,
> > >
> > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > possible
> > > to reduce parallel jobs in our CI. You might notice some unfinished
> > jobs. I
> > > will let you know as soon as this process has been completed. Until
> then,
> > > please bare with me since we have hundreds of jobs to run in order to
> > > validate all PRs.
> > >
> > > Best regards,
> > > Marco
> > >
> > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > marco.g.abreu@googlemail.com>
> > > wrote:
> > >
> > >> Hello,
> > >>
> > >> since the release branch has now been cut, I would like to move
> forward
> > >> with the CI improvements for the master branch. This would include the
> > >> following actions:
> > >> 1. Re-enable the new Jenkins job
> > >> 2. Request Apache Infra to move the protected branch check from the
> main
> > >> pipeline to our new ones
> > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
> > >> finalizes the deprecation process
> > >>
> > >> If nobody objects, I would like to start with #1 soon. Mentors, could
> > you
> > >> please assist to create the Apache Infra ticket? I would then take it
> > from
> > >> there and talk to Infra.
> > >>
> > >> Best regards,
> > >> Marco
> > >>
> > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > >> kellen.sunderland@gmail.com> wrote:
> > >>
> > >>> Sorry, [1] meant to reference
> > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > >>>
> > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > >>> kellen.sunderland@gmail.com> wrote:
> > >>>
> > >>>> Marco and I ran into another urgent issue over the weekend that was
> > >>>> causing builds to fail.  This issue was unrelated to any feature
> > >>>> development work, or other CI fixes applied recently, but it did
> > require
> > >>>> quite a bit of work from Marco (and a little from me) to fix.
> > >>>>
> > >>>> We spent enough time on the problem that it caused us to take a step
> > >>> back
> > >>>> and consider how we could both fix issues in CI and support the 1.4
> > >>> release
> > >>>> with the least impact possible on MXNet devs.  Marco had planned to
> > >>> make a
> > >>>> significant change to the CI to fix a long-standing Jenkins error
> [1],
> > >>> but
> > >>>> we feel that most developers would prioritize having a stable build
> > >>>> environment for the next few weeks over having this fix in place.
> > >>>>
> > >>>> To properly introduce a new CI system the intent was to do a gradual
> > >>>> blue/green roll out of the fix.  To manage this rollout would have
> > taken
> > >>>> operational effort and double compute load as we run systems in
> > >>> parallel.
> > >>>> This risks outages due to scaling limits, and we’d rather make this
> > >>> change
> > >>>> during a period of low-developer activity, i.e. shortly after the
> 1.4
> > >>>> release.
> > >>>>
> > >>>> This means that from now until the 1.4 release, in order to reduce
> > >>>> complexity MXNet developers should only see a single Jenkins
> > >>> verification
> > >>>> check, and a single Travis check.
> > >>>>
> > >>>>
> > >>>
> > >>
> >
>


-- 
Sincerely,
Gavin M. Bell

 "Never mistake a clear view for a short distance."
              -Paul Saffo

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Hello Naveen, this is not an experiment. Everything has been tested in our
test system and is considered working 100%. This is not a test but actually
the move into production - the merge into master happened a week ago. We
now just have to put all PRs into the catalogue, which means that all PRs
have to be analyzed with the new pipelines - the only thing that will be
noticeable is that the CI is under higher load.

The pending PRs will not be impacted. The existing pipeline is still
running in parallel and everything will behave as before.

-Marco

On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy <mn...@gmail.com> wrote:

> Marco, run your experiments on a branch - set up, test it well and then
> bring it to the master.
>
> > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> marco.g.abreu@googlemail.com.INVALID> wrote:
> >
> > Hello,
> >
> > I'm now moving forward with #1. I will try to get to #3 as soon as
> possible
> > to reduce parallel jobs in our CI. You might notice some unfinished
> jobs. I
> > will let you know as soon as this process has been completed. Until then,
> > please bare with me since we have hundreds of jobs to run in order to
> > validate all PRs.
> >
> > Best regards,
> > Marco
> >
> > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> marco.g.abreu@googlemail.com>
> > wrote:
> >
> >> Hello,
> >>
> >> since the release branch has now been cut, I would like to move forward
> >> with the CI improvements for the master branch. This would include the
> >> following actions:
> >> 1. Re-enable the new Jenkins job
> >> 2. Request Apache Infra to move the protected branch check from the main
> >> pipeline to our new ones
> >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
> >> finalizes the deprecation process
> >>
> >> If nobody objects, I would like to start with #1 soon. Mentors, could
> you
> >> please assist to create the Apache Infra ticket? I would then take it
> from
> >> there and talk to Infra.
> >>
> >> Best regards,
> >> Marco
> >>
> >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> >> kellen.sunderland@gmail.com> wrote:
> >>
> >>> Sorry, [1] meant to reference
> >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> >>>
> >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> >>> kellen.sunderland@gmail.com> wrote:
> >>>
> >>>> Marco and I ran into another urgent issue over the weekend that was
> >>>> causing builds to fail.  This issue was unrelated to any feature
> >>>> development work, or other CI fixes applied recently, but it did
> require
> >>>> quite a bit of work from Marco (and a little from me) to fix.
> >>>>
> >>>> We spent enough time on the problem that it caused us to take a step
> >>> back
> >>>> and consider how we could both fix issues in CI and support the 1.4
> >>> release
> >>>> with the least impact possible on MXNet devs.  Marco had planned to
> >>> make a
> >>>> significant change to the CI to fix a long-standing Jenkins error [1],
> >>> but
> >>>> we feel that most developers would prioritize having a stable build
> >>>> environment for the next few weeks over having this fix in place.
> >>>>
> >>>> To properly introduce a new CI system the intent was to do a gradual
> >>>> blue/green roll out of the fix.  To manage this rollout would have
> taken
> >>>> operational effort and double compute load as we run systems in
> >>> parallel.
> >>>> This risks outages due to scaling limits, and we’d rather make this
> >>> change
> >>>> during a period of low-developer activity, i.e. shortly after the 1.4
> >>>> release.
> >>>>
> >>>> This means that from now until the 1.4 release, in order to reduce
> >>>> complexity MXNet developers should only see a single Jenkins
> >>> verification
> >>>> check, and a single Travis check.
> >>>>
> >>>>
> >>>
> >>
>

Re: CI impaired

Posted by Naveen Swamy <mn...@gmail.com>.

Marco, run your experiments on a branch - set up, test it well and then bring it to the master. 

> On Nov 30, 2018, at 6:53 AM, Marco de Abreu <ma...@googlemail.com.INVALID> wrote:
> 
> Hello,
> 
> I'm now moving forward with #1. I will try to get to #3 as soon as possible
> to reduce parallel jobs in our CI. You might notice some unfinished jobs. I
> will let you know as soon as this process has been completed. Until then,
> please bare with me since we have hundreds of jobs to run in order to
> validate all PRs.
> 
> Best regards,
> Marco
> 
> On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <ma...@googlemail.com>
> wrote:
> 
>> Hello,
>> 
>> since the release branch has now been cut, I would like to move forward
>> with the CI improvements for the master branch. This would include the
>> following actions:
>> 1. Re-enable the new Jenkins job
>> 2. Request Apache Infra to move the protected branch check from the main
>> pipeline to our new ones
>> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
>> finalizes the deprecation process
>> 
>> If nobody objects, I would like to start with #1 soon. Mentors, could you
>> please assist to create the Apache Infra ticket? I would then take it from
>> there and talk to Infra.
>> 
>> Best regards,
>> Marco
>> 
>> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
>> kellen.sunderland@gmail.com> wrote:
>> 
>>> Sorry, [1] meant to reference
>>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>>> 
>>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
>>> kellen.sunderland@gmail.com> wrote:
>>> 
>>>> Marco and I ran into another urgent issue over the weekend that was
>>>> causing builds to fail.  This issue was unrelated to any feature
>>>> development work, or other CI fixes applied recently, but it did require
>>>> quite a bit of work from Marco (and a little from me) to fix.
>>>> 
>>>> We spent enough time on the problem that it caused us to take a step
>>> back
>>>> and consider how we could both fix issues in CI and support the 1.4
>>> release
>>>> with the least impact possible on MXNet devs.  Marco had planned to
>>> make a
>>>> significant change to the CI to fix a long-standing Jenkins error [1],
>>> but
>>>> we feel that most developers would prioritize having a stable build
>>>> environment for the next few weeks over having this fix in place.
>>>> 
>>>> To properly introduce a new CI system the intent was to do a gradual
>>>> blue/green roll out of the fix.  To manage this rollout would have taken
>>>> operational effort and double compute load as we run systems in
>>> parallel.
>>>> This risks outages due to scaling limits, and we’d rather make this
>>> change
>>>> during a period of low-developer activity, i.e. shortly after the 1.4
>>>> release.
>>>> 
>>>> This means that from now until the 1.4 release, in order to reduce
>>>> complexity MXNet developers should only see a single Jenkins
>>> verification
>>>> check, and a single Travis check.
>>>> 
>>>> 
>>> 
>>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Hello,

I'm now moving forward with #1. I will try to get to #3 as soon as possible
to reduce parallel jobs in our CI. You might notice some unfinished jobs. I
will let you know as soon as this process has been completed. Until then,
please bare with me since we have hundreds of jobs to run in order to
validate all PRs.

Best regards,
Marco

On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <ma...@googlemail.com>
wrote:

> Hello,
>
> since the release branch has now been cut, I would like to move forward
> with the CI improvements for the master branch. This would include the
> following actions:
> 1. Re-enable the new Jenkins job
> 2. Request Apache Infra to move the protected branch check from the main
> pipeline to our new ones
> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
> finalizes the deprecation process
>
> If nobody objects, I would like to start with #1 soon. Mentors, could you
> please assist to create the Apache Infra ticket? I would then take it from
> there and talk to Infra.
>
> Best regards,
> Marco
>
> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
>
>> Sorry, [1] meant to reference
>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>>
>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
>> kellen.sunderland@gmail.com> wrote:
>>
>> > Marco and I ran into another urgent issue over the weekend that was
>> > causing builds to fail.  This issue was unrelated to any feature
>> > development work, or other CI fixes applied recently, but it did require
>> > quite a bit of work from Marco (and a little from me) to fix.
>> >
>> > We spent enough time on the problem that it caused us to take a step
>> back
>> > and consider how we could both fix issues in CI and support the 1.4
>> release
>> > with the least impact possible on MXNet devs.  Marco had planned to
>> make a
>> > significant change to the CI to fix a long-standing Jenkins error [1],
>> but
>> > we feel that most developers would prioritize having a stable build
>> > environment for the next few weeks over having this fix in place.
>> >
>> > To properly introduce a new CI system the intent was to do a gradual
>> > blue/green roll out of the fix.  To manage this rollout would have taken
>> > operational effort and double compute load as we run systems in
>> parallel.
>> > This risks outages due to scaling limits, and we’d rather make this
>> change
>> > during a period of low-developer activity, i.e. shortly after the 1.4
>> > release.
>> >
>> > This means that from now until the 1.4 release, in order to reduce
>> > complexity MXNet developers should only see a single Jenkins
>> verification
>> > check, and a single Travis check.
>> >
>> >
>>
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Hello,

since the release branch has now been cut, I would like to move forward
with the CI improvements for the master branch. This would include the
following actions:
1. Re-enable the new Jenkins job
2. Request Apache Infra to move the protected branch check from the main
pipeline to our new ones
3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
finalizes the deprecation process

If nobody objects, I would like to start with #1 soon. Mentors, could you
please assist to create the Apache Infra ticket? I would then take it from
there and talk to Infra.

Best regards,
Marco

On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Sorry, [1] meant to reference
> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>
> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
>
> > Marco and I ran into another urgent issue over the weekend that was
> > causing builds to fail.  This issue was unrelated to any feature
> > development work, or other CI fixes applied recently, but it did require
> > quite a bit of work from Marco (and a little from me) to fix.
> >
> > We spent enough time on the problem that it caused us to take a step back
> > and consider how we could both fix issues in CI and support the 1.4
> release
> > with the least impact possible on MXNet devs.  Marco had planned to make
> a
> > significant change to the CI to fix a long-standing Jenkins error [1],
> but
> > we feel that most developers would prioritize having a stable build
> > environment for the next few weeks over having this fix in place.
> >
> > To properly introduce a new CI system the intent was to do a gradual
> > blue/green roll out of the fix.  To manage this rollout would have taken
> > operational effort and double compute load as we run systems in parallel.
> > This risks outages due to scaling limits, and we’d rather make this
> change
> > during a period of low-developer activity, i.e. shortly after the 1.4
> > release.
> >
> > This means that from now until the 1.4 release, in order to reduce
> > complexity MXNet developers should only see a single Jenkins verification
> > check, and a single Travis check.
> >
> >
>

Re: CI impaired

Posted by kellen sunderland <ke...@gmail.com>.

Sorry, [1] meant to reference
https://issues.jenkins-ci.org/browse/JENKINS-37984 .

On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Marco and I ran into another urgent issue over the weekend that was
> causing builds to fail.  This issue was unrelated to any feature
> development work, or other CI fixes applied recently, but it did require
> quite a bit of work from Marco (and a little from me) to fix.
>
> We spent enough time on the problem that it caused us to take a step back
> and consider how we could both fix issues in CI and support the 1.4 release
> with the least impact possible on MXNet devs.  Marco had planned to make a
> significant change to the CI to fix a long-standing Jenkins error [1], but
> we feel that most developers would prioritize having a stable build
> environment for the next few weeks over having this fix in place.
>
> To properly introduce a new CI system the intent was to do a gradual
> blue/green roll out of the fix.  To manage this rollout would have taken
> operational effort and double compute load as we run systems in parallel.
> This risks outages due to scaling limits, and we’d rather make this change
> during a period of low-developer activity, i.e. shortly after the 1.4
> release.
>
> This means that from now until the 1.4 release, in order to reduce
> complexity MXNet developers should only see a single Jenkins verification
> check, and a single Travis check.
>
>

Re: CI impaired

Posted by kellen sunderland <ke...@gmail.com>.

Marco and I ran into another urgent issue over the weekend that was causing
builds to fail.  This issue was unrelated to any feature development work,
or other CI fixes applied recently, but it did require quite a bit of work
from Marco (and a little from me) to fix.

We spent enough time on the problem that it caused us to take a step back
and consider how we could both fix issues in CI and support the 1.4 release
with the least impact possible on MXNet devs.  Marco had planned to make a
significant change to the CI to fix a long-standing Jenkins error [1], but
we feel that most developers would prioritize having a stable build
environment for the next few weeks over having this fix in place.

To properly introduce a new CI system the intent was to do a gradual
blue/green roll out of the fix.  To manage this rollout would have taken
operational effort and double compute load as we run systems in parallel.
This risks outages due to scaling limits, and we’d rather make this change
during a period of low-developer activity, i.e. shortly after the 1.4
release.

This means that from now until the 1.4 release, in order to reduce
complexity MXNet developers should only see a single Jenkins verification
check, and a single Travis check.

Re: CI impaired

Posted by Steffen Rochel <st...@gmail.com>.

Hi Marco - suggest to retrigger PRs, if needed in stages:
- pr-awaiting-merge
- pr-awaiting-review
that would cover 78 PR. In any case I would exclude pr-work-in-progress.

Steffen

On Sat, Nov 24, 2018 at 9:11 PM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Hey Marco, I'm still having quite a few issues passing PRs.  Would you be
> able to at least test a handful of PRs and make sure they pass/fail tests
> as you expect?
>
> On Sat, Nov 24, 2018, 7:01 PM Marco de Abreu
> <marco.g.abreu@googlemail.com.invalid wrote:
>
> > Hello Steffen,
> >
> > thank you for bringing up these PRs.
> >
> > I had to abort the builds during the outage which means that the jobs
> > didn't finish and not even the status propagation could have finished
> > (hence they show pending instead of failure or aborted).
> >
> > Recently, we merged a PR that adds utility slaves. This will ensure that
> > status updates will always be posted, no matter whether the main queue
> > hangs or not. This means that the status would then be properly reflected
> > and there should be no hanging pending runs.
> >
> > I could retrigger all PRs to kick off another round of validation, but
> this
> > would result in 240 jobs (2 main pipelines times 120 open PRs) to run.
> > Since we are currently in the pre-release stage, I wanted to avoid
> putting
> > the system under such heavy load.
> >
> > Instead, I'd kindly like to request the PR creators to make a new commit
> to
> > trigger the pipelines. In order to merge a PR, only PR-merge has to pass
> > and I tried to retrigger all PRs that have been aborted during the
> outage.
> > It might have been possible that I missed a few.
> >
> > Since it's still the weekend and there's not much going on, I can use the
> > time to trigger all PRs. Please advise whether you think I should move
> > forward (I expect the CI to finish all PRs within 6-10 hours) or if it's
> > fine to ask people to retrigger themselves.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> >
> > Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel <steffenrochel@gmail.com
> >
> > geschrieben:
> >
> > > Thanks Marco for the updates and resolving the issues.
> > > However, I do see a number of PR waiting to be merged with inconsistent
> > PR
> > > validation status check.
> > > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9
> > pending
> > > checks being queued. However, when you look at the details, either the
> > > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
> > > windows-gpu failed, required pr-merge which includes edge, gpu tests
> > > passed).
> > > Similar also for other PR with label pr-awaiting-merge (
> > >
> > >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
> > > )
> > > Please advice on resolution.
> > >
> > > Regards,
> > > Steffen
> > >
> > > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
> > > <ma...@googlemail.com.invalid> wrote:
> > >
> > > > Thanks everybody, I really appreciate it!
> > > >
> > > > Today was a good day, there were no incidents and everything appears
> to
> > > be
> > > > stable. In the meantime I did a deep dive on why we has such a
> > > significant
> > > > performance decrease with of our compilation jobs - which then
> clogged
> > up
> > > > the queue and resulted in 1000 jobs waiting to be scheduled.
> > > >
> > > > The reason was the way how we use ccache to speed up our compilation
> > > jobs.
> > > > Usually, this yields us a huge performance improvement (CPU openblas,
> > for
> > > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes
> down
> > > to
> > > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting
> > > factor.
> > > > Here's some background about how we operate our cache:
> > > >
> > > > We use EFS to have a distributed ccache between all of our
> > > > unrestricted-prod-slaves. EFS is classified for almost unlimited
> > > > scalability (being consumed by thousands of instances in parallel
> [1])
> > > with
> > > > a theoretical throughput of over 10Gbps. One thing I didn't know
> when I
> > > > designed this approach was the method how throughput is being
> granted.
> > > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> > > > throughput (default is 50MiB/s) [2]. Due to the high load, we
> consumed
> > > all
> > > > of our credits - here's a very interesting graph: [3].
> > > >
> > > > To avoid similar incidents in future, I have taken the following
> > actions:
> > > > 1. I switched EFS from burst-mode to provisioned throughput with
> > 300MB/s
> > > > (in the graph at [3] you can see how our IO immediately increases -
> and
> > > > thus our CI gets faster - as soon as I added provisioned throughput).
> > > > 2. I created internal follow-up tickets to add monitoring and
> automated
> > > > actions.
> > > >
> > > > First, we should be notified if we are running low on credits to
> > kick-off
> > > > an investigation. Second (nice to have), we could have a
> > lambda-function
> > > > which listens for that event and automatically switches the EFS
> volume
> > > from
> > > > burst-mode to provisioned throughput during high-load-times. The
> > required
> > > > throughput could be retrieved via CloudWatch and then multiplied by a
> > > > factor. EFS allows to downgrade the throughput mode 24h after the
> last
> > > > changes (to reduce capacity if the load is over) and always allows to
> > > > upgrade the provisioned capacity (if the load goes even higher). I've
> > > been
> > > > looking for a pre-made CloudFormation template to facilitate that,
> but
> > so
> > > > far, I haven't been able to find it.
> > > >
> > > > I'm now running additional load tests on our test CI environment to
> > > detect
> > > > other potential bottlenecks.
> > > >
> > > > Thanks a lot for your support!
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
> > > > [2]:
> > > >
> > >
> >
> https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
> > > > [3]: https://i.imgur.com/nboQLOn.png
> > > >
> > > > On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <la...@live.com>
> wrote:
> > > >
> > > > > Appreciated for your effort and help to make CI a better place!
> > > > >
> > > > > Qing
> > > > >
> > > > > On 11/21/18, 4:38 PM, "Lin Yuan" <ap...@gmail.com> wrote:
> > > > >
> > > > >     Thanks for your efforts, Marco!
> > > > >
> > > > >     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> > > > > anirudh2290@gmail.com>
> > > > >     wrote:
> > > > >
> > > > >     > Thanks for the quick response and mitigation!
> > > > >     >
> > > > >     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> > > > >     > <ma...@googlemail.com.invalid> wrote:
> > > > >     >
> > > > >     > > Hello,
> > > > >     > >
> > > > >     > > today, CI had some issues and I had to cancel all jobs a
> few
> > > > > minutes ago.
> > > > >     > > This was basically caused by the high load that is
> currently
> > > > being
> > > > > put on
> > > > >     > > our CI system due to the pre-release efforts for this
> Friday.
> > > > >     > >
> > > > >     > > It's really unfortunate that we just had outages of three
> > core
> > > > > components
> > > > >     > > within the last two days - sorry about that!. To recap, we
> > had
> > > > the
> > > > >     > > following outages (which are unrelated to the parallel
> > refactor
> > > > of
> > > > > the
> > > > >     > > Jenkins pipeline):
> > > > >     > > - (yesterday evening) The Jenkins master ran out of disk
> > space
> > > > and
> > > > > thus
> > > > >     > > processed requests at reduced capacity
> > > > >     > > - (this morning) The Jenkins master got updated which broke
> > our
> > > > >     > > autoscalings upscaling capabilities.
> > > > >     > > - (new, this evening) Jenkins API was irresponsive: Due to
> > the
> > > > high
> > > > >     > number
> > > > >     > > of jobs and a bad API design in the Jenkins REST API, the
> > > > > time-complexity
> > > > >     > > of a simple create or delete request was quadratic which
> > > resulted
> > > > > in all
> > > > >     > > requests timing out (that was the current outage). This
> > > resulted
> > > > > in our
> > > > >     > > auto scaling to be unable to interface with the Jenkins
> > master.
> > > > >     > >
> > > > >     > > I have now made improvements to our REST API calls which
> > > reduced
> > > > > the
> > > > >     > > complexity from O(N^2) to O(1). The reason was an
> underlying
> > > > > redirect
> > > > >     > loop
> > > > >     > > in the Jenkins createNode and deleteNode REST API in
> > > combination
> > > > > with
> > > > >     > > unrolling the entire slave and job graph (which got quite
> > huge
> > > > > during
> > > > >     > > extensive load) upon every single request. Since we had
> about
> > > 150
> > > > >     > > registered slaves and 1000 jobs in the queue, the duration
> > for
> > > a
> > > > > single
> > > > >     > > REST API call rose to up to 45 seconds (we execute up to a
> > few
> > > > > hundred
> > > > >     > > queries per auto scaling loop). This lead to our auto
> scaling
> > > > > timing out.
> > > > >     > >
> > > > >     > > Everything should be back to normal now. I'm closely
> > observing
> > > > the
> > > > >     > > situation and I'll let you know if I encounter any
> additional
> > > > > issues.
> > > > >     > >
> > > > >     > > Again, sorry for any caused inconveniences.
> > > > >     > >
> > > > >     > > Best regards,
> > > > >     > > Marco
> > > > >     > >
> > > > >     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> > > > > gavin.max.bell@gmail.com>
> > > > >     > > wrote:
> > > > >     > >
> > > > >     > > > Yes, let me add to the kudos, very nice work Marco.
> > > > >     > > >
> > > > >     > > >
> > > > >     > > > "I'm trying real hard to be the shepherd." -Jules
> Winnfield
> > > > >     > > >
> > > > >     > > >
> > > > >     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > > > >     > > > <ke...@amazon.de.INVALID> wrote:
> > > > >     > > > >
> > > > >     > > > > Appreciate the big effort in bring the CI back so
> > quickly.
> > > > > Thanks
> > > > >     > > Marco.
> > > > >     > > > >
> > > > >     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> > > > >     > marco.g.abreu@googlemail.com
> > > > >     > > .INVALID>
> > > > >     > > > wrote:
> > > > >     > > > > Thanks Aaron! Just for the record, the new Jenkins jobs
> > > were
> > > > >     > unrelated
> > > > >     > > to
> > > > >     > > > > that incident.
> > > > >     > > > >
> > > > >     > > > > If somebody is interested in the details around the
> > outage:
> > > > >     > > > >
> > > > >     > > > > Due to a required maintenance (disk running full), we
> had
> > > to
> > > > > upgrade
> > > > >     > > our
> > > > >     > > > > Jenkins master because it was running on Ubuntu 17.04
> > (for
> > > an
> > > > > unknown
> > > > >     > > > > reason, it used to be 16.04) and we needed to install
> > some
> > > > > packages.
> > > > >     > > > Since
> > > > >     > > > > the support for Ubuntu 17.04 was stopped, this resulted
> > in
> > > > all
> > > > >     > package
> > > > >     > > > > updates and installations to fail because the
> > repositories
> > > > > were taken
> > > > >     > > > > offline. Due to the unavailable maintenance package and
> > > other
> > > > > issues
> > > > >     > > with
> > > > >     > > > > the installed OpenJDK8 version, we made the decision to
> > > > > upgrade the
> > > > >     > > > Jenkins
> > > > >     > > > > master to Ubuntu 18.04 LTS in order to get back to a
> > > > supported
> > > > >     > version
> > > > >     > > > with
> > > > >     > > > > maintenance tools. During this upgrade, Jenkins was
> > > > > automatically
> > > > >     > > updated
> > > > >     > > > > by APT as part of the dist-upgrade process.
> > > > >     > > > >
> > > > >     > > > > In the latest version of Jenkins, some labels have been
> > > > > changed which
> > > > >     > > we
> > > > >     > > > > depend on for our auto scaling. To be more specific:
> > > > >     > > > >> Waiting for next available executor on mxnetlinux-gpu
> > > > >     > > > > has been changed to
> > > > >     > > > >> Waiting for next available executor on
> ‘mxnetlinux-gpu’
> > > > >     > > > > Notice the quote characters.
> > > > >     > > > >
> > > > >     > > > > Jenkins does not offer a better way than to parse these
> > > > > messages
> > > > >     > > > > unfortunately - there's no standardized way to express
> > > queue
> > > > > items.
> > > > >     > > Since
> > > > >     > > > > our parser expected the above message without quote
> > signs,
> > > > this
> > > > >     > message
> > > > >     > > > was
> > > > >     > > > > discarded.
> > > > >     > > > >
> > > > >     > > > > We support various queue reasons (5 of them to be
> exact)
> > > that
> > > > >     > indicate
> > > > >     > > > > resource starvation. If we run super low on capacity,
> the
> > > > queue
> > > > >     > reason
> > > > >     > > is
> > > > >     > > > > different and we would still be able to scale up, but
> > most
> > > of
> > > > > the
> > > > >     > cases
> > > > >     > > > > would have printed the unsupported message. This
> resulted
> > > in
> > > > > reduced
> > > > >     > > > > capacity (to be specific, the limit during that time
> was
> > 1
> > > > > slave per
> > > > >     > > > type).
> > > > >     > > > >
> > > > >     > > > > We have now fixed our autoscaling to automatically
> strip
> > > > these
> > > > >     > > characters
> > > > >     > > > > and added that message to our test suite.
> > > > >     > > > >
> > > > >     > > > > Best regards,
> > > > >     > > > > Marco
> > > > >     > > > >
> > > > >     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> > > > >     > > aaron.s.markham@gmail.com
> > > > >     > > > >
> > > > >     > > > > wrote:
> > > > >     > > > >
> > > > >     > > > >> Marco, thanks for your hard work on this. I'm super
> > > excited
> > > > > about
> > > > >     > the
> > > > >     > > > new
> > > > >     > > > >> Jenkins jobs. This is going to be very helpful and
> > improve
> > > > > sanity
> > > > >     > for
> > > > >     > > > our
> > > > >     > > > >> PRs and ourselves!
> > > > >     > > > >>
> > > > >     > > > >> Cheers,
> > > > >     > > > >> Aaron
> > > > >     > > > >>
> > > > >     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > > > >     > > > >> <marco.g.abreu@googlemail.com.invalid wrote:
> > > > >     > > > >>
> > > > >     > > > >>> Hello,
> > > > >     > > > >>>
> > > > >     > > > >>> the CI is now back up and running. Auto scaling is
> > > working
> > > > as
> > > > >     > > expected
> > > > >     > > > >> and
> > > > >     > > > >>> it passed our load tests.
> > > > >     > > > >>>
> > > > >     > > > >>> Please excuse the caused inconveniences.
> > > > >     > > > >>>
> > > > >     > > > >>> Best regards,
> > > > >     > > > >>> Marco
> > > > >     > > > >>>
> > > > >     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > > > >     > > > >>> marco.g.abreu@googlemail.com>
> > > > >     > > > >>> wrote:
> > > > >     > > > >>>
> > > > >     > > > >>>> Hello,
> > > > >     > > > >>>>
> > > > >     > > > >>>> I'd like to let you know that our CI was impaired
> and
> > > down
> > > > > for the
> > > > >     > > > last
> > > > >     > > > >>>> few hours. After getting the CI back up, I noticed
> > that
> > > > our
> > > > > auto
> > > > >     > > > >> scaling
> > > > >     > > > >>>> broke due to a silent update of Jenkins which broke
> > our
> > > > >     > > > >>> upscale-detection.
> > > > >     > > > >>>> Manual scaling is currently not possible and
> stopping
> > > the
> > > > > scaling
> > > > >     > > > won't
> > > > >     > > > >>>> help either because there are currently no p3
> > instances
> > > > > available,
> > > > >     > > > >> which
> > > > >     > > > >>>> means that all jobs will fail none the less. In a
> few
> > > > > hours, the
> > > > >     > > auto
> > > > >     > > > >>>> scaling will have recycled all slaves through the
> > > > down-scale
> > > > >     > > mechanism
> > > > >     > > > >>> and
> > > > >     > > > >>>> we will be out of capacity. This will lead to
> resource
> > > > > starvation
> > > > >     > > and
> > > > >     > > > >>> thus
> > > > >     > > > >>>> timeouts.
> > > > >     > > > >>>>
> > > > >     > > > >>>> Your PRs will be properly registered by Jenkins, but
> > > > please
> > > > > expect
> > > > >     > > the
> > > > >     > > > >>>> jobs to time out and thus fail your PRs.
> > > > >     > > > >>>>
> > > > >     > > > >>>> I will fix the auto scaling as soon as I'm awake
> > again.
> > > > >     > > > >>>>
> > > > >     > > > >>>> Sorry for the caused inconveniences.
> > > > >     > > > >>>>
> > > > >     > > > >>>> Best regards,
> > > > >     > > > >>>> Marco
> > > > >     > > > >>>>
> > > > >     > > > >>>>
> > > > >     > > > >>>> P.S. Sorry for the brief email and my lack of
> further
> > > > > fixes, but
> > > > >     > > it's
> > > > >     > > > >>>> 5:30AM now and I've been working for 17 hours.
> > > > >     > > > >>>>
> > > > >     > > > >>>
> > > > >     > > > >>
> > > > >     > > >
> > > > >     > >
> > > > >     >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI impaired

Posted by kellen sunderland <ke...@gmail.com>.

Hey Marco, I'm still having quite a few issues passing PRs.  Would you be
able to at least test a handful of PRs and make sure they pass/fail tests
as you expect?

On Sat, Nov 24, 2018, 7:01 PM Marco de Abreu
<marco.g.abreu@googlemail.com.invalid wrote:

> Hello Steffen,
>
> thank you for bringing up these PRs.
>
> I had to abort the builds during the outage which means that the jobs
> didn't finish and not even the status propagation could have finished
> (hence they show pending instead of failure or aborted).
>
> Recently, we merged a PR that adds utility slaves. This will ensure that
> status updates will always be posted, no matter whether the main queue
> hangs or not. This means that the status would then be properly reflected
> and there should be no hanging pending runs.
>
> I could retrigger all PRs to kick off another round of validation, but this
> would result in 240 jobs (2 main pipelines times 120 open PRs) to run.
> Since we are currently in the pre-release stage, I wanted to avoid putting
> the system under such heavy load.
>
> Instead, I'd kindly like to request the PR creators to make a new commit to
> trigger the pipelines. In order to merge a PR, only PR-merge has to pass
> and I tried to retrigger all PRs that have been aborted during the outage.
> It might have been possible that I missed a few.
>
> Since it's still the weekend and there's not much going on, I can use the
> time to trigger all PRs. Please advise whether you think I should move
> forward (I expect the CI to finish all PRs within 6-10 hours) or if it's
> fine to ask people to retrigger themselves.
>
> Please excuse the caused inconveniences.
>
> Best regards,
> Marco
>
>
> Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel <st...@gmail.com>
> geschrieben:
>
> > Thanks Marco for the updates and resolving the issues.
> > However, I do see a number of PR waiting to be merged with inconsistent
> PR
> > validation status check.
> > E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9
> pending
> > checks being queued. However, when you look at the details, either the
> > checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
> > windows-gpu failed, required pr-merge which includes edge, gpu tests
> > passed).
> > Similar also for other PR with label pr-awaiting-merge (
> >
> >
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
> > )
> > Please advice on resolution.
> >
> > Regards,
> > Steffen
> >
> > On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
> > <ma...@googlemail.com.invalid> wrote:
> >
> > > Thanks everybody, I really appreciate it!
> > >
> > > Today was a good day, there were no incidents and everything appears to
> > be
> > > stable. In the meantime I did a deep dive on why we has such a
> > significant
> > > performance decrease with of our compilation jobs - which then clogged
> up
> > > the queue and resulted in 1000 jobs waiting to be scheduled.
> > >
> > > The reason was the way how we use ccache to speed up our compilation
> > jobs.
> > > Usually, this yields us a huge performance improvement (CPU openblas,
> for
> > > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down
> > to
> > > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting
> > factor.
> > > Here's some background about how we operate our cache:
> > >
> > > We use EFS to have a distributed ccache between all of our
> > > unrestricted-prod-slaves. EFS is classified for almost unlimited
> > > scalability (being consumed by thousands of instances in parallel [1])
> > with
> > > a theoretical throughput of over 10Gbps. One thing I didn't know when I
> > > designed this approach was the method how throughput is being granted.
> > > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> > > throughput (default is 50MiB/s) [2]. Due to the high load, we consumed
> > all
> > > of our credits - here's a very interesting graph: [3].
> > >
> > > To avoid similar incidents in future, I have taken the following
> actions:
> > > 1. I switched EFS from burst-mode to provisioned throughput with
> 300MB/s
> > > (in the graph at [3] you can see how our IO immediately increases - and
> > > thus our CI gets faster - as soon as I added provisioned throughput).
> > > 2. I created internal follow-up tickets to add monitoring and automated
> > > actions.
> > >
> > > First, we should be notified if we are running low on credits to
> kick-off
> > > an investigation. Second (nice to have), we could have a
> lambda-function
> > > which listens for that event and automatically switches the EFS volume
> > from
> > > burst-mode to provisioned throughput during high-load-times. The
> required
> > > throughput could be retrieved via CloudWatch and then multiplied by a
> > > factor. EFS allows to downgrade the throughput mode 24h after the last
> > > changes (to reduce capacity if the load is over) and always allows to
> > > upgrade the provisioned capacity (if the load goes even higher). I've
> > been
> > > looking for a pre-made CloudFormation template to facilitate that, but
> so
> > > far, I haven't been able to find it.
> > >
> > > I'm now running additional load tests on our test CI environment to
> > detect
> > > other potential bottlenecks.
> > >
> > > Thanks a lot for your support!
> > >
> > > Best regards,
> > > Marco
> > >
> > > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
> > > [2]:
> > >
> >
> https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
> > > [3]: https://i.imgur.com/nboQLOn.png
> > >
> > > On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <la...@live.com> wrote:
> > >
> > > > Appreciated for your effort and help to make CI a better place!
> > > >
> > > > Qing
> > > >
> > > > On 11/21/18, 4:38 PM, "Lin Yuan" <ap...@gmail.com> wrote:
> > > >
> > > >     Thanks for your efforts, Marco!
> > > >
> > > >     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> > > > anirudh2290@gmail.com>
> > > >     wrote:
> > > >
> > > >     > Thanks for the quick response and mitigation!
> > > >     >
> > > >     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> > > >     > <ma...@googlemail.com.invalid> wrote:
> > > >     >
> > > >     > > Hello,
> > > >     > >
> > > >     > > today, CI had some issues and I had to cancel all jobs a few
> > > > minutes ago.
> > > >     > > This was basically caused by the high load that is currently
> > > being
> > > > put on
> > > >     > > our CI system due to the pre-release efforts for this Friday.
> > > >     > >
> > > >     > > It's really unfortunate that we just had outages of three
> core
> > > > components
> > > >     > > within the last two days - sorry about that!. To recap, we
> had
> > > the
> > > >     > > following outages (which are unrelated to the parallel
> refactor
> > > of
> > > > the
> > > >     > > Jenkins pipeline):
> > > >     > > - (yesterday evening) The Jenkins master ran out of disk
> space
> > > and
> > > > thus
> > > >     > > processed requests at reduced capacity
> > > >     > > - (this morning) The Jenkins master got updated which broke
> our
> > > >     > > autoscalings upscaling capabilities.
> > > >     > > - (new, this evening) Jenkins API was irresponsive: Due to
> the
> > > high
> > > >     > number
> > > >     > > of jobs and a bad API design in the Jenkins REST API, the
> > > > time-complexity
> > > >     > > of a simple create or delete request was quadratic which
> > resulted
> > > > in all
> > > >     > > requests timing out (that was the current outage). This
> > resulted
> > > > in our
> > > >     > > auto scaling to be unable to interface with the Jenkins
> master.
> > > >     > >
> > > >     > > I have now made improvements to our REST API calls which
> > reduced
> > > > the
> > > >     > > complexity from O(N^2) to O(1). The reason was an underlying
> > > > redirect
> > > >     > loop
> > > >     > > in the Jenkins createNode and deleteNode REST API in
> > combination
> > > > with
> > > >     > > unrolling the entire slave and job graph (which got quite
> huge
> > > > during
> > > >     > > extensive load) upon every single request. Since we had about
> > 150
> > > >     > > registered slaves and 1000 jobs in the queue, the duration
> for
> > a
> > > > single
> > > >     > > REST API call rose to up to 45 seconds (we execute up to a
> few
> > > > hundred
> > > >     > > queries per auto scaling loop). This lead to our auto scaling
> > > > timing out.
> > > >     > >
> > > >     > > Everything should be back to normal now. I'm closely
> observing
> > > the
> > > >     > > situation and I'll let you know if I encounter any additional
> > > > issues.
> > > >     > >
> > > >     > > Again, sorry for any caused inconveniences.
> > > >     > >
> > > >     > > Best regards,
> > > >     > > Marco
> > > >     > >
> > > >     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> > > > gavin.max.bell@gmail.com>
> > > >     > > wrote:
> > > >     > >
> > > >     > > > Yes, let me add to the kudos, very nice work Marco.
> > > >     > > >
> > > >     > > >
> > > >     > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> > > >     > > >
> > > >     > > >
> > > >     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > > >     > > > <ke...@amazon.de.INVALID> wrote:
> > > >     > > > >
> > > >     > > > > Appreciate the big effort in bring the CI back so
> quickly.
> > > > Thanks
> > > >     > > Marco.
> > > >     > > > >
> > > >     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> > > >     > marco.g.abreu@googlemail.com
> > > >     > > .INVALID>
> > > >     > > > wrote:
> > > >     > > > > Thanks Aaron! Just for the record, the new Jenkins jobs
> > were
> > > >     > unrelated
> > > >     > > to
> > > >     > > > > that incident.
> > > >     > > > >
> > > >     > > > > If somebody is interested in the details around the
> outage:
> > > >     > > > >
> > > >     > > > > Due to a required maintenance (disk running full), we had
> > to
> > > > upgrade
> > > >     > > our
> > > >     > > > > Jenkins master because it was running on Ubuntu 17.04
> (for
> > an
> > > > unknown
> > > >     > > > > reason, it used to be 16.04) and we needed to install
> some
> > > > packages.
> > > >     > > > Since
> > > >     > > > > the support for Ubuntu 17.04 was stopped, this resulted
> in
> > > all
> > > >     > package
> > > >     > > > > updates and installations to fail because the
> repositories
> > > > were taken
> > > >     > > > > offline. Due to the unavailable maintenance package and
> > other
> > > > issues
> > > >     > > with
> > > >     > > > > the installed OpenJDK8 version, we made the decision to
> > > > upgrade the
> > > >     > > > Jenkins
> > > >     > > > > master to Ubuntu 18.04 LTS in order to get back to a
> > > supported
> > > >     > version
> > > >     > > > with
> > > >     > > > > maintenance tools. During this upgrade, Jenkins was
> > > > automatically
> > > >     > > updated
> > > >     > > > > by APT as part of the dist-upgrade process.
> > > >     > > > >
> > > >     > > > > In the latest version of Jenkins, some labels have been
> > > > changed which
> > > >     > > we
> > > >     > > > > depend on for our auto scaling. To be more specific:
> > > >     > > > >> Waiting for next available executor on mxnetlinux-gpu
> > > >     > > > > has been changed to
> > > >     > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > > >     > > > > Notice the quote characters.
> > > >     > > > >
> > > >     > > > > Jenkins does not offer a better way than to parse these
> > > > messages
> > > >     > > > > unfortunately - there's no standardized way to express
> > queue
> > > > items.
> > > >     > > Since
> > > >     > > > > our parser expected the above message without quote
> signs,
> > > this
> > > >     > message
> > > >     > > > was
> > > >     > > > > discarded.
> > > >     > > > >
> > > >     > > > > We support various queue reasons (5 of them to be exact)
> > that
> > > >     > indicate
> > > >     > > > > resource starvation. If we run super low on capacity, the
> > > queue
> > > >     > reason
> > > >     > > is
> > > >     > > > > different and we would still be able to scale up, but
> most
> > of
> > > > the
> > > >     > cases
> > > >     > > > > would have printed the unsupported message. This resulted
> > in
> > > > reduced
> > > >     > > > > capacity (to be specific, the limit during that time was
> 1
> > > > slave per
> > > >     > > > type).
> > > >     > > > >
> > > >     > > > > We have now fixed our autoscaling to automatically strip
> > > these
> > > >     > > characters
> > > >     > > > > and added that message to our test suite.
> > > >     > > > >
> > > >     > > > > Best regards,
> > > >     > > > > Marco
> > > >     > > > >
> > > >     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> > > >     > > aaron.s.markham@gmail.com
> > > >     > > > >
> > > >     > > > > wrote:
> > > >     > > > >
> > > >     > > > >> Marco, thanks for your hard work on this. I'm super
> > excited
> > > > about
> > > >     > the
> > > >     > > > new
> > > >     > > > >> Jenkins jobs. This is going to be very helpful and
> improve
> > > > sanity
> > > >     > for
> > > >     > > > our
> > > >     > > > >> PRs and ourselves!
> > > >     > > > >>
> > > >     > > > >> Cheers,
> > > >     > > > >> Aaron
> > > >     > > > >>
> > > >     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > > >     > > > >> <marco.g.abreu@googlemail.com.invalid wrote:
> > > >     > > > >>
> > > >     > > > >>> Hello,
> > > >     > > > >>>
> > > >     > > > >>> the CI is now back up and running. Auto scaling is
> > working
> > > as
> > > >     > > expected
> > > >     > > > >> and
> > > >     > > > >>> it passed our load tests.
> > > >     > > > >>>
> > > >     > > > >>> Please excuse the caused inconveniences.
> > > >     > > > >>>
> > > >     > > > >>> Best regards,
> > > >     > > > >>> Marco
> > > >     > > > >>>
> > > >     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > > >     > > > >>> marco.g.abreu@googlemail.com>
> > > >     > > > >>> wrote:
> > > >     > > > >>>
> > > >     > > > >>>> Hello,
> > > >     > > > >>>>
> > > >     > > > >>>> I'd like to let you know that our CI was impaired and
> > down
> > > > for the
> > > >     > > > last
> > > >     > > > >>>> few hours. After getting the CI back up, I noticed
> that
> > > our
> > > > auto
> > > >     > > > >> scaling
> > > >     > > > >>>> broke due to a silent update of Jenkins which broke
> our
> > > >     > > > >>> upscale-detection.
> > > >     > > > >>>> Manual scaling is currently not possible and stopping
> > the
> > > > scaling
> > > >     > > > won't
> > > >     > > > >>>> help either because there are currently no p3
> instances
> > > > available,
> > > >     > > > >> which
> > > >     > > > >>>> means that all jobs will fail none the less. In a few
> > > > hours, the
> > > >     > > auto
> > > >     > > > >>>> scaling will have recycled all slaves through the
> > > down-scale
> > > >     > > mechanism
> > > >     > > > >>> and
> > > >     > > > >>>> we will be out of capacity. This will lead to resource
> > > > starvation
> > > >     > > and
> > > >     > > > >>> thus
> > > >     > > > >>>> timeouts.
> > > >     > > > >>>>
> > > >     > > > >>>> Your PRs will be properly registered by Jenkins, but
> > > please
> > > > expect
> > > >     > > the
> > > >     > > > >>>> jobs to time out and thus fail your PRs.
> > > >     > > > >>>>
> > > >     > > > >>>> I will fix the auto scaling as soon as I'm awake
> again.
> > > >     > > > >>>>
> > > >     > > > >>>> Sorry for the caused inconveniences.
> > > >     > > > >>>>
> > > >     > > > >>>> Best regards,
> > > >     > > > >>>> Marco
> > > >     > > > >>>>
> > > >     > > > >>>>
> > > >     > > > >>>> P.S. Sorry for the brief email and my lack of further
> > > > fixes, but
> > > >     > > it's
> > > >     > > > >>>> 5:30AM now and I've been working for 17 hours.
> > > >     > > > >>>>
> > > >     > > > >>>
> > > >     > > > >>
> > > >     > > >
> > > >     > >
> > > >     >
> > > >
> > > >
> > > >
> > >
> >
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Hello Steffen,

thank you for bringing up these PRs.

I had to abort the builds during the outage which means that the jobs
didn't finish and not even the status propagation could have finished
(hence they show pending instead of failure or aborted).

Recently, we merged a PR that adds utility slaves. This will ensure that
status updates will always be posted, no matter whether the main queue
hangs or not. This means that the status would then be properly reflected
and there should be no hanging pending runs.

I could retrigger all PRs to kick off another round of validation, but this
would result in 240 jobs (2 main pipelines times 120 open PRs) to run.
Since we are currently in the pre-release stage, I wanted to avoid putting
the system under such heavy load.

Instead, I'd kindly like to request the PR creators to make a new commit to
trigger the pipelines. In order to merge a PR, only PR-merge has to pass
and I tried to retrigger all PRs that have been aborted during the outage.
It might have been possible that I missed a few.

Since it's still the weekend and there's not much going on, I can use the
time to trigger all PRs. Please advise whether you think I should move
forward (I expect the CI to finish all PRs within 6-10 hours) or if it's
fine to ask people to retrigger themselves.

Please excuse the caused inconveniences.

Best regards,
Marco


Am So., 25. Nov. 2018, 03:48 hat Steffen Rochel <st...@gmail.com>
geschrieben:

> Thanks Marco for the updates and resolving the issues.
> However, I do see a number of PR waiting to be merged with inconsistent PR
> validation status check.
> E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 pending
> checks being queued. However, when you look at the details, either the
> checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
> windows-gpu failed, required pr-merge which includes edge, gpu tests
> passed).
> Similar also for other PR with label pr-awaiting-merge (
>
> https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
> )
> Please advice on resolution.
>
> Regards,
> Steffen
>
> On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
> <ma...@googlemail.com.invalid> wrote:
>
> > Thanks everybody, I really appreciate it!
> >
> > Today was a good day, there were no incidents and everything appears to
> be
> > stable. In the meantime I did a deep dive on why we has such a
> significant
> > performance decrease with of our compilation jobs - which then clogged up
> > the queue and resulted in 1000 jobs waiting to be scheduled.
> >
> > The reason was the way how we use ccache to speed up our compilation
> jobs.
> > Usually, this yields us a huge performance improvement (CPU openblas, for
> > example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down
> to
> > ~1.5min, etc.). Unfortunately in this case, ccache was our limiting
> factor.
> > Here's some background about how we operate our cache:
> >
> > We use EFS to have a distributed ccache between all of our
> > unrestricted-prod-slaves. EFS is classified for almost unlimited
> > scalability (being consumed by thousands of instances in parallel [1])
> with
> > a theoretical throughput of over 10Gbps. One thing I didn't know when I
> > designed this approach was the method how throughput is being granted.
> > Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> > throughput (default is 50MiB/s) [2]. Due to the high load, we consumed
> all
> > of our credits - here's a very interesting graph: [3].
> >
> > To avoid similar incidents in future, I have taken the following actions:
> > 1. I switched EFS from burst-mode to provisioned throughput with 300MB/s
> > (in the graph at [3] you can see how our IO immediately increases - and
> > thus our CI gets faster - as soon as I added provisioned throughput).
> > 2. I created internal follow-up tickets to add monitoring and automated
> > actions.
> >
> > First, we should be notified if we are running low on credits to kick-off
> > an investigation. Second (nice to have), we could have a lambda-function
> > which listens for that event and automatically switches the EFS volume
> from
> > burst-mode to provisioned throughput during high-load-times. The required
> > throughput could be retrieved via CloudWatch and then multiplied by a
> > factor. EFS allows to downgrade the throughput mode 24h after the last
> > changes (to reduce capacity if the load is over) and always allows to
> > upgrade the provisioned capacity (if the load goes even higher). I've
> been
> > looking for a pre-made CloudFormation template to facilitate that, but so
> > far, I haven't been able to find it.
> >
> > I'm now running additional load tests on our test CI environment to
> detect
> > other potential bottlenecks.
> >
> > Thanks a lot for your support!
> >
> > Best regards,
> > Marco
> >
> > [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
> > [2]:
> >
> https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
> > [3]: https://i.imgur.com/nboQLOn.png
> >
> > On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <la...@live.com> wrote:
> >
> > > Appreciated for your effort and help to make CI a better place!
> > >
> > > Qing
> > >
> > > On 11/21/18, 4:38 PM, "Lin Yuan" <ap...@gmail.com> wrote:
> > >
> > >     Thanks for your efforts, Marco!
> > >
> > >     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> > > anirudh2290@gmail.com>
> > >     wrote:
> > >
> > >     > Thanks for the quick response and mitigation!
> > >     >
> > >     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> > >     > <ma...@googlemail.com.invalid> wrote:
> > >     >
> > >     > > Hello,
> > >     > >
> > >     > > today, CI had some issues and I had to cancel all jobs a few
> > > minutes ago.
> > >     > > This was basically caused by the high load that is currently
> > being
> > > put on
> > >     > > our CI system due to the pre-release efforts for this Friday.
> > >     > >
> > >     > > It's really unfortunate that we just had outages of three core
> > > components
> > >     > > within the last two days - sorry about that!. To recap, we had
> > the
> > >     > > following outages (which are unrelated to the parallel refactor
> > of
> > > the
> > >     > > Jenkins pipeline):
> > >     > > - (yesterday evening) The Jenkins master ran out of disk space
> > and
> > > thus
> > >     > > processed requests at reduced capacity
> > >     > > - (this morning) The Jenkins master got updated which broke our
> > >     > > autoscalings upscaling capabilities.
> > >     > > - (new, this evening) Jenkins API was irresponsive: Due to the
> > high
> > >     > number
> > >     > > of jobs and a bad API design in the Jenkins REST API, the
> > > time-complexity
> > >     > > of a simple create or delete request was quadratic which
> resulted
> > > in all
> > >     > > requests timing out (that was the current outage). This
> resulted
> > > in our
> > >     > > auto scaling to be unable to interface with the Jenkins master.
> > >     > >
> > >     > > I have now made improvements to our REST API calls which
> reduced
> > > the
> > >     > > complexity from O(N^2) to O(1). The reason was an underlying
> > > redirect
> > >     > loop
> > >     > > in the Jenkins createNode and deleteNode REST API in
> combination
> > > with
> > >     > > unrolling the entire slave and job graph (which got quite huge
> > > during
> > >     > > extensive load) upon every single request. Since we had about
> 150
> > >     > > registered slaves and 1000 jobs in the queue, the duration for
> a
> > > single
> > >     > > REST API call rose to up to 45 seconds (we execute up to a few
> > > hundred
> > >     > > queries per auto scaling loop). This lead to our auto scaling
> > > timing out.
> > >     > >
> > >     > > Everything should be back to normal now. I'm closely observing
> > the
> > >     > > situation and I'll let you know if I encounter any additional
> > > issues.
> > >     > >
> > >     > > Again, sorry for any caused inconveniences.
> > >     > >
> > >     > > Best regards,
> > >     > > Marco
> > >     > >
> > >     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> > > gavin.max.bell@gmail.com>
> > >     > > wrote:
> > >     > >
> > >     > > > Yes, let me add to the kudos, very nice work Marco.
> > >     > > >
> > >     > > >
> > >     > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> > >     > > >
> > >     > > >
> > >     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > >     > > > <ke...@amazon.de.INVALID> wrote:
> > >     > > > >
> > >     > > > > Appreciate the big effort in bring the CI back so quickly.
> > > Thanks
> > >     > > Marco.
> > >     > > > >
> > >     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> > >     > marco.g.abreu@googlemail.com
> > >     > > .INVALID>
> > >     > > > wrote:
> > >     > > > > Thanks Aaron! Just for the record, the new Jenkins jobs
> were
> > >     > unrelated
> > >     > > to
> > >     > > > > that incident.
> > >     > > > >
> > >     > > > > If somebody is interested in the details around the outage:
> > >     > > > >
> > >     > > > > Due to a required maintenance (disk running full), we had
> to
> > > upgrade
> > >     > > our
> > >     > > > > Jenkins master because it was running on Ubuntu 17.04 (for
> an
> > > unknown
> > >     > > > > reason, it used to be 16.04) and we needed to install some
> > > packages.
> > >     > > > Since
> > >     > > > > the support for Ubuntu 17.04 was stopped, this resulted in
> > all
> > >     > package
> > >     > > > > updates and installations to fail because the repositories
> > > were taken
> > >     > > > > offline. Due to the unavailable maintenance package and
> other
> > > issues
> > >     > > with
> > >     > > > > the installed OpenJDK8 version, we made the decision to
> > > upgrade the
> > >     > > > Jenkins
> > >     > > > > master to Ubuntu 18.04 LTS in order to get back to a
> > supported
> > >     > version
> > >     > > > with
> > >     > > > > maintenance tools. During this upgrade, Jenkins was
> > > automatically
> > >     > > updated
> > >     > > > > by APT as part of the dist-upgrade process.
> > >     > > > >
> > >     > > > > In the latest version of Jenkins, some labels have been
> > > changed which
> > >     > > we
> > >     > > > > depend on for our auto scaling. To be more specific:
> > >     > > > >> Waiting for next available executor on mxnetlinux-gpu
> > >     > > > > has been changed to
> > >     > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > >     > > > > Notice the quote characters.
> > >     > > > >
> > >     > > > > Jenkins does not offer a better way than to parse these
> > > messages
> > >     > > > > unfortunately - there's no standardized way to express
> queue
> > > items.
> > >     > > Since
> > >     > > > > our parser expected the above message without quote signs,
> > this
> > >     > message
> > >     > > > was
> > >     > > > > discarded.
> > >     > > > >
> > >     > > > > We support various queue reasons (5 of them to be exact)
> that
> > >     > indicate
> > >     > > > > resource starvation. If we run super low on capacity, the
> > queue
> > >     > reason
> > >     > > is
> > >     > > > > different and we would still be able to scale up, but most
> of
> > > the
> > >     > cases
> > >     > > > > would have printed the unsupported message. This resulted
> in
> > > reduced
> > >     > > > > capacity (to be specific, the limit during that time was 1
> > > slave per
> > >     > > > type).
> > >     > > > >
> > >     > > > > We have now fixed our autoscaling to automatically strip
> > these
> > >     > > characters
> > >     > > > > and added that message to our test suite.
> > >     > > > >
> > >     > > > > Best regards,
> > >     > > > > Marco
> > >     > > > >
> > >     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> > >     > > aaron.s.markham@gmail.com
> > >     > > > >
> > >     > > > > wrote:
> > >     > > > >
> > >     > > > >> Marco, thanks for your hard work on this. I'm super
> excited
> > > about
> > >     > the
> > >     > > > new
> > >     > > > >> Jenkins jobs. This is going to be very helpful and improve
> > > sanity
> > >     > for
> > >     > > > our
> > >     > > > >> PRs and ourselves!
> > >     > > > >>
> > >     > > > >> Cheers,
> > >     > > > >> Aaron
> > >     > > > >>
> > >     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > >     > > > >> <marco.g.abreu@googlemail.com.invalid wrote:
> > >     > > > >>
> > >     > > > >>> Hello,
> > >     > > > >>>
> > >     > > > >>> the CI is now back up and running. Auto scaling is
> working
> > as
> > >     > > expected
> > >     > > > >> and
> > >     > > > >>> it passed our load tests.
> > >     > > > >>>
> > >     > > > >>> Please excuse the caused inconveniences.
> > >     > > > >>>
> > >     > > > >>> Best regards,
> > >     > > > >>> Marco
> > >     > > > >>>
> > >     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > >     > > > >>> marco.g.abreu@googlemail.com>
> > >     > > > >>> wrote:
> > >     > > > >>>
> > >     > > > >>>> Hello,
> > >     > > > >>>>
> > >     > > > >>>> I'd like to let you know that our CI was impaired and
> down
> > > for the
> > >     > > > last
> > >     > > > >>>> few hours. After getting the CI back up, I noticed that
> > our
> > > auto
> > >     > > > >> scaling
> > >     > > > >>>> broke due to a silent update of Jenkins which broke our
> > >     > > > >>> upscale-detection.
> > >     > > > >>>> Manual scaling is currently not possible and stopping
> the
> > > scaling
> > >     > > > won't
> > >     > > > >>>> help either because there are currently no p3 instances
> > > available,
> > >     > > > >> which
> > >     > > > >>>> means that all jobs will fail none the less. In a few
> > > hours, the
> > >     > > auto
> > >     > > > >>>> scaling will have recycled all slaves through the
> > down-scale
> > >     > > mechanism
> > >     > > > >>> and
> > >     > > > >>>> we will be out of capacity. This will lead to resource
> > > starvation
> > >     > > and
> > >     > > > >>> thus
> > >     > > > >>>> timeouts.
> > >     > > > >>>>
> > >     > > > >>>> Your PRs will be properly registered by Jenkins, but
> > please
> > > expect
> > >     > > the
> > >     > > > >>>> jobs to time out and thus fail your PRs.
> > >     > > > >>>>
> > >     > > > >>>> I will fix the auto scaling as soon as I'm awake again.
> > >     > > > >>>>
> > >     > > > >>>> Sorry for the caused inconveniences.
> > >     > > > >>>>
> > >     > > > >>>> Best regards,
> > >     > > > >>>> Marco
> > >     > > > >>>>
> > >     > > > >>>>
> > >     > > > >>>> P.S. Sorry for the brief email and my lack of further
> > > fixes, but
> > >     > > it's
> > >     > > > >>>> 5:30AM now and I've been working for 17 hours.
> > >     > > > >>>>
> > >     > > > >>>
> > >     > > > >>
> > >     > > >
> > >     > >
> > >     >
> > >
> > >
> > >
> >
>

Re: CI impaired

Posted by Steffen Rochel <st...@gmail.com>.

Thanks Marco for the updates and resolving the issues.
However, I do see a number of PR waiting to be merged with inconsistent PR
validation status check.
E.g. https://github.com/apache/incubator-mxnet/pull/13041 shows 9 pending
checks being queued. However, when you look at the details, either the
checks have passed or failed (centos-cpu, edge, unix-cpu, window-cpu,
windows-gpu failed, required pr-merge which includes edge, gpu tests
passed).
Similar also for other PR with label pr-awaiting-merge (
https://github.com/apache/incubator-mxnet/pulls?utf8=%E2%9C%93&q=is%3Apr+is%3Aopen+label%3Apr-awaiting-merge
)
Please advice on resolution.

Regards,
Steffen

On Thu, Nov 22, 2018 at 12:09 PM Marco de Abreu
<ma...@googlemail.com.invalid> wrote:

> Thanks everybody, I really appreciate it!
>
> Today was a good day, there were no incidents and everything appears to be
> stable. In the meantime I did a deep dive on why we has such a significant
> performance decrease with of our compilation jobs - which then clogged up
> the queue and resulted in 1000 jobs waiting to be scheduled.
>
> The reason was the way how we use ccache to speed up our compilation jobs.
> Usually, this yields us a huge performance improvement (CPU openblas, for
> example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down to
> ~1.5min, etc.). Unfortunately in this case, ccache was our limiting factor.
> Here's some background about how we operate our cache:
>
> We use EFS to have a distributed ccache between all of our
> unrestricted-prod-slaves. EFS is classified for almost unlimited
> scalability (being consumed by thousands of instances in parallel [1]) with
> a theoretical throughput of over 10Gbps. One thing I didn't know when I
> designed this approach was the method how throughput is being granted.
> Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
> throughput (default is 50MiB/s) [2]. Due to the high load, we consumed all
> of our credits - here's a very interesting graph: [3].
>
> To avoid similar incidents in future, I have taken the following actions:
> 1. I switched EFS from burst-mode to provisioned throughput with 300MB/s
> (in the graph at [3] you can see how our IO immediately increases - and
> thus our CI gets faster - as soon as I added provisioned throughput).
> 2. I created internal follow-up tickets to add monitoring and automated
> actions.
>
> First, we should be notified if we are running low on credits to kick-off
> an investigation. Second (nice to have), we could have a lambda-function
> which listens for that event and automatically switches the EFS volume from
> burst-mode to provisioned throughput during high-load-times. The required
> throughput could be retrieved via CloudWatch and then multiplied by a
> factor. EFS allows to downgrade the throughput mode 24h after the last
> changes (to reduce capacity if the load is over) and always allows to
> upgrade the provisioned capacity (if the load goes even higher). I've been
> looking for a pre-made CloudFormation template to facilitate that, but so
> far, I haven't been able to find it.
>
> I'm now running additional load tests on our test CI environment to detect
> other potential bottlenecks.
>
> Thanks a lot for your support!
>
> Best regards,
> Marco
>
> [1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
> [2]:
> https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
> [3]: https://i.imgur.com/nboQLOn.png
>
> On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <la...@live.com> wrote:
>
> > Appreciated for your effort and help to make CI a better place!
> >
> > Qing
> >
> > On 11/21/18, 4:38 PM, "Lin Yuan" <ap...@gmail.com> wrote:
> >
> >     Thanks for your efforts, Marco!
> >
> >     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> > anirudh2290@gmail.com>
> >     wrote:
> >
> >     > Thanks for the quick response and mitigation!
> >     >
> >     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> >     > <ma...@googlemail.com.invalid> wrote:
> >     >
> >     > > Hello,
> >     > >
> >     > > today, CI had some issues and I had to cancel all jobs a few
> > minutes ago.
> >     > > This was basically caused by the high load that is currently
> being
> > put on
> >     > > our CI system due to the pre-release efforts for this Friday.
> >     > >
> >     > > It's really unfortunate that we just had outages of three core
> > components
> >     > > within the last two days - sorry about that!. To recap, we had
> the
> >     > > following outages (which are unrelated to the parallel refactor
> of
> > the
> >     > > Jenkins pipeline):
> >     > > - (yesterday evening) The Jenkins master ran out of disk space
> and
> > thus
> >     > > processed requests at reduced capacity
> >     > > - (this morning) The Jenkins master got updated which broke our
> >     > > autoscalings upscaling capabilities.
> >     > > - (new, this evening) Jenkins API was irresponsive: Due to the
> high
> >     > number
> >     > > of jobs and a bad API design in the Jenkins REST API, the
> > time-complexity
> >     > > of a simple create or delete request was quadratic which resulted
> > in all
> >     > > requests timing out (that was the current outage). This resulted
> > in our
> >     > > auto scaling to be unable to interface with the Jenkins master.
> >     > >
> >     > > I have now made improvements to our REST API calls which reduced
> > the
> >     > > complexity from O(N^2) to O(1). The reason was an underlying
> > redirect
> >     > loop
> >     > > in the Jenkins createNode and deleteNode REST API in combination
> > with
> >     > > unrolling the entire slave and job graph (which got quite huge
> > during
> >     > > extensive load) upon every single request. Since we had about 150
> >     > > registered slaves and 1000 jobs in the queue, the duration for a
> > single
> >     > > REST API call rose to up to 45 seconds (we execute up to a few
> > hundred
> >     > > queries per auto scaling loop). This lead to our auto scaling
> > timing out.
> >     > >
> >     > > Everything should be back to normal now. I'm closely observing
> the
> >     > > situation and I'll let you know if I encounter any additional
> > issues.
> >     > >
> >     > > Again, sorry for any caused inconveniences.
> >     > >
> >     > > Best regards,
> >     > > Marco
> >     > >
> >     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> > gavin.max.bell@gmail.com>
> >     > > wrote:
> >     > >
> >     > > > Yes, let me add to the kudos, very nice work Marco.
> >     > > >
> >     > > >
> >     > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> >     > > >
> >     > > >
> >     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> >     > > > <ke...@amazon.de.INVALID> wrote:
> >     > > > >
> >     > > > > Appreciate the big effort in bring the CI back so quickly.
> > Thanks
> >     > > Marco.
> >     > > > >
> >     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> >     > marco.g.abreu@googlemail.com
> >     > > .INVALID>
> >     > > > wrote:
> >     > > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
> >     > unrelated
> >     > > to
> >     > > > > that incident.
> >     > > > >
> >     > > > > If somebody is interested in the details around the outage:
> >     > > > >
> >     > > > > Due to a required maintenance (disk running full), we had to
> > upgrade
> >     > > our
> >     > > > > Jenkins master because it was running on Ubuntu 17.04 (for an
> > unknown
> >     > > > > reason, it used to be 16.04) and we needed to install some
> > packages.
> >     > > > Since
> >     > > > > the support for Ubuntu 17.04 was stopped, this resulted in
> all
> >     > package
> >     > > > > updates and installations to fail because the repositories
> > were taken
> >     > > > > offline. Due to the unavailable maintenance package and other
> > issues
> >     > > with
> >     > > > > the installed OpenJDK8 version, we made the decision to
> > upgrade the
> >     > > > Jenkins
> >     > > > > master to Ubuntu 18.04 LTS in order to get back to a
> supported
> >     > version
> >     > > > with
> >     > > > > maintenance tools. During this upgrade, Jenkins was
> > automatically
> >     > > updated
> >     > > > > by APT as part of the dist-upgrade process.
> >     > > > >
> >     > > > > In the latest version of Jenkins, some labels have been
> > changed which
> >     > > we
> >     > > > > depend on for our auto scaling. To be more specific:
> >     > > > >> Waiting for next available executor on mxnetlinux-gpu
> >     > > > > has been changed to
> >     > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> >     > > > > Notice the quote characters.
> >     > > > >
> >     > > > > Jenkins does not offer a better way than to parse these
> > messages
> >     > > > > unfortunately - there's no standardized way to express queue
> > items.
> >     > > Since
> >     > > > > our parser expected the above message without quote signs,
> this
> >     > message
> >     > > > was
> >     > > > > discarded.
> >     > > > >
> >     > > > > We support various queue reasons (5 of them to be exact) that
> >     > indicate
> >     > > > > resource starvation. If we run super low on capacity, the
> queue
> >     > reason
> >     > > is
> >     > > > > different and we would still be able to scale up, but most of
> > the
> >     > cases
> >     > > > > would have printed the unsupported message. This resulted in
> > reduced
> >     > > > > capacity (to be specific, the limit during that time was 1
> > slave per
> >     > > > type).
> >     > > > >
> >     > > > > We have now fixed our autoscaling to automatically strip
> these
> >     > > characters
> >     > > > > and added that message to our test suite.
> >     > > > >
> >     > > > > Best regards,
> >     > > > > Marco
> >     > > > >
> >     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> >     > > aaron.s.markham@gmail.com
> >     > > > >
> >     > > > > wrote:
> >     > > > >
> >     > > > >> Marco, thanks for your hard work on this. I'm super excited
> > about
> >     > the
> >     > > > new
> >     > > > >> Jenkins jobs. This is going to be very helpful and improve
> > sanity
> >     > for
> >     > > > our
> >     > > > >> PRs and ourselves!
> >     > > > >>
> >     > > > >> Cheers,
> >     > > > >> Aaron
> >     > > > >>
> >     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> >     > > > >> <marco.g.abreu@googlemail.com.invalid wrote:
> >     > > > >>
> >     > > > >>> Hello,
> >     > > > >>>
> >     > > > >>> the CI is now back up and running. Auto scaling is working
> as
> >     > > expected
> >     > > > >> and
> >     > > > >>> it passed our load tests.
> >     > > > >>>
> >     > > > >>> Please excuse the caused inconveniences.
> >     > > > >>>
> >     > > > >>> Best regards,
> >     > > > >>> Marco
> >     > > > >>>
> >     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> >     > > > >>> marco.g.abreu@googlemail.com>
> >     > > > >>> wrote:
> >     > > > >>>
> >     > > > >>>> Hello,
> >     > > > >>>>
> >     > > > >>>> I'd like to let you know that our CI was impaired and down
> > for the
> >     > > > last
> >     > > > >>>> few hours. After getting the CI back up, I noticed that
> our
> > auto
> >     > > > >> scaling
> >     > > > >>>> broke due to a silent update of Jenkins which broke our
> >     > > > >>> upscale-detection.
> >     > > > >>>> Manual scaling is currently not possible and stopping the
> > scaling
> >     > > > won't
> >     > > > >>>> help either because there are currently no p3 instances
> > available,
> >     > > > >> which
> >     > > > >>>> means that all jobs will fail none the less. In a few
> > hours, the
> >     > > auto
> >     > > > >>>> scaling will have recycled all slaves through the
> down-scale
> >     > > mechanism
> >     > > > >>> and
> >     > > > >>>> we will be out of capacity. This will lead to resource
> > starvation
> >     > > and
> >     > > > >>> thus
> >     > > > >>>> timeouts.
> >     > > > >>>>
> >     > > > >>>> Your PRs will be properly registered by Jenkins, but
> please
> > expect
> >     > > the
> >     > > > >>>> jobs to time out and thus fail your PRs.
> >     > > > >>>>
> >     > > > >>>> I will fix the auto scaling as soon as I'm awake again.
> >     > > > >>>>
> >     > > > >>>> Sorry for the caused inconveniences.
> >     > > > >>>>
> >     > > > >>>> Best regards,
> >     > > > >>>> Marco
> >     > > > >>>>
> >     > > > >>>>
> >     > > > >>>> P.S. Sorry for the brief email and my lack of further
> > fixes, but
> >     > > it's
> >     > > > >>>> 5:30AM now and I've been working for 17 hours.
> >     > > > >>>>
> >     > > > >>>
> >     > > > >>
> >     > > >
> >     > >
> >     >
> >
> >
> >
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Thanks everybody, I really appreciate it!

Today was a good day, there were no incidents and everything appears to be
stable. In the meantime I did a deep dive on why we has such a significant
performance decrease with of our compilation jobs - which then clogged up
the queue and resulted in 1000 jobs waiting to be scheduled.

The reason was the way how we use ccache to speed up our compilation jobs.
Usually, this yields us a huge performance improvement (CPU openblas, for
example, goes from 30 minutes down to ~3min, ARMv7 from 30 minutes down to
~1.5min, etc.). Unfortunately in this case, ccache was our limiting factor.
Here's some background about how we operate our cache:

We use EFS to have a distributed ccache between all of our
unrestricted-prod-slaves. EFS is classified for almost unlimited
scalability (being consumed by thousands of instances in parallel [1]) with
a theoretical throughput of over 10Gbps. One thing I didn't know when I
designed this approach was the method how throughput is being granted.
Similar to T2-CPU-Credits, EFS uses BurstCredits to allow you higher
throughput (default is 50MiB/s) [2]. Due to the high load, we consumed all
of our credits - here's a very interesting graph: [3].

To avoid similar incidents in future, I have taken the following actions:
1. I switched EFS from burst-mode to provisioned throughput with 300MB/s
(in the graph at [3] you can see how our IO immediately increases - and
thus our CI gets faster - as soon as I added provisioned throughput).
2. I created internal follow-up tickets to add monitoring and automated
actions.

First, we should be notified if we are running low on credits to kick-off
an investigation. Second (nice to have), we could have a lambda-function
which listens for that event and automatically switches the EFS volume from
burst-mode to provisioned throughput during high-load-times. The required
throughput could be retrieved via CloudWatch and then multiplied by a
factor. EFS allows to downgrade the throughput mode 24h after the last
changes (to reduce capacity if the load is over) and always allows to
upgrade the provisioned capacity (if the load goes even higher). I've been
looking for a pre-made CloudFormation template to facilitate that, but so
far, I haven't been able to find it.

I'm now running additional load tests on our test CI environment to detect
other potential bottlenecks.

Thanks a lot for your support!

Best regards,
Marco

[1]: https://docs.aws.amazon.com/efs/latest/ug/performance.html
[2]:
https://docs.aws.amazon.com/efs/latest/ug/performance.html#throughput-modes
[3]: https://i.imgur.com/nboQLOn.png

On Thu, Nov 22, 2018 at 1:40 AM Qing Lan <la...@live.com> wrote:

> Appreciated for your effort and help to make CI a better place!
>
> Qing
>
> On 11/21/18, 4:38 PM, "Lin Yuan" <ap...@gmail.com> wrote:
>
>     Thanks for your efforts, Marco!
>
>     On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <
> anirudh2290@gmail.com>
>     wrote:
>
>     > Thanks for the quick response and mitigation!
>     >
>     > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
>     > <ma...@googlemail.com.invalid> wrote:
>     >
>     > > Hello,
>     > >
>     > > today, CI had some issues and I had to cancel all jobs a few
> minutes ago.
>     > > This was basically caused by the high load that is currently being
> put on
>     > > our CI system due to the pre-release efforts for this Friday.
>     > >
>     > > It's really unfortunate that we just had outages of three core
> components
>     > > within the last two days - sorry about that!. To recap, we had the
>     > > following outages (which are unrelated to the parallel refactor of
> the
>     > > Jenkins pipeline):
>     > > - (yesterday evening) The Jenkins master ran out of disk space and
> thus
>     > > processed requests at reduced capacity
>     > > - (this morning) The Jenkins master got updated which broke our
>     > > autoscalings upscaling capabilities.
>     > > - (new, this evening) Jenkins API was irresponsive: Due to the high
>     > number
>     > > of jobs and a bad API design in the Jenkins REST API, the
> time-complexity
>     > > of a simple create or delete request was quadratic which resulted
> in all
>     > > requests timing out (that was the current outage). This resulted
> in our
>     > > auto scaling to be unable to interface with the Jenkins master.
>     > >
>     > > I have now made improvements to our REST API calls which reduced
> the
>     > > complexity from O(N^2) to O(1). The reason was an underlying
> redirect
>     > loop
>     > > in the Jenkins createNode and deleteNode REST API in combination
> with
>     > > unrolling the entire slave and job graph (which got quite huge
> during
>     > > extensive load) upon every single request. Since we had about 150
>     > > registered slaves and 1000 jobs in the queue, the duration for a
> single
>     > > REST API call rose to up to 45 seconds (we execute up to a few
> hundred
>     > > queries per auto scaling loop). This lead to our auto scaling
> timing out.
>     > >
>     > > Everything should be back to normal now. I'm closely observing the
>     > > situation and I'll let you know if I encounter any additional
> issues.
>     > >
>     > > Again, sorry for any caused inconveniences.
>     > >
>     > > Best regards,
>     > > Marco
>     > >
>     > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <
> gavin.max.bell@gmail.com>
>     > > wrote:
>     > >
>     > > > Yes, let me add to the kudos, very nice work Marco.
>     > > >
>     > > >
>     > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
>     > > >
>     > > >
>     > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
>     > > > <ke...@amazon.de.INVALID> wrote:
>     > > > >
>     > > > > Appreciate the big effort in bring the CI back so quickly.
> Thanks
>     > > Marco.
>     > > > >
>     > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
>     > marco.g.abreu@googlemail.com
>     > > .INVALID>
>     > > > wrote:
>     > > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
>     > unrelated
>     > > to
>     > > > > that incident.
>     > > > >
>     > > > > If somebody is interested in the details around the outage:
>     > > > >
>     > > > > Due to a required maintenance (disk running full), we had to
> upgrade
>     > > our
>     > > > > Jenkins master because it was running on Ubuntu 17.04 (for an
> unknown
>     > > > > reason, it used to be 16.04) and we needed to install some
> packages.
>     > > > Since
>     > > > > the support for Ubuntu 17.04 was stopped, this resulted in all
>     > package
>     > > > > updates and installations to fail because the repositories
> were taken
>     > > > > offline. Due to the unavailable maintenance package and other
> issues
>     > > with
>     > > > > the installed OpenJDK8 version, we made the decision to
> upgrade the
>     > > > Jenkins
>     > > > > master to Ubuntu 18.04 LTS in order to get back to a supported
>     > version
>     > > > with
>     > > > > maintenance tools. During this upgrade, Jenkins was
> automatically
>     > > updated
>     > > > > by APT as part of the dist-upgrade process.
>     > > > >
>     > > > > In the latest version of Jenkins, some labels have been
> changed which
>     > > we
>     > > > > depend on for our auto scaling. To be more specific:
>     > > > >> Waiting for next available executor on mxnetlinux-gpu
>     > > > > has been changed to
>     > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
>     > > > > Notice the quote characters.
>     > > > >
>     > > > > Jenkins does not offer a better way than to parse these
> messages
>     > > > > unfortunately - there's no standardized way to express queue
> items.
>     > > Since
>     > > > > our parser expected the above message without quote signs, this
>     > message
>     > > > was
>     > > > > discarded.
>     > > > >
>     > > > > We support various queue reasons (5 of them to be exact) that
>     > indicate
>     > > > > resource starvation. If we run super low on capacity, the queue
>     > reason
>     > > is
>     > > > > different and we would still be able to scale up, but most of
> the
>     > cases
>     > > > > would have printed the unsupported message. This resulted in
> reduced
>     > > > > capacity (to be specific, the limit during that time was 1
> slave per
>     > > > type).
>     > > > >
>     > > > > We have now fixed our autoscaling to automatically strip these
>     > > characters
>     > > > > and added that message to our test suite.
>     > > > >
>     > > > > Best regards,
>     > > > > Marco
>     > > > >
>     > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
>     > > aaron.s.markham@gmail.com
>     > > > >
>     > > > > wrote:
>     > > > >
>     > > > >> Marco, thanks for your hard work on this. I'm super excited
> about
>     > the
>     > > > new
>     > > > >> Jenkins jobs. This is going to be very helpful and improve
> sanity
>     > for
>     > > > our
>     > > > >> PRs and ourselves!
>     > > > >>
>     > > > >> Cheers,
>     > > > >> Aaron
>     > > > >>
>     > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
>     > > > >> <marco.g.abreu@googlemail.com.invalid wrote:
>     > > > >>
>     > > > >>> Hello,
>     > > > >>>
>     > > > >>> the CI is now back up and running. Auto scaling is working as
>     > > expected
>     > > > >> and
>     > > > >>> it passed our load tests.
>     > > > >>>
>     > > > >>> Please excuse the caused inconveniences.
>     > > > >>>
>     > > > >>> Best regards,
>     > > > >>> Marco
>     > > > >>>
>     > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
>     > > > >>> marco.g.abreu@googlemail.com>
>     > > > >>> wrote:
>     > > > >>>
>     > > > >>>> Hello,
>     > > > >>>>
>     > > > >>>> I'd like to let you know that our CI was impaired and down
> for the
>     > > > last
>     > > > >>>> few hours. After getting the CI back up, I noticed that our
> auto
>     > > > >> scaling
>     > > > >>>> broke due to a silent update of Jenkins which broke our
>     > > > >>> upscale-detection.
>     > > > >>>> Manual scaling is currently not possible and stopping the
> scaling
>     > > > won't
>     > > > >>>> help either because there are currently no p3 instances
> available,
>     > > > >> which
>     > > > >>>> means that all jobs will fail none the less. In a few
> hours, the
>     > > auto
>     > > > >>>> scaling will have recycled all slaves through the down-scale
>     > > mechanism
>     > > > >>> and
>     > > > >>>> we will be out of capacity. This will lead to resource
> starvation
>     > > and
>     > > > >>> thus
>     > > > >>>> timeouts.
>     > > > >>>>
>     > > > >>>> Your PRs will be properly registered by Jenkins, but please
> expect
>     > > the
>     > > > >>>> jobs to time out and thus fail your PRs.
>     > > > >>>>
>     > > > >>>> I will fix the auto scaling as soon as I'm awake again.
>     > > > >>>>
>     > > > >>>> Sorry for the caused inconveniences.
>     > > > >>>>
>     > > > >>>> Best regards,
>     > > > >>>> Marco
>     > > > >>>>
>     > > > >>>>
>     > > > >>>> P.S. Sorry for the brief email and my lack of further
> fixes, but
>     > > it's
>     > > > >>>> 5:30AM now and I've been working for 17 hours.
>     > > > >>>>
>     > > > >>>
>     > > > >>
>     > > >
>     > >
>     >
>
>
>

Re: CI impaired

Posted by Qing Lan <la...@live.com>.

Appreciated for your effort and help to make CI a better place!

Qing 

On 11/21/18, 4:38 PM, "Lin Yuan" <ap...@gmail.com> wrote:

    Thanks for your efforts, Marco!
    
    On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <an...@gmail.com>
    wrote:
    
    > Thanks for the quick response and mitigation!
    >
    > On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
    > <ma...@googlemail.com.invalid> wrote:
    >
    > > Hello,
    > >
    > > today, CI had some issues and I had to cancel all jobs a few minutes ago.
    > > This was basically caused by the high load that is currently being put on
    > > our CI system due to the pre-release efforts for this Friday.
    > >
    > > It's really unfortunate that we just had outages of three core components
    > > within the last two days - sorry about that!. To recap, we had the
    > > following outages (which are unrelated to the parallel refactor of the
    > > Jenkins pipeline):
    > > - (yesterday evening) The Jenkins master ran out of disk space and thus
    > > processed requests at reduced capacity
    > > - (this morning) The Jenkins master got updated which broke our
    > > autoscalings upscaling capabilities.
    > > - (new, this evening) Jenkins API was irresponsive: Due to the high
    > number
    > > of jobs and a bad API design in the Jenkins REST API, the time-complexity
    > > of a simple create or delete request was quadratic which resulted in all
    > > requests timing out (that was the current outage). This resulted in our
    > > auto scaling to be unable to interface with the Jenkins master.
    > >
    > > I have now made improvements to our REST API calls which reduced the
    > > complexity from O(N^2) to O(1). The reason was an underlying redirect
    > loop
    > > in the Jenkins createNode and deleteNode REST API in combination with
    > > unrolling the entire slave and job graph (which got quite huge during
    > > extensive load) upon every single request. Since we had about 150
    > > registered slaves and 1000 jobs in the queue, the duration for a single
    > > REST API call rose to up to 45 seconds (we execute up to a few hundred
    > > queries per auto scaling loop). This lead to our auto scaling timing out.
    > >
    > > Everything should be back to normal now. I'm closely observing the
    > > situation and I'll let you know if I encounter any additional issues.
    > >
    > > Again, sorry for any caused inconveniences.
    > >
    > > Best regards,
    > > Marco
    > >
    > > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <ga...@gmail.com>
    > > wrote:
    > >
    > > > Yes, let me add to the kudos, very nice work Marco.
    > > >
    > > >
    > > > "I'm trying real hard to be the shepherd." -Jules Winnfield
    > > >
    > > >
    > > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
    > > > <ke...@amazon.de.INVALID> wrote:
    > > > >
    > > > > Appreciate the big effort in bring the CI back so quickly.  Thanks
    > > Marco.
    > > > >
    > > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
    > marco.g.abreu@googlemail.com
    > > .INVALID>
    > > > wrote:
    > > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
    > unrelated
    > > to
    > > > > that incident.
    > > > >
    > > > > If somebody is interested in the details around the outage:
    > > > >
    > > > > Due to a required maintenance (disk running full), we had to upgrade
    > > our
    > > > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
    > > > > reason, it used to be 16.04) and we needed to install some packages.
    > > > Since
    > > > > the support for Ubuntu 17.04 was stopped, this resulted in all
    > package
    > > > > updates and installations to fail because the repositories were taken
    > > > > offline. Due to the unavailable maintenance package and other issues
    > > with
    > > > > the installed OpenJDK8 version, we made the decision to upgrade the
    > > > Jenkins
    > > > > master to Ubuntu 18.04 LTS in order to get back to a supported
    > version
    > > > with
    > > > > maintenance tools. During this upgrade, Jenkins was automatically
    > > updated
    > > > > by APT as part of the dist-upgrade process.
    > > > >
    > > > > In the latest version of Jenkins, some labels have been changed which
    > > we
    > > > > depend on for our auto scaling. To be more specific:
    > > > >> Waiting for next available executor on mxnetlinux-gpu
    > > > > has been changed to
    > > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
    > > > > Notice the quote characters.
    > > > >
    > > > > Jenkins does not offer a better way than to parse these messages
    > > > > unfortunately - there's no standardized way to express queue items.
    > > Since
    > > > > our parser expected the above message without quote signs, this
    > message
    > > > was
    > > > > discarded.
    > > > >
    > > > > We support various queue reasons (5 of them to be exact) that
    > indicate
    > > > > resource starvation. If we run super low on capacity, the queue
    > reason
    > > is
    > > > > different and we would still be able to scale up, but most of the
    > cases
    > > > > would have printed the unsupported message. This resulted in reduced
    > > > > capacity (to be specific, the limit during that time was 1 slave per
    > > > type).
    > > > >
    > > > > We have now fixed our autoscaling to automatically strip these
    > > characters
    > > > > and added that message to our test suite.
    > > > >
    > > > > Best regards,
    > > > > Marco
    > > > >
    > > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
    > > aaron.s.markham@gmail.com
    > > > >
    > > > > wrote:
    > > > >
    > > > >> Marco, thanks for your hard work on this. I'm super excited about
    > the
    > > > new
    > > > >> Jenkins jobs. This is going to be very helpful and improve sanity
    > for
    > > > our
    > > > >> PRs and ourselves!
    > > > >>
    > > > >> Cheers,
    > > > >> Aaron
    > > > >>
    > > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
    > > > >> <marco.g.abreu@googlemail.com.invalid wrote:
    > > > >>
    > > > >>> Hello,
    > > > >>>
    > > > >>> the CI is now back up and running. Auto scaling is working as
    > > expected
    > > > >> and
    > > > >>> it passed our load tests.
    > > > >>>
    > > > >>> Please excuse the caused inconveniences.
    > > > >>>
    > > > >>> Best regards,
    > > > >>> Marco
    > > > >>>
    > > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
    > > > >>> marco.g.abreu@googlemail.com>
    > > > >>> wrote:
    > > > >>>
    > > > >>>> Hello,
    > > > >>>>
    > > > >>>> I'd like to let you know that our CI was impaired and down for the
    > > > last
    > > > >>>> few hours. After getting the CI back up, I noticed that our auto
    > > > >> scaling
    > > > >>>> broke due to a silent update of Jenkins which broke our
    > > > >>> upscale-detection.
    > > > >>>> Manual scaling is currently not possible and stopping the scaling
    > > > won't
    > > > >>>> help either because there are currently no p3 instances available,
    > > > >> which
    > > > >>>> means that all jobs will fail none the less. In a few hours, the
    > > auto
    > > > >>>> scaling will have recycled all slaves through the down-scale
    > > mechanism
    > > > >>> and
    > > > >>>> we will be out of capacity. This will lead to resource starvation
    > > and
    > > > >>> thus
    > > > >>>> timeouts.
    > > > >>>>
    > > > >>>> Your PRs will be properly registered by Jenkins, but please expect
    > > the
    > > > >>>> jobs to time out and thus fail your PRs.
    > > > >>>>
    > > > >>>> I will fix the auto scaling as soon as I'm awake again.
    > > > >>>>
    > > > >>>> Sorry for the caused inconveniences.
    > > > >>>>
    > > > >>>> Best regards,
    > > > >>>> Marco
    > > > >>>>
    > > > >>>>
    > > > >>>> P.S. Sorry for the brief email and my lack of further fixes, but
    > > it's
    > > > >>>> 5:30AM now and I've been working for 17 hours.
    > > > >>>>
    > > > >>>
    > > > >>
    > > >
    > >
    >

Re: CI impaired

Posted by Lin Yuan <ap...@gmail.com>.

Thanks for your efforts, Marco!

On Wed, Nov 21, 2018 at 4:02 PM Anirudh Subramanian <an...@gmail.com>
wrote:

> Thanks for the quick response and mitigation!
>
> On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
> <ma...@googlemail.com.invalid> wrote:
>
> > Hello,
> >
> > today, CI had some issues and I had to cancel all jobs a few minutes ago.
> > This was basically caused by the high load that is currently being put on
> > our CI system due to the pre-release efforts for this Friday.
> >
> > It's really unfortunate that we just had outages of three core components
> > within the last two days - sorry about that!. To recap, we had the
> > following outages (which are unrelated to the parallel refactor of the
> > Jenkins pipeline):
> > - (yesterday evening) The Jenkins master ran out of disk space and thus
> > processed requests at reduced capacity
> > - (this morning) The Jenkins master got updated which broke our
> > autoscalings upscaling capabilities.
> > - (new, this evening) Jenkins API was irresponsive: Due to the high
> number
> > of jobs and a bad API design in the Jenkins REST API, the time-complexity
> > of a simple create or delete request was quadratic which resulted in all
> > requests timing out (that was the current outage). This resulted in our
> > auto scaling to be unable to interface with the Jenkins master.
> >
> > I have now made improvements to our REST API calls which reduced the
> > complexity from O(N^2) to O(1). The reason was an underlying redirect
> loop
> > in the Jenkins createNode and deleteNode REST API in combination with
> > unrolling the entire slave and job graph (which got quite huge during
> > extensive load) upon every single request. Since we had about 150
> > registered slaves and 1000 jobs in the queue, the duration for a single
> > REST API call rose to up to 45 seconds (we execute up to a few hundred
> > queries per auto scaling loop). This lead to our auto scaling timing out.
> >
> > Everything should be back to normal now. I'm closely observing the
> > situation and I'll let you know if I encounter any additional issues.
> >
> > Again, sorry for any caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <ga...@gmail.com>
> > wrote:
> >
> > > Yes, let me add to the kudos, very nice work Marco.
> > >
> > >
> > > "I'm trying real hard to be the shepherd." -Jules Winnfield
> > >
> > >
> > > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > > <ke...@amazon.de.INVALID> wrote:
> > > >
> > > > Appreciate the big effort in bring the CI back so quickly.  Thanks
> > Marco.
> > > >
> > > > On Nov 21, 2018 5:52 AM, Marco de Abreu <
> marco.g.abreu@googlemail.com
> > .INVALID>
> > > wrote:
> > > > Thanks Aaron! Just for the record, the new Jenkins jobs were
> unrelated
> > to
> > > > that incident.
> > > >
> > > > If somebody is interested in the details around the outage:
> > > >
> > > > Due to a required maintenance (disk running full), we had to upgrade
> > our
> > > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> > > > reason, it used to be 16.04) and we needed to install some packages.
> > > Since
> > > > the support for Ubuntu 17.04 was stopped, this resulted in all
> package
> > > > updates and installations to fail because the repositories were taken
> > > > offline. Due to the unavailable maintenance package and other issues
> > with
> > > > the installed OpenJDK8 version, we made the decision to upgrade the
> > > Jenkins
> > > > master to Ubuntu 18.04 LTS in order to get back to a supported
> version
> > > with
> > > > maintenance tools. During this upgrade, Jenkins was automatically
> > updated
> > > > by APT as part of the dist-upgrade process.
> > > >
> > > > In the latest version of Jenkins, some labels have been changed which
> > we
> > > > depend on for our auto scaling. To be more specific:
> > > >> Waiting for next available executor on mxnetlinux-gpu
> > > > has been changed to
> > > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > > > Notice the quote characters.
> > > >
> > > > Jenkins does not offer a better way than to parse these messages
> > > > unfortunately - there's no standardized way to express queue items.
> > Since
> > > > our parser expected the above message without quote signs, this
> message
> > > was
> > > > discarded.
> > > >
> > > > We support various queue reasons (5 of them to be exact) that
> indicate
> > > > resource starvation. If we run super low on capacity, the queue
> reason
> > is
> > > > different and we would still be able to scale up, but most of the
> cases
> > > > would have printed the unsupported message. This resulted in reduced
> > > > capacity (to be specific, the limit during that time was 1 slave per
> > > type).
> > > >
> > > > We have now fixed our autoscaling to automatically strip these
> > characters
> > > > and added that message to our test suite.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> > aaron.s.markham@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Marco, thanks for your hard work on this. I'm super excited about
> the
> > > new
> > > >> Jenkins jobs. This is going to be very helpful and improve sanity
> for
> > > our
> > > >> PRs and ourselves!
> > > >>
> > > >> Cheers,
> > > >> Aaron
> > > >>
> > > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > > >> <marco.g.abreu@googlemail.com.invalid wrote:
> > > >>
> > > >>> Hello,
> > > >>>
> > > >>> the CI is now back up and running. Auto scaling is working as
> > expected
> > > >> and
> > > >>> it passed our load tests.
> > > >>>
> > > >>> Please excuse the caused inconveniences.
> > > >>>
> > > >>> Best regards,
> > > >>> Marco
> > > >>>
> > > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > > >>> marco.g.abreu@googlemail.com>
> > > >>> wrote:
> > > >>>
> > > >>>> Hello,
> > > >>>>
> > > >>>> I'd like to let you know that our CI was impaired and down for the
> > > last
> > > >>>> few hours. After getting the CI back up, I noticed that our auto
> > > >> scaling
> > > >>>> broke due to a silent update of Jenkins which broke our
> > > >>> upscale-detection.
> > > >>>> Manual scaling is currently not possible and stopping the scaling
> > > won't
> > > >>>> help either because there are currently no p3 instances available,
> > > >> which
> > > >>>> means that all jobs will fail none the less. In a few hours, the
> > auto
> > > >>>> scaling will have recycled all slaves through the down-scale
> > mechanism
> > > >>> and
> > > >>>> we will be out of capacity. This will lead to resource starvation
> > and
> > > >>> thus
> > > >>>> timeouts.
> > > >>>>
> > > >>>> Your PRs will be properly registered by Jenkins, but please expect
> > the
> > > >>>> jobs to time out and thus fail your PRs.
> > > >>>>
> > > >>>> I will fix the auto scaling as soon as I'm awake again.
> > > >>>>
> > > >>>> Sorry for the caused inconveniences.
> > > >>>>
> > > >>>> Best regards,
> > > >>>> Marco
> > > >>>>
> > > >>>>
> > > >>>> P.S. Sorry for the brief email and my lack of further fixes, but
> > it's
> > > >>>> 5:30AM now and I've been working for 17 hours.
> > > >>>>
> > > >>>
> > > >>
> > >
> >
>

Re: CI impaired

Posted by Anirudh Subramanian <an...@gmail.com>.

Thanks for the quick response and mitigation!

On Wed, Nov 21, 2018 at 3:55 PM Marco de Abreu
<ma...@googlemail.com.invalid> wrote:

> Hello,
>
> today, CI had some issues and I had to cancel all jobs a few minutes ago.
> This was basically caused by the high load that is currently being put on
> our CI system due to the pre-release efforts for this Friday.
>
> It's really unfortunate that we just had outages of three core components
> within the last two days - sorry about that!. To recap, we had the
> following outages (which are unrelated to the parallel refactor of the
> Jenkins pipeline):
> - (yesterday evening) The Jenkins master ran out of disk space and thus
> processed requests at reduced capacity
> - (this morning) The Jenkins master got updated which broke our
> autoscalings upscaling capabilities.
> - (new, this evening) Jenkins API was irresponsive: Due to the high number
> of jobs and a bad API design in the Jenkins REST API, the time-complexity
> of a simple create or delete request was quadratic which resulted in all
> requests timing out (that was the current outage). This resulted in our
> auto scaling to be unable to interface with the Jenkins master.
>
> I have now made improvements to our REST API calls which reduced the
> complexity from O(N^2) to O(1). The reason was an underlying redirect loop
> in the Jenkins createNode and deleteNode REST API in combination with
> unrolling the entire slave and job graph (which got quite huge during
> extensive load) upon every single request. Since we had about 150
> registered slaves and 1000 jobs in the queue, the duration for a single
> REST API call rose to up to 45 seconds (we execute up to a few hundred
> queries per auto scaling loop). This lead to our auto scaling timing out.
>
> Everything should be back to normal now. I'm closely observing the
> situation and I'll let you know if I encounter any additional issues.
>
> Again, sorry for any caused inconveniences.
>
> Best regards,
> Marco
>
> On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <ga...@gmail.com>
> wrote:
>
> > Yes, let me add to the kudos, very nice work Marco.
> >
> >
> > "I'm trying real hard to be the shepherd." -Jules Winnfield
> >
> >
> > > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> > <ke...@amazon.de.INVALID> wrote:
> > >
> > > Appreciate the big effort in bring the CI back so quickly.  Thanks
> Marco.
> > >
> > > On Nov 21, 2018 5:52 AM, Marco de Abreu <marco.g.abreu@googlemail.com
> .INVALID>
> > wrote:
> > > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated
> to
> > > that incident.
> > >
> > > If somebody is interested in the details around the outage:
> > >
> > > Due to a required maintenance (disk running full), we had to upgrade
> our
> > > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> > > reason, it used to be 16.04) and we needed to install some packages.
> > Since
> > > the support for Ubuntu 17.04 was stopped, this resulted in all package
> > > updates and installations to fail because the repositories were taken
> > > offline. Due to the unavailable maintenance package and other issues
> with
> > > the installed OpenJDK8 version, we made the decision to upgrade the
> > Jenkins
> > > master to Ubuntu 18.04 LTS in order to get back to a supported version
> > with
> > > maintenance tools. During this upgrade, Jenkins was automatically
> updated
> > > by APT as part of the dist-upgrade process.
> > >
> > > In the latest version of Jenkins, some labels have been changed which
> we
> > > depend on for our auto scaling. To be more specific:
> > >> Waiting for next available executor on mxnetlinux-gpu
> > > has been changed to
> > >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > > Notice the quote characters.
> > >
> > > Jenkins does not offer a better way than to parse these messages
> > > unfortunately - there's no standardized way to express queue items.
> Since
> > > our parser expected the above message without quote signs, this message
> > was
> > > discarded.
> > >
> > > We support various queue reasons (5 of them to be exact) that indicate
> > > resource starvation. If we run super low on capacity, the queue reason
> is
> > > different and we would still be able to scale up, but most of the cases
> > > would have printed the unsupported message. This resulted in reduced
> > > capacity (to be specific, the limit during that time was 1 slave per
> > type).
> > >
> > > We have now fixed our autoscaling to automatically strip these
> characters
> > > and added that message to our test suite.
> > >
> > > Best regards,
> > > Marco
> > >
> > > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <
> aaron.s.markham@gmail.com
> > >
> > > wrote:
> > >
> > >> Marco, thanks for your hard work on this. I'm super excited about the
> > new
> > >> Jenkins jobs. This is going to be very helpful and improve sanity for
> > our
> > >> PRs and ourselves!
> > >>
> > >> Cheers,
> > >> Aaron
> > >>
> > >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> > >> <marco.g.abreu@googlemail.com.invalid wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> the CI is now back up and running. Auto scaling is working as
> expected
> > >> and
> > >>> it passed our load tests.
> > >>>
> > >>> Please excuse the caused inconveniences.
> > >>>
> > >>> Best regards,
> > >>> Marco
> > >>>
> > >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > >>> marco.g.abreu@googlemail.com>
> > >>> wrote:
> > >>>
> > >>>> Hello,
> > >>>>
> > >>>> I'd like to let you know that our CI was impaired and down for the
> > last
> > >>>> few hours. After getting the CI back up, I noticed that our auto
> > >> scaling
> > >>>> broke due to a silent update of Jenkins which broke our
> > >>> upscale-detection.
> > >>>> Manual scaling is currently not possible and stopping the scaling
> > won't
> > >>>> help either because there are currently no p3 instances available,
> > >> which
> > >>>> means that all jobs will fail none the less. In a few hours, the
> auto
> > >>>> scaling will have recycled all slaves through the down-scale
> mechanism
> > >>> and
> > >>>> we will be out of capacity. This will lead to resource starvation
> and
> > >>> thus
> > >>>> timeouts.
> > >>>>
> > >>>> Your PRs will be properly registered by Jenkins, but please expect
> the
> > >>>> jobs to time out and thus fail your PRs.
> > >>>>
> > >>>> I will fix the auto scaling as soon as I'm awake again.
> > >>>>
> > >>>> Sorry for the caused inconveniences.
> > >>>>
> > >>>> Best regards,
> > >>>> Marco
> > >>>>
> > >>>>
> > >>>> P.S. Sorry for the brief email and my lack of further fixes, but
> it's
> > >>>> 5:30AM now and I've been working for 17 hours.
> > >>>>
> > >>>
> > >>
> >
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Hello,

today, CI had some issues and I had to cancel all jobs a few minutes ago.
This was basically caused by the high load that is currently being put on
our CI system due to the pre-release efforts for this Friday.

It's really unfortunate that we just had outages of three core components
within the last two days - sorry about that!. To recap, we had the
following outages (which are unrelated to the parallel refactor of the
Jenkins pipeline):
- (yesterday evening) The Jenkins master ran out of disk space and thus
processed requests at reduced capacity
- (this morning) The Jenkins master got updated which broke our
autoscalings upscaling capabilities.
- (new, this evening) Jenkins API was irresponsive: Due to the high number
of jobs and a bad API design in the Jenkins REST API, the time-complexity
of a simple create or delete request was quadratic which resulted in all
requests timing out (that was the current outage). This resulted in our
auto scaling to be unable to interface with the Jenkins master.

I have now made improvements to our REST API calls which reduced the
complexity from O(N^2) to O(1). The reason was an underlying redirect loop
in the Jenkins createNode and deleteNode REST API in combination with
unrolling the entire slave and job graph (which got quite huge during
extensive load) upon every single request. Since we had about 150
registered slaves and 1000 jobs in the queue, the duration for a single
REST API call rose to up to 45 seconds (we execute up to a few hundred
queries per auto scaling loop). This lead to our auto scaling timing out.

Everything should be back to normal now. I'm closely observing the
situation and I'll let you know if I encounter any additional issues.

Again, sorry for any caused inconveniences.

Best regards,
Marco

On Wed, Nov 21, 2018 at 5:10 PM Gavin M Bell <ga...@gmail.com>
wrote:

> Yes, let me add to the kudos, very nice work Marco.
>
>
> "I'm trying real hard to be the shepherd." -Jules Winnfield
>
>
> > On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen
> <ke...@amazon.de.INVALID> wrote:
> >
> > Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.
> >
> > On Nov 21, 2018 5:52 AM, Marco de Abreu <ma...@googlemail.com.INVALID>
> wrote:
> > Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
> > that incident.
> >
> > If somebody is interested in the details around the outage:
> >
> > Due to a required maintenance (disk running full), we had to upgrade our
> > Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> > reason, it used to be 16.04) and we needed to install some packages.
> Since
> > the support for Ubuntu 17.04 was stopped, this resulted in all package
> > updates and installations to fail because the repositories were taken
> > offline. Due to the unavailable maintenance package and other issues with
> > the installed OpenJDK8 version, we made the decision to upgrade the
> Jenkins
> > master to Ubuntu 18.04 LTS in order to get back to a supported version
> with
> > maintenance tools. During this upgrade, Jenkins was automatically updated
> > by APT as part of the dist-upgrade process.
> >
> > In the latest version of Jenkins, some labels have been changed which we
> > depend on for our auto scaling. To be more specific:
> >> Waiting for next available executor on mxnetlinux-gpu
> > has been changed to
> >> Waiting for next available executor on ‘mxnetlinux-gpu’
> > Notice the quote characters.
> >
> > Jenkins does not offer a better way than to parse these messages
> > unfortunately - there's no standardized way to express queue items. Since
> > our parser expected the above message without quote signs, this message
> was
> > discarded.
> >
> > We support various queue reasons (5 of them to be exact) that indicate
> > resource starvation. If we run super low on capacity, the queue reason is
> > different and we would still be able to scale up, but most of the cases
> > would have printed the unsupported message. This resulted in reduced
> > capacity (to be specific, the limit during that time was 1 slave per
> type).
> >
> > We have now fixed our autoscaling to automatically strip these characters
> > and added that message to our test suite.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aaron.s.markham@gmail.com
> >
> > wrote:
> >
> >> Marco, thanks for your hard work on this. I'm super excited about the
> new
> >> Jenkins jobs. This is going to be very helpful and improve sanity for
> our
> >> PRs and ourselves!
> >>
> >> Cheers,
> >> Aaron
> >>
> >> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> >> <marco.g.abreu@googlemail.com.invalid wrote:
> >>
> >>> Hello,
> >>>
> >>> the CI is now back up and running. Auto scaling is working as expected
> >> and
> >>> it passed our load tests.
> >>>
> >>> Please excuse the caused inconveniences.
> >>>
> >>> Best regards,
> >>> Marco
> >>>
> >>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> >>> marco.g.abreu@googlemail.com>
> >>> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> I'd like to let you know that our CI was impaired and down for the
> last
> >>>> few hours. After getting the CI back up, I noticed that our auto
> >> scaling
> >>>> broke due to a silent update of Jenkins which broke our
> >>> upscale-detection.
> >>>> Manual scaling is currently not possible and stopping the scaling
> won't
> >>>> help either because there are currently no p3 instances available,
> >> which
> >>>> means that all jobs will fail none the less. In a few hours, the auto
> >>>> scaling will have recycled all slaves through the down-scale mechanism
> >>> and
> >>>> we will be out of capacity. This will lead to resource starvation and
> >>> thus
> >>>> timeouts.
> >>>>
> >>>> Your PRs will be properly registered by Jenkins, but please expect the
> >>>> jobs to time out and thus fail your PRs.
> >>>>
> >>>> I will fix the auto scaling as soon as I'm awake again.
> >>>>
> >>>> Sorry for the caused inconveniences.
> >>>>
> >>>> Best regards,
> >>>> Marco
> >>>>
> >>>>
> >>>> P.S. Sorry for the brief email and my lack of further fixes, but it's
> >>>> 5:30AM now and I've been working for 17 hours.
> >>>>
> >>>
> >>
>

Re: CI impaired

Posted by Gavin M Bell <ga...@gmail.com>.

Yes, let me add to the kudos, very nice work Marco. 


"I'm trying real hard to be the shepherd." -Jules Winnfield


> On Nov 21, 2018, at 5:04 PM, Sunderland, Kellen <ke...@amazon.de.INVALID> wrote:
> 
> Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.
> 
> On Nov 21, 2018 5:52 AM, Marco de Abreu <ma...@googlemail.com.INVALID> wrote:
> Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
> that incident.
> 
> If somebody is interested in the details around the outage:
> 
> Due to a required maintenance (disk running full), we had to upgrade our
> Jenkins master because it was running on Ubuntu 17.04 (for an unknown
> reason, it used to be 16.04) and we needed to install some packages. Since
> the support for Ubuntu 17.04 was stopped, this resulted in all package
> updates and installations to fail because the repositories were taken
> offline. Due to the unavailable maintenance package and other issues with
> the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
> master to Ubuntu 18.04 LTS in order to get back to a supported version with
> maintenance tools. During this upgrade, Jenkins was automatically updated
> by APT as part of the dist-upgrade process.
> 
> In the latest version of Jenkins, some labels have been changed which we
> depend on for our auto scaling. To be more specific:
>> Waiting for next available executor on mxnetlinux-gpu
> has been changed to
>> Waiting for next available executor on ‘mxnetlinux-gpu’
> Notice the quote characters.
> 
> Jenkins does not offer a better way than to parse these messages
> unfortunately - there's no standardized way to express queue items. Since
> our parser expected the above message without quote signs, this message was
> discarded.
> 
> We support various queue reasons (5 of them to be exact) that indicate
> resource starvation. If we run super low on capacity, the queue reason is
> different and we would still be able to scale up, but most of the cases
> would have printed the unsupported message. This resulted in reduced
> capacity (to be specific, the limit during that time was 1 slave per type).
> 
> We have now fixed our autoscaling to automatically strip these characters
> and added that message to our test suite.
> 
> Best regards,
> Marco
> 
> On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aa...@gmail.com>
> wrote:
> 
>> Marco, thanks for your hard work on this. I'm super excited about the new
>> Jenkins jobs. This is going to be very helpful and improve sanity for our
>> PRs and ourselves!
>> 
>> Cheers,
>> Aaron
>> 
>> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
>> <marco.g.abreu@googlemail.com.invalid wrote:
>> 
>>> Hello,
>>> 
>>> the CI is now back up and running. Auto scaling is working as expected
>> and
>>> it passed our load tests.
>>> 
>>> Please excuse the caused inconveniences.
>>> 
>>> Best regards,
>>> Marco
>>> 
>>> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
>>> marco.g.abreu@googlemail.com>
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I'd like to let you know that our CI was impaired and down for the last
>>>> few hours. After getting the CI back up, I noticed that our auto
>> scaling
>>>> broke due to a silent update of Jenkins which broke our
>>> upscale-detection.
>>>> Manual scaling is currently not possible and stopping the scaling won't
>>>> help either because there are currently no p3 instances available,
>> which
>>>> means that all jobs will fail none the less. In a few hours, the auto
>>>> scaling will have recycled all slaves through the down-scale mechanism
>>> and
>>>> we will be out of capacity. This will lead to resource starvation and
>>> thus
>>>> timeouts.
>>>> 
>>>> Your PRs will be properly registered by Jenkins, but please expect the
>>>> jobs to time out and thus fail your PRs.
>>>> 
>>>> I will fix the auto scaling as soon as I'm awake again.
>>>> 
>>>> Sorry for the caused inconveniences.
>>>> 
>>>> Best regards,
>>>> Marco
>>>> 
>>>> 
>>>> P.S. Sorry for the brief email and my lack of further fixes, but it's
>>>> 5:30AM now and I've been working for 17 hours.
>>>> 
>>> 
>>

Re: CI impaired

Posted by "Sunderland, Kellen" <ke...@amazon.de.INVALID>.

Appreciate the big effort in bring the CI back so quickly.  Thanks Marco.

On Nov 21, 2018 5:52 AM, Marco de Abreu <ma...@googlemail.com.INVALID> wrote:
Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
that incident.

If somebody is interested in the details around the outage:

Due to a required maintenance (disk running full), we had to upgrade our
Jenkins master because it was running on Ubuntu 17.04 (for an unknown
reason, it used to be 16.04) and we needed to install some packages. Since
the support for Ubuntu 17.04 was stopped, this resulted in all package
updates and installations to fail because the repositories were taken
offline. Due to the unavailable maintenance package and other issues with
the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
master to Ubuntu 18.04 LTS in order to get back to a supported version with
maintenance tools. During this upgrade, Jenkins was automatically updated
by APT as part of the dist-upgrade process.

In the latest version of Jenkins, some labels have been changed which we
depend on for our auto scaling. To be more specific:
> Waiting for next available executor on mxnetlinux-gpu
has been changed to
> Waiting for next available executor on ‘mxnetlinux-gpu’
Notice the quote characters.

Jenkins does not offer a better way than to parse these messages
unfortunately - there's no standardized way to express queue items. Since
our parser expected the above message without quote signs, this message was
discarded.

We support various queue reasons (5 of them to be exact) that indicate
resource starvation. If we run super low on capacity, the queue reason is
different and we would still be able to scale up, but most of the cases
would have printed the unsupported message. This resulted in reduced
capacity (to be specific, the limit during that time was 1 slave per type).

We have now fixed our autoscaling to automatically strip these characters
and added that message to our test suite.

Best regards,
Marco

On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aa...@gmail.com>
wrote:

> Marco, thanks for your hard work on this. I'm super excited about the new
> Jenkins jobs. This is going to be very helpful and improve sanity for our
> PRs and ourselves!
>
> Cheers,
> Aaron
>
> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> <marco.g.abreu@googlemail.com.invalid wrote:
>
> > Hello,
> >
> > the CI is now back up and running. Auto scaling is working as expected
> and
> > it passed our load tests.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > marco.g.abreu@googlemail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to let you know that our CI was impaired and down for the last
> > > few hours. After getting the CI back up, I noticed that our auto
> scaling
> > > broke due to a silent update of Jenkins which broke our
> > upscale-detection.
> > > Manual scaling is currently not possible and stopping the scaling won't
> > > help either because there are currently no p3 instances available,
> which
> > > means that all jobs will fail none the less. In a few hours, the auto
> > > scaling will have recycled all slaves through the down-scale mechanism
> > and
> > > we will be out of capacity. This will lead to resource starvation and
> > thus
> > > timeouts.
> > >
> > > Your PRs will be properly registered by Jenkins, but please expect the
> > > jobs to time out and thus fail your PRs.
> > >
> > > I will fix the auto scaling as soon as I'm awake again.
> > >
> > > Sorry for the caused inconveniences.
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > > 5:30AM now and I've been working for 17 hours.
> > >
> >
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Thanks Aaron! Just for the record, the new Jenkins jobs were unrelated to
that incident.

If somebody is interested in the details around the outage:

Due to a required maintenance (disk running full), we had to upgrade our
Jenkins master because it was running on Ubuntu 17.04 (for an unknown
reason, it used to be 16.04) and we needed to install some packages. Since
the support for Ubuntu 17.04 was stopped, this resulted in all package
updates and installations to fail because the repositories were taken
offline. Due to the unavailable maintenance package and other issues with
the installed OpenJDK8 version, we made the decision to upgrade the Jenkins
master to Ubuntu 18.04 LTS in order to get back to a supported version with
maintenance tools. During this upgrade, Jenkins was automatically updated
by APT as part of the dist-upgrade process.

In the latest version of Jenkins, some labels have been changed which we
depend on for our auto scaling. To be more specific:
> Waiting for next available executor on mxnetlinux-gpu
has been changed to
> Waiting for next available executor on ‘mxnetlinux-gpu’
Notice the quote characters.

Jenkins does not offer a better way than to parse these messages
unfortunately - there's no standardized way to express queue items. Since
our parser expected the above message without quote signs, this message was
discarded.

We support various queue reasons (5 of them to be exact) that indicate
resource starvation. If we run super low on capacity, the queue reason is
different and we would still be able to scale up, but most of the cases
would have printed the unsupported message. This resulted in reduced
capacity (to be specific, the limit during that time was 1 slave per type).

We have now fixed our autoscaling to automatically strip these characters
and added that message to our test suite.

Best regards,
Marco

On Wed, Nov 21, 2018 at 2:49 PM Aaron Markham <aa...@gmail.com>
wrote:

> Marco, thanks for your hard work on this. I'm super excited about the new
> Jenkins jobs. This is going to be very helpful and improve sanity for our
> PRs and ourselves!
>
> Cheers,
> Aaron
>
> On Wed, Nov 21, 2018, 05:37 Marco de Abreu
> <marco.g.abreu@googlemail.com.invalid wrote:
>
> > Hello,
> >
> > the CI is now back up and running. Auto scaling is working as expected
> and
> > it passed our load tests.
> >
> > Please excuse the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> > On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> > marco.g.abreu@googlemail.com>
> > wrote:
> >
> > > Hello,
> > >
> > > I'd like to let you know that our CI was impaired and down for the last
> > > few hours. After getting the CI back up, I noticed that our auto
> scaling
> > > broke due to a silent update of Jenkins which broke our
> > upscale-detection.
> > > Manual scaling is currently not possible and stopping the scaling won't
> > > help either because there are currently no p3 instances available,
> which
> > > means that all jobs will fail none the less. In a few hours, the auto
> > > scaling will have recycled all slaves through the down-scale mechanism
> > and
> > > we will be out of capacity. This will lead to resource starvation and
> > thus
> > > timeouts.
> > >
> > > Your PRs will be properly registered by Jenkins, but please expect the
> > > jobs to time out and thus fail your PRs.
> > >
> > > I will fix the auto scaling as soon as I'm awake again.
> > >
> > > Sorry for the caused inconveniences.
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > > 5:30AM now and I've been working for 17 hours.
> > >
> >
>

Re: CI impaired

Posted by Aaron Markham <aa...@gmail.com>.

Marco, thanks for your hard work on this. I'm super excited about the new
Jenkins jobs. This is going to be very helpful and improve sanity for our
PRs and ourselves!

Cheers,
Aaron

On Wed, Nov 21, 2018, 05:37 Marco de Abreu
<marco.g.abreu@googlemail.com.invalid wrote:

> Hello,
>
> the CI is now back up and running. Auto scaling is working as expected and
> it passed our load tests.
>
> Please excuse the caused inconveniences.
>
> Best regards,
> Marco
>
> On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <
> marco.g.abreu@googlemail.com>
> wrote:
>
> > Hello,
> >
> > I'd like to let you know that our CI was impaired and down for the last
> > few hours. After getting the CI back up, I noticed that our auto scaling
> > broke due to a silent update of Jenkins which broke our
> upscale-detection.
> > Manual scaling is currently not possible and stopping the scaling won't
> > help either because there are currently no p3 instances available, which
> > means that all jobs will fail none the less. In a few hours, the auto
> > scaling will have recycled all slaves through the down-scale mechanism
> and
> > we will be out of capacity. This will lead to resource starvation and
> thus
> > timeouts.
> >
> > Your PRs will be properly registered by Jenkins, but please expect the
> > jobs to time out and thus fail your PRs.
> >
> > I will fix the auto scaling as soon as I'm awake again.
> >
> > Sorry for the caused inconveniences.
> >
> > Best regards,
> > Marco
> >
> >
> > P.S. Sorry for the brief email and my lack of further fixes, but it's
> > 5:30AM now and I've been working for 17 hours.
> >
>

Re: CI impaired

Posted by Marco de Abreu <ma...@googlemail.com.INVALID>.

Hello,

the CI is now back up and running. Auto scaling is working as expected and
it passed our load tests.

Please excuse the caused inconveniences.

Best regards,
Marco

On Wed, Nov 21, 2018 at 5:24 AM Marco de Abreu <ma...@googlemail.com>
wrote:

> Hello,
>
> I'd like to let you know that our CI was impaired and down for the last
> few hours. After getting the CI back up, I noticed that our auto scaling
> broke due to a silent update of Jenkins which broke our upscale-detection.
> Manual scaling is currently not possible and stopping the scaling won't
> help either because there are currently no p3 instances available, which
> means that all jobs will fail none the less. In a few hours, the auto
> scaling will have recycled all slaves through the down-scale mechanism and
> we will be out of capacity. This will lead to resource starvation and thus
> timeouts.
>
> Your PRs will be properly registered by Jenkins, but please expect the
> jobs to time out and thus fail your PRs.
>
> I will fix the auto scaling as soon as I'm awake again.
>
> Sorry for the caused inconveniences.
>
> Best regards,
> Marco
>
>
> P.S. Sorry for the brief email and my lack of further fixes, but it's
> 5:30AM now and I've been working for 17 hours.
>