You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Marco de Abreu <ma...@apache.org> on 2020/06/15 23:19:51 UTC

Re: [CI] Staggered build pipelines enabled

Hello,

I'd like to revisit this decision and review whether the expected benefit
(cost reduction) was achieved and how the overall PR validation duration
has changed. Could you guys share some information on this matter?

Just today, we've had two incidents which were caused by this change:

1. This PR was merged prematurely because the follow up pipelines didn't
run (whether it's a timing issue or something else I don't know). We've
always had the policy of not enforcing protected feature branches and this
pipeline is causing human errors.
https://github.com/apache/incubator-mxnet/pull/18560

2. The sanity check, which got into the critical path here, starts running
into timeouts. I'm aware that this is in case of cache misses, but none the
less does that heavily increase the waiting duration since developers now
have to wait for two entire cache build cycles instead of just one:
https://github.com/apache/incubator-mxnet/pull/18568

I understand that money has to be conserved, but I still stand by my
opinion that this was the wrong move and development speed was sacrificed.

If there are no other compelling arguments, I'd prefer if the previous
state of parallel pipelines could be restored.

Best regards
Marco

sandeep krishnamurthy <sa...@gmail.com> schrieb am Di., 28.
Apr. 2020, 07:55:

> Thanks a lot Joe for your contributions. Thank you Marco, Chai and Leo for
> helping this.
> Especially given that you had seen around 57% build failing in sanity
> check, this should be very helpful to provide faster feedback for PR
> authors on sanity issues plus save a lot of unnecessary builds.
>
> Best,
> Sandeep
>
> On Mon, 27 Apr 2020, 10:20 pm Joe Evans, <jo...@gmail.com> wrote:
>
> > Hi dev community,
> >
> >
> > We have made the changes to the mxnet CI system to incorporate the
> > staggered build pipelines. With this change, when a new PR is created or
> an
> > existing PR is updated, the status checks will only show
> > “ci/jenkins/mxnet-validation/sanity” build job at first. Once this build
> > completes successfully (avg. run time is about 10min), the remaining CI
> > build jobs will appear and function as previously.
> >
> >
> > Please let me know if you experience any issues with this change.
> >
> >
> > Thanks!
> >
> > Joe
> >
> >
> > References:
> >
> >
> > https://github.com/apache/incubator-mxnet/issues/17802
> >
>

Re: [CI] Staggered build pipelines enabled

Posted by Sheng Zha <zh...@apache.org>.
Hi Marco,

> We've always had the policy of not enforcing protected feature branches and this pipeline is causing human errors.

There are several problems with this statement:
1. This is a release branch as defined in our release process. It's not a feature branch.
2. There was no explicit decision or an agreed upon policy of not protecting the release branch. To me it doesn't make sense to only protect the development branch while allowing force push and revision of history on release branches, and this is what needs to be fixed.
3. I made a conscious decision when merging it given the CI sanity check at the time, given that there were changes from Chai to CI pipelines which may not take effect. While a better approach would have been to test this separately first, I see such issue as a normal part of development and I wouldn't be too concerned given that we use proper version control.

Regarding the benefit of having the sanity check step, you can easily measure its impact by looking at the recent PR build history of the sanity step [1], which I believe you do have access to. Out of the recent 563 builds in PR validation, 137 builds have broken sanity check, or 24.3%. Given the amount of reduction in wasted compute on already failed validation pipelines, to me waiting for two docker cache builds in rare occasions is justified.

Since the increase in PR validation time is due to the docker build time, you can help by improving the docker cache of the CI. While a couple of people are already investing time in improving the status quo in this regard, you are more than welcome to step up and participate on improving it.

-sz

[1] http://jenkins.mxnet-ci.amazon-ml.com/job/mxnet-validation/job/sanity/view/change-requests/builds

On 2020/06/15 23:19:51, Marco de Abreu <ma...@apache.org> wrote: 
> Hello,
> 
> I'd like to revisit this decision and review whether the expected benefit
> (cost reduction) was achieved and how the overall PR validation duration
> has changed. Could you guys share some information on this matter?
> 
> Just today, we've had two incidents which were caused by this change:
> 
> 1. This PR was merged prematurely because the follow up pipelines didn't
> run (whether it's a timing issue or something else I don't know). We've
> always had the policy of not enforcing protected feature branches and this
> pipeline is causing human errors.
> https://github.com/apache/incubator-mxnet/pull/18560
> 
> 2. The sanity check, which got into the critical path here, starts running
> into timeouts. I'm aware that this is in case of cache misses, but none the
> less does that heavily increase the waiting duration since developers now
> have to wait for two entire cache build cycles instead of just one:
> https://github.com/apache/incubator-mxnet/pull/18568
> 
> I understand that money has to be conserved, but I still stand by my
> opinion that this was the wrong move and development speed was sacrificed.
> 
> If there are no other compelling arguments, I'd prefer if the previous
> state of parallel pipelines could be restored.
> 
> Best regards
> Marco
> 
> sandeep krishnamurthy <sa...@gmail.com> schrieb am Di., 28.
> Apr. 2020, 07:55:
> 
> > Thanks a lot Joe for your contributions. Thank you Marco, Chai and Leo for
> > helping this.
> > Especially given that you had seen around 57% build failing in sanity
> > check, this should be very helpful to provide faster feedback for PR
> > authors on sanity issues plus save a lot of unnecessary builds.
> >
> > Best,
> > Sandeep
> >
> > On Mon, 27 Apr 2020, 10:20 pm Joe Evans, <jo...@gmail.com> wrote:
> >
> > > Hi dev community,
> > >
> > >
> > > We have made the changes to the mxnet CI system to incorporate the
> > > staggered build pipelines. With this change, when a new PR is created or
> > an
> > > existing PR is updated, the status checks will only show
> > > “ci/jenkins/mxnet-validation/sanity” build job at first. Once this build
> > > completes successfully (avg. run time is about 10min), the remaining CI
> > > build jobs will appear and function as previously.
> > >
> > >
> > > Please let me know if you experience any issues with this change.
> > >
> > >
> > > Thanks!
> > >
> > > Joe
> > >
> > >
> > > References:
> > >
> > >
> > > https://github.com/apache/incubator-mxnet/issues/17802
> > >
> >
>