You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Pedro Larroy <pe...@gmail.com> on 2019/08/13 23:47:08 UTC

CI and PRs

Hi

Seems we are hitting some problems in CI. I propose the following action
items to remedy the situation and accelerate turn around times in CI,
reduce cost, complexity and probability of failure blocking PRs and
frustrating developers:

* Upgrade Windows visual studio from VS 2015 to VS 2017. The
build_windows.py infrastructure should easily work with the new version.
Currently some PRs are blocked by this:
https://github.com/apache/incubator-mxnet/issues/13958
* Move Gluon Model zoo tests to nightly. Tracked at
https://github.com/apache/incubator-mxnet/issues/15295
* Move non-python bindings tests to nightly. If a commit is touching other
bindings, the reviewer should ask for a full run which can be done locally,
use the label bot to trigger a full CI build, or defer to nightly.
* Provide a couple of basic sanity performance tests on small models that
are run on CI and can be echoed by the label bot as a comment for PRs.
* Address unit tests that take more than 10-20s, streamline them or move
them to nightly if it can't be done.
* Open sourcing the remaining CI infrastructure scripts so the community
can contribute.

I think our goal should be turnaround under 30min.

I would also like to touch base with the community that some PRs are not
being followed up by committers asking for changes. For example this PR is
importtant and is hanging for a long time.

https://github.com/apache/incubator-mxnet/pull/15051

This is another, less important but more trivial to review:

https://github.com/apache/incubator-mxnet/pull/14940

I think comitters requesting changes and not folllowing up in reasonable
time is not healthy for the project. I suggest configuring github
Notifications for a good SNR and following up.

Regards.

Pedro.

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Marco.

I have to agree with you on that, from past experience.
What do you suggest for maintenance?  Do we need a watermark that fails the
validation if the total runtime exceeds a high threshold?

Pedro.

On Wed, Aug 14, 2019 at 1:03 PM Marco de Abreu <ma...@gmail.com>
wrote:

> With regards to time I rather prefer us spending a bit more time on
> maintenance than somebody running into an error that could've been caught
> with a test.
>
> I mean, our Publishing pipeline for Scala GPU has been broken for quite
> some time now, but nobody noticed that. Basically my stance on that matter
> is that as soon as something is not blocking, you can also just deactivate
> it since you don't have a forcing function in an open source project.
> People will rarely come back and fix the errors of some nightly test that
> they introduced.
>
> -Marco
>
> Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019, 21:59:
>
> > If a language binding test is failing for a not important reason, then it
> > is too brittle and needs to be fixed (we have fixed some of these with
> the
> > Clojure package [1]).
> > But in general, if we thinking of the MXNet project as one project that
> is
> > across all the language bindings, then we want to know if some
> fundamental
> > code change is going to break a downstream package.
> > I can't speak for all the high level package binding maintainers, but I'm
> > always happy to pitch in to provide code fixes to help the base PR get
> > green.
> >
> > The time costs to maintain such a large CI project obviously needs to be
> > considered as well.
> >
> > [1] https://github.com/apache/incubator-mxnet/pull/15579
> >
> > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > From what I have seen Clojure is 15 minutes, which I think is
> reasonable.
> > > The only question is that when a binding such as R, Perl or Clojure
> > fails,
> > > some devs are a bit confused about how to fix them since they are not
> > > familiar with the testing tools and the language.
> > >
> > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <ca...@gmail.com>
> > wrote:
> > >
> > > > Great idea Marco! Anything that you think would be valuable to share
> > > would
> > > > be good. The duration of each node in the test stage sounds like a
> good
> > > > start.
> > > >
> > > > - Carin
> > > >
> > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > marco.g.abreu@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > we record a bunch of metrics about run statistics (down to the
> > duration
> > > > of
> > > > > every individual step). If you tell me which ones you're
> particularly
> > > > > interested in (probably total duration of each node in the test
> > stage),
> > > > I'm
> > > > > happy to provide them.
> > > > >
> > > > > Dimensions are (in hierarchical order):
> > > > > - job
> > > > > - branch
> > > > > - stage
> > > > > - node
> > > > > - step
> > > > >
> > > > > Unfortunately I don't have the possibility to export them since we
> > > store
> > > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > > 19:43:
> > > > >
> > > > > > I would prefer to keep the language binding in the PR process.
> > > Perhaps
> > > > we
> > > > > > could do some analytics to see how much each of the language
> > bindings
> > > > is
> > > > > > contributing to overall run time.
> > > > > > If we have some metrics on that, maybe we can come up with a
> > > guideline
> > > > of
> > > > > > how much time each should take. Another possibility is leverage
> the
> > > > > > parallel builds more.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Carin.
> > > > > > >
> > > > > > > That's a good point, all things considered would your
> preference
> > be
> > > > to
> > > > > > keep
> > > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > > Some options are having notifications here or in slack. But if
> we
> > > > think
> > > > > > > breakages would go unnoticed maybe is not a good idea to fully
> > > remove
> > > > > > > bindings from the PR process and just streamline the process.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > carinmeier@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Before any binding tests are moved to nightly, I think we
> need
> > to
> > > > > > figure
> > > > > > > > out how the community can get proper notifications of failure
> > and
> > > > > > success
> > > > > > > > on those nightly runs. Otherwise, I think that breakages
> would
> > go
> > > > > > > > unnoticed.
> > > > > > > >
> > > > > > > > -Carin
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi
> > > > > > > > >
> > > > > > > > > Seems we are hitting some problems in CI. I propose the
> > > following
> > > > > > > action
> > > > > > > > > items to remedy the situation and accelerate turn around
> > times
> > > in
> > > > > CI,
> > > > > > > > > reduce cost, complexity and probability of failure blocking
> > PRs
> > > > and
> > > > > > > > > frustrating developers:
> > > > > > > > >
> > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017.
> The
> > > > > > > > > build_windows.py infrastructure should easily work with the
> > new
> > > > > > > version.
> > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > * Move non-python bindings tests to nightly. If a commit is
> > > > > touching
> > > > > > > > other
> > > > > > > > > bindings, the reviewer should ask for a full run which can
> be
> > > > done
> > > > > > > > locally,
> > > > > > > > > use the label bot to trigger a full CI build, or defer to
> > > > nightly.
> > > > > > > > > * Provide a couple of basic sanity performance tests on
> small
> > > > > models
> > > > > > > that
> > > > > > > > > are run on CI and can be echoed by the label bot as a
> comment
> > > for
> > > > > > PRs.
> > > > > > > > > * Address unit tests that take more than 10-20s, streamline
> > > them
> > > > or
> > > > > > > move
> > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > * Open sourcing the remaining CI infrastructure scripts so
> > the
> > > > > > > community
> > > > > > > > > can contribute.
> > > > > > > > >
> > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > >
> > > > > > > > > I would also like to touch base with the community that
> some
> > > PRs
> > > > > are
> > > > > > > not
> > > > > > > > > being followed up by committers asking for changes. For
> > example
> > > > > this
> > > > > > PR
> > > > > > > > is
> > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > >
> > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > >
> > > > > > > > > This is another, less important but more trivial to review:
> > > > > > > > >
> > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > >
> > > > > > > > > I think comitters requesting changes and not folllowing up
> in
> > > > > > > reasonable
> > > > > > > > > time is not healthy for the project. I suggest configuring
> > > github
> > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > >
> > > > > > > > > Regards.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

It's rerunning as soon as that particular script has been modified. Since
the following steps depend on it, it means that once step 4 has a cache
mismatch, steps 5-15 are also no longer valid.

Our cache is always controlled by master. This means that the only thing
that matters is the diff between your branch and master and not the fact
whether it already has been run once. A single Jenkins run will juggle with
over 100gb of Docker images. If we held a cache that records every single
occurrence, the storage requirements and traffic would be very expensive.
Thus, the most efficient and less error prone approach was to make master
be the branch that defines the cache.

-Marco

Aaron Markham <aa...@gmail.com> schrieb am Fr., 16. Aug. 2019,
04:06:

> When you create a new Dockerfile and use that on CI, it doesn't seem
> to cache some of the steps... like this:
>
> Step 13/15 : RUN /work/ubuntu_docs.sh
>  ---> Running in a1e522f3283b
>  [91m+ echo 'Installing dependencies...'
> + apt-get update
>  [0mInstalling dependencies.
>
> Or this....
>
> Step 4/13 : RUN /work/ubuntu_core.sh
>  ---> Running in e7882d7aa750
>  [91m+ apt-get update
>
> I get if I was changing those scripts, but then I'd think it should
> cache after running it once... but, no.
>
>
> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <ma...@gmail.com>
> wrote:
> >
> > Do I understand it correctly that you are saying that the Docker cache
> > doesn't work properly and regularly reinstalls dependencies? Or do you
> mean
> > that you only have cache misses when you modify the dependencies - which
> > would be expected?
> >
> > -Marco
> >
> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> aaron.s.markham@gmail.com>
> > wrote:
> >
> > > Many of the CI pipelines follow this pattern:
> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > > repeat steps 1-3 over and over?
> > >
> > > Now, some tests use a stashed binary and docker cache. And I see this
> work
> > > locally, but for the most part, on CI, you're gonna sit through a
> > > dependency install.
> > >
> > > I noticed that almost all jobs use an ubuntu setup that is fully
> loaded.
> > > Without cache, it can take 10 or more minutes to build.  So I made a
> lite
> > > version. Takes only a few minutes instead.
> > >
> > > In some cases archiving worked great to share across pipelines, but as
> > > Marco mentioned we need a storage solution to make that happen. We
> can't
> > > archive every intermediate artifact for each PR.
> > >
> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> aaron.s.markham@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > The PRs Thomas and I are working on for the new docs and website
> share
> > > > the
> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > > > >
> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I see it done daily now, and while I can’t share all the details,
> > > it’s
> > > > > not
> > > > > > an incredibly complex thing, and involves not much more than
> nfs/efs
> > > > > > sharing and remote ssh commands.  All it takes is a little
> ingenuity
> > > > and
> > > > > > some imagination.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sounds good in theory. I think there are complex details with
> > > regards
> > > > > of
> > > > > > > resource sharing during parallel execution. Still I think both
> ways
> > > > can
> > > > > > be
> > > > > > > explored. I think some tests run for unreasonably long times
> for
> > > what
> > > > > > they
> > > > > > > are doing. We already scale parts of the pipeline horizontally
> > > across
> > > > > > > workers.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > > cjolivier01@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Rather than remove tests (which doesn’t scale as a
> solution), why
> > > > not
> > > > > > > scale
> > > > > > > > them horizontally so that they finish more quickly? Across
> > > > processes
> > > > > or
> > > > > > > > even on a pool of machines that aren’t necessarily the build
> > > > machine?
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > With regards to time I rather prefer us spending a bit more
> > > time
> > > > on
> > > > > > > > > maintenance than somebody running into an error that
> could've
> > > > been
> > > > > > > caught
> > > > > > > > > with a test.
> > > > > > > > >
> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
> broken
> > > for
> > > > > > quite
> > > > > > > > > some time now, but nobody noticed that. Basically my
> stance on
> > > > that
> > > > > > > > matter
> > > > > > > > > is that as soon as something is not blocking, you can also
> just
> > > > > > > > deactivate
> > > > > > > > > it since you don't have a forcing function in an open
> source
> > > > > project.
> > > > > > > > > People will rarely come back and fix the errors of some
> nightly
> > > > > test
> > > > > > > that
> > > > > > > > > they introduced.
> > > > > > > > >
> > > > > > > > > -Marco
> > > > > > > > >
> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> Aug.
> > > > 2019,
> > > > > > > 21:59:
> > > > > > > > >
> > > > > > > > > > If a language binding test is failing for a not important
> > > > reason,
> > > > > > > then
> > > > > > > > it
> > > > > > > > > > is too brittle and needs to be fixed (we have fixed some
> of
> > > > these
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > Clojure package [1]).
> > > > > > > > > > But in general, if we thinking of the MXNet project as
> one
> > > > > project
> > > > > > > that
> > > > > > > > > is
> > > > > > > > > > across all the language bindings, then we want to know if
> > > some
> > > > > > > > > fundamental
> > > > > > > > > > code change is going to break a downstream package.
> > > > > > > > > > I can't speak for all the high level package binding
> > > > maintainers,
> > > > > > but
> > > > > > > > I'm
> > > > > > > > > > always happy to pitch in to provide code fixes to help
> the
> > > base
> > > > > PR
> > > > > > > get
> > > > > > > > > > green.
> > > > > > > > > >
> > > > > > > > > > The time costs to maintain such a large CI project
> obviously
> > > > > needs
> > > > > > to
> > > > > > > > be
> > > > > > > > > > considered as well.
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I
> think
> > > is
> > > > > > > > > reasonable.
> > > > > > > > > > > The only question is that when a binding such as R,
> Perl or
> > > > > > Clojure
> > > > > > > > > > fails,
> > > > > > > > > > > some devs are a bit confused about how to fix them
> since
> > > they
> > > > > are
> > > > > > > not
> > > > > > > > > > > familiar with the testing tools and the language.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > > > > carinmeier@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Great idea Marco! Anything that you think would be
> > > valuable
> > > > > to
> > > > > > > > share
> > > > > > > > > > > would
> > > > > > > > > > > > be good. The duration of each node in the test stage
> > > sounds
> > > > > > like
> > > > > > > a
> > > > > > > > > good
> > > > > > > > > > > > start.
> > > > > > > > > > > >
> > > > > > > > > > > > - Carin
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > we record a bunch of metrics about run statistics
> (down
> > > > to
> > > > > > the
> > > > > > > > > > duration
> > > > > > > > > > > > of
> > > > > > > > > > > > > every individual step). If you tell me which ones
> > > you're
> > > > > > > > > particularly
> > > > > > > > > > > > > interested in (probably total duration of each
> node in
> > > > the
> > > > > > test
> > > > > > > > > > stage),
> > > > > > > > > > > > I'm
> > > > > > > > > > > > > happy to provide them.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > > > > - job
> > > > > > > > > > > > > - branch
> > > > > > > > > > > > > - stage
> > > > > > > > > > > > > - node
> > > > > > > > > > > > > - step
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unfortunately I don't have the possibility to
> export
> > > them
> > > > > > since
> > > > > > > > we
> > > > > > > > > > > store
> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
> offer
> > > raw
> > > > > > > exports.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Marco
> > > > > > > > > > > > >
> > > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am
> Mi., 14.
> > > > > Aug.
> > > > > > > > 2019,
> > > > > > > > > > > 19:43:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I would prefer to keep the language binding in
> the PR
> > > > > > > process.
> > > > > > > > > > > Perhaps
> > > > > > > > > > > > we
> > > > > > > > > > > > > > could do some analytics to see how much each of
> the
> > > > > > language
> > > > > > > > > > bindings
> > > > > > > > > > > > is
> > > > > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > > > > If we have some metrics on that, maybe we can
> come up
> > > > > with
> > > > > > a
> > > > > > > > > > > guideline
> > > > > > > > > > > > of
> > > > > > > > > > > > > > how much time each should take. Another
> possibility
> > > is
> > > > > > > leverage
> > > > > > > > > the
> > > > > > > > > > > > > > parallel builds more.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a good point, all things considered
> would
> > > your
> > > > > > > > > preference
> > > > > > > > > > be
> > > > > > > > > > > > to
> > > > > > > > > > > > > > keep
> > > > > > > > > > > > > > > the Clojure tests as part of the PR process or
> in
> > > > > > Nightly?
> > > > > > > > > > > > > > > Some options are having notifications here or
> in
> > > > slack.
> > > > > > But
> > > > > > > > if
> > > > > > > > > we
> > > > > > > > > > > > think
> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a
> good
> > > idea
> > > > > to
> > > > > > > > fully
> > > > > > > > > > > remove
> > > > > > > > > > > > > > > bindings from the PR process and just
> streamline
> > > the
> > > > > > > process.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > > > > > carinmeier@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Before any binding tests are moved to
> nightly, I
> > > > > think
> > > > > > we
> > > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > > > > figure
> > > > > > > > > > > > > > > > out how the community can get proper
> > > notifications
> > > > of
> > > > > > > > failure
> > > > > > > > > > and
> > > > > > > > > > > > > > success
> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think
> that
> > > > > > breakages
> > > > > > > > > would
> > > > > > > > > > go
> > > > > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -Carin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy
> <
> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> > > > propose
> > > > > > the
> > > > > > > > > > > following
> > > > > > > > > > > > > > > action
> > > > > > > > > > > > > > > > > items to remedy the situation and
> accelerate
> > > turn
> > > > > > > around
> > > > > > > > > > times
> > > > > > > > > > > in
> > > > > > > > > > > > > CI,
> > > > > > > > > > > > > > > > > reduce cost, complexity and probability of
> > > > failure
> > > > > > > > blocking
> > > > > > > > > > PRs
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS
> 2015 to
> > > > VS
> > > > > > > 2017.
> > > > > > > > > The
> > > > > > > > > > > > > > > > > build_windows.py infrastructure should
> easily
> > > > work
> > > > > > with
> > > > > > > > the
> > > > > > > > > > new
> > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> > > Tracked
> > > > at
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
> nightly.
> > > If a
> > > > > > > commit
> > > > > > > > is
> > > > > > > > > > > > > touching
> > > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a
> full
> > > run
> > > > > > which
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > > > done
> > > > > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI
> build,
> > > or
> > > > > > defer
> > > > > > > to
> > > > > > > > > > > > nightly.
> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
> performance
> > > > > tests
> > > > > > on
> > > > > > > > > small
> > > > > > > > > > > > > models
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the
> label
> > > bot
> > > > > as a
> > > > > > > > > comment
> > > > > > > > > > > for
> > > > > > > > > > > > > > PRs.
> > > > > > > > > > > > > > > > > * Address unit tests that take more than
> > > 10-20s,
> > > > > > > > streamline
> > > > > > > > > > > them
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
> infrastructure
> > > > > > scripts
> > > > > > > > so
> > > > > > > > > > the
> > > > > > > > > > > > > > > community
> > > > > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think our goal should be turnaround under
> > > > 30min.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I would also like to touch base with the
> > > > community
> > > > > > that
> > > > > > > > > some
> > > > > > > > > > > PRs
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > being followed up by committers asking for
> > > > changes.
> > > > > > For
> > > > > > > > > > example
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > PR
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is another, less important but more
> > > trivial
> > > > to
> > > > > > > > review:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think comitters requesting changes and
> not
> > > > > > folllowing
> > > > > > > > up
> > > > > > > > > in
> > > > > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > > > > time is not healthy for the project. I
> suggest
> > > > > > > > configuring
> > > > > > > > > > > github
> > > > > > > > > > > > > > > > > Notifications for a good SNR and following
> up.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Chris, you are reading some confrontational or negative things where
there is no bad intention and just diverse opinions and ways to express
them.

We went with Marco for a beer and dinner together and talked about this and
we had a good exchange of technical ideas and opinions with mutual respect,
often is much easier to talk  in person than getting the wrong
interpretation over email. (Isn't Apache about community over code?) You
and me should do it sometime if you want. I send you an initial beer emoji
🍻 as a friendly token of good intention.

Marco's role in the project and in CI, his PMC role or contributions were
never put into question. The question is how can we have a more diverse
contributions and making it easier to do so to grow the community and help
people contribute. Giving credit, acknowledging that many activities are a
team effort and supporting them are some ideas that I think might be useful
looking forward. My proposal is to acknowledge those contributions and be
more inclusive now that the remaining infrastructure is open sourced.


Pedro.










On Fri, Aug 23, 2019 at 7:43 PM Chris Olivier <cj...@gmail.com> wrote:

> Pedro,
>
> I don’t see where Marco says that he “designed and implemented all aspects
> of CI by himself”.  I do think, however, that it’s fair to say that Marco
> was in charge of the design and most likely made the majority of design
> decisions as the CI was being built, especially around those tenents that
> he mentioned.  I know this because before I submitted Marco as a committer,
> I asked some his teammates whether Marco was really responsible for CI and
> the answer by all I asked were that CI was Marco's baby and he did most of
> it by some large margin (I am paraphrasing).  Taking other design inputs
> and examples (i.e. Apache CI) is all part of any responsible design
> process.
>
> In addition, I am not understanding the obfuscation of “people who
> contributed to CI”, “person/people who designed CI”, or even
> "person who oversees CI" as it is weaponized in your email.  Again, nowhere
> did Marco say that he did everything back then or since then.  I don't
> think it's fair to try to modify what Marco wrote and then try to turn it
> against him.  Reminds me of the techniques of network news these days,
> quite frankly (whichever side you're "on" doesn't matter, because both
> sides do it).
>
> -Chris
>
>
>
>
>
> On Fri, Aug 23, 2019 at 3:56 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Thanks for your response Marco, I think you have totally missed my
> original
> > point which was basically that someone volunteering effort on the CI is
> as
> > important as someone contributing a feature. From my perspective this
> > hasn't been the case, and we had to rely a lot on you and Sheng to submit
> > fixes which required access, also to relay communication with Apache
> infra.
> > Also in many cases we had to rely on you to channel fixes, PRs, disable
> > tests etc. If the community is fine having this kind of bottleneck, fine
> > with me. From my point of view and the feedback from myself and other
> > people which contributed to CI this was not always a good experience.
> > Having a welcoming and inclusive community is very important. I don't
> want
> > to start a discussion on this, but invite the community to do a bit of
> soul
> > searching on this topic, now that the infrastructure is open source.
> >
> > Also I find surprising that you claim that you designed the CI yourself,
> > when this was a joint work of many individuals, including the old Apache
> CI
> > and additional contributions and code reviewers, people who were oncall
> for
> > this service or the autoscaling approach which if I remember correctly
> came
> > from a humble servant. Kellen did a lot of pair programming and code
> > reviews. Obviously you have a done a lot of work on CI which has had a
> huge
> > positive impact on the project and your recognition is well deserved. The
> > technical details you mention on your email are perfectly true and valid.
> >
> > Below is a rough list of individuals who contributed to CI, I would like
> to
> > thank all of them since without this work, we wouldn't be able to deliver
> > with the quality that we have done in the past.
> >
> >
> > pllarroy@mac:0: ~/d/m/ci [fc_higher_order_grad_2]> git log
> > --pretty=format:%aN . | sort | uniq -c | sort -n | tail -n 10
> >    6 Zach Kimberg
> >    6 stu1130
> >    7 Jake Lee
> >    8 Aaron Markham
> >   11 Lanking
> >   12 Anton Chernov
> >   13 perdasilva
> >   26 Kellen Sunderland
> >   34 Marco de Abreu
> >   46 Pedro Larroy
> >
> > pllarroy@mac:0: ~/d/mxnet_ci_general [master]> git log
> --pretty=format:%aN
> > | sort | uniq -c | sort -n
> >    1 Gavin M. Bell
> >    1 de Abreu
> >    6 Bair
> >    7 Kellen Sunderland
> >    8 Jose Luis Contreras
> >   14 perdasilva
> >   20 Per Goncalves da Silva
> >   29 Anton Chernov
> >   39 Chance Bair
> >   96 Pedro Larroy
> >  209 Marco de Abreu
> >
> >
> >
> > Pedro.
> >
> > On Fri, Aug 23, 2019 at 3:18 PM Marco de Abreu <ma...@gmail.com>
> > wrote:
> >
> > > I've heard this request multiple times and so far, I'm having issues
> > > understanding the direct correlation between having committer
> permissions
> > > and being able to manage CI.
> > >
> > > When I designed the CI, one of the tenets was maintainability and
> > > accessbility for the community: I wanted to avoid that somebody needs
> > > certain privileges in order to execute regular actions. The result was
> > the
> > > strong usage of Jenkinsfiles, Dockerfiles and the runtime functions.
> The
> > > combination of these techniques allowed somebody to create a job from
> the
> > > process flow level (Jenkinsfile), over the environment level
> (Dockerfile)
> > > to the individual action level (runtime functions). This design
> basically
> > > gives the community full access over the entire flow.
> > >
> > > The jobs that are configured to source only Jenkinsfile. Jenkins
> > supports a
> > > lot of different ways how to define pipelines, but I have made sure to
> > > encourage everybody to use only Jenkinsfiles. This makes sure that no
> > > configuration is done in the web-interface. This firs of all alleviates
> > the
> > > permission issue since there's literally no config in the web interface
> > and
> > > second it allows auditing since all changes have to be done in the
> MXNet
> > > GitHub repository.
> > >
> > > Committers have elevated permissions in Jenkins. These contain the
> > > permission to run, stop and configure jobs. All other permissions are
> > > restricted to system administrators for the sake of ensuring stability
> of
> > > the system. On the dev-CI on the other hand, we're happy to add people
> so
> > > they can experiment as much as they want. The transition to prod-CI is
> > then
> > > assisted by me to ensure smooth operations and adhering to the best
> > > practices (like using our Jenkinsfiles and Docker structure, for
> > example).
> > >
> > > The only case where somebody would need elevated permissions is if they
> > > would like to change system settings. But at that point, we're talking
> > > about instance settings and AWS account configuration. Since that now
> > > reaches into the next permission level, which is restricted to the
> donor
> > of
> > > the CI system - Amazon Web Services - this is something that not even
> PMC
> > > members will receive. The same policy is in place for the official
> Apache
> > > CI: Committers/PMCs can configure their job, but don't have system
> level
> > > access to either Jenkins or the underlying AWS account for obvious
> > reasons.
> > > We're trying to stay in line with the same policy, but in the past I've
> > > granted Jenkins administrator access to people who required elevated
> > access
> > > to properly do their job - Aaron Markham with regards to the website
> > being
> > > one example.
> > >
> > > This means that the only case when a contributor needs committer
> > assistance
> > > is the moment when somebody would like to set up a new Jenkins job. It
> > > would be a matter of setting up the job to point to the persons branch
> -
> > > Jenkins will then automatically pull the Jenkinsfile and thus no
> further
> > > configuration is necessary and updates are directly consumed. Such a
> > > request IMO is on the same level as us having to cut a ticket to Apache
> > > INFRA to create a new job.
> > >
> > > With regards to speed: So far, I was the only "CI-Person" with
> committer
> > > privileges. But due to our 4-eye-rule for PRs, I wasn't able to merge
> my
> > > own changes anyways - most of them were reviewed by Sheng, for example.
> > In
> > > an emergency, I'm sure that somebody can be reached to assist since we
> > > currently have 39 PMC members and 20 committers spanning multiple
> > > timezones.
> > >
> > > For these reasons, I don't agree with the sentiment that contributors
> are
> > > unable to effectively work with the CI system unless they have
> committer
> > > privileges.
> > >
> > > Best regards,
> > > Marco
> > >
> > >
> > > On Fri, Aug 23, 2019 at 10:33 AM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com>
> > > wrote:
> > >
> > > > As Marco has open sourced the bulk of the CI infrastructure donated
> > from
> > > > Amazon to the community, I would like to raise the recommendation
> that
> > > the
> > > > community takes action to help volunteers working on the CI have a
> > better
> > > > experience. In the past, it's my impression that there hasn't been
> much
> > > > action granting PMC or committer privileges to engineers volunteering
> > to
> > > > help CI other than Marco. This would encourage more contributions and
> > > help
> > > > expedite critical fixes and corrective actions. I think this has not
> > > > properly enabled those individuals to be as effective as they could,
> as
> > > > well as the lack of recognition for such a critical activity. I'm not
> > > sure
> > > > about the cause but I believe this is something that should be
> > rectified
> > > > for future contributions and help on the CI front if improvements are
> > > > desired.
> > > >
> > > > In spanish we have a saying: "es de bien nacido ser agradecido".
> > > >
> > > > Pedro.
> > > >
> > > > On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi Aaron. This is difficult to diagnose, because I don't know what
> to
> > > do
> > > > > when the hash of the layer in docker doesn't match and decides to
> > > rebuild
> > > > > it. the r script seems not to have changed. I have observed this in
> > the
> > > > > past and I think is due to bugs in docker.   Maybe Kellen is able
> to
> > > give
> > > > > some tips here.
> > > > >
> > > > > In this case you should use -R which is already in master. (you can
> > > > always
> > > > > copy the script on top if you are in an older revision).
> > > > >
> > > > > Another thing that worked for me in the past was to completely nuke
> > the
> > > > > docker cache, so it redonwloads from the CI repo. After that it
> > worked
> > > > fine
> > > > > in some cases.
> > > > >
> > > > > These two workarounds are not ideal, but should unblock you.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <
> > > > aaron.s.markham@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Is -R already in there?
> > > > >>
> > > > >> Here's an example of it happening to me right now.... I am making
> > > > >> minor changes to the runtime_functions logic for handling the R
> docs
> > > > >> output. I pull the fix, then run the container, but I see the R
> deps
> > > > >> layer re-running. I didn't touch that. Why it that running again?
> > > > >>
> > > > >> From https://github.com/aaronmarkham/incubator-mxnet
> > > > >>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
> > > > >> origin/new_website_pipeline_2_aaron_rdocs
> > > > >> Updating f71cc6d..deec6aa
> > > > >> Fast-forward
> > > > >>  ci/docker/runtime_functions.sh | 6 +++---
> > > > >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > > > >> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
> > > > >> --docker-registry mxnetci --platform ubuntu_cpu_r
> > > > >> --docker-build-retries 3 --shm-size 500m
> /work/runtime_functions.sh
> > > > >> build_r_docs
> > > > >> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based
> build
> > > > tool.
> > > > >> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
> > > > >> enabled from registry mxnetci
> > > > >> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
> > > > >> mxnetci/build.ubuntu_cpu_r from mxnetci
> > > > >> Using default tag: latest
> > > > >> latest: Pulling from mxnetci/build.ubuntu_cpu_r
> > > > >> Digest:
> > > > >>
> > > sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
> > > > >> Status: Downloaded newer image for
> mxnetci/build.ubuntu_cpu_r:latest
> > > > >> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker
> > > cache
> > > > >> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
> > > > >> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
> > > > >> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker
> > build
> > > > >> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
> > > > >> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r
> -t
> > > > >> mxnetci/build.ubuntu_cpu_r docker'
> > > > >> Sending build context to Docker daemon  289.8kB
> > > > >> Step 1/15 : FROM ubuntu:16.04
> > > > >>  ---> 5e13f8dd4c1a
> > > > >> Step 2/15 : WORKDIR /work/deps
> > > > >>  ---> Using cache
> > > > >>  ---> afc2a135945d
> > > > >> Step 3/15 : COPY install/ubuntu_core.sh /work/
> > > > >>  ---> Using cache
> > > > >>  ---> da2b2e7f35e1
> > > > >> Step 4/15 : RUN /work/ubuntu_core.sh
> > > > >>  ---> Using cache
> > > > >>  ---> d1e88b26b1d2
> > > > >> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
> > > > >>  ---> Using cache
> > > > >>  ---> 3aa97dea3b7b
> > > > >> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
> > > > >>  ---> Using cache
> > > > >>  ---> bec503f1d149
> > > > >> Step 7/15 : COPY install/ubuntu_r.sh /work/
> > > > >>  ---> c5e77c38031d
> > > > >> Step 8/15 : COPY install/r.gpg /work/
> > > > >>  ---> d8cdbf015d2b
> > > > >> Step 9/15 : RUN /work/ubuntu_r.sh
> > > > >>  ---> Running in c6c90b9e1538
> > > > >> ++ dirname /work/ubuntu_r.sh
> > > > >> + cd /work
> > > > >> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> > > > >> + apt-key add r.gpg
> > > > >> OK
> > > > >> + add-apt-repository 'deb [arch=amd64,i386]
> > > > >> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> > > > >> + apt-get update
> > > > >> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
> > > > >>
> > > > >> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
> > > > >> <pe...@gmail.com> wrote:
> > > > >> >
> > > > >> > Also, I forgot, another workaround is that I added the -R flag
> to
> > > the
> > > > >> build
> > > > >> > logic (build.py) so the container is not rebuilt for manual use.
> > > > >> >
> > > > >> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> > > > >> pedro.larroy.lists@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > >
> > > > >> > > Hi Aaron.
> > > > >> > >
> > > > >> > > As Marco explained, if you are in master the cache usually
> > works,
> > > > >> there's
> > > > >> > > two issues that I have observed:
> > > > >> > >
> > > > >> > > 1 - Docker doesn't automatically pull the base image (ex.
> > > > >> ubuntu:16.04) so
> > > > >> > > if your cached base which is used in the FROM statement
> becomes
> > > > >> outdated
> > > > >> > > your caching won't work. (Using docker pull ubuntu:16.04) or
> the
> > > > base
> > > > >> > > images from the container helps with this.
> > > > >> > >
> > > > >> > > 2 - There's another situation where the above doesn't help
> which
> > > > >> seems to
> > > > >> > > be an unidentified issue with the docker cache:
> > > > >> > > https://github.com/docker/docker.github.io/issues/8886
> > > > >> > >
> > > > >> > > We can get a short term workaround for #1 by explicitly
> pulling
> > > > bases
> > > > >> from
> > > > >> > > the script, but I think docker should do it when using
> > > --cache-from
> > > > so
> > > > >> > > maybe contributing a patch to docker would the best approach.
> > > > >> > >
> > > > >> > > Pedro
> > > > >> > >
> > > > >> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
> > > > >> aaron.s.markham@gmail.com>
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > >> When you create a new Dockerfile and use that on CI, it
> doesn't
> > > > seem
> > > > >> > >> to cache some of the steps... like this:
> > > > >> > >>
> > > > >> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
> > > > >> > >>  ---> Running in a1e522f3283b
> > > > >> > >>  [91m+ echo 'Installing dependencies...'
> > > > >> > >> + apt-get update
> > > > >> > >>  [0mInstalling dependencies.
> > > > >> > >>
> > > > >> > >> Or this....
> > > > >> > >>
> > > > >> > >> Step 4/13 : RUN /work/ubuntu_core.sh
> > > > >> > >>  ---> Running in e7882d7aa750
> > > > >> > >>  [91m+ apt-get update
> > > > >> > >>
> > > > >> > >> I get if I was changing those scripts, but then I'd think it
> > > should
> > > > >> > >> cache after running it once... but, no.
> > > > >> > >>
> > > > >> > >>
> > > > >> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
> > > > >> marco.g.abreu@gmail.com>
> > > > >> > >> wrote:
> > > > >> > >> >
> > > > >> > >> > Do I understand it correctly that you are saying that the
> > > Docker
> > > > >> cache
> > > > >> > >> > doesn't work properly and regularly reinstalls
> dependencies?
> > Or
> > > > do
> > > > >> you
> > > > >> > >> mean
> > > > >> > >> > that you only have cache misses when you modify the
> > > dependencies
> > > > -
> > > > >> which
> > > > >> > >> > would be expected?
> > > > >> > >> >
> > > > >> > >> > -Marco
> > > > >> > >> >
> > > > >> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> > > > >> > >> aaron.s.markham@gmail.com>
> > > > >> > >> > wrote:
> > > > >> > >> >
> > > > >> > >> > > Many of the CI pipelines follow this pattern:
> > > > >> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run
> some
> > > > >> tests. Why
> > > > >> > >> > > repeat steps 1-3 over and over?
> > > > >> > >> > >
> > > > >> > >> > > Now, some tests use a stashed binary and docker cache.
> And
> > I
> > > > see
> > > > >> this
> > > > >> > >> work
> > > > >> > >> > > locally, but for the most part, on CI, you're gonna sit
> > > > through a
> > > > >> > >> > > dependency install.
> > > > >> > >> > >
> > > > >> > >> > > I noticed that almost all jobs use an ubuntu setup that
> is
> > > > fully
> > > > >> > >> loaded.
> > > > >> > >> > > Without cache, it can take 10 or more minutes to build.
> > So I
> > > > >> made a
> > > > >> > >> lite
> > > > >> > >> > > version. Takes only a few minutes instead.
> > > > >> > >> > >
> > > > >> > >> > > In some cases archiving worked great to share across
> > > pipelines,
> > > > >> but as
> > > > >> > >> > > Marco mentioned we need a storage solution to make that
> > > happen.
> > > > >> We
> > > > >> > >> can't
> > > > >> > >> > > archive every intermediate artifact for each PR.
> > > > >> > >> > >
> > > > >> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
> > > > >> > >> pedro.larroy.lists@gmail.com>
> > > > >> > >> > > wrote:
> > > > >> > >> > >
> > > > >> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > > >> > >> > > >
> > > > >> > >> > > > Pedro.
> > > > >> > >> > > >
> > > > >> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> > > > >> > >> aaron.s.markham@gmail.com
> > > > >> > >> > > >
> > > > >> > >> > > > wrote:
> > > > >> > >> > > >
> > > > >> > >> > > > > The PRs Thomas and I are working on for the new docs
> > and
> > > > >> website
> > > > >> > >> share
> > > > >> > >> > > > the
> > > > >> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds
> > > things
> > > > >> up a
> > > > >> > >> lot.
> > > > >> > >> > > > >
> > > > >> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
> > > > >> cjolivier01@gmail.com>
> > > > >> > >> > > wrote:
> > > > >> > >> > > > >
> > > > >> > >> > > > > > I see it done daily now, and while I can’t share
> all
> > > the
> > > > >> > >> details,
> > > > >> > >> > > it’s
> > > > >> > >> > > > > not
> > > > >> > >> > > > > > an incredibly complex thing, and involves not much
> > more
> > > > >> than
> > > > >> > >> nfs/efs
> > > > >> > >> > > > > > sharing and remote ssh commands.  All it takes is a
> > > > little
> > > > >> > >> ingenuity
> > > > >> > >> > > > and
> > > > >> > >> > > > > > some imagination.
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > >> > >> > > > > pedro.larroy.lists@gmail.com
> > > > >> > >> > > > > > >
> > > > >> > >> > > > > > wrote:
> > > > >> > >> > > > > >
> > > > >> > >> > > > > > > Sounds good in theory. I think there are complex
> > > > details
> > > > >> with
> > > > >> > >> > > regards
> > > > >> > >> > > > > of
> > > > >> > >> > > > > > > resource sharing during parallel execution.
> Still I
> > > > think
> > > > >> > >> both ways
> > > > >> > >> > > > can
> > > > >> > >> > > > > > be
> > > > >> > >> > > > > > > explored. I think some tests run for unreasonably
> > > long
> > > > >> times
> > > > >> > >> for
> > > > >> > >> > > what
> > > > >> > >> > > > > > they
> > > > >> > >> > > > > > > are doing. We already scale parts of the pipeline
> > > > >> horizontally
> > > > >> > >> > > across
> > > > >> > >> > > > > > > workers.
> > > > >> > >> > > > > > >
> > > > >> > >> > > > > > >
> > > > >> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > > >> > >> > > > cjolivier01@apache.org>
> > > > >> > >> > > > > > > wrote:
> > > > >> > >> > > > > > >
> > > > >> > >> > > > > > > > +1
> > > > >> > >> > > > > > > >
> > > > >> > >> > > > > > > > Rather than remove tests (which doesn’t scale
> as
> > a
> > > > >> > >> solution), why
> > > > >> > >> > > > not
> > > > >> > >> > > > > > > scale
> > > > >> > >> > > > > > > > them horizontally so that they finish more
> > quickly?
> > > > >> Across
> > > > >> > >> > > > processes
> > > > >> > >> > > > > or
> > > > >> > >> > > > > > > > even on a pool of machines that aren’t
> > necessarily
> > > > the
> > > > >> build
> > > > >> > >> > > > machine?
> > > > >> > >> > > > > > > >
> > > > >> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de
> Abreu <
> > > > >> > >> > > > > > marco.g.abreu@gmail.com
> > > > >> > >> > > > > > > >
> > > > >> > >> > > > > > > > wrote:
> > > > >> > >> > > > > > > >
> > > > >> > >> > > > > > > > > With regards to time I rather prefer us
> > spending
> > > a
> > > > >> bit
> > > > >> > >> more
> > > > >> > >> > > time
> > > > >> > >> > > > on
> > > > >> > >> > > > > > > > > maintenance than somebody running into an
> error
> > > > that
> > > > >> > >> could've
> > > > >> > >> > > > been
> > > > >> > >> > > > > > > caught
> > > > >> > >> > > > > > > > > with a test.
> > > > >> > >> > > > > > > > >
> > > > >> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU
> > has
> > > > >> been
> > > > >> > >> broken
> > > > >> > >> > > for
> > > > >> > >> > > > > > quite
> > > > >> > >> > > > > > > > > some time now, but nobody noticed that.
> > Basically
> > > > my
> > > > >> > >> stance on
> > > > >> > >> > > > that
> > > > >> > >> > > > > > > > matter
> > > > >> > >> > > > > > > > > is that as soon as something is not blocking,
> > you
> > > > can
> > > > >> > >> also just
> > > > >> > >> > > > > > > > deactivate
> > > > >> > >> > > > > > > > > it since you don't have a forcing function in
> > an
> > > > open
> > > > >> > >> source
> > > > >> > >> > > > > project.
> > > > >> > >> > > > > > > > > People will rarely come back and fix the
> errors
> > > of
> > > > >> some
> > > > >> > >> nightly
> > > > >> > >> > > > > test
> > > > >> > >> > > > > > > that
> > > > >> > >> > > > > > > > > they introduced.
> > > > >> > >> > > > > > > > >
> > > > >> > >> > > > > > > > > -Marco
> > > > >> > >> > > > > > > > >
> > > > >> > >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb
> am
> > > Mi.,
> > > > >> 14.
> > > > >> > >> Aug.
> > > > >> > >> > > > 2019,
> > > > >> > >> > > > > > > 21:59:
> > > > >> > >> > > > > > > > >
> > > > >> > >> > > > > > > > > > If a language binding test is failing for a
> > not
> > > > >> > >> important
> > > > >> > >> > > > reason,
> > > > >> > >> > > > > > > then
> > > > >> > >> > > > > > > > it
> > > > >> > >> > > > > > > > > > is too brittle and needs to be fixed (we
> have
> > > > fixed
> > > > >> > >> some of
> > > > >> > >> > > > these
> > > > >> > >> > > > > > > with
> > > > >> > >> > > > > > > > > the
> > > > >> > >> > > > > > > > > > Clojure package [1]).
> > > > >> > >> > > > > > > > > > But in general, if we thinking of the MXNet
> > > > >> project as
> > > > >> > >> one
> > > > >> > >> > > > > project
> > > > >> > >> > > > > > > that
> > > > >> > >> > > > > > > > > is
> > > > >> > >> > > > > > > > > > across all the language bindings, then we
> > want
> > > to
> > > > >> know
> > > > >> > >> if
> > > > >> > >> > > some
> > > > >> > >> > > > > > > > > fundamental
> > > > >> > >> > > > > > > > > > code change is going to break a downstream
> > > > package.
> > > > >> > >> > > > > > > > > > I can't speak for all the high level
> package
> > > > >> binding
> > > > >> > >> > > > maintainers,
> > > > >> > >> > > > > > but
> > > > >> > >> > > > > > > > I'm
> > > > >> > >> > > > > > > > > > always happy to pitch in to provide code
> > fixes
> > > to
> > > > >> help
> > > > >> > >> the
> > > > >> > >> > > base
> > > > >> > >> > > > > PR
> > > > >> > >> > > > > > > get
> > > > >> > >> > > > > > > > > > green.
> > > > >> > >> > > > > > > > > >
> > > > >> > >> > > > > > > > > > The time costs to maintain such a large CI
> > > > project
> > > > >> > >> obviously
> > > > >> > >> > > > > needs
> > > > >> > >> > > > > > to
> > > > >> > >> > > > > > > > be
> > > > >> > >> > > > > > > > > > considered as well.
> > > > >> > >> > > > > > > > > >
> > > > >> > >> > > > > > > > > > [1]
> > > > >> > >> https://github.com/apache/incubator-mxnet/pull/15579
> > > > >> > >> > > > > > > > > >
> > > > >> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro
> Larroy
> > <
> > > > >> > >> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > >> > >> > > > > > > > > > >
> > > > >> > >> > > > > > > > > > wrote:
> > > > >> > >> > > > > > > > > >
> > > > >> > >> > > > > > > > > > > From what I have seen Clojure is 15
> > minutes,
> > > > >> which I
> > > > >> > >> think
> > > > >> > >> > > is
> > > > >> > >> > > > > > > > > reasonable.
> > > > >> > >> > > > > > > > > > > The only question is that when a binding
> > such
> > > > as
> > > > >> R,
> > > > >> > >> Perl or
> > > > >> > >> > > > > > Clojure
> > > > >> > >> > > > > > > > > > fails,
> > > > >> > >> > > > > > > > > > > some devs are a bit confused about how to
> > fix
> > > > >> them
> > > > >> > >> since
> > > > >> > >> > > they
> > > > >> > >> > > > > are
> > > > >> > >> > > > > > > not
> > > > >> > >> > > > > > > > > > > familiar with the testing tools and the
> > > > language.
> > > > >> > >> > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin
> > Meier
> > > <
> > > > >> > >> > > > > > carinmeier@gmail.com
> > > > >> > >> > > > > > > >
> > > > >> > >> > > > > > > > > > wrote:
> > > > >> > >> > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > Great idea Marco! Anything that you
> think
> > > > >> would be
> > > > >> > >> > > valuable
> > > > >> > >> > > > > to
> > > > >> > >> > > > > > > > share
> > > > >> > >> > > > > > > > > > > would
> > > > >> > >> > > > > > > > > > > > be good. The duration of each node in
> the
> > > > test
> > > > >> stage
> > > > >> > >> > > sounds
> > > > >> > >> > > > > > like
> > > > >> > >> > > > > > > a
> > > > >> > >> > > > > > > > > good
> > > > >> > >> > > > > > > > > > > > start.
> > > > >> > >> > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > - Carin
> > > > >> > >> > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco
> de
> > > > Abreu
> > > > >> <
> > > > >> > >> > > > > > > > > > marco.g.abreu@gmail.com>
> > > > >> > >> > > > > > > > > > > > wrote:
> > > > >> > >> > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > Hi,
> > > > >> > >> > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > we record a bunch of metrics about
> run
> > > > >> statistics
> > > > >> > >> (down
> > > > >> > >> > > > to
> > > > >> > >> > > > > > the
> > > > >> > >> > > > > > > > > > duration
> > > > >> > >> > > > > > > > > > > > of
> > > > >> > >> > > > > > > > > > > > > every individual step). If you tell
> me
> > > > which
> > > > >> ones
> > > > >> > >> > > you're
> > > > >> > >> > > > > > > > > particularly
> > > > >> > >> > > > > > > > > > > > > interested in (probably total
> duration
> > of
> > > > >> each
> > > > >> > >> node in
> > > > >> > >> > > > the
> > > > >> > >> > > > > > test
> > > > >> > >> > > > > > > > > > stage),
> > > > >> > >> > > > > > > > > > > > I'm
> > > > >> > >> > > > > > > > > > > > > happy to provide them.
> > > > >> > >> > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > Dimensions are (in hierarchical
> order):
> > > > >> > >> > > > > > > > > > > > > - job
> > > > >> > >> > > > > > > > > > > > > - branch
> > > > >> > >> > > > > > > > > > > > > - stage
> > > > >> > >> > > > > > > > > > > > > - node
> > > > >> > >> > > > > > > > > > > > > - step
> > > > >> > >> > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > Unfortunately I don't have the
> > > possibility
> > > > to
> > > > >> > >> export
> > > > >> > >> > > them
> > > > >> > >> > > > > > since
> > > > >> > >> > > > > > > > we
> > > > >> > >> > > > > > > > > > > store
> > > > >> > >> > > > > > > > > > > > > them in CloudWatch Metrics which
> afaik
> > > > >> doesn't
> > > > >> > >> offer
> > > > >> > >> > > raw
> > > > >> > >> > > > > > > exports.
> > > > >> > >> > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > Best regards,
> > > > >> > >> > > > > > > > > > > > > Marco
> > > > >> > >> > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > Carin Meier <ca...@gmail.com>
> > > schrieb
> > > > >> am
> > > > >> > >> Mi., 14.
> > > > >> > >> > > > > Aug.
> > > > >> > >> > > > > > > > 2019,
> > > > >> > >> > > > > > > > > > > 19:43:
> > > > >> > >> > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > I would prefer to keep the language
> > > > >> binding in
> > > > >> > >> the PR
> > > > >> > >> > > > > > > process.
> > > > >> > >> > > > > > > > > > > Perhaps
> > > > >> > >> > > > > > > > > > > > we
> > > > >> > >> > > > > > > > > > > > > > could do some analytics to see how
> > much
> > > > >> each of
> > > > >> > >> the
> > > > >> > >> > > > > > language
> > > > >> > >> > > > > > > > > > bindings
> > > > >> > >> > > > > > > > > > > > is
> > > > >> > >> > > > > > > > > > > > > > contributing to overall run time.
> > > > >> > >> > > > > > > > > > > > > > If we have some metrics on that,
> > maybe
> > > we
> > > > >> can
> > > > >> > >> come up
> > > > >> > >> > > > > with
> > > > >> > >> > > > > > a
> > > > >> > >> > > > > > > > > > > guideline
> > > > >> > >> > > > > > > > > > > > of
> > > > >> > >> > > > > > > > > > > > > > how much time each should take.
> > Another
> > > > >> > >> possibility
> > > > >> > >> > > is
> > > > >> > >> > > > > > > leverage
> > > > >> > >> > > > > > > > > the
> > > > >> > >> > > > > > > > > > > > > > parallel builds more.
> > > > >> > >> > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM
> Pedro
> > > > >> Larroy <
> > > > >> > >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > >> > >> > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > wrote:
> > > > >> > >> > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > Hi Carin.
> > > > >> > >> > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > That's a good point, all things
> > > > >> considered
> > > > >> > >> would
> > > > >> > >> > > your
> > > > >> > >> > > > > > > > > preference
> > > > >> > >> > > > > > > > > > be
> > > > >> > >> > > > > > > > > > > > to
> > > > >> > >> > > > > > > > > > > > > > keep
> > > > >> > >> > > > > > > > > > > > > > > the Clojure tests as part of the
> PR
> > > > >> process
> > > > >> > >> or in
> > > > >> > >> > > > > > Nightly?
> > > > >> > >> > > > > > > > > > > > > > > Some options are having
> > notifications
> > > > >> here or
> > > > >> > >> in
> > > > >> > >> > > > slack.
> > > > >> > >> > > > > > But
> > > > >> > >> > > > > > > > if
> > > > >> > >> > > > > > > > > we
> > > > >> > >> > > > > > > > > > > > think
> > > > >> > >> > > > > > > > > > > > > > > breakages would go unnoticed
> maybe
> > is
> > > > >> not a
> > > > >> > >> good
> > > > >> > >> > > idea
> > > > >> > >> > > > > to
> > > > >> > >> > > > > > > > fully
> > > > >> > >> > > > > > > > > > > remove
> > > > >> > >> > > > > > > > > > > > > > > bindings from the PR process and
> > just
> > > > >> > >> streamline
> > > > >> > >> > > the
> > > > >> > >> > > > > > > process.
> > > > >> > >> > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > Pedro.
> > > > >> > >> > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM
> > Carin
> > > > >> Meier <
> > > > >> > >> > > > > > > > > > carinmeier@gmail.com>
> > > > >> > >> > > > > > > > > > > > > > wrote:
> > > > >> > >> > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > Before any binding tests are
> > moved
> > > to
> > > > >> > >> nightly, I
> > > > >> > >> > > > > think
> > > > >> > >> > > > > > we
> > > > >> > >> > > > > > > > > need
> > > > >> > >> > > > > > > > > > to
> > > > >> > >> > > > > > > > > > > > > > figure
> > > > >> > >> > > > > > > > > > > > > > > > out how the community can get
> > > proper
> > > > >> > >> > > notifications
> > > > >> > >> > > > of
> > > > >> > >> > > > > > > > failure
> > > > >> > >> > > > > > > > > > and
> > > > >> > >> > > > > > > > > > > > > > success
> > > > >> > >> > > > > > > > > > > > > > > > on those nightly runs.
> > Otherwise, I
> > > > >> think
> > > > >> > >> that
> > > > >> > >> > > > > > breakages
> > > > >> > >> > > > > > > > > would
> > > > >> > >> > > > > > > > > > go
> > > > >> > >> > > > > > > > > > > > > > > > unnoticed.
> > > > >> > >> > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > -Carin
> > > > >> > >> > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM
> > > Pedro
> > > > >> > >> Larroy <
> > > > >> > >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > wrote:
> > > > >> > >> > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > Hi
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > Seems we are hitting some
> > > problems
> > > > >> in CI.
> > > > >> > >> I
> > > > >> > >> > > > propose
> > > > >> > >> > > > > > the
> > > > >> > >> > > > > > > > > > > following
> > > > >> > >> > > > > > > > > > > > > > > action
> > > > >> > >> > > > > > > > > > > > > > > > > items to remedy the situation
> > and
> > > > >> > >> accelerate
> > > > >> > >> > > turn
> > > > >> > >> > > > > > > around
> > > > >> > >> > > > > > > > > > times
> > > > >> > >> > > > > > > > > > > in
> > > > >> > >> > > > > > > > > > > > > CI,
> > > > >> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
> > > > >> probability of
> > > > >> > >> > > > failure
> > > > >> > >> > > > > > > > blocking
> > > > >> > >> > > > > > > > > > PRs
> > > > >> > >> > > > > > > > > > > > and
> > > > >> > >> > > > > > > > > > > > > > > > > frustrating developers:
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual
> studio
> > > > from
> > > > >> VS
> > > > >> > >> 2015 to
> > > > >> > >> > > > VS
> > > > >> > >> > > > > > > 2017.
> > > > >> > >> > > > > > > > > The
> > > > >> > >> > > > > > > > > > > > > > > > > build_windows.py
> infrastructure
> > > > >> should
> > > > >> > >> easily
> > > > >> > >> > > > work
> > > > >> > >> > > > > > with
> > > > >> > >> > > > > > > > the
> > > > >> > >> > > > > > > > > > new
> > > > >> > >> > > > > > > > > > > > > > > version.
> > > > >> > >> > > > > > > > > > > > > > > > > Currently some PRs are
> blocked
> > by
> > > > >> this:
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > >
> > https://github.com/apache/incubator-mxnet/issues/13958
> > > > >> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests
> to
> > > > >> nightly.
> > > > >> > >> > > Tracked
> > > > >> > >> > > > at
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > >
> > https://github.com/apache/incubator-mxnet/issues/15295
> > > > >> > >> > > > > > > > > > > > > > > > > * Move non-python bindings
> > tests
> > > to
> > > > >> > >> nightly.
> > > > >> > >> > > If a
> > > > >> > >> > > > > > > commit
> > > > >> > >> > > > > > > > is
> > > > >> > >> > > > > > > > > > > > > touching
> > > > >> > >> > > > > > > > > > > > > > > > other
> > > > >> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should
> > ask
> > > > >> for a
> > > > >> > >> full
> > > > >> > >> > > run
> > > > >> > >> > > > > > which
> > > > >> > >> > > > > > > > can
> > > > >> > >> > > > > > > > > be
> > > > >> > >> > > > > > > > > > > > done
> > > > >> > >> > > > > > > > > > > > > > > > locally,
> > > > >> > >> > > > > > > > > > > > > > > > > use the label bot to trigger
> a
> > > full
> > > > >> CI
> > > > >> > >> build,
> > > > >> > >> > > or
> > > > >> > >> > > > > > defer
> > > > >> > >> > > > > > > to
> > > > >> > >> > > > > > > > > > > > nightly.
> > > > >> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic
> > > sanity
> > > > >> > >> performance
> > > > >> > >> > > > > tests
> > > > >> > >> > > > > > on
> > > > >> > >> > > > > > > > > small
> > > > >> > >> > > > > > > > > > > > > models
> > > > >> > >> > > > > > > > > > > > > > > that
> > > > >> > >> > > > > > > > > > > > > > > > > are run on CI and can be
> echoed
> > > by
> > > > >> the
> > > > >> > >> label
> > > > >> > >> > > bot
> > > > >> > >> > > > > as a
> > > > >> > >> > > > > > > > > comment
> > > > >> > >> > > > > > > > > > > for
> > > > >> > >> > > > > > > > > > > > > > PRs.
> > > > >> > >> > > > > > > > > > > > > > > > > * Address unit tests that
> take
> > > more
> > > > >> than
> > > > >> > >> > > 10-20s,
> > > > >> > >> > > > > > > > streamline
> > > > >> > >> > > > > > > > > > > them
> > > > >> > >> > > > > > > > > > > > or
> > > > >> > >> > > > > > > > > > > > > > > move
> > > > >> > >> > > > > > > > > > > > > > > > > them to nightly if it can't
> be
> > > > done.
> > > > >> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining
> > CI
> > > > >> > >> infrastructure
> > > > >> > >> > > > > > scripts
> > > > >> > >> > > > > > > > so
> > > > >> > >> > > > > > > > > > the
> > > > >> > >> > > > > > > > > > > > > > > community
> > > > >> > >> > > > > > > > > > > > > > > > > can contribute.
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > I think our goal should be
> > > > turnaround
> > > > >> > >> under
> > > > >> > >> > > > 30min.
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > I would also like to touch
> base
> > > > with
> > > > >> the
> > > > >> > >> > > > community
> > > > >> > >> > > > > > that
> > > > >> > >> > > > > > > > > some
> > > > >> > >> > > > > > > > > > > PRs
> > > > >> > >> > > > > > > > > > > > > are
> > > > >> > >> > > > > > > > > > > > > > > not
> > > > >> > >> > > > > > > > > > > > > > > > > being followed up by
> committers
> > > > >> asking for
> > > > >> > >> > > > changes.
> > > > >> > >> > > > > > For
> > > > >> > >> > > > > > > > > > example
> > > > >> > >> > > > > > > > > > > > > this
> > > > >> > >> > > > > > > > > > > > > > PR
> > > > >> > >> > > > > > > > > > > > > > > > is
> > > > >> > >> > > > > > > > > > > > > > > > > importtant and is hanging
> for a
> > > > long
> > > > >> time.
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > This is another, less
> important
> > > but
> > > > >> more
> > > > >> > >> > > trivial
> > > > >> > >> > > > to
> > > > >> > >> > > > > > > > review:
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > I think comitters requesting
> > > > changes
> > > > >> and
> > > > >> > >> not
> > > > >> > >> > > > > > folllowing
> > > > >> > >> > > > > > > > up
> > > > >> > >> > > > > > > > > in
> > > > >> > >> > > > > > > > > > > > > > > reasonable
> > > > >> > >> > > > > > > > > > > > > > > > > time is not healthy for the
> > > > project.
> > > > >> I
> > > > >> > >> suggest
> > > > >> > >> > > > > > > > configuring
> > > > >> > >> > > > > > > > > > > github
> > > > >> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR
> > and
> > > > >> > >> following up.
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > Regards.
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > > > Pedro.
> > > > >> > >> > > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > > >
> > > > >> > >> > > > > > > > > > >
> > > > >> > >> > > > > > > > > >
> > > > >> > >> > > > > > > > >
> > > > >> > >> > > > > > > >
> > > > >> > >> > > > > > >
> > > > >> > >> > > > > >
> > > > >> > >> > > > >
> > > > >> > >> > > >
> > > > >> > >> > >
> > > > >> > >>
> > > > >> > >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Chris Olivier <cj...@gmail.com>.

Pedro,

I don’t see where Marco says that he “designed and implemented all aspects
of CI by himself”.  I do think, however, that it’s fair to say that Marco
was in charge of the design and most likely made the majority of design
decisions as the CI was being built, especially around those tenents that
he mentioned.  I know this because before I submitted Marco as a committer,
I asked some his teammates whether Marco was really responsible for CI and
the answer by all I asked were that CI was Marco's baby and he did most of
it by some large margin (I am paraphrasing).  Taking other design inputs
and examples (i.e. Apache CI) is all part of any responsible design process.

In addition, I am not understanding the obfuscation of “people who
contributed to CI”, “person/people who designed CI”, or even
"person who oversees CI" as it is weaponized in your email.  Again, nowhere
did Marco say that he did everything back then or since then.  I don't
think it's fair to try to modify what Marco wrote and then try to turn it
against him.  Reminds me of the techniques of network news these days,
quite frankly (whichever side you're "on" doesn't matter, because both
sides do it).

-Chris





On Fri, Aug 23, 2019 at 3:56 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Thanks for your response Marco, I think you have totally missed my original
> point which was basically that someone volunteering effort on the CI is as
> important as someone contributing a feature. From my perspective this
> hasn't been the case, and we had to rely a lot on you and Sheng to submit
> fixes which required access, also to relay communication with Apache infra.
> Also in many cases we had to rely on you to channel fixes, PRs, disable
> tests etc. If the community is fine having this kind of bottleneck, fine
> with me. From my point of view and the feedback from myself and other
> people which contributed to CI this was not always a good experience.
> Having a welcoming and inclusive community is very important. I don't want
> to start a discussion on this, but invite the community to do a bit of soul
> searching on this topic, now that the infrastructure is open source.
>
> Also I find surprising that you claim that you designed the CI yourself,
> when this was a joint work of many individuals, including the old Apache CI
> and additional contributions and code reviewers, people who were oncall for
> this service or the autoscaling approach which if I remember correctly came
> from a humble servant. Kellen did a lot of pair programming and code
> reviews. Obviously you have a done a lot of work on CI which has had a huge
> positive impact on the project and your recognition is well deserved. The
> technical details you mention on your email are perfectly true and valid.
>
> Below is a rough list of individuals who contributed to CI, I would like to
> thank all of them since without this work, we wouldn't be able to deliver
> with the quality that we have done in the past.
>
>
> pllarroy@mac:0: ~/d/m/ci [fc_higher_order_grad_2]> git log
> --pretty=format:%aN . | sort | uniq -c | sort -n | tail -n 10
>    6 Zach Kimberg
>    6 stu1130
>    7 Jake Lee
>    8 Aaron Markham
>   11 Lanking
>   12 Anton Chernov
>   13 perdasilva
>   26 Kellen Sunderland
>   34 Marco de Abreu
>   46 Pedro Larroy
>
> pllarroy@mac:0: ~/d/mxnet_ci_general [master]> git log --pretty=format:%aN
> | sort | uniq -c | sort -n
>    1 Gavin M. Bell
>    1 de Abreu
>    6 Bair
>    7 Kellen Sunderland
>    8 Jose Luis Contreras
>   14 perdasilva
>   20 Per Goncalves da Silva
>   29 Anton Chernov
>   39 Chance Bair
>   96 Pedro Larroy
>  209 Marco de Abreu
>
>
>
> Pedro.
>
> On Fri, Aug 23, 2019 at 3:18 PM Marco de Abreu <ma...@gmail.com>
> wrote:
>
> > I've heard this request multiple times and so far, I'm having issues
> > understanding the direct correlation between having committer permissions
> > and being able to manage CI.
> >
> > When I designed the CI, one of the tenets was maintainability and
> > accessbility for the community: I wanted to avoid that somebody needs
> > certain privileges in order to execute regular actions. The result was
> the
> > strong usage of Jenkinsfiles, Dockerfiles and the runtime functions. The
> > combination of these techniques allowed somebody to create a job from the
> > process flow level (Jenkinsfile), over the environment level (Dockerfile)
> > to the individual action level (runtime functions). This design basically
> > gives the community full access over the entire flow.
> >
> > The jobs that are configured to source only Jenkinsfile. Jenkins
> supports a
> > lot of different ways how to define pipelines, but I have made sure to
> > encourage everybody to use only Jenkinsfiles. This makes sure that no
> > configuration is done in the web-interface. This firs of all alleviates
> the
> > permission issue since there's literally no config in the web interface
> and
> > second it allows auditing since all changes have to be done in the MXNet
> > GitHub repository.
> >
> > Committers have elevated permissions in Jenkins. These contain the
> > permission to run, stop and configure jobs. All other permissions are
> > restricted to system administrators for the sake of ensuring stability of
> > the system. On the dev-CI on the other hand, we're happy to add people so
> > they can experiment as much as they want. The transition to prod-CI is
> then
> > assisted by me to ensure smooth operations and adhering to the best
> > practices (like using our Jenkinsfiles and Docker structure, for
> example).
> >
> > The only case where somebody would need elevated permissions is if they
> > would like to change system settings. But at that point, we're talking
> > about instance settings and AWS account configuration. Since that now
> > reaches into the next permission level, which is restricted to the donor
> of
> > the CI system - Amazon Web Services - this is something that not even PMC
> > members will receive. The same policy is in place for the official Apache
> > CI: Committers/PMCs can configure their job, but don't have system level
> > access to either Jenkins or the underlying AWS account for obvious
> reasons.
> > We're trying to stay in line with the same policy, but in the past I've
> > granted Jenkins administrator access to people who required elevated
> access
> > to properly do their job - Aaron Markham with regards to the website
> being
> > one example.
> >
> > This means that the only case when a contributor needs committer
> assistance
> > is the moment when somebody would like to set up a new Jenkins job. It
> > would be a matter of setting up the job to point to the persons branch -
> > Jenkins will then automatically pull the Jenkinsfile and thus no further
> > configuration is necessary and updates are directly consumed. Such a
> > request IMO is on the same level as us having to cut a ticket to Apache
> > INFRA to create a new job.
> >
> > With regards to speed: So far, I was the only "CI-Person" with committer
> > privileges. But due to our 4-eye-rule for PRs, I wasn't able to merge my
> > own changes anyways - most of them were reviewed by Sheng, for example.
> In
> > an emergency, I'm sure that somebody can be reached to assist since we
> > currently have 39 PMC members and 20 committers spanning multiple
> > timezones.
> >
> > For these reasons, I don't agree with the sentiment that contributors are
> > unable to effectively work with the CI system unless they have committer
> > privileges.
> >
> > Best regards,
> > Marco
> >
> >
> > On Fri, Aug 23, 2019 at 10:33 AM Pedro Larroy <
> > pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > > As Marco has open sourced the bulk of the CI infrastructure donated
> from
> > > Amazon to the community, I would like to raise the recommendation that
> > the
> > > community takes action to help volunteers working on the CI have a
> better
> > > experience. In the past, it's my impression that there hasn't been much
> > > action granting PMC or committer privileges to engineers volunteering
> to
> > > help CI other than Marco. This would encourage more contributions and
> > help
> > > expedite critical fixes and corrective actions. I think this has not
> > > properly enabled those individuals to be as effective as they could, as
> > > well as the lack of recognition for such a critical activity. I'm not
> > sure
> > > about the cause but I believe this is something that should be
> rectified
> > > for future contributions and help on the CI front if improvements are
> > > desired.
> > >
> > > In spanish we have a saying: "es de bien nacido ser agradecido".
> > >
> > > Pedro.
> > >
> > > On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi Aaron. This is difficult to diagnose, because I don't know what to
> > do
> > > > when the hash of the layer in docker doesn't match and decides to
> > rebuild
> > > > it. the r script seems not to have changed. I have observed this in
> the
> > > > past and I think is due to bugs in docker.   Maybe Kellen is able to
> > give
> > > > some tips here.
> > > >
> > > > In this case you should use -R which is already in master. (you can
> > > always
> > > > copy the script on top if you are in an older revision).
> > > >
> > > > Another thing that worked for me in the past was to completely nuke
> the
> > > > docker cache, so it redonwloads from the CI repo. After that it
> worked
> > > fine
> > > > in some cases.
> > > >
> > > > These two workarounds are not ideal, but should unblock you.
> > > >
> > > > Pedro.
> > > >
> > > > On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <
> > > aaron.s.markham@gmail.com>
> > > > wrote:
> > > >
> > > >> Is -R already in there?
> > > >>
> > > >> Here's an example of it happening to me right now.... I am making
> > > >> minor changes to the runtime_functions logic for handling the R docs
> > > >> output. I pull the fix, then run the container, but I see the R deps
> > > >> layer re-running. I didn't touch that. Why it that running again?
> > > >>
> > > >> From https://github.com/aaronmarkham/incubator-mxnet
> > > >>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
> > > >> origin/new_website_pipeline_2_aaron_rdocs
> > > >> Updating f71cc6d..deec6aa
> > > >> Fast-forward
> > > >>  ci/docker/runtime_functions.sh | 6 +++---
> > > >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > > >> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
> > > >> --docker-registry mxnetci --platform ubuntu_cpu_r
> > > >> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
> > > >> build_r_docs
> > > >> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build
> > > tool.
> > > >> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
> > > >> enabled from registry mxnetci
> > > >> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
> > > >> mxnetci/build.ubuntu_cpu_r from mxnetci
> > > >> Using default tag: latest
> > > >> latest: Pulling from mxnetci/build.ubuntu_cpu_r
> > > >> Digest:
> > > >>
> > sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
> > > >> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
> > > >> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker
> > cache
> > > >> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
> > > >> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
> > > >> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker
> build
> > > >> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
> > > >> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
> > > >> mxnetci/build.ubuntu_cpu_r docker'
> > > >> Sending build context to Docker daemon  289.8kB
> > > >> Step 1/15 : FROM ubuntu:16.04
> > > >>  ---> 5e13f8dd4c1a
> > > >> Step 2/15 : WORKDIR /work/deps
> > > >>  ---> Using cache
> > > >>  ---> afc2a135945d
> > > >> Step 3/15 : COPY install/ubuntu_core.sh /work/
> > > >>  ---> Using cache
> > > >>  ---> da2b2e7f35e1
> > > >> Step 4/15 : RUN /work/ubuntu_core.sh
> > > >>  ---> Using cache
> > > >>  ---> d1e88b26b1d2
> > > >> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
> > > >>  ---> Using cache
> > > >>  ---> 3aa97dea3b7b
> > > >> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
> > > >>  ---> Using cache
> > > >>  ---> bec503f1d149
> > > >> Step 7/15 : COPY install/ubuntu_r.sh /work/
> > > >>  ---> c5e77c38031d
> > > >> Step 8/15 : COPY install/r.gpg /work/
> > > >>  ---> d8cdbf015d2b
> > > >> Step 9/15 : RUN /work/ubuntu_r.sh
> > > >>  ---> Running in c6c90b9e1538
> > > >> ++ dirname /work/ubuntu_r.sh
> > > >> + cd /work
> > > >> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> > > >> + apt-key add r.gpg
> > > >> OK
> > > >> + add-apt-repository 'deb [arch=amd64,i386]
> > > >> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> > > >> + apt-get update
> > > >> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
> > > >>
> > > >> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
> > > >> <pe...@gmail.com> wrote:
> > > >> >
> > > >> > Also, I forgot, another workaround is that I added the -R flag to
> > the
> > > >> build
> > > >> > logic (build.py) so the container is not rebuilt for manual use.
> > > >> >
> > > >> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> > > >> pedro.larroy.lists@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > >
> > > >> > > Hi Aaron.
> > > >> > >
> > > >> > > As Marco explained, if you are in master the cache usually
> works,
> > > >> there's
> > > >> > > two issues that I have observed:
> > > >> > >
> > > >> > > 1 - Docker doesn't automatically pull the base image (ex.
> > > >> ubuntu:16.04) so
> > > >> > > if your cached base which is used in the FROM statement becomes
> > > >> outdated
> > > >> > > your caching won't work. (Using docker pull ubuntu:16.04) or the
> > > base
> > > >> > > images from the container helps with this.
> > > >> > >
> > > >> > > 2 - There's another situation where the above doesn't help which
> > > >> seems to
> > > >> > > be an unidentified issue with the docker cache:
> > > >> > > https://github.com/docker/docker.github.io/issues/8886
> > > >> > >
> > > >> > > We can get a short term workaround for #1 by explicitly pulling
> > > bases
> > > >> from
> > > >> > > the script, but I think docker should do it when using
> > --cache-from
> > > so
> > > >> > > maybe contributing a patch to docker would the best approach.
> > > >> > >
> > > >> > > Pedro
> > > >> > >
> > > >> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
> > > >> aaron.s.markham@gmail.com>
> > > >> > > wrote:
> > > >> > >
> > > >> > >> When you create a new Dockerfile and use that on CI, it doesn't
> > > seem
> > > >> > >> to cache some of the steps... like this:
> > > >> > >>
> > > >> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
> > > >> > >>  ---> Running in a1e522f3283b
> > > >> > >>  [91m+ echo 'Installing dependencies...'
> > > >> > >> + apt-get update
> > > >> > >>  [0mInstalling dependencies.
> > > >> > >>
> > > >> > >> Or this....
> > > >> > >>
> > > >> > >> Step 4/13 : RUN /work/ubuntu_core.sh
> > > >> > >>  ---> Running in e7882d7aa750
> > > >> > >>  [91m+ apt-get update
> > > >> > >>
> > > >> > >> I get if I was changing those scripts, but then I'd think it
> > should
> > > >> > >> cache after running it once... but, no.
> > > >> > >>
> > > >> > >>
> > > >> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
> > > >> marco.g.abreu@gmail.com>
> > > >> > >> wrote:
> > > >> > >> >
> > > >> > >> > Do I understand it correctly that you are saying that the
> > Docker
> > > >> cache
> > > >> > >> > doesn't work properly and regularly reinstalls dependencies?
> Or
> > > do
> > > >> you
> > > >> > >> mean
> > > >> > >> > that you only have cache misses when you modify the
> > dependencies
> > > -
> > > >> which
> > > >> > >> > would be expected?
> > > >> > >> >
> > > >> > >> > -Marco
> > > >> > >> >
> > > >> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> > > >> > >> aaron.s.markham@gmail.com>
> > > >> > >> > wrote:
> > > >> > >> >
> > > >> > >> > > Many of the CI pipelines follow this pattern:
> > > >> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run some
> > > >> tests. Why
> > > >> > >> > > repeat steps 1-3 over and over?
> > > >> > >> > >
> > > >> > >> > > Now, some tests use a stashed binary and docker cache. And
> I
> > > see
> > > >> this
> > > >> > >> work
> > > >> > >> > > locally, but for the most part, on CI, you're gonna sit
> > > through a
> > > >> > >> > > dependency install.
> > > >> > >> > >
> > > >> > >> > > I noticed that almost all jobs use an ubuntu setup that is
> > > fully
> > > >> > >> loaded.
> > > >> > >> > > Without cache, it can take 10 or more minutes to build.
> So I
> > > >> made a
> > > >> > >> lite
> > > >> > >> > > version. Takes only a few minutes instead.
> > > >> > >> > >
> > > >> > >> > > In some cases archiving worked great to share across
> > pipelines,
> > > >> but as
> > > >> > >> > > Marco mentioned we need a storage solution to make that
> > happen.
> > > >> We
> > > >> > >> can't
> > > >> > >> > > archive every intermediate artifact for each PR.
> > > >> > >> > >
> > > >> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
> > > >> > >> pedro.larroy.lists@gmail.com>
> > > >> > >> > > wrote:
> > > >> > >> > >
> > > >> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >> > >> > > >
> > > >> > >> > > > Pedro.
> > > >> > >> > > >
> > > >> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> > > >> > >> aaron.s.markham@gmail.com
> > > >> > >> > > >
> > > >> > >> > > > wrote:
> > > >> > >> > > >
> > > >> > >> > > > > The PRs Thomas and I are working on for the new docs
> and
> > > >> website
> > > >> > >> share
> > > >> > >> > > > the
> > > >> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds
> > things
> > > >> up a
> > > >> > >> lot.
> > > >> > >> > > > >
> > > >> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
> > > >> cjolivier01@gmail.com>
> > > >> > >> > > wrote:
> > > >> > >> > > > >
> > > >> > >> > > > > > I see it done daily now, and while I can’t share all
> > the
> > > >> > >> details,
> > > >> > >> > > it’s
> > > >> > >> > > > > not
> > > >> > >> > > > > > an incredibly complex thing, and involves not much
> more
> > > >> than
> > > >> > >> nfs/efs
> > > >> > >> > > > > > sharing and remote ssh commands.  All it takes is a
> > > little
> > > >> > >> ingenuity
> > > >> > >> > > > and
> > > >> > >> > > > > > some imagination.
> > > >> > >> > > > > >
> > > >> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > >> > >> > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > >
> > > >> > >> > > > > > wrote:
> > > >> > >> > > > > >
> > > >> > >> > > > > > > Sounds good in theory. I think there are complex
> > > details
> > > >> with
> > > >> > >> > > regards
> > > >> > >> > > > > of
> > > >> > >> > > > > > > resource sharing during parallel execution. Still I
> > > think
> > > >> > >> both ways
> > > >> > >> > > > can
> > > >> > >> > > > > > be
> > > >> > >> > > > > > > explored. I think some tests run for unreasonably
> > long
> > > >> times
> > > >> > >> for
> > > >> > >> > > what
> > > >> > >> > > > > > they
> > > >> > >> > > > > > > are doing. We already scale parts of the pipeline
> > > >> horizontally
> > > >> > >> > > across
> > > >> > >> > > > > > > workers.
> > > >> > >> > > > > > >
> > > >> > >> > > > > > >
> > > >> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > >> > >> > > > cjolivier01@apache.org>
> > > >> > >> > > > > > > wrote:
> > > >> > >> > > > > > >
> > > >> > >> > > > > > > > +1
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > Rather than remove tests (which doesn’t scale as
> a
> > > >> > >> solution), why
> > > >> > >> > > > not
> > > >> > >> > > > > > > scale
> > > >> > >> > > > > > > > them horizontally so that they finish more
> quickly?
> > > >> Across
> > > >> > >> > > > processes
> > > >> > >> > > > > or
> > > >> > >> > > > > > > > even on a pool of machines that aren’t
> necessarily
> > > the
> > > >> build
> > > >> > >> > > > machine?
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > >> > >> > > > > > marco.g.abreu@gmail.com
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > wrote:
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > > With regards to time I rather prefer us
> spending
> > a
> > > >> bit
> > > >> > >> more
> > > >> > >> > > time
> > > >> > >> > > > on
> > > >> > >> > > > > > > > > maintenance than somebody running into an error
> > > that
> > > >> > >> could've
> > > >> > >> > > > been
> > > >> > >> > > > > > > caught
> > > >> > >> > > > > > > > > with a test.
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU
> has
> > > >> been
> > > >> > >> broken
> > > >> > >> > > for
> > > >> > >> > > > > > quite
> > > >> > >> > > > > > > > > some time now, but nobody noticed that.
> Basically
> > > my
> > > >> > >> stance on
> > > >> > >> > > > that
> > > >> > >> > > > > > > > matter
> > > >> > >> > > > > > > > > is that as soon as something is not blocking,
> you
> > > can
> > > >> > >> also just
> > > >> > >> > > > > > > > deactivate
> > > >> > >> > > > > > > > > it since you don't have a forcing function in
> an
> > > open
> > > >> > >> source
> > > >> > >> > > > > project.
> > > >> > >> > > > > > > > > People will rarely come back and fix the errors
> > of
> > > >> some
> > > >> > >> nightly
> > > >> > >> > > > > test
> > > >> > >> > > > > > > that
> > > >> > >> > > > > > > > > they introduced.
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > -Marco
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am
> > Mi.,
> > > >> 14.
> > > >> > >> Aug.
> > > >> > >> > > > 2019,
> > > >> > >> > > > > > > 21:59:
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > > > > If a language binding test is failing for a
> not
> > > >> > >> important
> > > >> > >> > > > reason,
> > > >> > >> > > > > > > then
> > > >> > >> > > > > > > > it
> > > >> > >> > > > > > > > > > is too brittle and needs to be fixed (we have
> > > fixed
> > > >> > >> some of
> > > >> > >> > > > these
> > > >> > >> > > > > > > with
> > > >> > >> > > > > > > > > the
> > > >> > >> > > > > > > > > > Clojure package [1]).
> > > >> > >> > > > > > > > > > But in general, if we thinking of the MXNet
> > > >> project as
> > > >> > >> one
> > > >> > >> > > > > project
> > > >> > >> > > > > > > that
> > > >> > >> > > > > > > > > is
> > > >> > >> > > > > > > > > > across all the language bindings, then we
> want
> > to
> > > >> know
> > > >> > >> if
> > > >> > >> > > some
> > > >> > >> > > > > > > > > fundamental
> > > >> > >> > > > > > > > > > code change is going to break a downstream
> > > package.
> > > >> > >> > > > > > > > > > I can't speak for all the high level package
> > > >> binding
> > > >> > >> > > > maintainers,
> > > >> > >> > > > > > but
> > > >> > >> > > > > > > > I'm
> > > >> > >> > > > > > > > > > always happy to pitch in to provide code
> fixes
> > to
> > > >> help
> > > >> > >> the
> > > >> > >> > > base
> > > >> > >> > > > > PR
> > > >> > >> > > > > > > get
> > > >> > >> > > > > > > > > > green.
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > The time costs to maintain such a large CI
> > > project
> > > >> > >> obviously
> > > >> > >> > > > > needs
> > > >> > >> > > > > > to
> > > >> > >> > > > > > > > be
> > > >> > >> > > > > > > > > > considered as well.
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > [1]
> > > >> > >> https://github.com/apache/incubator-mxnet/pull/15579
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy
> <
> > > >> > >> > > > > > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > > > > From what I have seen Clojure is 15
> minutes,
> > > >> which I
> > > >> > >> think
> > > >> > >> > > is
> > > >> > >> > > > > > > > > reasonable.
> > > >> > >> > > > > > > > > > > The only question is that when a binding
> such
> > > as
> > > >> R,
> > > >> > >> Perl or
> > > >> > >> > > > > > Clojure
> > > >> > >> > > > > > > > > > fails,
> > > >> > >> > > > > > > > > > > some devs are a bit confused about how to
> fix
> > > >> them
> > > >> > >> since
> > > >> > >> > > they
> > > >> > >> > > > > are
> > > >> > >> > > > > > > not
> > > >> > >> > > > > > > > > > > familiar with the testing tools and the
> > > language.
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin
> Meier
> > <
> > > >> > >> > > > > > carinmeier@gmail.com
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > Great idea Marco! Anything that you think
> > > >> would be
> > > >> > >> > > valuable
> > > >> > >> > > > > to
> > > >> > >> > > > > > > > share
> > > >> > >> > > > > > > > > > > would
> > > >> > >> > > > > > > > > > > > be good. The duration of each node in the
> > > test
> > > >> stage
> > > >> > >> > > sounds
> > > >> > >> > > > > > like
> > > >> > >> > > > > > > a
> > > >> > >> > > > > > > > > good
> > > >> > >> > > > > > > > > > > > start.
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > - Carin
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de
> > > Abreu
> > > >> <
> > > >> > >> > > > > > > > > > marco.g.abreu@gmail.com>
> > > >> > >> > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Hi,
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > we record a bunch of metrics about run
> > > >> statistics
> > > >> > >> (down
> > > >> > >> > > > to
> > > >> > >> > > > > > the
> > > >> > >> > > > > > > > > > duration
> > > >> > >> > > > > > > > > > > > of
> > > >> > >> > > > > > > > > > > > > every individual step). If you tell me
> > > which
> > > >> ones
> > > >> > >> > > you're
> > > >> > >> > > > > > > > > particularly
> > > >> > >> > > > > > > > > > > > > interested in (probably total duration
> of
> > > >> each
> > > >> > >> node in
> > > >> > >> > > > the
> > > >> > >> > > > > > test
> > > >> > >> > > > > > > > > > stage),
> > > >> > >> > > > > > > > > > > > I'm
> > > >> > >> > > > > > > > > > > > > happy to provide them.
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> > > >> > >> > > > > > > > > > > > > - job
> > > >> > >> > > > > > > > > > > > > - branch
> > > >> > >> > > > > > > > > > > > > - stage
> > > >> > >> > > > > > > > > > > > > - node
> > > >> > >> > > > > > > > > > > > > - step
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Unfortunately I don't have the
> > possibility
> > > to
> > > >> > >> export
> > > >> > >> > > them
> > > >> > >> > > > > > since
> > > >> > >> > > > > > > > we
> > > >> > >> > > > > > > > > > > store
> > > >> > >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik
> > > >> doesn't
> > > >> > >> offer
> > > >> > >> > > raw
> > > >> > >> > > > > > > exports.
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Best regards,
> > > >> > >> > > > > > > > > > > > > Marco
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > Carin Meier <ca...@gmail.com>
> > schrieb
> > > >> am
> > > >> > >> Mi., 14.
> > > >> > >> > > > > Aug.
> > > >> > >> > > > > > > > 2019,
> > > >> > >> > > > > > > > > > > 19:43:
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > I would prefer to keep the language
> > > >> binding in
> > > >> > >> the PR
> > > >> > >> > > > > > > process.
> > > >> > >> > > > > > > > > > > Perhaps
> > > >> > >> > > > > > > > > > > > we
> > > >> > >> > > > > > > > > > > > > > could do some analytics to see how
> much
> > > >> each of
> > > >> > >> the
> > > >> > >> > > > > > language
> > > >> > >> > > > > > > > > > bindings
> > > >> > >> > > > > > > > > > > > is
> > > >> > >> > > > > > > > > > > > > > contributing to overall run time.
> > > >> > >> > > > > > > > > > > > > > If we have some metrics on that,
> maybe
> > we
> > > >> can
> > > >> > >> come up
> > > >> > >> > > > > with
> > > >> > >> > > > > > a
> > > >> > >> > > > > > > > > > > guideline
> > > >> > >> > > > > > > > > > > > of
> > > >> > >> > > > > > > > > > > > > > how much time each should take.
> Another
> > > >> > >> possibility
> > > >> > >> > > is
> > > >> > >> > > > > > > leverage
> > > >> > >> > > > > > > > > the
> > > >> > >> > > > > > > > > > > > > > parallel builds more.
> > > >> > >> > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
> > > >> Larroy <
> > > >> > >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > Hi Carin.
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > That's a good point, all things
> > > >> considered
> > > >> > >> would
> > > >> > >> > > your
> > > >> > >> > > > > > > > > preference
> > > >> > >> > > > > > > > > > be
> > > >> > >> > > > > > > > > > > > to
> > > >> > >> > > > > > > > > > > > > > keep
> > > >> > >> > > > > > > > > > > > > > > the Clojure tests as part of the PR
> > > >> process
> > > >> > >> or in
> > > >> > >> > > > > > Nightly?
> > > >> > >> > > > > > > > > > > > > > > Some options are having
> notifications
> > > >> here or
> > > >> > >> in
> > > >> > >> > > > slack.
> > > >> > >> > > > > > But
> > > >> > >> > > > > > > > if
> > > >> > >> > > > > > > > > we
> > > >> > >> > > > > > > > > > > > think
> > > >> > >> > > > > > > > > > > > > > > breakages would go unnoticed maybe
> is
> > > >> not a
> > > >> > >> good
> > > >> > >> > > idea
> > > >> > >> > > > > to
> > > >> > >> > > > > > > > fully
> > > >> > >> > > > > > > > > > > remove
> > > >> > >> > > > > > > > > > > > > > > bindings from the PR process and
> just
> > > >> > >> streamline
> > > >> > >> > > the
> > > >> > >> > > > > > > process.
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > Pedro.
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM
> Carin
> > > >> Meier <
> > > >> > >> > > > > > > > > > carinmeier@gmail.com>
> > > >> > >> > > > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > Before any binding tests are
> moved
> > to
> > > >> > >> nightly, I
> > > >> > >> > > > > think
> > > >> > >> > > > > > we
> > > >> > >> > > > > > > > > need
> > > >> > >> > > > > > > > > > to
> > > >> > >> > > > > > > > > > > > > > figure
> > > >> > >> > > > > > > > > > > > > > > > out how the community can get
> > proper
> > > >> > >> > > notifications
> > > >> > >> > > > of
> > > >> > >> > > > > > > > failure
> > > >> > >> > > > > > > > > > and
> > > >> > >> > > > > > > > > > > > > > success
> > > >> > >> > > > > > > > > > > > > > > > on those nightly runs.
> Otherwise, I
> > > >> think
> > > >> > >> that
> > > >> > >> > > > > > breakages
> > > >> > >> > > > > > > > > would
> > > >> > >> > > > > > > > > > go
> > > >> > >> > > > > > > > > > > > > > > > unnoticed.
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > -Carin
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM
> > Pedro
> > > >> > >> Larroy <
> > > >> > >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > wrote:
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Hi
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Seems we are hitting some
> > problems
> > > >> in CI.
> > > >> > >> I
> > > >> > >> > > > propose
> > > >> > >> > > > > > the
> > > >> > >> > > > > > > > > > > following
> > > >> > >> > > > > > > > > > > > > > > action
> > > >> > >> > > > > > > > > > > > > > > > > items to remedy the situation
> and
> > > >> > >> accelerate
> > > >> > >> > > turn
> > > >> > >> > > > > > > around
> > > >> > >> > > > > > > > > > times
> > > >> > >> > > > > > > > > > > in
> > > >> > >> > > > > > > > > > > > > CI,
> > > >> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
> > > >> probability of
> > > >> > >> > > > failure
> > > >> > >> > > > > > > > blocking
> > > >> > >> > > > > > > > > > PRs
> > > >> > >> > > > > > > > > > > > and
> > > >> > >> > > > > > > > > > > > > > > > > frustrating developers:
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio
> > > from
> > > >> VS
> > > >> > >> 2015 to
> > > >> > >> > > > VS
> > > >> > >> > > > > > > 2017.
> > > >> > >> > > > > > > > > The
> > > >> > >> > > > > > > > > > > > > > > > > build_windows.py infrastructure
> > > >> should
> > > >> > >> easily
> > > >> > >> > > > work
> > > >> > >> > > > > > with
> > > >> > >> > > > > > > > the
> > > >> > >> > > > > > > > > > new
> > > >> > >> > > > > > > > > > > > > > > version.
> > > >> > >> > > > > > > > > > > > > > > > > Currently some PRs are blocked
> by
> > > >> this:
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > >
> https://github.com/apache/incubator-mxnet/issues/13958
> > > >> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to
> > > >> nightly.
> > > >> > >> > > Tracked
> > > >> > >> > > > at
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > >
> https://github.com/apache/incubator-mxnet/issues/15295
> > > >> > >> > > > > > > > > > > > > > > > > * Move non-python bindings
> tests
> > to
> > > >> > >> nightly.
> > > >> > >> > > If a
> > > >> > >> > > > > > > commit
> > > >> > >> > > > > > > > is
> > > >> > >> > > > > > > > > > > > > touching
> > > >> > >> > > > > > > > > > > > > > > > other
> > > >> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should
> ask
> > > >> for a
> > > >> > >> full
> > > >> > >> > > run
> > > >> > >> > > > > > which
> > > >> > >> > > > > > > > can
> > > >> > >> > > > > > > > > be
> > > >> > >> > > > > > > > > > > > done
> > > >> > >> > > > > > > > > > > > > > > > locally,
> > > >> > >> > > > > > > > > > > > > > > > > use the label bot to trigger a
> > full
> > > >> CI
> > > >> > >> build,
> > > >> > >> > > or
> > > >> > >> > > > > > defer
> > > >> > >> > > > > > > to
> > > >> > >> > > > > > > > > > > > nightly.
> > > >> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic
> > sanity
> > > >> > >> performance
> > > >> > >> > > > > tests
> > > >> > >> > > > > > on
> > > >> > >> > > > > > > > > small
> > > >> > >> > > > > > > > > > > > > models
> > > >> > >> > > > > > > > > > > > > > > that
> > > >> > >> > > > > > > > > > > > > > > > > are run on CI and can be echoed
> > by
> > > >> the
> > > >> > >> label
> > > >> > >> > > bot
> > > >> > >> > > > > as a
> > > >> > >> > > > > > > > > comment
> > > >> > >> > > > > > > > > > > for
> > > >> > >> > > > > > > > > > > > > > PRs.
> > > >> > >> > > > > > > > > > > > > > > > > * Address unit tests that take
> > more
> > > >> than
> > > >> > >> > > 10-20s,
> > > >> > >> > > > > > > > streamline
> > > >> > >> > > > > > > > > > > them
> > > >> > >> > > > > > > > > > > > or
> > > >> > >> > > > > > > > > > > > > > > move
> > > >> > >> > > > > > > > > > > > > > > > > them to nightly if it can't be
> > > done.
> > > >> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining
> CI
> > > >> > >> infrastructure
> > > >> > >> > > > > > scripts
> > > >> > >> > > > > > > > so
> > > >> > >> > > > > > > > > > the
> > > >> > >> > > > > > > > > > > > > > > community
> > > >> > >> > > > > > > > > > > > > > > > > can contribute.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > I think our goal should be
> > > turnaround
> > > >> > >> under
> > > >> > >> > > > 30min.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > I would also like to touch base
> > > with
> > > >> the
> > > >> > >> > > > community
> > > >> > >> > > > > > that
> > > >> > >> > > > > > > > > some
> > > >> > >> > > > > > > > > > > PRs
> > > >> > >> > > > > > > > > > > > > are
> > > >> > >> > > > > > > > > > > > > > > not
> > > >> > >> > > > > > > > > > > > > > > > > being followed up by committers
> > > >> asking for
> > > >> > >> > > > changes.
> > > >> > >> > > > > > For
> > > >> > >> > > > > > > > > > example
> > > >> > >> > > > > > > > > > > > > this
> > > >> > >> > > > > > > > > > > > > > PR
> > > >> > >> > > > > > > > > > > > > > > > is
> > > >> > >> > > > > > > > > > > > > > > > > importtant and is hanging for a
> > > long
> > > >> time.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > This is another, less important
> > but
> > > >> more
> > > >> > >> > > trivial
> > > >> > >> > > > to
> > > >> > >> > > > > > > > review:
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > I think comitters requesting
> > > changes
> > > >> and
> > > >> > >> not
> > > >> > >> > > > > > folllowing
> > > >> > >> > > > > > > > up
> > > >> > >> > > > > > > > > in
> > > >> > >> > > > > > > > > > > > > > > reasonable
> > > >> > >> > > > > > > > > > > > > > > > > time is not healthy for the
> > > project.
> > > >> I
> > > >> > >> suggest
> > > >> > >> > > > > > > > configuring
> > > >> > >> > > > > > > > > > > github
> > > >> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR
> and
> > > >> > >> following up.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Regards.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > > > Pedro.
> > > >> > >> > > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > > >
> > > >> > >> > > > > > > > > > > >
> > > >> > >> > > > > > > > > > >
> > > >> > >> > > > > > > > > >
> > > >> > >> > > > > > > > >
> > > >> > >> > > > > > > >
> > > >> > >> > > > > > >
> > > >> > >> > > > > >
> > > >> > >> > > > >
> > > >> > >> > > >
> > > >> > >> > >
> > > >> > >>
> > > >> > >
> > > >>
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Thanks for your response Marco, I think you have totally missed my original
point which was basically that someone volunteering effort on the CI is as
important as someone contributing a feature. From my perspective this
hasn't been the case, and we had to rely a lot on you and Sheng to submit
fixes which required access, also to relay communication with Apache infra.
Also in many cases we had to rely on you to channel fixes, PRs, disable
tests etc. If the community is fine having this kind of bottleneck, fine
with me. From my point of view and the feedback from myself and other
people which contributed to CI this was not always a good experience.
Having a welcoming and inclusive community is very important. I don't want
to start a discussion on this, but invite the community to do a bit of soul
searching on this topic, now that the infrastructure is open source.

Also I find surprising that you claim that you designed the CI yourself,
when this was a joint work of many individuals, including the old Apache CI
and additional contributions and code reviewers, people who were oncall for
this service or the autoscaling approach which if I remember correctly came
from a humble servant. Kellen did a lot of pair programming and code
reviews. Obviously you have a done a lot of work on CI which has had a huge
positive impact on the project and your recognition is well deserved. The
technical details you mention on your email are perfectly true and valid.

Below is a rough list of individuals who contributed to CI, I would like to
thank all of them since without this work, we wouldn't be able to deliver
with the quality that we have done in the past.


pllarroy@mac:0: ~/d/m/ci [fc_higher_order_grad_2]> git log
--pretty=format:%aN . | sort | uniq -c | sort -n | tail -n 10
   6 Zach Kimberg
   6 stu1130
   7 Jake Lee
   8 Aaron Markham
  11 Lanking
  12 Anton Chernov
  13 perdasilva
  26 Kellen Sunderland
  34 Marco de Abreu
  46 Pedro Larroy

pllarroy@mac:0: ~/d/mxnet_ci_general [master]> git log --pretty=format:%aN
| sort | uniq -c | sort -n
   1 Gavin M. Bell
   1 de Abreu
   6 Bair
   7 Kellen Sunderland
   8 Jose Luis Contreras
  14 perdasilva
  20 Per Goncalves da Silva
  29 Anton Chernov
  39 Chance Bair
  96 Pedro Larroy
 209 Marco de Abreu



Pedro.

On Fri, Aug 23, 2019 at 3:18 PM Marco de Abreu <ma...@gmail.com>
wrote:

> I've heard this request multiple times and so far, I'm having issues
> understanding the direct correlation between having committer permissions
> and being able to manage CI.
>
> When I designed the CI, one of the tenets was maintainability and
> accessbility for the community: I wanted to avoid that somebody needs
> certain privileges in order to execute regular actions. The result was the
> strong usage of Jenkinsfiles, Dockerfiles and the runtime functions. The
> combination of these techniques allowed somebody to create a job from the
> process flow level (Jenkinsfile), over the environment level (Dockerfile)
> to the individual action level (runtime functions). This design basically
> gives the community full access over the entire flow.
>
> The jobs that are configured to source only Jenkinsfile. Jenkins supports a
> lot of different ways how to define pipelines, but I have made sure to
> encourage everybody to use only Jenkinsfiles. This makes sure that no
> configuration is done in the web-interface. This firs of all alleviates the
> permission issue since there's literally no config in the web interface and
> second it allows auditing since all changes have to be done in the MXNet
> GitHub repository.
>
> Committers have elevated permissions in Jenkins. These contain the
> permission to run, stop and configure jobs. All other permissions are
> restricted to system administrators for the sake of ensuring stability of
> the system. On the dev-CI on the other hand, we're happy to add people so
> they can experiment as much as they want. The transition to prod-CI is then
> assisted by me to ensure smooth operations and adhering to the best
> practices (like using our Jenkinsfiles and Docker structure, for example).
>
> The only case where somebody would need elevated permissions is if they
> would like to change system settings. But at that point, we're talking
> about instance settings and AWS account configuration. Since that now
> reaches into the next permission level, which is restricted to the donor of
> the CI system - Amazon Web Services - this is something that not even PMC
> members will receive. The same policy is in place for the official Apache
> CI: Committers/PMCs can configure their job, but don't have system level
> access to either Jenkins or the underlying AWS account for obvious reasons.
> We're trying to stay in line with the same policy, but in the past I've
> granted Jenkins administrator access to people who required elevated access
> to properly do their job - Aaron Markham with regards to the website being
> one example.
>
> This means that the only case when a contributor needs committer assistance
> is the moment when somebody would like to set up a new Jenkins job. It
> would be a matter of setting up the job to point to the persons branch -
> Jenkins will then automatically pull the Jenkinsfile and thus no further
> configuration is necessary and updates are directly consumed. Such a
> request IMO is on the same level as us having to cut a ticket to Apache
> INFRA to create a new job.
>
> With regards to speed: So far, I was the only "CI-Person" with committer
> privileges. But due to our 4-eye-rule for PRs, I wasn't able to merge my
> own changes anyways - most of them were reviewed by Sheng, for example. In
> an emergency, I'm sure that somebody can be reached to assist since we
> currently have 39 PMC members and 20 committers spanning multiple
> timezones.
>
> For these reasons, I don't agree with the sentiment that contributors are
> unable to effectively work with the CI system unless they have committer
> privileges.
>
> Best regards,
> Marco
>
>
> On Fri, Aug 23, 2019 at 10:33 AM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> wrote:
>
> > As Marco has open sourced the bulk of the CI infrastructure donated from
> > Amazon to the community, I would like to raise the recommendation that
> the
> > community takes action to help volunteers working on the CI have a better
> > experience. In the past, it's my impression that there hasn't been much
> > action granting PMC or committer privileges to engineers volunteering to
> > help CI other than Marco. This would encourage more contributions and
> help
> > expedite critical fixes and corrective actions. I think this has not
> > properly enabled those individuals to be as effective as they could, as
> > well as the lack of recognition for such a critical activity. I'm not
> sure
> > about the cause but I believe this is something that should be rectified
> > for future contributions and help on the CI front if improvements are
> > desired.
> >
> > In spanish we have a saying: "es de bien nacido ser agradecido".
> >
> > Pedro.
> >
> > On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Hi Aaron. This is difficult to diagnose, because I don't know what to
> do
> > > when the hash of the layer in docker doesn't match and decides to
> rebuild
> > > it. the r script seems not to have changed. I have observed this in the
> > > past and I think is due to bugs in docker.   Maybe Kellen is able to
> give
> > > some tips here.
> > >
> > > In this case you should use -R which is already in master. (you can
> > always
> > > copy the script on top if you are in an older revision).
> > >
> > > Another thing that worked for me in the past was to completely nuke the
> > > docker cache, so it redonwloads from the CI repo. After that it worked
> > fine
> > > in some cases.
> > >
> > > These two workarounds are not ideal, but should unblock you.
> > >
> > > Pedro.
> > >
> > > On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <
> > aaron.s.markham@gmail.com>
> > > wrote:
> > >
> > >> Is -R already in there?
> > >>
> > >> Here's an example of it happening to me right now.... I am making
> > >> minor changes to the runtime_functions logic for handling the R docs
> > >> output. I pull the fix, then run the container, but I see the R deps
> > >> layer re-running. I didn't touch that. Why it that running again?
> > >>
> > >> From https://github.com/aaronmarkham/incubator-mxnet
> > >>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
> > >> origin/new_website_pipeline_2_aaron_rdocs
> > >> Updating f71cc6d..deec6aa
> > >> Fast-forward
> > >>  ci/docker/runtime_functions.sh | 6 +++---
> > >>  1 file changed, 3 insertions(+), 3 deletions(-)
> > >> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
> > >> --docker-registry mxnetci --platform ubuntu_cpu_r
> > >> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
> > >> build_r_docs
> > >> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build
> > tool.
> > >> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
> > >> enabled from registry mxnetci
> > >> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
> > >> mxnetci/build.ubuntu_cpu_r from mxnetci
> > >> Using default tag: latest
> > >> latest: Pulling from mxnetci/build.ubuntu_cpu_r
> > >> Digest:
> > >>
> sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
> > >> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
> > >> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker
> cache
> > >> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
> > >> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
> > >> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
> > >> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
> > >> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
> > >> mxnetci/build.ubuntu_cpu_r docker'
> > >> Sending build context to Docker daemon  289.8kB
> > >> Step 1/15 : FROM ubuntu:16.04
> > >>  ---> 5e13f8dd4c1a
> > >> Step 2/15 : WORKDIR /work/deps
> > >>  ---> Using cache
> > >>  ---> afc2a135945d
> > >> Step 3/15 : COPY install/ubuntu_core.sh /work/
> > >>  ---> Using cache
> > >>  ---> da2b2e7f35e1
> > >> Step 4/15 : RUN /work/ubuntu_core.sh
> > >>  ---> Using cache
> > >>  ---> d1e88b26b1d2
> > >> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
> > >>  ---> Using cache
> > >>  ---> 3aa97dea3b7b
> > >> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
> > >>  ---> Using cache
> > >>  ---> bec503f1d149
> > >> Step 7/15 : COPY install/ubuntu_r.sh /work/
> > >>  ---> c5e77c38031d
> > >> Step 8/15 : COPY install/r.gpg /work/
> > >>  ---> d8cdbf015d2b
> > >> Step 9/15 : RUN /work/ubuntu_r.sh
> > >>  ---> Running in c6c90b9e1538
> > >> ++ dirname /work/ubuntu_r.sh
> > >> + cd /work
> > >> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> > >> + apt-key add r.gpg
> > >> OK
> > >> + add-apt-repository 'deb [arch=amd64,i386]
> > >> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> > >> + apt-get update
> > >> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
> > >>
> > >> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
> > >> <pe...@gmail.com> wrote:
> > >> >
> > >> > Also, I forgot, another workaround is that I added the -R flag to
> the
> > >> build
> > >> > logic (build.py) so the container is not rebuilt for manual use.
> > >> >
> > >> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> > >> pedro.larroy.lists@gmail.com>
> > >> > wrote:
> > >> >
> > >> > >
> > >> > > Hi Aaron.
> > >> > >
> > >> > > As Marco explained, if you are in master the cache usually works,
> > >> there's
> > >> > > two issues that I have observed:
> > >> > >
> > >> > > 1 - Docker doesn't automatically pull the base image (ex.
> > >> ubuntu:16.04) so
> > >> > > if your cached base which is used in the FROM statement becomes
> > >> outdated
> > >> > > your caching won't work. (Using docker pull ubuntu:16.04) or the
> > base
> > >> > > images from the container helps with this.
> > >> > >
> > >> > > 2 - There's another situation where the above doesn't help which
> > >> seems to
> > >> > > be an unidentified issue with the docker cache:
> > >> > > https://github.com/docker/docker.github.io/issues/8886
> > >> > >
> > >> > > We can get a short term workaround for #1 by explicitly pulling
> > bases
> > >> from
> > >> > > the script, but I think docker should do it when using
> --cache-from
> > so
> > >> > > maybe contributing a patch to docker would the best approach.
> > >> > >
> > >> > > Pedro
> > >> > >
> > >> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
> > >> aaron.s.markham@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > >> When you create a new Dockerfile and use that on CI, it doesn't
> > seem
> > >> > >> to cache some of the steps... like this:
> > >> > >>
> > >> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
> > >> > >>  ---> Running in a1e522f3283b
> > >> > >>  [91m+ echo 'Installing dependencies...'
> > >> > >> + apt-get update
> > >> > >>  [0mInstalling dependencies.
> > >> > >>
> > >> > >> Or this....
> > >> > >>
> > >> > >> Step 4/13 : RUN /work/ubuntu_core.sh
> > >> > >>  ---> Running in e7882d7aa750
> > >> > >>  [91m+ apt-get update
> > >> > >>
> > >> > >> I get if I was changing those scripts, but then I'd think it
> should
> > >> > >> cache after running it once... but, no.
> > >> > >>
> > >> > >>
> > >> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
> > >> marco.g.abreu@gmail.com>
> > >> > >> wrote:
> > >> > >> >
> > >> > >> > Do I understand it correctly that you are saying that the
> Docker
> > >> cache
> > >> > >> > doesn't work properly and regularly reinstalls dependencies? Or
> > do
> > >> you
> > >> > >> mean
> > >> > >> > that you only have cache misses when you modify the
> dependencies
> > -
> > >> which
> > >> > >> > would be expected?
> > >> > >> >
> > >> > >> > -Marco
> > >> > >> >
> > >> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> > >> > >> aaron.s.markham@gmail.com>
> > >> > >> > wrote:
> > >> > >> >
> > >> > >> > > Many of the CI pipelines follow this pattern:
> > >> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run some
> > >> tests. Why
> > >> > >> > > repeat steps 1-3 over and over?
> > >> > >> > >
> > >> > >> > > Now, some tests use a stashed binary and docker cache. And I
> > see
> > >> this
> > >> > >> work
> > >> > >> > > locally, but for the most part, on CI, you're gonna sit
> > through a
> > >> > >> > > dependency install.
> > >> > >> > >
> > >> > >> > > I noticed that almost all jobs use an ubuntu setup that is
> > fully
> > >> > >> loaded.
> > >> > >> > > Without cache, it can take 10 or more minutes to build.  So I
> > >> made a
> > >> > >> lite
> > >> > >> > > version. Takes only a few minutes instead.
> > >> > >> > >
> > >> > >> > > In some cases archiving worked great to share across
> pipelines,
> > >> but as
> > >> > >> > > Marco mentioned we need a storage solution to make that
> happen.
> > >> We
> > >> > >> can't
> > >> > >> > > archive every intermediate artifact for each PR.
> > >> > >> > >
> > >> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
> > >> > >> pedro.larroy.lists@gmail.com>
> > >> > >> > > wrote:
> > >> > >> > >
> > >> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
> > >> > >> > > >
> > >> > >> > > > Pedro.
> > >> > >> > > >
> > >> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> > >> > >> aaron.s.markham@gmail.com
> > >> > >> > > >
> > >> > >> > > > wrote:
> > >> > >> > > >
> > >> > >> > > > > The PRs Thomas and I are working on for the new docs and
> > >> website
> > >> > >> share
> > >> > >> > > > the
> > >> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds
> things
> > >> up a
> > >> > >> lot.
> > >> > >> > > > >
> > >> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
> > >> cjolivier01@gmail.com>
> > >> > >> > > wrote:
> > >> > >> > > > >
> > >> > >> > > > > > I see it done daily now, and while I can’t share all
> the
> > >> > >> details,
> > >> > >> > > it’s
> > >> > >> > > > > not
> > >> > >> > > > > > an incredibly complex thing, and involves not much more
> > >> than
> > >> > >> nfs/efs
> > >> > >> > > > > > sharing and remote ssh commands.  All it takes is a
> > little
> > >> > >> ingenuity
> > >> > >> > > > and
> > >> > >> > > > > > some imagination.
> > >> > >> > > > > >
> > >> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > >> > >> > > > > pedro.larroy.lists@gmail.com
> > >> > >> > > > > > >
> > >> > >> > > > > > wrote:
> > >> > >> > > > > >
> > >> > >> > > > > > > Sounds good in theory. I think there are complex
> > details
> > >> with
> > >> > >> > > regards
> > >> > >> > > > > of
> > >> > >> > > > > > > resource sharing during parallel execution. Still I
> > think
> > >> > >> both ways
> > >> > >> > > > can
> > >> > >> > > > > > be
> > >> > >> > > > > > > explored. I think some tests run for unreasonably
> long
> > >> times
> > >> > >> for
> > >> > >> > > what
> > >> > >> > > > > > they
> > >> > >> > > > > > > are doing. We already scale parts of the pipeline
> > >> horizontally
> > >> > >> > > across
> > >> > >> > > > > > > workers.
> > >> > >> > > > > > >
> > >> > >> > > > > > >
> > >> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > >> > >> > > > cjolivier01@apache.org>
> > >> > >> > > > > > > wrote:
> > >> > >> > > > > > >
> > >> > >> > > > > > > > +1
> > >> > >> > > > > > > >
> > >> > >> > > > > > > > Rather than remove tests (which doesn’t scale as a
> > >> > >> solution), why
> > >> > >> > > > not
> > >> > >> > > > > > > scale
> > >> > >> > > > > > > > them horizontally so that they finish more quickly?
> > >> Across
> > >> > >> > > > processes
> > >> > >> > > > > or
> > >> > >> > > > > > > > even on a pool of machines that aren’t necessarily
> > the
> > >> build
> > >> > >> > > > machine?
> > >> > >> > > > > > > >
> > >> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > >> > >> > > > > > marco.g.abreu@gmail.com
> > >> > >> > > > > > > >
> > >> > >> > > > > > > > wrote:
> > >> > >> > > > > > > >
> > >> > >> > > > > > > > > With regards to time I rather prefer us spending
> a
> > >> bit
> > >> > >> more
> > >> > >> > > time
> > >> > >> > > > on
> > >> > >> > > > > > > > > maintenance than somebody running into an error
> > that
> > >> > >> could've
> > >> > >> > > > been
> > >> > >> > > > > > > caught
> > >> > >> > > > > > > > > with a test.
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has
> > >> been
> > >> > >> broken
> > >> > >> > > for
> > >> > >> > > > > > quite
> > >> > >> > > > > > > > > some time now, but nobody noticed that. Basically
> > my
> > >> > >> stance on
> > >> > >> > > > that
> > >> > >> > > > > > > > matter
> > >> > >> > > > > > > > > is that as soon as something is not blocking, you
> > can
> > >> > >> also just
> > >> > >> > > > > > > > deactivate
> > >> > >> > > > > > > > > it since you don't have a forcing function in an
> > open
> > >> > >> source
> > >> > >> > > > > project.
> > >> > >> > > > > > > > > People will rarely come back and fix the errors
> of
> > >> some
> > >> > >> nightly
> > >> > >> > > > > test
> > >> > >> > > > > > > that
> > >> > >> > > > > > > > > they introduced.
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > -Marco
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am
> Mi.,
> > >> 14.
> > >> > >> Aug.
> > >> > >> > > > 2019,
> > >> > >> > > > > > > 21:59:
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > > > > If a language binding test is failing for a not
> > >> > >> important
> > >> > >> > > > reason,
> > >> > >> > > > > > > then
> > >> > >> > > > > > > > it
> > >> > >> > > > > > > > > > is too brittle and needs to be fixed (we have
> > fixed
> > >> > >> some of
> > >> > >> > > > these
> > >> > >> > > > > > > with
> > >> > >> > > > > > > > > the
> > >> > >> > > > > > > > > > Clojure package [1]).
> > >> > >> > > > > > > > > > But in general, if we thinking of the MXNet
> > >> project as
> > >> > >> one
> > >> > >> > > > > project
> > >> > >> > > > > > > that
> > >> > >> > > > > > > > > is
> > >> > >> > > > > > > > > > across all the language bindings, then we want
> to
> > >> know
> > >> > >> if
> > >> > >> > > some
> > >> > >> > > > > > > > > fundamental
> > >> > >> > > > > > > > > > code change is going to break a downstream
> > package.
> > >> > >> > > > > > > > > > I can't speak for all the high level package
> > >> binding
> > >> > >> > > > maintainers,
> > >> > >> > > > > > but
> > >> > >> > > > > > > > I'm
> > >> > >> > > > > > > > > > always happy to pitch in to provide code fixes
> to
> > >> help
> > >> > >> the
> > >> > >> > > base
> > >> > >> > > > > PR
> > >> > >> > > > > > > get
> > >> > >> > > > > > > > > > green.
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > The time costs to maintain such a large CI
> > project
> > >> > >> obviously
> > >> > >> > > > > needs
> > >> > >> > > > > > to
> > >> > >> > > > > > > > be
> > >> > >> > > > > > > > > > considered as well.
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > [1]
> > >> > >> https://github.com/apache/incubator-mxnet/pull/15579
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > >> > >> > > > > > > > > pedro.larroy.lists@gmail.com
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > wrote:
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > > > > From what I have seen Clojure is 15 minutes,
> > >> which I
> > >> > >> think
> > >> > >> > > is
> > >> > >> > > > > > > > > reasonable.
> > >> > >> > > > > > > > > > > The only question is that when a binding such
> > as
> > >> R,
> > >> > >> Perl or
> > >> > >> > > > > > Clojure
> > >> > >> > > > > > > > > > fails,
> > >> > >> > > > > > > > > > > some devs are a bit confused about how to fix
> > >> them
> > >> > >> since
> > >> > >> > > they
> > >> > >> > > > > are
> > >> > >> > > > > > > not
> > >> > >> > > > > > > > > > > familiar with the testing tools and the
> > language.
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier
> <
> > >> > >> > > > > > carinmeier@gmail.com
> > >> > >> > > > > > > >
> > >> > >> > > > > > > > > > wrote:
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > > > > Great idea Marco! Anything that you think
> > >> would be
> > >> > >> > > valuable
> > >> > >> > > > > to
> > >> > >> > > > > > > > share
> > >> > >> > > > > > > > > > > would
> > >> > >> > > > > > > > > > > > be good. The duration of each node in the
> > test
> > >> stage
> > >> > >> > > sounds
> > >> > >> > > > > > like
> > >> > >> > > > > > > a
> > >> > >> > > > > > > > > good
> > >> > >> > > > > > > > > > > > start.
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > - Carin
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de
> > Abreu
> > >> <
> > >> > >> > > > > > > > > > marco.g.abreu@gmail.com>
> > >> > >> > > > > > > > > > > > wrote:
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > Hi,
> > >> > >> > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > we record a bunch of metrics about run
> > >> statistics
> > >> > >> (down
> > >> > >> > > > to
> > >> > >> > > > > > the
> > >> > >> > > > > > > > > > duration
> > >> > >> > > > > > > > > > > > of
> > >> > >> > > > > > > > > > > > > every individual step). If you tell me
> > which
> > >> ones
> > >> > >> > > you're
> > >> > >> > > > > > > > > particularly
> > >> > >> > > > > > > > > > > > > interested in (probably total duration of
> > >> each
> > >> > >> node in
> > >> > >> > > > the
> > >> > >> > > > > > test
> > >> > >> > > > > > > > > > stage),
> > >> > >> > > > > > > > > > > > I'm
> > >> > >> > > > > > > > > > > > > happy to provide them.
> > >> > >> > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> > >> > >> > > > > > > > > > > > > - job
> > >> > >> > > > > > > > > > > > > - branch
> > >> > >> > > > > > > > > > > > > - stage
> > >> > >> > > > > > > > > > > > > - node
> > >> > >> > > > > > > > > > > > > - step
> > >> > >> > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > Unfortunately I don't have the
> possibility
> > to
> > >> > >> export
> > >> > >> > > them
> > >> > >> > > > > > since
> > >> > >> > > > > > > > we
> > >> > >> > > > > > > > > > > store
> > >> > >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik
> > >> doesn't
> > >> > >> offer
> > >> > >> > > raw
> > >> > >> > > > > > > exports.
> > >> > >> > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > Best regards,
> > >> > >> > > > > > > > > > > > > Marco
> > >> > >> > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > Carin Meier <ca...@gmail.com>
> schrieb
> > >> am
> > >> > >> Mi., 14.
> > >> > >> > > > > Aug.
> > >> > >> > > > > > > > 2019,
> > >> > >> > > > > > > > > > > 19:43:
> > >> > >> > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > I would prefer to keep the language
> > >> binding in
> > >> > >> the PR
> > >> > >> > > > > > > process.
> > >> > >> > > > > > > > > > > Perhaps
> > >> > >> > > > > > > > > > > > we
> > >> > >> > > > > > > > > > > > > > could do some analytics to see how much
> > >> each of
> > >> > >> the
> > >> > >> > > > > > language
> > >> > >> > > > > > > > > > bindings
> > >> > >> > > > > > > > > > > > is
> > >> > >> > > > > > > > > > > > > > contributing to overall run time.
> > >> > >> > > > > > > > > > > > > > If we have some metrics on that, maybe
> we
> > >> can
> > >> > >> come up
> > >> > >> > > > > with
> > >> > >> > > > > > a
> > >> > >> > > > > > > > > > > guideline
> > >> > >> > > > > > > > > > > > of
> > >> > >> > > > > > > > > > > > > > how much time each should take. Another
> > >> > >> possibility
> > >> > >> > > is
> > >> > >> > > > > > > leverage
> > >> > >> > > > > > > > > the
> > >> > >> > > > > > > > > > > > > > parallel builds more.
> > >> > >> > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
> > >> Larroy <
> > >> > >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > >> > >> > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > wrote:
> > >> > >> > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > Hi Carin.
> > >> > >> > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > That's a good point, all things
> > >> considered
> > >> > >> would
> > >> > >> > > your
> > >> > >> > > > > > > > > preference
> > >> > >> > > > > > > > > > be
> > >> > >> > > > > > > > > > > > to
> > >> > >> > > > > > > > > > > > > > keep
> > >> > >> > > > > > > > > > > > > > > the Clojure tests as part of the PR
> > >> process
> > >> > >> or in
> > >> > >> > > > > > Nightly?
> > >> > >> > > > > > > > > > > > > > > Some options are having notifications
> > >> here or
> > >> > >> in
> > >> > >> > > > slack.
> > >> > >> > > > > > But
> > >> > >> > > > > > > > if
> > >> > >> > > > > > > > > we
> > >> > >> > > > > > > > > > > > think
> > >> > >> > > > > > > > > > > > > > > breakages would go unnoticed maybe is
> > >> not a
> > >> > >> good
> > >> > >> > > idea
> > >> > >> > > > > to
> > >> > >> > > > > > > > fully
> > >> > >> > > > > > > > > > > remove
> > >> > >> > > > > > > > > > > > > > > bindings from the PR process and just
> > >> > >> streamline
> > >> > >> > > the
> > >> > >> > > > > > > process.
> > >> > >> > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > Pedro.
> > >> > >> > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin
> > >> Meier <
> > >> > >> > > > > > > > > > carinmeier@gmail.com>
> > >> > >> > > > > > > > > > > > > > wrote:
> > >> > >> > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > Before any binding tests are moved
> to
> > >> > >> nightly, I
> > >> > >> > > > > think
> > >> > >> > > > > > we
> > >> > >> > > > > > > > > need
> > >> > >> > > > > > > > > > to
> > >> > >> > > > > > > > > > > > > > figure
> > >> > >> > > > > > > > > > > > > > > > out how the community can get
> proper
> > >> > >> > > notifications
> > >> > >> > > > of
> > >> > >> > > > > > > > failure
> > >> > >> > > > > > > > > > and
> > >> > >> > > > > > > > > > > > > > success
> > >> > >> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I
> > >> think
> > >> > >> that
> > >> > >> > > > > > breakages
> > >> > >> > > > > > > > > would
> > >> > >> > > > > > > > > > go
> > >> > >> > > > > > > > > > > > > > > > unnoticed.
> > >> > >> > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > -Carin
> > >> > >> > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM
> Pedro
> > >> > >> Larroy <
> > >> > >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > wrote:
> > >> > >> > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > Hi
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > Seems we are hitting some
> problems
> > >> in CI.
> > >> > >> I
> > >> > >> > > > propose
> > >> > >> > > > > > the
> > >> > >> > > > > > > > > > > following
> > >> > >> > > > > > > > > > > > > > > action
> > >> > >> > > > > > > > > > > > > > > > > items to remedy the situation and
> > >> > >> accelerate
> > >> > >> > > turn
> > >> > >> > > > > > > around
> > >> > >> > > > > > > > > > times
> > >> > >> > > > > > > > > > > in
> > >> > >> > > > > > > > > > > > > CI,
> > >> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
> > >> probability of
> > >> > >> > > > failure
> > >> > >> > > > > > > > blocking
> > >> > >> > > > > > > > > > PRs
> > >> > >> > > > > > > > > > > > and
> > >> > >> > > > > > > > > > > > > > > > > frustrating developers:
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio
> > from
> > >> VS
> > >> > >> 2015 to
> > >> > >> > > > VS
> > >> > >> > > > > > > 2017.
> > >> > >> > > > > > > > > The
> > >> > >> > > > > > > > > > > > > > > > > build_windows.py infrastructure
> > >> should
> > >> > >> easily
> > >> > >> > > > work
> > >> > >> > > > > > with
> > >> > >> > > > > > > > the
> > >> > >> > > > > > > > > > new
> > >> > >> > > > > > > > > > > > > > > version.
> > >> > >> > > > > > > > > > > > > > > > > Currently some PRs are blocked by
> > >> this:
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > >> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to
> > >> nightly.
> > >> > >> > > Tracked
> > >> > >> > > > at
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > >> > >> > > > > > > > > > > > > > > > > * Move non-python bindings tests
> to
> > >> > >> nightly.
> > >> > >> > > If a
> > >> > >> > > > > > > commit
> > >> > >> > > > > > > > is
> > >> > >> > > > > > > > > > > > > touching
> > >> > >> > > > > > > > > > > > > > > > other
> > >> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should ask
> > >> for a
> > >> > >> full
> > >> > >> > > run
> > >> > >> > > > > > which
> > >> > >> > > > > > > > can
> > >> > >> > > > > > > > > be
> > >> > >> > > > > > > > > > > > done
> > >> > >> > > > > > > > > > > > > > > > locally,
> > >> > >> > > > > > > > > > > > > > > > > use the label bot to trigger a
> full
> > >> CI
> > >> > >> build,
> > >> > >> > > or
> > >> > >> > > > > > defer
> > >> > >> > > > > > > to
> > >> > >> > > > > > > > > > > > nightly.
> > >> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic
> sanity
> > >> > >> performance
> > >> > >> > > > > tests
> > >> > >> > > > > > on
> > >> > >> > > > > > > > > small
> > >> > >> > > > > > > > > > > > > models
> > >> > >> > > > > > > > > > > > > > > that
> > >> > >> > > > > > > > > > > > > > > > > are run on CI and can be echoed
> by
> > >> the
> > >> > >> label
> > >> > >> > > bot
> > >> > >> > > > > as a
> > >> > >> > > > > > > > > comment
> > >> > >> > > > > > > > > > > for
> > >> > >> > > > > > > > > > > > > > PRs.
> > >> > >> > > > > > > > > > > > > > > > > * Address unit tests that take
> more
> > >> than
> > >> > >> > > 10-20s,
> > >> > >> > > > > > > > streamline
> > >> > >> > > > > > > > > > > them
> > >> > >> > > > > > > > > > > > or
> > >> > >> > > > > > > > > > > > > > > move
> > >> > >> > > > > > > > > > > > > > > > > them to nightly if it can't be
> > done.
> > >> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
> > >> > >> infrastructure
> > >> > >> > > > > > scripts
> > >> > >> > > > > > > > so
> > >> > >> > > > > > > > > > the
> > >> > >> > > > > > > > > > > > > > > community
> > >> > >> > > > > > > > > > > > > > > > > can contribute.
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > I think our goal should be
> > turnaround
> > >> > >> under
> > >> > >> > > > 30min.
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > I would also like to touch base
> > with
> > >> the
> > >> > >> > > > community
> > >> > >> > > > > > that
> > >> > >> > > > > > > > > some
> > >> > >> > > > > > > > > > > PRs
> > >> > >> > > > > > > > > > > > > are
> > >> > >> > > > > > > > > > > > > > > not
> > >> > >> > > > > > > > > > > > > > > > > being followed up by committers
> > >> asking for
> > >> > >> > > > changes.
> > >> > >> > > > > > For
> > >> > >> > > > > > > > > > example
> > >> > >> > > > > > > > > > > > > this
> > >> > >> > > > > > > > > > > > > > PR
> > >> > >> > > > > > > > > > > > > > > > is
> > >> > >> > > > > > > > > > > > > > > > > importtant and is hanging for a
> > long
> > >> time.
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > This is another, less important
> but
> > >> more
> > >> > >> > > trivial
> > >> > >> > > > to
> > >> > >> > > > > > > > review:
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > I think comitters requesting
> > changes
> > >> and
> > >> > >> not
> > >> > >> > > > > > folllowing
> > >> > >> > > > > > > > up
> > >> > >> > > > > > > > > in
> > >> > >> > > > > > > > > > > > > > > reasonable
> > >> > >> > > > > > > > > > > > > > > > > time is not healthy for the
> > project.
> > >> I
> > >> > >> suggest
> > >> > >> > > > > > > > configuring
> > >> > >> > > > > > > > > > > github
> > >> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR and
> > >> > >> following up.
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > Regards.
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > > > Pedro.
> > >> > >> > > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > > >
> > >> > >> > > > > > > > > > > >
> > >> > >> > > > > > > > > > >
> > >> > >> > > > > > > > > >
> > >> > >> > > > > > > > >
> > >> > >> > > > > > > >
> > >> > >> > > > > > >
> > >> > >> > > > > >
> > >> > >> > > > >
> > >> > >> > > >
> > >> > >> > >
> > >> > >>
> > >> > >
> > >>
> > >
> >
>

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

I've heard this request multiple times and so far, I'm having issues
understanding the direct correlation between having committer permissions
and being able to manage CI.

When I designed the CI, one of the tenets was maintainability and
accessbility for the community: I wanted to avoid that somebody needs
certain privileges in order to execute regular actions. The result was the
strong usage of Jenkinsfiles, Dockerfiles and the runtime functions. The
combination of these techniques allowed somebody to create a job from the
process flow level (Jenkinsfile), over the environment level (Dockerfile)
to the individual action level (runtime functions). This design basically
gives the community full access over the entire flow.

The jobs that are configured to source only Jenkinsfile. Jenkins supports a
lot of different ways how to define pipelines, but I have made sure to
encourage everybody to use only Jenkinsfiles. This makes sure that no
configuration is done in the web-interface. This firs of all alleviates the
permission issue since there's literally no config in the web interface and
second it allows auditing since all changes have to be done in the MXNet
GitHub repository.

Committers have elevated permissions in Jenkins. These contain the
permission to run, stop and configure jobs. All other permissions are
restricted to system administrators for the sake of ensuring stability of
the system. On the dev-CI on the other hand, we're happy to add people so
they can experiment as much as they want. The transition to prod-CI is then
assisted by me to ensure smooth operations and adhering to the best
practices (like using our Jenkinsfiles and Docker structure, for example).

The only case where somebody would need elevated permissions is if they
would like to change system settings. But at that point, we're talking
about instance settings and AWS account configuration. Since that now
reaches into the next permission level, which is restricted to the donor of
the CI system - Amazon Web Services - this is something that not even PMC
members will receive. The same policy is in place for the official Apache
CI: Committers/PMCs can configure their job, but don't have system level
access to either Jenkins or the underlying AWS account for obvious reasons.
We're trying to stay in line with the same policy, but in the past I've
granted Jenkins administrator access to people who required elevated access
to properly do their job - Aaron Markham with regards to the website being
one example.

This means that the only case when a contributor needs committer assistance
is the moment when somebody would like to set up a new Jenkins job. It
would be a matter of setting up the job to point to the persons branch -
Jenkins will then automatically pull the Jenkinsfile and thus no further
configuration is necessary and updates are directly consumed. Such a
request IMO is on the same level as us having to cut a ticket to Apache
INFRA to create a new job.

With regards to speed: So far, I was the only "CI-Person" with committer
privileges. But due to our 4-eye-rule for PRs, I wasn't able to merge my
own changes anyways - most of them were reviewed by Sheng, for example. In
an emergency, I'm sure that somebody can be reached to assist since we
currently have 39 PMC members and 20 committers spanning multiple
timezones.

For these reasons, I don't agree with the sentiment that contributors are
unable to effectively work with the CI system unless they have committer
privileges.

Best regards,
Marco


On Fri, Aug 23, 2019 at 10:33 AM Pedro Larroy <pe...@gmail.com>
wrote:

> As Marco has open sourced the bulk of the CI infrastructure donated from
> Amazon to the community, I would like to raise the recommendation that the
> community takes action to help volunteers working on the CI have a better
> experience. In the past, it's my impression that there hasn't been much
> action granting PMC or committer privileges to engineers volunteering to
> help CI other than Marco. This would encourage more contributions and help
> expedite critical fixes and corrective actions. I think this has not
> properly enabled those individuals to be as effective as they could, as
> well as the lack of recognition for such a critical activity. I'm not sure
> about the cause but I believe this is something that should be rectified
> for future contributions and help on the CI front if improvements are
> desired.
>
> In spanish we have a saying: "es de bien nacido ser agradecido".
>
> Pedro.
>
> On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Hi Aaron. This is difficult to diagnose, because I don't know what to do
> > when the hash of the layer in docker doesn't match and decides to rebuild
> > it. the r script seems not to have changed. I have observed this in the
> > past and I think is due to bugs in docker.   Maybe Kellen is able to give
> > some tips here.
> >
> > In this case you should use -R which is already in master. (you can
> always
> > copy the script on top if you are in an older revision).
> >
> > Another thing that worked for me in the past was to completely nuke the
> > docker cache, so it redonwloads from the CI repo. After that it worked
> fine
> > in some cases.
> >
> > These two workarounds are not ideal, but should unblock you.
> >
> > Pedro.
> >
> > On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <
> aaron.s.markham@gmail.com>
> > wrote:
> >
> >> Is -R already in there?
> >>
> >> Here's an example of it happening to me right now.... I am making
> >> minor changes to the runtime_functions logic for handling the R docs
> >> output. I pull the fix, then run the container, but I see the R deps
> >> layer re-running. I didn't touch that. Why it that running again?
> >>
> >> From https://github.com/aaronmarkham/incubator-mxnet
> >>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
> >> origin/new_website_pipeline_2_aaron_rdocs
> >> Updating f71cc6d..deec6aa
> >> Fast-forward
> >>  ci/docker/runtime_functions.sh | 6 +++---
> >>  1 file changed, 3 insertions(+), 3 deletions(-)
> >> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
> >> --docker-registry mxnetci --platform ubuntu_cpu_r
> >> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
> >> build_r_docs
> >> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build
> tool.
> >> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
> >> enabled from registry mxnetci
> >> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
> >> mxnetci/build.ubuntu_cpu_r from mxnetci
> >> Using default tag: latest
> >> latest: Pulling from mxnetci/build.ubuntu_cpu_r
> >> Digest:
> >> sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
> >> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
> >> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker cache
> >> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
> >> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
> >> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
> >> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
> >> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
> >> mxnetci/build.ubuntu_cpu_r docker'
> >> Sending build context to Docker daemon  289.8kB
> >> Step 1/15 : FROM ubuntu:16.04
> >>  ---> 5e13f8dd4c1a
> >> Step 2/15 : WORKDIR /work/deps
> >>  ---> Using cache
> >>  ---> afc2a135945d
> >> Step 3/15 : COPY install/ubuntu_core.sh /work/
> >>  ---> Using cache
> >>  ---> da2b2e7f35e1
> >> Step 4/15 : RUN /work/ubuntu_core.sh
> >>  ---> Using cache
> >>  ---> d1e88b26b1d2
> >> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
> >>  ---> Using cache
> >>  ---> 3aa97dea3b7b
> >> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
> >>  ---> Using cache
> >>  ---> bec503f1d149
> >> Step 7/15 : COPY install/ubuntu_r.sh /work/
> >>  ---> c5e77c38031d
> >> Step 8/15 : COPY install/r.gpg /work/
> >>  ---> d8cdbf015d2b
> >> Step 9/15 : RUN /work/ubuntu_r.sh
> >>  ---> Running in c6c90b9e1538
> >> ++ dirname /work/ubuntu_r.sh
> >> + cd /work
> >> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> >> + apt-key add r.gpg
> >> OK
> >> + add-apt-repository 'deb [arch=amd64,i386]
> >> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> >> + apt-get update
> >> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
> >>
> >> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
> >> <pe...@gmail.com> wrote:
> >> >
> >> > Also, I forgot, another workaround is that I added the -R flag to the
> >> build
> >> > logic (build.py) so the container is not rebuilt for manual use.
> >> >
> >> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> >> pedro.larroy.lists@gmail.com>
> >> > wrote:
> >> >
> >> > >
> >> > > Hi Aaron.
> >> > >
> >> > > As Marco explained, if you are in master the cache usually works,
> >> there's
> >> > > two issues that I have observed:
> >> > >
> >> > > 1 - Docker doesn't automatically pull the base image (ex.
> >> ubuntu:16.04) so
> >> > > if your cached base which is used in the FROM statement becomes
> >> outdated
> >> > > your caching won't work. (Using docker pull ubuntu:16.04) or the
> base
> >> > > images from the container helps with this.
> >> > >
> >> > > 2 - There's another situation where the above doesn't help which
> >> seems to
> >> > > be an unidentified issue with the docker cache:
> >> > > https://github.com/docker/docker.github.io/issues/8886
> >> > >
> >> > > We can get a short term workaround for #1 by explicitly pulling
> bases
> >> from
> >> > > the script, but I think docker should do it when using --cache-from
> so
> >> > > maybe contributing a patch to docker would the best approach.
> >> > >
> >> > > Pedro
> >> > >
> >> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
> >> aaron.s.markham@gmail.com>
> >> > > wrote:
> >> > >
> >> > >> When you create a new Dockerfile and use that on CI, it doesn't
> seem
> >> > >> to cache some of the steps... like this:
> >> > >>
> >> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
> >> > >>  ---> Running in a1e522f3283b
> >> > >>  [91m+ echo 'Installing dependencies...'
> >> > >> + apt-get update
> >> > >>  [0mInstalling dependencies.
> >> > >>
> >> > >> Or this....
> >> > >>
> >> > >> Step 4/13 : RUN /work/ubuntu_core.sh
> >> > >>  ---> Running in e7882d7aa750
> >> > >>  [91m+ apt-get update
> >> > >>
> >> > >> I get if I was changing those scripts, but then I'd think it should
> >> > >> cache after running it once... but, no.
> >> > >>
> >> > >>
> >> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
> >> marco.g.abreu@gmail.com>
> >> > >> wrote:
> >> > >> >
> >> > >> > Do I understand it correctly that you are saying that the Docker
> >> cache
> >> > >> > doesn't work properly and regularly reinstalls dependencies? Or
> do
> >> you
> >> > >> mean
> >> > >> > that you only have cache misses when you modify the dependencies
> -
> >> which
> >> > >> > would be expected?
> >> > >> >
> >> > >> > -Marco
> >> > >> >
> >> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> >> > >> aaron.s.markham@gmail.com>
> >> > >> > wrote:
> >> > >> >
> >> > >> > > Many of the CI pipelines follow this pattern:
> >> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run some
> >> tests. Why
> >> > >> > > repeat steps 1-3 over and over?
> >> > >> > >
> >> > >> > > Now, some tests use a stashed binary and docker cache. And I
> see
> >> this
> >> > >> work
> >> > >> > > locally, but for the most part, on CI, you're gonna sit
> through a
> >> > >> > > dependency install.
> >> > >> > >
> >> > >> > > I noticed that almost all jobs use an ubuntu setup that is
> fully
> >> > >> loaded.
> >> > >> > > Without cache, it can take 10 or more minutes to build.  So I
> >> made a
> >> > >> lite
> >> > >> > > version. Takes only a few minutes instead.
> >> > >> > >
> >> > >> > > In some cases archiving worked great to share across pipelines,
> >> but as
> >> > >> > > Marco mentioned we need a storage solution to make that happen.
> >> We
> >> > >> can't
> >> > >> > > archive every intermediate artifact for each PR.
> >> > >> > >
> >> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
> >> > >> pedro.larroy.lists@gmail.com>
> >> > >> > > wrote:
> >> > >> > >
> >> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
> >> > >> > > >
> >> > >> > > > Pedro.
> >> > >> > > >
> >> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> >> > >> aaron.s.markham@gmail.com
> >> > >> > > >
> >> > >> > > > wrote:
> >> > >> > > >
> >> > >> > > > > The PRs Thomas and I are working on for the new docs and
> >> website
> >> > >> share
> >> > >> > > > the
> >> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds things
> >> up a
> >> > >> lot.
> >> > >> > > > >
> >> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
> >> cjolivier01@gmail.com>
> >> > >> > > wrote:
> >> > >> > > > >
> >> > >> > > > > > I see it done daily now, and while I can’t share all the
> >> > >> details,
> >> > >> > > it’s
> >> > >> > > > > not
> >> > >> > > > > > an incredibly complex thing, and involves not much more
> >> than
> >> > >> nfs/efs
> >> > >> > > > > > sharing and remote ssh commands.  All it takes is a
> little
> >> > >> ingenuity
> >> > >> > > > and
> >> > >> > > > > > some imagination.
> >> > >> > > > > >
> >> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> >> > >> > > > > pedro.larroy.lists@gmail.com
> >> > >> > > > > > >
> >> > >> > > > > > wrote:
> >> > >> > > > > >
> >> > >> > > > > > > Sounds good in theory. I think there are complex
> details
> >> with
> >> > >> > > regards
> >> > >> > > > > of
> >> > >> > > > > > > resource sharing during parallel execution. Still I
> think
> >> > >> both ways
> >> > >> > > > can
> >> > >> > > > > > be
> >> > >> > > > > > > explored. I think some tests run for unreasonably long
> >> times
> >> > >> for
> >> > >> > > what
> >> > >> > > > > > they
> >> > >> > > > > > > are doing. We already scale parts of the pipeline
> >> horizontally
> >> > >> > > across
> >> > >> > > > > > > workers.
> >> > >> > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> >> > >> > > > cjolivier01@apache.org>
> >> > >> > > > > > > wrote:
> >> > >> > > > > > >
> >> > >> > > > > > > > +1
> >> > >> > > > > > > >
> >> > >> > > > > > > > Rather than remove tests (which doesn’t scale as a
> >> > >> solution), why
> >> > >> > > > not
> >> > >> > > > > > > scale
> >> > >> > > > > > > > them horizontally so that they finish more quickly?
> >> Across
> >> > >> > > > processes
> >> > >> > > > > or
> >> > >> > > > > > > > even on a pool of machines that aren’t necessarily
> the
> >> build
> >> > >> > > > machine?
> >> > >> > > > > > > >
> >> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> >> > >> > > > > > marco.g.abreu@gmail.com
> >> > >> > > > > > > >
> >> > >> > > > > > > > wrote:
> >> > >> > > > > > > >
> >> > >> > > > > > > > > With regards to time I rather prefer us spending a
> >> bit
> >> > >> more
> >> > >> > > time
> >> > >> > > > on
> >> > >> > > > > > > > > maintenance than somebody running into an error
> that
> >> > >> could've
> >> > >> > > > been
> >> > >> > > > > > > caught
> >> > >> > > > > > > > > with a test.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has
> >> been
> >> > >> broken
> >> > >> > > for
> >> > >> > > > > > quite
> >> > >> > > > > > > > > some time now, but nobody noticed that. Basically
> my
> >> > >> stance on
> >> > >> > > > that
> >> > >> > > > > > > > matter
> >> > >> > > > > > > > > is that as soon as something is not blocking, you
> can
> >> > >> also just
> >> > >> > > > > > > > deactivate
> >> > >> > > > > > > > > it since you don't have a forcing function in an
> open
> >> > >> source
> >> > >> > > > > project.
> >> > >> > > > > > > > > People will rarely come back and fix the errors of
> >> some
> >> > >> nightly
> >> > >> > > > > test
> >> > >> > > > > > > that
> >> > >> > > > > > > > > they introduced.
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > -Marco
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi.,
> >> 14.
> >> > >> Aug.
> >> > >> > > > 2019,
> >> > >> > > > > > > 21:59:
> >> > >> > > > > > > > >
> >> > >> > > > > > > > > > If a language binding test is failing for a not
> >> > >> important
> >> > >> > > > reason,
> >> > >> > > > > > > then
> >> > >> > > > > > > > it
> >> > >> > > > > > > > > > is too brittle and needs to be fixed (we have
> fixed
> >> > >> some of
> >> > >> > > > these
> >> > >> > > > > > > with
> >> > >> > > > > > > > > the
> >> > >> > > > > > > > > > Clojure package [1]).
> >> > >> > > > > > > > > > But in general, if we thinking of the MXNet
> >> project as
> >> > >> one
> >> > >> > > > > project
> >> > >> > > > > > > that
> >> > >> > > > > > > > > is
> >> > >> > > > > > > > > > across all the language bindings, then we want to
> >> know
> >> > >> if
> >> > >> > > some
> >> > >> > > > > > > > > fundamental
> >> > >> > > > > > > > > > code change is going to break a downstream
> package.
> >> > >> > > > > > > > > > I can't speak for all the high level package
> >> binding
> >> > >> > > > maintainers,
> >> > >> > > > > > but
> >> > >> > > > > > > > I'm
> >> > >> > > > > > > > > > always happy to pitch in to provide code fixes to
> >> help
> >> > >> the
> >> > >> > > base
> >> > >> > > > > PR
> >> > >> > > > > > > get
> >> > >> > > > > > > > > > green.
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > The time costs to maintain such a large CI
> project
> >> > >> obviously
> >> > >> > > > > needs
> >> > >> > > > > > to
> >> > >> > > > > > > > be
> >> > >> > > > > > > > > > considered as well.
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > [1]
> >> > >> https://github.com/apache/incubator-mxnet/pull/15579
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> >> > >> > > > > > > > > pedro.larroy.lists@gmail.com
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > wrote:
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > > > > From what I have seen Clojure is 15 minutes,
> >> which I
> >> > >> think
> >> > >> > > is
> >> > >> > > > > > > > > reasonable.
> >> > >> > > > > > > > > > > The only question is that when a binding such
> as
> >> R,
> >> > >> Perl or
> >> > >> > > > > > Clojure
> >> > >> > > > > > > > > > fails,
> >> > >> > > > > > > > > > > some devs are a bit confused about how to fix
> >> them
> >> > >> since
> >> > >> > > they
> >> > >> > > > > are
> >> > >> > > > > > > not
> >> > >> > > > > > > > > > > familiar with the testing tools and the
> language.
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> >> > >> > > > > > carinmeier@gmail.com
> >> > >> > > > > > > >
> >> > >> > > > > > > > > > wrote:
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > > > > Great idea Marco! Anything that you think
> >> would be
> >> > >> > > valuable
> >> > >> > > > > to
> >> > >> > > > > > > > share
> >> > >> > > > > > > > > > > would
> >> > >> > > > > > > > > > > > be good. The duration of each node in the
> test
> >> stage
> >> > >> > > sounds
> >> > >> > > > > > like
> >> > >> > > > > > > a
> >> > >> > > > > > > > > good
> >> > >> > > > > > > > > > > > start.
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > - Carin
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de
> Abreu
> >> <
> >> > >> > > > > > > > > > marco.g.abreu@gmail.com>
> >> > >> > > > > > > > > > > > wrote:
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > Hi,
> >> > >> > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > we record a bunch of metrics about run
> >> statistics
> >> > >> (down
> >> > >> > > > to
> >> > >> > > > > > the
> >> > >> > > > > > > > > > duration
> >> > >> > > > > > > > > > > > of
> >> > >> > > > > > > > > > > > > every individual step). If you tell me
> which
> >> ones
> >> > >> > > you're
> >> > >> > > > > > > > > particularly
> >> > >> > > > > > > > > > > > > interested in (probably total duration of
> >> each
> >> > >> node in
> >> > >> > > > the
> >> > >> > > > > > test
> >> > >> > > > > > > > > > stage),
> >> > >> > > > > > > > > > > > I'm
> >> > >> > > > > > > > > > > > > happy to provide them.
> >> > >> > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> >> > >> > > > > > > > > > > > > - job
> >> > >> > > > > > > > > > > > > - branch
> >> > >> > > > > > > > > > > > > - stage
> >> > >> > > > > > > > > > > > > - node
> >> > >> > > > > > > > > > > > > - step
> >> > >> > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > Unfortunately I don't have the possibility
> to
> >> > >> export
> >> > >> > > them
> >> > >> > > > > > since
> >> > >> > > > > > > > we
> >> > >> > > > > > > > > > > store
> >> > >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik
> >> doesn't
> >> > >> offer
> >> > >> > > raw
> >> > >> > > > > > > exports.
> >> > >> > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > Best regards,
> >> > >> > > > > > > > > > > > > Marco
> >> > >> > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb
> >> am
> >> > >> Mi., 14.
> >> > >> > > > > Aug.
> >> > >> > > > > > > > 2019,
> >> > >> > > > > > > > > > > 19:43:
> >> > >> > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > I would prefer to keep the language
> >> binding in
> >> > >> the PR
> >> > >> > > > > > > process.
> >> > >> > > > > > > > > > > Perhaps
> >> > >> > > > > > > > > > > > we
> >> > >> > > > > > > > > > > > > > could do some analytics to see how much
> >> each of
> >> > >> the
> >> > >> > > > > > language
> >> > >> > > > > > > > > > bindings
> >> > >> > > > > > > > > > > > is
> >> > >> > > > > > > > > > > > > > contributing to overall run time.
> >> > >> > > > > > > > > > > > > > If we have some metrics on that, maybe we
> >> can
> >> > >> come up
> >> > >> > > > > with
> >> > >> > > > > > a
> >> > >> > > > > > > > > > > guideline
> >> > >> > > > > > > > > > > > of
> >> > >> > > > > > > > > > > > > > how much time each should take. Another
> >> > >> possibility
> >> > >> > > is
> >> > >> > > > > > > leverage
> >> > >> > > > > > > > > the
> >> > >> > > > > > > > > > > > > > parallel builds more.
> >> > >> > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
> >> Larroy <
> >> > >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> >> > >> > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > wrote:
> >> > >> > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > Hi Carin.
> >> > >> > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > That's a good point, all things
> >> considered
> >> > >> would
> >> > >> > > your
> >> > >> > > > > > > > > preference
> >> > >> > > > > > > > > > be
> >> > >> > > > > > > > > > > > to
> >> > >> > > > > > > > > > > > > > keep
> >> > >> > > > > > > > > > > > > > > the Clojure tests as part of the PR
> >> process
> >> > >> or in
> >> > >> > > > > > Nightly?
> >> > >> > > > > > > > > > > > > > > Some options are having notifications
> >> here or
> >> > >> in
> >> > >> > > > slack.
> >> > >> > > > > > But
> >> > >> > > > > > > > if
> >> > >> > > > > > > > > we
> >> > >> > > > > > > > > > > > think
> >> > >> > > > > > > > > > > > > > > breakages would go unnoticed maybe is
> >> not a
> >> > >> good
> >> > >> > > idea
> >> > >> > > > > to
> >> > >> > > > > > > > fully
> >> > >> > > > > > > > > > > remove
> >> > >> > > > > > > > > > > > > > > bindings from the PR process and just
> >> > >> streamline
> >> > >> > > the
> >> > >> > > > > > > process.
> >> > >> > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > Pedro.
> >> > >> > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin
> >> Meier <
> >> > >> > > > > > > > > > carinmeier@gmail.com>
> >> > >> > > > > > > > > > > > > > wrote:
> >> > >> > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > Before any binding tests are moved to
> >> > >> nightly, I
> >> > >> > > > > think
> >> > >> > > > > > we
> >> > >> > > > > > > > > need
> >> > >> > > > > > > > > > to
> >> > >> > > > > > > > > > > > > > figure
> >> > >> > > > > > > > > > > > > > > > out how the community can get proper
> >> > >> > > notifications
> >> > >> > > > of
> >> > >> > > > > > > > failure
> >> > >> > > > > > > > > > and
> >> > >> > > > > > > > > > > > > > success
> >> > >> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I
> >> think
> >> > >> that
> >> > >> > > > > > breakages
> >> > >> > > > > > > > > would
> >> > >> > > > > > > > > > go
> >> > >> > > > > > > > > > > > > > > > unnoticed.
> >> > >> > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > -Carin
> >> > >> > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro
> >> > >> Larroy <
> >> > >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > wrote:
> >> > >> > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > Hi
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > Seems we are hitting some problems
> >> in CI.
> >> > >> I
> >> > >> > > > propose
> >> > >> > > > > > the
> >> > >> > > > > > > > > > > following
> >> > >> > > > > > > > > > > > > > > action
> >> > >> > > > > > > > > > > > > > > > > items to remedy the situation and
> >> > >> accelerate
> >> > >> > > turn
> >> > >> > > > > > > around
> >> > >> > > > > > > > > > times
> >> > >> > > > > > > > > > > in
> >> > >> > > > > > > > > > > > > CI,
> >> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
> >> probability of
> >> > >> > > > failure
> >> > >> > > > > > > > blocking
> >> > >> > > > > > > > > > PRs
> >> > >> > > > > > > > > > > > and
> >> > >> > > > > > > > > > > > > > > > > frustrating developers:
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio
> from
> >> VS
> >> > >> 2015 to
> >> > >> > > > VS
> >> > >> > > > > > > 2017.
> >> > >> > > > > > > > > The
> >> > >> > > > > > > > > > > > > > > > > build_windows.py infrastructure
> >> should
> >> > >> easily
> >> > >> > > > work
> >> > >> > > > > > with
> >> > >> > > > > > > > the
> >> > >> > > > > > > > > > new
> >> > >> > > > > > > > > > > > > > > version.
> >> > >> > > > > > > > > > > > > > > > > Currently some PRs are blocked by
> >> this:
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> >> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to
> >> nightly.
> >> > >> > > Tracked
> >> > >> > > > at
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> >> > >> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
> >> > >> nightly.
> >> > >> > > If a
> >> > >> > > > > > > commit
> >> > >> > > > > > > > is
> >> > >> > > > > > > > > > > > > touching
> >> > >> > > > > > > > > > > > > > > > other
> >> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should ask
> >> for a
> >> > >> full
> >> > >> > > run
> >> > >> > > > > > which
> >> > >> > > > > > > > can
> >> > >> > > > > > > > > be
> >> > >> > > > > > > > > > > > done
> >> > >> > > > > > > > > > > > > > > > locally,
> >> > >> > > > > > > > > > > > > > > > > use the label bot to trigger a full
> >> CI
> >> > >> build,
> >> > >> > > or
> >> > >> > > > > > defer
> >> > >> > > > > > > to
> >> > >> > > > > > > > > > > > nightly.
> >> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
> >> > >> performance
> >> > >> > > > > tests
> >> > >> > > > > > on
> >> > >> > > > > > > > > small
> >> > >> > > > > > > > > > > > > models
> >> > >> > > > > > > > > > > > > > > that
> >> > >> > > > > > > > > > > > > > > > > are run on CI and can be echoed by
> >> the
> >> > >> label
> >> > >> > > bot
> >> > >> > > > > as a
> >> > >> > > > > > > > > comment
> >> > >> > > > > > > > > > > for
> >> > >> > > > > > > > > > > > > > PRs.
> >> > >> > > > > > > > > > > > > > > > > * Address unit tests that take more
> >> than
> >> > >> > > 10-20s,
> >> > >> > > > > > > > streamline
> >> > >> > > > > > > > > > > them
> >> > >> > > > > > > > > > > > or
> >> > >> > > > > > > > > > > > > > > move
> >> > >> > > > > > > > > > > > > > > > > them to nightly if it can't be
> done.
> >> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
> >> > >> infrastructure
> >> > >> > > > > > scripts
> >> > >> > > > > > > > so
> >> > >> > > > > > > > > > the
> >> > >> > > > > > > > > > > > > > > community
> >> > >> > > > > > > > > > > > > > > > > can contribute.
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > I think our goal should be
> turnaround
> >> > >> under
> >> > >> > > > 30min.
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > I would also like to touch base
> with
> >> the
> >> > >> > > > community
> >> > >> > > > > > that
> >> > >> > > > > > > > > some
> >> > >> > > > > > > > > > > PRs
> >> > >> > > > > > > > > > > > > are
> >> > >> > > > > > > > > > > > > > > not
> >> > >> > > > > > > > > > > > > > > > > being followed up by committers
> >> asking for
> >> > >> > > > changes.
> >> > >> > > > > > For
> >> > >> > > > > > > > > > example
> >> > >> > > > > > > > > > > > > this
> >> > >> > > > > > > > > > > > > > PR
> >> > >> > > > > > > > > > > > > > > > is
> >> > >> > > > > > > > > > > > > > > > > importtant and is hanging for a
> long
> >> time.
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > This is another, less important but
> >> more
> >> > >> > > trivial
> >> > >> > > > to
> >> > >> > > > > > > > review:
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > I think comitters requesting
> changes
> >> and
> >> > >> not
> >> > >> > > > > > folllowing
> >> > >> > > > > > > > up
> >> > >> > > > > > > > > in
> >> > >> > > > > > > > > > > > > > > reasonable
> >> > >> > > > > > > > > > > > > > > > > time is not healthy for the
> project.
> >> I
> >> > >> suggest
> >> > >> > > > > > > > configuring
> >> > >> > > > > > > > > > > github
> >> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR and
> >> > >> following up.
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > Regards.
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > > > Pedro.
> >> > >> > > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > > >
> >> > >> > > > > > > > > > > > >
> >> > >> > > > > > > > > > > >
> >> > >> > > > > > > > > > >
> >> > >> > > > > > > > > >
> >> > >> > > > > > > > >
> >> > >> > > > > > > >
> >> > >> > > > > > >
> >> > >> > > > > >
> >> > >> > > > >
> >> > >> > > >
> >> > >> > >
> >> > >>
> >> > >
> >>
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

As Marco has open sourced the bulk of the CI infrastructure donated from
Amazon to the community, I would like to raise the recommendation that the
community takes action to help volunteers working on the CI have a better
experience. In the past, it's my impression that there hasn't been much
action granting PMC or committer privileges to engineers volunteering to
help CI other than Marco. This would encourage more contributions and help
expedite critical fixes and corrective actions. I think this has not
properly enabled those individuals to be as effective as they could, as
well as the lack of recognition for such a critical activity. I'm not sure
about the cause but I believe this is something that should be rectified
for future contributions and help on the CI front if improvements are
desired.

In spanish we have a saying: "es de bien nacido ser agradecido".

Pedro.

On Fri, Aug 16, 2019 at 4:03 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Hi Aaron. This is difficult to diagnose, because I don't know what to do
> when the hash of the layer in docker doesn't match and decides to rebuild
> it. the r script seems not to have changed. I have observed this in the
> past and I think is due to bugs in docker.   Maybe Kellen is able to give
> some tips here.
>
> In this case you should use -R which is already in master. (you can always
> copy the script on top if you are in an older revision).
>
> Another thing that worked for me in the past was to completely nuke the
> docker cache, so it redonwloads from the CI repo. After that it worked fine
> in some cases.
>
> These two workarounds are not ideal, but should unblock you.
>
> Pedro.
>
> On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <aa...@gmail.com>
> wrote:
>
>> Is -R already in there?
>>
>> Here's an example of it happening to me right now.... I am making
>> minor changes to the runtime_functions logic for handling the R docs
>> output. I pull the fix, then run the container, but I see the R deps
>> layer re-running. I didn't touch that. Why it that running again?
>>
>> From https://github.com/aaronmarkham/incubator-mxnet
>>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
>> origin/new_website_pipeline_2_aaron_rdocs
>> Updating f71cc6d..deec6aa
>> Fast-forward
>>  ci/docker/runtime_functions.sh | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
>> --docker-registry mxnetci --platform ubuntu_cpu_r
>> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
>> build_r_docs
>> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build tool.
>> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
>> enabled from registry mxnetci
>> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
>> mxnetci/build.ubuntu_cpu_r from mxnetci
>> Using default tag: latest
>> latest: Pulling from mxnetci/build.ubuntu_cpu_r
>> Digest:
>> sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
>> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
>> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker cache
>> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
>> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
>> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
>> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
>> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
>> mxnetci/build.ubuntu_cpu_r docker'
>> Sending build context to Docker daemon  289.8kB
>> Step 1/15 : FROM ubuntu:16.04
>>  ---> 5e13f8dd4c1a
>> Step 2/15 : WORKDIR /work/deps
>>  ---> Using cache
>>  ---> afc2a135945d
>> Step 3/15 : COPY install/ubuntu_core.sh /work/
>>  ---> Using cache
>>  ---> da2b2e7f35e1
>> Step 4/15 : RUN /work/ubuntu_core.sh
>>  ---> Using cache
>>  ---> d1e88b26b1d2
>> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
>>  ---> Using cache
>>  ---> 3aa97dea3b7b
>> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
>>  ---> Using cache
>>  ---> bec503f1d149
>> Step 7/15 : COPY install/ubuntu_r.sh /work/
>>  ---> c5e77c38031d
>> Step 8/15 : COPY install/r.gpg /work/
>>  ---> d8cdbf015d2b
>> Step 9/15 : RUN /work/ubuntu_r.sh
>>  ---> Running in c6c90b9e1538
>> ++ dirname /work/ubuntu_r.sh
>> + cd /work
>> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
>> + apt-key add r.gpg
>> OK
>> + add-apt-repository 'deb [arch=amd64,i386]
>> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
>> + apt-get update
>> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
>>
>> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
>> <pe...@gmail.com> wrote:
>> >
>> > Also, I forgot, another workaround is that I added the -R flag to the
>> build
>> > logic (build.py) so the container is not rebuilt for manual use.
>> >
>> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
>> pedro.larroy.lists@gmail.com>
>> > wrote:
>> >
>> > >
>> > > Hi Aaron.
>> > >
>> > > As Marco explained, if you are in master the cache usually works,
>> there's
>> > > two issues that I have observed:
>> > >
>> > > 1 - Docker doesn't automatically pull the base image (ex.
>> ubuntu:16.04) so
>> > > if your cached base which is used in the FROM statement becomes
>> outdated
>> > > your caching won't work. (Using docker pull ubuntu:16.04) or the base
>> > > images from the container helps with this.
>> > >
>> > > 2 - There's another situation where the above doesn't help which
>> seems to
>> > > be an unidentified issue with the docker cache:
>> > > https://github.com/docker/docker.github.io/issues/8886
>> > >
>> > > We can get a short term workaround for #1 by explicitly pulling bases
>> from
>> > > the script, but I think docker should do it when using --cache-from so
>> > > maybe contributing a patch to docker would the best approach.
>> > >
>> > > Pedro
>> > >
>> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
>> aaron.s.markham@gmail.com>
>> > > wrote:
>> > >
>> > >> When you create a new Dockerfile and use that on CI, it doesn't seem
>> > >> to cache some of the steps... like this:
>> > >>
>> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
>> > >>  ---> Running in a1e522f3283b
>> > >>  [91m+ echo 'Installing dependencies...'
>> > >> + apt-get update
>> > >>  [0mInstalling dependencies.
>> > >>
>> > >> Or this....
>> > >>
>> > >> Step 4/13 : RUN /work/ubuntu_core.sh
>> > >>  ---> Running in e7882d7aa750
>> > >>  [91m+ apt-get update
>> > >>
>> > >> I get if I was changing those scripts, but then I'd think it should
>> > >> cache after running it once... but, no.
>> > >>
>> > >>
>> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
>> marco.g.abreu@gmail.com>
>> > >> wrote:
>> > >> >
>> > >> > Do I understand it correctly that you are saying that the Docker
>> cache
>> > >> > doesn't work properly and regularly reinstalls dependencies? Or do
>> you
>> > >> mean
>> > >> > that you only have cache misses when you modify the dependencies -
>> which
>> > >> > would be expected?
>> > >> >
>> > >> > -Marco
>> > >> >
>> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
>> > >> aaron.s.markham@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Many of the CI pipelines follow this pattern:
>> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run some
>> tests. Why
>> > >> > > repeat steps 1-3 over and over?
>> > >> > >
>> > >> > > Now, some tests use a stashed binary and docker cache. And I see
>> this
>> > >> work
>> > >> > > locally, but for the most part, on CI, you're gonna sit through a
>> > >> > > dependency install.
>> > >> > >
>> > >> > > I noticed that almost all jobs use an ubuntu setup that is fully
>> > >> loaded.
>> > >> > > Without cache, it can take 10 or more minutes to build.  So I
>> made a
>> > >> lite
>> > >> > > version. Takes only a few minutes instead.
>> > >> > >
>> > >> > > In some cases archiving worked great to share across pipelines,
>> but as
>> > >> > > Marco mentioned we need a storage solution to make that happen.
>> We
>> > >> can't
>> > >> > > archive every intermediate artifact for each PR.
>> > >> > >
>> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
>> > >> pedro.larroy.lists@gmail.com>
>> > >> > > wrote:
>> > >> > >
>> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
>> > >> > > >
>> > >> > > > Pedro.
>> > >> > > >
>> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
>> > >> aaron.s.markham@gmail.com
>> > >> > > >
>> > >> > > > wrote:
>> > >> > > >
>> > >> > > > > The PRs Thomas and I are working on for the new docs and
>> website
>> > >> share
>> > >> > > > the
>> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds things
>> up a
>> > >> lot.
>> > >> > > > >
>> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
>> cjolivier01@gmail.com>
>> > >> > > wrote:
>> > >> > > > >
>> > >> > > > > > I see it done daily now, and while I can’t share all the
>> > >> details,
>> > >> > > it’s
>> > >> > > > > not
>> > >> > > > > > an incredibly complex thing, and involves not much more
>> than
>> > >> nfs/efs
>> > >> > > > > > sharing and remote ssh commands.  All it takes is a little
>> > >> ingenuity
>> > >> > > > and
>> > >> > > > > > some imagination.
>> > >> > > > > >
>> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> > >> > > > > pedro.larroy.lists@gmail.com
>> > >> > > > > > >
>> > >> > > > > > wrote:
>> > >> > > > > >
>> > >> > > > > > > Sounds good in theory. I think there are complex details
>> with
>> > >> > > regards
>> > >> > > > > of
>> > >> > > > > > > resource sharing during parallel execution. Still I think
>> > >> both ways
>> > >> > > > can
>> > >> > > > > > be
>> > >> > > > > > > explored. I think some tests run for unreasonably long
>> times
>> > >> for
>> > >> > > what
>> > >> > > > > > they
>> > >> > > > > > > are doing. We already scale parts of the pipeline
>> horizontally
>> > >> > > across
>> > >> > > > > > > workers.
>> > >> > > > > > >
>> > >> > > > > > >
>> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
>> > >> > > > cjolivier01@apache.org>
>> > >> > > > > > > wrote:
>> > >> > > > > > >
>> > >> > > > > > > > +1
>> > >> > > > > > > >
>> > >> > > > > > > > Rather than remove tests (which doesn’t scale as a
>> > >> solution), why
>> > >> > > > not
>> > >> > > > > > > scale
>> > >> > > > > > > > them horizontally so that they finish more quickly?
>> Across
>> > >> > > > processes
>> > >> > > > > or
>> > >> > > > > > > > even on a pool of machines that aren’t necessarily the
>> build
>> > >> > > > machine?
>> > >> > > > > > > >
>> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>> > >> > > > > > marco.g.abreu@gmail.com
>> > >> > > > > > > >
>> > >> > > > > > > > wrote:
>> > >> > > > > > > >
>> > >> > > > > > > > > With regards to time I rather prefer us spending a
>> bit
>> > >> more
>> > >> > > time
>> > >> > > > on
>> > >> > > > > > > > > maintenance than somebody running into an error that
>> > >> could've
>> > >> > > > been
>> > >> > > > > > > caught
>> > >> > > > > > > > > with a test.
>> > >> > > > > > > > >
>> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has
>> been
>> > >> broken
>> > >> > > for
>> > >> > > > > > quite
>> > >> > > > > > > > > some time now, but nobody noticed that. Basically my
>> > >> stance on
>> > >> > > > that
>> > >> > > > > > > > matter
>> > >> > > > > > > > > is that as soon as something is not blocking, you can
>> > >> also just
>> > >> > > > > > > > deactivate
>> > >> > > > > > > > > it since you don't have a forcing function in an open
>> > >> source
>> > >> > > > > project.
>> > >> > > > > > > > > People will rarely come back and fix the errors of
>> some
>> > >> nightly
>> > >> > > > > test
>> > >> > > > > > > that
>> > >> > > > > > > > > they introduced.
>> > >> > > > > > > > >
>> > >> > > > > > > > > -Marco
>> > >> > > > > > > > >
>> > >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi.,
>> 14.
>> > >> Aug.
>> > >> > > > 2019,
>> > >> > > > > > > 21:59:
>> > >> > > > > > > > >
>> > >> > > > > > > > > > If a language binding test is failing for a not
>> > >> important
>> > >> > > > reason,
>> > >> > > > > > > then
>> > >> > > > > > > > it
>> > >> > > > > > > > > > is too brittle and needs to be fixed (we have fixed
>> > >> some of
>> > >> > > > these
>> > >> > > > > > > with
>> > >> > > > > > > > > the
>> > >> > > > > > > > > > Clojure package [1]).
>> > >> > > > > > > > > > But in general, if we thinking of the MXNet
>> project as
>> > >> one
>> > >> > > > > project
>> > >> > > > > > > that
>> > >> > > > > > > > > is
>> > >> > > > > > > > > > across all the language bindings, then we want to
>> know
>> > >> if
>> > >> > > some
>> > >> > > > > > > > > fundamental
>> > >> > > > > > > > > > code change is going to break a downstream package.
>> > >> > > > > > > > > > I can't speak for all the high level package
>> binding
>> > >> > > > maintainers,
>> > >> > > > > > but
>> > >> > > > > > > > I'm
>> > >> > > > > > > > > > always happy to pitch in to provide code fixes to
>> help
>> > >> the
>> > >> > > base
>> > >> > > > > PR
>> > >> > > > > > > get
>> > >> > > > > > > > > > green.
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > The time costs to maintain such a large CI project
>> > >> obviously
>> > >> > > > > needs
>> > >> > > > > > to
>> > >> > > > > > > > be
>> > >> > > > > > > > > > considered as well.
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > [1]
>> > >> https://github.com/apache/incubator-mxnet/pull/15579
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>> > >> > > > > > > > > pedro.larroy.lists@gmail.com
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > >
>> > >> > > > > > > > > > > From what I have seen Clojure is 15 minutes,
>> which I
>> > >> think
>> > >> > > is
>> > >> > > > > > > > > reasonable.
>> > >> > > > > > > > > > > The only question is that when a binding such as
>> R,
>> > >> Perl or
>> > >> > > > > > Clojure
>> > >> > > > > > > > > > fails,
>> > >> > > > > > > > > > > some devs are a bit confused about how to fix
>> them
>> > >> since
>> > >> > > they
>> > >> > > > > are
>> > >> > > > > > > not
>> > >> > > > > > > > > > > familiar with the testing tools and the language.
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>> > >> > > > > > carinmeier@gmail.com
>> > >> > > > > > > >
>> > >> > > > > > > > > > wrote:
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > > > > Great idea Marco! Anything that you think
>> would be
>> > >> > > valuable
>> > >> > > > > to
>> > >> > > > > > > > share
>> > >> > > > > > > > > > > would
>> > >> > > > > > > > > > > > be good. The duration of each node in the test
>> stage
>> > >> > > sounds
>> > >> > > > > > like
>> > >> > > > > > > a
>> > >> > > > > > > > > good
>> > >> > > > > > > > > > > > start.
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > - Carin
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu
>> <
>> > >> > > > > > > > > > marco.g.abreu@gmail.com>
>> > >> > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Hi,
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > we record a bunch of metrics about run
>> statistics
>> > >> (down
>> > >> > > > to
>> > >> > > > > > the
>> > >> > > > > > > > > > duration
>> > >> > > > > > > > > > > > of
>> > >> > > > > > > > > > > > > every individual step). If you tell me which
>> ones
>> > >> > > you're
>> > >> > > > > > > > > particularly
>> > >> > > > > > > > > > > > > interested in (probably total duration of
>> each
>> > >> node in
>> > >> > > > the
>> > >> > > > > > test
>> > >> > > > > > > > > > stage),
>> > >> > > > > > > > > > > > I'm
>> > >> > > > > > > > > > > > > happy to provide them.
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
>> > >> > > > > > > > > > > > > - job
>> > >> > > > > > > > > > > > > - branch
>> > >> > > > > > > > > > > > > - stage
>> > >> > > > > > > > > > > > > - node
>> > >> > > > > > > > > > > > > - step
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Unfortunately I don't have the possibility to
>> > >> export
>> > >> > > them
>> > >> > > > > > since
>> > >> > > > > > > > we
>> > >> > > > > > > > > > > store
>> > >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik
>> doesn't
>> > >> offer
>> > >> > > raw
>> > >> > > > > > > exports.
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Best regards,
>> > >> > > > > > > > > > > > > Marco
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb
>> am
>> > >> Mi., 14.
>> > >> > > > > Aug.
>> > >> > > > > > > > 2019,
>> > >> > > > > > > > > > > 19:43:
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > I would prefer to keep the language
>> binding in
>> > >> the PR
>> > >> > > > > > > process.
>> > >> > > > > > > > > > > Perhaps
>> > >> > > > > > > > > > > > we
>> > >> > > > > > > > > > > > > > could do some analytics to see how much
>> each of
>> > >> the
>> > >> > > > > > language
>> > >> > > > > > > > > > bindings
>> > >> > > > > > > > > > > > is
>> > >> > > > > > > > > > > > > > contributing to overall run time.
>> > >> > > > > > > > > > > > > > If we have some metrics on that, maybe we
>> can
>> > >> come up
>> > >> > > > > with
>> > >> > > > > > a
>> > >> > > > > > > > > > > guideline
>> > >> > > > > > > > > > > > of
>> > >> > > > > > > > > > > > > > how much time each should take. Another
>> > >> possibility
>> > >> > > is
>> > >> > > > > > > leverage
>> > >> > > > > > > > > the
>> > >> > > > > > > > > > > > > > parallel builds more.
>> > >> > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
>> Larroy <
>> > >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > Hi Carin.
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > That's a good point, all things
>> considered
>> > >> would
>> > >> > > your
>> > >> > > > > > > > > preference
>> > >> > > > > > > > > > be
>> > >> > > > > > > > > > > > to
>> > >> > > > > > > > > > > > > > keep
>> > >> > > > > > > > > > > > > > > the Clojure tests as part of the PR
>> process
>> > >> or in
>> > >> > > > > > Nightly?
>> > >> > > > > > > > > > > > > > > Some options are having notifications
>> here or
>> > >> in
>> > >> > > > slack.
>> > >> > > > > > But
>> > >> > > > > > > > if
>> > >> > > > > > > > > we
>> > >> > > > > > > > > > > > think
>> > >> > > > > > > > > > > > > > > breakages would go unnoticed maybe is
>> not a
>> > >> good
>> > >> > > idea
>> > >> > > > > to
>> > >> > > > > > > > fully
>> > >> > > > > > > > > > > remove
>> > >> > > > > > > > > > > > > > > bindings from the PR process and just
>> > >> streamline
>> > >> > > the
>> > >> > > > > > > process.
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > Pedro.
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin
>> Meier <
>> > >> > > > > > > > > > carinmeier@gmail.com>
>> > >> > > > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > Before any binding tests are moved to
>> > >> nightly, I
>> > >> > > > > think
>> > >> > > > > > we
>> > >> > > > > > > > > need
>> > >> > > > > > > > > > to
>> > >> > > > > > > > > > > > > > figure
>> > >> > > > > > > > > > > > > > > > out how the community can get proper
>> > >> > > notifications
>> > >> > > > of
>> > >> > > > > > > > failure
>> > >> > > > > > > > > > and
>> > >> > > > > > > > > > > > > > success
>> > >> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I
>> think
>> > >> that
>> > >> > > > > > breakages
>> > >> > > > > > > > > would
>> > >> > > > > > > > > > go
>> > >> > > > > > > > > > > > > > > > unnoticed.
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > -Carin
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro
>> > >> Larroy <
>> > >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > wrote:
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Hi
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Seems we are hitting some problems
>> in CI.
>> > >> I
>> > >> > > > propose
>> > >> > > > > > the
>> > >> > > > > > > > > > > following
>> > >> > > > > > > > > > > > > > > action
>> > >> > > > > > > > > > > > > > > > > items to remedy the situation and
>> > >> accelerate
>> > >> > > turn
>> > >> > > > > > > around
>> > >> > > > > > > > > > times
>> > >> > > > > > > > > > > in
>> > >> > > > > > > > > > > > > CI,
>> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
>> probability of
>> > >> > > > failure
>> > >> > > > > > > > blocking
>> > >> > > > > > > > > > PRs
>> > >> > > > > > > > > > > > and
>> > >> > > > > > > > > > > > > > > > > frustrating developers:
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from
>> VS
>> > >> 2015 to
>> > >> > > > VS
>> > >> > > > > > > 2017.
>> > >> > > > > > > > > The
>> > >> > > > > > > > > > > > > > > > > build_windows.py infrastructure
>> should
>> > >> easily
>> > >> > > > work
>> > >> > > > > > with
>> > >> > > > > > > > the
>> > >> > > > > > > > > > new
>> > >> > > > > > > > > > > > > > > version.
>> > >> > > > > > > > > > > > > > > > > Currently some PRs are blocked by
>> this:
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
>> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to
>> nightly.
>> > >> > > Tracked
>> > >> > > > at
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
>> > >> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
>> > >> nightly.
>> > >> > > If a
>> > >> > > > > > > commit
>> > >> > > > > > > > is
>> > >> > > > > > > > > > > > > touching
>> > >> > > > > > > > > > > > > > > > other
>> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should ask
>> for a
>> > >> full
>> > >> > > run
>> > >> > > > > > which
>> > >> > > > > > > > can
>> > >> > > > > > > > > be
>> > >> > > > > > > > > > > > done
>> > >> > > > > > > > > > > > > > > > locally,
>> > >> > > > > > > > > > > > > > > > > use the label bot to trigger a full
>> CI
>> > >> build,
>> > >> > > or
>> > >> > > > > > defer
>> > >> > > > > > > to
>> > >> > > > > > > > > > > > nightly.
>> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
>> > >> performance
>> > >> > > > > tests
>> > >> > > > > > on
>> > >> > > > > > > > > small
>> > >> > > > > > > > > > > > > models
>> > >> > > > > > > > > > > > > > > that
>> > >> > > > > > > > > > > > > > > > > are run on CI and can be echoed by
>> the
>> > >> label
>> > >> > > bot
>> > >> > > > > as a
>> > >> > > > > > > > > comment
>> > >> > > > > > > > > > > for
>> > >> > > > > > > > > > > > > > PRs.
>> > >> > > > > > > > > > > > > > > > > * Address unit tests that take more
>> than
>> > >> > > 10-20s,
>> > >> > > > > > > > streamline
>> > >> > > > > > > > > > > them
>> > >> > > > > > > > > > > > or
>> > >> > > > > > > > > > > > > > > move
>> > >> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
>> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
>> > >> infrastructure
>> > >> > > > > > scripts
>> > >> > > > > > > > so
>> > >> > > > > > > > > > the
>> > >> > > > > > > > > > > > > > > community
>> > >> > > > > > > > > > > > > > > > > can contribute.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > I think our goal should be turnaround
>> > >> under
>> > >> > > > 30min.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > I would also like to touch base with
>> the
>> > >> > > > community
>> > >> > > > > > that
>> > >> > > > > > > > > some
>> > >> > > > > > > > > > > PRs
>> > >> > > > > > > > > > > > > are
>> > >> > > > > > > > > > > > > > > not
>> > >> > > > > > > > > > > > > > > > > being followed up by committers
>> asking for
>> > >> > > > changes.
>> > >> > > > > > For
>> > >> > > > > > > > > > example
>> > >> > > > > > > > > > > > > this
>> > >> > > > > > > > > > > > > > PR
>> > >> > > > > > > > > > > > > > > > is
>> > >> > > > > > > > > > > > > > > > > importtant and is hanging for a long
>> time.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > This is another, less important but
>> more
>> > >> > > trivial
>> > >> > > > to
>> > >> > > > > > > > review:
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > I think comitters requesting changes
>> and
>> > >> not
>> > >> > > > > > folllowing
>> > >> > > > > > > > up
>> > >> > > > > > > > > in
>> > >> > > > > > > > > > > > > > > reasonable
>> > >> > > > > > > > > > > > > > > > > time is not healthy for the project.
>> I
>> > >> suggest
>> > >> > > > > > > > configuring
>> > >> > > > > > > > > > > github
>> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR and
>> > >> following up.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Regards.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > > > Pedro.
>> > >> > > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > > >
>> > >> > > > > > > > > > > > >
>> > >> > > > > > > > > > > >
>> > >> > > > > > > > > > >
>> > >> > > > > > > > > >
>> > >> > > > > > > > >
>> > >> > > > > > > >
>> > >> > > > > > >
>> > >> > > > > >
>> > >> > > > >
>> > >> > > >
>> > >> > >
>> > >>
>> > >
>>
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Aaron. This is difficult to diagnose, because I don't know what to do
when the hash of the layer in docker doesn't match and decides to rebuild
it. the r script seems not to have changed. I have observed this in the
past and I think is due to bugs in docker.   Maybe Kellen is able to give
some tips here.

In this case you should use -R which is already in master. (you can always
copy the script on top if you are in an older revision).

Another thing that worked for me in the past was to completely nuke the
docker cache, so it redonwloads from the CI repo. After that it worked fine
in some cases.

These two workarounds are not ideal, but should unblock you.

Pedro.

On Fri, Aug 16, 2019 at 11:39 AM Aaron Markham <aa...@gmail.com>
wrote:

> Is -R already in there?
>
> Here's an example of it happening to me right now.... I am making
> minor changes to the runtime_functions logic for handling the R docs
> output. I pull the fix, then run the container, but I see the R deps
> layer re-running. I didn't touch that. Why it that running again?
>
> From https://github.com/aaronmarkham/incubator-mxnet
>    f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
> origin/new_website_pipeline_2_aaron_rdocs
> Updating f71cc6d..deec6aa
> Fast-forward
>  ci/docker/runtime_functions.sh | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> (base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
> --docker-registry mxnetci --platform ubuntu_cpu_r
> --docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
> build_r_docs
> build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build tool.
> build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
> enabled from registry mxnetci
> build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
> mxnetci/build.ubuntu_cpu_r from mxnetci
> Using default tag: latest
> latest: Pulling from mxnetci/build.ubuntu_cpu_r
> Digest:
> sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
> Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
> build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker cache
> build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
> tagged 'mxnetci/build.ubuntu_cpu_r' with docker
> build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
> -f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
> --build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
> mxnetci/build.ubuntu_cpu_r docker'
> Sending build context to Docker daemon  289.8kB
> Step 1/15 : FROM ubuntu:16.04
>  ---> 5e13f8dd4c1a
> Step 2/15 : WORKDIR /work/deps
>  ---> Using cache
>  ---> afc2a135945d
> Step 3/15 : COPY install/ubuntu_core.sh /work/
>  ---> Using cache
>  ---> da2b2e7f35e1
> Step 4/15 : RUN /work/ubuntu_core.sh
>  ---> Using cache
>  ---> d1e88b26b1d2
> Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
>  ---> Using cache
>  ---> 3aa97dea3b7b
> Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
>  ---> Using cache
>  ---> bec503f1d149
> Step 7/15 : COPY install/ubuntu_r.sh /work/
>  ---> c5e77c38031d
> Step 8/15 : COPY install/r.gpg /work/
>  ---> d8cdbf015d2b
> Step 9/15 : RUN /work/ubuntu_r.sh
>  ---> Running in c6c90b9e1538
> ++ dirname /work/ubuntu_r.sh
> + cd /work
> + echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
> + apt-key add r.gpg
> OK
> + add-apt-repository 'deb [arch=amd64,i386]
> https://cran.rstudio.com/bin/linux/ubuntu xenial/'
> + apt-get update
> Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease
>
> On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
> <pe...@gmail.com> wrote:
> >
> > Also, I forgot, another workaround is that I added the -R flag to the
> build
> > logic (build.py) so the container is not rebuilt for manual use.
> >
> > On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > wrote:
> >
> > >
> > > Hi Aaron.
> > >
> > > As Marco explained, if you are in master the cache usually works,
> there's
> > > two issues that I have observed:
> > >
> > > 1 - Docker doesn't automatically pull the base image (ex.
> ubuntu:16.04) so
> > > if your cached base which is used in the FROM statement becomes
> outdated
> > > your caching won't work. (Using docker pull ubuntu:16.04) or the base
> > > images from the container helps with this.
> > >
> > > 2 - There's another situation where the above doesn't help which seems
> to
> > > be an unidentified issue with the docker cache:
> > > https://github.com/docker/docker.github.io/issues/8886
> > >
> > > We can get a short term workaround for #1 by explicitly pulling bases
> from
> > > the script, but I think docker should do it when using --cache-from so
> > > maybe contributing a patch to docker would the best approach.
> > >
> > > Pedro
> > >
> > > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <
> aaron.s.markham@gmail.com>
> > > wrote:
> > >
> > >> When you create a new Dockerfile and use that on CI, it doesn't seem
> > >> to cache some of the steps... like this:
> > >>
> > >> Step 13/15 : RUN /work/ubuntu_docs.sh
> > >>  ---> Running in a1e522f3283b
> > >>  [91m+ echo 'Installing dependencies...'
> > >> + apt-get update
> > >>  [0mInstalling dependencies.
> > >>
> > >> Or this....
> > >>
> > >> Step 4/13 : RUN /work/ubuntu_core.sh
> > >>  ---> Running in e7882d7aa750
> > >>  [91m+ apt-get update
> > >>
> > >> I get if I was changing those scripts, but then I'd think it should
> > >> cache after running it once... but, no.
> > >>
> > >>
> > >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <
> marco.g.abreu@gmail.com>
> > >> wrote:
> > >> >
> > >> > Do I understand it correctly that you are saying that the Docker
> cache
> > >> > doesn't work properly and regularly reinstalls dependencies? Or do
> you
> > >> mean
> > >> > that you only have cache misses when you modify the dependencies -
> which
> > >> > would be expected?
> > >> >
> > >> > -Marco
> > >> >
> > >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> > >> aaron.s.markham@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Many of the CI pipelines follow this pattern:
> > >> > > Load ubuntu 16.04, install deps, build mxnet, then run some
> tests. Why
> > >> > > repeat steps 1-3 over and over?
> > >> > >
> > >> > > Now, some tests use a stashed binary and docker cache. And I see
> this
> > >> work
> > >> > > locally, but for the most part, on CI, you're gonna sit through a
> > >> > > dependency install.
> > >> > >
> > >> > > I noticed that almost all jobs use an ubuntu setup that is fully
> > >> loaded.
> > >> > > Without cache, it can take 10 or more minutes to build.  So I
> made a
> > >> lite
> > >> > > version. Takes only a few minutes instead.
> > >> > >
> > >> > > In some cases archiving worked great to share across pipelines,
> but as
> > >> > > Marco mentioned we need a storage solution to make that happen. We
> > >> can't
> > >> > > archive every intermediate artifact for each PR.
> > >> > >
> > >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
> > >> pedro.larroy.lists@gmail.com>
> > >> > > wrote:
> > >> > >
> > >> > > > Hi Aaron. Why speeds things up? What's the difference?
> > >> > > >
> > >> > > > Pedro.
> > >> > > >
> > >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> > >> aaron.s.markham@gmail.com
> > >> > > >
> > >> > > > wrote:
> > >> > > >
> > >> > > > > The PRs Thomas and I are working on for the new docs and
> website
> > >> share
> > >> > > > the
> > >> > > > > mxnet binary in the new CI pipelines we made. Speeds things
> up a
> > >> lot.
> > >> > > > >
> > >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <
> cjolivier01@gmail.com>
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > I see it done daily now, and while I can’t share all the
> > >> details,
> > >> > > it’s
> > >> > > > > not
> > >> > > > > > an incredibly complex thing, and involves not much more than
> > >> nfs/efs
> > >> > > > > > sharing and remote ssh commands.  All it takes is a little
> > >> ingenuity
> > >> > > > and
> > >> > > > > > some imagination.
> > >> > > > > >
> > >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > >> > > > > pedro.larroy.lists@gmail.com
> > >> > > > > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Sounds good in theory. I think there are complex details
> with
> > >> > > regards
> > >> > > > > of
> > >> > > > > > > resource sharing during parallel execution. Still I think
> > >> both ways
> > >> > > > can
> > >> > > > > > be
> > >> > > > > > > explored. I think some tests run for unreasonably long
> times
> > >> for
> > >> > > what
> > >> > > > > > they
> > >> > > > > > > are doing. We already scale parts of the pipeline
> horizontally
> > >> > > across
> > >> > > > > > > workers.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > >> > > > cjolivier01@apache.org>
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > +1
> > >> > > > > > > >
> > >> > > > > > > > Rather than remove tests (which doesn’t scale as a
> > >> solution), why
> > >> > > > not
> > >> > > > > > > scale
> > >> > > > > > > > them horizontally so that they finish more quickly?
> Across
> > >> > > > processes
> > >> > > > > or
> > >> > > > > > > > even on a pool of machines that aren’t necessarily the
> build
> > >> > > > machine?
> > >> > > > > > > >
> > >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > >> > > > > > marco.g.abreu@gmail.com
> > >> > > > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > > With regards to time I rather prefer us spending a bit
> > >> more
> > >> > > time
> > >> > > > on
> > >> > > > > > > > > maintenance than somebody running into an error that
> > >> could've
> > >> > > > been
> > >> > > > > > > caught
> > >> > > > > > > > > with a test.
> > >> > > > > > > > >
> > >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
> > >> broken
> > >> > > for
> > >> > > > > > quite
> > >> > > > > > > > > some time now, but nobody noticed that. Basically my
> > >> stance on
> > >> > > > that
> > >> > > > > > > > matter
> > >> > > > > > > > > is that as soon as something is not blocking, you can
> > >> also just
> > >> > > > > > > > deactivate
> > >> > > > > > > > > it since you don't have a forcing function in an open
> > >> source
> > >> > > > > project.
> > >> > > > > > > > > People will rarely come back and fix the errors of
> some
> > >> nightly
> > >> > > > > test
> > >> > > > > > > that
> > >> > > > > > > > > they introduced.
> > >> > > > > > > > >
> > >> > > > > > > > > -Marco
> > >> > > > > > > > >
> > >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi.,
> 14.
> > >> Aug.
> > >> > > > 2019,
> > >> > > > > > > 21:59:
> > >> > > > > > > > >
> > >> > > > > > > > > > If a language binding test is failing for a not
> > >> important
> > >> > > > reason,
> > >> > > > > > > then
> > >> > > > > > > > it
> > >> > > > > > > > > > is too brittle and needs to be fixed (we have fixed
> > >> some of
> > >> > > > these
> > >> > > > > > > with
> > >> > > > > > > > > the
> > >> > > > > > > > > > Clojure package [1]).
> > >> > > > > > > > > > But in general, if we thinking of the MXNet project
> as
> > >> one
> > >> > > > > project
> > >> > > > > > > that
> > >> > > > > > > > > is
> > >> > > > > > > > > > across all the language bindings, then we want to
> know
> > >> if
> > >> > > some
> > >> > > > > > > > > fundamental
> > >> > > > > > > > > > code change is going to break a downstream package.
> > >> > > > > > > > > > I can't speak for all the high level package binding
> > >> > > > maintainers,
> > >> > > > > > but
> > >> > > > > > > > I'm
> > >> > > > > > > > > > always happy to pitch in to provide code fixes to
> help
> > >> the
> > >> > > base
> > >> > > > > PR
> > >> > > > > > > get
> > >> > > > > > > > > > green.
> > >> > > > > > > > > >
> > >> > > > > > > > > > The time costs to maintain such a large CI project
> > >> obviously
> > >> > > > > needs
> > >> > > > > > to
> > >> > > > > > > > be
> > >> > > > > > > > > > considered as well.
> > >> > > > > > > > > >
> > >> > > > > > > > > > [1]
> > >> https://github.com/apache/incubator-mxnet/pull/15579
> > >> > > > > > > > > >
> > >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > >> > > > > > > > > pedro.larroy.lists@gmail.com
> > >> > > > > > > > > > >
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > >
> > >> > > > > > > > > > > From what I have seen Clojure is 15 minutes,
> which I
> > >> think
> > >> > > is
> > >> > > > > > > > > reasonable.
> > >> > > > > > > > > > > The only question is that when a binding such as
> R,
> > >> Perl or
> > >> > > > > > Clojure
> > >> > > > > > > > > > fails,
> > >> > > > > > > > > > > some devs are a bit confused about how to fix them
> > >> since
> > >> > > they
> > >> > > > > are
> > >> > > > > > > not
> > >> > > > > > > > > > > familiar with the testing tools and the language.
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > >> > > > > > carinmeier@gmail.com
> > >> > > > > > > >
> > >> > > > > > > > > > wrote:
> > >> > > > > > > > > > >
> > >> > > > > > > > > > > > Great idea Marco! Anything that you think would
> be
> > >> > > valuable
> > >> > > > > to
> > >> > > > > > > > share
> > >> > > > > > > > > > > would
> > >> > > > > > > > > > > > be good. The duration of each node in the test
> stage
> > >> > > sounds
> > >> > > > > > like
> > >> > > > > > > a
> > >> > > > > > > > > good
> > >> > > > > > > > > > > > start.
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > - Carin
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > >> > > > > > > > > > marco.g.abreu@gmail.com>
> > >> > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > > > > Hi,
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > we record a bunch of metrics about run
> statistics
> > >> (down
> > >> > > > to
> > >> > > > > > the
> > >> > > > > > > > > > duration
> > >> > > > > > > > > > > > of
> > >> > > > > > > > > > > > > every individual step). If you tell me which
> ones
> > >> > > you're
> > >> > > > > > > > > particularly
> > >> > > > > > > > > > > > > interested in (probably total duration of each
> > >> node in
> > >> > > > the
> > >> > > > > > test
> > >> > > > > > > > > > stage),
> > >> > > > > > > > > > > > I'm
> > >> > > > > > > > > > > > > happy to provide them.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> > >> > > > > > > > > > > > > - job
> > >> > > > > > > > > > > > > - branch
> > >> > > > > > > > > > > > > - stage
> > >> > > > > > > > > > > > > - node
> > >> > > > > > > > > > > > > - step
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Unfortunately I don't have the possibility to
> > >> export
> > >> > > them
> > >> > > > > > since
> > >> > > > > > > > we
> > >> > > > > > > > > > > store
> > >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
> > >> offer
> > >> > > raw
> > >> > > > > > > exports.
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Best regards,
> > >> > > > > > > > > > > > > Marco
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am
> > >> Mi., 14.
> > >> > > > > Aug.
> > >> > > > > > > > 2019,
> > >> > > > > > > > > > > 19:43:
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > I would prefer to keep the language binding
> in
> > >> the PR
> > >> > > > > > > process.
> > >> > > > > > > > > > > Perhaps
> > >> > > > > > > > > > > > we
> > >> > > > > > > > > > > > > > could do some analytics to see how much
> each of
> > >> the
> > >> > > > > > language
> > >> > > > > > > > > > bindings
> > >> > > > > > > > > > > > is
> > >> > > > > > > > > > > > > > contributing to overall run time.
> > >> > > > > > > > > > > > > > If we have some metrics on that, maybe we
> can
> > >> come up
> > >> > > > > with
> > >> > > > > > a
> > >> > > > > > > > > > > guideline
> > >> > > > > > > > > > > > of
> > >> > > > > > > > > > > > > > how much time each should take. Another
> > >> possibility
> > >> > > is
> > >> > > > > > > leverage
> > >> > > > > > > > > the
> > >> > > > > > > > > > > > > > parallel builds more.
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro
> Larroy <
> > >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Hi Carin.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > That's a good point, all things considered
> > >> would
> > >> > > your
> > >> > > > > > > > > preference
> > >> > > > > > > > > > be
> > >> > > > > > > > > > > > to
> > >> > > > > > > > > > > > > > keep
> > >> > > > > > > > > > > > > > > the Clojure tests as part of the PR
> process
> > >> or in
> > >> > > > > > Nightly?
> > >> > > > > > > > > > > > > > > Some options are having notifications
> here or
> > >> in
> > >> > > > slack.
> > >> > > > > > But
> > >> > > > > > > > if
> > >> > > > > > > > > we
> > >> > > > > > > > > > > > think
> > >> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not
> a
> > >> good
> > >> > > idea
> > >> > > > > to
> > >> > > > > > > > fully
> > >> > > > > > > > > > > remove
> > >> > > > > > > > > > > > > > > bindings from the PR process and just
> > >> streamline
> > >> > > the
> > >> > > > > > > process.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > Pedro.
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin
> Meier <
> > >> > > > > > > > > > carinmeier@gmail.com>
> > >> > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > Before any binding tests are moved to
> > >> nightly, I
> > >> > > > > think
> > >> > > > > > we
> > >> > > > > > > > > need
> > >> > > > > > > > > > to
> > >> > > > > > > > > > > > > > figure
> > >> > > > > > > > > > > > > > > > out how the community can get proper
> > >> > > notifications
> > >> > > > of
> > >> > > > > > > > failure
> > >> > > > > > > > > > and
> > >> > > > > > > > > > > > > > success
> > >> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I
> think
> > >> that
> > >> > > > > > breakages
> > >> > > > > > > > > would
> > >> > > > > > > > > > go
> > >> > > > > > > > > > > > > > > > unnoticed.
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > -Carin
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro
> > >> Larroy <
> > >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > wrote:
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Hi
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Seems we are hitting some problems in
> CI.
> > >> I
> > >> > > > propose
> > >> > > > > > the
> > >> > > > > > > > > > > following
> > >> > > > > > > > > > > > > > > action
> > >> > > > > > > > > > > > > > > > > items to remedy the situation and
> > >> accelerate
> > >> > > turn
> > >> > > > > > > around
> > >> > > > > > > > > > times
> > >> > > > > > > > > > > in
> > >> > > > > > > > > > > > > CI,
> > >> > > > > > > > > > > > > > > > > reduce cost, complexity and
> probability of
> > >> > > > failure
> > >> > > > > > > > blocking
> > >> > > > > > > > > > PRs
> > >> > > > > > > > > > > > and
> > >> > > > > > > > > > > > > > > > > frustrating developers:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from
> VS
> > >> 2015 to
> > >> > > > VS
> > >> > > > > > > 2017.
> > >> > > > > > > > > The
> > >> > > > > > > > > > > > > > > > > build_windows.py infrastructure should
> > >> easily
> > >> > > > work
> > >> > > > > > with
> > >> > > > > > > > the
> > >> > > > > > > > > > new
> > >> > > > > > > > > > > > > > > version.
> > >> > > > > > > > > > > > > > > > > Currently some PRs are blocked by
> this:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to
> nightly.
> > >> > > Tracked
> > >> > > > at
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > >> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
> > >> nightly.
> > >> > > If a
> > >> > > > > > > commit
> > >> > > > > > > > is
> > >> > > > > > > > > > > > > touching
> > >> > > > > > > > > > > > > > > > other
> > >> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for
> a
> > >> full
> > >> > > run
> > >> > > > > > which
> > >> > > > > > > > can
> > >> > > > > > > > > be
> > >> > > > > > > > > > > > done
> > >> > > > > > > > > > > > > > > > locally,
> > >> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI
> > >> build,
> > >> > > or
> > >> > > > > > defer
> > >> > > > > > > to
> > >> > > > > > > > > > > > nightly.
> > >> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
> > >> performance
> > >> > > > > tests
> > >> > > > > > on
> > >> > > > > > > > > small
> > >> > > > > > > > > > > > > models
> > >> > > > > > > > > > > > > > > that
> > >> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the
> > >> label
> > >> > > bot
> > >> > > > > as a
> > >> > > > > > > > > comment
> > >> > > > > > > > > > > for
> > >> > > > > > > > > > > > > > PRs.
> > >> > > > > > > > > > > > > > > > > * Address unit tests that take more
> than
> > >> > > 10-20s,
> > >> > > > > > > > streamline
> > >> > > > > > > > > > > them
> > >> > > > > > > > > > > > or
> > >> > > > > > > > > > > > > > > move
> > >> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
> > >> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
> > >> infrastructure
> > >> > > > > > scripts
> > >> > > > > > > > so
> > >> > > > > > > > > > the
> > >> > > > > > > > > > > > > > > community
> > >> > > > > > > > > > > > > > > > > can contribute.
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > I think our goal should be turnaround
> > >> under
> > >> > > > 30min.
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > I would also like to touch base with
> the
> > >> > > > community
> > >> > > > > > that
> > >> > > > > > > > > some
> > >> > > > > > > > > > > PRs
> > >> > > > > > > > > > > > > are
> > >> > > > > > > > > > > > > > > not
> > >> > > > > > > > > > > > > > > > > being followed up by committers
> asking for
> > >> > > > changes.
> > >> > > > > > For
> > >> > > > > > > > > > example
> > >> > > > > > > > > > > > > this
> > >> > > > > > > > > > > > > > PR
> > >> > > > > > > > > > > > > > > > is
> > >> > > > > > > > > > > > > > > > > importtant and is hanging for a long
> time.
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > This is another, less important but
> more
> > >> > > trivial
> > >> > > > to
> > >> > > > > > > > review:
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > I think comitters requesting changes
> and
> > >> not
> > >> > > > > > folllowing
> > >> > > > > > > > up
> > >> > > > > > > > > in
> > >> > > > > > > > > > > > > > > reasonable
> > >> > > > > > > > > > > > > > > > > time is not healthy for the project. I
> > >> suggest
> > >> > > > > > > > configuring
> > >> > > > > > > > > > > github
> > >> > > > > > > > > > > > > > > > > Notifications for a good SNR and
> > >> following up.
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Regards.
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > > > Pedro.
> > >> > > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > > >
> > >> > > > > > > > > > > > > >
> > >> > > > > > > > > > > > >
> > >> > > > > > > > > > > >
> > >> > > > > > > > > > >
> > >> > > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> > >
>

Re: CI and PRs

Posted by Aaron Markham <aa...@gmail.com>.

Is -R already in there?

Here's an example of it happening to me right now.... I am making
minor changes to the runtime_functions logic for handling the R docs
output. I pull the fix, then run the container, but I see the R deps
layer re-running. I didn't touch that. Why it that running again?

From https://github.com/aaronmarkham/incubator-mxnet
   f71cc6d..deec6aa  new_website_pipeline_2_aaron_rdocs ->
origin/new_website_pipeline_2_aaron_rdocs
Updating f71cc6d..deec6aa
Fast-forward
 ci/docker/runtime_functions.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
(base) ubuntu@ip-172-31-47-182:~/aaron/ci$ ./build.py
--docker-registry mxnetci --platform ubuntu_cpu_r
--docker-build-retries 3 --shm-size 500m /work/runtime_functions.sh
build_r_docs
build.py: 2019-08-16 18:34:44,639Z INFO MXNet container based build tool.
build.py: 2019-08-16 18:34:44,641Z INFO Docker cache download is
enabled from registry mxnetci
build.py: 2019-08-16 18:34:44,641Z INFO Loading Docker cache for
mxnetci/build.ubuntu_cpu_r from mxnetci
Using default tag: latest
latest: Pulling from mxnetci/build.ubuntu_cpu_r
Digest: sha256:7dc515c288b3e66d96920eb8975f985a501bb57f70595fbe0cb1c4fcd8d4184b
Status: Downloaded newer image for mxnetci/build.ubuntu_cpu_r:latest
build.py: 2019-08-16 18:34:44,807Z INFO Successfully pulled docker cache
build.py: 2019-08-16 18:34:44,807Z INFO Building docker container
tagged 'mxnetci/build.ubuntu_cpu_r' with docker
build.py: 2019-08-16 18:34:44,807Z INFO Running command: 'docker build
-f docker/Dockerfile.build.ubuntu_cpu_r --build-arg USER_ID=1000
--build-arg GROUP_ID=1000 --cache-from mxnetci/build.ubuntu_cpu_r -t
mxnetci/build.ubuntu_cpu_r docker'
Sending build context to Docker daemon  289.8kB
Step 1/15 : FROM ubuntu:16.04
 ---> 5e13f8dd4c1a
Step 2/15 : WORKDIR /work/deps
 ---> Using cache
 ---> afc2a135945d
Step 3/15 : COPY install/ubuntu_core.sh /work/
 ---> Using cache
 ---> da2b2e7f35e1
Step 4/15 : RUN /work/ubuntu_core.sh
 ---> Using cache
 ---> d1e88b26b1d2
Step 5/15 : COPY install/deb_ubuntu_ccache.sh /work/
 ---> Using cache
 ---> 3aa97dea3b7b
Step 6/15 : RUN /work/deb_ubuntu_ccache.sh
 ---> Using cache
 ---> bec503f1d149
Step 7/15 : COPY install/ubuntu_r.sh /work/
 ---> c5e77c38031d
Step 8/15 : COPY install/r.gpg /work/
 ---> d8cdbf015d2b
Step 9/15 : RUN /work/ubuntu_r.sh
 ---> Running in c6c90b9e1538
++ dirname /work/ubuntu_r.sh
+ cd /work
+ echo 'deb http://cran.rstudio.com/bin/linux/ubuntu trusty/'
+ apt-key add r.gpg
OK
+ add-apt-repository 'deb [arch=amd64,i386]
https://cran.rstudio.com/bin/linux/ubuntu xenial/'
+ apt-get update
Ign:1 http://cran.rstudio.com/bin/linux/ubuntu trusty/ InRelease

On Fri, Aug 16, 2019 at 11:32 AM Pedro Larroy
<pe...@gmail.com> wrote:
>
> Also, I forgot, another workaround is that I added the -R flag to the build
> logic (build.py) so the container is not rebuilt for manual use.
>
> On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <pe...@gmail.com>
> wrote:
>
> >
> > Hi Aaron.
> >
> > As Marco explained, if you are in master the cache usually works, there's
> > two issues that I have observed:
> >
> > 1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
> > if your cached base which is used in the FROM statement becomes outdated
> > your caching won't work. (Using docker pull ubuntu:16.04) or the base
> > images from the container helps with this.
> >
> > 2 - There's another situation where the above doesn't help which seems to
> > be an unidentified issue with the docker cache:
> > https://github.com/docker/docker.github.io/issues/8886
> >
> > We can get a short term workaround for #1 by explicitly pulling bases from
> > the script, but I think docker should do it when using --cache-from so
> > maybe contributing a patch to docker would the best approach.
> >
> > Pedro
> >
> > On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <aa...@gmail.com>
> > wrote:
> >
> >> When you create a new Dockerfile and use that on CI, it doesn't seem
> >> to cache some of the steps... like this:
> >>
> >> Step 13/15 : RUN /work/ubuntu_docs.sh
> >>  ---> Running in a1e522f3283b
> >>  [91m+ echo 'Installing dependencies...'
> >> + apt-get update
> >>  [0mInstalling dependencies.
> >>
> >> Or this....
> >>
> >> Step 4/13 : RUN /work/ubuntu_core.sh
> >>  ---> Running in e7882d7aa750
> >>  [91m+ apt-get update
> >>
> >> I get if I was changing those scripts, but then I'd think it should
> >> cache after running it once... but, no.
> >>
> >>
> >> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <ma...@gmail.com>
> >> wrote:
> >> >
> >> > Do I understand it correctly that you are saying that the Docker cache
> >> > doesn't work properly and regularly reinstalls dependencies? Or do you
> >> mean
> >> > that you only have cache misses when you modify the dependencies - which
> >> > would be expected?
> >> >
> >> > -Marco
> >> >
> >> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> >> aaron.s.markham@gmail.com>
> >> > wrote:
> >> >
> >> > > Many of the CI pipelines follow this pattern:
> >> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> >> > > repeat steps 1-3 over and over?
> >> > >
> >> > > Now, some tests use a stashed binary and docker cache. And I see this
> >> work
> >> > > locally, but for the most part, on CI, you're gonna sit through a
> >> > > dependency install.
> >> > >
> >> > > I noticed that almost all jobs use an ubuntu setup that is fully
> >> loaded.
> >> > > Without cache, it can take 10 or more minutes to build.  So I made a
> >> lite
> >> > > version. Takes only a few minutes instead.
> >> > >
> >> > > In some cases archiving worked great to share across pipelines, but as
> >> > > Marco mentioned we need a storage solution to make that happen. We
> >> can't
> >> > > archive every intermediate artifact for each PR.
> >> > >
> >> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
> >> pedro.larroy.lists@gmail.com>
> >> > > wrote:
> >> > >
> >> > > > Hi Aaron. Why speeds things up? What's the difference?
> >> > > >
> >> > > > Pedro.
> >> > > >
> >> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> >> aaron.s.markham@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > The PRs Thomas and I are working on for the new docs and website
> >> share
> >> > > > the
> >> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
> >> lot.
> >> > > > >
> >> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
> >> > > wrote:
> >> > > > >
> >> > > > > > I see it done daily now, and while I can’t share all the
> >> details,
> >> > > it’s
> >> > > > > not
> >> > > > > > an incredibly complex thing, and involves not much more than
> >> nfs/efs
> >> > > > > > sharing and remote ssh commands.  All it takes is a little
> >> ingenuity
> >> > > > and
> >> > > > > > some imagination.
> >> > > > > >
> >> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> >> > > > > pedro.larroy.lists@gmail.com
> >> > > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Sounds good in theory. I think there are complex details with
> >> > > regards
> >> > > > > of
> >> > > > > > > resource sharing during parallel execution. Still I think
> >> both ways
> >> > > > can
> >> > > > > > be
> >> > > > > > > explored. I think some tests run for unreasonably long times
> >> for
> >> > > what
> >> > > > > > they
> >> > > > > > > are doing. We already scale parts of the pipeline horizontally
> >> > > across
> >> > > > > > > workers.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> >> > > > cjolivier01@apache.org>
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > +1
> >> > > > > > > >
> >> > > > > > > > Rather than remove tests (which doesn’t scale as a
> >> solution), why
> >> > > > not
> >> > > > > > > scale
> >> > > > > > > > them horizontally so that they finish more quickly? Across
> >> > > > processes
> >> > > > > or
> >> > > > > > > > even on a pool of machines that aren’t necessarily the build
> >> > > > machine?
> >> > > > > > > >
> >> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> >> > > > > > marco.g.abreu@gmail.com
> >> > > > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > With regards to time I rather prefer us spending a bit
> >> more
> >> > > time
> >> > > > on
> >> > > > > > > > > maintenance than somebody running into an error that
> >> could've
> >> > > > been
> >> > > > > > > caught
> >> > > > > > > > > with a test.
> >> > > > > > > > >
> >> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
> >> broken
> >> > > for
> >> > > > > > quite
> >> > > > > > > > > some time now, but nobody noticed that. Basically my
> >> stance on
> >> > > > that
> >> > > > > > > > matter
> >> > > > > > > > > is that as soon as something is not blocking, you can
> >> also just
> >> > > > > > > > deactivate
> >> > > > > > > > > it since you don't have a forcing function in an open
> >> source
> >> > > > > project.
> >> > > > > > > > > People will rarely come back and fix the errors of some
> >> nightly
> >> > > > > test
> >> > > > > > > that
> >> > > > > > > > > they introduced.
> >> > > > > > > > >
> >> > > > > > > > > -Marco
> >> > > > > > > > >
> >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> >> Aug.
> >> > > > 2019,
> >> > > > > > > 21:59:
> >> > > > > > > > >
> >> > > > > > > > > > If a language binding test is failing for a not
> >> important
> >> > > > reason,
> >> > > > > > > then
> >> > > > > > > > it
> >> > > > > > > > > > is too brittle and needs to be fixed (we have fixed
> >> some of
> >> > > > these
> >> > > > > > > with
> >> > > > > > > > > the
> >> > > > > > > > > > Clojure package [1]).
> >> > > > > > > > > > But in general, if we thinking of the MXNet project as
> >> one
> >> > > > > project
> >> > > > > > > that
> >> > > > > > > > > is
> >> > > > > > > > > > across all the language bindings, then we want to know
> >> if
> >> > > some
> >> > > > > > > > > fundamental
> >> > > > > > > > > > code change is going to break a downstream package.
> >> > > > > > > > > > I can't speak for all the high level package binding
> >> > > > maintainers,
> >> > > > > > but
> >> > > > > > > > I'm
> >> > > > > > > > > > always happy to pitch in to provide code fixes to help
> >> the
> >> > > base
> >> > > > > PR
> >> > > > > > > get
> >> > > > > > > > > > green.
> >> > > > > > > > > >
> >> > > > > > > > > > The time costs to maintain such a large CI project
> >> obviously
> >> > > > > needs
> >> > > > > > to
> >> > > > > > > > be
> >> > > > > > > > > > considered as well.
> >> > > > > > > > > >
> >> > > > > > > > > > [1]
> >> https://github.com/apache/incubator-mxnet/pull/15579
> >> > > > > > > > > >
> >> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> >> > > > > > > > > pedro.larroy.lists@gmail.com
> >> > > > > > > > > > >
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I
> >> think
> >> > > is
> >> > > > > > > > > reasonable.
> >> > > > > > > > > > > The only question is that when a binding such as R,
> >> Perl or
> >> > > > > > Clojure
> >> > > > > > > > > > fails,
> >> > > > > > > > > > > some devs are a bit confused about how to fix them
> >> since
> >> > > they
> >> > > > > are
> >> > > > > > > not
> >> > > > > > > > > > > familiar with the testing tools and the language.
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> >> > > > > > carinmeier@gmail.com
> >> > > > > > > >
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Great idea Marco! Anything that you think would be
> >> > > valuable
> >> > > > > to
> >> > > > > > > > share
> >> > > > > > > > > > > would
> >> > > > > > > > > > > > be good. The duration of each node in the test stage
> >> > > sounds
> >> > > > > > like
> >> > > > > > > a
> >> > > > > > > > > good
> >> > > > > > > > > > > > start.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > - Carin
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> >> > > > > > > > > > marco.g.abreu@gmail.com>
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Hi,
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > we record a bunch of metrics about run statistics
> >> (down
> >> > > > to
> >> > > > > > the
> >> > > > > > > > > > duration
> >> > > > > > > > > > > > of
> >> > > > > > > > > > > > > every individual step). If you tell me which ones
> >> > > you're
> >> > > > > > > > > particularly
> >> > > > > > > > > > > > > interested in (probably total duration of each
> >> node in
> >> > > > the
> >> > > > > > test
> >> > > > > > > > > > stage),
> >> > > > > > > > > > > > I'm
> >> > > > > > > > > > > > > happy to provide them.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> >> > > > > > > > > > > > > - job
> >> > > > > > > > > > > > > - branch
> >> > > > > > > > > > > > > - stage
> >> > > > > > > > > > > > > - node
> >> > > > > > > > > > > > > - step
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Unfortunately I don't have the possibility to
> >> export
> >> > > them
> >> > > > > > since
> >> > > > > > > > we
> >> > > > > > > > > > > store
> >> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
> >> offer
> >> > > raw
> >> > > > > > > exports.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Best regards,
> >> > > > > > > > > > > > > Marco
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am
> >> Mi., 14.
> >> > > > > Aug.
> >> > > > > > > > 2019,
> >> > > > > > > > > > > 19:43:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > > I would prefer to keep the language binding in
> >> the PR
> >> > > > > > > process.
> >> > > > > > > > > > > Perhaps
> >> > > > > > > > > > > > we
> >> > > > > > > > > > > > > > could do some analytics to see how much each of
> >> the
> >> > > > > > language
> >> > > > > > > > > > bindings
> >> > > > > > > > > > > > is
> >> > > > > > > > > > > > > > contributing to overall run time.
> >> > > > > > > > > > > > > > If we have some metrics on that, maybe we can
> >> come up
> >> > > > > with
> >> > > > > > a
> >> > > > > > > > > > > guideline
> >> > > > > > > > > > > > of
> >> > > > > > > > > > > > > > how much time each should take. Another
> >> possibility
> >> > > is
> >> > > > > > > leverage
> >> > > > > > > > > the
> >> > > > > > > > > > > > > > parallel builds more.
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> >> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Hi Carin.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > That's a good point, all things considered
> >> would
> >> > > your
> >> > > > > > > > > preference
> >> > > > > > > > > > be
> >> > > > > > > > > > > > to
> >> > > > > > > > > > > > > > keep
> >> > > > > > > > > > > > > > > the Clojure tests as part of the PR process
> >> or in
> >> > > > > > Nightly?
> >> > > > > > > > > > > > > > > Some options are having notifications here or
> >> in
> >> > > > slack.
> >> > > > > > But
> >> > > > > > > > if
> >> > > > > > > > > we
> >> > > > > > > > > > > > think
> >> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a
> >> good
> >> > > idea
> >> > > > > to
> >> > > > > > > > fully
> >> > > > > > > > > > > remove
> >> > > > > > > > > > > > > > > bindings from the PR process and just
> >> streamline
> >> > > the
> >> > > > > > > process.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > Pedro.
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> >> > > > > > > > > > carinmeier@gmail.com>
> >> > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > Before any binding tests are moved to
> >> nightly, I
> >> > > > > think
> >> > > > > > we
> >> > > > > > > > > need
> >> > > > > > > > > > to
> >> > > > > > > > > > > > > > figure
> >> > > > > > > > > > > > > > > > out how the community can get proper
> >> > > notifications
> >> > > > of
> >> > > > > > > > failure
> >> > > > > > > > > > and
> >> > > > > > > > > > > > > > success
> >> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think
> >> that
> >> > > > > > breakages
> >> > > > > > > > > would
> >> > > > > > > > > > go
> >> > > > > > > > > > > > > > > > unnoticed.
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > -Carin
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro
> >> Larroy <
> >> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > wrote:
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Hi
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI.
> >> I
> >> > > > propose
> >> > > > > > the
> >> > > > > > > > > > > following
> >> > > > > > > > > > > > > > > action
> >> > > > > > > > > > > > > > > > > items to remedy the situation and
> >> accelerate
> >> > > turn
> >> > > > > > > around
> >> > > > > > > > > > times
> >> > > > > > > > > > > in
> >> > > > > > > > > > > > > CI,
> >> > > > > > > > > > > > > > > > > reduce cost, complexity and probability of
> >> > > > failure
> >> > > > > > > > blocking
> >> > > > > > > > > > PRs
> >> > > > > > > > > > > > and
> >> > > > > > > > > > > > > > > > > frustrating developers:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS
> >> 2015 to
> >> > > > VS
> >> > > > > > > 2017.
> >> > > > > > > > > The
> >> > > > > > > > > > > > > > > > > build_windows.py infrastructure should
> >> easily
> >> > > > work
> >> > > > > > with
> >> > > > > > > > the
> >> > > > > > > > > > new
> >> > > > > > > > > > > > > > > version.
> >> > > > > > > > > > > > > > > > > Currently some PRs are blocked by this:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> >> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> >> > > Tracked
> >> > > > at
> >> > > > > > > > > > > > > > > > >
> >> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> >> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
> >> nightly.
> >> > > If a
> >> > > > > > > commit
> >> > > > > > > > is
> >> > > > > > > > > > > > > touching
> >> > > > > > > > > > > > > > > > other
> >> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a
> >> full
> >> > > run
> >> > > > > > which
> >> > > > > > > > can
> >> > > > > > > > > be
> >> > > > > > > > > > > > done
> >> > > > > > > > > > > > > > > > locally,
> >> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI
> >> build,
> >> > > or
> >> > > > > > defer
> >> > > > > > > to
> >> > > > > > > > > > > > nightly.
> >> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
> >> performance
> >> > > > > tests
> >> > > > > > on
> >> > > > > > > > > small
> >> > > > > > > > > > > > > models
> >> > > > > > > > > > > > > > > that
> >> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the
> >> label
> >> > > bot
> >> > > > > as a
> >> > > > > > > > > comment
> >> > > > > > > > > > > for
> >> > > > > > > > > > > > > > PRs.
> >> > > > > > > > > > > > > > > > > * Address unit tests that take more than
> >> > > 10-20s,
> >> > > > > > > > streamline
> >> > > > > > > > > > > them
> >> > > > > > > > > > > > or
> >> > > > > > > > > > > > > > > move
> >> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
> >> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
> >> infrastructure
> >> > > > > > scripts
> >> > > > > > > > so
> >> > > > > > > > > > the
> >> > > > > > > > > > > > > > > community
> >> > > > > > > > > > > > > > > > > can contribute.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > I think our goal should be turnaround
> >> under
> >> > > > 30min.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > I would also like to touch base with the
> >> > > > community
> >> > > > > > that
> >> > > > > > > > > some
> >> > > > > > > > > > > PRs
> >> > > > > > > > > > > > > are
> >> > > > > > > > > > > > > > > not
> >> > > > > > > > > > > > > > > > > being followed up by committers asking for
> >> > > > changes.
> >> > > > > > For
> >> > > > > > > > > > example
> >> > > > > > > > > > > > > this
> >> > > > > > > > > > > > > > PR
> >> > > > > > > > > > > > > > > > is
> >> > > > > > > > > > > > > > > > > importtant and is hanging for a long time.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > This is another, less important but more
> >> > > trivial
> >> > > > to
> >> > > > > > > > review:
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > >
> >> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > I think comitters requesting changes and
> >> not
> >> > > > > > folllowing
> >> > > > > > > > up
> >> > > > > > > > > in
> >> > > > > > > > > > > > > > > reasonable
> >> > > > > > > > > > > > > > > > > time is not healthy for the project. I
> >> suggest
> >> > > > > > > > configuring
> >> > > > > > > > > > > github
> >> > > > > > > > > > > > > > > > > Notifications for a good SNR and
> >> following up.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Regards.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > > > Pedro.
> >> > > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > > >
> >> > > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >>
> >

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Also, I forgot, another workaround is that I added the -R flag to the build
logic (build.py) so the container is not rebuilt for manual use.

On Fri, Aug 16, 2019 at 11:18 AM Pedro Larroy <pe...@gmail.com>
wrote:

>
> Hi Aaron.
>
> As Marco explained, if you are in master the cache usually works, there's
> two issues that I have observed:
>
> 1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
> if your cached base which is used in the FROM statement becomes outdated
> your caching won't work. (Using docker pull ubuntu:16.04) or the base
> images from the container helps with this.
>
> 2 - There's another situation where the above doesn't help which seems to
> be an unidentified issue with the docker cache:
> https://github.com/docker/docker.github.io/issues/8886
>
> We can get a short term workaround for #1 by explicitly pulling bases from
> the script, but I think docker should do it when using --cache-from so
> maybe contributing a patch to docker would the best approach.
>
> Pedro
>
> On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <aa...@gmail.com>
> wrote:
>
>> When you create a new Dockerfile and use that on CI, it doesn't seem
>> to cache some of the steps... like this:
>>
>> Step 13/15 : RUN /work/ubuntu_docs.sh
>>  ---> Running in a1e522f3283b
>>  [91m+ echo 'Installing dependencies...'
>> + apt-get update
>>  [0mInstalling dependencies.
>>
>> Or this....
>>
>> Step 4/13 : RUN /work/ubuntu_core.sh
>>  ---> Running in e7882d7aa750
>>  [91m+ apt-get update
>>
>> I get if I was changing those scripts, but then I'd think it should
>> cache after running it once... but, no.
>>
>>
>> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <ma...@gmail.com>
>> wrote:
>> >
>> > Do I understand it correctly that you are saying that the Docker cache
>> > doesn't work properly and regularly reinstalls dependencies? Or do you
>> mean
>> > that you only have cache misses when you modify the dependencies - which
>> > would be expected?
>> >
>> > -Marco
>> >
>> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
>> aaron.s.markham@gmail.com>
>> > wrote:
>> >
>> > > Many of the CI pipelines follow this pattern:
>> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
>> > > repeat steps 1-3 over and over?
>> > >
>> > > Now, some tests use a stashed binary and docker cache. And I see this
>> work
>> > > locally, but for the most part, on CI, you're gonna sit through a
>> > > dependency install.
>> > >
>> > > I noticed that almost all jobs use an ubuntu setup that is fully
>> loaded.
>> > > Without cache, it can take 10 or more minutes to build.  So I made a
>> lite
>> > > version. Takes only a few minutes instead.
>> > >
>> > > In some cases archiving worked great to share across pipelines, but as
>> > > Marco mentioned we need a storage solution to make that happen. We
>> can't
>> > > archive every intermediate artifact for each PR.
>> > >
>> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <
>> pedro.larroy.lists@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi Aaron. Why speeds things up? What's the difference?
>> > > >
>> > > > Pedro.
>> > > >
>> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
>> aaron.s.markham@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > The PRs Thomas and I are working on for the new docs and website
>> share
>> > > > the
>> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
>> lot.
>> > > > >
>> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > I see it done daily now, and while I can’t share all the
>> details,
>> > > it’s
>> > > > > not
>> > > > > > an incredibly complex thing, and involves not much more than
>> nfs/efs
>> > > > > > sharing and remote ssh commands.  All it takes is a little
>> ingenuity
>> > > > and
>> > > > > > some imagination.
>> > > > > >
>> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> > > > > pedro.larroy.lists@gmail.com
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Sounds good in theory. I think there are complex details with
>> > > regards
>> > > > > of
>> > > > > > > resource sharing during parallel execution. Still I think
>> both ways
>> > > > can
>> > > > > > be
>> > > > > > > explored. I think some tests run for unreasonably long times
>> for
>> > > what
>> > > > > > they
>> > > > > > > are doing. We already scale parts of the pipeline horizontally
>> > > across
>> > > > > > > workers.
>> > > > > > >
>> > > > > > >
>> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
>> > > > cjolivier01@apache.org>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > +1
>> > > > > > > >
>> > > > > > > > Rather than remove tests (which doesn’t scale as a
>> solution), why
>> > > > not
>> > > > > > > scale
>> > > > > > > > them horizontally so that they finish more quickly? Across
>> > > > processes
>> > > > > or
>> > > > > > > > even on a pool of machines that aren’t necessarily the build
>> > > > machine?
>> > > > > > > >
>> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>> > > > > > marco.g.abreu@gmail.com
>> > > > > > > >
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > With regards to time I rather prefer us spending a bit
>> more
>> > > time
>> > > > on
>> > > > > > > > > maintenance than somebody running into an error that
>> could've
>> > > > been
>> > > > > > > caught
>> > > > > > > > > with a test.
>> > > > > > > > >
>> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
>> broken
>> > > for
>> > > > > > quite
>> > > > > > > > > some time now, but nobody noticed that. Basically my
>> stance on
>> > > > that
>> > > > > > > > matter
>> > > > > > > > > is that as soon as something is not blocking, you can
>> also just
>> > > > > > > > deactivate
>> > > > > > > > > it since you don't have a forcing function in an open
>> source
>> > > > > project.
>> > > > > > > > > People will rarely come back and fix the errors of some
>> nightly
>> > > > > test
>> > > > > > > that
>> > > > > > > > > they introduced.
>> > > > > > > > >
>> > > > > > > > > -Marco
>> > > > > > > > >
>> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
>> Aug.
>> > > > 2019,
>> > > > > > > 21:59:
>> > > > > > > > >
>> > > > > > > > > > If a language binding test is failing for a not
>> important
>> > > > reason,
>> > > > > > > then
>> > > > > > > > it
>> > > > > > > > > > is too brittle and needs to be fixed (we have fixed
>> some of
>> > > > these
>> > > > > > > with
>> > > > > > > > > the
>> > > > > > > > > > Clojure package [1]).
>> > > > > > > > > > But in general, if we thinking of the MXNet project as
>> one
>> > > > > project
>> > > > > > > that
>> > > > > > > > > is
>> > > > > > > > > > across all the language bindings, then we want to know
>> if
>> > > some
>> > > > > > > > > fundamental
>> > > > > > > > > > code change is going to break a downstream package.
>> > > > > > > > > > I can't speak for all the high level package binding
>> > > > maintainers,
>> > > > > > but
>> > > > > > > > I'm
>> > > > > > > > > > always happy to pitch in to provide code fixes to help
>> the
>> > > base
>> > > > > PR
>> > > > > > > get
>> > > > > > > > > > green.
>> > > > > > > > > >
>> > > > > > > > > > The time costs to maintain such a large CI project
>> obviously
>> > > > > needs
>> > > > > > to
>> > > > > > > > be
>> > > > > > > > > > considered as well.
>> > > > > > > > > >
>> > > > > > > > > > [1]
>> https://github.com/apache/incubator-mxnet/pull/15579
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>> > > > > > > > > pedro.larroy.lists@gmail.com
>> > > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I
>> think
>> > > is
>> > > > > > > > > reasonable.
>> > > > > > > > > > > The only question is that when a binding such as R,
>> Perl or
>> > > > > > Clojure
>> > > > > > > > > > fails,
>> > > > > > > > > > > some devs are a bit confused about how to fix them
>> since
>> > > they
>> > > > > are
>> > > > > > > not
>> > > > > > > > > > > familiar with the testing tools and the language.
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>> > > > > > carinmeier@gmail.com
>> > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Great idea Marco! Anything that you think would be
>> > > valuable
>> > > > > to
>> > > > > > > > share
>> > > > > > > > > > > would
>> > > > > > > > > > > > be good. The duration of each node in the test stage
>> > > sounds
>> > > > > > like
>> > > > > > > a
>> > > > > > > > > good
>> > > > > > > > > > > > start.
>> > > > > > > > > > > >
>> > > > > > > > > > > > - Carin
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
>> > > > > > > > > > marco.g.abreu@gmail.com>
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi,
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > we record a bunch of metrics about run statistics
>> (down
>> > > > to
>> > > > > > the
>> > > > > > > > > > duration
>> > > > > > > > > > > > of
>> > > > > > > > > > > > > every individual step). If you tell me which ones
>> > > you're
>> > > > > > > > > particularly
>> > > > > > > > > > > > > interested in (probably total duration of each
>> node in
>> > > > the
>> > > > > > test
>> > > > > > > > > > stage),
>> > > > > > > > > > > > I'm
>> > > > > > > > > > > > > happy to provide them.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Dimensions are (in hierarchical order):
>> > > > > > > > > > > > > - job
>> > > > > > > > > > > > > - branch
>> > > > > > > > > > > > > - stage
>> > > > > > > > > > > > > - node
>> > > > > > > > > > > > > - step
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Unfortunately I don't have the possibility to
>> export
>> > > them
>> > > > > > since
>> > > > > > > > we
>> > > > > > > > > > > store
>> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
>> offer
>> > > raw
>> > > > > > > exports.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Best regards,
>> > > > > > > > > > > > > Marco
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am
>> Mi., 14.
>> > > > > Aug.
>> > > > > > > > 2019,
>> > > > > > > > > > > 19:43:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > I would prefer to keep the language binding in
>> the PR
>> > > > > > > process.
>> > > > > > > > > > > Perhaps
>> > > > > > > > > > > > we
>> > > > > > > > > > > > > > could do some analytics to see how much each of
>> the
>> > > > > > language
>> > > > > > > > > > bindings
>> > > > > > > > > > > > is
>> > > > > > > > > > > > > > contributing to overall run time.
>> > > > > > > > > > > > > > If we have some metrics on that, maybe we can
>> come up
>> > > > > with
>> > > > > > a
>> > > > > > > > > > > guideline
>> > > > > > > > > > > > of
>> > > > > > > > > > > > > > how much time each should take. Another
>> possibility
>> > > is
>> > > > > > > leverage
>> > > > > > > > > the
>> > > > > > > > > > > > > > parallel builds more.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
>> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Hi Carin.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > That's a good point, all things considered
>> would
>> > > your
>> > > > > > > > > preference
>> > > > > > > > > > be
>> > > > > > > > > > > > to
>> > > > > > > > > > > > > > keep
>> > > > > > > > > > > > > > > the Clojure tests as part of the PR process
>> or in
>> > > > > > Nightly?
>> > > > > > > > > > > > > > > Some options are having notifications here or
>> in
>> > > > slack.
>> > > > > > But
>> > > > > > > > if
>> > > > > > > > > we
>> > > > > > > > > > > > think
>> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a
>> good
>> > > idea
>> > > > > to
>> > > > > > > > fully
>> > > > > > > > > > > remove
>> > > > > > > > > > > > > > > bindings from the PR process and just
>> streamline
>> > > the
>> > > > > > > process.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Pedro.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
>> > > > > > > > > > carinmeier@gmail.com>
>> > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Before any binding tests are moved to
>> nightly, I
>> > > > > think
>> > > > > > we
>> > > > > > > > > need
>> > > > > > > > > > to
>> > > > > > > > > > > > > > figure
>> > > > > > > > > > > > > > > > out how the community can get proper
>> > > notifications
>> > > > of
>> > > > > > > > failure
>> > > > > > > > > > and
>> > > > > > > > > > > > > > success
>> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think
>> that
>> > > > > > breakages
>> > > > > > > > > would
>> > > > > > > > > > go
>> > > > > > > > > > > > > > > > unnoticed.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > -Carin
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro
>> Larroy <
>> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Hi
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI.
>> I
>> > > > propose
>> > > > > > the
>> > > > > > > > > > > following
>> > > > > > > > > > > > > > > action
>> > > > > > > > > > > > > > > > > items to remedy the situation and
>> accelerate
>> > > turn
>> > > > > > > around
>> > > > > > > > > > times
>> > > > > > > > > > > in
>> > > > > > > > > > > > > CI,
>> > > > > > > > > > > > > > > > > reduce cost, complexity and probability of
>> > > > failure
>> > > > > > > > blocking
>> > > > > > > > > > PRs
>> > > > > > > > > > > > and
>> > > > > > > > > > > > > > > > > frustrating developers:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS
>> 2015 to
>> > > > VS
>> > > > > > > 2017.
>> > > > > > > > > The
>> > > > > > > > > > > > > > > > > build_windows.py infrastructure should
>> easily
>> > > > work
>> > > > > > with
>> > > > > > > > the
>> > > > > > > > > > new
>> > > > > > > > > > > > > > > version.
>> > > > > > > > > > > > > > > > > Currently some PRs are blocked by this:
>> > > > > > > > > > > > > > > > >
>> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
>> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
>> > > Tracked
>> > > > at
>> > > > > > > > > > > > > > > > >
>> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
>> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
>> nightly.
>> > > If a
>> > > > > > > commit
>> > > > > > > > is
>> > > > > > > > > > > > > touching
>> > > > > > > > > > > > > > > > other
>> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a
>> full
>> > > run
>> > > > > > which
>> > > > > > > > can
>> > > > > > > > > be
>> > > > > > > > > > > > done
>> > > > > > > > > > > > > > > > locally,
>> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI
>> build,
>> > > or
>> > > > > > defer
>> > > > > > > to
>> > > > > > > > > > > > nightly.
>> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
>> performance
>> > > > > tests
>> > > > > > on
>> > > > > > > > > small
>> > > > > > > > > > > > > models
>> > > > > > > > > > > > > > > that
>> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the
>> label
>> > > bot
>> > > > > as a
>> > > > > > > > > comment
>> > > > > > > > > > > for
>> > > > > > > > > > > > > > PRs.
>> > > > > > > > > > > > > > > > > * Address unit tests that take more than
>> > > 10-20s,
>> > > > > > > > streamline
>> > > > > > > > > > > them
>> > > > > > > > > > > > or
>> > > > > > > > > > > > > > > move
>> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
>> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
>> infrastructure
>> > > > > > scripts
>> > > > > > > > so
>> > > > > > > > > > the
>> > > > > > > > > > > > > > > community
>> > > > > > > > > > > > > > > > > can contribute.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I think our goal should be turnaround
>> under
>> > > > 30min.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I would also like to touch base with the
>> > > > community
>> > > > > > that
>> > > > > > > > > some
>> > > > > > > > > > > PRs
>> > > > > > > > > > > > > are
>> > > > > > > > > > > > > > > not
>> > > > > > > > > > > > > > > > > being followed up by committers asking for
>> > > > changes.
>> > > > > > For
>> > > > > > > > > > example
>> > > > > > > > > > > > > this
>> > > > > > > > > > > > > > PR
>> > > > > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > > > importtant and is hanging for a long time.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > https://github.com/apache/incubator-mxnet/pull/15051
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > This is another, less important but more
>> > > trivial
>> > > > to
>> > > > > > > > review:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > https://github.com/apache/incubator-mxnet/pull/14940
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > I think comitters requesting changes and
>> not
>> > > > > > folllowing
>> > > > > > > > up
>> > > > > > > > > in
>> > > > > > > > > > > > > > > reasonable
>> > > > > > > > > > > > > > > > > time is not healthy for the project. I
>> suggest
>> > > > > > > > configuring
>> > > > > > > > > > > github
>> > > > > > > > > > > > > > > > > Notifications for a good SNR and
>> following up.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Regards.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Pedro.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>>
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Aaron.

As Marco explained, if you are in master the cache usually works, there's
two issues that I have observed:

1 - Docker doesn't automatically pull the base image (ex. ubuntu:16.04) so
if your cached base which is used in the FROM statement becomes outdated
your caching won't work. (Using docker pull ubuntu:16.04) or the base
images from the container helps with this.

2 - There's another situation where the above doesn't help which seems to
be an unidentified issue with the docker cache:
https://github.com/docker/docker.github.io/issues/8886

We can get a short term workaround for #1 by explicitly pulling bases from
the script, but I think docker should do it when using --cache-from so
maybe contributing a patch to docker would the best approach.

Pedro

On Thu, Aug 15, 2019 at 7:06 PM Aaron Markham <aa...@gmail.com>
wrote:

> When you create a new Dockerfile and use that on CI, it doesn't seem
> to cache some of the steps... like this:
>
> Step 13/15 : RUN /work/ubuntu_docs.sh
>  ---> Running in a1e522f3283b
>  [91m+ echo 'Installing dependencies...'
> + apt-get update
>  [0mInstalling dependencies.
>
> Or this....
>
> Step 4/13 : RUN /work/ubuntu_core.sh
>  ---> Running in e7882d7aa750
>  [91m+ apt-get update
>
> I get if I was changing those scripts, but then I'd think it should
> cache after running it once... but, no.
>
>
> On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <ma...@gmail.com>
> wrote:
> >
> > Do I understand it correctly that you are saying that the Docker cache
> > doesn't work properly and regularly reinstalls dependencies? Or do you
> mean
> > that you only have cache misses when you modify the dependencies - which
> > would be expected?
> >
> > -Marco
> >
> > On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <
> aaron.s.markham@gmail.com>
> > wrote:
> >
> > > Many of the CI pipelines follow this pattern:
> > > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > > repeat steps 1-3 over and over?
> > >
> > > Now, some tests use a stashed binary and docker cache. And I see this
> work
> > > locally, but for the most part, on CI, you're gonna sit through a
> > > dependency install.
> > >
> > > I noticed that almost all jobs use an ubuntu setup that is fully
> loaded.
> > > Without cache, it can take 10 or more minutes to build.  So I made a
> lite
> > > version. Takes only a few minutes instead.
> > >
> > > In some cases archiving worked great to share across pipelines, but as
> > > Marco mentioned we need a storage solution to make that happen. We
> can't
> > > archive every intermediate artifact for each PR.
> > >
> > > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > > wrote:
> > >
> > > > Hi Aaron. Why speeds things up? What's the difference?
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <
> aaron.s.markham@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > The PRs Thomas and I are working on for the new docs and website
> share
> > > > the
> > > > > mxnet binary in the new CI pipelines we made. Speeds things up a
> lot.
> > > > >
> > > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I see it done daily now, and while I can’t share all the details,
> > > it’s
> > > > > not
> > > > > > an incredibly complex thing, and involves not much more than
> nfs/efs
> > > > > > sharing and remote ssh commands.  All it takes is a little
> ingenuity
> > > > and
> > > > > > some imagination.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Sounds good in theory. I think there are complex details with
> > > regards
> > > > > of
> > > > > > > resource sharing during parallel execution. Still I think both
> ways
> > > > can
> > > > > > be
> > > > > > > explored. I think some tests run for unreasonably long times
> for
> > > what
> > > > > > they
> > > > > > > are doing. We already scale parts of the pipeline horizontally
> > > across
> > > > > > > workers.
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > > cjolivier01@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Rather than remove tests (which doesn’t scale as a
> solution), why
> > > > not
> > > > > > > scale
> > > > > > > > them horizontally so that they finish more quickly? Across
> > > > processes
> > > > > or
> > > > > > > > even on a pool of machines that aren’t necessarily the build
> > > > machine?
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > With regards to time I rather prefer us spending a bit more
> > > time
> > > > on
> > > > > > > > > maintenance than somebody running into an error that
> could've
> > > > been
> > > > > > > caught
> > > > > > > > > with a test.
> > > > > > > > >
> > > > > > > > > I mean, our Publishing pipeline for Scala GPU has been
> broken
> > > for
> > > > > > quite
> > > > > > > > > some time now, but nobody noticed that. Basically my
> stance on
> > > > that
> > > > > > > > matter
> > > > > > > > > is that as soon as something is not blocking, you can also
> just
> > > > > > > > deactivate
> > > > > > > > > it since you don't have a forcing function in an open
> source
> > > > > project.
> > > > > > > > > People will rarely come back and fix the errors of some
> nightly
> > > > > test
> > > > > > > that
> > > > > > > > > they introduced.
> > > > > > > > >
> > > > > > > > > -Marco
> > > > > > > > >
> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> Aug.
> > > > 2019,
> > > > > > > 21:59:
> > > > > > > > >
> > > > > > > > > > If a language binding test is failing for a not important
> > > > reason,
> > > > > > > then
> > > > > > > > it
> > > > > > > > > > is too brittle and needs to be fixed (we have fixed some
> of
> > > > these
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > Clojure package [1]).
> > > > > > > > > > But in general, if we thinking of the MXNet project as
> one
> > > > > project
> > > > > > > that
> > > > > > > > > is
> > > > > > > > > > across all the language bindings, then we want to know if
> > > some
> > > > > > > > > fundamental
> > > > > > > > > > code change is going to break a downstream package.
> > > > > > > > > > I can't speak for all the high level package binding
> > > > maintainers,
> > > > > > but
> > > > > > > > I'm
> > > > > > > > > > always happy to pitch in to provide code fixes to help
> the
> > > base
> > > > > PR
> > > > > > > get
> > > > > > > > > > green.
> > > > > > > > > >
> > > > > > > > > > The time costs to maintain such a large CI project
> obviously
> > > > > needs
> > > > > > to
> > > > > > > > be
> > > > > > > > > > considered as well.
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > From what I have seen Clojure is 15 minutes, which I
> think
> > > is
> > > > > > > > > reasonable.
> > > > > > > > > > > The only question is that when a binding such as R,
> Perl or
> > > > > > Clojure
> > > > > > > > > > fails,
> > > > > > > > > > > some devs are a bit confused about how to fix them
> since
> > > they
> > > > > are
> > > > > > > not
> > > > > > > > > > > familiar with the testing tools and the language.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > > > > carinmeier@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Great idea Marco! Anything that you think would be
> > > valuable
> > > > > to
> > > > > > > > share
> > > > > > > > > > > would
> > > > > > > > > > > > be good. The duration of each node in the test stage
> > > sounds
> > > > > > like
> > > > > > > a
> > > > > > > > > good
> > > > > > > > > > > > start.
> > > > > > > > > > > >
> > > > > > > > > > > > - Carin
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > we record a bunch of metrics about run statistics
> (down
> > > > to
> > > > > > the
> > > > > > > > > > duration
> > > > > > > > > > > > of
> > > > > > > > > > > > > every individual step). If you tell me which ones
> > > you're
> > > > > > > > > particularly
> > > > > > > > > > > > > interested in (probably total duration of each
> node in
> > > > the
> > > > > > test
> > > > > > > > > > stage),
> > > > > > > > > > > > I'm
> > > > > > > > > > > > > happy to provide them.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > > > > - job
> > > > > > > > > > > > > - branch
> > > > > > > > > > > > > - stage
> > > > > > > > > > > > > - node
> > > > > > > > > > > > > - step
> > > > > > > > > > > > >
> > > > > > > > > > > > > Unfortunately I don't have the possibility to
> export
> > > them
> > > > > > since
> > > > > > > > we
> > > > > > > > > > > store
> > > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't
> offer
> > > raw
> > > > > > > exports.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > Marco
> > > > > > > > > > > > >
> > > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am
> Mi., 14.
> > > > > Aug.
> > > > > > > > 2019,
> > > > > > > > > > > 19:43:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > I would prefer to keep the language binding in
> the PR
> > > > > > > process.
> > > > > > > > > > > Perhaps
> > > > > > > > > > > > we
> > > > > > > > > > > > > > could do some analytics to see how much each of
> the
> > > > > > language
> > > > > > > > > > bindings
> > > > > > > > > > > > is
> > > > > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > > > > If we have some metrics on that, maybe we can
> come up
> > > > > with
> > > > > > a
> > > > > > > > > > > guideline
> > > > > > > > > > > > of
> > > > > > > > > > > > > > how much time each should take. Another
> possibility
> > > is
> > > > > > > leverage
> > > > > > > > > the
> > > > > > > > > > > > > > parallel builds more.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > That's a good point, all things considered
> would
> > > your
> > > > > > > > > preference
> > > > > > > > > > be
> > > > > > > > > > > > to
> > > > > > > > > > > > > > keep
> > > > > > > > > > > > > > > the Clojure tests as part of the PR process or
> in
> > > > > > Nightly?
> > > > > > > > > > > > > > > Some options are having notifications here or
> in
> > > > slack.
> > > > > > But
> > > > > > > > if
> > > > > > > > > we
> > > > > > > > > > > > think
> > > > > > > > > > > > > > > breakages would go unnoticed maybe is not a
> good
> > > idea
> > > > > to
> > > > > > > > fully
> > > > > > > > > > > remove
> > > > > > > > > > > > > > > bindings from the PR process and just
> streamline
> > > the
> > > > > > > process.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > > > > > carinmeier@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Before any binding tests are moved to
> nightly, I
> > > > > think
> > > > > > we
> > > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > > > > figure
> > > > > > > > > > > > > > > > out how the community can get proper
> > > notifications
> > > > of
> > > > > > > > failure
> > > > > > > > > > and
> > > > > > > > > > > > > > success
> > > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think
> that
> > > > > > breakages
> > > > > > > > > would
> > > > > > > > > > go
> > > > > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > -Carin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy
> <
> > > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> > > > propose
> > > > > > the
> > > > > > > > > > > following
> > > > > > > > > > > > > > > action
> > > > > > > > > > > > > > > > > items to remedy the situation and
> accelerate
> > > turn
> > > > > > > around
> > > > > > > > > > times
> > > > > > > > > > > in
> > > > > > > > > > > > > CI,
> > > > > > > > > > > > > > > > > reduce cost, complexity and probability of
> > > > failure
> > > > > > > > blocking
> > > > > > > > > > PRs
> > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS
> 2015 to
> > > > VS
> > > > > > > 2017.
> > > > > > > > > The
> > > > > > > > > > > > > > > > > build_windows.py infrastructure should
> easily
> > > > work
> > > > > > with
> > > > > > > > the
> > > > > > > > > > new
> > > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> > > Tracked
> > > > at
> > > > > > > > > > > > > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > > > > * Move non-python bindings tests to
> nightly.
> > > If a
> > > > > > > commit
> > > > > > > > is
> > > > > > > > > > > > > touching
> > > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > > bindings, the reviewer should ask for a
> full
> > > run
> > > > > > which
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > > > done
> > > > > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > > > > use the label bot to trigger a full CI
> build,
> > > or
> > > > > > defer
> > > > > > > to
> > > > > > > > > > > > nightly.
> > > > > > > > > > > > > > > > > * Provide a couple of basic sanity
> performance
> > > > > tests
> > > > > > on
> > > > > > > > > small
> > > > > > > > > > > > > models
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > are run on CI and can be echoed by the
> label
> > > bot
> > > > > as a
> > > > > > > > > comment
> > > > > > > > > > > for
> > > > > > > > > > > > > > PRs.
> > > > > > > > > > > > > > > > > * Address unit tests that take more than
> > > 10-20s,
> > > > > > > > streamline
> > > > > > > > > > > them
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > > > > > * Open sourcing the remaining CI
> infrastructure
> > > > > > scripts
> > > > > > > > so
> > > > > > > > > > the
> > > > > > > > > > > > > > > community
> > > > > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think our goal should be turnaround under
> > > > 30min.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I would also like to touch base with the
> > > > community
> > > > > > that
> > > > > > > > > some
> > > > > > > > > > > PRs
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > being followed up by committers asking for
> > > > changes.
> > > > > > For
> > > > > > > > > > example
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > PR
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This is another, less important but more
> > > trivial
> > > > to
> > > > > > > > review:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I think comitters requesting changes and
> not
> > > > > > folllowing
> > > > > > > > up
> > > > > > > > > in
> > > > > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > > > > time is not healthy for the project. I
> suggest
> > > > > > > > configuring
> > > > > > > > > > > github
> > > > > > > > > > > > > > > > > Notifications for a good SNR and following
> up.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: CI and PRs

Posted by Aaron Markham <aa...@gmail.com>.

When you create a new Dockerfile and use that on CI, it doesn't seem
to cache some of the steps... like this:

Step 13/15 : RUN /work/ubuntu_docs.sh
 ---> Running in a1e522f3283b
 [91m+ echo 'Installing dependencies...'
+ apt-get update
 [0mInstalling dependencies.

Or this....

Step 4/13 : RUN /work/ubuntu_core.sh
 ---> Running in e7882d7aa750
 [91m+ apt-get update

I get if I was changing those scripts, but then I'd think it should
cache after running it once... but, no.


On Thu, Aug 15, 2019 at 3:51 PM Marco de Abreu <ma...@gmail.com> wrote:
>
> Do I understand it correctly that you are saying that the Docker cache
> doesn't work properly and regularly reinstalls dependencies? Or do you mean
> that you only have cache misses when you modify the dependencies - which
> would be expected?
>
> -Marco
>
> On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <aa...@gmail.com>
> wrote:
>
> > Many of the CI pipelines follow this pattern:
> > Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> > repeat steps 1-3 over and over?
> >
> > Now, some tests use a stashed binary and docker cache. And I see this work
> > locally, but for the most part, on CI, you're gonna sit through a
> > dependency install.
> >
> > I noticed that almost all jobs use an ubuntu setup that is fully loaded.
> > Without cache, it can take 10 or more minutes to build.  So I made a lite
> > version. Takes only a few minutes instead.
> >
> > In some cases archiving worked great to share across pipelines, but as
> > Marco mentioned we need a storage solution to make that happen. We can't
> > archive every intermediate artifact for each PR.
> >
> > On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pe...@gmail.com>
> > wrote:
> >
> > > Hi Aaron. Why speeds things up? What's the difference?
> > >
> > > Pedro.
> > >
> > > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <aaron.s.markham@gmail.com
> > >
> > > wrote:
> > >
> > > > The PRs Thomas and I are working on for the new docs and website share
> > > the
> > > > mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> > > >
> > > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
> > wrote:
> > > >
> > > > > I see it done daily now, and while I can’t share all the details,
> > it’s
> > > > not
> > > > > an incredibly complex thing, and involves not much more than nfs/efs
> > > > > sharing and remote ssh commands.  All it takes is a little ingenuity
> > > and
> > > > > some imagination.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Sounds good in theory. I think there are complex details with
> > regards
> > > > of
> > > > > > resource sharing during parallel execution. Still I think both ways
> > > can
> > > > > be
> > > > > > explored. I think some tests run for unreasonably long times for
> > what
> > > > > they
> > > > > > are doing. We already scale parts of the pipeline horizontally
> > across
> > > > > > workers.
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > > cjolivier01@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Rather than remove tests (which doesn’t scale as a solution), why
> > > not
> > > > > > scale
> > > > > > > them horizontally so that they finish more quickly? Across
> > > processes
> > > > or
> > > > > > > even on a pool of machines that aren’t necessarily the build
> > > machine?
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > > marco.g.abreu@gmail.com
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > With regards to time I rather prefer us spending a bit more
> > time
> > > on
> > > > > > > > maintenance than somebody running into an error that could've
> > > been
> > > > > > caught
> > > > > > > > with a test.
> > > > > > > >
> > > > > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> > for
> > > > > quite
> > > > > > > > some time now, but nobody noticed that. Basically my stance on
> > > that
> > > > > > > matter
> > > > > > > > is that as soon as something is not blocking, you can also just
> > > > > > > deactivate
> > > > > > > > it since you don't have a forcing function in an open source
> > > > project.
> > > > > > > > People will rarely come back and fix the errors of some nightly
> > > > test
> > > > > > that
> > > > > > > > they introduced.
> > > > > > > >
> > > > > > > > -Marco
> > > > > > > >
> > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> > > 2019,
> > > > > > 21:59:
> > > > > > > >
> > > > > > > > > If a language binding test is failing for a not important
> > > reason,
> > > > > > then
> > > > > > > it
> > > > > > > > > is too brittle and needs to be fixed (we have fixed some of
> > > these
> > > > > > with
> > > > > > > > the
> > > > > > > > > Clojure package [1]).
> > > > > > > > > But in general, if we thinking of the MXNet project as one
> > > > project
> > > > > > that
> > > > > > > > is
> > > > > > > > > across all the language bindings, then we want to know if
> > some
> > > > > > > > fundamental
> > > > > > > > > code change is going to break a downstream package.
> > > > > > > > > I can't speak for all the high level package binding
> > > maintainers,
> > > > > but
> > > > > > > I'm
> > > > > > > > > always happy to pitch in to provide code fixes to help the
> > base
> > > > PR
> > > > > > get
> > > > > > > > > green.
> > > > > > > > >
> > > > > > > > > The time costs to maintain such a large CI project obviously
> > > > needs
> > > > > to
> > > > > > > be
> > > > > > > > > considered as well.
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > From what I have seen Clojure is 15 minutes, which I think
> > is
> > > > > > > > reasonable.
> > > > > > > > > > The only question is that when a binding such as R, Perl or
> > > > > Clojure
> > > > > > > > > fails,
> > > > > > > > > > some devs are a bit confused about how to fix them since
> > they
> > > > are
> > > > > > not
> > > > > > > > > > familiar with the testing tools and the language.
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > > > carinmeier@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Great idea Marco! Anything that you think would be
> > valuable
> > > > to
> > > > > > > share
> > > > > > > > > > would
> > > > > > > > > > > be good. The duration of each node in the test stage
> > sounds
> > > > > like
> > > > > > a
> > > > > > > > good
> > > > > > > > > > > start.
> > > > > > > > > > >
> > > > > > > > > > > - Carin
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > we record a bunch of metrics about run statistics (down
> > > to
> > > > > the
> > > > > > > > > duration
> > > > > > > > > > > of
> > > > > > > > > > > > every individual step). If you tell me which ones
> > you're
> > > > > > > > particularly
> > > > > > > > > > > > interested in (probably total duration of each node in
> > > the
> > > > > test
> > > > > > > > > stage),
> > > > > > > > > > > I'm
> > > > > > > > > > > > happy to provide them.
> > > > > > > > > > > >
> > > > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > > > - job
> > > > > > > > > > > > - branch
> > > > > > > > > > > > - stage
> > > > > > > > > > > > - node
> > > > > > > > > > > > - step
> > > > > > > > > > > >
> > > > > > > > > > > > Unfortunately I don't have the possibility to export
> > them
> > > > > since
> > > > > > > we
> > > > > > > > > > store
> > > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer
> > raw
> > > > > > exports.
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Marco
> > > > > > > > > > > >
> > > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> > > > Aug.
> > > > > > > 2019,
> > > > > > > > > > 19:43:
> > > > > > > > > > > >
> > > > > > > > > > > > > I would prefer to keep the language binding in the PR
> > > > > > process.
> > > > > > > > > > Perhaps
> > > > > > > > > > > we
> > > > > > > > > > > > > could do some analytics to see how much each of the
> > > > > language
> > > > > > > > > bindings
> > > > > > > > > > > is
> > > > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > > > If we have some metrics on that, maybe we can come up
> > > > with
> > > > > a
> > > > > > > > > > guideline
> > > > > > > > > > > of
> > > > > > > > > > > > > how much time each should take. Another possibility
> > is
> > > > > > leverage
> > > > > > > > the
> > > > > > > > > > > > > parallel builds more.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > That's a good point, all things considered would
> > your
> > > > > > > > preference
> > > > > > > > > be
> > > > > > > > > > > to
> > > > > > > > > > > > > keep
> > > > > > > > > > > > > > the Clojure tests as part of the PR process or in
> > > > > Nightly?
> > > > > > > > > > > > > > Some options are having notifications here or in
> > > slack.
> > > > > But
> > > > > > > if
> > > > > > > > we
> > > > > > > > > > > think
> > > > > > > > > > > > > > breakages would go unnoticed maybe is not a good
> > idea
> > > > to
> > > > > > > fully
> > > > > > > > > > remove
> > > > > > > > > > > > > > bindings from the PR process and just streamline
> > the
> > > > > > process.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > > > > carinmeier@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Before any binding tests are moved to nightly, I
> > > > think
> > > > > we
> > > > > > > > need
> > > > > > > > > to
> > > > > > > > > > > > > figure
> > > > > > > > > > > > > > > out how the community can get proper
> > notifications
> > > of
> > > > > > > failure
> > > > > > > > > and
> > > > > > > > > > > > > success
> > > > > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> > > > > breakages
> > > > > > > > would
> > > > > > > > > go
> > > > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > -Carin
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> > > propose
> > > > > the
> > > > > > > > > > following
> > > > > > > > > > > > > > action
> > > > > > > > > > > > > > > > items to remedy the situation and accelerate
> > turn
> > > > > > around
> > > > > > > > > times
> > > > > > > > > > in
> > > > > > > > > > > > CI,
> > > > > > > > > > > > > > > > reduce cost, complexity and probability of
> > > failure
> > > > > > > blocking
> > > > > > > > > PRs
> > > > > > > > > > > and
> > > > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to
> > > VS
> > > > > > 2017.
> > > > > > > > The
> > > > > > > > > > > > > > > > build_windows.py infrastructure should easily
> > > work
> > > > > with
> > > > > > > the
> > > > > > > > > new
> > > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> > Tracked
> > > at
> > > > > > > > > > > > > > > >
> > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > > > * Move non-python bindings tests to nightly.
> > If a
> > > > > > commit
> > > > > > > is
> > > > > > > > > > > > touching
> > > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > > bindings, the reviewer should ask for a full
> > run
> > > > > which
> > > > > > > can
> > > > > > > > be
> > > > > > > > > > > done
> > > > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > > > use the label bot to trigger a full CI build,
> > or
> > > > > defer
> > > > > > to
> > > > > > > > > > > nightly.
> > > > > > > > > > > > > > > > * Provide a couple of basic sanity performance
> > > > tests
> > > > > on
> > > > > > > > small
> > > > > > > > > > > > models
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > are run on CI and can be echoed by the label
> > bot
> > > > as a
> > > > > > > > comment
> > > > > > > > > > for
> > > > > > > > > > > > > PRs.
> > > > > > > > > > > > > > > > * Address unit tests that take more than
> > 10-20s,
> > > > > > > streamline
> > > > > > > > > > them
> > > > > > > > > > > or
> > > > > > > > > > > > > > move
> > > > > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> > > > > scripts
> > > > > > > so
> > > > > > > > > the
> > > > > > > > > > > > > > community
> > > > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think our goal should be turnaround under
> > > 30min.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I would also like to touch base with the
> > > community
> > > > > that
> > > > > > > > some
> > > > > > > > > > PRs
> > > > > > > > > > > > are
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > being followed up by committers asking for
> > > changes.
> > > > > For
> > > > > > > > > example
> > > > > > > > > > > > this
> > > > > > > > > > > > > PR
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This is another, less important but more
> > trivial
> > > to
> > > > > > > review:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think comitters requesting changes and not
> > > > > folllowing
> > > > > > > up
> > > > > > > > in
> > > > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > > > time is not healthy for the project. I suggest
> > > > > > > configuring
> > > > > > > > > > github
> > > > > > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

Do I understand it correctly that you are saying that the Docker cache
doesn't work properly and regularly reinstalls dependencies? Or do you mean
that you only have cache misses when you modify the dependencies - which
would be expected?

-Marco

On Fri, Aug 16, 2019 at 12:48 AM Aaron Markham <aa...@gmail.com>
wrote:

> Many of the CI pipelines follow this pattern:
> Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
> repeat steps 1-3 over and over?
>
> Now, some tests use a stashed binary and docker cache. And I see this work
> locally, but for the most part, on CI, you're gonna sit through a
> dependency install.
>
> I noticed that almost all jobs use an ubuntu setup that is fully loaded.
> Without cache, it can take 10 or more minutes to build.  So I made a lite
> version. Takes only a few minutes instead.
>
> In some cases archiving worked great to share across pipelines, but as
> Marco mentioned we need a storage solution to make that happen. We can't
> archive every intermediate artifact for each PR.
>
> On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pe...@gmail.com>
> wrote:
>
> > Hi Aaron. Why speeds things up? What's the difference?
> >
> > Pedro.
> >
> > On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <aaron.s.markham@gmail.com
> >
> > wrote:
> >
> > > The PRs Thomas and I are working on for the new docs and website share
> > the
> > > mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> > >
> > > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
> wrote:
> > >
> > > > I see it done daily now, and while I can’t share all the details,
> it’s
> > > not
> > > > an incredibly complex thing, and involves not much more than nfs/efs
> > > > sharing and remote ssh commands.  All it takes is a little ingenuity
> > and
> > > > some imagination.
> > > >
> > > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Sounds good in theory. I think there are complex details with
> regards
> > > of
> > > > > resource sharing during parallel execution. Still I think both ways
> > can
> > > > be
> > > > > explored. I think some tests run for unreasonably long times for
> what
> > > > they
> > > > > are doing. We already scale parts of the pipeline horizontally
> across
> > > > > workers.
> > > > >
> > > > >
> > > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> > cjolivier01@apache.org>
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Rather than remove tests (which doesn’t scale as a solution), why
> > not
> > > > > scale
> > > > > > them horizontally so that they finish more quickly? Across
> > processes
> > > or
> > > > > > even on a pool of machines that aren’t necessarily the build
> > machine?
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > > marco.g.abreu@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > With regards to time I rather prefer us spending a bit more
> time
> > on
> > > > > > > maintenance than somebody running into an error that could've
> > been
> > > > > caught
> > > > > > > with a test.
> > > > > > >
> > > > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> for
> > > > quite
> > > > > > > some time now, but nobody noticed that. Basically my stance on
> > that
> > > > > > matter
> > > > > > > is that as soon as something is not blocking, you can also just
> > > > > > deactivate
> > > > > > > it since you don't have a forcing function in an open source
> > > project.
> > > > > > > People will rarely come back and fix the errors of some nightly
> > > test
> > > > > that
> > > > > > > they introduced.
> > > > > > >
> > > > > > > -Marco
> > > > > > >
> > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> > 2019,
> > > > > 21:59:
> > > > > > >
> > > > > > > > If a language binding test is failing for a not important
> > reason,
> > > > > then
> > > > > > it
> > > > > > > > is too brittle and needs to be fixed (we have fixed some of
> > these
> > > > > with
> > > > > > > the
> > > > > > > > Clojure package [1]).
> > > > > > > > But in general, if we thinking of the MXNet project as one
> > > project
> > > > > that
> > > > > > > is
> > > > > > > > across all the language bindings, then we want to know if
> some
> > > > > > > fundamental
> > > > > > > > code change is going to break a downstream package.
> > > > > > > > I can't speak for all the high level package binding
> > maintainers,
> > > > but
> > > > > > I'm
> > > > > > > > always happy to pitch in to provide code fixes to help the
> base
> > > PR
> > > > > get
> > > > > > > > green.
> > > > > > > >
> > > > > > > > The time costs to maintain such a large CI project obviously
> > > needs
> > > > to
> > > > > > be
> > > > > > > > considered as well.
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > From what I have seen Clojure is 15 minutes, which I think
> is
> > > > > > > reasonable.
> > > > > > > > > The only question is that when a binding such as R, Perl or
> > > > Clojure
> > > > > > > > fails,
> > > > > > > > > some devs are a bit confused about how to fix them since
> they
> > > are
> > > > > not
> > > > > > > > > familiar with the testing tools and the language.
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > > carinmeier@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Great idea Marco! Anything that you think would be
> valuable
> > > to
> > > > > > share
> > > > > > > > > would
> > > > > > > > > > be good. The duration of each node in the test stage
> sounds
> > > > like
> > > > > a
> > > > > > > good
> > > > > > > > > > start.
> > > > > > > > > >
> > > > > > > > > > - Carin
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > we record a bunch of metrics about run statistics (down
> > to
> > > > the
> > > > > > > > duration
> > > > > > > > > > of
> > > > > > > > > > > every individual step). If you tell me which ones
> you're
> > > > > > > particularly
> > > > > > > > > > > interested in (probably total duration of each node in
> > the
> > > > test
> > > > > > > > stage),
> > > > > > > > > > I'm
> > > > > > > > > > > happy to provide them.
> > > > > > > > > > >
> > > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > > - job
> > > > > > > > > > > - branch
> > > > > > > > > > > - stage
> > > > > > > > > > > - node
> > > > > > > > > > > - step
> > > > > > > > > > >
> > > > > > > > > > > Unfortunately I don't have the possibility to export
> them
> > > > since
> > > > > > we
> > > > > > > > > store
> > > > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer
> raw
> > > > > exports.
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Marco
> > > > > > > > > > >
> > > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> > > Aug.
> > > > > > 2019,
> > > > > > > > > 19:43:
> > > > > > > > > > >
> > > > > > > > > > > > I would prefer to keep the language binding in the PR
> > > > > process.
> > > > > > > > > Perhaps
> > > > > > > > > > we
> > > > > > > > > > > > could do some analytics to see how much each of the
> > > > language
> > > > > > > > bindings
> > > > > > > > > > is
> > > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > > If we have some metrics on that, maybe we can come up
> > > with
> > > > a
> > > > > > > > > guideline
> > > > > > > > > > of
> > > > > > > > > > > > how much time each should take. Another possibility
> is
> > > > > leverage
> > > > > > > the
> > > > > > > > > > > > parallel builds more.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's a good point, all things considered would
> your
> > > > > > > preference
> > > > > > > > be
> > > > > > > > > > to
> > > > > > > > > > > > keep
> > > > > > > > > > > > > the Clojure tests as part of the PR process or in
> > > > Nightly?
> > > > > > > > > > > > > Some options are having notifications here or in
> > slack.
> > > > But
> > > > > > if
> > > > > > > we
> > > > > > > > > > think
> > > > > > > > > > > > > breakages would go unnoticed maybe is not a good
> idea
> > > to
> > > > > > fully
> > > > > > > > > remove
> > > > > > > > > > > > > bindings from the PR process and just streamline
> the
> > > > > process.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > > > carinmeier@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Before any binding tests are moved to nightly, I
> > > think
> > > > we
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > > > figure
> > > > > > > > > > > > > > out how the community can get proper
> notifications
> > of
> > > > > > failure
> > > > > > > > and
> > > > > > > > > > > > success
> > > > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> > > > breakages
> > > > > > > would
> > > > > > > > go
> > > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > -Carin
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> > propose
> > > > the
> > > > > > > > > following
> > > > > > > > > > > > > action
> > > > > > > > > > > > > > > items to remedy the situation and accelerate
> turn
> > > > > around
> > > > > > > > times
> > > > > > > > > in
> > > > > > > > > > > CI,
> > > > > > > > > > > > > > > reduce cost, complexity and probability of
> > failure
> > > > > > blocking
> > > > > > > > PRs
> > > > > > > > > > and
> > > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to
> > VS
> > > > > 2017.
> > > > > > > The
> > > > > > > > > > > > > > > build_windows.py infrastructure should easily
> > work
> > > > with
> > > > > > the
> > > > > > > > new
> > > > > > > > > > > > > version.
> > > > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > > > >
> > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> Tracked
> > at
> > > > > > > > > > > > > > >
> > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > > * Move non-python bindings tests to nightly.
> If a
> > > > > commit
> > > > > > is
> > > > > > > > > > > touching
> > > > > > > > > > > > > > other
> > > > > > > > > > > > > > > bindings, the reviewer should ask for a full
> run
> > > > which
> > > > > > can
> > > > > > > be
> > > > > > > > > > done
> > > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > > use the label bot to trigger a full CI build,
> or
> > > > defer
> > > > > to
> > > > > > > > > > nightly.
> > > > > > > > > > > > > > > * Provide a couple of basic sanity performance
> > > tests
> > > > on
> > > > > > > small
> > > > > > > > > > > models
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > are run on CI and can be echoed by the label
> bot
> > > as a
> > > > > > > comment
> > > > > > > > > for
> > > > > > > > > > > > PRs.
> > > > > > > > > > > > > > > * Address unit tests that take more than
> 10-20s,
> > > > > > streamline
> > > > > > > > > them
> > > > > > > > > > or
> > > > > > > > > > > > > move
> > > > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> > > > scripts
> > > > > > so
> > > > > > > > the
> > > > > > > > > > > > > community
> > > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think our goal should be turnaround under
> > 30min.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would also like to touch base with the
> > community
> > > > that
> > > > > > > some
> > > > > > > > > PRs
> > > > > > > > > > > are
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > being followed up by committers asking for
> > changes.
> > > > For
> > > > > > > > example
> > > > > > > > > > > this
> > > > > > > > > > > > PR
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This is another, less important but more
> trivial
> > to
> > > > > > review:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think comitters requesting changes and not
> > > > folllowing
> > > > > > up
> > > > > > > in
> > > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > > time is not healthy for the project. I suggest
> > > > > > configuring
> > > > > > > > > github
> > > > > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Aaron Markham <aa...@gmail.com>.

Many of the CI pipelines follow this pattern:
Load ubuntu 16.04, install deps, build mxnet, then run some tests. Why
repeat steps 1-3 over and over?

Now, some tests use a stashed binary and docker cache. And I see this work
locally, but for the most part, on CI, you're gonna sit through a
dependency install.

I noticed that almost all jobs use an ubuntu setup that is fully loaded.
Without cache, it can take 10 or more minutes to build.  So I made a lite
version. Takes only a few minutes instead.

In some cases archiving worked great to share across pipelines, but as
Marco mentioned we need a storage solution to make that happen. We can't
archive every intermediate artifact for each PR.

On Thu, Aug 15, 2019, 13:47 Pedro Larroy <pe...@gmail.com>
wrote:

> Hi Aaron. Why speeds things up? What's the difference?
>
> Pedro.
>
> On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <aa...@gmail.com>
> wrote:
>
> > The PRs Thomas and I are working on for the new docs and website share
> the
> > mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >
> > On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com> wrote:
> >
> > > I see it done daily now, and while I can’t share all the details, it’s
> > not
> > > an incredibly complex thing, and involves not much more than nfs/efs
> > > sharing and remote ssh commands.  All it takes is a little ingenuity
> and
> > > some imagination.
> > >
> > > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > Sounds good in theory. I think there are complex details with regards
> > of
> > > > resource sharing during parallel execution. Still I think both ways
> can
> > > be
> > > > explored. I think some tests run for unreasonably long times for what
> > > they
> > > > are doing. We already scale parts of the pipeline horizontally across
> > > > workers.
> > > >
> > > >
> > > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> cjolivier01@apache.org>
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Rather than remove tests (which doesn’t scale as a solution), why
> not
> > > > scale
> > > > > them horizontally so that they finish more quickly? Across
> processes
> > or
> > > > > even on a pool of machines that aren’t necessarily the build
> machine?
> > > > >
> > > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > > marco.g.abreu@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > With regards to time I rather prefer us spending a bit more time
> on
> > > > > > maintenance than somebody running into an error that could've
> been
> > > > caught
> > > > > > with a test.
> > > > > >
> > > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> > > quite
> > > > > > some time now, but nobody noticed that. Basically my stance on
> that
> > > > > matter
> > > > > > is that as soon as something is not blocking, you can also just
> > > > > deactivate
> > > > > > it since you don't have a forcing function in an open source
> > project.
> > > > > > People will rarely come back and fix the errors of some nightly
> > test
> > > > that
> > > > > > they introduced.
> > > > > >
> > > > > > -Marco
> > > > > >
> > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> 2019,
> > > > 21:59:
> > > > > >
> > > > > > > If a language binding test is failing for a not important
> reason,
> > > > then
> > > > > it
> > > > > > > is too brittle and needs to be fixed (we have fixed some of
> these
> > > > with
> > > > > > the
> > > > > > > Clojure package [1]).
> > > > > > > But in general, if we thinking of the MXNet project as one
> > project
> > > > that
> > > > > > is
> > > > > > > across all the language bindings, then we want to know if some
> > > > > > fundamental
> > > > > > > code change is going to break a downstream package.
> > > > > > > I can't speak for all the high level package binding
> maintainers,
> > > but
> > > > > I'm
> > > > > > > always happy to pitch in to provide code fixes to help the base
> > PR
> > > > get
> > > > > > > green.
> > > > > > >
> > > > > > > The time costs to maintain such a large CI project obviously
> > needs
> > > to
> > > > > be
> > > > > > > considered as well.
> > > > > > >
> > > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > > pedro.larroy.lists@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > > > reasonable.
> > > > > > > > The only question is that when a binding such as R, Perl or
> > > Clojure
> > > > > > > fails,
> > > > > > > > some devs are a bit confused about how to fix them since they
> > are
> > > > not
> > > > > > > > familiar with the testing tools and the language.
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > > carinmeier@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Great idea Marco! Anything that you think would be valuable
> > to
> > > > > share
> > > > > > > > would
> > > > > > > > > be good. The duration of each node in the test stage sounds
> > > like
> > > > a
> > > > > > good
> > > > > > > > > start.
> > > > > > > > >
> > > > > > > > > - Carin
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > > marco.g.abreu@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > we record a bunch of metrics about run statistics (down
> to
> > > the
> > > > > > > duration
> > > > > > > > > of
> > > > > > > > > > every individual step). If you tell me which ones you're
> > > > > > particularly
> > > > > > > > > > interested in (probably total duration of each node in
> the
> > > test
> > > > > > > stage),
> > > > > > > > > I'm
> > > > > > > > > > happy to provide them.
> > > > > > > > > >
> > > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > > - job
> > > > > > > > > > - branch
> > > > > > > > > > - stage
> > > > > > > > > > - node
> > > > > > > > > > - step
> > > > > > > > > >
> > > > > > > > > > Unfortunately I don't have the possibility to export them
> > > since
> > > > > we
> > > > > > > > store
> > > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> > > > exports.
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Marco
> > > > > > > > > >
> > > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> > Aug.
> > > > > 2019,
> > > > > > > > 19:43:
> > > > > > > > > >
> > > > > > > > > > > I would prefer to keep the language binding in the PR
> > > > process.
> > > > > > > > Perhaps
> > > > > > > > > we
> > > > > > > > > > > could do some analytics to see how much each of the
> > > language
> > > > > > > bindings
> > > > > > > > > is
> > > > > > > > > > > contributing to overall run time.
> > > > > > > > > > > If we have some metrics on that, maybe we can come up
> > with
> > > a
> > > > > > > > guideline
> > > > > > > > > of
> > > > > > > > > > > how much time each should take. Another possibility is
> > > > leverage
> > > > > > the
> > > > > > > > > > > parallel builds more.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Carin.
> > > > > > > > > > > >
> > > > > > > > > > > > That's a good point, all things considered would your
> > > > > > preference
> > > > > > > be
> > > > > > > > > to
> > > > > > > > > > > keep
> > > > > > > > > > > > the Clojure tests as part of the PR process or in
> > > Nightly?
> > > > > > > > > > > > Some options are having notifications here or in
> slack.
> > > But
> > > > > if
> > > > > > we
> > > > > > > > > think
> > > > > > > > > > > > breakages would go unnoticed maybe is not a good idea
> > to
> > > > > fully
> > > > > > > > remove
> > > > > > > > > > > > bindings from the PR process and just streamline the
> > > > process.
> > > > > > > > > > > >
> > > > > > > > > > > > Pedro.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > > carinmeier@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Before any binding tests are moved to nightly, I
> > think
> > > we
> > > > > > need
> > > > > > > to
> > > > > > > > > > > figure
> > > > > > > > > > > > > out how the community can get proper notifications
> of
> > > > > failure
> > > > > > > and
> > > > > > > > > > > success
> > > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> > > breakages
> > > > > > would
> > > > > > > go
> > > > > > > > > > > > > unnoticed.
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Carin
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> propose
> > > the
> > > > > > > > following
> > > > > > > > > > > > action
> > > > > > > > > > > > > > items to remedy the situation and accelerate turn
> > > > around
> > > > > > > times
> > > > > > > > in
> > > > > > > > > > CI,
> > > > > > > > > > > > > > reduce cost, complexity and probability of
> failure
> > > > > blocking
> > > > > > > PRs
> > > > > > > > > and
> > > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to
> VS
> > > > 2017.
> > > > > > The
> > > > > > > > > > > > > > build_windows.py infrastructure should easily
> work
> > > with
> > > > > the
> > > > > > > new
> > > > > > > > > > > > version.
> > > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked
> at
> > > > > > > > > > > > > >
> > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > > * Move non-python bindings tests to nightly. If a
> > > > commit
> > > > > is
> > > > > > > > > > touching
> > > > > > > > > > > > > other
> > > > > > > > > > > > > > bindings, the reviewer should ask for a full run
> > > which
> > > > > can
> > > > > > be
> > > > > > > > > done
> > > > > > > > > > > > > locally,
> > > > > > > > > > > > > > use the label bot to trigger a full CI build, or
> > > defer
> > > > to
> > > > > > > > > nightly.
> > > > > > > > > > > > > > * Provide a couple of basic sanity performance
> > tests
> > > on
> > > > > > small
> > > > > > > > > > models
> > > > > > > > > > > > that
> > > > > > > > > > > > > > are run on CI and can be echoed by the label bot
> > as a
> > > > > > comment
> > > > > > > > for
> > > > > > > > > > > PRs.
> > > > > > > > > > > > > > * Address unit tests that take more than 10-20s,
> > > > > streamline
> > > > > > > > them
> > > > > > > > > or
> > > > > > > > > > > > move
> > > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> > > scripts
> > > > > so
> > > > > > > the
> > > > > > > > > > > > community
> > > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think our goal should be turnaround under
> 30min.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I would also like to touch base with the
> community
> > > that
> > > > > > some
> > > > > > > > PRs
> > > > > > > > > > are
> > > > > > > > > > > > not
> > > > > > > > > > > > > > being followed up by committers asking for
> changes.
> > > For
> > > > > > > example
> > > > > > > > > > this
> > > > > > > > > > > PR
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is another, less important but more trivial
> to
> > > > > review:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think comitters requesting changes and not
> > > folllowing
> > > > > up
> > > > > > in
> > > > > > > > > > > > reasonable
> > > > > > > > > > > > > > time is not healthy for the project. I suggest
> > > > > configuring
> > > > > > > > github
> > > > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Aaron. Why speeds things up? What's the difference?

Pedro.

On Wed, Aug 14, 2019 at 8:39 PM Aaron Markham <aa...@gmail.com>
wrote:

> The PRs Thomas and I are working on for the new docs and website share the
> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>
> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com> wrote:
>
> > I see it done daily now, and while I can’t share all the details, it’s
> not
> > an incredibly complex thing, and involves not much more than nfs/efs
> > sharing and remote ssh commands.  All it takes is a little ingenuity and
> > some imagination.
> >
> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Sounds good in theory. I think there are complex details with regards
> of
> > > resource sharing during parallel execution. Still I think both ways can
> > be
> > > explored. I think some tests run for unreasonably long times for what
> > they
> > > are doing. We already scale parts of the pipeline horizontally across
> > > workers.
> > >
> > >
> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cj...@apache.org>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Rather than remove tests (which doesn’t scale as a solution), why not
> > > scale
> > > > them horizontally so that they finish more quickly? Across processes
> or
> > > > even on a pool of machines that aren’t necessarily the build machine?
> > > >
> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > marco.g.abreu@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > With regards to time I rather prefer us spending a bit more time on
> > > > > maintenance than somebody running into an error that could've been
> > > caught
> > > > > with a test.
> > > > >
> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> > quite
> > > > > some time now, but nobody noticed that. Basically my stance on that
> > > > matter
> > > > > is that as soon as something is not blocking, you can also just
> > > > deactivate
> > > > > it since you don't have a forcing function in an open source
> project.
> > > > > People will rarely come back and fix the errors of some nightly
> test
> > > that
> > > > > they introduced.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > > 21:59:
> > > > >
> > > > > > If a language binding test is failing for a not important reason,
> > > then
> > > > it
> > > > > > is too brittle and needs to be fixed (we have fixed some of these
> > > with
> > > > > the
> > > > > > Clojure package [1]).
> > > > > > But in general, if we thinking of the MXNet project as one
> project
> > > that
> > > > > is
> > > > > > across all the language bindings, then we want to know if some
> > > > > fundamental
> > > > > > code change is going to break a downstream package.
> > > > > > I can't speak for all the high level package binding maintainers,
> > but
> > > > I'm
> > > > > > always happy to pitch in to provide code fixes to help the base
> PR
> > > get
> > > > > > green.
> > > > > >
> > > > > > The time costs to maintain such a large CI project obviously
> needs
> > to
> > > > be
> > > > > > considered as well.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > > reasonable.
> > > > > > > The only question is that when a binding such as R, Perl or
> > Clojure
> > > > > > fails,
> > > > > > > some devs are a bit confused about how to fix them since they
> are
> > > not
> > > > > > > familiar with the testing tools and the language.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > carinmeier@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Great idea Marco! Anything that you think would be valuable
> to
> > > > share
> > > > > > > would
> > > > > > > > be good. The duration of each node in the test stage sounds
> > like
> > > a
> > > > > good
> > > > > > > > start.
> > > > > > > >
> > > > > > > > - Carin
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > we record a bunch of metrics about run statistics (down to
> > the
> > > > > > duration
> > > > > > > > of
> > > > > > > > > every individual step). If you tell me which ones you're
> > > > > particularly
> > > > > > > > > interested in (probably total duration of each node in the
> > test
> > > > > > stage),
> > > > > > > > I'm
> > > > > > > > > happy to provide them.
> > > > > > > > >
> > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > - job
> > > > > > > > > - branch
> > > > > > > > > - stage
> > > > > > > > > - node
> > > > > > > > > - step
> > > > > > > > >
> > > > > > > > > Unfortunately I don't have the possibility to export them
> > since
> > > > we
> > > > > > > store
> > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> > > exports.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Marco
> > > > > > > > >
> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> Aug.
> > > > 2019,
> > > > > > > 19:43:
> > > > > > > > >
> > > > > > > > > > I would prefer to keep the language binding in the PR
> > > process.
> > > > > > > Perhaps
> > > > > > > > we
> > > > > > > > > > could do some analytics to see how much each of the
> > language
> > > > > > bindings
> > > > > > > > is
> > > > > > > > > > contributing to overall run time.
> > > > > > > > > > If we have some metrics on that, maybe we can come up
> with
> > a
> > > > > > > guideline
> > > > > > > > of
> > > > > > > > > > how much time each should take. Another possibility is
> > > leverage
> > > > > the
> > > > > > > > > > parallel builds more.
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Carin.
> > > > > > > > > > >
> > > > > > > > > > > That's a good point, all things considered would your
> > > > > preference
> > > > > > be
> > > > > > > > to
> > > > > > > > > > keep
> > > > > > > > > > > the Clojure tests as part of the PR process or in
> > Nightly?
> > > > > > > > > > > Some options are having notifications here or in slack.
> > But
> > > > if
> > > > > we
> > > > > > > > think
> > > > > > > > > > > breakages would go unnoticed maybe is not a good idea
> to
> > > > fully
> > > > > > > remove
> > > > > > > > > > > bindings from the PR process and just streamline the
> > > process.
> > > > > > > > > > >
> > > > > > > > > > > Pedro.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > carinmeier@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Before any binding tests are moved to nightly, I
> think
> > we
> > > > > need
> > > > > > to
> > > > > > > > > > figure
> > > > > > > > > > > > out how the community can get proper notifications of
> > > > failure
> > > > > > and
> > > > > > > > > > success
> > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> > breakages
> > > > > would
> > > > > > go
> > > > > > > > > > > > unnoticed.
> > > > > > > > > > > >
> > > > > > > > > > > > -Carin
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi
> > > > > > > > > > > > >
> > > > > > > > > > > > > Seems we are hitting some problems in CI. I propose
> > the
> > > > > > > following
> > > > > > > > > > > action
> > > > > > > > > > > > > items to remedy the situation and accelerate turn
> > > around
> > > > > > times
> > > > > > > in
> > > > > > > > > CI,
> > > > > > > > > > > > > reduce cost, complexity and probability of failure
> > > > blocking
> > > > > > PRs
> > > > > > > > and
> > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > >
> > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS
> > > 2017.
> > > > > The
> > > > > > > > > > > > > build_windows.py infrastructure should easily work
> > with
> > > > the
> > > > > > new
> > > > > > > > > > > version.
> > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > * Move non-python bindings tests to nightly. If a
> > > commit
> > > > is
> > > > > > > > > touching
> > > > > > > > > > > > other
> > > > > > > > > > > > > bindings, the reviewer should ask for a full run
> > which
> > > > can
> > > > > be
> > > > > > > > done
> > > > > > > > > > > > locally,
> > > > > > > > > > > > > use the label bot to trigger a full CI build, or
> > defer
> > > to
> > > > > > > > nightly.
> > > > > > > > > > > > > * Provide a couple of basic sanity performance
> tests
> > on
> > > > > small
> > > > > > > > > models
> > > > > > > > > > > that
> > > > > > > > > > > > > are run on CI and can be echoed by the label bot
> as a
> > > > > comment
> > > > > > > for
> > > > > > > > > > PRs.
> > > > > > > > > > > > > * Address unit tests that take more than 10-20s,
> > > > streamline
> > > > > > > them
> > > > > > > > or
> > > > > > > > > > > move
> > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> > scripts
> > > > so
> > > > > > the
> > > > > > > > > > > community
> > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would also like to touch base with the community
> > that
> > > > > some
> > > > > > > PRs
> > > > > > > > > are
> > > > > > > > > > > not
> > > > > > > > > > > > > being followed up by committers asking for changes.
> > For
> > > > > > example
> > > > > > > > > this
> > > > > > > > > > PR
> > > > > > > > > > > > is
> > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is another, less important but more trivial to
> > > > review:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think comitters requesting changes and not
> > folllowing
> > > > up
> > > > > in
> > > > > > > > > > > reasonable
> > > > > > > > > > > > > time is not healthy for the project. I suggest
> > > > configuring
> > > > > > > github
> > > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

No worries, auto scaling is taking care of that :)

-Marco

Sheng Zha <sz...@gmail.com> schrieb am Do., 15. Aug. 2019, 19:50:

> The AWS Batch approach should also help with hardware utilization as
> machines are launched only when needed :)
>
> -sz
>
> > On Aug 15, 2019, at 9:11 AM, Marco de Abreu <ma...@gmail.com>
> wrote:
> >
> > Thanks Leonard. Naively dividing by test files would certainly be an easy
> > and doable way before going into to proper nose parallelization. Great
> idea!
> >
> > Scalability in terms of nodes is not an issue. Our system can handle at
> > least 600 slaves (didn't want to go higher for obvious reasons). But I
> > think we don't even have to go that far because most of the time, our
> > machines are heavily under utilized due to the single-threaded nature of
> > most tests. Thus, parallel test execution on the same machine would
> already
> > speed up the process by great lengths.
> >
> > -Marco
> >
> > P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
> > my approach is considered helpful :)
> >
> > Leonard Lausen <l-...@lausen.nl> schrieb am Do., 15. Aug. 2019,
> 18:59:
> >
> >> To parallelize across machines: For GluonNLP we started submitting test
> >> jobs to AWS Batch. Just adding a for-loop over the units in the
> >> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
> >> Jenkins just waits for all jobs to finish and retrieves their status.
> >> This works since AWS Batch added GPU support this April [3].
> >>
> >> For MXNet, naively parallelizing over the files defining the test cases
> >> that are in the longest running Pipeline stage may already help?
> >>
> >> [1]:
> >>
> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
> >> [2]:
> https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
> >> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
> >>
> >> Marco de Abreu <ma...@gmail.com> writes:
> >>
> >>> The first start wrt parallelization could certainly be start adding
> >>> parallel test execution in nosetests.
> >>>
> >>> -Marco
> >>>
> >>> Aaron Markham <aa...@gmail.com> schrieb am Do., 15. Aug.
> 2019,
> >>> 05:39:
> >>>
> >>>> The PRs Thomas and I are working on for the new docs and website share
> >> the
> >>>> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >>>>
> >>>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
> >> wrote:
> >>>>
> >>>>> I see it done daily now, and while I can’t share all the details,
> it’s
> >>>> not
> >>>>> an incredibly complex thing, and involves not much more than nfs/efs
> >>>>> sharing and remote ssh commands.  All it takes is a little ingenuity
> >> and
> >>>>> some imagination.
> >>>>>
> >>>>> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> >>>> pedro.larroy.lists@gmail.com
> >>>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Sounds good in theory. I think there are complex details with
> >> regards
> >>>> of
> >>>>>> resource sharing during parallel execution. Still I think both ways
> >> can
> >>>>> be
> >>>>>> explored. I think some tests run for unreasonably long times for
> >> what
> >>>>> they
> >>>>>> are doing. We already scale parts of the pipeline horizontally
> >> across
> >>>>>> workers.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> >> cjolivier01@apache.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> +1
> >>>>>>>
> >>>>>>> Rather than remove tests (which doesn’t scale as a solution), why
> >> not
> >>>>>> scale
> >>>>>>> them horizontally so that they finish more quickly? Across
> >> processes
> >>>> or
> >>>>>>> even on a pool of machines that aren’t necessarily the build
> >> machine?
> >>>>>>>
> >>>>>>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> >>>>> marco.g.abreu@gmail.com
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> With regards to time I rather prefer us spending a bit more
> >> time on
> >>>>>>>> maintenance than somebody running into an error that could've
> >> been
> >>>>>> caught
> >>>>>>>> with a test.
> >>>>>>>>
> >>>>>>>> I mean, our Publishing pipeline for Scala GPU has been broken
> >> for
> >>>>> quite
> >>>>>>>> some time now, but nobody noticed that. Basically my stance on
> >> that
> >>>>>>> matter
> >>>>>>>> is that as soon as something is not blocking, you can also just
> >>>>>>> deactivate
> >>>>>>>> it since you don't have a forcing function in an open source
> >>>> project.
> >>>>>>>> People will rarely come back and fix the errors of some nightly
> >>>> test
> >>>>>> that
> >>>>>>>> they introduced.
> >>>>>>>>
> >>>>>>>> -Marco
> >>>>>>>>
> >>>>>>>> Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> >> 2019,
> >>>>>> 21:59:
> >>>>>>>>
> >>>>>>>>> If a language binding test is failing for a not important
> >> reason,
> >>>>>> then
> >>>>>>> it
> >>>>>>>>> is too brittle and needs to be fixed (we have fixed some of
> >> these
> >>>>>> with
> >>>>>>>> the
> >>>>>>>>> Clojure package [1]).
> >>>>>>>>> But in general, if we thinking of the MXNet project as one
> >>>> project
> >>>>>> that
> >>>>>>>> is
> >>>>>>>>> across all the language bindings, then we want to know if some
> >>>>>>>> fundamental
> >>>>>>>>> code change is going to break a downstream package.
> >>>>>>>>> I can't speak for all the high level package binding
> >> maintainers,
> >>>>> but
> >>>>>>> I'm
> >>>>>>>>> always happy to pitch in to provide code fixes to help the
> >> base
> >>>> PR
> >>>>>> get
> >>>>>>>>> green.
> >>>>>>>>>
> >>>>>>>>> The time costs to maintain such a large CI project obviously
> >>>> needs
> >>>>> to
> >>>>>>> be
> >>>>>>>>> considered as well.
> >>>>>>>>>
> >>>>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15579
> >>>>>>>>>
> >>>>>>>>> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> >>>>>>>> pedro.larroy.lists@gmail.com
> >>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> From what I have seen Clojure is 15 minutes, which I think
> >> is
> >>>>>>>> reasonable.
> >>>>>>>>>> The only question is that when a binding such as R, Perl or
> >>>>> Clojure
> >>>>>>>>> fails,
> >>>>>>>>>> some devs are a bit confused about how to fix them since
> >> they
> >>>> are
> >>>>>> not
> >>>>>>>>>> familiar with the testing tools and the language.
> >>>>>>>>>>
> >>>>>>>>>> On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> >>>>> carinmeier@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Great idea Marco! Anything that you think would be
> >> valuable
> >>>> to
> >>>>>>> share
> >>>>>>>>>> would
> >>>>>>>>>>> be good. The duration of each node in the test stage
> >> sounds
> >>>>> like
> >>>>>> a
> >>>>>>>> good
> >>>>>>>>>>> start.
> >>>>>>>>>>>
> >>>>>>>>>>> - Carin
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> >>>>>>>>> marco.g.abreu@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> we record a bunch of metrics about run statistics (down
> >> to
> >>>>> the
> >>>>>>>>> duration
> >>>>>>>>>>> of
> >>>>>>>>>>>> every individual step). If you tell me which ones you're
> >>>>>>>> particularly
> >>>>>>>>>>>> interested in (probably total duration of each node in
> >> the
> >>>>> test
> >>>>>>>>> stage),
> >>>>>>>>>>> I'm
> >>>>>>>>>>>> happy to provide them.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Dimensions are (in hierarchical order):
> >>>>>>>>>>>> - job
> >>>>>>>>>>>> - branch
> >>>>>>>>>>>> - stage
> >>>>>>>>>>>> - node
> >>>>>>>>>>>> - step
> >>>>>>>>>>>>
> >>>>>>>>>>>> Unfortunately I don't have the possibility to export
> >> them
> >>>>> since
> >>>>>>> we
> >>>>>>>>>> store
> >>>>>>>>>>>> them in CloudWatch Metrics which afaik doesn't offer raw
> >>>>>> exports.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best regards,
> >>>>>>>>>>>> Marco
> >>>>>>>>>>>>
> >>>>>>>>>>>> Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> >>>> Aug.
> >>>>>>> 2019,
> >>>>>>>>>> 19:43:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> I would prefer to keep the language binding in the PR
> >>>>>> process.
> >>>>>>>>>> Perhaps
> >>>>>>>>>>> we
> >>>>>>>>>>>>> could do some analytics to see how much each of the
> >>>>> language
> >>>>>>>>> bindings
> >>>>>>>>>>> is
> >>>>>>>>>>>>> contributing to overall run time.
> >>>>>>>>>>>>> If we have some metrics on that, maybe we can come up
> >>>> with
> >>>>> a
> >>>>>>>>>> guideline
> >>>>>>>>>>> of
> >>>>>>>>>>>>> how much time each should take. Another possibility is
> >>>>>> leverage
> >>>>>>>> the
> >>>>>>>>>>>>> parallel builds more.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> >>>>>>>>>>>> pedro.larroy.lists@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Carin.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> That's a good point, all things considered would
> >> your
> >>>>>>>> preference
> >>>>>>>>> be
> >>>>>>>>>>> to
> >>>>>>>>>>>>> keep
> >>>>>>>>>>>>>> the Clojure tests as part of the PR process or in
> >>>>> Nightly?
> >>>>>>>>>>>>>> Some options are having notifications here or in
> >> slack.
> >>>>> But
> >>>>>>> if
> >>>>>>>> we
> >>>>>>>>>>> think
> >>>>>>>>>>>>>> breakages would go unnoticed maybe is not a good
> >> idea
> >>>> to
> >>>>>>> fully
> >>>>>>>>>> remove
> >>>>>>>>>>>>>> bindings from the PR process and just streamline the
> >>>>>> process.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Pedro.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> >>>>>>>>> carinmeier@gmail.com>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Before any binding tests are moved to nightly, I
> >>>> think
> >>>>> we
> >>>>>>>> need
> >>>>>>>>> to
> >>>>>>>>>>>>> figure
> >>>>>>>>>>>>>>> out how the community can get proper
> >> notifications of
> >>>>>>> failure
> >>>>>>>>> and
> >>>>>>>>>>>>> success
> >>>>>>>>>>>>>>> on those nightly runs. Otherwise, I think that
> >>>>> breakages
> >>>>>>>> would
> >>>>>>>>> go
> >>>>>>>>>>>>>>> unnoticed.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Carin
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> >>>>>>>>>>>>>> pedro.larroy.lists@gmail.com
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hi
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Seems we are hitting some problems in CI. I
> >> propose
> >>>>> the
> >>>>>>>>>> following
> >>>>>>>>>>>>>> action
> >>>>>>>>>>>>>>>> items to remedy the situation and accelerate
> >> turn
> >>>>>> around
> >>>>>>>>> times
> >>>>>>>>>> in
> >>>>>>>>>>>> CI,
> >>>>>>>>>>>>>>>> reduce cost, complexity and probability of
> >> failure
> >>>>>>> blocking
> >>>>>>>>> PRs
> >>>>>>>>>>> and
> >>>>>>>>>>>>>>>> frustrating developers:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> * Upgrade Windows visual studio from VS 2015 to
> >> VS
> >>>>>> 2017.
> >>>>>>>> The
> >>>>>>>>>>>>>>>> build_windows.py infrastructure should easily
> >> work
> >>>>> with
> >>>>>>> the
> >>>>>>>>> new
> >>>>>>>>>>>>>> version.
> >>>>>>>>>>>>>>>> Currently some PRs are blocked by this:
> >>>>>>>>>>>>>>>>
> >>>>> https://github.com/apache/incubator-mxnet/issues/13958
> >>>>>>>>>>>>>>>> * Move Gluon Model zoo tests to nightly.
> >> Tracked at
> >>>>>>>>>>>>>>>>
> >>>>> https://github.com/apache/incubator-mxnet/issues/15295
> >>>>>>>>>>>>>>>> * Move non-python bindings tests to nightly. If
> >> a
> >>>>>> commit
> >>>>>>> is
> >>>>>>>>>>>> touching
> >>>>>>>>>>>>>>> other
> >>>>>>>>>>>>>>>> bindings, the reviewer should ask for a full run
> >>>>> which
> >>>>>>> can
> >>>>>>>> be
> >>>>>>>>>>> done
> >>>>>>>>>>>>>>> locally,
> >>>>>>>>>>>>>>>> use the label bot to trigger a full CI build, or
> >>>>> defer
> >>>>>> to
> >>>>>>>>>>> nightly.
> >>>>>>>>>>>>>>>> * Provide a couple of basic sanity performance
> >>>> tests
> >>>>> on
> >>>>>>>> small
> >>>>>>>>>>>> models
> >>>>>>>>>>>>>> that
> >>>>>>>>>>>>>>>> are run on CI and can be echoed by the label bot
> >>>> as a
> >>>>>>>> comment
> >>>>>>>>>> for
> >>>>>>>>>>>>> PRs.
> >>>>>>>>>>>>>>>> * Address unit tests that take more than 10-20s,
> >>>>>>> streamline
> >>>>>>>>>> them
> >>>>>>>>>>> or
> >>>>>>>>>>>>>> move
> >>>>>>>>>>>>>>>> them to nightly if it can't be done.
> >>>>>>>>>>>>>>>> * Open sourcing the remaining CI infrastructure
> >>>>> scripts
> >>>>>>> so
> >>>>>>>>> the
> >>>>>>>>>>>>>> community
> >>>>>>>>>>>>>>>> can contribute.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think our goal should be turnaround under
> >> 30min.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I would also like to touch base with the
> >> community
> >>>>> that
> >>>>>>>> some
> >>>>>>>>>> PRs
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>> being followed up by committers asking for
> >> changes.
> >>>>> For
> >>>>>>>>> example
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> PR
> >>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>> importtant and is hanging for a long time.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>> https://github.com/apache/incubator-mxnet/pull/15051
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This is another, less important but more
> >> trivial to
> >>>>>>> review:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>> https://github.com/apache/incubator-mxnet/pull/14940
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I think comitters requesting changes and not
> >>>>> folllowing
> >>>>>>> up
> >>>>>>>> in
> >>>>>>>>>>>>>> reasonable
> >>>>>>>>>>>>>>>> time is not healthy for the project. I suggest
> >>>>>>> configuring
> >>>>>>>>>> github
> >>>>>>>>>>>>>>>> Notifications for a good SNR and following up.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Regards.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Pedro.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
>

Re: CI and PRs

Posted by Sheng Zha <sz...@gmail.com>.

The AWS Batch approach should also help with hardware utilization as machines are launched only when needed :)

-sz

> On Aug 15, 2019, at 9:11 AM, Marco de Abreu <ma...@gmail.com> wrote:
> 
> Thanks Leonard. Naively dividing by test files would certainly be an easy
> and doable way before going into to proper nose parallelization. Great idea!
> 
> Scalability in terms of nodes is not an issue. Our system can handle at
> least 600 slaves (didn't want to go higher for obvious reasons). But I
> think we don't even have to go that far because most of the time, our
> machines are heavily under utilized due to the single-threaded nature of
> most tests. Thus, parallel test execution on the same machine would already
> speed up the process by great lengths.
> 
> -Marco
> 
> P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
> my approach is considered helpful :)
> 
> Leonard Lausen <l-...@lausen.nl> schrieb am Do., 15. Aug. 2019, 18:59:
> 
>> To parallelize across machines: For GluonNLP we started submitting test
>> jobs to AWS Batch. Just adding a for-loop over the units in the
>> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
>> Jenkins just waits for all jobs to finish and retrieves their status.
>> This works since AWS Batch added GPU support this April [3].
>> 
>> For MXNet, naively parallelizing over the files defining the test cases
>> that are in the longest running Pipeline stage may already help?
>> 
>> [1]:
>> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
>> [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
>> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
>> 
>> Marco de Abreu <ma...@gmail.com> writes:
>> 
>>> The first start wrt parallelization could certainly be start adding
>>> parallel test execution in nosetests.
>>> 
>>> -Marco
>>> 
>>> Aaron Markham <aa...@gmail.com> schrieb am Do., 15. Aug. 2019,
>>> 05:39:
>>> 
>>>> The PRs Thomas and I are working on for the new docs and website share
>> the
>>>> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>>>> 
>>>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
>> wrote:
>>>> 
>>>>> I see it done daily now, and while I can’t share all the details, it’s
>>>> not
>>>>> an incredibly complex thing, and involves not much more than nfs/efs
>>>>> sharing and remote ssh commands.  All it takes is a little ingenuity
>> and
>>>>> some imagination.
>>>>> 
>>>>> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>>>> pedro.larroy.lists@gmail.com
>>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Sounds good in theory. I think there are complex details with
>> regards
>>>> of
>>>>>> resource sharing during parallel execution. Still I think both ways
>> can
>>>>> be
>>>>>> explored. I think some tests run for unreasonably long times for
>> what
>>>>> they
>>>>>> are doing. We already scale parts of the pipeline horizontally
>> across
>>>>>> workers.
>>>>>> 
>>>>>> 
>>>>>> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
>> cjolivier01@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> +1
>>>>>>> 
>>>>>>> Rather than remove tests (which doesn’t scale as a solution), why
>> not
>>>>>> scale
>>>>>>> them horizontally so that they finish more quickly? Across
>> processes
>>>> or
>>>>>>> even on a pool of machines that aren’t necessarily the build
>> machine?
>>>>>>> 
>>>>>>> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>>>>> marco.g.abreu@gmail.com
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> With regards to time I rather prefer us spending a bit more
>> time on
>>>>>>>> maintenance than somebody running into an error that could've
>> been
>>>>>> caught
>>>>>>>> with a test.
>>>>>>>> 
>>>>>>>> I mean, our Publishing pipeline for Scala GPU has been broken
>> for
>>>>> quite
>>>>>>>> some time now, but nobody noticed that. Basically my stance on
>> that
>>>>>>> matter
>>>>>>>> is that as soon as something is not blocking, you can also just
>>>>>>> deactivate
>>>>>>>> it since you don't have a forcing function in an open source
>>>> project.
>>>>>>>> People will rarely come back and fix the errors of some nightly
>>>> test
>>>>>> that
>>>>>>>> they introduced.
>>>>>>>> 
>>>>>>>> -Marco
>>>>>>>> 
>>>>>>>> Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
>> 2019,
>>>>>> 21:59:
>>>>>>>> 
>>>>>>>>> If a language binding test is failing for a not important
>> reason,
>>>>>> then
>>>>>>> it
>>>>>>>>> is too brittle and needs to be fixed (we have fixed some of
>> these
>>>>>> with
>>>>>>>> the
>>>>>>>>> Clojure package [1]).
>>>>>>>>> But in general, if we thinking of the MXNet project as one
>>>> project
>>>>>> that
>>>>>>>> is
>>>>>>>>> across all the language bindings, then we want to know if some
>>>>>>>> fundamental
>>>>>>>>> code change is going to break a downstream package.
>>>>>>>>> I can't speak for all the high level package binding
>> maintainers,
>>>>> but
>>>>>>> I'm
>>>>>>>>> always happy to pitch in to provide code fixes to help the
>> base
>>>> PR
>>>>>> get
>>>>>>>>> green.
>>>>>>>>> 
>>>>>>>>> The time costs to maintain such a large CI project obviously
>>>> needs
>>>>> to
>>>>>>> be
>>>>>>>>> considered as well.
>>>>>>>>> 
>>>>>>>>> [1] https://github.com/apache/incubator-mxnet/pull/15579
>>>>>>>>> 
>>>>>>>>> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>>>>>>>> pedro.larroy.lists@gmail.com
>>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> From what I have seen Clojure is 15 minutes, which I think
>> is
>>>>>>>> reasonable.
>>>>>>>>>> The only question is that when a binding such as R, Perl or
>>>>> Clojure
>>>>>>>>> fails,
>>>>>>>>>> some devs are a bit confused about how to fix them since
>> they
>>>> are
>>>>>> not
>>>>>>>>>> familiar with the testing tools and the language.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>>>>> carinmeier@gmail.com
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Great idea Marco! Anything that you think would be
>> valuable
>>>> to
>>>>>>> share
>>>>>>>>>> would
>>>>>>>>>>> be good. The duration of each node in the test stage
>> sounds
>>>>> like
>>>>>> a
>>>>>>>> good
>>>>>>>>>>> start.
>>>>>>>>>>> 
>>>>>>>>>>> - Carin
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
>>>>>>>>> marco.g.abreu@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> we record a bunch of metrics about run statistics (down
>> to
>>>>> the
>>>>>>>>> duration
>>>>>>>>>>> of
>>>>>>>>>>>> every individual step). If you tell me which ones you're
>>>>>>>> particularly
>>>>>>>>>>>> interested in (probably total duration of each node in
>> the
>>>>> test
>>>>>>>>> stage),
>>>>>>>>>>> I'm
>>>>>>>>>>>> happy to provide them.
>>>>>>>>>>>> 
>>>>>>>>>>>> Dimensions are (in hierarchical order):
>>>>>>>>>>>> - job
>>>>>>>>>>>> - branch
>>>>>>>>>>>> - stage
>>>>>>>>>>>> - node
>>>>>>>>>>>> - step
>>>>>>>>>>>> 
>>>>>>>>>>>> Unfortunately I don't have the possibility to export
>> them
>>>>> since
>>>>>>> we
>>>>>>>>>> store
>>>>>>>>>>>> them in CloudWatch Metrics which afaik doesn't offer raw
>>>>>> exports.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Marco
>>>>>>>>>>>> 
>>>>>>>>>>>> Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
>>>> Aug.
>>>>>>> 2019,
>>>>>>>>>> 19:43:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I would prefer to keep the language binding in the PR
>>>>>> process.
>>>>>>>>>> Perhaps
>>>>>>>>>>> we
>>>>>>>>>>>>> could do some analytics to see how much each of the
>>>>> language
>>>>>>>>> bindings
>>>>>>>>>>> is
>>>>>>>>>>>>> contributing to overall run time.
>>>>>>>>>>>>> If we have some metrics on that, maybe we can come up
>>>> with
>>>>> a
>>>>>>>>>> guideline
>>>>>>>>>>> of
>>>>>>>>>>>>> how much time each should take. Another possibility is
>>>>>> leverage
>>>>>>>> the
>>>>>>>>>>>>> parallel builds more.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
>>>>>>>>>>>> pedro.larroy.lists@gmail.com
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Carin.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> That's a good point, all things considered would
>> your
>>>>>>>> preference
>>>>>>>>> be
>>>>>>>>>>> to
>>>>>>>>>>>>> keep
>>>>>>>>>>>>>> the Clojure tests as part of the PR process or in
>>>>> Nightly?
>>>>>>>>>>>>>> Some options are having notifications here or in
>> slack.
>>>>> But
>>>>>>> if
>>>>>>>> we
>>>>>>>>>>> think
>>>>>>>>>>>>>> breakages would go unnoticed maybe is not a good
>> idea
>>>> to
>>>>>>> fully
>>>>>>>>>> remove
>>>>>>>>>>>>>> bindings from the PR process and just streamline the
>>>>>> process.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Pedro.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
>>>>>>>>> carinmeier@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Before any binding tests are moved to nightly, I
>>>> think
>>>>> we
>>>>>>>> need
>>>>>>>>> to
>>>>>>>>>>>>> figure
>>>>>>>>>>>>>>> out how the community can get proper
>> notifications of
>>>>>>> failure
>>>>>>>>> and
>>>>>>>>>>>>> success
>>>>>>>>>>>>>>> on those nightly runs. Otherwise, I think that
>>>>> breakages
>>>>>>>> would
>>>>>>>>> go
>>>>>>>>>>>>>>> unnoticed.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Carin
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
>>>>>>>>>>>>>> pedro.larroy.lists@gmail.com
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Seems we are hitting some problems in CI. I
>> propose
>>>>> the
>>>>>>>>>> following
>>>>>>>>>>>>>> action
>>>>>>>>>>>>>>>> items to remedy the situation and accelerate
>> turn
>>>>>> around
>>>>>>>>> times
>>>>>>>>>> in
>>>>>>>>>>>> CI,
>>>>>>>>>>>>>>>> reduce cost, complexity and probability of
>> failure
>>>>>>> blocking
>>>>>>>>> PRs
>>>>>>>>>>> and
>>>>>>>>>>>>>>>> frustrating developers:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> * Upgrade Windows visual studio from VS 2015 to
>> VS
>>>>>> 2017.
>>>>>>>> The
>>>>>>>>>>>>>>>> build_windows.py infrastructure should easily
>> work
>>>>> with
>>>>>>> the
>>>>>>>>> new
>>>>>>>>>>>>>> version.
>>>>>>>>>>>>>>>> Currently some PRs are blocked by this:
>>>>>>>>>>>>>>>> 
>>>>> https://github.com/apache/incubator-mxnet/issues/13958
>>>>>>>>>>>>>>>> * Move Gluon Model zoo tests to nightly.
>> Tracked at
>>>>>>>>>>>>>>>> 
>>>>> https://github.com/apache/incubator-mxnet/issues/15295
>>>>>>>>>>>>>>>> * Move non-python bindings tests to nightly. If
>> a
>>>>>> commit
>>>>>>> is
>>>>>>>>>>>> touching
>>>>>>>>>>>>>>> other
>>>>>>>>>>>>>>>> bindings, the reviewer should ask for a full run
>>>>> which
>>>>>>> can
>>>>>>>> be
>>>>>>>>>>> done
>>>>>>>>>>>>>>> locally,
>>>>>>>>>>>>>>>> use the label bot to trigger a full CI build, or
>>>>> defer
>>>>>> to
>>>>>>>>>>> nightly.
>>>>>>>>>>>>>>>> * Provide a couple of basic sanity performance
>>>> tests
>>>>> on
>>>>>>>> small
>>>>>>>>>>>> models
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>> are run on CI and can be echoed by the label bot
>>>> as a
>>>>>>>> comment
>>>>>>>>>> for
>>>>>>>>>>>>> PRs.
>>>>>>>>>>>>>>>> * Address unit tests that take more than 10-20s,
>>>>>>> streamline
>>>>>>>>>> them
>>>>>>>>>>> or
>>>>>>>>>>>>>> move
>>>>>>>>>>>>>>>> them to nightly if it can't be done.
>>>>>>>>>>>>>>>> * Open sourcing the remaining CI infrastructure
>>>>> scripts
>>>>>>> so
>>>>>>>>> the
>>>>>>>>>>>>>> community
>>>>>>>>>>>>>>>> can contribute.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think our goal should be turnaround under
>> 30min.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I would also like to touch base with the
>> community
>>>>> that
>>>>>>>> some
>>>>>>>>>> PRs
>>>>>>>>>>>> are
>>>>>>>>>>>>>> not
>>>>>>>>>>>>>>>> being followed up by committers asking for
>> changes.
>>>>> For
>>>>>>>>> example
>>>>>>>>>>>> this
>>>>>>>>>>>>> PR
>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>> importtant and is hanging for a long time.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>> https://github.com/apache/incubator-mxnet/pull/15051
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This is another, less important but more
>> trivial to
>>>>>>> review:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>> https://github.com/apache/incubator-mxnet/pull/14940
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think comitters requesting changes and not
>>>>> folllowing
>>>>>>> up
>>>>>>>> in
>>>>>>>>>>>>>> reasonable
>>>>>>>>>>>>>>>> time is not healthy for the project. I suggest
>>>>>>> configuring
>>>>>>>>>> github
>>>>>>>>>>>>>>>> Notifications for a good SNR and following up.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Regards.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Pedro.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

Thanks Leonard. Naively dividing by test files would certainly be an easy
and doable way before going into to proper nose parallelization. Great idea!

Scalability in terms of nodes is not an issue. Our system can handle at
least 600 slaves (didn't want to go higher for obvious reasons). But I
think we don't even have to go that far because most of the time, our
machines are heavily under utilized due to the single-threaded nature of
most tests. Thus, parallel test execution on the same machine would already
speed up the process by great lengths.

-Marco

P.S. the structure of the Jenkinsfiles seems pretty familiar :P i am glad
my approach is considered helpful :)

Leonard Lausen <l-...@lausen.nl> schrieb am Do., 15. Aug. 2019, 18:59:

> To parallelize across machines: For GluonNLP we started submitting test
> jobs to AWS Batch. Just adding a for-loop over the units in the
> Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
> Jenkins just waits for all jobs to finish and retrieves their status.
> This works since AWS Batch added GPU support this April [3].
>
> For MXNet, naively parallelizing over the files defining the test cases
> that are in the longest running Pipeline stage may already help?
>
> [1]:
> https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
> [2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
> [3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/
>
> Marco de Abreu <ma...@gmail.com> writes:
>
> > The first start wrt parallelization could certainly be start adding
> > parallel test execution in nosetests.
> >
> > -Marco
> >
> > Aaron Markham <aa...@gmail.com> schrieb am Do., 15. Aug. 2019,
> > 05:39:
> >
> >> The PRs Thomas and I are working on for the new docs and website share
> the
> >> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
> >>
> >> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com>
> wrote:
> >>
> >> > I see it done daily now, and while I can’t share all the details, it’s
> >> not
> >> > an incredibly complex thing, and involves not much more than nfs/efs
> >> > sharing and remote ssh commands.  All it takes is a little ingenuity
> and
> >> > some imagination.
> >> >
> >> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> >> pedro.larroy.lists@gmail.com
> >> > >
> >> > wrote:
> >> >
> >> > > Sounds good in theory. I think there are complex details with
> regards
> >> of
> >> > > resource sharing during parallel execution. Still I think both ways
> can
> >> > be
> >> > > explored. I think some tests run for unreasonably long times for
> what
> >> > they
> >> > > are doing. We already scale parts of the pipeline horizontally
> across
> >> > > workers.
> >> > >
> >> > >
> >> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <
> cjolivier01@apache.org>
> >> > > wrote:
> >> > >
> >> > > > +1
> >> > > >
> >> > > > Rather than remove tests (which doesn’t scale as a solution), why
> not
> >> > > scale
> >> > > > them horizontally so that they finish more quickly? Across
> processes
> >> or
> >> > > > even on a pool of machines that aren’t necessarily the build
> machine?
> >> > > >
> >> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> >> > marco.g.abreu@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > With regards to time I rather prefer us spending a bit more
> time on
> >> > > > > maintenance than somebody running into an error that could've
> been
> >> > > caught
> >> > > > > with a test.
> >> > > > >
> >> > > > > I mean, our Publishing pipeline for Scala GPU has been broken
> for
> >> > quite
> >> > > > > some time now, but nobody noticed that. Basically my stance on
> that
> >> > > > matter
> >> > > > > is that as soon as something is not blocking, you can also just
> >> > > > deactivate
> >> > > > > it since you don't have a forcing function in an open source
> >> project.
> >> > > > > People will rarely come back and fix the errors of some nightly
> >> test
> >> > > that
> >> > > > > they introduced.
> >> > > > >
> >> > > > > -Marco
> >> > > > >
> >> > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> 2019,
> >> > > 21:59:
> >> > > > >
> >> > > > > > If a language binding test is failing for a not important
> reason,
> >> > > then
> >> > > > it
> >> > > > > > is too brittle and needs to be fixed (we have fixed some of
> these
> >> > > with
> >> > > > > the
> >> > > > > > Clojure package [1]).
> >> > > > > > But in general, if we thinking of the MXNet project as one
> >> project
> >> > > that
> >> > > > > is
> >> > > > > > across all the language bindings, then we want to know if some
> >> > > > > fundamental
> >> > > > > > code change is going to break a downstream package.
> >> > > > > > I can't speak for all the high level package binding
> maintainers,
> >> > but
> >> > > > I'm
> >> > > > > > always happy to pitch in to provide code fixes to help the
> base
> >> PR
> >> > > get
> >> > > > > > green.
> >> > > > > >
> >> > > > > > The time costs to maintain such a large CI project obviously
> >> needs
> >> > to
> >> > > > be
> >> > > > > > considered as well.
> >> > > > > >
> >> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> >> > > > > >
> >> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> >> > > > > pedro.larroy.lists@gmail.com
> >> > > > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > From what I have seen Clojure is 15 minutes, which I think
> is
> >> > > > > reasonable.
> >> > > > > > > The only question is that when a binding such as R, Perl or
> >> > Clojure
> >> > > > > > fails,
> >> > > > > > > some devs are a bit confused about how to fix them since
> they
> >> are
> >> > > not
> >> > > > > > > familiar with the testing tools and the language.
> >> > > > > > >
> >> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> >> > carinmeier@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > Great idea Marco! Anything that you think would be
> valuable
> >> to
> >> > > > share
> >> > > > > > > would
> >> > > > > > > > be good. The duration of each node in the test stage
> sounds
> >> > like
> >> > > a
> >> > > > > good
> >> > > > > > > > start.
> >> > > > > > > >
> >> > > > > > > > - Carin
> >> > > > > > > >
> >> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> >> > > > > > marco.g.abreu@gmail.com>
> >> > > > > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Hi,
> >> > > > > > > > >
> >> > > > > > > > > we record a bunch of metrics about run statistics (down
> to
> >> > the
> >> > > > > > duration
> >> > > > > > > > of
> >> > > > > > > > > every individual step). If you tell me which ones you're
> >> > > > > particularly
> >> > > > > > > > > interested in (probably total duration of each node in
> the
> >> > test
> >> > > > > > stage),
> >> > > > > > > > I'm
> >> > > > > > > > > happy to provide them.
> >> > > > > > > > >
> >> > > > > > > > > Dimensions are (in hierarchical order):
> >> > > > > > > > > - job
> >> > > > > > > > > - branch
> >> > > > > > > > > - stage
> >> > > > > > > > > - node
> >> > > > > > > > > - step
> >> > > > > > > > >
> >> > > > > > > > > Unfortunately I don't have the possibility to export
> them
> >> > since
> >> > > > we
> >> > > > > > > store
> >> > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> >> > > exports.
> >> > > > > > > > >
> >> > > > > > > > > Best regards,
> >> > > > > > > > > Marco
> >> > > > > > > > >
> >> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> >> Aug.
> >> > > > 2019,
> >> > > > > > > 19:43:
> >> > > > > > > > >
> >> > > > > > > > > > I would prefer to keep the language binding in the PR
> >> > > process.
> >> > > > > > > Perhaps
> >> > > > > > > > we
> >> > > > > > > > > > could do some analytics to see how much each of the
> >> > language
> >> > > > > > bindings
> >> > > > > > > > is
> >> > > > > > > > > > contributing to overall run time.
> >> > > > > > > > > > If we have some metrics on that, maybe we can come up
> >> with
> >> > a
> >> > > > > > > guideline
> >> > > > > > > > of
> >> > > > > > > > > > how much time each should take. Another possibility is
> >> > > leverage
> >> > > > > the
> >> > > > > > > > > > parallel builds more.
> >> > > > > > > > > >
> >> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> >> > > > > > > > > pedro.larroy.lists@gmail.com
> >> > > > > > > > > > >
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hi Carin.
> >> > > > > > > > > > >
> >> > > > > > > > > > > That's a good point, all things considered would
> your
> >> > > > > preference
> >> > > > > > be
> >> > > > > > > > to
> >> > > > > > > > > > keep
> >> > > > > > > > > > > the Clojure tests as part of the PR process or in
> >> > Nightly?
> >> > > > > > > > > > > Some options are having notifications here or in
> slack.
> >> > But
> >> > > > if
> >> > > > > we
> >> > > > > > > > think
> >> > > > > > > > > > > breakages would go unnoticed maybe is not a good
> idea
> >> to
> >> > > > fully
> >> > > > > > > remove
> >> > > > > > > > > > > bindings from the PR process and just streamline the
> >> > > process.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Pedro.
> >> > > > > > > > > > >
> >> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> >> > > > > > carinmeier@gmail.com>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Before any binding tests are moved to nightly, I
> >> think
> >> > we
> >> > > > > need
> >> > > > > > to
> >> > > > > > > > > > figure
> >> > > > > > > > > > > > out how the community can get proper
> notifications of
> >> > > > failure
> >> > > > > > and
> >> > > > > > > > > > success
> >> > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> >> > breakages
> >> > > > > would
> >> > > > > > go
> >> > > > > > > > > > > > unnoticed.
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > -Carin
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> >> > > > > > > > > > > pedro.larroy.lists@gmail.com
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > wrote:
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > Hi
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Seems we are hitting some problems in CI. I
> propose
> >> > the
> >> > > > > > > following
> >> > > > > > > > > > > action
> >> > > > > > > > > > > > > items to remedy the situation and accelerate
> turn
> >> > > around
> >> > > > > > times
> >> > > > > > > in
> >> > > > > > > > > CI,
> >> > > > > > > > > > > > > reduce cost, complexity and probability of
> failure
> >> > > > blocking
> >> > > > > > PRs
> >> > > > > > > > and
> >> > > > > > > > > > > > > frustrating developers:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to
> VS
> >> > > 2017.
> >> > > > > The
> >> > > > > > > > > > > > > build_windows.py infrastructure should easily
> work
> >> > with
> >> > > > the
> >> > > > > > new
> >> > > > > > > > > > > version.
> >> > > > > > > > > > > > > Currently some PRs are blocked by this:
> >> > > > > > > > > > > > >
> >> > https://github.com/apache/incubator-mxnet/issues/13958
> >> > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly.
> Tracked at
> >> > > > > > > > > > > > >
> >> > https://github.com/apache/incubator-mxnet/issues/15295
> >> > > > > > > > > > > > > * Move non-python bindings tests to nightly. If
> a
> >> > > commit
> >> > > > is
> >> > > > > > > > > touching
> >> > > > > > > > > > > > other
> >> > > > > > > > > > > > > bindings, the reviewer should ask for a full run
> >> > which
> >> > > > can
> >> > > > > be
> >> > > > > > > > done
> >> > > > > > > > > > > > locally,
> >> > > > > > > > > > > > > use the label bot to trigger a full CI build, or
> >> > defer
> >> > > to
> >> > > > > > > > nightly.
> >> > > > > > > > > > > > > * Provide a couple of basic sanity performance
> >> tests
> >> > on
> >> > > > > small
> >> > > > > > > > > models
> >> > > > > > > > > > > that
> >> > > > > > > > > > > > > are run on CI and can be echoed by the label bot
> >> as a
> >> > > > > comment
> >> > > > > > > for
> >> > > > > > > > > > PRs.
> >> > > > > > > > > > > > > * Address unit tests that take more than 10-20s,
> >> > > > streamline
> >> > > > > > > them
> >> > > > > > > > or
> >> > > > > > > > > > > move
> >> > > > > > > > > > > > > them to nightly if it can't be done.
> >> > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> >> > scripts
> >> > > > so
> >> > > > > > the
> >> > > > > > > > > > > community
> >> > > > > > > > > > > > > can contribute.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I think our goal should be turnaround under
> 30min.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I would also like to touch base with the
> community
> >> > that
> >> > > > > some
> >> > > > > > > PRs
> >> > > > > > > > > are
> >> > > > > > > > > > > not
> >> > > > > > > > > > > > > being followed up by committers asking for
> changes.
> >> > For
> >> > > > > > example
> >> > > > > > > > > this
> >> > > > > > > > > > PR
> >> > > > > > > > > > > > is
> >> > > > > > > > > > > > > importtant and is hanging for a long time.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> https://github.com/apache/incubator-mxnet/pull/15051
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > This is another, less important but more
> trivial to
> >> > > > review:
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > >
> >> https://github.com/apache/incubator-mxnet/pull/14940
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > I think comitters requesting changes and not
> >> > folllowing
> >> > > > up
> >> > > > > in
> >> > > > > > > > > > > reasonable
> >> > > > > > > > > > > > > time is not healthy for the project. I suggest
> >> > > > configuring
> >> > > > > > > github
> >> > > > > > > > > > > > > Notifications for a good SNR and following up.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Regards.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Pedro.
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: CI and PRs

Posted by Leonard Lausen <l-...@lausen.nl>.

To parallelize across machines: For GluonNLP we started submitting test
jobs to AWS Batch. Just adding a for-loop over the units in the
Jenkinsfile [1] and submitting a job for each [2] works quite well. Then
Jenkins just waits for all jobs to finish and retrieves their status.
This works since AWS Batch added GPU support this April [3].

For MXNet, naively parallelizing over the files defining the test cases
that are in the longest running Pipeline stage may already help?

[1]: https://github.com/dmlc/gluon-nlp/blob/master/ci/jenkins/Jenkinsfile_py3-master_gpu_doc#L53
[2]: https://github.com/dmlc/gluon-nlp/blob/master/ci/batch/submit-job.py
[3]: https://aws.amazon.com/blogs/compute/gpu-workloads-on-aws-batch/

Marco de Abreu <ma...@gmail.com> writes:

> The first start wrt parallelization could certainly be start adding
> parallel test execution in nosetests.
>
> -Marco
>
> Aaron Markham <aa...@gmail.com> schrieb am Do., 15. Aug. 2019,
> 05:39:
>
>> The PRs Thomas and I are working on for the new docs and website share the
>> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>>
>> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com> wrote:
>>
>> > I see it done daily now, and while I can’t share all the details, it’s
>> not
>> > an incredibly complex thing, and involves not much more than nfs/efs
>> > sharing and remote ssh commands.  All it takes is a little ingenuity and
>> > some imagination.
>> >
>> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
>> pedro.larroy.lists@gmail.com
>> > >
>> > wrote:
>> >
>> > > Sounds good in theory. I think there are complex details with regards
>> of
>> > > resource sharing during parallel execution. Still I think both ways can
>> > be
>> > > explored. I think some tests run for unreasonably long times for what
>> > they
>> > > are doing. We already scale parts of the pipeline horizontally across
>> > > workers.
>> > >
>> > >
>> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cj...@apache.org>
>> > > wrote:
>> > >
>> > > > +1
>> > > >
>> > > > Rather than remove tests (which doesn’t scale as a solution), why not
>> > > scale
>> > > > them horizontally so that they finish more quickly? Across processes
>> or
>> > > > even on a pool of machines that aren’t necessarily the build machine?
>> > > >
>> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
>> > marco.g.abreu@gmail.com
>> > > >
>> > > > wrote:
>> > > >
>> > > > > With regards to time I rather prefer us spending a bit more time on
>> > > > > maintenance than somebody running into an error that could've been
>> > > caught
>> > > > > with a test.
>> > > > >
>> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
>> > quite
>> > > > > some time now, but nobody noticed that. Basically my stance on that
>> > > > matter
>> > > > > is that as soon as something is not blocking, you can also just
>> > > > deactivate
>> > > > > it since you don't have a forcing function in an open source
>> project.
>> > > > > People will rarely come back and fix the errors of some nightly
>> test
>> > > that
>> > > > > they introduced.
>> > > > >
>> > > > > -Marco
>> > > > >
>> > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
>> > > 21:59:
>> > > > >
>> > > > > > If a language binding test is failing for a not important reason,
>> > > then
>> > > > it
>> > > > > > is too brittle and needs to be fixed (we have fixed some of these
>> > > with
>> > > > > the
>> > > > > > Clojure package [1]).
>> > > > > > But in general, if we thinking of the MXNet project as one
>> project
>> > > that
>> > > > > is
>> > > > > > across all the language bindings, then we want to know if some
>> > > > > fundamental
>> > > > > > code change is going to break a downstream package.
>> > > > > > I can't speak for all the high level package binding maintainers,
>> > but
>> > > > I'm
>> > > > > > always happy to pitch in to provide code fixes to help the base
>> PR
>> > > get
>> > > > > > green.
>> > > > > >
>> > > > > > The time costs to maintain such a large CI project obviously
>> needs
>> > to
>> > > > be
>> > > > > > considered as well.
>> > > > > >
>> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
>> > > > > >
>> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
>> > > > > pedro.larroy.lists@gmail.com
>> > > > > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > From what I have seen Clojure is 15 minutes, which I think is
>> > > > > reasonable.
>> > > > > > > The only question is that when a binding such as R, Perl or
>> > Clojure
>> > > > > > fails,
>> > > > > > > some devs are a bit confused about how to fix them since they
>> are
>> > > not
>> > > > > > > familiar with the testing tools and the language.
>> > > > > > >
>> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
>> > carinmeier@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Great idea Marco! Anything that you think would be valuable
>> to
>> > > > share
>> > > > > > > would
>> > > > > > > > be good. The duration of each node in the test stage sounds
>> > like
>> > > a
>> > > > > good
>> > > > > > > > start.
>> > > > > > > >
>> > > > > > > > - Carin
>> > > > > > > >
>> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
>> > > > > > marco.g.abreu@gmail.com>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi,
>> > > > > > > > >
>> > > > > > > > > we record a bunch of metrics about run statistics (down to
>> > the
>> > > > > > duration
>> > > > > > > > of
>> > > > > > > > > every individual step). If you tell me which ones you're
>> > > > > particularly
>> > > > > > > > > interested in (probably total duration of each node in the
>> > test
>> > > > > > stage),
>> > > > > > > > I'm
>> > > > > > > > > happy to provide them.
>> > > > > > > > >
>> > > > > > > > > Dimensions are (in hierarchical order):
>> > > > > > > > > - job
>> > > > > > > > > - branch
>> > > > > > > > > - stage
>> > > > > > > > > - node
>> > > > > > > > > - step
>> > > > > > > > >
>> > > > > > > > > Unfortunately I don't have the possibility to export them
>> > since
>> > > > we
>> > > > > > > store
>> > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
>> > > exports.
>> > > > > > > > >
>> > > > > > > > > Best regards,
>> > > > > > > > > Marco
>> > > > > > > > >
>> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
>> Aug.
>> > > > 2019,
>> > > > > > > 19:43:
>> > > > > > > > >
>> > > > > > > > > > I would prefer to keep the language binding in the PR
>> > > process.
>> > > > > > > Perhaps
>> > > > > > > > we
>> > > > > > > > > > could do some analytics to see how much each of the
>> > language
>> > > > > > bindings
>> > > > > > > > is
>> > > > > > > > > > contributing to overall run time.
>> > > > > > > > > > If we have some metrics on that, maybe we can come up
>> with
>> > a
>> > > > > > > guideline
>> > > > > > > > of
>> > > > > > > > > > how much time each should take. Another possibility is
>> > > leverage
>> > > > > the
>> > > > > > > > > > parallel builds more.
>> > > > > > > > > >
>> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
>> > > > > > > > > pedro.larroy.lists@gmail.com
>> > > > > > > > > > >
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi Carin.
>> > > > > > > > > > >
>> > > > > > > > > > > That's a good point, all things considered would your
>> > > > > preference
>> > > > > > be
>> > > > > > > > to
>> > > > > > > > > > keep
>> > > > > > > > > > > the Clojure tests as part of the PR process or in
>> > Nightly?
>> > > > > > > > > > > Some options are having notifications here or in slack.
>> > But
>> > > > if
>> > > > > we
>> > > > > > > > think
>> > > > > > > > > > > breakages would go unnoticed maybe is not a good idea
>> to
>> > > > fully
>> > > > > > > remove
>> > > > > > > > > > > bindings from the PR process and just streamline the
>> > > process.
>> > > > > > > > > > >
>> > > > > > > > > > > Pedro.
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
>> > > > > > carinmeier@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > Before any binding tests are moved to nightly, I
>> think
>> > we
>> > > > > need
>> > > > > > to
>> > > > > > > > > > figure
>> > > > > > > > > > > > out how the community can get proper notifications of
>> > > > failure
>> > > > > > and
>> > > > > > > > > > success
>> > > > > > > > > > > > on those nightly runs. Otherwise, I think that
>> > breakages
>> > > > > would
>> > > > > > go
>> > > > > > > > > > > > unnoticed.
>> > > > > > > > > > > >
>> > > > > > > > > > > > -Carin
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
>> > > > > > > > > > > pedro.larroy.lists@gmail.com
>> > > > > > > > > > > > >
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > Hi
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Seems we are hitting some problems in CI. I propose
>> > the
>> > > > > > > following
>> > > > > > > > > > > action
>> > > > > > > > > > > > > items to remedy the situation and accelerate turn
>> > > around
>> > > > > > times
>> > > > > > > in
>> > > > > > > > > CI,
>> > > > > > > > > > > > > reduce cost, complexity and probability of failure
>> > > > blocking
>> > > > > > PRs
>> > > > > > > > and
>> > > > > > > > > > > > > frustrating developers:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS
>> > > 2017.
>> > > > > The
>> > > > > > > > > > > > > build_windows.py infrastructure should easily work
>> > with
>> > > > the
>> > > > > > new
>> > > > > > > > > > > version.
>> > > > > > > > > > > > > Currently some PRs are blocked by this:
>> > > > > > > > > > > > >
>> > https://github.com/apache/incubator-mxnet/issues/13958
>> > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
>> > > > > > > > > > > > >
>> > https://github.com/apache/incubator-mxnet/issues/15295
>> > > > > > > > > > > > > * Move non-python bindings tests to nightly. If a
>> > > commit
>> > > > is
>> > > > > > > > > touching
>> > > > > > > > > > > > other
>> > > > > > > > > > > > > bindings, the reviewer should ask for a full run
>> > which
>> > > > can
>> > > > > be
>> > > > > > > > done
>> > > > > > > > > > > > locally,
>> > > > > > > > > > > > > use the label bot to trigger a full CI build, or
>> > defer
>> > > to
>> > > > > > > > nightly.
>> > > > > > > > > > > > > * Provide a couple of basic sanity performance
>> tests
>> > on
>> > > > > small
>> > > > > > > > > models
>> > > > > > > > > > > that
>> > > > > > > > > > > > > are run on CI and can be echoed by the label bot
>> as a
>> > > > > comment
>> > > > > > > for
>> > > > > > > > > > PRs.
>> > > > > > > > > > > > > * Address unit tests that take more than 10-20s,
>> > > > streamline
>> > > > > > > them
>> > > > > > > > or
>> > > > > > > > > > > move
>> > > > > > > > > > > > > them to nightly if it can't be done.
>> > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
>> > scripts
>> > > > so
>> > > > > > the
>> > > > > > > > > > > community
>> > > > > > > > > > > > > can contribute.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I think our goal should be turnaround under 30min.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I would also like to touch base with the community
>> > that
>> > > > > some
>> > > > > > > PRs
>> > > > > > > > > are
>> > > > > > > > > > > not
>> > > > > > > > > > > > > being followed up by committers asking for changes.
>> > For
>> > > > > > example
>> > > > > > > > > this
>> > > > > > > > > > PR
>> > > > > > > > > > > > is
>> > > > > > > > > > > > > importtant and is hanging for a long time.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> https://github.com/apache/incubator-mxnet/pull/15051
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > This is another, less important but more trivial to
>> > > > review:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> https://github.com/apache/incubator-mxnet/pull/14940
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > I think comitters requesting changes and not
>> > folllowing
>> > > > up
>> > > > > in
>> > > > > > > > > > > reasonable
>> > > > > > > > > > > > > time is not healthy for the project. I suggest
>> > > > configuring
>> > > > > > > github
>> > > > > > > > > > > > > Notifications for a good SNR and following up.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Regards.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Pedro.
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

The first start wrt parallelization could certainly be start adding
parallel test execution in nosetests.

-Marco

Aaron Markham <aa...@gmail.com> schrieb am Do., 15. Aug. 2019,
05:39:

> The PRs Thomas and I are working on for the new docs and website share the
> mxnet binary in the new CI pipelines we made. Speeds things up a lot.
>
> On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com> wrote:
>
> > I see it done daily now, and while I can’t share all the details, it’s
> not
> > an incredibly complex thing, and involves not much more than nfs/efs
> > sharing and remote ssh commands.  All it takes is a little ingenuity and
> > some imagination.
> >
> > On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Sounds good in theory. I think there are complex details with regards
> of
> > > resource sharing during parallel execution. Still I think both ways can
> > be
> > > explored. I think some tests run for unreasonably long times for what
> > they
> > > are doing. We already scale parts of the pipeline horizontally across
> > > workers.
> > >
> > >
> > > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cj...@apache.org>
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Rather than remove tests (which doesn’t scale as a solution), why not
> > > scale
> > > > them horizontally so that they finish more quickly? Across processes
> or
> > > > even on a pool of machines that aren’t necessarily the build machine?
> > > >
> > > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> > marco.g.abreu@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > With regards to time I rather prefer us spending a bit more time on
> > > > > maintenance than somebody running into an error that could've been
> > > caught
> > > > > with a test.
> > > > >
> > > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> > quite
> > > > > some time now, but nobody noticed that. Basically my stance on that
> > > > matter
> > > > > is that as soon as something is not blocking, you can also just
> > > > deactivate
> > > > > it since you don't have a forcing function in an open source
> project.
> > > > > People will rarely come back and fix the errors of some nightly
> test
> > > that
> > > > > they introduced.
> > > > >
> > > > > -Marco
> > > > >
> > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > > 21:59:
> > > > >
> > > > > > If a language binding test is failing for a not important reason,
> > > then
> > > > it
> > > > > > is too brittle and needs to be fixed (we have fixed some of these
> > > with
> > > > > the
> > > > > > Clojure package [1]).
> > > > > > But in general, if we thinking of the MXNet project as one
> project
> > > that
> > > > > is
> > > > > > across all the language bindings, then we want to know if some
> > > > > fundamental
> > > > > > code change is going to break a downstream package.
> > > > > > I can't speak for all the high level package binding maintainers,
> > but
> > > > I'm
> > > > > > always happy to pitch in to provide code fixes to help the base
> PR
> > > get
> > > > > > green.
> > > > > >
> > > > > > The time costs to maintain such a large CI project obviously
> needs
> > to
> > > > be
> > > > > > considered as well.
> > > > > >
> > > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > > reasonable.
> > > > > > > The only question is that when a binding such as R, Perl or
> > Clojure
> > > > > > fails,
> > > > > > > some devs are a bit confused about how to fix them since they
> are
> > > not
> > > > > > > familiar with the testing tools and the language.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> > carinmeier@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Great idea Marco! Anything that you think would be valuable
> to
> > > > share
> > > > > > > would
> > > > > > > > be good. The duration of each node in the test stage sounds
> > like
> > > a
> > > > > good
> > > > > > > > start.
> > > > > > > >
> > > > > > > > - Carin
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > > marco.g.abreu@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > we record a bunch of metrics about run statistics (down to
> > the
> > > > > > duration
> > > > > > > > of
> > > > > > > > > every individual step). If you tell me which ones you're
> > > > > particularly
> > > > > > > > > interested in (probably total duration of each node in the
> > test
> > > > > > stage),
> > > > > > > > I'm
> > > > > > > > > happy to provide them.
> > > > > > > > >
> > > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > > - job
> > > > > > > > > - branch
> > > > > > > > > - stage
> > > > > > > > > - node
> > > > > > > > > - step
> > > > > > > > >
> > > > > > > > > Unfortunately I don't have the possibility to export them
> > since
> > > > we
> > > > > > > store
> > > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> > > exports.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Marco
> > > > > > > > >
> > > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14.
> Aug.
> > > > 2019,
> > > > > > > 19:43:
> > > > > > > > >
> > > > > > > > > > I would prefer to keep the language binding in the PR
> > > process.
> > > > > > > Perhaps
> > > > > > > > we
> > > > > > > > > > could do some analytics to see how much each of the
> > language
> > > > > > bindings
> > > > > > > > is
> > > > > > > > > > contributing to overall run time.
> > > > > > > > > > If we have some metrics on that, maybe we can come up
> with
> > a
> > > > > > > guideline
> > > > > > > > of
> > > > > > > > > > how much time each should take. Another possibility is
> > > leverage
> > > > > the
> > > > > > > > > > parallel builds more.
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Carin.
> > > > > > > > > > >
> > > > > > > > > > > That's a good point, all things considered would your
> > > > > preference
> > > > > > be
> > > > > > > > to
> > > > > > > > > > keep
> > > > > > > > > > > the Clojure tests as part of the PR process or in
> > Nightly?
> > > > > > > > > > > Some options are having notifications here or in slack.
> > But
> > > > if
> > > > > we
> > > > > > > > think
> > > > > > > > > > > breakages would go unnoticed maybe is not a good idea
> to
> > > > fully
> > > > > > > remove
> > > > > > > > > > > bindings from the PR process and just streamline the
> > > process.
> > > > > > > > > > >
> > > > > > > > > > > Pedro.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > > carinmeier@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Before any binding tests are moved to nightly, I
> think
> > we
> > > > > need
> > > > > > to
> > > > > > > > > > figure
> > > > > > > > > > > > out how the community can get proper notifications of
> > > > failure
> > > > > > and
> > > > > > > > > > success
> > > > > > > > > > > > on those nightly runs. Otherwise, I think that
> > breakages
> > > > > would
> > > > > > go
> > > > > > > > > > > > unnoticed.
> > > > > > > > > > > >
> > > > > > > > > > > > -Carin
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi
> > > > > > > > > > > > >
> > > > > > > > > > > > > Seems we are hitting some problems in CI. I propose
> > the
> > > > > > > following
> > > > > > > > > > > action
> > > > > > > > > > > > > items to remedy the situation and accelerate turn
> > > around
> > > > > > times
> > > > > > > in
> > > > > > > > > CI,
> > > > > > > > > > > > > reduce cost, complexity and probability of failure
> > > > blocking
> > > > > > PRs
> > > > > > > > and
> > > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > > >
> > > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS
> > > 2017.
> > > > > The
> > > > > > > > > > > > > build_windows.py infrastructure should easily work
> > with
> > > > the
> > > > > > new
> > > > > > > > > > > version.
> > > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > > > > >
> > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > > * Move non-python bindings tests to nightly. If a
> > > commit
> > > > is
> > > > > > > > > touching
> > > > > > > > > > > > other
> > > > > > > > > > > > > bindings, the reviewer should ask for a full run
> > which
> > > > can
> > > > > be
> > > > > > > > done
> > > > > > > > > > > > locally,
> > > > > > > > > > > > > use the label bot to trigger a full CI build, or
> > defer
> > > to
> > > > > > > > nightly.
> > > > > > > > > > > > > * Provide a couple of basic sanity performance
> tests
> > on
> > > > > small
> > > > > > > > > models
> > > > > > > > > > > that
> > > > > > > > > > > > > are run on CI and can be echoed by the label bot
> as a
> > > > > comment
> > > > > > > for
> > > > > > > > > > PRs.
> > > > > > > > > > > > > * Address unit tests that take more than 10-20s,
> > > > streamline
> > > > > > > them
> > > > > > > > or
> > > > > > > > > > > move
> > > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> > scripts
> > > > so
> > > > > > the
> > > > > > > > > > > community
> > > > > > > > > > > > > can contribute.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would also like to touch base with the community
> > that
> > > > > some
> > > > > > > PRs
> > > > > > > > > are
> > > > > > > > > > > not
> > > > > > > > > > > > > being followed up by committers asking for changes.
> > For
> > > > > > example
> > > > > > > > > this
> > > > > > > > > > PR
> > > > > > > > > > > > is
> > > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is another, less important but more trivial to
> > > > review:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think comitters requesting changes and not
> > folllowing
> > > > up
> > > > > in
> > > > > > > > > > > reasonable
> > > > > > > > > > > > > time is not healthy for the project. I suggest
> > > > configuring
> > > > > > > github
> > > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Pedro.
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Aaron Markham <aa...@gmail.com>.

The PRs Thomas and I are working on for the new docs and website share the
mxnet binary in the new CI pipelines we made. Speeds things up a lot.

On Wed, Aug 14, 2019, 18:16 Chris Olivier <cj...@gmail.com> wrote:

> I see it done daily now, and while I can’t share all the details, it’s not
> an incredibly complex thing, and involves not much more than nfs/efs
> sharing and remote ssh commands.  All it takes is a little ingenuity and
> some imagination.
>
> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Sounds good in theory. I think there are complex details with regards of
> > resource sharing during parallel execution. Still I think both ways can
> be
> > explored. I think some tests run for unreasonably long times for what
> they
> > are doing. We already scale parts of the pipeline horizontally across
> > workers.
> >
> >
> > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cj...@apache.org>
> > wrote:
> >
> > > +1
> > >
> > > Rather than remove tests (which doesn’t scale as a solution), why not
> > scale
> > > them horizontally so that they finish more quickly? Across processes or
> > > even on a pool of machines that aren’t necessarily the build machine?
> > >
> > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> marco.g.abreu@gmail.com
> > >
> > > wrote:
> > >
> > > > With regards to time I rather prefer us spending a bit more time on
> > > > maintenance than somebody running into an error that could've been
> > caught
> > > > with a test.
> > > >
> > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> quite
> > > > some time now, but nobody noticed that. Basically my stance on that
> > > matter
> > > > is that as soon as something is not blocking, you can also just
> > > deactivate
> > > > it since you don't have a forcing function in an open source project.
> > > > People will rarely come back and fix the errors of some nightly test
> > that
> > > > they introduced.
> > > >
> > > > -Marco
> > > >
> > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > 21:59:
> > > >
> > > > > If a language binding test is failing for a not important reason,
> > then
> > > it
> > > > > is too brittle and needs to be fixed (we have fixed some of these
> > with
> > > > the
> > > > > Clojure package [1]).
> > > > > But in general, if we thinking of the MXNet project as one project
> > that
> > > > is
> > > > > across all the language bindings, then we want to know if some
> > > > fundamental
> > > > > code change is going to break a downstream package.
> > > > > I can't speak for all the high level package binding maintainers,
> but
> > > I'm
> > > > > always happy to pitch in to provide code fixes to help the base PR
> > get
> > > > > green.
> > > > >
> > > > > The time costs to maintain such a large CI project obviously needs
> to
> > > be
> > > > > considered as well.
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > >
> > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > reasonable.
> > > > > > The only question is that when a binding such as R, Perl or
> Clojure
> > > > > fails,
> > > > > > some devs are a bit confused about how to fix them since they are
> > not
> > > > > > familiar with the testing tools and the language.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> carinmeier@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Great idea Marco! Anything that you think would be valuable to
> > > share
> > > > > > would
> > > > > > > be good. The duration of each node in the test stage sounds
> like
> > a
> > > > good
> > > > > > > start.
> > > > > > >
> > > > > > > - Carin
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > marco.g.abreu@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > we record a bunch of metrics about run statistics (down to
> the
> > > > > duration
> > > > > > > of
> > > > > > > > every individual step). If you tell me which ones you're
> > > > particularly
> > > > > > > > interested in (probably total duration of each node in the
> test
> > > > > stage),
> > > > > > > I'm
> > > > > > > > happy to provide them.
> > > > > > > >
> > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > - job
> > > > > > > > - branch
> > > > > > > > - stage
> > > > > > > > - node
> > > > > > > > - step
> > > > > > > >
> > > > > > > > Unfortunately I don't have the possibility to export them
> since
> > > we
> > > > > > store
> > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> > exports.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Marco
> > > > > > > >
> > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> > > 2019,
> > > > > > 19:43:
> > > > > > > >
> > > > > > > > > I would prefer to keep the language binding in the PR
> > process.
> > > > > > Perhaps
> > > > > > > we
> > > > > > > > > could do some analytics to see how much each of the
> language
> > > > > bindings
> > > > > > > is
> > > > > > > > > contributing to overall run time.
> > > > > > > > > If we have some metrics on that, maybe we can come up with
> a
> > > > > > guideline
> > > > > > > of
> > > > > > > > > how much time each should take. Another possibility is
> > leverage
> > > > the
> > > > > > > > > parallel builds more.
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Carin.
> > > > > > > > > >
> > > > > > > > > > That's a good point, all things considered would your
> > > > preference
> > > > > be
> > > > > > > to
> > > > > > > > > keep
> > > > > > > > > > the Clojure tests as part of the PR process or in
> Nightly?
> > > > > > > > > > Some options are having notifications here or in slack.
> But
> > > if
> > > > we
> > > > > > > think
> > > > > > > > > > breakages would go unnoticed maybe is not a good idea to
> > > fully
> > > > > > remove
> > > > > > > > > > bindings from the PR process and just streamline the
> > process.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > carinmeier@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Before any binding tests are moved to nightly, I think
> we
> > > > need
> > > > > to
> > > > > > > > > figure
> > > > > > > > > > > out how the community can get proper notifications of
> > > failure
> > > > > and
> > > > > > > > > success
> > > > > > > > > > > on those nightly runs. Otherwise, I think that
> breakages
> > > > would
> > > > > go
> > > > > > > > > > > unnoticed.
> > > > > > > > > > >
> > > > > > > > > > > -Carin
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi
> > > > > > > > > > > >
> > > > > > > > > > > > Seems we are hitting some problems in CI. I propose
> the
> > > > > > following
> > > > > > > > > > action
> > > > > > > > > > > > items to remedy the situation and accelerate turn
> > around
> > > > > times
> > > > > > in
> > > > > > > > CI,
> > > > > > > > > > > > reduce cost, complexity and probability of failure
> > > blocking
> > > > > PRs
> > > > > > > and
> > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > >
> > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS
> > 2017.
> > > > The
> > > > > > > > > > > > build_windows.py infrastructure should easily work
> with
> > > the
> > > > > new
> > > > > > > > > > version.
> > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > * Move non-python bindings tests to nightly. If a
> > commit
> > > is
> > > > > > > > touching
> > > > > > > > > > > other
> > > > > > > > > > > > bindings, the reviewer should ask for a full run
> which
> > > can
> > > > be
> > > > > > > done
> > > > > > > > > > > locally,
> > > > > > > > > > > > use the label bot to trigger a full CI build, or
> defer
> > to
> > > > > > > nightly.
> > > > > > > > > > > > * Provide a couple of basic sanity performance tests
> on
> > > > small
> > > > > > > > models
> > > > > > > > > > that
> > > > > > > > > > > > are run on CI and can be echoed by the label bot as a
> > > > comment
> > > > > > for
> > > > > > > > > PRs.
> > > > > > > > > > > > * Address unit tests that take more than 10-20s,
> > > streamline
> > > > > > them
> > > > > > > or
> > > > > > > > > > move
> > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> scripts
> > > so
> > > > > the
> > > > > > > > > > community
> > > > > > > > > > > > can contribute.
> > > > > > > > > > > >
> > > > > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > > > > >
> > > > > > > > > > > > I would also like to touch base with the community
> that
> > > > some
> > > > > > PRs
> > > > > > > > are
> > > > > > > > > > not
> > > > > > > > > > > > being followed up by committers asking for changes.
> For
> > > > > example
> > > > > > > > this
> > > > > > > > > PR
> > > > > > > > > > > is
> > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > >
> > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > >
> > > > > > > > > > > > This is another, less important but more trivial to
> > > review:
> > > > > > > > > > > >
> > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > >
> > > > > > > > > > > > I think comitters requesting changes and not
> folllowing
> > > up
> > > > in
> > > > > > > > > > reasonable
> > > > > > > > > > > > time is not healthy for the project. I suggest
> > > configuring
> > > > > > github
> > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards.
> > > > > > > > > > > >
> > > > > > > > > > > > Pedro.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Chris.
I suggest you send a PR to illustrate your proposal so we have a concrete
example to look into.
Pedro.

On Wed, Aug 14, 2019 at 6:16 PM Chris Olivier <cj...@gmail.com> wrote:

> I see it done daily now, and while I can’t share all the details, it’s not
> an incredibly complex thing, and involves not much more than nfs/efs
> sharing and remote ssh commands.  All it takes is a little ingenuity and
> some imagination.
>
> On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Sounds good in theory. I think there are complex details with regards of
> > resource sharing during parallel execution. Still I think both ways can
> be
> > explored. I think some tests run for unreasonably long times for what
> they
> > are doing. We already scale parts of the pipeline horizontally across
> > workers.
> >
> >
> > On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cj...@apache.org>
> > wrote:
> >
> > > +1
> > >
> > > Rather than remove tests (which doesn’t scale as a solution), why not
> > scale
> > > them horizontally so that they finish more quickly? Across processes or
> > > even on a pool of machines that aren’t necessarily the build machine?
> > >
> > > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <
> marco.g.abreu@gmail.com
> > >
> > > wrote:
> > >
> > > > With regards to time I rather prefer us spending a bit more time on
> > > > maintenance than somebody running into an error that could've been
> > caught
> > > > with a test.
> > > >
> > > > I mean, our Publishing pipeline for Scala GPU has been broken for
> quite
> > > > some time now, but nobody noticed that. Basically my stance on that
> > > matter
> > > > is that as soon as something is not blocking, you can also just
> > > deactivate
> > > > it since you don't have a forcing function in an open source project.
> > > > People will rarely come back and fix the errors of some nightly test
> > that
> > > > they introduced.
> > > >
> > > > -Marco
> > > >
> > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > 21:59:
> > > >
> > > > > If a language binding test is failing for a not important reason,
> > then
> > > it
> > > > > is too brittle and needs to be fixed (we have fixed some of these
> > with
> > > > the
> > > > > Clojure package [1]).
> > > > > But in general, if we thinking of the MXNet project as one project
> > that
> > > > is
> > > > > across all the language bindings, then we want to know if some
> > > > fundamental
> > > > > code change is going to break a downstream package.
> > > > > I can't speak for all the high level package binding maintainers,
> but
> > > I'm
> > > > > always happy to pitch in to provide code fixes to help the base PR
> > get
> > > > > green.
> > > > >
> > > > > The time costs to maintain such a large CI project obviously needs
> to
> > > be
> > > > > considered as well.
> > > > >
> > > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > > >
> > > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > > reasonable.
> > > > > > The only question is that when a binding such as R, Perl or
> Clojure
> > > > > fails,
> > > > > > some devs are a bit confused about how to fix them since they are
> > not
> > > > > > familiar with the testing tools and the language.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <
> carinmeier@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Great idea Marco! Anything that you think would be valuable to
> > > share
> > > > > > would
> > > > > > > be good. The duration of each node in the test stage sounds
> like
> > a
> > > > good
> > > > > > > start.
> > > > > > >
> > > > > > > - Carin
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > > marco.g.abreu@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > we record a bunch of metrics about run statistics (down to
> the
> > > > > duration
> > > > > > > of
> > > > > > > > every individual step). If you tell me which ones you're
> > > > particularly
> > > > > > > > interested in (probably total duration of each node in the
> test
> > > > > stage),
> > > > > > > I'm
> > > > > > > > happy to provide them.
> > > > > > > >
> > > > > > > > Dimensions are (in hierarchical order):
> > > > > > > > - job
> > > > > > > > - branch
> > > > > > > > - stage
> > > > > > > > - node
> > > > > > > > - step
> > > > > > > >
> > > > > > > > Unfortunately I don't have the possibility to export them
> since
> > > we
> > > > > > store
> > > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> > exports.
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Marco
> > > > > > > >
> > > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> > > 2019,
> > > > > > 19:43:
> > > > > > > >
> > > > > > > > > I would prefer to keep the language binding in the PR
> > process.
> > > > > > Perhaps
> > > > > > > we
> > > > > > > > > could do some analytics to see how much each of the
> language
> > > > > bindings
> > > > > > > is
> > > > > > > > > contributing to overall run time.
> > > > > > > > > If we have some metrics on that, maybe we can come up with
> a
> > > > > > guideline
> > > > > > > of
> > > > > > > > > how much time each should take. Another possibility is
> > leverage
> > > > the
> > > > > > > > > parallel builds more.
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Carin.
> > > > > > > > > >
> > > > > > > > > > That's a good point, all things considered would your
> > > > preference
> > > > > be
> > > > > > > to
> > > > > > > > > keep
> > > > > > > > > > the Clojure tests as part of the PR process or in
> Nightly?
> > > > > > > > > > Some options are having notifications here or in slack.
> But
> > > if
> > > > we
> > > > > > > think
> > > > > > > > > > breakages would go unnoticed maybe is not a good idea to
> > > fully
> > > > > > remove
> > > > > > > > > > bindings from the PR process and just streamline the
> > process.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > > carinmeier@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Before any binding tests are moved to nightly, I think
> we
> > > > need
> > > > > to
> > > > > > > > > figure
> > > > > > > > > > > out how the community can get proper notifications of
> > > failure
> > > > > and
> > > > > > > > > success
> > > > > > > > > > > on those nightly runs. Otherwise, I think that
> breakages
> > > > would
> > > > > go
> > > > > > > > > > > unnoticed.
> > > > > > > > > > >
> > > > > > > > > > > -Carin
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi
> > > > > > > > > > > >
> > > > > > > > > > > > Seems we are hitting some problems in CI. I propose
> the
> > > > > > following
> > > > > > > > > > action
> > > > > > > > > > > > items to remedy the situation and accelerate turn
> > around
> > > > > times
> > > > > > in
> > > > > > > > CI,
> > > > > > > > > > > > reduce cost, complexity and probability of failure
> > > blocking
> > > > > PRs
> > > > > > > and
> > > > > > > > > > > > frustrating developers:
> > > > > > > > > > > >
> > > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS
> > 2017.
> > > > The
> > > > > > > > > > > > build_windows.py infrastructure should easily work
> with
> > > the
> > > > > new
> > > > > > > > > > version.
> > > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > > > >
> https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > > * Move non-python bindings tests to nightly. If a
> > commit
> > > is
> > > > > > > > touching
> > > > > > > > > > > other
> > > > > > > > > > > > bindings, the reviewer should ask for a full run
> which
> > > can
> > > > be
> > > > > > > done
> > > > > > > > > > > locally,
> > > > > > > > > > > > use the label bot to trigger a full CI build, or
> defer
> > to
> > > > > > > nightly.
> > > > > > > > > > > > * Provide a couple of basic sanity performance tests
> on
> > > > small
> > > > > > > > models
> > > > > > > > > > that
> > > > > > > > > > > > are run on CI and can be echoed by the label bot as a
> > > > comment
> > > > > > for
> > > > > > > > > PRs.
> > > > > > > > > > > > * Address unit tests that take more than 10-20s,
> > > streamline
> > > > > > them
> > > > > > > or
> > > > > > > > > > move
> > > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > > * Open sourcing the remaining CI infrastructure
> scripts
> > > so
> > > > > the
> > > > > > > > > > community
> > > > > > > > > > > > can contribute.
> > > > > > > > > > > >
> > > > > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > > > > >
> > > > > > > > > > > > I would also like to touch base with the community
> that
> > > > some
> > > > > > PRs
> > > > > > > > are
> > > > > > > > > > not
> > > > > > > > > > > > being followed up by committers asking for changes.
> For
> > > > > example
> > > > > > > > this
> > > > > > > > > PR
> > > > > > > > > > > is
> > > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > > >
> > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > > >
> > > > > > > > > > > > This is another, less important but more trivial to
> > > review:
> > > > > > > > > > > >
> > > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > > >
> > > > > > > > > > > > I think comitters requesting changes and not
> folllowing
> > > up
> > > > in
> > > > > > > > > > reasonable
> > > > > > > > > > > > time is not healthy for the project. I suggest
> > > configuring
> > > > > > github
> > > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards.
> > > > > > > > > > > >
> > > > > > > > > > > > Pedro.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Chris Olivier <cj...@gmail.com>.

I see it done daily now, and while I can’t share all the details, it’s not
an incredibly complex thing, and involves not much more than nfs/efs
sharing and remote ssh commands.  All it takes is a little ingenuity and
some imagination.

On Wed, Aug 14, 2019 at 4:31 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Sounds good in theory. I think there are complex details with regards of
> resource sharing during parallel execution. Still I think both ways can be
> explored. I think some tests run for unreasonably long times for what they
> are doing. We already scale parts of the pipeline horizontally across
> workers.
>
>
> On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cj...@apache.org>
> wrote:
>
> > +1
> >
> > Rather than remove tests (which doesn’t scale as a solution), why not
> scale
> > them horizontally so that they finish more quickly? Across processes or
> > even on a pool of machines that aren’t necessarily the build machine?
> >
> > On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <marco.g.abreu@gmail.com
> >
> > wrote:
> >
> > > With regards to time I rather prefer us spending a bit more time on
> > > maintenance than somebody running into an error that could've been
> caught
> > > with a test.
> > >
> > > I mean, our Publishing pipeline for Scala GPU has been broken for quite
> > > some time now, but nobody noticed that. Basically my stance on that
> > matter
> > > is that as soon as something is not blocking, you can also just
> > deactivate
> > > it since you don't have a forcing function in an open source project.
> > > People will rarely come back and fix the errors of some nightly test
> that
> > > they introduced.
> > >
> > > -Marco
> > >
> > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> 21:59:
> > >
> > > > If a language binding test is failing for a not important reason,
> then
> > it
> > > > is too brittle and needs to be fixed (we have fixed some of these
> with
> > > the
> > > > Clojure package [1]).
> > > > But in general, if we thinking of the MXNet project as one project
> that
> > > is
> > > > across all the language bindings, then we want to know if some
> > > fundamental
> > > > code change is going to break a downstream package.
> > > > I can't speak for all the high level package binding maintainers, but
> > I'm
> > > > always happy to pitch in to provide code fixes to help the base PR
> get
> > > > green.
> > > >
> > > > The time costs to maintain such a large CI project obviously needs to
> > be
> > > > considered as well.
> > > >
> > > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > > >
> > > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > From what I have seen Clojure is 15 minutes, which I think is
> > > reasonable.
> > > > > The only question is that when a binding such as R, Perl or Clojure
> > > > fails,
> > > > > some devs are a bit confused about how to fix them since they are
> not
> > > > > familiar with the testing tools and the language.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <carinmeier@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Great idea Marco! Anything that you think would be valuable to
> > share
> > > > > would
> > > > > > be good. The duration of each node in the test stage sounds like
> a
> > > good
> > > > > > start.
> > > > > >
> > > > > > - Carin
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > > marco.g.abreu@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > we record a bunch of metrics about run statistics (down to the
> > > > duration
> > > > > > of
> > > > > > > every individual step). If you tell me which ones you're
> > > particularly
> > > > > > > interested in (probably total duration of each node in the test
> > > > stage),
> > > > > > I'm
> > > > > > > happy to provide them.
> > > > > > >
> > > > > > > Dimensions are (in hierarchical order):
> > > > > > > - job
> > > > > > > - branch
> > > > > > > - stage
> > > > > > > - node
> > > > > > > - step
> > > > > > >
> > > > > > > Unfortunately I don't have the possibility to export them since
> > we
> > > > > store
> > > > > > > them in CloudWatch Metrics which afaik doesn't offer raw
> exports.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Marco
> > > > > > >
> > > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> > 2019,
> > > > > 19:43:
> > > > > > >
> > > > > > > > I would prefer to keep the language binding in the PR
> process.
> > > > > Perhaps
> > > > > > we
> > > > > > > > could do some analytics to see how much each of the language
> > > > bindings
> > > > > > is
> > > > > > > > contributing to overall run time.
> > > > > > > > If we have some metrics on that, maybe we can come up with a
> > > > > guideline
> > > > > > of
> > > > > > > > how much time each should take. Another possibility is
> leverage
> > > the
> > > > > > > > parallel builds more.
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Carin.
> > > > > > > > >
> > > > > > > > > That's a good point, all things considered would your
> > > preference
> > > > be
> > > > > > to
> > > > > > > > keep
> > > > > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > > > > Some options are having notifications here or in slack. But
> > if
> > > we
> > > > > > think
> > > > > > > > > breakages would go unnoticed maybe is not a good idea to
> > fully
> > > > > remove
> > > > > > > > > bindings from the PR process and just streamline the
> process.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > > carinmeier@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Before any binding tests are moved to nightly, I think we
> > > need
> > > > to
> > > > > > > > figure
> > > > > > > > > > out how the community can get proper notifications of
> > failure
> > > > and
> > > > > > > > success
> > > > > > > > > > on those nightly runs. Otherwise, I think that breakages
> > > would
> > > > go
> > > > > > > > > > unnoticed.
> > > > > > > > > >
> > > > > > > > > > -Carin
> > > > > > > > > >
> > > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi
> > > > > > > > > > >
> > > > > > > > > > > Seems we are hitting some problems in CI. I propose the
> > > > > following
> > > > > > > > > action
> > > > > > > > > > > items to remedy the situation and accelerate turn
> around
> > > > times
> > > > > in
> > > > > > > CI,
> > > > > > > > > > > reduce cost, complexity and probability of failure
> > blocking
> > > > PRs
> > > > > > and
> > > > > > > > > > > frustrating developers:
> > > > > > > > > > >
> > > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS
> 2017.
> > > The
> > > > > > > > > > > build_windows.py infrastructure should easily work with
> > the
> > > > new
> > > > > > > > > version.
> > > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > > * Move non-python bindings tests to nightly. If a
> commit
> > is
> > > > > > > touching
> > > > > > > > > > other
> > > > > > > > > > > bindings, the reviewer should ask for a full run which
> > can
> > > be
> > > > > > done
> > > > > > > > > > locally,
> > > > > > > > > > > use the label bot to trigger a full CI build, or defer
> to
> > > > > > nightly.
> > > > > > > > > > > * Provide a couple of basic sanity performance tests on
> > > small
> > > > > > > models
> > > > > > > > > that
> > > > > > > > > > > are run on CI and can be echoed by the label bot as a
> > > comment
> > > > > for
> > > > > > > > PRs.
> > > > > > > > > > > * Address unit tests that take more than 10-20s,
> > streamline
> > > > > them
> > > > > > or
> > > > > > > > > move
> > > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > > * Open sourcing the remaining CI infrastructure scripts
> > so
> > > > the
> > > > > > > > > community
> > > > > > > > > > > can contribute.
> > > > > > > > > > >
> > > > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > > > >
> > > > > > > > > > > I would also like to touch base with the community that
> > > some
> > > > > PRs
> > > > > > > are
> > > > > > > > > not
> > > > > > > > > > > being followed up by committers asking for changes. For
> > > > example
> > > > > > > this
> > > > > > > > PR
> > > > > > > > > > is
> > > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > > >
> > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > > >
> > > > > > > > > > > This is another, less important but more trivial to
> > review:
> > > > > > > > > > >
> > > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > > >
> > > > > > > > > > > I think comitters requesting changes and not folllowing
> > up
> > > in
> > > > > > > > > reasonable
> > > > > > > > > > > time is not healthy for the project. I suggest
> > configuring
> > > > > github
> > > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > > >
> > > > > > > > > > > Regards.
> > > > > > > > > > >
> > > > > > > > > > > Pedro.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Sounds good in theory. I think there are complex details with regards of
resource sharing during parallel execution. Still I think both ways can be
explored. I think some tests run for unreasonably long times for what they
are doing. We already scale parts of the pipeline horizontally across
workers.


On Wed, Aug 14, 2019 at 5:12 PM Chris Olivier <cj...@apache.org>
wrote:

> +1
>
> Rather than remove tests (which doesn’t scale as a solution), why not scale
> them horizontally so that they finish more quickly? Across processes or
> even on a pool of machines that aren’t necessarily the build machine?
>
> On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <ma...@gmail.com>
> wrote:
>
> > With regards to time I rather prefer us spending a bit more time on
> > maintenance than somebody running into an error that could've been caught
> > with a test.
> >
> > I mean, our Publishing pipeline for Scala GPU has been broken for quite
> > some time now, but nobody noticed that. Basically my stance on that
> matter
> > is that as soon as something is not blocking, you can also just
> deactivate
> > it since you don't have a forcing function in an open source project.
> > People will rarely come back and fix the errors of some nightly test that
> > they introduced.
> >
> > -Marco
> >
> > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019, 21:59:
> >
> > > If a language binding test is failing for a not important reason, then
> it
> > > is too brittle and needs to be fixed (we have fixed some of these with
> > the
> > > Clojure package [1]).
> > > But in general, if we thinking of the MXNet project as one project that
> > is
> > > across all the language bindings, then we want to know if some
> > fundamental
> > > code change is going to break a downstream package.
> > > I can't speak for all the high level package binding maintainers, but
> I'm
> > > always happy to pitch in to provide code fixes to help the base PR get
> > > green.
> > >
> > > The time costs to maintain such a large CI project obviously needs to
> be
> > > considered as well.
> > >
> > > [1] https://github.com/apache/incubator-mxnet/pull/15579
> > >
> > > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > From what I have seen Clojure is 15 minutes, which I think is
> > reasonable.
> > > > The only question is that when a binding such as R, Perl or Clojure
> > > fails,
> > > > some devs are a bit confused about how to fix them since they are not
> > > > familiar with the testing tools and the language.
> > > >
> > > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <ca...@gmail.com>
> > > wrote:
> > > >
> > > > > Great idea Marco! Anything that you think would be valuable to
> share
> > > > would
> > > > > be good. The duration of each node in the test stage sounds like a
> > good
> > > > > start.
> > > > >
> > > > > - Carin
> > > > >
> > > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > > marco.g.abreu@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > we record a bunch of metrics about run statistics (down to the
> > > duration
> > > > > of
> > > > > > every individual step). If you tell me which ones you're
> > particularly
> > > > > > interested in (probably total duration of each node in the test
> > > stage),
> > > > > I'm
> > > > > > happy to provide them.
> > > > > >
> > > > > > Dimensions are (in hierarchical order):
> > > > > > - job
> > > > > > - branch
> > > > > > - stage
> > > > > > - node
> > > > > > - step
> > > > > >
> > > > > > Unfortunately I don't have the possibility to export them since
> we
> > > > store
> > > > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > > > >
> > > > > > Best regards,
> > > > > > Marco
> > > > > >
> > > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug.
> 2019,
> > > > 19:43:
> > > > > >
> > > > > > > I would prefer to keep the language binding in the PR process.
> > > > Perhaps
> > > > > we
> > > > > > > could do some analytics to see how much each of the language
> > > bindings
> > > > > is
> > > > > > > contributing to overall run time.
> > > > > > > If we have some metrics on that, maybe we can come up with a
> > > > guideline
> > > > > of
> > > > > > > how much time each should take. Another possibility is leverage
> > the
> > > > > > > parallel builds more.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > > pedro.larroy.lists@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Carin.
> > > > > > > >
> > > > > > > > That's a good point, all things considered would your
> > preference
> > > be
> > > > > to
> > > > > > > keep
> > > > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > > > Some options are having notifications here or in slack. But
> if
> > we
> > > > > think
> > > > > > > > breakages would go unnoticed maybe is not a good idea to
> fully
> > > > remove
> > > > > > > > bindings from the PR process and just streamline the process.
> > > > > > > >
> > > > > > > > Pedro.
> > > > > > > >
> > > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > > carinmeier@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Before any binding tests are moved to nightly, I think we
> > need
> > > to
> > > > > > > figure
> > > > > > > > > out how the community can get proper notifications of
> failure
> > > and
> > > > > > > success
> > > > > > > > > on those nightly runs. Otherwise, I think that breakages
> > would
> > > go
> > > > > > > > > unnoticed.
> > > > > > > > >
> > > > > > > > > -Carin
> > > > > > > > >
> > > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi
> > > > > > > > > >
> > > > > > > > > > Seems we are hitting some problems in CI. I propose the
> > > > following
> > > > > > > > action
> > > > > > > > > > items to remedy the situation and accelerate turn around
> > > times
> > > > in
> > > > > > CI,
> > > > > > > > > > reduce cost, complexity and probability of failure
> blocking
> > > PRs
> > > > > and
> > > > > > > > > > frustrating developers:
> > > > > > > > > >
> > > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017.
> > The
> > > > > > > > > > build_windows.py infrastructure should easily work with
> the
> > > new
> > > > > > > > version.
> > > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > > * Move non-python bindings tests to nightly. If a commit
> is
> > > > > > touching
> > > > > > > > > other
> > > > > > > > > > bindings, the reviewer should ask for a full run which
> can
> > be
> > > > > done
> > > > > > > > > locally,
> > > > > > > > > > use the label bot to trigger a full CI build, or defer to
> > > > > nightly.
> > > > > > > > > > * Provide a couple of basic sanity performance tests on
> > small
> > > > > > models
> > > > > > > > that
> > > > > > > > > > are run on CI and can be echoed by the label bot as a
> > comment
> > > > for
> > > > > > > PRs.
> > > > > > > > > > * Address unit tests that take more than 10-20s,
> streamline
> > > > them
> > > > > or
> > > > > > > > move
> > > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > > * Open sourcing the remaining CI infrastructure scripts
> so
> > > the
> > > > > > > > community
> > > > > > > > > > can contribute.
> > > > > > > > > >
> > > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > > >
> > > > > > > > > > I would also like to touch base with the community that
> > some
> > > > PRs
> > > > > > are
> > > > > > > > not
> > > > > > > > > > being followed up by committers asking for changes. For
> > > example
> > > > > > this
> > > > > > > PR
> > > > > > > > > is
> > > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > > >
> > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > > >
> > > > > > > > > > This is another, less important but more trivial to
> review:
> > > > > > > > > >
> > > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > > >
> > > > > > > > > > I think comitters requesting changes and not folllowing
> up
> > in
> > > > > > > > reasonable
> > > > > > > > > > time is not healthy for the project. I suggest
> configuring
> > > > github
> > > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > > >
> > > > > > > > > > Regards.
> > > > > > > > > >
> > > > > > > > > > Pedro.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Chris Olivier <cj...@apache.org>.

+1

Rather than remove tests (which doesn’t scale as a solution), why not scale
them horizontally so that they finish more quickly? Across processes or
even on a pool of machines that aren’t necessarily the build machine?

On Wed, Aug 14, 2019 at 12:03 PM Marco de Abreu <ma...@gmail.com>
wrote:

> With regards to time I rather prefer us spending a bit more time on
> maintenance than somebody running into an error that could've been caught
> with a test.
>
> I mean, our Publishing pipeline for Scala GPU has been broken for quite
> some time now, but nobody noticed that. Basically my stance on that matter
> is that as soon as something is not blocking, you can also just deactivate
> it since you don't have a forcing function in an open source project.
> People will rarely come back and fix the errors of some nightly test that
> they introduced.
>
> -Marco
>
> Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019, 21:59:
>
> > If a language binding test is failing for a not important reason, then it
> > is too brittle and needs to be fixed (we have fixed some of these with
> the
> > Clojure package [1]).
> > But in general, if we thinking of the MXNet project as one project that
> is
> > across all the language bindings, then we want to know if some
> fundamental
> > code change is going to break a downstream package.
> > I can't speak for all the high level package binding maintainers, but I'm
> > always happy to pitch in to provide code fixes to help the base PR get
> > green.
> >
> > The time costs to maintain such a large CI project obviously needs to be
> > considered as well.
> >
> > [1] https://github.com/apache/incubator-mxnet/pull/15579
> >
> > On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > From what I have seen Clojure is 15 minutes, which I think is
> reasonable.
> > > The only question is that when a binding such as R, Perl or Clojure
> > fails,
> > > some devs are a bit confused about how to fix them since they are not
> > > familiar with the testing tools and the language.
> > >
> > > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <ca...@gmail.com>
> > wrote:
> > >
> > > > Great idea Marco! Anything that you think would be valuable to share
> > > would
> > > > be good. The duration of each node in the test stage sounds like a
> good
> > > > start.
> > > >
> > > > - Carin
> > > >
> > > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> > marco.g.abreu@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > we record a bunch of metrics about run statistics (down to the
> > duration
> > > > of
> > > > > every individual step). If you tell me which ones you're
> particularly
> > > > > interested in (probably total duration of each node in the test
> > stage),
> > > > I'm
> > > > > happy to provide them.
> > > > >
> > > > > Dimensions are (in hierarchical order):
> > > > > - job
> > > > > - branch
> > > > > - stage
> > > > > - node
> > > > > - step
> > > > >
> > > > > Unfortunately I don't have the possibility to export them since we
> > > store
> > > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > > 19:43:
> > > > >
> > > > > > I would prefer to keep the language binding in the PR process.
> > > Perhaps
> > > > we
> > > > > > could do some analytics to see how much each of the language
> > bindings
> > > > is
> > > > > > contributing to overall run time.
> > > > > > If we have some metrics on that, maybe we can come up with a
> > > guideline
> > > > of
> > > > > > how much time each should take. Another possibility is leverage
> the
> > > > > > parallel builds more.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Carin.
> > > > > > >
> > > > > > > That's a good point, all things considered would your
> preference
> > be
> > > > to
> > > > > > keep
> > > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > > Some options are having notifications here or in slack. But if
> we
> > > > think
> > > > > > > breakages would go unnoticed maybe is not a good idea to fully
> > > remove
> > > > > > > bindings from the PR process and just streamline the process.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> > carinmeier@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Before any binding tests are moved to nightly, I think we
> need
> > to
> > > > > > figure
> > > > > > > > out how the community can get proper notifications of failure
> > and
> > > > > > success
> > > > > > > > on those nightly runs. Otherwise, I think that breakages
> would
> > go
> > > > > > > > unnoticed.
> > > > > > > >
> > > > > > > > -Carin
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > > pedro.larroy.lists@gmail.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi
> > > > > > > > >
> > > > > > > > > Seems we are hitting some problems in CI. I propose the
> > > following
> > > > > > > action
> > > > > > > > > items to remedy the situation and accelerate turn around
> > times
> > > in
> > > > > CI,
> > > > > > > > > reduce cost, complexity and probability of failure blocking
> > PRs
> > > > and
> > > > > > > > > frustrating developers:
> > > > > > > > >
> > > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017.
> The
> > > > > > > > > build_windows.py infrastructure should easily work with the
> > new
> > > > > > > version.
> > > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > > * Move non-python bindings tests to nightly. If a commit is
> > > > > touching
> > > > > > > > other
> > > > > > > > > bindings, the reviewer should ask for a full run which can
> be
> > > > done
> > > > > > > > locally,
> > > > > > > > > use the label bot to trigger a full CI build, or defer to
> > > > nightly.
> > > > > > > > > * Provide a couple of basic sanity performance tests on
> small
> > > > > models
> > > > > > > that
> > > > > > > > > are run on CI and can be echoed by the label bot as a
> comment
> > > for
> > > > > > PRs.
> > > > > > > > > * Address unit tests that take more than 10-20s, streamline
> > > them
> > > > or
> > > > > > > move
> > > > > > > > > them to nightly if it can't be done.
> > > > > > > > > * Open sourcing the remaining CI infrastructure scripts so
> > the
> > > > > > > community
> > > > > > > > > can contribute.
> > > > > > > > >
> > > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > > >
> > > > > > > > > I would also like to touch base with the community that
> some
> > > PRs
> > > > > are
> > > > > > > not
> > > > > > > > > being followed up by committers asking for changes. For
> > example
> > > > > this
> > > > > > PR
> > > > > > > > is
> > > > > > > > > importtant and is hanging for a long time.
> > > > > > > > >
> > > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > > >
> > > > > > > > > This is another, less important but more trivial to review:
> > > > > > > > >
> > > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > > >
> > > > > > > > > I think comitters requesting changes and not folllowing up
> in
> > > > > > > reasonable
> > > > > > > > > time is not healthy for the project. I suggest configuring
> > > github
> > > > > > > > > Notifications for a good SNR and following up.
> > > > > > > > >
> > > > > > > > > Regards.
> > > > > > > > >
> > > > > > > > > Pedro.
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

With regards to time I rather prefer us spending a bit more time on
maintenance than somebody running into an error that could've been caught
with a test.

I mean, our Publishing pipeline for Scala GPU has been broken for quite
some time now, but nobody noticed that. Basically my stance on that matter
is that as soon as something is not blocking, you can also just deactivate
it since you don't have a forcing function in an open source project.
People will rarely come back and fix the errors of some nightly test that
they introduced.

-Marco

Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019, 21:59:

> If a language binding test is failing for a not important reason, then it
> is too brittle and needs to be fixed (we have fixed some of these with the
> Clojure package [1]).
> But in general, if we thinking of the MXNet project as one project that is
> across all the language bindings, then we want to know if some fundamental
> code change is going to break a downstream package.
> I can't speak for all the high level package binding maintainers, but I'm
> always happy to pitch in to provide code fixes to help the base PR get
> green.
>
> The time costs to maintain such a large CI project obviously needs to be
> considered as well.
>
> [1] https://github.com/apache/incubator-mxnet/pull/15579
>
> On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > From what I have seen Clojure is 15 minutes, which I think is reasonable.
> > The only question is that when a binding such as R, Perl or Clojure
> fails,
> > some devs are a bit confused about how to fix them since they are not
> > familiar with the testing tools and the language.
> >
> > On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <ca...@gmail.com>
> wrote:
> >
> > > Great idea Marco! Anything that you think would be valuable to share
> > would
> > > be good. The duration of each node in the test stage sounds like a good
> > > start.
> > >
> > > - Carin
> > >
> > > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <
> marco.g.abreu@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > we record a bunch of metrics about run statistics (down to the
> duration
> > > of
> > > > every individual step). If you tell me which ones you're particularly
> > > > interested in (probably total duration of each node in the test
> stage),
> > > I'm
> > > > happy to provide them.
> > > >
> > > > Dimensions are (in hierarchical order):
> > > > - job
> > > > - branch
> > > > - stage
> > > > - node
> > > > - step
> > > >
> > > > Unfortunately I don't have the possibility to export them since we
> > store
> > > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> > 19:43:
> > > >
> > > > > I would prefer to keep the language binding in the PR process.
> > Perhaps
> > > we
> > > > > could do some analytics to see how much each of the language
> bindings
> > > is
> > > > > contributing to overall run time.
> > > > > If we have some metrics on that, maybe we can come up with a
> > guideline
> > > of
> > > > > how much time each should take. Another possibility is leverage the
> > > > > parallel builds more.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Carin.
> > > > > >
> > > > > > That's a good point, all things considered would your preference
> be
> > > to
> > > > > keep
> > > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > > Some options are having notifications here or in slack. But if we
> > > think
> > > > > > breakages would go unnoticed maybe is not a good idea to fully
> > remove
> > > > > > bindings from the PR process and just streamline the process.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <
> carinmeier@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Before any binding tests are moved to nightly, I think we need
> to
> > > > > figure
> > > > > > > out how the community can get proper notifications of failure
> and
> > > > > success
> > > > > > > on those nightly runs. Otherwise, I think that breakages would
> go
> > > > > > > unnoticed.
> > > > > > >
> > > > > > > -Carin
> > > > > > >
> > > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > > pedro.larroy.lists@gmail.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi
> > > > > > > >
> > > > > > > > Seems we are hitting some problems in CI. I propose the
> > following
> > > > > > action
> > > > > > > > items to remedy the situation and accelerate turn around
> times
> > in
> > > > CI,
> > > > > > > > reduce cost, complexity and probability of failure blocking
> PRs
> > > and
> > > > > > > > frustrating developers:
> > > > > > > >
> > > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > > > > > > build_windows.py infrastructure should easily work with the
> new
> > > > > > version.
> > > > > > > > Currently some PRs are blocked by this:
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > > * Move non-python bindings tests to nightly. If a commit is
> > > > touching
> > > > > > > other
> > > > > > > > bindings, the reviewer should ask for a full run which can be
> > > done
> > > > > > > locally,
> > > > > > > > use the label bot to trigger a full CI build, or defer to
> > > nightly.
> > > > > > > > * Provide a couple of basic sanity performance tests on small
> > > > models
> > > > > > that
> > > > > > > > are run on CI and can be echoed by the label bot as a comment
> > for
> > > > > PRs.
> > > > > > > > * Address unit tests that take more than 10-20s, streamline
> > them
> > > or
> > > > > > move
> > > > > > > > them to nightly if it can't be done.
> > > > > > > > * Open sourcing the remaining CI infrastructure scripts so
> the
> > > > > > community
> > > > > > > > can contribute.
> > > > > > > >
> > > > > > > > I think our goal should be turnaround under 30min.
> > > > > > > >
> > > > > > > > I would also like to touch base with the community that some
> > PRs
> > > > are
> > > > > > not
> > > > > > > > being followed up by committers asking for changes. For
> example
> > > > this
> > > > > PR
> > > > > > > is
> > > > > > > > importtant and is hanging for a long time.
> > > > > > > >
> > > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > > >
> > > > > > > > This is another, less important but more trivial to review:
> > > > > > > >
> > > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > > >
> > > > > > > > I think comitters requesting changes and not folllowing up in
> > > > > > reasonable
> > > > > > > > time is not healthy for the project. I suggest configuring
> > github
> > > > > > > > Notifications for a good SNR and following up.
> > > > > > > >
> > > > > > > > Regards.
> > > > > > > >
> > > > > > > > Pedro.
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Carin Meier <ca...@gmail.com>.

If a language binding test is failing for a not important reason, then it
is too brittle and needs to be fixed (we have fixed some of these with the
Clojure package [1]).
But in general, if we thinking of the MXNet project as one project that is
across all the language bindings, then we want to know if some fundamental
code change is going to break a downstream package.
I can't speak for all the high level package binding maintainers, but I'm
always happy to pitch in to provide code fixes to help the base PR get
green.

The time costs to maintain such a large CI project obviously needs to be
considered as well.

[1] https://github.com/apache/incubator-mxnet/pull/15579

On Wed, Aug 14, 2019 at 3:48 PM Pedro Larroy <pe...@gmail.com>
wrote:

> From what I have seen Clojure is 15 minutes, which I think is reasonable.
> The only question is that when a binding such as R, Perl or Clojure fails,
> some devs are a bit confused about how to fix them since they are not
> familiar with the testing tools and the language.
>
> On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <ca...@gmail.com> wrote:
>
> > Great idea Marco! Anything that you think would be valuable to share
> would
> > be good. The duration of each node in the test stage sounds like a good
> > start.
> >
> > - Carin
> >
> > On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <ma...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > we record a bunch of metrics about run statistics (down to the duration
> > of
> > > every individual step). If you tell me which ones you're particularly
> > > interested in (probably total duration of each node in the test stage),
> > I'm
> > > happy to provide them.
> > >
> > > Dimensions are (in hierarchical order):
> > > - job
> > > - branch
> > > - stage
> > > - node
> > > - step
> > >
> > > Unfortunately I don't have the possibility to export them since we
> store
> > > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> > >
> > > Best regards,
> > > Marco
> > >
> > > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019,
> 19:43:
> > >
> > > > I would prefer to keep the language binding in the PR process.
> Perhaps
> > we
> > > > could do some analytics to see how much each of the language bindings
> > is
> > > > contributing to overall run time.
> > > > If we have some metrics on that, maybe we can come up with a
> guideline
> > of
> > > > how much time each should take. Another possibility is leverage the
> > > > parallel builds more.
> > > >
> > > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi Carin.
> > > > >
> > > > > That's a good point, all things considered would your preference be
> > to
> > > > keep
> > > > > the Clojure tests as part of the PR process or in Nightly?
> > > > > Some options are having notifications here or in slack. But if we
> > think
> > > > > breakages would go unnoticed maybe is not a good idea to fully
> remove
> > > > > bindings from the PR process and just streamline the process.
> > > > >
> > > > > Pedro.
> > > > >
> > > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <ca...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Before any binding tests are moved to nightly, I think we need to
> > > > figure
> > > > > > out how the community can get proper notifications of failure and
> > > > success
> > > > > > on those nightly runs. Otherwise, I think that breakages would go
> > > > > > unnoticed.
> > > > > >
> > > > > > -Carin
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > > pedro.larroy.lists@gmail.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi
> > > > > > >
> > > > > > > Seems we are hitting some problems in CI. I propose the
> following
> > > > > action
> > > > > > > items to remedy the situation and accelerate turn around times
> in
> > > CI,
> > > > > > > reduce cost, complexity and probability of failure blocking PRs
> > and
> > > > > > > frustrating developers:
> > > > > > >
> > > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > > > > > build_windows.py infrastructure should easily work with the new
> > > > > version.
> > > > > > > Currently some PRs are blocked by this:
> > > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > > * Move non-python bindings tests to nightly. If a commit is
> > > touching
> > > > > > other
> > > > > > > bindings, the reviewer should ask for a full run which can be
> > done
> > > > > > locally,
> > > > > > > use the label bot to trigger a full CI build, or defer to
> > nightly.
> > > > > > > * Provide a couple of basic sanity performance tests on small
> > > models
> > > > > that
> > > > > > > are run on CI and can be echoed by the label bot as a comment
> for
> > > > PRs.
> > > > > > > * Address unit tests that take more than 10-20s, streamline
> them
> > or
> > > > > move
> > > > > > > them to nightly if it can't be done.
> > > > > > > * Open sourcing the remaining CI infrastructure scripts so the
> > > > > community
> > > > > > > can contribute.
> > > > > > >
> > > > > > > I think our goal should be turnaround under 30min.
> > > > > > >
> > > > > > > I would also like to touch base with the community that some
> PRs
> > > are
> > > > > not
> > > > > > > being followed up by committers asking for changes. For example
> > > this
> > > > PR
> > > > > > is
> > > > > > > importtant and is hanging for a long time.
> > > > > > >
> > > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > > >
> > > > > > > This is another, less important but more trivial to review:
> > > > > > >
> > > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > > >
> > > > > > > I think comitters requesting changes and not folllowing up in
> > > > > reasonable
> > > > > > > time is not healthy for the project. I suggest configuring
> github
> > > > > > > Notifications for a good SNR and following up.
> > > > > > >
> > > > > > > Regards.
> > > > > > >
> > > > > > > Pedro.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

From what I have seen Clojure is 15 minutes, which I think is reasonable.
The only question is that when a binding such as R, Perl or Clojure fails,
some devs are a bit confused about how to fix them since they are not
familiar with the testing tools and the language.

On Wed, Aug 14, 2019 at 11:57 AM Carin Meier <ca...@gmail.com> wrote:

> Great idea Marco! Anything that you think would be valuable to share would
> be good. The duration of each node in the test stage sounds like a good
> start.
>
> - Carin
>
> On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <ma...@gmail.com>
> wrote:
>
> > Hi,
> >
> > we record a bunch of metrics about run statistics (down to the duration
> of
> > every individual step). If you tell me which ones you're particularly
> > interested in (probably total duration of each node in the test stage),
> I'm
> > happy to provide them.
> >
> > Dimensions are (in hierarchical order):
> > - job
> > - branch
> > - stage
> > - node
> > - step
> >
> > Unfortunately I don't have the possibility to export them since we store
> > them in CloudWatch Metrics which afaik doesn't offer raw exports.
> >
> > Best regards,
> > Marco
> >
> > Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019, 19:43:
> >
> > > I would prefer to keep the language binding in the PR process. Perhaps
> we
> > > could do some analytics to see how much each of the language bindings
> is
> > > contributing to overall run time.
> > > If we have some metrics on that, maybe we can come up with a guideline
> of
> > > how much time each should take. Another possibility is leverage the
> > > parallel builds more.
> > >
> > > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi Carin.
> > > >
> > > > That's a good point, all things considered would your preference be
> to
> > > keep
> > > > the Clojure tests as part of the PR process or in Nightly?
> > > > Some options are having notifications here or in slack. But if we
> think
> > > > breakages would go unnoticed maybe is not a good idea to fully remove
> > > > bindings from the PR process and just streamline the process.
> > > >
> > > > Pedro.
> > > >
> > > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <ca...@gmail.com>
> > > wrote:
> > > >
> > > > > Before any binding tests are moved to nightly, I think we need to
> > > figure
> > > > > out how the community can get proper notifications of failure and
> > > success
> > > > > on those nightly runs. Otherwise, I think that breakages would go
> > > > > unnoticed.
> > > > >
> > > > > -Carin
> > > > >
> > > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > > pedro.larroy.lists@gmail.com
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > >
> > > > > > Seems we are hitting some problems in CI. I propose the following
> > > > action
> > > > > > items to remedy the situation and accelerate turn around times in
> > CI,
> > > > > > reduce cost, complexity and probability of failure blocking PRs
> and
> > > > > > frustrating developers:
> > > > > >
> > > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > > > > build_windows.py infrastructure should easily work with the new
> > > > version.
> > > > > > Currently some PRs are blocked by this:
> > > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > > * Move non-python bindings tests to nightly. If a commit is
> > touching
> > > > > other
> > > > > > bindings, the reviewer should ask for a full run which can be
> done
> > > > > locally,
> > > > > > use the label bot to trigger a full CI build, or defer to
> nightly.
> > > > > > * Provide a couple of basic sanity performance tests on small
> > models
> > > > that
> > > > > > are run on CI and can be echoed by the label bot as a comment for
> > > PRs.
> > > > > > * Address unit tests that take more than 10-20s, streamline them
> or
> > > > move
> > > > > > them to nightly if it can't be done.
> > > > > > * Open sourcing the remaining CI infrastructure scripts so the
> > > > community
> > > > > > can contribute.
> > > > > >
> > > > > > I think our goal should be turnaround under 30min.
> > > > > >
> > > > > > I would also like to touch base with the community that some PRs
> > are
> > > > not
> > > > > > being followed up by committers asking for changes. For example
> > this
> > > PR
> > > > > is
> > > > > > importtant and is hanging for a long time.
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > > >
> > > > > > This is another, less important but more trivial to review:
> > > > > >
> > > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > > >
> > > > > > I think comitters requesting changes and not folllowing up in
> > > > reasonable
> > > > > > time is not healthy for the project. I suggest configuring github
> > > > > > Notifications for a good SNR and following up.
> > > > > >
> > > > > > Regards.
> > > > > >
> > > > > > Pedro.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Carin Meier <ca...@gmail.com>.

Great idea Marco! Anything that you think would be valuable to share would
be good. The duration of each node in the test stage sounds like a good
start.

- Carin

On Wed, Aug 14, 2019 at 2:48 PM Marco de Abreu <ma...@gmail.com>
wrote:

> Hi,
>
> we record a bunch of metrics about run statistics (down to the duration of
> every individual step). If you tell me which ones you're particularly
> interested in (probably total duration of each node in the test stage), I'm
> happy to provide them.
>
> Dimensions are (in hierarchical order):
> - job
> - branch
> - stage
> - node
> - step
>
> Unfortunately I don't have the possibility to export them since we store
> them in CloudWatch Metrics which afaik doesn't offer raw exports.
>
> Best regards,
> Marco
>
> Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019, 19:43:
>
> > I would prefer to keep the language binding in the PR process. Perhaps we
> > could do some analytics to see how much each of the language bindings is
> > contributing to overall run time.
> > If we have some metrics on that, maybe we can come up with a guideline of
> > how much time each should take. Another possibility is leverage the
> > parallel builds more.
> >
> > On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Hi Carin.
> > >
> > > That's a good point, all things considered would your preference be to
> > keep
> > > the Clojure tests as part of the PR process or in Nightly?
> > > Some options are having notifications here or in slack. But if we think
> > > breakages would go unnoticed maybe is not a good idea to fully remove
> > > bindings from the PR process and just streamline the process.
> > >
> > > Pedro.
> > >
> > > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <ca...@gmail.com>
> > wrote:
> > >
> > > > Before any binding tests are moved to nightly, I think we need to
> > figure
> > > > out how the community can get proper notifications of failure and
> > success
> > > > on those nightly runs. Otherwise, I think that breakages would go
> > > > unnoticed.
> > > >
> > > > -Carin
> > > >
> > > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi
> > > > >
> > > > > Seems we are hitting some problems in CI. I propose the following
> > > action
> > > > > items to remedy the situation and accelerate turn around times in
> CI,
> > > > > reduce cost, complexity and probability of failure blocking PRs and
> > > > > frustrating developers:
> > > > >
> > > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > > > build_windows.py infrastructure should easily work with the new
> > > version.
> > > > > Currently some PRs are blocked by this:
> > > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > > * Move non-python bindings tests to nightly. If a commit is
> touching
> > > > other
> > > > > bindings, the reviewer should ask for a full run which can be done
> > > > locally,
> > > > > use the label bot to trigger a full CI build, or defer to nightly.
> > > > > * Provide a couple of basic sanity performance tests on small
> models
> > > that
> > > > > are run on CI and can be echoed by the label bot as a comment for
> > PRs.
> > > > > * Address unit tests that take more than 10-20s, streamline them or
> > > move
> > > > > them to nightly if it can't be done.
> > > > > * Open sourcing the remaining CI infrastructure scripts so the
> > > community
> > > > > can contribute.
> > > > >
> > > > > I think our goal should be turnaround under 30min.
> > > > >
> > > > > I would also like to touch base with the community that some PRs
> are
> > > not
> > > > > being followed up by committers asking for changes. For example
> this
> > PR
> > > > is
> > > > > importtant and is hanging for a long time.
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > > >
> > > > > This is another, less important but more trivial to review:
> > > > >
> > > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > > >
> > > > > I think comitters requesting changes and not folllowing up in
> > > reasonable
> > > > > time is not healthy for the project. I suggest configuring github
> > > > > Notifications for a good SNR and following up.
> > > > >
> > > > > Regards.
> > > > >
> > > > > Pedro.
> > > > >
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Marco de Abreu <ma...@gmail.com>.

Hi,

we record a bunch of metrics about run statistics (down to the duration of
every individual step). If you tell me which ones you're particularly
interested in (probably total duration of each node in the test stage), I'm
happy to provide them.

Dimensions are (in hierarchical order):
- job
- branch
- stage
- node
- step

Unfortunately I don't have the possibility to export them since we store
them in CloudWatch Metrics which afaik doesn't offer raw exports.

Best regards,
Marco

Carin Meier <ca...@gmail.com> schrieb am Mi., 14. Aug. 2019, 19:43:

> I would prefer to keep the language binding in the PR process. Perhaps we
> could do some analytics to see how much each of the language bindings is
> contributing to overall run time.
> If we have some metrics on that, maybe we can come up with a guideline of
> how much time each should take. Another possibility is leverage the
> parallel builds more.
>
> On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Hi Carin.
> >
> > That's a good point, all things considered would your preference be to
> keep
> > the Clojure tests as part of the PR process or in Nightly?
> > Some options are having notifications here or in slack. But if we think
> > breakages would go unnoticed maybe is not a good idea to fully remove
> > bindings from the PR process and just streamline the process.
> >
> > Pedro.
> >
> > On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <ca...@gmail.com>
> wrote:
> >
> > > Before any binding tests are moved to nightly, I think we need to
> figure
> > > out how the community can get proper notifications of failure and
> success
> > > on those nightly runs. Otherwise, I think that breakages would go
> > > unnoticed.
> > >
> > > -Carin
> > >
> > > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > Hi
> > > >
> > > > Seems we are hitting some problems in CI. I propose the following
> > action
> > > > items to remedy the situation and accelerate turn around times in CI,
> > > > reduce cost, complexity and probability of failure blocking PRs and
> > > > frustrating developers:
> > > >
> > > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > > build_windows.py infrastructure should easily work with the new
> > version.
> > > > Currently some PRs are blocked by this:
> > > > https://github.com/apache/incubator-mxnet/issues/13958
> > > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > > https://github.com/apache/incubator-mxnet/issues/15295
> > > > * Move non-python bindings tests to nightly. If a commit is touching
> > > other
> > > > bindings, the reviewer should ask for a full run which can be done
> > > locally,
> > > > use the label bot to trigger a full CI build, or defer to nightly.
> > > > * Provide a couple of basic sanity performance tests on small models
> > that
> > > > are run on CI and can be echoed by the label bot as a comment for
> PRs.
> > > > * Address unit tests that take more than 10-20s, streamline them or
> > move
> > > > them to nightly if it can't be done.
> > > > * Open sourcing the remaining CI infrastructure scripts so the
> > community
> > > > can contribute.
> > > >
> > > > I think our goal should be turnaround under 30min.
> > > >
> > > > I would also like to touch base with the community that some PRs are
> > not
> > > > being followed up by committers asking for changes. For example this
> PR
> > > is
> > > > importtant and is hanging for a long time.
> > > >
> > > > https://github.com/apache/incubator-mxnet/pull/15051
> > > >
> > > > This is another, less important but more trivial to review:
> > > >
> > > > https://github.com/apache/incubator-mxnet/pull/14940
> > > >
> > > > I think comitters requesting changes and not folllowing up in
> > reasonable
> > > > time is not healthy for the project. I suggest configuring github
> > > > Notifications for a good SNR and following up.
> > > >
> > > > Regards.
> > > >
> > > > Pedro.
> > > >
> > >
> >
>

Re: CI and PRs

Posted by Carin Meier <ca...@gmail.com>.

I would prefer to keep the language binding in the PR process. Perhaps we
could do some analytics to see how much each of the language bindings is
contributing to overall run time.
If we have some metrics on that, maybe we can come up with a guideline of
how much time each should take. Another possibility is leverage the
parallel builds more.

On Wed, Aug 14, 2019 at 1:30 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Hi Carin.
>
> That's a good point, all things considered would your preference be to keep
> the Clojure tests as part of the PR process or in Nightly?
> Some options are having notifications here or in slack. But if we think
> breakages would go unnoticed maybe is not a good idea to fully remove
> bindings from the PR process and just streamline the process.
>
> Pedro.
>
> On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <ca...@gmail.com> wrote:
>
> > Before any binding tests are moved to nightly, I think we need to figure
> > out how the community can get proper notifications of failure and success
> > on those nightly runs. Otherwise, I think that breakages would go
> > unnoticed.
> >
> > -Carin
> >
> > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Hi
> > >
> > > Seems we are hitting some problems in CI. I propose the following
> action
> > > items to remedy the situation and accelerate turn around times in CI,
> > > reduce cost, complexity and probability of failure blocking PRs and
> > > frustrating developers:
> > >
> > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > build_windows.py infrastructure should easily work with the new
> version.
> > > Currently some PRs are blocked by this:
> > > https://github.com/apache/incubator-mxnet/issues/13958
> > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > https://github.com/apache/incubator-mxnet/issues/15295
> > > * Move non-python bindings tests to nightly. If a commit is touching
> > other
> > > bindings, the reviewer should ask for a full run which can be done
> > locally,
> > > use the label bot to trigger a full CI build, or defer to nightly.
> > > * Provide a couple of basic sanity performance tests on small models
> that
> > > are run on CI and can be echoed by the label bot as a comment for PRs.
> > > * Address unit tests that take more than 10-20s, streamline them or
> move
> > > them to nightly if it can't be done.
> > > * Open sourcing the remaining CI infrastructure scripts so the
> community
> > > can contribute.
> > >
> > > I think our goal should be turnaround under 30min.
> > >
> > > I would also like to touch base with the community that some PRs are
> not
> > > being followed up by committers asking for changes. For example this PR
> > is
> > > importtant and is hanging for a long time.
> > >
> > > https://github.com/apache/incubator-mxnet/pull/15051
> > >
> > > This is another, less important but more trivial to review:
> > >
> > > https://github.com/apache/incubator-mxnet/pull/14940
> > >
> > > I think comitters requesting changes and not folllowing up in
> reasonable
> > > time is not healthy for the project. I suggest configuring github
> > > Notifications for a good SNR and following up.
> > >
> > > Regards.
> > >
> > > Pedro.
> > >
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Hi Carin.

That's a good point, all things considered would your preference be to keep
the Clojure tests as part of the PR process or in Nightly?
Some options are having notifications here or in slack. But if we think
breakages would go unnoticed maybe is not a good idea to fully remove
bindings from the PR process and just streamline the process.

Pedro.

On Wed, Aug 14, 2019 at 5:09 AM Carin Meier <ca...@gmail.com> wrote:

> Before any binding tests are moved to nightly, I think we need to figure
> out how the community can get proper notifications of failure and success
> on those nightly runs. Otherwise, I think that breakages would go
> unnoticed.
>
> -Carin
>
> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Hi
> >
> > Seems we are hitting some problems in CI. I propose the following action
> > items to remedy the situation and accelerate turn around times in CI,
> > reduce cost, complexity and probability of failure blocking PRs and
> > frustrating developers:
> >
> > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > build_windows.py infrastructure should easily work with the new version.
> > Currently some PRs are blocked by this:
> > https://github.com/apache/incubator-mxnet/issues/13958
> > * Move Gluon Model zoo tests to nightly. Tracked at
> > https://github.com/apache/incubator-mxnet/issues/15295
> > * Move non-python bindings tests to nightly. If a commit is touching
> other
> > bindings, the reviewer should ask for a full run which can be done
> locally,
> > use the label bot to trigger a full CI build, or defer to nightly.
> > * Provide a couple of basic sanity performance tests on small models that
> > are run on CI and can be echoed by the label bot as a comment for PRs.
> > * Address unit tests that take more than 10-20s, streamline them or move
> > them to nightly if it can't be done.
> > * Open sourcing the remaining CI infrastructure scripts so the community
> > can contribute.
> >
> > I think our goal should be turnaround under 30min.
> >
> > I would also like to touch base with the community that some PRs are not
> > being followed up by committers asking for changes. For example this PR
> is
> > importtant and is hanging for a long time.
> >
> > https://github.com/apache/incubator-mxnet/pull/15051
> >
> > This is another, less important but more trivial to review:
> >
> > https://github.com/apache/incubator-mxnet/pull/14940
> >
> > I think comitters requesting changes and not folllowing up in reasonable
> > time is not healthy for the project. I suggest configuring github
> > Notifications for a good SNR and following up.
> >
> > Regards.
> >
> > Pedro.
> >
>

Re: CI and PRs

Posted by Pedro Larroy <pe...@gmail.com>.

Yes another point is that pushing again to the PR should cancel previous
builds which is now not happening which wastes resources.

Any ideas how to make connection errors more robust? The Ivy cache for JVM
packages for example could be pre-populated in the workers. It's a balance
between complexity and efficiency and simplicity.

Maybe maven has some settings to retry download failures for example. For
failures downloading gpg keys we just stored them in the repository to
avoid networking problems.


On Wed, Aug 14, 2019 at 9:39 AM Chaitanya Bapat <ch...@gmail.com>
wrote:

> Pedro,
>
> great job of summarizing the set of tasks to restore CI's glory!
> As far as your list goes,
>
> > * Address unit tests that take more than 10-20s, streamline them or move
> > them to nightly if it can't be done.
>
> I would like to call out this request specifically. I'm tracking # of
> timeouts that happen (and this is by no means an exhaustive list) - PR
> #15880 <https://github.com/apache/incubator-mxnet/issues/15880>
> It's unreasonable for CI to run tests for 3 hours. So, we do need to
> address this issue with greater intent.
>
> Moreover, to add to the tale of CI woes, we should make it robust enough
> for network connection errors.
> At times, CI fails due to inability to fetch some packages.
> 1. Error log doesn't mention corrective action (on the part of PR author -
> "to retrigger the CI")
> 2. Would have been great had CI handled it smartly (or some sort of way to
> fasten the process of passing the CI)
>
> Hopefully, with the help of community, we would be able to catch exceptions
> and make CI great again!
>
>
> On Wed, 14 Aug 2019 at 05:09, Carin Meier <ca...@gmail.com> wrote:
>
> > Before any binding tests are moved to nightly, I think we need to figure
> > out how the community can get proper notifications of failure and success
> > on those nightly runs. Otherwise, I think that breakages would go
> > unnoticed.
> >
> > -Carin
> >
> > On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Hi
> > >
> > > Seems we are hitting some problems in CI. I propose the following
> action
> > > items to remedy the situation and accelerate turn around times in CI,
> > > reduce cost, complexity and probability of failure blocking PRs and
> > > frustrating developers:
> > >
> > > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > > build_windows.py infrastructure should easily work with the new
> version.
> > > Currently some PRs are blocked by this:
> > > https://github.com/apache/incubator-mxnet/issues/13958
> > > * Move Gluon Model zoo tests to nightly. Tracked at
> > > https://github.com/apache/incubator-mxnet/issues/15295
> > > * Move non-python bindings tests to nightly. If a commit is touching
> > other
> > > bindings, the reviewer should ask for a full run which can be done
> > locally,
> > > use the label bot to trigger a full CI build, or defer to nightly.
> > > * Provide a couple of basic sanity performance tests on small models
> that
> > > are run on CI and can be echoed by the label bot as a comment for PRs.
> > > * Address unit tests that take more than 10-20s, streamline them or
> move
> > > them to nightly if it can't be done.
>
> > * Open sourcing the remaining CI infrastructure scripts so the community
> > > can contribute.
> > >
> > > I think our goal should be turnaround under 30min.
> > >
> > > I would also like to touch base with the community that some PRs are
> not
> > > being followed up by committers asking for changes. For example this PR
> > is
> > > importtant and is hanging for a long time.
> > >
> > > https://github.com/apache/incubator-mxnet/pull/15051
> > >
> > > This is another, less important but more trivial to review:
> > >
> > > https://github.com/apache/incubator-mxnet/pull/14940
> > >
> > > I think comitters requesting changes and not folllowing up in
> reasonable
> > > time is not healthy for the project. I suggest configuring github
> > > Notifications for a good SNR and following up.
> > >
> > > Regards.
> > >
> > > Pedro.
> > >
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat
> ]
> <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>

Re: CI and PRs

Posted by Chaitanya Bapat <ch...@gmail.com>.

Pedro,

great job of summarizing the set of tasks to restore CI's glory!
As far as your list goes,

> * Address unit tests that take more than 10-20s, streamline them or move
> them to nightly if it can't be done.

I would like to call out this request specifically. I'm tracking # of
timeouts that happen (and this is by no means an exhaustive list) - PR
#15880 <https://github.com/apache/incubator-mxnet/issues/15880>
It's unreasonable for CI to run tests for 3 hours. So, we do need to
address this issue with greater intent.

Moreover, to add to the tale of CI woes, we should make it robust enough
for network connection errors.
At times, CI fails due to inability to fetch some packages.
1. Error log doesn't mention corrective action (on the part of PR author -
"to retrigger the CI")
2. Would have been great had CI handled it smartly (or some sort of way to
fasten the process of passing the CI)

Hopefully, with the help of community, we would be able to catch exceptions
and make CI great again!


On Wed, 14 Aug 2019 at 05:09, Carin Meier <ca...@gmail.com> wrote:

> Before any binding tests are moved to nightly, I think we need to figure
> out how the community can get proper notifications of failure and success
> on those nightly runs. Otherwise, I think that breakages would go
> unnoticed.
>
> -Carin
>
> On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Hi
> >
> > Seems we are hitting some problems in CI. I propose the following action
> > items to remedy the situation and accelerate turn around times in CI,
> > reduce cost, complexity and probability of failure blocking PRs and
> > frustrating developers:
> >
> > * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> > build_windows.py infrastructure should easily work with the new version.
> > Currently some PRs are blocked by this:
> > https://github.com/apache/incubator-mxnet/issues/13958
> > * Move Gluon Model zoo tests to nightly. Tracked at
> > https://github.com/apache/incubator-mxnet/issues/15295
> > * Move non-python bindings tests to nightly. If a commit is touching
> other
> > bindings, the reviewer should ask for a full run which can be done
> locally,
> > use the label bot to trigger a full CI build, or defer to nightly.
> > * Provide a couple of basic sanity performance tests on small models that
> > are run on CI and can be echoed by the label bot as a comment for PRs.
> > * Address unit tests that take more than 10-20s, streamline them or move
> > them to nightly if it can't be done.

> * Open sourcing the remaining CI infrastructure scripts so the community
> > can contribute.
> >
> > I think our goal should be turnaround under 30min.
> >
> > I would also like to touch base with the community that some PRs are not
> > being followed up by committers asking for changes. For example this PR
> is
> > importtant and is hanging for a long time.
> >
> > https://github.com/apache/incubator-mxnet/pull/15051
> >
> > This is another, less important but more trivial to review:
> >
> > https://github.com/apache/incubator-mxnet/pull/14940
> >
> > I think comitters requesting changes and not folllowing up in reasonable
> > time is not healthy for the project. I suggest configuring github
> > Notifications for a good SNR and following up.
> >
> > Regards.
> >
> > Pedro.
> >
>


-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
<https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat]
<https://www.facebook.com/chaibapchya>[image:
https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
https://www.linkedin.com//in/chaibapat25]
<https://www.linkedin.com//in/chaibapchya/>

Re: CI and PRs

Posted by Carin Meier <ca...@gmail.com>.

Before any binding tests are moved to nightly, I think we need to figure
out how the community can get proper notifications of failure and success
on those nightly runs. Otherwise, I think that breakages would go unnoticed.

-Carin

On Tue, Aug 13, 2019 at 7:47 PM Pedro Larroy <pe...@gmail.com>
wrote:

> Hi
>
> Seems we are hitting some problems in CI. I propose the following action
> items to remedy the situation and accelerate turn around times in CI,
> reduce cost, complexity and probability of failure blocking PRs and
> frustrating developers:
>
> * Upgrade Windows visual studio from VS 2015 to VS 2017. The
> build_windows.py infrastructure should easily work with the new version.
> Currently some PRs are blocked by this:
> https://github.com/apache/incubator-mxnet/issues/13958
> * Move Gluon Model zoo tests to nightly. Tracked at
> https://github.com/apache/incubator-mxnet/issues/15295
> * Move non-python bindings tests to nightly. If a commit is touching other
> bindings, the reviewer should ask for a full run which can be done locally,
> use the label bot to trigger a full CI build, or defer to nightly.
> * Provide a couple of basic sanity performance tests on small models that
> are run on CI and can be echoed by the label bot as a comment for PRs.
> * Address unit tests that take more than 10-20s, streamline them or move
> them to nightly if it can't be done.
> * Open sourcing the remaining CI infrastructure scripts so the community
> can contribute.
>
> I think our goal should be turnaround under 30min.
>
> I would also like to touch base with the community that some PRs are not
> being followed up by committers asking for changes. For example this PR is
> importtant and is hanging for a long time.
>
> https://github.com/apache/incubator-mxnet/pull/15051
>
> This is another, less important but more trivial to review:
>
> https://github.com/apache/incubator-mxnet/pull/14940
>
> I think comitters requesting changes and not folllowing up in reasonable
> time is not healthy for the project. I suggest configuring github
> Notifications for a good SNR and following up.
>
> Regards.
>
> Pedro.
>