You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by sandeep krishnamurthy <sa...@gmail.com> on 2017/10/20 20:01:12 UTC

Fwd: [Proposal] Stabilizing Apache MXNet CI build system

Hello all,

I am hereby opening up a discussion thread on how we can stabilize Apache
MXNet CI build system.

Problems:

========

Recently, we have seen following issues with Apache MXNet CI build systems:

   1. Apache Jenkins master is overloaded and we see issues like - unable
   to trigger builds, difficult to load and view the blue ocean and other
   Jenkins build status page.
   2. We are generating too many request/interaction on Apache Infra team.
      1. Addition/deletion of new slave: Caused from scaling activity,
      recycling, troubleshooting or any actions leading to change of slave
      machines.
      2. Plugins / other Jenkins Master configurations.
      3. Experimentation on CI pipelines.
   3. Harder to debug and resolve issues - Since access to master and slave
   is not with the same community, it requires Infra and community to dive
   deep together on all action items.

Possible Solutions:

==============

   1. Can we set up a separate Jenkins CI build system for Apache MXNet
   outside Apache Infra?
   2. Can we have a separate Jenkins Master in Apache Infra for MXNet?
   3. Review design of current setup, refine and fill the gaps.

@ Mentors/Infra team/Community:

==========================

Please provide your suggestions on how we can proceed further and work on
stabilizing the CI build systems for MXNet.

Also, if the community decides on separate Jenkins CI build system, what
important points should be taken care of apart from the below:

   1. Community being able to access the build page for build statuses.
   2. Committers being able to login with apache credentials.
   3. Hook setup from apache/incubator-mxnet repo to Jenkins master.


Irrespective of the solution we come up, I think we should initiate a
technical design discussion on how to setup the CI build system. Probably 1
or 2 pager documents with the architecture and review with Infra and
community members.

***There were few proposal and discussion on the slack channel, to reach
wider community members, moving that discussion formally to this list.


My Proposal: Option 1 - Set up separate Jenkins CI build system.

Thanks,

Sandeep



-- 
Sandeep Krishnamurthy

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
I believe that Mu already started that discussion about using old mxnet.io
Jenkins server.   I expect deciding whether to replace would hinge in large
part upon what it would be replaced with.

On Fri, Oct 20, 2017 at 4:30 PM, sandeep krishnamurthy <
sandeep.krishna98@gmail.com> wrote:

> Chris: If the community decides to go with separate setup, then there will
> be a tech design discussion and CodeCommit / Jenkins / Travis such
> proposals will be covered and discussed.
>
> Thanks,
> Sandeep
>
> On Fri, Oct 20, 2017 at 4:22 PM, Seb Kiureghian <se...@gmail.com>
> wrote:
>
> > But the feather can definitely be added once MXNet graduates.
> >
> > On Fri, Oct 20, 2017 at 4:21 PM, Seb Kiureghian <se...@gmail.com>
> > wrote:
> >
> > > The feather can only be used by Top Level Projects.
> > >
> > > On Fri, Oct 20, 2017 at 4:19 PM, Chris Olivier <cj...@gmail.com>
> > > wrote:
> > >
> > >> When the word Apache is in the Hadoop logo (not always), it includes
> the
> > >> feather and color scheme.
> > >>
> > >> On Fri, Oct 20, 2017 at 4:18 PM, Chris Olivier <cjolivier01@gmail.com
> >
> > >> wrote:
> > >>
> > >>> Thanks.
> > >>>
> > >>> Is there any way to work the feather into it?
> > >>>
> > >>> i.e.  https://goo.gl/images/BU4dnG
> > >>>
> > >>> On Fri, Oct 20, 2017 at 4:11 PM, Seb Kiureghian <se...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> https://imgur.com/a/aADkA
> > >>>>
> > >>>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
> cjolivier01@gmail.com
> > >
> > >>>> wrote:
> > >>>>
> > >>>> > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> > >>>> > everything. It's also compatible with Jenkins.
> > >>>> >
> > >>>> > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> > >>>> tqchen@cs.washington.edu>
> > >>>> > wrote:
> > >>>> >
> > >>>> > > +1
> > >>>> > >
> > >>>> > > Tianqi
> > >>>> > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
> wrote:
> > >>>> > >
> > >>>> > > > +1
> > >>>> > > >
> > >>>> > > >
> > >>>> > > > It seems that the Apache CI is quite overloaded these days,
> and
> > >>>> MXNet's
> > >>>> > > CI
> > >>>> > > > pipeline is too complex to run there. In addition, we may need
> > to
> > >>>> add
> > >>>> > > more
> > >>>> > > > devices, e.g. macpro and rasbperry pi, into the server, and
> more
> > >>>> tasks
> > >>>> > > such
> > >>>> > > > as pip build. It means a lot of requests to the Infra team.
> > >>>> > > >
> > >>>> > > > We can reuse our previous Jenkins server at
> http://ci.mxnet.io/
> > .
> > >>>> But
> > >>>> > we
> > >>>> > > > probably need a dedicate developer to maintain it.
> > >>>> > > >
> > >>>> > > >
> > >>>> > > >
> > >>>> > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > >>>> > > > sandeep.krishna98@gmail.com> wrote:
> > >>>> > > >
> > >>>> > > > > Hello all,
> > >>>> > > > >
> > >>>> > > > > I am hereby opening up a discussion thread on how we can
> > >>>> stabilize
> > >>>> > > Apache
> > >>>> > > > > MXNet CI build system.
> > >>>> > > > >
> > >>>> > > > > Problems:
> > >>>> > > > >
> > >>>> > > > > ========
> > >>>> > > > >
> > >>>> > > > > Recently, we have seen following issues with Apache MXNet CI
> > >>>> build
> > >>>> > > > systems:
> > >>>> > > > >
> > >>>> > > > >    1. Apache Jenkins master is overloaded and we see issues
> > >>>> like -
> > >>>> > > unable
> > >>>> > > > >    to trigger builds, difficult to load and view the blue
> > ocean
> > >>>> and
> > >>>> > > other
> > >>>> > > > >    Jenkins build status page.
> > >>>> > > > >    2. We are generating too many request/interaction on
> Apache
> > >>>> Infra
> > >>>> > > > team.
> > >>>> > > > >       1. Addition/deletion of new slave: Caused from scaling
> > >>>> > activity,
> > >>>> > > > >       recycling, troubleshooting or any actions leading to
> > >>>> change of
> > >>>> > > > slave
> > >>>> > > > >       machines.
> > >>>> > > > >       2. Plugins / other Jenkins Master configurations.
> > >>>> > > > >       3. Experimentation on CI pipelines.
> > >>>> > > > >    3. Harder to debug and resolve issues - Since access to
> > >>>> master and
> > >>>> > > > slave
> > >>>> > > > >    is not with the same community, it requires Infra and
> > >>>> community to
> > >>>> > > > dive
> > >>>> > > > >    deep together on all action items.
> > >>>> > > > >
> > >>>> > > > > Possible Solutions:
> > >>>> > > > >
> > >>>> > > > > ==============
> > >>>> > > > >
> > >>>> > > > >    1. Can we set up a separate Jenkins CI build system for
> > >>>> Apache
> > >>>> > MXNet
> > >>>> > > > >    outside Apache Infra?
> > >>>> > > > >    2. Can we have a separate Jenkins Master in Apache Infra
> > for
> > >>>> > MXNet?
> > >>>> > > > >    3. Review design of current setup, refine and fill the
> > gaps.
> > >>>> > > > >
> > >>>> > > > > @ Mentors/Infra team/Community:
> > >>>> > > > >
> > >>>> > > > > ==========================
> > >>>> > > > >
> > >>>> > > > > Please provide your suggestions on how we can proceed
> further
> > >>>> and
> > >>>> > work
> > >>>> > > on
> > >>>> > > > > stabilizing the CI build systems for MXNet.
> > >>>> > > > >
> > >>>> > > > > Also, if the community decides on separate Jenkins CI build
> > >>>> system,
> > >>>> > > what
> > >>>> > > > > important points should be taken care of apart from the
> below:
> > >>>> > > > >
> > >>>> > > > >    1. Community being able to access the build page for
> build
> > >>>> > statuses.
> > >>>> > > > >    2. Committers being able to login with apache
> credentials.
> > >>>> > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> > >>>> master.
> > >>>> > > > >
> > >>>> > > > >
> > >>>> > > > > Irrespective of the solution we come up, I think we should
> > >>>> initiate a
> > >>>> > > > > technical design discussion on how to setup the CI build
> > system.
> > >>>> > > > Probably 1
> > >>>> > > > > or 2 pager documents with the architecture and review with
> > >>>> Infra and
> > >>>> > > > > community members.
> > >>>> > > > >
> > >>>> > > > > ***There were few proposal and discussion on the slack
> > channel,
> > >>>> to
> > >>>> > > reach
> > >>>> > > > > wider community members, moving that discussion formally to
> > this
> > >>>> > list.
> > >>>> > > > >
> > >>>> > > > >
> > >>>> > > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> > system.
> > >>>> > > > >
> > >>>> > > > > Thanks,
> > >>>> > > > >
> > >>>> > > > > Sandeep
> > >>>> > > > >
> > >>>> > > > >
> > >>>> > > > >
> > >>>> > > > > --
> > >>>> > > > > Sandeep Krishnamurthy
> > >>>> > > > >
> > >>>> > > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> > >>>
> > >>
> > >
> >
>
>
>
> --
> Sandeep Krishnamurthy
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by sandeep krishnamurthy <sa...@gmail.com>.
Chris: If the community decides to go with separate setup, then there will
be a tech design discussion and CodeCommit / Jenkins / Travis such
proposals will be covered and discussed.

Thanks,
Sandeep

On Fri, Oct 20, 2017 at 4:22 PM, Seb Kiureghian <se...@gmail.com> wrote:

> But the feather can definitely be added once MXNet graduates.
>
> On Fri, Oct 20, 2017 at 4:21 PM, Seb Kiureghian <se...@gmail.com>
> wrote:
>
> > The feather can only be used by Top Level Projects.
> >
> > On Fri, Oct 20, 2017 at 4:19 PM, Chris Olivier <cj...@gmail.com>
> > wrote:
> >
> >> When the word Apache is in the Hadoop logo (not always), it includes the
> >> feather and color scheme.
> >>
> >> On Fri, Oct 20, 2017 at 4:18 PM, Chris Olivier <cj...@gmail.com>
> >> wrote:
> >>
> >>> Thanks.
> >>>
> >>> Is there any way to work the feather into it?
> >>>
> >>> i.e.  https://goo.gl/images/BU4dnG
> >>>
> >>> On Fri, Oct 20, 2017 at 4:11 PM, Seb Kiureghian <se...@gmail.com>
> >>> wrote:
> >>>
> >>>> https://imgur.com/a/aADkA
> >>>>
> >>>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cjolivier01@gmail.com
> >
> >>>> wrote:
> >>>>
> >>>> > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> >>>> > everything. It's also compatible with Jenkins.
> >>>> >
> >>>> > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> >>>> tqchen@cs.washington.edu>
> >>>> > wrote:
> >>>> >
> >>>> > > +1
> >>>> > >
> >>>> > > Tianqi
> >>>> > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
> >>>> > >
> >>>> > > > +1
> >>>> > > >
> >>>> > > >
> >>>> > > > It seems that the Apache CI is quite overloaded these days, and
> >>>> MXNet's
> >>>> > > CI
> >>>> > > > pipeline is too complex to run there. In addition, we may need
> to
> >>>> add
> >>>> > > more
> >>>> > > > devices, e.g. macpro and rasbperry pi, into the server, and more
> >>>> tasks
> >>>> > > such
> >>>> > > > as pip build. It means a lot of requests to the Infra team.
> >>>> > > >
> >>>> > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/
> .
> >>>> But
> >>>> > we
> >>>> > > > probably need a dedicate developer to maintain it.
> >>>> > > >
> >>>> > > >
> >>>> > > >
> >>>> > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> >>>> > > > sandeep.krishna98@gmail.com> wrote:
> >>>> > > >
> >>>> > > > > Hello all,
> >>>> > > > >
> >>>> > > > > I am hereby opening up a discussion thread on how we can
> >>>> stabilize
> >>>> > > Apache
> >>>> > > > > MXNet CI build system.
> >>>> > > > >
> >>>> > > > > Problems:
> >>>> > > > >
> >>>> > > > > ========
> >>>> > > > >
> >>>> > > > > Recently, we have seen following issues with Apache MXNet CI
> >>>> build
> >>>> > > > systems:
> >>>> > > > >
> >>>> > > > >    1. Apache Jenkins master is overloaded and we see issues
> >>>> like -
> >>>> > > unable
> >>>> > > > >    to trigger builds, difficult to load and view the blue
> ocean
> >>>> and
> >>>> > > other
> >>>> > > > >    Jenkins build status page.
> >>>> > > > >    2. We are generating too many request/interaction on Apache
> >>>> Infra
> >>>> > > > team.
> >>>> > > > >       1. Addition/deletion of new slave: Caused from scaling
> >>>> > activity,
> >>>> > > > >       recycling, troubleshooting or any actions leading to
> >>>> change of
> >>>> > > > slave
> >>>> > > > >       machines.
> >>>> > > > >       2. Plugins / other Jenkins Master configurations.
> >>>> > > > >       3. Experimentation on CI pipelines.
> >>>> > > > >    3. Harder to debug and resolve issues - Since access to
> >>>> master and
> >>>> > > > slave
> >>>> > > > >    is not with the same community, it requires Infra and
> >>>> community to
> >>>> > > > dive
> >>>> > > > >    deep together on all action items.
> >>>> > > > >
> >>>> > > > > Possible Solutions:
> >>>> > > > >
> >>>> > > > > ==============
> >>>> > > > >
> >>>> > > > >    1. Can we set up a separate Jenkins CI build system for
> >>>> Apache
> >>>> > MXNet
> >>>> > > > >    outside Apache Infra?
> >>>> > > > >    2. Can we have a separate Jenkins Master in Apache Infra
> for
> >>>> > MXNet?
> >>>> > > > >    3. Review design of current setup, refine and fill the
> gaps.
> >>>> > > > >
> >>>> > > > > @ Mentors/Infra team/Community:
> >>>> > > > >
> >>>> > > > > ==========================
> >>>> > > > >
> >>>> > > > > Please provide your suggestions on how we can proceed further
> >>>> and
> >>>> > work
> >>>> > > on
> >>>> > > > > stabilizing the CI build systems for MXNet.
> >>>> > > > >
> >>>> > > > > Also, if the community decides on separate Jenkins CI build
> >>>> system,
> >>>> > > what
> >>>> > > > > important points should be taken care of apart from the below:
> >>>> > > > >
> >>>> > > > >    1. Community being able to access the build page for build
> >>>> > statuses.
> >>>> > > > >    2. Committers being able to login with apache credentials.
> >>>> > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> >>>> master.
> >>>> > > > >
> >>>> > > > >
> >>>> > > > > Irrespective of the solution we come up, I think we should
> >>>> initiate a
> >>>> > > > > technical design discussion on how to setup the CI build
> system.
> >>>> > > > Probably 1
> >>>> > > > > or 2 pager documents with the architecture and review with
> >>>> Infra and
> >>>> > > > > community members.
> >>>> > > > >
> >>>> > > > > ***There were few proposal and discussion on the slack
> channel,
> >>>> to
> >>>> > > reach
> >>>> > > > > wider community members, moving that discussion formally to
> this
> >>>> > list.
> >>>> > > > >
> >>>> > > > >
> >>>> > > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> system.
> >>>> > > > >
> >>>> > > > > Thanks,
> >>>> > > > >
> >>>> > > > > Sandeep
> >>>> > > > >
> >>>> > > > >
> >>>> > > > >
> >>>> > > > > --
> >>>> > > > > Sandeep Krishnamurthy
> >>>> > > > >
> >>>> > > >
> >>>> > >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>



-- 
Sandeep Krishnamurthy

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Seb Kiureghian <se...@gmail.com>.
But the feather can definitely be added once MXNet graduates.

On Fri, Oct 20, 2017 at 4:21 PM, Seb Kiureghian <se...@gmail.com> wrote:

> The feather can only be used by Top Level Projects.
>
> On Fri, Oct 20, 2017 at 4:19 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
>> When the word Apache is in the Hadoop logo (not always), it includes the
>> feather and color scheme.
>>
>> On Fri, Oct 20, 2017 at 4:18 PM, Chris Olivier <cj...@gmail.com>
>> wrote:
>>
>>> Thanks.
>>>
>>> Is there any way to work the feather into it?
>>>
>>> i.e.  https://goo.gl/images/BU4dnG
>>>
>>> On Fri, Oct 20, 2017 at 4:11 PM, Seb Kiureghian <se...@gmail.com>
>>> wrote:
>>>
>>>> https://imgur.com/a/aADkA
>>>>
>>>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
>>>> wrote:
>>>>
>>>> > Why don;t we look into fully managed AWS CodeBuild?  It maintains
>>>> > everything. It's also compatible with Jenkins.
>>>> >
>>>> > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
>>>> tqchen@cs.washington.edu>
>>>> > wrote:
>>>> >
>>>> > > +1
>>>> > >
>>>> > > Tianqi
>>>> > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
>>>> > >
>>>> > > > +1
>>>> > > >
>>>> > > >
>>>> > > > It seems that the Apache CI is quite overloaded these days, and
>>>> MXNet's
>>>> > > CI
>>>> > > > pipeline is too complex to run there. In addition, we may need to
>>>> add
>>>> > > more
>>>> > > > devices, e.g. macpro and rasbperry pi, into the server, and more
>>>> tasks
>>>> > > such
>>>> > > > as pip build. It means a lot of requests to the Infra team.
>>>> > > >
>>>> > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/.
>>>> But
>>>> > we
>>>> > > > probably need a dedicate developer to maintain it.
>>>> > > >
>>>> > > >
>>>> > > >
>>>> > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>>>> > > > sandeep.krishna98@gmail.com> wrote:
>>>> > > >
>>>> > > > > Hello all,
>>>> > > > >
>>>> > > > > I am hereby opening up a discussion thread on how we can
>>>> stabilize
>>>> > > Apache
>>>> > > > > MXNet CI build system.
>>>> > > > >
>>>> > > > > Problems:
>>>> > > > >
>>>> > > > > ========
>>>> > > > >
>>>> > > > > Recently, we have seen following issues with Apache MXNet CI
>>>> build
>>>> > > > systems:
>>>> > > > >
>>>> > > > >    1. Apache Jenkins master is overloaded and we see issues
>>>> like -
>>>> > > unable
>>>> > > > >    to trigger builds, difficult to load and view the blue ocean
>>>> and
>>>> > > other
>>>> > > > >    Jenkins build status page.
>>>> > > > >    2. We are generating too many request/interaction on Apache
>>>> Infra
>>>> > > > team.
>>>> > > > >       1. Addition/deletion of new slave: Caused from scaling
>>>> > activity,
>>>> > > > >       recycling, troubleshooting or any actions leading to
>>>> change of
>>>> > > > slave
>>>> > > > >       machines.
>>>> > > > >       2. Plugins / other Jenkins Master configurations.
>>>> > > > >       3. Experimentation on CI pipelines.
>>>> > > > >    3. Harder to debug and resolve issues - Since access to
>>>> master and
>>>> > > > slave
>>>> > > > >    is not with the same community, it requires Infra and
>>>> community to
>>>> > > > dive
>>>> > > > >    deep together on all action items.
>>>> > > > >
>>>> > > > > Possible Solutions:
>>>> > > > >
>>>> > > > > ==============
>>>> > > > >
>>>> > > > >    1. Can we set up a separate Jenkins CI build system for
>>>> Apache
>>>> > MXNet
>>>> > > > >    outside Apache Infra?
>>>> > > > >    2. Can we have a separate Jenkins Master in Apache Infra for
>>>> > MXNet?
>>>> > > > >    3. Review design of current setup, refine and fill the gaps.
>>>> > > > >
>>>> > > > > @ Mentors/Infra team/Community:
>>>> > > > >
>>>> > > > > ==========================
>>>> > > > >
>>>> > > > > Please provide your suggestions on how we can proceed further
>>>> and
>>>> > work
>>>> > > on
>>>> > > > > stabilizing the CI build systems for MXNet.
>>>> > > > >
>>>> > > > > Also, if the community decides on separate Jenkins CI build
>>>> system,
>>>> > > what
>>>> > > > > important points should be taken care of apart from the below:
>>>> > > > >
>>>> > > > >    1. Community being able to access the build page for build
>>>> > statuses.
>>>> > > > >    2. Committers being able to login with apache credentials.
>>>> > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
>>>> master.
>>>> > > > >
>>>> > > > >
>>>> > > > > Irrespective of the solution we come up, I think we should
>>>> initiate a
>>>> > > > > technical design discussion on how to setup the CI build system.
>>>> > > > Probably 1
>>>> > > > > or 2 pager documents with the architecture and review with
>>>> Infra and
>>>> > > > > community members.
>>>> > > > >
>>>> > > > > ***There were few proposal and discussion on the slack channel,
>>>> to
>>>> > > reach
>>>> > > > > wider community members, moving that discussion formally to this
>>>> > list.
>>>> > > > >
>>>> > > > >
>>>> > > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
>>>> > > > >
>>>> > > > > Thanks,
>>>> > > > >
>>>> > > > > Sandeep
>>>> > > > >
>>>> > > > >
>>>> > > > >
>>>> > > > > --
>>>> > > > > Sandeep Krishnamurthy
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Seb Kiureghian <se...@gmail.com>.
The feather can only be used by Top Level Projects.

On Fri, Oct 20, 2017 at 4:19 PM, Chris Olivier <cj...@gmail.com>
wrote:

> When the word Apache is in the Hadoop logo (not always), it includes the
> feather and color scheme.
>
> On Fri, Oct 20, 2017 at 4:18 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
>> Thanks.
>>
>> Is there any way to work the feather into it?
>>
>> i.e.  https://goo.gl/images/BU4dnG
>>
>> On Fri, Oct 20, 2017 at 4:11 PM, Seb Kiureghian <se...@gmail.com>
>> wrote:
>>
>>> https://imgur.com/a/aADkA
>>>
>>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
>>> wrote:
>>>
>>> > Why don;t we look into fully managed AWS CodeBuild?  It maintains
>>> > everything. It's also compatible with Jenkins.
>>> >
>>> > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tqchen@cs.washington.edu
>>> >
>>> > wrote:
>>> >
>>> > > +1
>>> > >
>>> > > Tianqi
>>> > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
>>> > >
>>> > > > +1
>>> > > >
>>> > > >
>>> > > > It seems that the Apache CI is quite overloaded these days, and
>>> MXNet's
>>> > > CI
>>> > > > pipeline is too complex to run there. In addition, we may need to
>>> add
>>> > > more
>>> > > > devices, e.g. macpro and rasbperry pi, into the server, and more
>>> tasks
>>> > > such
>>> > > > as pip build. It means a lot of requests to the Infra team.
>>> > > >
>>> > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/.
>>> But
>>> > we
>>> > > > probably need a dedicate developer to maintain it.
>>> > > >
>>> > > >
>>> > > >
>>> > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>>> > > > sandeep.krishna98@gmail.com> wrote:
>>> > > >
>>> > > > > Hello all,
>>> > > > >
>>> > > > > I am hereby opening up a discussion thread on how we can
>>> stabilize
>>> > > Apache
>>> > > > > MXNet CI build system.
>>> > > > >
>>> > > > > Problems:
>>> > > > >
>>> > > > > ========
>>> > > > >
>>> > > > > Recently, we have seen following issues with Apache MXNet CI
>>> build
>>> > > > systems:
>>> > > > >
>>> > > > >    1. Apache Jenkins master is overloaded and we see issues like
>>> -
>>> > > unable
>>> > > > >    to trigger builds, difficult to load and view the blue ocean
>>> and
>>> > > other
>>> > > > >    Jenkins build status page.
>>> > > > >    2. We are generating too many request/interaction on Apache
>>> Infra
>>> > > > team.
>>> > > > >       1. Addition/deletion of new slave: Caused from scaling
>>> > activity,
>>> > > > >       recycling, troubleshooting or any actions leading to
>>> change of
>>> > > > slave
>>> > > > >       machines.
>>> > > > >       2. Plugins / other Jenkins Master configurations.
>>> > > > >       3. Experimentation on CI pipelines.
>>> > > > >    3. Harder to debug and resolve issues - Since access to
>>> master and
>>> > > > slave
>>> > > > >    is not with the same community, it requires Infra and
>>> community to
>>> > > > dive
>>> > > > >    deep together on all action items.
>>> > > > >
>>> > > > > Possible Solutions:
>>> > > > >
>>> > > > > ==============
>>> > > > >
>>> > > > >    1. Can we set up a separate Jenkins CI build system for Apache
>>> > MXNet
>>> > > > >    outside Apache Infra?
>>> > > > >    2. Can we have a separate Jenkins Master in Apache Infra for
>>> > MXNet?
>>> > > > >    3. Review design of current setup, refine and fill the gaps.
>>> > > > >
>>> > > > > @ Mentors/Infra team/Community:
>>> > > > >
>>> > > > > ==========================
>>> > > > >
>>> > > > > Please provide your suggestions on how we can proceed further and
>>> > work
>>> > > on
>>> > > > > stabilizing the CI build systems for MXNet.
>>> > > > >
>>> > > > > Also, if the community decides on separate Jenkins CI build
>>> system,
>>> > > what
>>> > > > > important points should be taken care of apart from the below:
>>> > > > >
>>> > > > >    1. Community being able to access the build page for build
>>> > statuses.
>>> > > > >    2. Committers being able to login with apache credentials.
>>> > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
>>> master.
>>> > > > >
>>> > > > >
>>> > > > > Irrespective of the solution we come up, I think we should
>>> initiate a
>>> > > > > technical design discussion on how to setup the CI build system.
>>> > > > Probably 1
>>> > > > > or 2 pager documents with the architecture and review with Infra
>>> and
>>> > > > > community members.
>>> > > > >
>>> > > > > ***There were few proposal and discussion on the slack channel,
>>> to
>>> > > reach
>>> > > > > wider community members, moving that discussion formally to this
>>> > list.
>>> > > > >
>>> > > > >
>>> > > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
>>> > > > >
>>> > > > > Thanks,
>>> > > > >
>>> > > > > Sandeep
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > > --
>>> > > > > Sandeep Krishnamurthy
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
When the word Apache is in the Hadoop logo (not always), it includes the
feather and color scheme.

On Fri, Oct 20, 2017 at 4:18 PM, Chris Olivier <cj...@gmail.com>
wrote:

> Thanks.
>
> Is there any way to work the feather into it?
>
> i.e.  https://goo.gl/images/BU4dnG
>
> On Fri, Oct 20, 2017 at 4:11 PM, Seb Kiureghian <se...@gmail.com>
> wrote:
>
>> https://imgur.com/a/aADkA
>>
>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
>> wrote:
>>
>> > Why don;t we look into fully managed AWS CodeBuild?  It maintains
>> > everything. It's also compatible with Jenkins.
>> >
>> > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tq...@cs.washington.edu>
>> > wrote:
>> >
>> > > +1
>> > >
>> > > Tianqi
>> > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
>> > >
>> > > > +1
>> > > >
>> > > >
>> > > > It seems that the Apache CI is quite overloaded these days, and
>> MXNet's
>> > > CI
>> > > > pipeline is too complex to run there. In addition, we may need to
>> add
>> > > more
>> > > > devices, e.g. macpro and rasbperry pi, into the server, and more
>> tasks
>> > > such
>> > > > as pip build. It means a lot of requests to the Infra team.
>> > > >
>> > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/.
>> But
>> > we
>> > > > probably need a dedicate developer to maintain it.
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>> > > > sandeep.krishna98@gmail.com> wrote:
>> > > >
>> > > > > Hello all,
>> > > > >
>> > > > > I am hereby opening up a discussion thread on how we can stabilize
>> > > Apache
>> > > > > MXNet CI build system.
>> > > > >
>> > > > > Problems:
>> > > > >
>> > > > > ========
>> > > > >
>> > > > > Recently, we have seen following issues with Apache MXNet CI build
>> > > > systems:
>> > > > >
>> > > > >    1. Apache Jenkins master is overloaded and we see issues like -
>> > > unable
>> > > > >    to trigger builds, difficult to load and view the blue ocean
>> and
>> > > other
>> > > > >    Jenkins build status page.
>> > > > >    2. We are generating too many request/interaction on Apache
>> Infra
>> > > > team.
>> > > > >       1. Addition/deletion of new slave: Caused from scaling
>> > activity,
>> > > > >       recycling, troubleshooting or any actions leading to change
>> of
>> > > > slave
>> > > > >       machines.
>> > > > >       2. Plugins / other Jenkins Master configurations.
>> > > > >       3. Experimentation on CI pipelines.
>> > > > >    3. Harder to debug and resolve issues - Since access to master
>> and
>> > > > slave
>> > > > >    is not with the same community, it requires Infra and
>> community to
>> > > > dive
>> > > > >    deep together on all action items.
>> > > > >
>> > > > > Possible Solutions:
>> > > > >
>> > > > > ==============
>> > > > >
>> > > > >    1. Can we set up a separate Jenkins CI build system for Apache
>> > MXNet
>> > > > >    outside Apache Infra?
>> > > > >    2. Can we have a separate Jenkins Master in Apache Infra for
>> > MXNet?
>> > > > >    3. Review design of current setup, refine and fill the gaps.
>> > > > >
>> > > > > @ Mentors/Infra team/Community:
>> > > > >
>> > > > > ==========================
>> > > > >
>> > > > > Please provide your suggestions on how we can proceed further and
>> > work
>> > > on
>> > > > > stabilizing the CI build systems for MXNet.
>> > > > >
>> > > > > Also, if the community decides on separate Jenkins CI build
>> system,
>> > > what
>> > > > > important points should be taken care of apart from the below:
>> > > > >
>> > > > >    1. Community being able to access the build page for build
>> > statuses.
>> > > > >    2. Committers being able to login with apache credentials.
>> > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
>> master.
>> > > > >
>> > > > >
>> > > > > Irrespective of the solution we come up, I think we should
>> initiate a
>> > > > > technical design discussion on how to setup the CI build system.
>> > > > Probably 1
>> > > > > or 2 pager documents with the architecture and review with Infra
>> and
>> > > > > community members.
>> > > > >
>> > > > > ***There were few proposal and discussion on the slack channel, to
>> > > reach
>> > > > > wider community members, moving that discussion formally to this
>> > list.
>> > > > >
>> > > > >
>> > > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Sandeep
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Sandeep Krishnamurthy
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
Thanks.

Is there any way to work the feather into it?

i.e.  https://goo.gl/images/BU4dnG

On Fri, Oct 20, 2017 at 4:11 PM, Seb Kiureghian <se...@gmail.com> wrote:

> https://imgur.com/a/aADkA
>
> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
> > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> > everything. It's also compatible with Jenkins.
> >
> > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tq...@cs.washington.edu>
> > wrote:
> >
> > > +1
> > >
> > > Tianqi
> > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > It seems that the Apache CI is quite overloaded these days, and
> MXNet's
> > > CI
> > > > pipeline is too complex to run there. In addition, we may need to add
> > > more
> > > > devices, e.g. macpro and rasbperry pi, into the server, and more
> tasks
> > > such
> > > > as pip build. It means a lot of requests to the Infra team.
> > > >
> > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/. But
> > we
> > > > probably need a dedicate developer to maintain it.
> > > >
> > > >
> > > >
> > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > sandeep.krishna98@gmail.com> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I am hereby opening up a discussion thread on how we can stabilize
> > > Apache
> > > > > MXNet CI build system.
> > > > >
> > > > > Problems:
> > > > >
> > > > > ========
> > > > >
> > > > > Recently, we have seen following issues with Apache MXNet CI build
> > > > systems:
> > > > >
> > > > >    1. Apache Jenkins master is overloaded and we see issues like -
> > > unable
> > > > >    to trigger builds, difficult to load and view the blue ocean and
> > > other
> > > > >    Jenkins build status page.
> > > > >    2. We are generating too many request/interaction on Apache
> Infra
> > > > team.
> > > > >       1. Addition/deletion of new slave: Caused from scaling
> > activity,
> > > > >       recycling, troubleshooting or any actions leading to change
> of
> > > > slave
> > > > >       machines.
> > > > >       2. Plugins / other Jenkins Master configurations.
> > > > >       3. Experimentation on CI pipelines.
> > > > >    3. Harder to debug and resolve issues - Since access to master
> and
> > > > slave
> > > > >    is not with the same community, it requires Infra and community
> to
> > > > dive
> > > > >    deep together on all action items.
> > > > >
> > > > > Possible Solutions:
> > > > >
> > > > > ==============
> > > > >
> > > > >    1. Can we set up a separate Jenkins CI build system for Apache
> > MXNet
> > > > >    outside Apache Infra?
> > > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> > MXNet?
> > > > >    3. Review design of current setup, refine and fill the gaps.
> > > > >
> > > > > @ Mentors/Infra team/Community:
> > > > >
> > > > > ==========================
> > > > >
> > > > > Please provide your suggestions on how we can proceed further and
> > work
> > > on
> > > > > stabilizing the CI build systems for MXNet.
> > > > >
> > > > > Also, if the community decides on separate Jenkins CI build system,
> > > what
> > > > > important points should be taken care of apart from the below:
> > > > >
> > > > >    1. Community being able to access the build page for build
> > statuses.
> > > > >    2. Committers being able to login with apache credentials.
> > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> master.
> > > > >
> > > > >
> > > > > Irrespective of the solution we come up, I think we should
> initiate a
> > > > > technical design discussion on how to setup the CI build system.
> > > > Probably 1
> > > > > or 2 pager documents with the architecture and review with Infra
> and
> > > > > community members.
> > > > >
> > > > > ***There were few proposal and discussion on the slack channel, to
> > > reach
> > > > > wider community members, moving that discussion formally to this
> > list.
> > > > >
> > > > >
> > > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Sandeep
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sandeep Krishnamurthy
> > > > >
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Seb Kiureghian <se...@gmail.com>.
https://imgur.com/a/aADkA

On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
wrote:

> Why don;t we look into fully managed AWS CodeBuild?  It maintains
> everything. It's also compatible with Jenkins.
>
> On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tq...@cs.washington.edu>
> wrote:
>
> > +1
> >
> > Tianqi
> > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
> >
> > > +1
> > >
> > >
> > > It seems that the Apache CI is quite overloaded these days, and MXNet's
> > CI
> > > pipeline is too complex to run there. In addition, we may need to add
> > more
> > > devices, e.g. macpro and rasbperry pi, into the server, and more tasks
> > such
> > > as pip build. It means a lot of requests to the Infra team.
> > >
> > > We can reuse our previous Jenkins server at http://ci.mxnet.io/. But
> we
> > > probably need a dedicate developer to maintain it.
> > >
> > >
> > >
> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > sandeep.krishna98@gmail.com> wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am hereby opening up a discussion thread on how we can stabilize
> > Apache
> > > > MXNet CI build system.
> > > >
> > > > Problems:
> > > >
> > > > ========
> > > >
> > > > Recently, we have seen following issues with Apache MXNet CI build
> > > systems:
> > > >
> > > >    1. Apache Jenkins master is overloaded and we see issues like -
> > unable
> > > >    to trigger builds, difficult to load and view the blue ocean and
> > other
> > > >    Jenkins build status page.
> > > >    2. We are generating too many request/interaction on Apache Infra
> > > team.
> > > >       1. Addition/deletion of new slave: Caused from scaling
> activity,
> > > >       recycling, troubleshooting or any actions leading to change of
> > > slave
> > > >       machines.
> > > >       2. Plugins / other Jenkins Master configurations.
> > > >       3. Experimentation on CI pipelines.
> > > >    3. Harder to debug and resolve issues - Since access to master and
> > > slave
> > > >    is not with the same community, it requires Infra and community to
> > > dive
> > > >    deep together on all action items.
> > > >
> > > > Possible Solutions:
> > > >
> > > > ==============
> > > >
> > > >    1. Can we set up a separate Jenkins CI build system for Apache
> MXNet
> > > >    outside Apache Infra?
> > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> MXNet?
> > > >    3. Review design of current setup, refine and fill the gaps.
> > > >
> > > > @ Mentors/Infra team/Community:
> > > >
> > > > ==========================
> > > >
> > > > Please provide your suggestions on how we can proceed further and
> work
> > on
> > > > stabilizing the CI build systems for MXNet.
> > > >
> > > > Also, if the community decides on separate Jenkins CI build system,
> > what
> > > > important points should be taken care of apart from the below:
> > > >
> > > >    1. Community being able to access the build page for build
> statuses.
> > > >    2. Committers being able to login with apache credentials.
> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
> > > >
> > > >
> > > > Irrespective of the solution we come up, I think we should initiate a
> > > > technical design discussion on how to setup the CI build system.
> > > Probably 1
> > > > or 2 pager documents with the architecture and review with Infra and
> > > > community members.
> > > >
> > > > ***There were few proposal and discussion on the slack channel, to
> > reach
> > > > wider community members, moving that discussion formally to this
> list.
> > > >
> > > >
> > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > > >
> > > > Thanks,
> > > >
> > > > Sandeep
> > > >
> > > >
> > > >
> > > > --
> > > > Sandeep Krishnamurthy
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Hen <ba...@apache.org>.
Some inline thoughts.

On Wed, Nov 1, 2017 at 9:41 AM, Bhavin Thaker <bh...@gmail.com>
wrote:

> Few comments/suggestions:
>
> 1) Can  we have this nice list of todo items on the Apache MXNet wiki page
> to track them better?
>
> 2) Can we have a set of owners for each set of tests and source code
> directory? One of the problems I have observed is that when there is a test
> failure, it is difficult to find an owner who will take the responsibility
> of fixing the test OR identifying the culprit code promptly -- this causes
> the master to continue to fail for many days.
>

On this one, we're all volunteers and there shouldn't be situations of
"Bob's permission is needed to edit this file", or "We're waiting on Alice
to do that work". The project as a whole owns this.

Agreed that this can cause a tragedy of the commons, but raising the bar on
being a committer to someone who has the privilege of 24/7 time on the
project is worse.

As an employer of contributors, something you could do internally at Amazon
is to identify experts who own (from Amazon's point of view) contributions
to that area and they can be the ones you poke on an issue (internally).


>
> 3) Specifically, we need an owner for the Windows setup -- nobody seems to
> know much about it -- please feel free to correct me if required.
>

If there's no one in the community who can support it, then a) we should
seek someone (help wanted etc) on the lists/website/twitter, and b) if that
fails, we should move it to a contrib/deprecated path.


>
> 4) +1 to have a list of all feature requests on Jira or a similar commonly
> and easily accessible system.
>
> 5) -1 to the branching model -- I was the gatekeeper for the branching
> model at Informix for the database kernel code to be merged to master along
> with my day-job of being a database kernel engineer for around 9 months and
> hence have the opinion that a branching model just shifts the burden from
> one place to another. We don't have a dedicated team to do the branching
> model. If we really need a buildable master everyday, then we could just
> tag every successful build as last_clean_build on master -- use this tag to
> get a clean master at any time. How many Apache projects are doing
> development on separate branches?
>

Typically I would expect separate branch develop to happen when a project
is experimenting with multiple futures. Most projects do have multiple
branches (I'd guess typically only 2) to support bugfixes to older versions
and new code on newer versions though.


>
> 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
> https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
> that fails for any warning found. We can build on top of his work.
>
> 7) FYI: For the unit-tests problems, Meghna identified that some of the
> unit-test run times have increased significantly in the recent builds. We
> need volunteers to help diagnose the root-cause here:
>
> Unit Test Task
>
> Build #337
>
> Build #500
>
> Build #556
>
> Python 2: GPU win
>
> 25
>
> 38
>
> 40
>
> Python 3: GPU Win
>
> 15
>
> 38
>
> 46
>
> Python2: CPU
>
> 25
>
> 35
>
> 80
>
> Python3: CPU
>
> 14
>
> 28
>
> 72
>
> R: CPU
>
> 20
>
> 34
>
> 24
>
> R: GPU
>
> 5
>
> 24
>
> 24
>
>
> 8) Ensure that all PRs submitted have corresponding documentation on
> http://mxnet.io for it.  It may be fine to have documentation follow the
> code changes as long as there is ownership that this task will be done in a
> timely manner.  For example, I have requested the Nvidia team to submit PRs
> to update documentation on http://mxnet.io for the Volta changes to MXNet.
>

Why not expect documentation as a part of the PR?


>
>
> 9) Ensure that mega-PRs have some level of design or architecture
> document(s) shared on the Apache MXNet wiki. The mega-PR must have both
> unit-tests and nightly/integration tests submitted to demonstrate
> high-quality level.
>

+1. These are the ones that should be having a dev@ discussion.


>
>
> 10) Finally, how do we get ownership for code submitted to MXNet? When
> something fails in a code segment that only a small set of folks know
> about, what is the expected SLA for a response from them? When users deploy
> MXNet in production environments, they will expect some form of SLA for
> support and a patch release.
>

Users can expect what they want. What they get is best effort/good
intentions. If they want someone to supply an SLA, then they can pay a
vendor who repackages MXNet/builds upon MXNet for that service.

Part of the value of Open Source is that users can always fix the issue
themselves, they are not beholden to a third party to fix it for them (and
thus need an SLA). For something like OpenOffice there is an obvious issue
there, many of its users would need longer to come up to speed to fix the
issue and the likely reply; but for MXNet, many of its users do know how to
code and don't need to go learn a programming language before starting to
look at the bug. This is also why it's very important that the MXNet
documentation explains how to get the source, how to build the source, and
how to contribute.

Security vulnerabilities are a little different. While good intentions
remain, it's assumed that a healthy project can fulfill the good
intentions, and repeated security issues without resolution will quickly
raise the question of whether the project is not mature enough for its user
base.

Hen

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by kellen sunderland <ke...@gmail.com>.
To point 7) I did a little bit of measure / profiling of our test runs a
week or two ago and came to the same conclusion.  I assumed the slow downs
were mostly due to new tests which had recently been added.  There were
quite a few gluon tests for example added, and I think they're fairly
resource intensive.

On Wed, Nov 1, 2017 at 6:40 PM, kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Bhavin: I would add on point 5 that it doesn't alway make sense to attach
> ownership for the broken integration test to the PR author.  We're planning
> extensive integration tests on a variety of hardware.  Some of these test
> failures won't be reproducible by most PR authors and the effort to resolve
> these failures should be delegated to a test owner.  Agree with Pedro that
> this would be strictly fast-fwd merging from one branch to another after
> integration tests pass, so it shouldn't require much extra work beyond
> fixing failures.
>
> On Wed, Nov 1, 2017 at 6:35 PM, Pedro Larroy <pedro.larroy.lists@gmail.com
> > wrote:
>
>> Hi Bhavin
>>
>> Good suggestions.
>>
>> I wanted to respond to your point #5
>>
>> The promotion of integration to master would be done automatically by
>> jenkins once a commit passes the nightly tests. So it should not
>> impose any additional burden on the developers, as there is no manual
>> step involved / human gatekeeper.
>>
>> It would be equivalent to your suggestion with tags. You can do the
>> same with branches, anyway a git branch is just a pointer to some
>> commit, so I think we are talking about the same.
>>
>> Pedro.
>>
>>
>>
>>
>> On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker <bh...@gmail.com>
>> wrote:
>> > Few comments/suggestions:
>> >
>> > 1) Can  we have this nice list of todo items on the Apache MXNet wiki
>> page
>> > to track them better?
>> >
>> > 2) Can we have a set of owners for each set of tests and source code
>> > directory? One of the problems I have observed is that when there is a
>> test
>> > failure, it is difficult to find an owner who will take the
>> responsibility
>> > of fixing the test OR identifying the culprit code promptly -- this
>> causes
>> > the master to continue to fail for many days.
>> >
>> > 3) Specifically, we need an owner for the Windows setup -- nobody seems
>> to
>> > know much about it -- please feel free to correct me if required.
>> >
>> > 4) +1 to have a list of all feature requests on Jira or a similar
>> commonly
>> > and easily accessible system.
>> >
>> > 5) -1 to the branching model -- I was the gatekeeper for the branching
>> > model at Informix for the database kernel code to be merged to master
>> along
>> > with my day-job of being a database kernel engineer for around 9 months
>> and
>> > hence have the opinion that a branching model just shifts the burden
>> from
>> > one place to another. We don't have a dedicated team to do the branching
>> > model. If we really need a buildable master everyday, then we could just
>> > tag every successful build as last_clean_build on master -- use this
>> tag to
>> > get a clean master at any time. How many Apache projects are doing
>> > development on separate branches?
>> >
>> > 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
>> > https://github.com/apache/incubator-mxnet/pull/7109 and has a test
>> added
>> > that fails for any warning found. We can build on top of his work.
>> >
>> > 7) FYI: For the unit-tests problems, Meghna identified that some of the
>> > unit-test run times have increased significantly in the recent builds.
>> We
>> > need volunteers to help diagnose the root-cause here:
>> >
>> > Unit Test Task
>> >
>> > Build #337
>> >
>> > Build #500
>> >
>> > Build #556
>> >
>> > Python 2: GPU win
>> >
>> > 25
>> >
>> > 38
>> >
>> > 40
>> >
>> > Python 3: GPU Win
>> >
>> > 15
>> >
>> > 38
>> >
>> > 46
>> >
>> > Python2: CPU
>> >
>> > 25
>> >
>> > 35
>> >
>> > 80
>> >
>> > Python3: CPU
>> >
>> > 14
>> >
>> > 28
>> >
>> > 72
>> >
>> > R: CPU
>> >
>> > 20
>> >
>> > 34
>> >
>> > 24
>> >
>> > R: GPU
>> >
>> > 5
>> >
>> > 24
>> >
>> > 24
>> >
>> >
>> > 8) Ensure that all PRs submitted have corresponding documentation on
>> > http://mxnet.io for it.  It may be fine to have documentation follow
>> the
>> > code changes as long as there is ownership that this task will be done
>> in a
>> > timely manner.  For example, I have requested the Nvidia team to submit
>> PRs
>> > to update documentation on http://mxnet.io for the Volta changes to
>> MXNet.
>> >
>> >
>> > 9) Ensure that mega-PRs have some level of design or architecture
>> > document(s) shared on the Apache MXNet wiki. The mega-PR must have both
>> > unit-tests and nightly/integration tests submitted to demonstrate
>> > high-quality level.
>> >
>> >
>> > 10) Finally, how do we get ownership for code submitted to MXNet? When
>> > something fails in a code segment that only a small set of folks know
>> > about, what is the expected SLA for a response from them? When users
>> deploy
>> > MXNet in production environments, they will expect some form of SLA for
>> > support and a patch release.
>> >
>> >
>> > Regards,
>> > Bhavin Thaker.
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <
>> pedro.larroy.lists@gmail.com>
>> > wrote:
>> >
>> >> +1  That would be great.
>> >>
>> >> On Mon, Oct 30, 2017 at 5:35 PM, Hen <ba...@apache.org> wrote:
>> >> > How about we ask for a new mxnet repo to store all the config in?
>> >> >
>> >> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <
>> pedro.larroy.lists@gmail.com
>> >> >
>> >> > wrote:
>> >> >
>> >> >> Just to provide a high level overview of the ideas and proposals
>> >> >> coming from different sources for the requirements for testing and
>> >> >> validation of builds:
>> >> >>
>> >> >> * Have terraform files for the testing infrastructure.
>> Infrastructure
>> >> >> as code (IaC). Minus not emulated / nor cloud based, embedded
>> >> >> hardware. ("single command" replication of the testing
>> infrastructure,
>> >> >> no manual steps).
>> >> >>
>> >> >> * CI software based on Jenkins, unless someone thinks there's a
>> better
>> >> >> alternative.
>> >> >>
>> >> >> * Use autoscaling groups and improve staggered build + test steps to
>> >> >> achieve higher parallelism and shorter feedback times.
>> >> >>
>> >> >> * Switch to a branching model based on stable master + integration
>> >> >> branch. PRs are merged into dev/integration which runs extended
>> >> >> nightly tests, which are
>> >> >> then merged into master, preferably in an automated way after
>> >> >> successful extended testing.
>> >> >> Master is always tested, and always buildable. Release branches or
>> >> >> tags in master as usual for releases.
>> >> >>
>> >> >> * Build + test feedback time targeting less than 15 minutes.
>> >> >> (Currently a build in a 16x core takes 7m). This involves lot of
>> >> >> refactoring of tests, move expensive tests / big smoke tests to
>> >> >> nightlies on the integration branch, also tests on IoT devices /
>> power
>> >> >> and performance regressions...
>> >> >>
>> >> >> * Add code coverage and other quality metrics.
>> >> >>
>> >> >> * Eliminate warnings and treat warnings as errors. We have spent
>> time
>> >> >> tracking down "undefined behaviour" bugs that could have been caught
>> >> >> by compiler warnings.
>> >> >>
>> >> >> Is there something I'm missing or additional things that come to
>> your
>> >> >> mind that you would wish to add?
>> >> >>
>> >> >> Pedro.
>> >> >>
>> >>
>>
>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by kellen sunderland <ke...@gmail.com>.
Bhavin: I would add on point 5 that it doesn't alway make sense to attach
ownership for the broken integration test to the PR author.  We're planning
extensive integration tests on a variety of hardware.  Some of these test
failures won't be reproducible by most PR authors and the effort to resolve
these failures should be delegated to a test owner.  Agree with Pedro that
this would be strictly fast-fwd merging from one branch to another after
integration tests pass, so it shouldn't require much extra work beyond
fixing failures.

On Wed, Nov 1, 2017 at 6:35 PM, Pedro Larroy <pe...@gmail.com>
wrote:

> Hi Bhavin
>
> Good suggestions.
>
> I wanted to respond to your point #5
>
> The promotion of integration to master would be done automatically by
> jenkins once a commit passes the nightly tests. So it should not
> impose any additional burden on the developers, as there is no manual
> step involved / human gatekeeper.
>
> It would be equivalent to your suggestion with tags. You can do the
> same with branches, anyway a git branch is just a pointer to some
> commit, so I think we are talking about the same.
>
> Pedro.
>
>
>
>
> On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker <bh...@gmail.com>
> wrote:
> > Few comments/suggestions:
> >
> > 1) Can  we have this nice list of todo items on the Apache MXNet wiki
> page
> > to track them better?
> >
> > 2) Can we have a set of owners for each set of tests and source code
> > directory? One of the problems I have observed is that when there is a
> test
> > failure, it is difficult to find an owner who will take the
> responsibility
> > of fixing the test OR identifying the culprit code promptly -- this
> causes
> > the master to continue to fail for many days.
> >
> > 3) Specifically, we need an owner for the Windows setup -- nobody seems
> to
> > know much about it -- please feel free to correct me if required.
> >
> > 4) +1 to have a list of all feature requests on Jira or a similar
> commonly
> > and easily accessible system.
> >
> > 5) -1 to the branching model -- I was the gatekeeper for the branching
> > model at Informix for the database kernel code to be merged to master
> along
> > with my day-job of being a database kernel engineer for around 9 months
> and
> > hence have the opinion that a branching model just shifts the burden from
> > one place to another. We don't have a dedicated team to do the branching
> > model. If we really need a buildable master everyday, then we could just
> > tag every successful build as last_clean_build on master -- use this tag
> to
> > get a clean master at any time. How many Apache projects are doing
> > development on separate branches?
> >
> > 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
> > https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
> > that fails for any warning found. We can build on top of his work.
> >
> > 7) FYI: For the unit-tests problems, Meghna identified that some of the
> > unit-test run times have increased significantly in the recent builds. We
> > need volunteers to help diagnose the root-cause here:
> >
> > Unit Test Task
> >
> > Build #337
> >
> > Build #500
> >
> > Build #556
> >
> > Python 2: GPU win
> >
> > 25
> >
> > 38
> >
> > 40
> >
> > Python 3: GPU Win
> >
> > 15
> >
> > 38
> >
> > 46
> >
> > Python2: CPU
> >
> > 25
> >
> > 35
> >
> > 80
> >
> > Python3: CPU
> >
> > 14
> >
> > 28
> >
> > 72
> >
> > R: CPU
> >
> > 20
> >
> > 34
> >
> > 24
> >
> > R: GPU
> >
> > 5
> >
> > 24
> >
> > 24
> >
> >
> > 8) Ensure that all PRs submitted have corresponding documentation on
> > http://mxnet.io for it.  It may be fine to have documentation follow the
> > code changes as long as there is ownership that this task will be done
> in a
> > timely manner.  For example, I have requested the Nvidia team to submit
> PRs
> > to update documentation on http://mxnet.io for the Volta changes to
> MXNet.
> >
> >
> > 9) Ensure that mega-PRs have some level of design or architecture
> > document(s) shared on the Apache MXNet wiki. The mega-PR must have both
> > unit-tests and nightly/integration tests submitted to demonstrate
> > high-quality level.
> >
> >
> > 10) Finally, how do we get ownership for code submitted to MXNet? When
> > something fails in a code segment that only a small set of folks know
> > about, what is the expected SLA for a response from them? When users
> deploy
> > MXNet in production environments, they will expect some form of SLA for
> > support and a patch release.
> >
> >
> > Regards,
> > Bhavin Thaker.
> >
> >
> >
> >
> >
> >
> > On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> > wrote:
> >
> >> +1  That would be great.
> >>
> >> On Mon, Oct 30, 2017 at 5:35 PM, Hen <ba...@apache.org> wrote:
> >> > How about we ask for a new mxnet repo to store all the config in?
> >> >
> >> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <
> pedro.larroy.lists@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> Just to provide a high level overview of the ideas and proposals
> >> >> coming from different sources for the requirements for testing and
> >> >> validation of builds:
> >> >>
> >> >> * Have terraform files for the testing infrastructure. Infrastructure
> >> >> as code (IaC). Minus not emulated / nor cloud based, embedded
> >> >> hardware. ("single command" replication of the testing
> infrastructure,
> >> >> no manual steps).
> >> >>
> >> >> * CI software based on Jenkins, unless someone thinks there's a
> better
> >> >> alternative.
> >> >>
> >> >> * Use autoscaling groups and improve staggered build + test steps to
> >> >> achieve higher parallelism and shorter feedback times.
> >> >>
> >> >> * Switch to a branching model based on stable master + integration
> >> >> branch. PRs are merged into dev/integration which runs extended
> >> >> nightly tests, which are
> >> >> then merged into master, preferably in an automated way after
> >> >> successful extended testing.
> >> >> Master is always tested, and always buildable. Release branches or
> >> >> tags in master as usual for releases.
> >> >>
> >> >> * Build + test feedback time targeting less than 15 minutes.
> >> >> (Currently a build in a 16x core takes 7m). This involves lot of
> >> >> refactoring of tests, move expensive tests / big smoke tests to
> >> >> nightlies on the integration branch, also tests on IoT devices /
> power
> >> >> and performance regressions...
> >> >>
> >> >> * Add code coverage and other quality metrics.
> >> >>
> >> >> * Eliminate warnings and treat warnings as errors. We have spent time
> >> >> tracking down "undefined behaviour" bugs that could have been caught
> >> >> by compiler warnings.
> >> >>
> >> >> Is there something I'm missing or additional things that come to your
> >> >> mind that you would wish to add?
> >> >>
> >> >> Pedro.
> >> >>
> >>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Pedro Larroy <pe...@gmail.com>.
Hi Bhavin

Good suggestions.

I wanted to respond to your point #5

The promotion of integration to master would be done automatically by
jenkins once a commit passes the nightly tests. So it should not
impose any additional burden on the developers, as there is no manual
step involved / human gatekeeper.

It would be equivalent to your suggestion with tags. You can do the
same with branches, anyway a git branch is just a pointer to some
commit, so I think we are talking about the same.

Pedro.




On Wed, Nov 1, 2017 at 5:41 PM, Bhavin Thaker <bh...@gmail.com> wrote:
> Few comments/suggestions:
>
> 1) Can  we have this nice list of todo items on the Apache MXNet wiki page
> to track them better?
>
> 2) Can we have a set of owners for each set of tests and source code
> directory? One of the problems I have observed is that when there is a test
> failure, it is difficult to find an owner who will take the responsibility
> of fixing the test OR identifying the culprit code promptly -- this causes
> the master to continue to fail for many days.
>
> 3) Specifically, we need an owner for the Windows setup -- nobody seems to
> know much about it -- please feel free to correct me if required.
>
> 4) +1 to have a list of all feature requests on Jira or a similar commonly
> and easily accessible system.
>
> 5) -1 to the branching model -- I was the gatekeeper for the branching
> model at Informix for the database kernel code to be merged to master along
> with my day-job of being a database kernel engineer for around 9 months and
> hence have the opinion that a branching model just shifts the burden from
> one place to another. We don't have a dedicated team to do the branching
> model. If we really need a buildable master everyday, then we could just
> tag every successful build as last_clean_build on master -- use this tag to
> get a clean master at any time. How many Apache projects are doing
> development on separate branches?
>
> 6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
> https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
> that fails for any warning found. We can build on top of his work.
>
> 7) FYI: For the unit-tests problems, Meghna identified that some of the
> unit-test run times have increased significantly in the recent builds. We
> need volunteers to help diagnose the root-cause here:
>
> Unit Test Task
>
> Build #337
>
> Build #500
>
> Build #556
>
> Python 2: GPU win
>
> 25
>
> 38
>
> 40
>
> Python 3: GPU Win
>
> 15
>
> 38
>
> 46
>
> Python2: CPU
>
> 25
>
> 35
>
> 80
>
> Python3: CPU
>
> 14
>
> 28
>
> 72
>
> R: CPU
>
> 20
>
> 34
>
> 24
>
> R: GPU
>
> 5
>
> 24
>
> 24
>
>
> 8) Ensure that all PRs submitted have corresponding documentation on
> http://mxnet.io for it.  It may be fine to have documentation follow the
> code changes as long as there is ownership that this task will be done in a
> timely manner.  For example, I have requested the Nvidia team to submit PRs
> to update documentation on http://mxnet.io for the Volta changes to MXNet.
>
>
> 9) Ensure that mega-PRs have some level of design or architecture
> document(s) shared on the Apache MXNet wiki. The mega-PR must have both
> unit-tests and nightly/integration tests submitted to demonstrate
> high-quality level.
>
>
> 10) Finally, how do we get ownership for code submitted to MXNet? When
> something fails in a code segment that only a small set of folks know
> about, what is the expected SLA for a response from them? When users deploy
> MXNet in production environments, they will expect some form of SLA for
> support and a patch release.
>
>
> Regards,
> Bhavin Thaker.
>
>
>
>
>
>
> On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <pe...@gmail.com>
> wrote:
>
>> +1  That would be great.
>>
>> On Mon, Oct 30, 2017 at 5:35 PM, Hen <ba...@apache.org> wrote:
>> > How about we ask for a new mxnet repo to store all the config in?
>> >
>> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <pedro.larroy.lists@gmail.com
>> >
>> > wrote:
>> >
>> >> Just to provide a high level overview of the ideas and proposals
>> >> coming from different sources for the requirements for testing and
>> >> validation of builds:
>> >>
>> >> * Have terraform files for the testing infrastructure. Infrastructure
>> >> as code (IaC). Minus not emulated / nor cloud based, embedded
>> >> hardware. ("single command" replication of the testing infrastructure,
>> >> no manual steps).
>> >>
>> >> * CI software based on Jenkins, unless someone thinks there's a better
>> >> alternative.
>> >>
>> >> * Use autoscaling groups and improve staggered build + test steps to
>> >> achieve higher parallelism and shorter feedback times.
>> >>
>> >> * Switch to a branching model based on stable master + integration
>> >> branch. PRs are merged into dev/integration which runs extended
>> >> nightly tests, which are
>> >> then merged into master, preferably in an automated way after
>> >> successful extended testing.
>> >> Master is always tested, and always buildable. Release branches or
>> >> tags in master as usual for releases.
>> >>
>> >> * Build + test feedback time targeting less than 15 minutes.
>> >> (Currently a build in a 16x core takes 7m). This involves lot of
>> >> refactoring of tests, move expensive tests / big smoke tests to
>> >> nightlies on the integration branch, also tests on IoT devices / power
>> >> and performance regressions...
>> >>
>> >> * Add code coverage and other quality metrics.
>> >>
>> >> * Eliminate warnings and treat warnings as errors. We have spent time
>> >> tracking down "undefined behaviour" bugs that could have been caught
>> >> by compiler warnings.
>> >>
>> >> Is there something I'm missing or additional things that come to your
>> >> mind that you would wish to add?
>> >>
>> >> Pedro.
>> >>
>>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Bhavin Thaker <bh...@gmail.com>.
Few comments/suggestions:

1) Can  we have this nice list of todo items on the Apache MXNet wiki page
to track them better?

2) Can we have a set of owners for each set of tests and source code
directory? One of the problems I have observed is that when there is a test
failure, it is difficult to find an owner who will take the responsibility
of fixing the test OR identifying the culprit code promptly -- this causes
the master to continue to fail for many days.

3) Specifically, we need an owner for the Windows setup -- nobody seems to
know much about it -- please feel free to correct me if required.

4) +1 to have a list of all feature requests on Jira or a similar commonly
and easily accessible system.

5) -1 to the branching model -- I was the gatekeeper for the branching
model at Informix for the database kernel code to be merged to master along
with my day-job of being a database kernel engineer for around 9 months and
hence have the opinion that a branching model just shifts the burden from
one place to another. We don't have a dedicated team to do the branching
model. If we really need a buildable master everyday, then we could just
tag every successful build as last_clean_build on master -- use this tag to
get a clean master at any time. How many Apache projects are doing
development on separate branches?

6) FYI: Rahul (rahul003@) has fixed various warnings with this PR:
https://github.com/apache/incubator-mxnet/pull/7109 and has a test added
that fails for any warning found. We can build on top of his work.

7) FYI: For the unit-tests problems, Meghna identified that some of the
unit-test run times have increased significantly in the recent builds. We
need volunteers to help diagnose the root-cause here:

Unit Test Task

Build #337

Build #500

Build #556

Python 2: GPU win

25

38

40

Python 3: GPU Win

15

38

46

Python2: CPU

25

35

80

Python3: CPU

14

28

72

R: CPU

20

34

24

R: GPU

5

24

24


8) Ensure that all PRs submitted have corresponding documentation on
http://mxnet.io for it.  It may be fine to have documentation follow the
code changes as long as there is ownership that this task will be done in a
timely manner.  For example, I have requested the Nvidia team to submit PRs
to update documentation on http://mxnet.io for the Volta changes to MXNet.


9) Ensure that mega-PRs have some level of design or architecture
document(s) shared on the Apache MXNet wiki. The mega-PR must have both
unit-tests and nightly/integration tests submitted to demonstrate
high-quality level.


10) Finally, how do we get ownership for code submitted to MXNet? When
something fails in a code segment that only a small set of folks know
about, what is the expected SLA for a response from them? When users deploy
MXNet in production environments, they will expect some form of SLA for
support and a patch release.


Regards,
Bhavin Thaker.






On Wed, Nov 1, 2017 at 8:20 AM, Pedro Larroy <pe...@gmail.com>
wrote:

> +1  That would be great.
>
> On Mon, Oct 30, 2017 at 5:35 PM, Hen <ba...@apache.org> wrote:
> > How about we ask for a new mxnet repo to store all the config in?
> >
> > On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> > wrote:
> >
> >> Just to provide a high level overview of the ideas and proposals
> >> coming from different sources for the requirements for testing and
> >> validation of builds:
> >>
> >> * Have terraform files for the testing infrastructure. Infrastructure
> >> as code (IaC). Minus not emulated / nor cloud based, embedded
> >> hardware. ("single command" replication of the testing infrastructure,
> >> no manual steps).
> >>
> >> * CI software based on Jenkins, unless someone thinks there's a better
> >> alternative.
> >>
> >> * Use autoscaling groups and improve staggered build + test steps to
> >> achieve higher parallelism and shorter feedback times.
> >>
> >> * Switch to a branching model based on stable master + integration
> >> branch. PRs are merged into dev/integration which runs extended
> >> nightly tests, which are
> >> then merged into master, preferably in an automated way after
> >> successful extended testing.
> >> Master is always tested, and always buildable. Release branches or
> >> tags in master as usual for releases.
> >>
> >> * Build + test feedback time targeting less than 15 minutes.
> >> (Currently a build in a 16x core takes 7m). This involves lot of
> >> refactoring of tests, move expensive tests / big smoke tests to
> >> nightlies on the integration branch, also tests on IoT devices / power
> >> and performance regressions...
> >>
> >> * Add code coverage and other quality metrics.
> >>
> >> * Eliminate warnings and treat warnings as errors. We have spent time
> >> tracking down "undefined behaviour" bugs that could have been caught
> >> by compiler warnings.
> >>
> >> Is there something I'm missing or additional things that come to your
> >> mind that you would wish to add?
> >>
> >> Pedro.
> >>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Pedro Larroy <pe...@gmail.com>.
+1  That would be great.

On Mon, Oct 30, 2017 at 5:35 PM, Hen <ba...@apache.org> wrote:
> How about we ask for a new mxnet repo to store all the config in?
>
> On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <pe...@gmail.com>
> wrote:
>
>> Just to provide a high level overview of the ideas and proposals
>> coming from different sources for the requirements for testing and
>> validation of builds:
>>
>> * Have terraform files for the testing infrastructure. Infrastructure
>> as code (IaC). Minus not emulated / nor cloud based, embedded
>> hardware. ("single command" replication of the testing infrastructure,
>> no manual steps).
>>
>> * CI software based on Jenkins, unless someone thinks there's a better
>> alternative.
>>
>> * Use autoscaling groups and improve staggered build + test steps to
>> achieve higher parallelism and shorter feedback times.
>>
>> * Switch to a branching model based on stable master + integration
>> branch. PRs are merged into dev/integration which runs extended
>> nightly tests, which are
>> then merged into master, preferably in an automated way after
>> successful extended testing.
>> Master is always tested, and always buildable. Release branches or
>> tags in master as usual for releases.
>>
>> * Build + test feedback time targeting less than 15 minutes.
>> (Currently a build in a 16x core takes 7m). This involves lot of
>> refactoring of tests, move expensive tests / big smoke tests to
>> nightlies on the integration branch, also tests on IoT devices / power
>> and performance regressions...
>>
>> * Add code coverage and other quality metrics.
>>
>> * Eliminate warnings and treat warnings as errors. We have spent time
>> tracking down "undefined behaviour" bugs that could have been caught
>> by compiler warnings.
>>
>> Is there something I'm missing or additional things that come to your
>> mind that you would wish to add?
>>
>> Pedro.
>>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Hen <ba...@apache.org>.
How about we ask for a new mxnet repo to store all the config in?

On Fri, Oct 27, 2017 at 05:30 Pedro Larroy <pe...@gmail.com>
wrote:

> Just to provide a high level overview of the ideas and proposals
> coming from different sources for the requirements for testing and
> validation of builds:
>
> * Have terraform files for the testing infrastructure. Infrastructure
> as code (IaC). Minus not emulated / nor cloud based, embedded
> hardware. ("single command" replication of the testing infrastructure,
> no manual steps).
>
> * CI software based on Jenkins, unless someone thinks there's a better
> alternative.
>
> * Use autoscaling groups and improve staggered build + test steps to
> achieve higher parallelism and shorter feedback times.
>
> * Switch to a branching model based on stable master + integration
> branch. PRs are merged into dev/integration which runs extended
> nightly tests, which are
> then merged into master, preferably in an automated way after
> successful extended testing.
> Master is always tested, and always buildable. Release branches or
> tags in master as usual for releases.
>
> * Build + test feedback time targeting less than 15 minutes.
> (Currently a build in a 16x core takes 7m). This involves lot of
> refactoring of tests, move expensive tests / big smoke tests to
> nightlies on the integration branch, also tests on IoT devices / power
> and performance regressions...
>
> * Add code coverage and other quality metrics.
>
> * Eliminate warnings and treat warnings as errors. We have spent time
> tracking down "undefined behaviour" bugs that could have been caught
> by compiler warnings.
>
> Is there something I'm missing or additional things that come to your
> mind that you would wish to add?
>
> Pedro.
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Suneel Marthi <sm...@apache.org>.
+1

On Sat, Oct 28, 2017 at 5:29 AM, Chris Olivier <cj...@gmail.com>
wrote:

> IMHO, it would be nice to have Apache JIRA for mxnet where these sort of
> feature requests could be entered and publicly tracked and possibly taken
> up by whoever has cycles with the JIRA helping to avoid overlapping work.
> After the core system works, of course. WDYT?
>
> On Fri, Oct 27, 2017 at 5:30 AM, Pedro Larroy <
> pedro.larroy.lists@gmail.com>
> wrote:
>
> > Just to provide a high level overview of the ideas and proposals
> > coming from different sources for the requirements for testing and
> > validation of builds:
> >
> > * Have terraform files for the testing infrastructure. Infrastructure
> > as code (IaC). Minus not emulated / nor cloud based, embedded
> > hardware. ("single command" replication of the testing infrastructure,
> > no manual steps).
> >
> > * CI software based on Jenkins, unless someone thinks there's a better
> > alternative.
> >
> > * Use autoscaling groups and improve staggered build + test steps to
> > achieve higher parallelism and shorter feedback times.
> >
> > * Switch to a branching model based on stable master + integration
> > branch. PRs are merged into dev/integration which runs extended
> > nightly tests, which are
> > then merged into master, preferably in an automated way after
> > successful extended testing.
> > Master is always tested, and always buildable. Release branches or
> > tags in master as usual for releases.
> >
> > * Build + test feedback time targeting less than 15 minutes.
> > (Currently a build in a 16x core takes 7m). This involves lot of
> > refactoring of tests, move expensive tests / big smoke tests to
> > nightlies on the integration branch, also tests on IoT devices / power
> > and performance regressions...
> >
> > * Add code coverage and other quality metrics.
> >
> > * Eliminate warnings and treat warnings as errors. We have spent time
> > tracking down "undefined behaviour" bugs that could have been caught
> > by compiler warnings.
> >
> > Is there something I'm missing or additional things that come to your
> > mind that you would wish to add?
> >
> > Pedro.
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
IMHO, it would be nice to have Apache JIRA for mxnet where these sort of
feature requests could be entered and publicly tracked and possibly taken
up by whoever has cycles with the JIRA helping to avoid overlapping work.
After the core system works, of course. WDYT?

On Fri, Oct 27, 2017 at 5:30 AM, Pedro Larroy <pe...@gmail.com>
wrote:

> Just to provide a high level overview of the ideas and proposals
> coming from different sources for the requirements for testing and
> validation of builds:
>
> * Have terraform files for the testing infrastructure. Infrastructure
> as code (IaC). Minus not emulated / nor cloud based, embedded
> hardware. ("single command" replication of the testing infrastructure,
> no manual steps).
>
> * CI software based on Jenkins, unless someone thinks there's a better
> alternative.
>
> * Use autoscaling groups and improve staggered build + test steps to
> achieve higher parallelism and shorter feedback times.
>
> * Switch to a branching model based on stable master + integration
> branch. PRs are merged into dev/integration which runs extended
> nightly tests, which are
> then merged into master, preferably in an automated way after
> successful extended testing.
> Master is always tested, and always buildable. Release branches or
> tags in master as usual for releases.
>
> * Build + test feedback time targeting less than 15 minutes.
> (Currently a build in a 16x core takes 7m). This involves lot of
> refactoring of tests, move expensive tests / big smoke tests to
> nightlies on the integration branch, also tests on IoT devices / power
> and performance regressions...
>
> * Add code coverage and other quality metrics.
>
> * Eliminate warnings and treat warnings as errors. We have spent time
> tracking down "undefined behaviour" bugs that could have been caught
> by compiler warnings.
>
> Is there something I'm missing or additional things that come to your
> mind that you would wish to add?
>
> Pedro.
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Pedro Larroy <pe...@gmail.com>.
Just to provide a high level overview of the ideas and proposals
coming from different sources for the requirements for testing and
validation of builds:

* Have terraform files for the testing infrastructure. Infrastructure
as code (IaC). Minus not emulated / nor cloud based, embedded
hardware. ("single command" replication of the testing infrastructure,
no manual steps).

* CI software based on Jenkins, unless someone thinks there's a better
alternative.

* Use autoscaling groups and improve staggered build + test steps to
achieve higher parallelism and shorter feedback times.

* Switch to a branching model based on stable master + integration
branch. PRs are merged into dev/integration which runs extended
nightly tests, which are
then merged into master, preferably in an automated way after
successful extended testing.
Master is always tested, and always buildable. Release branches or
tags in master as usual for releases.

* Build + test feedback time targeting less than 15 minutes.
(Currently a build in a 16x core takes 7m). This involves lot of
refactoring of tests, move expensive tests / big smoke tests to
nightlies on the integration branch, also tests on IoT devices / power
and performance regressions...

* Add code coverage and other quality metrics.

* Eliminate warnings and treat warnings as errors. We have spent time
tracking down "undefined behaviour" bugs that could have been caught
by compiler warnings.

Is there something I'm missing or additional things that come to your
mind that you would wish to add?

Pedro.

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Pedro Larroy <pe...@gmail.com>.
Thanks for your input guys, I think we are all on a good track to get this
fixed. I'm confident that Meghna and Marco are going to drive this to
success. We are collecting ideas and requirements for the document on how
we will revamp the testing infrastructure. My only question right now is
where to store this document to collaborate. I don't seem to have
permissions in confluence to edit the wiki:
https://cwiki.apache.org/confluence/display/MXNET/Continuous+Integration

Should we otherwise use a shared google doc or a github wiki or how?

Please advice.

Pedro.

On Thu, Oct 26, 2017 at 8:14 AM, Meghna Baijal <me...@gmail.com>
wrote:

> Thanks Sandeep for driving this discussion. I am also in contact with Pedro
> and his team to include their requirements.
> And thank you Sebastian, I will let you know!
>
> Meghna
>
> On Wed, Oct 25, 2017 at 11:05 PM, Sebastian <ss...@googlemail.com>
> wrote:
>
> > @meghana @pedro let me know if you need someone with a mentor hat to open
> > tickets or send mail to infra, happy to help here.
> >
> > Best,
> > Sebastian
> >
> >
> > On 25.10.2017 23:18, sandeep krishnamurthy wrote:
> >
> >> Thank you, everyone, for the discussion, proposal, and the vote.
> >>
> >> Here majority community members see current CI system for Apache MXNet
> is
> >> having issues in scaling and diverse test environments. And the common
> >> suggestion is to have a separate CI setup for Apache MXNet.
> >>
> >> Following are the next steps:
> >>
> >> 1. Meghana proposed she would like to take the lead on this and come up
> >> with an initial tech design write up covering requirements, use-cases,
> >> alternate solutions and a proposed solution on how we could set up the
> CI
> >> system for MXNet.
> >> 2. This tech design will be reviewed in the community and following
> that,
> >> collaborate with Infra team and mentors to complete setup in the
> >> integration of the new system with Repo and Website and more.
> >>
> >> @Pedro Larry - We should sync up on understanding how we can unify the
> set
> >> up you have for various devices and the new set up being proposed and
> >> built. Ideally, we should have a unified CI setup for the project
> >> accessible to the community.
> >>
> >> Regards,
> >> Sandeep
> >>
> >> On Mon, Oct 23, 2017 at 7:29 AM, Pedro Larroy <
> >> pedro.larroy.lists@gmail.com>
> >> wrote:
> >>
> >> +1
> >>>
> >>> We (with Kellen and Marco) are already working on a CI system that
> >>> verifies
> >>> MXNet on devices, so far a work in progress, but at least we are
> checking
> >>> that the build is sane on Android, different arm flavors and ubuntu,
> also
> >>> building PRs. So far we are still working on having the unit tests pass
> >>> on
> >>> some architectures like Jetson TX2 and ARM / Raspberry PI.
> >>>
> >>> http://ci.mxnet.amazon-ml.com/
> >>>
> >>> Agree with Steffen on creating a document with requirements and high
> >>> level
> >>> architecture. Also I would like to have quicker feedback and as we
> >>> discussed before, saner unit tests. I think there's a big and
> nontrivial
> >>> amount of effort required here.
> >>>
> >>> Pedro.
> >>>
> >>> On Mon, Oct 23, 2017 at 6:43 AM, Steffen Rochel <
> steffenrochel@gmail.com
> >>> >
> >>> wrote:
> >>>
> >>> +1
> >>>> I support Option 1 - Set up separate Jenkins CI build system. While
> the
> >>>> Apache service is appropriate for some projects, our experience over
> the
> >>>> last 6 months has not been meeting the needs of the MXNet (incubating)
> >>>> project. AWS has been and will continue provide resources for such
> >>>>
> >>> project.
> >>>
> >>>> Agree we should create a document summarizing the requirements and
> high
> >>>> level architecture, which should answer the question of Jenkins or
> >>>> alternative.
> >>>>
> >>>> Steffen
> >>>>
> >>>> On Sat, Oct 21, 2017 at 6:51 PM shiwen hu <ya...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> +1
> >>>>>
> >>>>>
> >>>>> 2017-10-21 9:48 GMT+08:00 Chris Olivier <cj...@gmail.com>:
> >>>>>
> >>>>> Ok, just looking for anything that can cut a task out if possible. I
> >>>>>>
> >>>>> do
> >>>
> >>>> support not using Apache Jenkins server anyMore — it’s really not
> >>>>>>
> >>>>> been
> >>>
> >>>> working out for various reasons.  But having a person full time is
> >>>>>> something that Steffen would have to address, I imagine.
> >>>>>>
> >>>>>> On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:
> >>>>>>
> >>>>>> I didn't see the clear advantage of CodePipline over pure jenkins,
> >>>>>>>
> >>>>>> because
> >>>>>>
> >>>>>>> we don't need to deploy here.
> >>>>>>>
> >>>>>>> On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <
> >>>>>>>
> >>>>>> cjolivier01@gmail.com>
> >>>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> CodePipeline, then.  You can point it to Jenkins instances.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com>
> >>>>>>>>
> >>>>>>> wrote:
> >>>
> >>>>
> >>>>>>>> AWS CodeBuild is not an option. It doesn't support GPU
> >>>>>>>>>
> >>>>>>>> instances,
> >>>
> >>>> mac
> >>>>>
> >>>>>> os
> >>>>>>>
> >>>>>>>> x,
> >>>>>>>>
> >>>>>>>>> and windows. Not even mention the edge devices.
> >>>>>>>>>
> >>>>>>>>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
> >>>>>>>>>
> >>>>>>>> cjolivier01@gmail.com>
> >>>>>>
> >>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Why don;t we look into fully managed AWS CodeBuild?  It
> >>>>>>>>>>
> >>>>>>>>> maintains
> >>>>
> >>>>> everything. It's also compatible with Jenkins.
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> >>>>>>>>>>
> >>>>>>>>> tqchen@cs.washington.edu
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> +1
> >>>>>>>>>>>
> >>>>>>>>>>> Tianqi
> >>>>>>>>>>> On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
> >>>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>>>>> +1
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> It seems that the Apache CI is quite overloaded these
> >>>>>>>>>>>>
> >>>>>>>>>>> days,
> >>>
> >>>> and
> >>>>>
> >>>>>> MXNet's
> >>>>>>>>>
> >>>>>>>>>> CI
> >>>>>>>>>>>
> >>>>>>>>>>>> pipeline is too complex to run there. In addition, we may
> >>>>>>>>>>>>
> >>>>>>>>>>> need
> >>>>>
> >>>>>> to
> >>>>>>
> >>>>>>> add
> >>>>>>>>
> >>>>>>>>> more
> >>>>>>>>>>>
> >>>>>>>>>>>> devices, e.g. macpro and rasbperry pi, into the server,
> >>>>>>>>>>>>
> >>>>>>>>>>> and
> >>>
> >>>> more
> >>>>>>
> >>>>>>> tasks
> >>>>>>>>>
> >>>>>>>>>> such
> >>>>>>>>>>>
> >>>>>>>>>>>> as pip build. It means a lot of requests to the Infra
> >>>>>>>>>>>>
> >>>>>>>>>>> team.
> >>>
> >>>>
> >>>>>>>>>>>> We can reuse our previous Jenkins server at
> >>>>>>>>>>>>
> >>>>>>>>>>> http://ci.mxnet.io/.
> >>>>>>
> >>>>>>> But
> >>>>>>>>
> >>>>>>>>> we
> >>>>>>>>>>
> >>>>>>>>>>> probably need a dedicate developer to maintain it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> >>>>>>>>>>>> sandeep.krishna98@gmail.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hello all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am hereby opening up a discussion thread on how we
> >>>>>>>>>>>>>
> >>>>>>>>>>>> can
> >>>
> >>>> stabilize
> >>>>>>>>
> >>>>>>>>> Apache
> >>>>>>>>>>>
> >>>>>>>>>>>> MXNet CI build system.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Problems:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ========
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Recently, we have seen following issues with Apache
> >>>>>>>>>>>>>
> >>>>>>>>>>>> MXNet
> >>>
> >>>> CI
> >>>>>
> >>>>>> build
> >>>>>>>>
> >>>>>>>>> systems:
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>     1. Apache Jenkins master is overloaded and we see
> >>>>>>>>>>>>>
> >>>>>>>>>>>> issues
> >>>>
> >>>>> like
> >>>>>>>
> >>>>>>>> -
> >>>>>>>>
> >>>>>>>>> unable
> >>>>>>>>>>>
> >>>>>>>>>>>>     to trigger builds, difficult to load and view the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> blue
> >>>
> >>>> ocean
> >>>>>>
> >>>>>>> and
> >>>>>>>>
> >>>>>>>>> other
> >>>>>>>>>>>
> >>>>>>>>>>>>     Jenkins build status page.
> >>>>>>>>>>>>>     2. We are generating too many request/interaction on
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Apache
> >>>>>>
> >>>>>>> Infra
> >>>>>>>>>
> >>>>>>>>>> team.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>        1. Addition/deletion of new slave: Caused from
> >>>>>>>>>>>>>
> >>>>>>>>>>>> scaling
> >>>>>
> >>>>>> activity,
> >>>>>>>>>>
> >>>>>>>>>>>        recycling, troubleshooting or any actions leading
> >>>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>
> >>>>> change
> >>>>>>>
> >>>>>>>> of
> >>>>>>>>>
> >>>>>>>>>> slave
> >>>>>>>>>>>>
> >>>>>>>>>>>>>        machines.
> >>>>>>>>>>>>>        2. Plugins / other Jenkins Master configurations.
> >>>>>>>>>>>>>        3. Experimentation on CI pipelines.
> >>>>>>>>>>>>>     3. Harder to debug and resolve issues - Since access
> >>>>>>>>>>>>>
> >>>>>>>>>>>> to
> >>>>
> >>>>> master
> >>>>>>>
> >>>>>>>> and
> >>>>>>>>>
> >>>>>>>>>> slave
> >>>>>>>>>>>>
> >>>>>>>>>>>>>     is not with the same community, it requires Infra
> >>>>>>>>>>>>>
> >>>>>>>>>>>> and
> >>>
> >>>> community
> >>>>>>>>
> >>>>>>>>> to
> >>>>>>>>>
> >>>>>>>>>> dive
> >>>>>>>>>>>>
> >>>>>>>>>>>>>     deep together on all action items.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Possible Solutions:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ==============
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>     1. Can we set up a separate Jenkins CI build system
> >>>>>>>>>>>>>
> >>>>>>>>>>>> for
> >>>>
> >>>>> Apache
> >>>>>>>
> >>>>>>>> MXNet
> >>>>>>>>>>
> >>>>>>>>>>>     outside Apache Infra?
> >>>>>>>>>>>>>     2. Can we have a separate Jenkins Master in Apache
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Infra
> >>>>
> >>>>> for
> >>>>>>
> >>>>>>> MXNet?
> >>>>>>>>>>
> >>>>>>>>>>>     3. Review design of current setup, refine and fill
> >>>>>>>>>>>>>
> >>>>>>>>>>>> the
> >>>
> >>>> gaps.
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>>> @ Mentors/Infra team/Community:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ==========================
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Please provide your suggestions on how we can proceed
> >>>>>>>>>>>>>
> >>>>>>>>>>>> further
> >>>>>
> >>>>>> and
> >>>>>>>
> >>>>>>>> work
> >>>>>>>>>>
> >>>>>>>>>>> on
> >>>>>>>>>>>
> >>>>>>>>>>>> stabilizing the CI build systems for MXNet.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Also, if the community decides on separate Jenkins CI
> >>>>>>>>>>>>>
> >>>>>>>>>>>> build
> >>>>
> >>>>> system,
> >>>>>>>>
> >>>>>>>>> what
> >>>>>>>>>>>
> >>>>>>>>>>>> important points should be taken care of apart from the
> >>>>>>>>>>>>>
> >>>>>>>>>>>> below:
> >>>>>>
> >>>>>>>
> >>>>>>>>>>>>>     1. Community being able to access the build page for
> >>>>>>>>>>>>>
> >>>>>>>>>>>> build
> >>>>>
> >>>>>> statuses.
> >>>>>>>>>>
> >>>>>>>>>>>     2. Committers being able to login with apache
> >>>>>>>>>>>>>
> >>>>>>>>>>>> credentials.
> >>>>>
> >>>>>>     3. Hook setup from apache/incubator-mxnet repo to
> >>>>>>>>>>>>>
> >>>>>>>>>>>> Jenkins
> >>>>>
> >>>>>> master.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Irrespective of the solution we come up, I think we
> >>>>>>>>>>>>>
> >>>>>>>>>>>> should
> >>>>
> >>>>> initiate a
> >>>>>>>>>
> >>>>>>>>>> technical design discussion on how to setup the CI
> >>>>>>>>>>>>>
> >>>>>>>>>>>> build
> >>>
> >>>> system.
> >>>>>>>
> >>>>>>>> Probably 1
> >>>>>>>>>>>>
> >>>>>>>>>>>>> or 2 pager documents with the architecture and review
> >>>>>>>>>>>>>
> >>>>>>>>>>>> with
> >>>>
> >>>>> Infra
> >>>>>>>
> >>>>>>>> and
> >>>>>>>>>
> >>>>>>>>>> community members.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> ***There were few proposal and discussion on the slack
> >>>>>>>>>>>>>
> >>>>>>>>>>>> channel,
> >>>>>>
> >>>>>>> to
> >>>>>>>>
> >>>>>>>>> reach
> >>>>>>>>>>>
> >>>>>>>>>>>> wider community members, moving that discussion
> >>>>>>>>>>>>>
> >>>>>>>>>>>> formally
> >>>
> >>>> to
> >>>>
> >>>>> this
> >>>>>>>
> >>>>>>>> list.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> My Proposal: Option 1 - Set up separate Jenkins CI
> >>>>>>>>>>>>>
> >>>>>>>>>>>> build
> >>>
> >>>> system.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Sandeep
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --
> >>>>>>>>>>>>> Sandeep Krishnamurthy
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Meghna Baijal <me...@gmail.com>.
Thanks Sandeep for driving this discussion. I am also in contact with Pedro
and his team to include their requirements.
And thank you Sebastian, I will let you know!

Meghna

On Wed, Oct 25, 2017 at 11:05 PM, Sebastian <ss...@googlemail.com> wrote:

> @meghana @pedro let me know if you need someone with a mentor hat to open
> tickets or send mail to infra, happy to help here.
>
> Best,
> Sebastian
>
>
> On 25.10.2017 23:18, sandeep krishnamurthy wrote:
>
>> Thank you, everyone, for the discussion, proposal, and the vote.
>>
>> Here majority community members see current CI system for Apache MXNet is
>> having issues in scaling and diverse test environments. And the common
>> suggestion is to have a separate CI setup for Apache MXNet.
>>
>> Following are the next steps:
>>
>> 1. Meghana proposed she would like to take the lead on this and come up
>> with an initial tech design write up covering requirements, use-cases,
>> alternate solutions and a proposed solution on how we could set up the CI
>> system for MXNet.
>> 2. This tech design will be reviewed in the community and following that,
>> collaborate with Infra team and mentors to complete setup in the
>> integration of the new system with Repo and Website and more.
>>
>> @Pedro Larry - We should sync up on understanding how we can unify the set
>> up you have for various devices and the new set up being proposed and
>> built. Ideally, we should have a unified CI setup for the project
>> accessible to the community.
>>
>> Regards,
>> Sandeep
>>
>> On Mon, Oct 23, 2017 at 7:29 AM, Pedro Larroy <
>> pedro.larroy.lists@gmail.com>
>> wrote:
>>
>> +1
>>>
>>> We (with Kellen and Marco) are already working on a CI system that
>>> verifies
>>> MXNet on devices, so far a work in progress, but at least we are checking
>>> that the build is sane on Android, different arm flavors and ubuntu, also
>>> building PRs. So far we are still working on having the unit tests pass
>>> on
>>> some architectures like Jetson TX2 and ARM / Raspberry PI.
>>>
>>> http://ci.mxnet.amazon-ml.com/
>>>
>>> Agree with Steffen on creating a document with requirements and high
>>> level
>>> architecture. Also I would like to have quicker feedback and as we
>>> discussed before, saner unit tests. I think there's a big and nontrivial
>>> amount of effort required here.
>>>
>>> Pedro.
>>>
>>> On Mon, Oct 23, 2017 at 6:43 AM, Steffen Rochel <steffenrochel@gmail.com
>>> >
>>> wrote:
>>>
>>> +1
>>>> I support Option 1 - Set up separate Jenkins CI build system. While the
>>>> Apache service is appropriate for some projects, our experience over the
>>>> last 6 months has not been meeting the needs of the MXNet (incubating)
>>>> project. AWS has been and will continue provide resources for such
>>>>
>>> project.
>>>
>>>> Agree we should create a document summarizing the requirements and high
>>>> level architecture, which should answer the question of Jenkins or
>>>> alternative.
>>>>
>>>> Steffen
>>>>
>>>> On Sat, Oct 21, 2017 at 6:51 PM shiwen hu <ya...@gmail.com>
>>>> wrote:
>>>>
>>>> +1
>>>>>
>>>>>
>>>>> 2017-10-21 9:48 GMT+08:00 Chris Olivier <cj...@gmail.com>:
>>>>>
>>>>> Ok, just looking for anything that can cut a task out if possible. I
>>>>>>
>>>>> do
>>>
>>>> support not using Apache Jenkins server anyMore — it’s really not
>>>>>>
>>>>> been
>>>
>>>> working out for various reasons.  But having a person full time is
>>>>>> something that Steffen would have to address, I imagine.
>>>>>>
>>>>>> On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:
>>>>>>
>>>>>> I didn't see the clear advantage of CodePipline over pure jenkins,
>>>>>>>
>>>>>> because
>>>>>>
>>>>>>> we don't need to deploy here.
>>>>>>>
>>>>>>> On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <
>>>>>>>
>>>>>> cjolivier01@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> CodePipeline, then.  You can point it to Jenkins instances.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com>
>>>>>>>>
>>>>>>> wrote:
>>>
>>>>
>>>>>>>> AWS CodeBuild is not an option. It doesn't support GPU
>>>>>>>>>
>>>>>>>> instances,
>>>
>>>> mac
>>>>>
>>>>>> os
>>>>>>>
>>>>>>>> x,
>>>>>>>>
>>>>>>>>> and windows. Not even mention the edge devices.
>>>>>>>>>
>>>>>>>>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
>>>>>>>>>
>>>>>>>> cjolivier01@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Why don;t we look into fully managed AWS CodeBuild?  It
>>>>>>>>>>
>>>>>>>>> maintains
>>>>
>>>>> everything. It's also compatible with Jenkins.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
>>>>>>>>>>
>>>>>>>>> tqchen@cs.washington.edu
>>>>>>>
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>> Tianqi
>>>>>>>>>>> On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> It seems that the Apache CI is quite overloaded these
>>>>>>>>>>>>
>>>>>>>>>>> days,
>>>
>>>> and
>>>>>
>>>>>> MXNet's
>>>>>>>>>
>>>>>>>>>> CI
>>>>>>>>>>>
>>>>>>>>>>>> pipeline is too complex to run there. In addition, we may
>>>>>>>>>>>>
>>>>>>>>>>> need
>>>>>
>>>>>> to
>>>>>>
>>>>>>> add
>>>>>>>>
>>>>>>>>> more
>>>>>>>>>>>
>>>>>>>>>>>> devices, e.g. macpro and rasbperry pi, into the server,
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>
>>>> more
>>>>>>
>>>>>>> tasks
>>>>>>>>>
>>>>>>>>>> such
>>>>>>>>>>>
>>>>>>>>>>>> as pip build. It means a lot of requests to the Infra
>>>>>>>>>>>>
>>>>>>>>>>> team.
>>>
>>>>
>>>>>>>>>>>> We can reuse our previous Jenkins server at
>>>>>>>>>>>>
>>>>>>>>>>> http://ci.mxnet.io/.
>>>>>>
>>>>>>> But
>>>>>>>>
>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>>>> probably need a dedicate developer to maintain it.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>>>>>>>>>>>> sandeep.krishna98@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am hereby opening up a discussion thread on how we
>>>>>>>>>>>>>
>>>>>>>>>>>> can
>>>
>>>> stabilize
>>>>>>>>
>>>>>>>>> Apache
>>>>>>>>>>>
>>>>>>>>>>>> MXNet CI build system.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Problems:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ========
>>>>>>>>>>>>>
>>>>>>>>>>>>> Recently, we have seen following issues with Apache
>>>>>>>>>>>>>
>>>>>>>>>>>> MXNet
>>>
>>>> CI
>>>>>
>>>>>> build
>>>>>>>>
>>>>>>>>> systems:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>     1. Apache Jenkins master is overloaded and we see
>>>>>>>>>>>>>
>>>>>>>>>>>> issues
>>>>
>>>>> like
>>>>>>>
>>>>>>>> -
>>>>>>>>
>>>>>>>>> unable
>>>>>>>>>>>
>>>>>>>>>>>>     to trigger builds, difficult to load and view the
>>>>>>>>>>>>>
>>>>>>>>>>>> blue
>>>
>>>> ocean
>>>>>>
>>>>>>> and
>>>>>>>>
>>>>>>>>> other
>>>>>>>>>>>
>>>>>>>>>>>>     Jenkins build status page.
>>>>>>>>>>>>>     2. We are generating too many request/interaction on
>>>>>>>>>>>>>
>>>>>>>>>>>> Apache
>>>>>>
>>>>>>> Infra
>>>>>>>>>
>>>>>>>>>> team.
>>>>>>>>>>>>
>>>>>>>>>>>>>        1. Addition/deletion of new slave: Caused from
>>>>>>>>>>>>>
>>>>>>>>>>>> scaling
>>>>>
>>>>>> activity,
>>>>>>>>>>
>>>>>>>>>>>        recycling, troubleshooting or any actions leading
>>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>
>>>>> change
>>>>>>>
>>>>>>>> of
>>>>>>>>>
>>>>>>>>>> slave
>>>>>>>>>>>>
>>>>>>>>>>>>>        machines.
>>>>>>>>>>>>>        2. Plugins / other Jenkins Master configurations.
>>>>>>>>>>>>>        3. Experimentation on CI pipelines.
>>>>>>>>>>>>>     3. Harder to debug and resolve issues - Since access
>>>>>>>>>>>>>
>>>>>>>>>>>> to
>>>>
>>>>> master
>>>>>>>
>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> slave
>>>>>>>>>>>>
>>>>>>>>>>>>>     is not with the same community, it requires Infra
>>>>>>>>>>>>>
>>>>>>>>>>>> and
>>>
>>>> community
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>
>>>>>>>>>> dive
>>>>>>>>>>>>
>>>>>>>>>>>>>     deep together on all action items.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Possible Solutions:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ==============
>>>>>>>>>>>>>
>>>>>>>>>>>>>     1. Can we set up a separate Jenkins CI build system
>>>>>>>>>>>>>
>>>>>>>>>>>> for
>>>>
>>>>> Apache
>>>>>>>
>>>>>>>> MXNet
>>>>>>>>>>
>>>>>>>>>>>     outside Apache Infra?
>>>>>>>>>>>>>     2. Can we have a separate Jenkins Master in Apache
>>>>>>>>>>>>>
>>>>>>>>>>>> Infra
>>>>
>>>>> for
>>>>>>
>>>>>>> MXNet?
>>>>>>>>>>
>>>>>>>>>>>     3. Review design of current setup, refine and fill
>>>>>>>>>>>>>
>>>>>>>>>>>> the
>>>
>>>> gaps.
>>>>>>
>>>>>>>
>>>>>>>>>>>>> @ Mentors/Infra team/Community:
>>>>>>>>>>>>>
>>>>>>>>>>>>> ==========================
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please provide your suggestions on how we can proceed
>>>>>>>>>>>>>
>>>>>>>>>>>> further
>>>>>
>>>>>> and
>>>>>>>
>>>>>>>> work
>>>>>>>>>>
>>>>>>>>>>> on
>>>>>>>>>>>
>>>>>>>>>>>> stabilizing the CI build systems for MXNet.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also, if the community decides on separate Jenkins CI
>>>>>>>>>>>>>
>>>>>>>>>>>> build
>>>>
>>>>> system,
>>>>>>>>
>>>>>>>>> what
>>>>>>>>>>>
>>>>>>>>>>>> important points should be taken care of apart from the
>>>>>>>>>>>>>
>>>>>>>>>>>> below:
>>>>>>
>>>>>>>
>>>>>>>>>>>>>     1. Community being able to access the build page for
>>>>>>>>>>>>>
>>>>>>>>>>>> build
>>>>>
>>>>>> statuses.
>>>>>>>>>>
>>>>>>>>>>>     2. Committers being able to login with apache
>>>>>>>>>>>>>
>>>>>>>>>>>> credentials.
>>>>>
>>>>>>     3. Hook setup from apache/incubator-mxnet repo to
>>>>>>>>>>>>>
>>>>>>>>>>>> Jenkins
>>>>>
>>>>>> master.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Irrespective of the solution we come up, I think we
>>>>>>>>>>>>>
>>>>>>>>>>>> should
>>>>
>>>>> initiate a
>>>>>>>>>
>>>>>>>>>> technical design discussion on how to setup the CI
>>>>>>>>>>>>>
>>>>>>>>>>>> build
>>>
>>>> system.
>>>>>>>
>>>>>>>> Probably 1
>>>>>>>>>>>>
>>>>>>>>>>>>> or 2 pager documents with the architecture and review
>>>>>>>>>>>>>
>>>>>>>>>>>> with
>>>>
>>>>> Infra
>>>>>>>
>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> community members.
>>>>>>>>>>>>>
>>>>>>>>>>>>> ***There were few proposal and discussion on the slack
>>>>>>>>>>>>>
>>>>>>>>>>>> channel,
>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>>> reach
>>>>>>>>>>>
>>>>>>>>>>>> wider community members, moving that discussion
>>>>>>>>>>>>>
>>>>>>>>>>>> formally
>>>
>>>> to
>>>>
>>>>> this
>>>>>>>
>>>>>>>> list.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> My Proposal: Option 1 - Set up separate Jenkins CI
>>>>>>>>>>>>>
>>>>>>>>>>>> build
>>>
>>>> system.
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sandeep
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Sandeep Krishnamurthy
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Sebastian <ss...@googlemail.com>.
@meghana @pedro let me know if you need someone with a mentor hat to 
open tickets or send mail to infra, happy to help here.

Best,
Sebastian

On 25.10.2017 23:18, sandeep krishnamurthy wrote:
> Thank you, everyone, for the discussion, proposal, and the vote.
> 
> Here majority community members see current CI system for Apache MXNet is
> having issues in scaling and diverse test environments. And the common
> suggestion is to have a separate CI setup for Apache MXNet.
> 
> Following are the next steps:
> 
> 1. Meghana proposed she would like to take the lead on this and come up
> with an initial tech design write up covering requirements, use-cases,
> alternate solutions and a proposed solution on how we could set up the CI
> system for MXNet.
> 2. This tech design will be reviewed in the community and following that,
> collaborate with Infra team and mentors to complete setup in the
> integration of the new system with Repo and Website and more.
> 
> @Pedro Larry - We should sync up on understanding how we can unify the set
> up you have for various devices and the new set up being proposed and
> built. Ideally, we should have a unified CI setup for the project
> accessible to the community.
> 
> Regards,
> Sandeep
> 
> On Mon, Oct 23, 2017 at 7:29 AM, Pedro Larroy <pe...@gmail.com>
> wrote:
> 
>> +1
>>
>> We (with Kellen and Marco) are already working on a CI system that verifies
>> MXNet on devices, so far a work in progress, but at least we are checking
>> that the build is sane on Android, different arm flavors and ubuntu, also
>> building PRs. So far we are still working on having the unit tests pass on
>> some architectures like Jetson TX2 and ARM / Raspberry PI.
>>
>> http://ci.mxnet.amazon-ml.com/
>>
>> Agree with Steffen on creating a document with requirements and high level
>> architecture. Also I would like to have quicker feedback and as we
>> discussed before, saner unit tests. I think there's a big and nontrivial
>> amount of effort required here.
>>
>> Pedro.
>>
>> On Mon, Oct 23, 2017 at 6:43 AM, Steffen Rochel <st...@gmail.com>
>> wrote:
>>
>>> +1
>>> I support Option 1 - Set up separate Jenkins CI build system. While the
>>> Apache service is appropriate for some projects, our experience over the
>>> last 6 months has not been meeting the needs of the MXNet (incubating)
>>> project. AWS has been and will continue provide resources for such
>> project.
>>> Agree we should create a document summarizing the requirements and high
>>> level architecture, which should answer the question of Jenkins or
>>> alternative.
>>>
>>> Steffen
>>>
>>> On Sat, Oct 21, 2017 at 6:51 PM shiwen hu <ya...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>> 2017-10-21 9:48 GMT+08:00 Chris Olivier <cj...@gmail.com>:
>>>>
>>>>> Ok, just looking for anything that can cut a task out if possible. I
>> do
>>>>> support not using Apache Jenkins server anyMore — it’s really not
>> been
>>>>> working out for various reasons.  But having a person full time is
>>>>> something that Steffen would have to address, I imagine.
>>>>>
>>>>> On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:
>>>>>
>>>>>> I didn't see the clear advantage of CodePipline over pure jenkins,
>>>>> because
>>>>>> we don't need to deploy here.
>>>>>>
>>>>>> On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <
>>> cjolivier01@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> CodePipeline, then.  You can point it to Jenkins instances.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com>
>> wrote:
>>>>>>>
>>>>>>>> AWS CodeBuild is not an option. It doesn't support GPU
>> instances,
>>>> mac
>>>>>> os
>>>>>>> x,
>>>>>>>> and windows. Not even mention the edge devices.
>>>>>>>>
>>>>>>>> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
>>>>> cjolivier01@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Why don;t we look into fully managed AWS CodeBuild?  It
>>> maintains
>>>>>>>>> everything. It's also compatible with Jenkins.
>>>>>>>>>
>>>>>>>>> On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
>>>>>> tqchen@cs.washington.edu
>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> +1
>>>>>>>>>>
>>>>>>>>>> Tianqi
>>>>>>>>>> On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> +1
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It seems that the Apache CI is quite overloaded these
>> days,
>>>> and
>>>>>>>> MXNet's
>>>>>>>>>> CI
>>>>>>>>>>> pipeline is too complex to run there. In addition, we may
>>>> need
>>>>> to
>>>>>>> add
>>>>>>>>>> more
>>>>>>>>>>> devices, e.g. macpro and rasbperry pi, into the server,
>> and
>>>>> more
>>>>>>>> tasks
>>>>>>>>>> such
>>>>>>>>>>> as pip build. It means a lot of requests to the Infra
>> team.
>>>>>>>>>>>
>>>>>>>>>>> We can reuse our previous Jenkins server at
>>>>> http://ci.mxnet.io/.
>>>>>>> But
>>>>>>>>> we
>>>>>>>>>>> probably need a dedicate developer to maintain it.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>>>>>>>>>>> sandeep.krishna98@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello all,
>>>>>>>>>>>>
>>>>>>>>>>>> I am hereby opening up a discussion thread on how we
>> can
>>>>>>> stabilize
>>>>>>>>>> Apache
>>>>>>>>>>>> MXNet CI build system.
>>>>>>>>>>>>
>>>>>>>>>>>> Problems:
>>>>>>>>>>>>
>>>>>>>>>>>> ========
>>>>>>>>>>>>
>>>>>>>>>>>> Recently, we have seen following issues with Apache
>> MXNet
>>>> CI
>>>>>>> build
>>>>>>>>>>> systems:
>>>>>>>>>>>>
>>>>>>>>>>>>     1. Apache Jenkins master is overloaded and we see
>>> issues
>>>>>> like
>>>>>>> -
>>>>>>>>>> unable
>>>>>>>>>>>>     to trigger builds, difficult to load and view the
>> blue
>>>>> ocean
>>>>>>> and
>>>>>>>>>> other
>>>>>>>>>>>>     Jenkins build status page.
>>>>>>>>>>>>     2. We are generating too many request/interaction on
>>>>> Apache
>>>>>>>> Infra
>>>>>>>>>>> team.
>>>>>>>>>>>>        1. Addition/deletion of new slave: Caused from
>>>> scaling
>>>>>>>>> activity,
>>>>>>>>>>>>        recycling, troubleshooting or any actions leading
>>> to
>>>>>> change
>>>>>>>> of
>>>>>>>>>>> slave
>>>>>>>>>>>>        machines.
>>>>>>>>>>>>        2. Plugins / other Jenkins Master configurations.
>>>>>>>>>>>>        3. Experimentation on CI pipelines.
>>>>>>>>>>>>     3. Harder to debug and resolve issues - Since access
>>> to
>>>>>> master
>>>>>>>> and
>>>>>>>>>>> slave
>>>>>>>>>>>>     is not with the same community, it requires Infra
>> and
>>>>>>> community
>>>>>>>> to
>>>>>>>>>>> dive
>>>>>>>>>>>>     deep together on all action items.
>>>>>>>>>>>>
>>>>>>>>>>>> Possible Solutions:
>>>>>>>>>>>>
>>>>>>>>>>>> ==============
>>>>>>>>>>>>
>>>>>>>>>>>>     1. Can we set up a separate Jenkins CI build system
>>> for
>>>>>> Apache
>>>>>>>>> MXNet
>>>>>>>>>>>>     outside Apache Infra?
>>>>>>>>>>>>     2. Can we have a separate Jenkins Master in Apache
>>> Infra
>>>>> for
>>>>>>>>> MXNet?
>>>>>>>>>>>>     3. Review design of current setup, refine and fill
>> the
>>>>> gaps.
>>>>>>>>>>>>
>>>>>>>>>>>> @ Mentors/Infra team/Community:
>>>>>>>>>>>>
>>>>>>>>>>>> ==========================
>>>>>>>>>>>>
>>>>>>>>>>>> Please provide your suggestions on how we can proceed
>>>> further
>>>>>> and
>>>>>>>>> work
>>>>>>>>>> on
>>>>>>>>>>>> stabilizing the CI build systems for MXNet.
>>>>>>>>>>>>
>>>>>>>>>>>> Also, if the community decides on separate Jenkins CI
>>> build
>>>>>>> system,
>>>>>>>>>> what
>>>>>>>>>>>> important points should be taken care of apart from the
>>>>> below:
>>>>>>>>>>>>
>>>>>>>>>>>>     1. Community being able to access the build page for
>>>> build
>>>>>>>>> statuses.
>>>>>>>>>>>>     2. Committers being able to login with apache
>>>> credentials.
>>>>>>>>>>>>     3. Hook setup from apache/incubator-mxnet repo to
>>>> Jenkins
>>>>>>>> master.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Irrespective of the solution we come up, I think we
>>> should
>>>>>>>> initiate a
>>>>>>>>>>>> technical design discussion on how to setup the CI
>> build
>>>>>> system.
>>>>>>>>>>> Probably 1
>>>>>>>>>>>> or 2 pager documents with the architecture and review
>>> with
>>>>>> Infra
>>>>>>>> and
>>>>>>>>>>>> community members.
>>>>>>>>>>>>
>>>>>>>>>>>> ***There were few proposal and discussion on the slack
>>>>> channel,
>>>>>>> to
>>>>>>>>>> reach
>>>>>>>>>>>> wider community members, moving that discussion
>> formally
>>> to
>>>>>> this
>>>>>>>>> list.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> My Proposal: Option 1 - Set up separate Jenkins CI
>> build
>>>>>> system.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Sandeep
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Sandeep Krishnamurthy
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> 
> 
> 

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by sandeep krishnamurthy <sa...@gmail.com>.
Thank you, everyone, for the discussion, proposal, and the vote.

Here majority community members see current CI system for Apache MXNet is
having issues in scaling and diverse test environments. And the common
suggestion is to have a separate CI setup for Apache MXNet.

Following are the next steps:

1. Meghana proposed she would like to take the lead on this and come up
with an initial tech design write up covering requirements, use-cases,
alternate solutions and a proposed solution on how we could set up the CI
system for MXNet.
2. This tech design will be reviewed in the community and following that,
collaborate with Infra team and mentors to complete setup in the
integration of the new system with Repo and Website and more.

@Pedro Larry - We should sync up on understanding how we can unify the set
up you have for various devices and the new set up being proposed and
built. Ideally, we should have a unified CI setup for the project
accessible to the community.

Regards,
Sandeep

On Mon, Oct 23, 2017 at 7:29 AM, Pedro Larroy <pe...@gmail.com>
wrote:

> +1
>
> We (with Kellen and Marco) are already working on a CI system that verifies
> MXNet on devices, so far a work in progress, but at least we are checking
> that the build is sane on Android, different arm flavors and ubuntu, also
> building PRs. So far we are still working on having the unit tests pass on
> some architectures like Jetson TX2 and ARM / Raspberry PI.
>
> http://ci.mxnet.amazon-ml.com/
>
> Agree with Steffen on creating a document with requirements and high level
> architecture. Also I would like to have quicker feedback and as we
> discussed before, saner unit tests. I think there's a big and nontrivial
> amount of effort required here.
>
> Pedro.
>
> On Mon, Oct 23, 2017 at 6:43 AM, Steffen Rochel <st...@gmail.com>
> wrote:
>
> > +1
> > I support Option 1 - Set up separate Jenkins CI build system. While the
> > Apache service is appropriate for some projects, our experience over the
> > last 6 months has not been meeting the needs of the MXNet (incubating)
> > project. AWS has been and will continue provide resources for such
> project.
> > Agree we should create a document summarizing the requirements and high
> > level architecture, which should answer the question of Jenkins or
> > alternative.
> >
> > Steffen
> >
> > On Sat, Oct 21, 2017 at 6:51 PM shiwen hu <ya...@gmail.com> wrote:
> >
> > > +1
> > >
> > >
> > > 2017-10-21 9:48 GMT+08:00 Chris Olivier <cj...@gmail.com>:
> > >
> > > > Ok, just looking for anything that can cut a task out if possible. I
> do
> > > > support not using Apache Jenkins server anyMore — it’s really not
> been
> > > > working out for various reasons.  But having a person full time is
> > > > something that Steffen would have to address, I imagine.
> > > >
> > > > On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:
> > > >
> > > > > I didn't see the clear advantage of CodePipline over pure jenkins,
> > > > because
> > > > > we don't need to deploy here.
> > > > >
> > > > > On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <
> > cjolivier01@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > CodePipeline, then.  You can point it to Jenkins instances.
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com>
> wrote:
> > > > > >
> > > > > > > AWS CodeBuild is not an option. It doesn't support GPU
> instances,
> > > mac
> > > > > os
> > > > > > x,
> > > > > > > and windows. Not even mention the edge devices.
> > > > > > >
> > > > > > > On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
> > > > cjolivier01@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Why don;t we look into fully managed AWS CodeBuild?  It
> > maintains
> > > > > > > > everything. It's also compatible with Jenkins.
> > > > > > > >
> > > > > > > > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> > > > > tqchen@cs.washington.edu
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > Tianqi
> > > > > > > > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > It seems that the Apache CI is quite overloaded these
> days,
> > > and
> > > > > > > MXNet's
> > > > > > > > > CI
> > > > > > > > > > pipeline is too complex to run there. In addition, we may
> > > need
> > > > to
> > > > > > add
> > > > > > > > > more
> > > > > > > > > > devices, e.g. macpro and rasbperry pi, into the server,
> and
> > > > more
> > > > > > > tasks
> > > > > > > > > such
> > > > > > > > > > as pip build. It means a lot of requests to the Infra
> team.
> > > > > > > > > >
> > > > > > > > > > We can reuse our previous Jenkins server at
> > > > http://ci.mxnet.io/.
> > > > > > But
> > > > > > > > we
> > > > > > > > > > probably need a dedicate developer to maintain it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > > > > > > sandeep.krishna98@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > > Hello all,
> > > > > > > > > > >
> > > > > > > > > > > I am hereby opening up a discussion thread on how we
> can
> > > > > > stabilize
> > > > > > > > > Apache
> > > > > > > > > > > MXNet CI build system.
> > > > > > > > > > >
> > > > > > > > > > > Problems:
> > > > > > > > > > >
> > > > > > > > > > > ========
> > > > > > > > > > >
> > > > > > > > > > > Recently, we have seen following issues with Apache
> MXNet
> > > CI
> > > > > > build
> > > > > > > > > > systems:
> > > > > > > > > > >
> > > > > > > > > > >    1. Apache Jenkins master is overloaded and we see
> > issues
> > > > > like
> > > > > > -
> > > > > > > > > unable
> > > > > > > > > > >    to trigger builds, difficult to load and view the
> blue
> > > > ocean
> > > > > > and
> > > > > > > > > other
> > > > > > > > > > >    Jenkins build status page.
> > > > > > > > > > >    2. We are generating too many request/interaction on
> > > > Apache
> > > > > > > Infra
> > > > > > > > > > team.
> > > > > > > > > > >       1. Addition/deletion of new slave: Caused from
> > > scaling
> > > > > > > > activity,
> > > > > > > > > > >       recycling, troubleshooting or any actions leading
> > to
> > > > > change
> > > > > > > of
> > > > > > > > > > slave
> > > > > > > > > > >       machines.
> > > > > > > > > > >       2. Plugins / other Jenkins Master configurations.
> > > > > > > > > > >       3. Experimentation on CI pipelines.
> > > > > > > > > > >    3. Harder to debug and resolve issues - Since access
> > to
> > > > > master
> > > > > > > and
> > > > > > > > > > slave
> > > > > > > > > > >    is not with the same community, it requires Infra
> and
> > > > > > community
> > > > > > > to
> > > > > > > > > > dive
> > > > > > > > > > >    deep together on all action items.
> > > > > > > > > > >
> > > > > > > > > > > Possible Solutions:
> > > > > > > > > > >
> > > > > > > > > > > ==============
> > > > > > > > > > >
> > > > > > > > > > >    1. Can we set up a separate Jenkins CI build system
> > for
> > > > > Apache
> > > > > > > > MXNet
> > > > > > > > > > >    outside Apache Infra?
> > > > > > > > > > >    2. Can we have a separate Jenkins Master in Apache
> > Infra
> > > > for
> > > > > > > > MXNet?
> > > > > > > > > > >    3. Review design of current setup, refine and fill
> the
> > > > gaps.
> > > > > > > > > > >
> > > > > > > > > > > @ Mentors/Infra team/Community:
> > > > > > > > > > >
> > > > > > > > > > > ==========================
> > > > > > > > > > >
> > > > > > > > > > > Please provide your suggestions on how we can proceed
> > > further
> > > > > and
> > > > > > > > work
> > > > > > > > > on
> > > > > > > > > > > stabilizing the CI build systems for MXNet.
> > > > > > > > > > >
> > > > > > > > > > > Also, if the community decides on separate Jenkins CI
> > build
> > > > > > system,
> > > > > > > > > what
> > > > > > > > > > > important points should be taken care of apart from the
> > > > below:
> > > > > > > > > > >
> > > > > > > > > > >    1. Community being able to access the build page for
> > > build
> > > > > > > > statuses.
> > > > > > > > > > >    2. Committers being able to login with apache
> > > credentials.
> > > > > > > > > > >    3. Hook setup from apache/incubator-mxnet repo to
> > > Jenkins
> > > > > > > master.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Irrespective of the solution we come up, I think we
> > should
> > > > > > > initiate a
> > > > > > > > > > > technical design discussion on how to setup the CI
> build
> > > > > system.
> > > > > > > > > > Probably 1
> > > > > > > > > > > or 2 pager documents with the architecture and review
> > with
> > > > > Infra
> > > > > > > and
> > > > > > > > > > > community members.
> > > > > > > > > > >
> > > > > > > > > > > ***There were few proposal and discussion on the slack
> > > > channel,
> > > > > > to
> > > > > > > > > reach
> > > > > > > > > > > wider community members, moving that discussion
> formally
> > to
> > > > > this
> > > > > > > > list.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > My Proposal: Option 1 - Set up separate Jenkins CI
> build
> > > > > system.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > >
> > > > > > > > > > > Sandeep
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Sandeep Krishnamurthy
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
Sandeep Krishnamurthy

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Pedro Larroy <pe...@gmail.com>.
+1

We (with Kellen and Marco) are already working on a CI system that verifies
MXNet on devices, so far a work in progress, but at least we are checking
that the build is sane on Android, different arm flavors and ubuntu, also
building PRs. So far we are still working on having the unit tests pass on
some architectures like Jetson TX2 and ARM / Raspberry PI.

http://ci.mxnet.amazon-ml.com/

Agree with Steffen on creating a document with requirements and high level
architecture. Also I would like to have quicker feedback and as we
discussed before, saner unit tests. I think there's a big and nontrivial
amount of effort required here.

Pedro.

On Mon, Oct 23, 2017 at 6:43 AM, Steffen Rochel <st...@gmail.com>
wrote:

> +1
> I support Option 1 - Set up separate Jenkins CI build system. While the
> Apache service is appropriate for some projects, our experience over the
> last 6 months has not been meeting the needs of the MXNet (incubating)
> project. AWS has been and will continue provide resources for such project.
> Agree we should create a document summarizing the requirements and high
> level architecture, which should answer the question of Jenkins or
> alternative.
>
> Steffen
>
> On Sat, Oct 21, 2017 at 6:51 PM shiwen hu <ya...@gmail.com> wrote:
>
> > +1
> >
> >
> > 2017-10-21 9:48 GMT+08:00 Chris Olivier <cj...@gmail.com>:
> >
> > > Ok, just looking for anything that can cut a task out if possible. I do
> > > support not using Apache Jenkins server anyMore — it’s really not been
> > > working out for various reasons.  But having a person full time is
> > > something that Steffen would have to address, I imagine.
> > >
> > > On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:
> > >
> > > > I didn't see the clear advantage of CodePipline over pure jenkins,
> > > because
> > > > we don't need to deploy here.
> > > >
> > > > On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <
> cjolivier01@gmail.com>
> > > > wrote:
> > > >
> > > > > CodePipeline, then.  You can point it to Jenkins instances.
> > > > >
> > > > >
> > > > > On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com> wrote:
> > > > >
> > > > > > AWS CodeBuild is not an option. It doesn't support GPU instances,
> > mac
> > > > os
> > > > > x,
> > > > > > and windows. Not even mention the edge devices.
> > > > > >
> > > > > > On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
> > > cjolivier01@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Why don;t we look into fully managed AWS CodeBuild?  It
> maintains
> > > > > > > everything. It's also compatible with Jenkins.
> > > > > > >
> > > > > > > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> > > > tqchen@cs.washington.edu
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Tianqi
> > > > > > > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
> > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > It seems that the Apache CI is quite overloaded these days,
> > and
> > > > > > MXNet's
> > > > > > > > CI
> > > > > > > > > pipeline is too complex to run there. In addition, we may
> > need
> > > to
> > > > > add
> > > > > > > > more
> > > > > > > > > devices, e.g. macpro and rasbperry pi, into the server, and
> > > more
> > > > > > tasks
> > > > > > > > such
> > > > > > > > > as pip build. It means a lot of requests to the Infra team.
> > > > > > > > >
> > > > > > > > > We can reuse our previous Jenkins server at
> > > http://ci.mxnet.io/.
> > > > > But
> > > > > > > we
> > > > > > > > > probably need a dedicate developer to maintain it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > > > > > sandeep.krishna98@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > > Hello all,
> > > > > > > > > >
> > > > > > > > > > I am hereby opening up a discussion thread on how we can
> > > > > stabilize
> > > > > > > > Apache
> > > > > > > > > > MXNet CI build system.
> > > > > > > > > >
> > > > > > > > > > Problems:
> > > > > > > > > >
> > > > > > > > > > ========
> > > > > > > > > >
> > > > > > > > > > Recently, we have seen following issues with Apache MXNet
> > CI
> > > > > build
> > > > > > > > > systems:
> > > > > > > > > >
> > > > > > > > > >    1. Apache Jenkins master is overloaded and we see
> issues
> > > > like
> > > > > -
> > > > > > > > unable
> > > > > > > > > >    to trigger builds, difficult to load and view the blue
> > > ocean
> > > > > and
> > > > > > > > other
> > > > > > > > > >    Jenkins build status page.
> > > > > > > > > >    2. We are generating too many request/interaction on
> > > Apache
> > > > > > Infra
> > > > > > > > > team.
> > > > > > > > > >       1. Addition/deletion of new slave: Caused from
> > scaling
> > > > > > > activity,
> > > > > > > > > >       recycling, troubleshooting or any actions leading
> to
> > > > change
> > > > > > of
> > > > > > > > > slave
> > > > > > > > > >       machines.
> > > > > > > > > >       2. Plugins / other Jenkins Master configurations.
> > > > > > > > > >       3. Experimentation on CI pipelines.
> > > > > > > > > >    3. Harder to debug and resolve issues - Since access
> to
> > > > master
> > > > > > and
> > > > > > > > > slave
> > > > > > > > > >    is not with the same community, it requires Infra and
> > > > > community
> > > > > > to
> > > > > > > > > dive
> > > > > > > > > >    deep together on all action items.
> > > > > > > > > >
> > > > > > > > > > Possible Solutions:
> > > > > > > > > >
> > > > > > > > > > ==============
> > > > > > > > > >
> > > > > > > > > >    1. Can we set up a separate Jenkins CI build system
> for
> > > > Apache
> > > > > > > MXNet
> > > > > > > > > >    outside Apache Infra?
> > > > > > > > > >    2. Can we have a separate Jenkins Master in Apache
> Infra
> > > for
> > > > > > > MXNet?
> > > > > > > > > >    3. Review design of current setup, refine and fill the
> > > gaps.
> > > > > > > > > >
> > > > > > > > > > @ Mentors/Infra team/Community:
> > > > > > > > > >
> > > > > > > > > > ==========================
> > > > > > > > > >
> > > > > > > > > > Please provide your suggestions on how we can proceed
> > further
> > > > and
> > > > > > > work
> > > > > > > > on
> > > > > > > > > > stabilizing the CI build systems for MXNet.
> > > > > > > > > >
> > > > > > > > > > Also, if the community decides on separate Jenkins CI
> build
> > > > > system,
> > > > > > > > what
> > > > > > > > > > important points should be taken care of apart from the
> > > below:
> > > > > > > > > >
> > > > > > > > > >    1. Community being able to access the build page for
> > build
> > > > > > > statuses.
> > > > > > > > > >    2. Committers being able to login with apache
> > credentials.
> > > > > > > > > >    3. Hook setup from apache/incubator-mxnet repo to
> > Jenkins
> > > > > > master.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Irrespective of the solution we come up, I think we
> should
> > > > > > initiate a
> > > > > > > > > > technical design discussion on how to setup the CI build
> > > > system.
> > > > > > > > > Probably 1
> > > > > > > > > > or 2 pager documents with the architecture and review
> with
> > > > Infra
> > > > > > and
> > > > > > > > > > community members.
> > > > > > > > > >
> > > > > > > > > > ***There were few proposal and discussion on the slack
> > > channel,
> > > > > to
> > > > > > > > reach
> > > > > > > > > > wider community members, moving that discussion formally
> to
> > > > this
> > > > > > > list.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> > > > system.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > >
> > > > > > > > > > Sandeep
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Sandeep Krishnamurthy
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Steffen Rochel <st...@gmail.com>.
+1
I support Option 1 - Set up separate Jenkins CI build system. While the
Apache service is appropriate for some projects, our experience over the
last 6 months has not been meeting the needs of the MXNet (incubating)
project. AWS has been and will continue provide resources for such project.
Agree we should create a document summarizing the requirements and high
level architecture, which should answer the question of Jenkins or
alternative.

Steffen

On Sat, Oct 21, 2017 at 6:51 PM shiwen hu <ya...@gmail.com> wrote:

> +1
>
>
> 2017-10-21 9:48 GMT+08:00 Chris Olivier <cj...@gmail.com>:
>
> > Ok, just looking for anything that can cut a task out if possible. I do
> > support not using Apache Jenkins server anyMore — it’s really not been
> > working out for various reasons.  But having a person full time is
> > something that Steffen would have to address, I imagine.
> >
> > On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:
> >
> > > I didn't see the clear advantage of CodePipline over pure jenkins,
> > because
> > > we don't need to deploy here.
> > >
> > > On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <cj...@gmail.com>
> > > wrote:
> > >
> > > > CodePipeline, then.  You can point it to Jenkins instances.
> > > >
> > > >
> > > > On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com> wrote:
> > > >
> > > > > AWS CodeBuild is not an option. It doesn't support GPU instances,
> mac
> > > os
> > > > x,
> > > > > and windows. Not even mention the edge devices.
> > > > >
> > > > > On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
> > cjolivier01@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> > > > > > everything. It's also compatible with Jenkins.
> > > > > >
> > > > > > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> > > tqchen@cs.washington.edu
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Tianqi
> > > > > > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
> > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > >
> > > > > > > > It seems that the Apache CI is quite overloaded these days,
> and
> > > > > MXNet's
> > > > > > > CI
> > > > > > > > pipeline is too complex to run there. In addition, we may
> need
> > to
> > > > add
> > > > > > > more
> > > > > > > > devices, e.g. macpro and rasbperry pi, into the server, and
> > more
> > > > > tasks
> > > > > > > such
> > > > > > > > as pip build. It means a lot of requests to the Infra team.
> > > > > > > >
> > > > > > > > We can reuse our previous Jenkins server at
> > http://ci.mxnet.io/.
> > > > But
> > > > > > we
> > > > > > > > probably need a dedicate developer to maintain it.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > > > > sandeep.krishna98@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hello all,
> > > > > > > > >
> > > > > > > > > I am hereby opening up a discussion thread on how we can
> > > > stabilize
> > > > > > > Apache
> > > > > > > > > MXNet CI build system.
> > > > > > > > >
> > > > > > > > > Problems:
> > > > > > > > >
> > > > > > > > > ========
> > > > > > > > >
> > > > > > > > > Recently, we have seen following issues with Apache MXNet
> CI
> > > > build
> > > > > > > > systems:
> > > > > > > > >
> > > > > > > > >    1. Apache Jenkins master is overloaded and we see issues
> > > like
> > > > -
> > > > > > > unable
> > > > > > > > >    to trigger builds, difficult to load and view the blue
> > ocean
> > > > and
> > > > > > > other
> > > > > > > > >    Jenkins build status page.
> > > > > > > > >    2. We are generating too many request/interaction on
> > Apache
> > > > > Infra
> > > > > > > > team.
> > > > > > > > >       1. Addition/deletion of new slave: Caused from
> scaling
> > > > > > activity,
> > > > > > > > >       recycling, troubleshooting or any actions leading to
> > > change
> > > > > of
> > > > > > > > slave
> > > > > > > > >       machines.
> > > > > > > > >       2. Plugins / other Jenkins Master configurations.
> > > > > > > > >       3. Experimentation on CI pipelines.
> > > > > > > > >    3. Harder to debug and resolve issues - Since access to
> > > master
> > > > > and
> > > > > > > > slave
> > > > > > > > >    is not with the same community, it requires Infra and
> > > > community
> > > > > to
> > > > > > > > dive
> > > > > > > > >    deep together on all action items.
> > > > > > > > >
> > > > > > > > > Possible Solutions:
> > > > > > > > >
> > > > > > > > > ==============
> > > > > > > > >
> > > > > > > > >    1. Can we set up a separate Jenkins CI build system for
> > > Apache
> > > > > > MXNet
> > > > > > > > >    outside Apache Infra?
> > > > > > > > >    2. Can we have a separate Jenkins Master in Apache Infra
> > for
> > > > > > MXNet?
> > > > > > > > >    3. Review design of current setup, refine and fill the
> > gaps.
> > > > > > > > >
> > > > > > > > > @ Mentors/Infra team/Community:
> > > > > > > > >
> > > > > > > > > ==========================
> > > > > > > > >
> > > > > > > > > Please provide your suggestions on how we can proceed
> further
> > > and
> > > > > > work
> > > > > > > on
> > > > > > > > > stabilizing the CI build systems for MXNet.
> > > > > > > > >
> > > > > > > > > Also, if the community decides on separate Jenkins CI build
> > > > system,
> > > > > > > what
> > > > > > > > > important points should be taken care of apart from the
> > below:
> > > > > > > > >
> > > > > > > > >    1. Community being able to access the build page for
> build
> > > > > > statuses.
> > > > > > > > >    2. Committers being able to login with apache
> credentials.
> > > > > > > > >    3. Hook setup from apache/incubator-mxnet repo to
> Jenkins
> > > > > master.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Irrespective of the solution we come up, I think we should
> > > > > initiate a
> > > > > > > > > technical design discussion on how to setup the CI build
> > > system.
> > > > > > > > Probably 1
> > > > > > > > > or 2 pager documents with the architecture and review with
> > > Infra
> > > > > and
> > > > > > > > > community members.
> > > > > > > > >
> > > > > > > > > ***There were few proposal and discussion on the slack
> > channel,
> > > > to
> > > > > > > reach
> > > > > > > > > wider community members, moving that discussion formally to
> > > this
> > > > > > list.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> > > system.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Sandeep
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Sandeep Krishnamurthy
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by shiwen hu <ya...@gmail.com>.
+1


2017-10-21 9:48 GMT+08:00 Chris Olivier <cj...@gmail.com>:

> Ok, just looking for anything that can cut a task out if possible. I do
> support not using Apache Jenkins server anyMore — it’s really not been
> working out for various reasons.  But having a person full time is
> something that Steffen would have to address, I imagine.
>
> On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:
>
> > I didn't see the clear advantage of CodePipline over pure jenkins,
> because
> > we don't need to deploy here.
> >
> > On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <cj...@gmail.com>
> > wrote:
> >
> > > CodePipeline, then.  You can point it to Jenkins instances.
> > >
> > >
> > > On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com> wrote:
> > >
> > > > AWS CodeBuild is not an option. It doesn't support GPU instances, mac
> > os
> > > x,
> > > > and windows. Not even mention the edge devices.
> > > >
> > > > On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <
> cjolivier01@gmail.com>
> > > > wrote:
> > > >
> > > > > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> > > > > everything. It's also compatible with Jenkins.
> > > > >
> > > > > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> > tqchen@cs.washington.edu
> > > >
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > Tianqi
> > > > > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com>
> wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > >
> > > > > > > It seems that the Apache CI is quite overloaded these days, and
> > > > MXNet's
> > > > > > CI
> > > > > > > pipeline is too complex to run there. In addition, we may need
> to
> > > add
> > > > > > more
> > > > > > > devices, e.g. macpro and rasbperry pi, into the server, and
> more
> > > > tasks
> > > > > > such
> > > > > > > as pip build. It means a lot of requests to the Infra team.
> > > > > > >
> > > > > > > We can reuse our previous Jenkins server at
> http://ci.mxnet.io/.
> > > But
> > > > > we
> > > > > > > probably need a dedicate developer to maintain it.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > > > sandeep.krishna98@gmail.com> wrote:
> > > > > > >
> > > > > > > > Hello all,
> > > > > > > >
> > > > > > > > I am hereby opening up a discussion thread on how we can
> > > stabilize
> > > > > > Apache
> > > > > > > > MXNet CI build system.
> > > > > > > >
> > > > > > > > Problems:
> > > > > > > >
> > > > > > > > ========
> > > > > > > >
> > > > > > > > Recently, we have seen following issues with Apache MXNet CI
> > > build
> > > > > > > systems:
> > > > > > > >
> > > > > > > >    1. Apache Jenkins master is overloaded and we see issues
> > like
> > > -
> > > > > > unable
> > > > > > > >    to trigger builds, difficult to load and view the blue
> ocean
> > > and
> > > > > > other
> > > > > > > >    Jenkins build status page.
> > > > > > > >    2. We are generating too many request/interaction on
> Apache
> > > > Infra
> > > > > > > team.
> > > > > > > >       1. Addition/deletion of new slave: Caused from scaling
> > > > > activity,
> > > > > > > >       recycling, troubleshooting or any actions leading to
> > change
> > > > of
> > > > > > > slave
> > > > > > > >       machines.
> > > > > > > >       2. Plugins / other Jenkins Master configurations.
> > > > > > > >       3. Experimentation on CI pipelines.
> > > > > > > >    3. Harder to debug and resolve issues - Since access to
> > master
> > > > and
> > > > > > > slave
> > > > > > > >    is not with the same community, it requires Infra and
> > > community
> > > > to
> > > > > > > dive
> > > > > > > >    deep together on all action items.
> > > > > > > >
> > > > > > > > Possible Solutions:
> > > > > > > >
> > > > > > > > ==============
> > > > > > > >
> > > > > > > >    1. Can we set up a separate Jenkins CI build system for
> > Apache
> > > > > MXNet
> > > > > > > >    outside Apache Infra?
> > > > > > > >    2. Can we have a separate Jenkins Master in Apache Infra
> for
> > > > > MXNet?
> > > > > > > >    3. Review design of current setup, refine and fill the
> gaps.
> > > > > > > >
> > > > > > > > @ Mentors/Infra team/Community:
> > > > > > > >
> > > > > > > > ==========================
> > > > > > > >
> > > > > > > > Please provide your suggestions on how we can proceed further
> > and
> > > > > work
> > > > > > on
> > > > > > > > stabilizing the CI build systems for MXNet.
> > > > > > > >
> > > > > > > > Also, if the community decides on separate Jenkins CI build
> > > system,
> > > > > > what
> > > > > > > > important points should be taken care of apart from the
> below:
> > > > > > > >
> > > > > > > >    1. Community being able to access the build page for build
> > > > > statuses.
> > > > > > > >    2. Committers being able to login with apache credentials.
> > > > > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> > > > master.
> > > > > > > >
> > > > > > > >
> > > > > > > > Irrespective of the solution we come up, I think we should
> > > > initiate a
> > > > > > > > technical design discussion on how to setup the CI build
> > system.
> > > > > > > Probably 1
> > > > > > > > or 2 pager documents with the architecture and review with
> > Infra
> > > > and
> > > > > > > > community members.
> > > > > > > >
> > > > > > > > ***There were few proposal and discussion on the slack
> channel,
> > > to
> > > > > > reach
> > > > > > > > wider community members, moving that discussion formally to
> > this
> > > > > list.
> > > > > > > >
> > > > > > > >
> > > > > > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> > system.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Sandeep
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Sandeep Krishnamurthy
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
Ok, just looking for anything that can cut a task out if possible. I do
support not using Apache Jenkins server anyMore — it’s really not been
working out for various reasons.  But having a person full time is
something that Steffen would have to address, I imagine.

On Fri, Oct 20, 2017 at 6:03 PM Mu Li <mu...@gmail.com> wrote:

> I didn't see the clear advantage of CodePipline over pure jenkins, because
> we don't need to deploy here.
>
> On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
> > CodePipeline, then.  You can point it to Jenkins instances.
> >
> >
> > On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com> wrote:
> >
> > > AWS CodeBuild is not an option. It doesn't support GPU instances, mac
> os
> > x,
> > > and windows. Not even mention the edge devices.
> > >
> > > On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
> > > wrote:
> > >
> > > > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> > > > everything. It's also compatible with Jenkins.
> > > >
> > > > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <
> tqchen@cs.washington.edu
> > >
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > Tianqi
> > > > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > >
> > > > > > It seems that the Apache CI is quite overloaded these days, and
> > > MXNet's
> > > > > CI
> > > > > > pipeline is too complex to run there. In addition, we may need to
> > add
> > > > > more
> > > > > > devices, e.g. macpro and rasbperry pi, into the server, and more
> > > tasks
> > > > > such
> > > > > > as pip build. It means a lot of requests to the Infra team.
> > > > > >
> > > > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/.
> > But
> > > > we
> > > > > > probably need a dedicate developer to maintain it.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > > sandeep.krishna98@gmail.com> wrote:
> > > > > >
> > > > > > > Hello all,
> > > > > > >
> > > > > > > I am hereby opening up a discussion thread on how we can
> > stabilize
> > > > > Apache
> > > > > > > MXNet CI build system.
> > > > > > >
> > > > > > > Problems:
> > > > > > >
> > > > > > > ========
> > > > > > >
> > > > > > > Recently, we have seen following issues with Apache MXNet CI
> > build
> > > > > > systems:
> > > > > > >
> > > > > > >    1. Apache Jenkins master is overloaded and we see issues
> like
> > -
> > > > > unable
> > > > > > >    to trigger builds, difficult to load and view the blue ocean
> > and
> > > > > other
> > > > > > >    Jenkins build status page.
> > > > > > >    2. We are generating too many request/interaction on Apache
> > > Infra
> > > > > > team.
> > > > > > >       1. Addition/deletion of new slave: Caused from scaling
> > > > activity,
> > > > > > >       recycling, troubleshooting or any actions leading to
> change
> > > of
> > > > > > slave
> > > > > > >       machines.
> > > > > > >       2. Plugins / other Jenkins Master configurations.
> > > > > > >       3. Experimentation on CI pipelines.
> > > > > > >    3. Harder to debug and resolve issues - Since access to
> master
> > > and
> > > > > > slave
> > > > > > >    is not with the same community, it requires Infra and
> > community
> > > to
> > > > > > dive
> > > > > > >    deep together on all action items.
> > > > > > >
> > > > > > > Possible Solutions:
> > > > > > >
> > > > > > > ==============
> > > > > > >
> > > > > > >    1. Can we set up a separate Jenkins CI build system for
> Apache
> > > > MXNet
> > > > > > >    outside Apache Infra?
> > > > > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> > > > MXNet?
> > > > > > >    3. Review design of current setup, refine and fill the gaps.
> > > > > > >
> > > > > > > @ Mentors/Infra team/Community:
> > > > > > >
> > > > > > > ==========================
> > > > > > >
> > > > > > > Please provide your suggestions on how we can proceed further
> and
> > > > work
> > > > > on
> > > > > > > stabilizing the CI build systems for MXNet.
> > > > > > >
> > > > > > > Also, if the community decides on separate Jenkins CI build
> > system,
> > > > > what
> > > > > > > important points should be taken care of apart from the below:
> > > > > > >
> > > > > > >    1. Community being able to access the build page for build
> > > > statuses.
> > > > > > >    2. Committers being able to login with apache credentials.
> > > > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> > > master.
> > > > > > >
> > > > > > >
> > > > > > > Irrespective of the solution we come up, I think we should
> > > initiate a
> > > > > > > technical design discussion on how to setup the CI build
> system.
> > > > > > Probably 1
> > > > > > > or 2 pager documents with the architecture and review with
> Infra
> > > and
> > > > > > > community members.
> > > > > > >
> > > > > > > ***There were few proposal and discussion on the slack channel,
> > to
> > > > > reach
> > > > > > > wider community members, moving that discussion formally to
> this
> > > > list.
> > > > > > >
> > > > > > >
> > > > > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> system.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Sandeep
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Sandeep Krishnamurthy
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Mu Li <mu...@gmail.com>.
I didn't see the clear advantage of CodePipline over pure jenkins, because
we don't need to deploy here.

On Fri, Oct 20, 2017 at 5:34 PM, Chris Olivier <cj...@gmail.com>
wrote:

> CodePipeline, then.  You can point it to Jenkins instances.
>
>
> On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com> wrote:
>
> > AWS CodeBuild is not an option. It doesn't support GPU instances, mac os
> x,
> > and windows. Not even mention the edge devices.
> >
> > On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
> > wrote:
> >
> > > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> > > everything. It's also compatible with Jenkins.
> > >
> > > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tqchen@cs.washington.edu
> >
> > > wrote:
> > >
> > > > +1
> > > >
> > > > Tianqi
> > > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
> > > >
> > > > > +1
> > > > >
> > > > >
> > > > > It seems that the Apache CI is quite overloaded these days, and
> > MXNet's
> > > > CI
> > > > > pipeline is too complex to run there. In addition, we may need to
> add
> > > > more
> > > > > devices, e.g. macpro and rasbperry pi, into the server, and more
> > tasks
> > > > such
> > > > > as pip build. It means a lot of requests to the Infra team.
> > > > >
> > > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/.
> But
> > > we
> > > > > probably need a dedicate developer to maintain it.
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > sandeep.krishna98@gmail.com> wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I am hereby opening up a discussion thread on how we can
> stabilize
> > > > Apache
> > > > > > MXNet CI build system.
> > > > > >
> > > > > > Problems:
> > > > > >
> > > > > > ========
> > > > > >
> > > > > > Recently, we have seen following issues with Apache MXNet CI
> build
> > > > > systems:
> > > > > >
> > > > > >    1. Apache Jenkins master is overloaded and we see issues like
> -
> > > > unable
> > > > > >    to trigger builds, difficult to load and view the blue ocean
> and
> > > > other
> > > > > >    Jenkins build status page.
> > > > > >    2. We are generating too many request/interaction on Apache
> > Infra
> > > > > team.
> > > > > >       1. Addition/deletion of new slave: Caused from scaling
> > > activity,
> > > > > >       recycling, troubleshooting or any actions leading to change
> > of
> > > > > slave
> > > > > >       machines.
> > > > > >       2. Plugins / other Jenkins Master configurations.
> > > > > >       3. Experimentation on CI pipelines.
> > > > > >    3. Harder to debug and resolve issues - Since access to master
> > and
> > > > > slave
> > > > > >    is not with the same community, it requires Infra and
> community
> > to
> > > > > dive
> > > > > >    deep together on all action items.
> > > > > >
> > > > > > Possible Solutions:
> > > > > >
> > > > > > ==============
> > > > > >
> > > > > >    1. Can we set up a separate Jenkins CI build system for Apache
> > > MXNet
> > > > > >    outside Apache Infra?
> > > > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> > > MXNet?
> > > > > >    3. Review design of current setup, refine and fill the gaps.
> > > > > >
> > > > > > @ Mentors/Infra team/Community:
> > > > > >
> > > > > > ==========================
> > > > > >
> > > > > > Please provide your suggestions on how we can proceed further and
> > > work
> > > > on
> > > > > > stabilizing the CI build systems for MXNet.
> > > > > >
> > > > > > Also, if the community decides on separate Jenkins CI build
> system,
> > > > what
> > > > > > important points should be taken care of apart from the below:
> > > > > >
> > > > > >    1. Community being able to access the build page for build
> > > statuses.
> > > > > >    2. Committers being able to login with apache credentials.
> > > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> > master.
> > > > > >
> > > > > >
> > > > > > Irrespective of the solution we come up, I think we should
> > initiate a
> > > > > > technical design discussion on how to setup the CI build system.
> > > > > Probably 1
> > > > > > or 2 pager documents with the architecture and review with Infra
> > and
> > > > > > community members.
> > > > > >
> > > > > > ***There were few proposal and discussion on the slack channel,
> to
> > > > reach
> > > > > > wider community members, moving that discussion formally to this
> > > list.
> > > > > >
> > > > > >
> > > > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Sandeep
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Sandeep Krishnamurthy
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
CodePipeline, then.  You can point it to Jenkins instances.


On Fri, Oct 20, 2017 at 4:49 PM Mu Li <mu...@gmail.com> wrote:

> AWS CodeBuild is not an option. It doesn't support GPU instances, mac os x,
> and windows. Not even mention the edge devices.
>
> On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
> > Why don;t we look into fully managed AWS CodeBuild?  It maintains
> > everything. It's also compatible with Jenkins.
> >
> > On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tq...@cs.washington.edu>
> > wrote:
> >
> > > +1
> > >
> > > Tianqi
> > > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
> > >
> > > > +1
> > > >
> > > >
> > > > It seems that the Apache CI is quite overloaded these days, and
> MXNet's
> > > CI
> > > > pipeline is too complex to run there. In addition, we may need to add
> > > more
> > > > devices, e.g. macpro and rasbperry pi, into the server, and more
> tasks
> > > such
> > > > as pip build. It means a lot of requests to the Infra team.
> > > >
> > > > We can reuse our previous Jenkins server at http://ci.mxnet.io/. But
> > we
> > > > probably need a dedicate developer to maintain it.
> > > >
> > > >
> > > >
> > > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > sandeep.krishna98@gmail.com> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I am hereby opening up a discussion thread on how we can stabilize
> > > Apache
> > > > > MXNet CI build system.
> > > > >
> > > > > Problems:
> > > > >
> > > > > ========
> > > > >
> > > > > Recently, we have seen following issues with Apache MXNet CI build
> > > > systems:
> > > > >
> > > > >    1. Apache Jenkins master is overloaded and we see issues like -
> > > unable
> > > > >    to trigger builds, difficult to load and view the blue ocean and
> > > other
> > > > >    Jenkins build status page.
> > > > >    2. We are generating too many request/interaction on Apache
> Infra
> > > > team.
> > > > >       1. Addition/deletion of new slave: Caused from scaling
> > activity,
> > > > >       recycling, troubleshooting or any actions leading to change
> of
> > > > slave
> > > > >       machines.
> > > > >       2. Plugins / other Jenkins Master configurations.
> > > > >       3. Experimentation on CI pipelines.
> > > > >    3. Harder to debug and resolve issues - Since access to master
> and
> > > > slave
> > > > >    is not with the same community, it requires Infra and community
> to
> > > > dive
> > > > >    deep together on all action items.
> > > > >
> > > > > Possible Solutions:
> > > > >
> > > > > ==============
> > > > >
> > > > >    1. Can we set up a separate Jenkins CI build system for Apache
> > MXNet
> > > > >    outside Apache Infra?
> > > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> > MXNet?
> > > > >    3. Review design of current setup, refine and fill the gaps.
> > > > >
> > > > > @ Mentors/Infra team/Community:
> > > > >
> > > > > ==========================
> > > > >
> > > > > Please provide your suggestions on how we can proceed further and
> > work
> > > on
> > > > > stabilizing the CI build systems for MXNet.
> > > > >
> > > > > Also, if the community decides on separate Jenkins CI build system,
> > > what
> > > > > important points should be taken care of apart from the below:
> > > > >
> > > > >    1. Community being able to access the build page for build
> > statuses.
> > > > >    2. Committers being able to login with apache credentials.
> > > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> master.
> > > > >
> > > > >
> > > > > Irrespective of the solution we come up, I think we should
> initiate a
> > > > > technical design discussion on how to setup the CI build system.
> > > > Probably 1
> > > > > or 2 pager documents with the architecture and review with Infra
> and
> > > > > community members.
> > > > >
> > > > > ***There were few proposal and discussion on the slack channel, to
> > > reach
> > > > > wider community members, moving that discussion formally to this
> > list.
> > > > >
> > > > >
> > > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Sandeep
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Sandeep Krishnamurthy
> > > > >
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Mu Li <mu...@gmail.com>.
AWS CodeBuild is not an option. It doesn't support GPU instances, mac os x,
and windows. Not even mention the edge devices.

On Fri, Oct 20, 2017 at 4:07 PM, Chris Olivier <cj...@gmail.com>
wrote:

> Why don;t we look into fully managed AWS CodeBuild?  It maintains
> everything. It's also compatible with Jenkins.
>
> On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tq...@cs.washington.edu>
> wrote:
>
> > +1
> >
> > Tianqi
> > On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
> >
> > > +1
> > >
> > >
> > > It seems that the Apache CI is quite overloaded these days, and MXNet's
> > CI
> > > pipeline is too complex to run there. In addition, we may need to add
> > more
> > > devices, e.g. macpro and rasbperry pi, into the server, and more tasks
> > such
> > > as pip build. It means a lot of requests to the Infra team.
> > >
> > > We can reuse our previous Jenkins server at http://ci.mxnet.io/. But
> we
> > > probably need a dedicate developer to maintain it.
> > >
> > >
> > >
> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > sandeep.krishna98@gmail.com> wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am hereby opening up a discussion thread on how we can stabilize
> > Apache
> > > > MXNet CI build system.
> > > >
> > > > Problems:
> > > >
> > > > ========
> > > >
> > > > Recently, we have seen following issues with Apache MXNet CI build
> > > systems:
> > > >
> > > >    1. Apache Jenkins master is overloaded and we see issues like -
> > unable
> > > >    to trigger builds, difficult to load and view the blue ocean and
> > other
> > > >    Jenkins build status page.
> > > >    2. We are generating too many request/interaction on Apache Infra
> > > team.
> > > >       1. Addition/deletion of new slave: Caused from scaling
> activity,
> > > >       recycling, troubleshooting or any actions leading to change of
> > > slave
> > > >       machines.
> > > >       2. Plugins / other Jenkins Master configurations.
> > > >       3. Experimentation on CI pipelines.
> > > >    3. Harder to debug and resolve issues - Since access to master and
> > > slave
> > > >    is not with the same community, it requires Infra and community to
> > > dive
> > > >    deep together on all action items.
> > > >
> > > > Possible Solutions:
> > > >
> > > > ==============
> > > >
> > > >    1. Can we set up a separate Jenkins CI build system for Apache
> MXNet
> > > >    outside Apache Infra?
> > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> MXNet?
> > > >    3. Review design of current setup, refine and fill the gaps.
> > > >
> > > > @ Mentors/Infra team/Community:
> > > >
> > > > ==========================
> > > >
> > > > Please provide your suggestions on how we can proceed further and
> work
> > on
> > > > stabilizing the CI build systems for MXNet.
> > > >
> > > > Also, if the community decides on separate Jenkins CI build system,
> > what
> > > > important points should be taken care of apart from the below:
> > > >
> > > >    1. Community being able to access the build page for build
> statuses.
> > > >    2. Committers being able to login with apache credentials.
> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
> > > >
> > > >
> > > > Irrespective of the solution we come up, I think we should initiate a
> > > > technical design discussion on how to setup the CI build system.
> > > Probably 1
> > > > or 2 pager documents with the architecture and review with Infra and
> > > > community members.
> > > >
> > > > ***There were few proposal and discussion on the slack channel, to
> > reach
> > > > wider community members, moving that discussion formally to this
> list.
> > > >
> > > >
> > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > > >
> > > > Thanks,
> > > >
> > > > Sandeep
> > > >
> > > >
> > > >
> > > > --
> > > > Sandeep Krishnamurthy
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
Why don;t we look into fully managed AWS CodeBuild?  It maintains
everything. It's also compatible with Jenkins.

On Fri, Oct 20, 2017 at 1:51 PM, Tianqi Chen <tq...@cs.washington.edu>
wrote:

> +1
>
> Tianqi
> On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:
>
> > +1
> >
> >
> > It seems that the Apache CI is quite overloaded these days, and MXNet's
> CI
> > pipeline is too complex to run there. In addition, we may need to add
> more
> > devices, e.g. macpro and rasbperry pi, into the server, and more tasks
> such
> > as pip build. It means a lot of requests to the Infra team.
> >
> > We can reuse our previous Jenkins server at http://ci.mxnet.io/. But we
> > probably need a dedicate developer to maintain it.
> >
> >
> >
> > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > sandeep.krishna98@gmail.com> wrote:
> >
> > > Hello all,
> > >
> > > I am hereby opening up a discussion thread on how we can stabilize
> Apache
> > > MXNet CI build system.
> > >
> > > Problems:
> > >
> > > ========
> > >
> > > Recently, we have seen following issues with Apache MXNet CI build
> > systems:
> > >
> > >    1. Apache Jenkins master is overloaded and we see issues like -
> unable
> > >    to trigger builds, difficult to load and view the blue ocean and
> other
> > >    Jenkins build status page.
> > >    2. We are generating too many request/interaction on Apache Infra
> > team.
> > >       1. Addition/deletion of new slave: Caused from scaling activity,
> > >       recycling, troubleshooting or any actions leading to change of
> > slave
> > >       machines.
> > >       2. Plugins / other Jenkins Master configurations.
> > >       3. Experimentation on CI pipelines.
> > >    3. Harder to debug and resolve issues - Since access to master and
> > slave
> > >    is not with the same community, it requires Infra and community to
> > dive
> > >    deep together on all action items.
> > >
> > > Possible Solutions:
> > >
> > > ==============
> > >
> > >    1. Can we set up a separate Jenkins CI build system for Apache MXNet
> > >    outside Apache Infra?
> > >    2. Can we have a separate Jenkins Master in Apache Infra for MXNet?
> > >    3. Review design of current setup, refine and fill the gaps.
> > >
> > > @ Mentors/Infra team/Community:
> > >
> > > ==========================
> > >
> > > Please provide your suggestions on how we can proceed further and work
> on
> > > stabilizing the CI build systems for MXNet.
> > >
> > > Also, if the community decides on separate Jenkins CI build system,
> what
> > > important points should be taken care of apart from the below:
> > >
> > >    1. Community being able to access the build page for build statuses.
> > >    2. Committers being able to login with apache credentials.
> > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
> > >
> > >
> > > Irrespective of the solution we come up, I think we should initiate a
> > > technical design discussion on how to setup the CI build system.
> > Probably 1
> > > or 2 pager documents with the architecture and review with Infra and
> > > community members.
> > >
> > > ***There were few proposal and discussion on the slack channel, to
> reach
> > > wider community members, moving that discussion formally to this list.
> > >
> > >
> > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > >
> > > Thanks,
> > >
> > > Sandeep
> > >
> > >
> > >
> > > --
> > > Sandeep Krishnamurthy
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Tianqi Chen <tq...@cs.washington.edu>.
+1

Tianqi
On Fri, Oct 20, 2017 at 1:39 PM Mu Li <mu...@gmail.com> wrote:

> +1
>
>
> It seems that the Apache CI is quite overloaded these days, and MXNet's CI
> pipeline is too complex to run there. In addition, we may need to add more
> devices, e.g. macpro and rasbperry pi, into the server, and more tasks such
> as pip build. It means a lot of requests to the Infra team.
>
> We can reuse our previous Jenkins server at http://ci.mxnet.io/. But we
> probably need a dedicate developer to maintain it.
>
>
>
> On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> sandeep.krishna98@gmail.com> wrote:
>
> > Hello all,
> >
> > I am hereby opening up a discussion thread on how we can stabilize Apache
> > MXNet CI build system.
> >
> > Problems:
> >
> > ========
> >
> > Recently, we have seen following issues with Apache MXNet CI build
> systems:
> >
> >    1. Apache Jenkins master is overloaded and we see issues like - unable
> >    to trigger builds, difficult to load and view the blue ocean and other
> >    Jenkins build status page.
> >    2. We are generating too many request/interaction on Apache Infra
> team.
> >       1. Addition/deletion of new slave: Caused from scaling activity,
> >       recycling, troubleshooting or any actions leading to change of
> slave
> >       machines.
> >       2. Plugins / other Jenkins Master configurations.
> >       3. Experimentation on CI pipelines.
> >    3. Harder to debug and resolve issues - Since access to master and
> slave
> >    is not with the same community, it requires Infra and community to
> dive
> >    deep together on all action items.
> >
> > Possible Solutions:
> >
> > ==============
> >
> >    1. Can we set up a separate Jenkins CI build system for Apache MXNet
> >    outside Apache Infra?
> >    2. Can we have a separate Jenkins Master in Apache Infra for MXNet?
> >    3. Review design of current setup, refine and fill the gaps.
> >
> > @ Mentors/Infra team/Community:
> >
> > ==========================
> >
> > Please provide your suggestions on how we can proceed further and work on
> > stabilizing the CI build systems for MXNet.
> >
> > Also, if the community decides on separate Jenkins CI build system, what
> > important points should be taken care of apart from the below:
> >
> >    1. Community being able to access the build page for build statuses.
> >    2. Committers being able to login with apache credentials.
> >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
> >
> >
> > Irrespective of the solution we come up, I think we should initiate a
> > technical design discussion on how to setup the CI build system.
> Probably 1
> > or 2 pager documents with the architecture and review with Infra and
> > community members.
> >
> > ***There were few proposal and discussion on the slack channel, to reach
> > wider community members, moving that discussion formally to this list.
> >
> >
> > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> >
> > Thanks,
> >
> > Sandeep
> >
> >
> >
> > --
> > Sandeep Krishnamurthy
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Mu Li <mu...@gmail.com>.
+1


It seems that the Apache CI is quite overloaded these days, and MXNet's CI
pipeline is too complex to run there. In addition, we may need to add more
devices, e.g. macpro and rasbperry pi, into the server, and more tasks such
as pip build. It means a lot of requests to the Infra team.

We can reuse our previous Jenkins server at http://ci.mxnet.io/. But we
probably need a dedicate developer to maintain it.



On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
sandeep.krishna98@gmail.com> wrote:

> Hello all,
>
> I am hereby opening up a discussion thread on how we can stabilize Apache
> MXNet CI build system.
>
> Problems:
>
> ========
>
> Recently, we have seen following issues with Apache MXNet CI build systems:
>
>    1. Apache Jenkins master is overloaded and we see issues like - unable
>    to trigger builds, difficult to load and view the blue ocean and other
>    Jenkins build status page.
>    2. We are generating too many request/interaction on Apache Infra team.
>       1. Addition/deletion of new slave: Caused from scaling activity,
>       recycling, troubleshooting or any actions leading to change of slave
>       machines.
>       2. Plugins / other Jenkins Master configurations.
>       3. Experimentation on CI pipelines.
>    3. Harder to debug and resolve issues - Since access to master and slave
>    is not with the same community, it requires Infra and community to dive
>    deep together on all action items.
>
> Possible Solutions:
>
> ==============
>
>    1. Can we set up a separate Jenkins CI build system for Apache MXNet
>    outside Apache Infra?
>    2. Can we have a separate Jenkins Master in Apache Infra for MXNet?
>    3. Review design of current setup, refine and fill the gaps.
>
> @ Mentors/Infra team/Community:
>
> ==========================
>
> Please provide your suggestions on how we can proceed further and work on
> stabilizing the CI build systems for MXNet.
>
> Also, if the community decides on separate Jenkins CI build system, what
> important points should be taken care of apart from the below:
>
>    1. Community being able to access the build page for build statuses.
>    2. Committers being able to login with apache credentials.
>    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
>
>
> Irrespective of the solution we come up, I think we should initiate a
> technical design discussion on how to setup the CI build system. Probably 1
> or 2 pager documents with the architecture and review with Infra and
> community members.
>
> ***There were few proposal and discussion on the slack channel, to reach
> wider community members, moving that discussion formally to this list.
>
>
> My Proposal: Option 1 - Set up separate Jenkins CI build system.
>
> Thanks,
>
> Sandeep
>
>
>
> --
> Sandeep Krishnamurthy
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Meghna Baijal <me...@gmail.com>.
Chris,
The Windows slaves on apache use EIPs which makes it easier to
replace/reboot/reconnect these instances. But, there are some reasons
because of which EIPs cannot be used for ubuntu slaves
Several workarounds are being explored for this. And one such solution is
to use the aws codebuild plugin with Jenkins -

1. In Jenkins there is a plugin to integrate with aws codebuild which can
be used to automate slave management.
2. The idea is to configure only the *ubuntu* slaves using this plugin.
This addresses the issue of EIPs and automation on ubuntu.
3. Other platforms such as windows and Edge devices continue to be
configured directly through jenkins without using this plugin. This is ok
since windows slaves anyway use EIPs

At this point this is only in POC stage.

Thanks,
Meghna Baijal

On Thu, Nov 9, 2017 at 12:23 PM, Meghna Baijal <me...@gmail.com>
wrote:

> Pedro, I created a row for BuildBot in the doc. Do you want to add some
> pros and cons about it? It would be good to have all this information
> collected in one place.
>
> Meghna
>
> On Thu, Nov 9, 2017 at 4:40 AM, Larroy, Pedro <pl...@amazon.de> wrote:
>
>> Thanks a lot for the document and leading the discussion.
>>
>> Does anybody have experience with a build system other than Jenkins? In
>> the document we mention Teamcity as a possible option, and there’s also the
>> second leading open source CI tool “Buildbot” which is not mentioned.
>>
>> I’m not sure if we have strong evidence to have an informed decision
>> about using something other than Jenkins, also from the document I get that
>> the negatives of Jenkins are pretty minor compared to the other frameworks.
>>
>> I would be interested to read if somebody has used any other framework in
>> depth and is willing to vote against using Jenkins so we can all do an
>> informed vote.
>>
>> I don’t feel comfortable voting for Jenkins because is the only one I
>> know as well.
>>
>> Kind regards.
>> --
>>
>> Pedro
>>
>> On 08/11/17 23:41, "Meghna Baijal" <me...@gmail.com> wrote:
>>
>>     Thanks for the active discussion on the document for the new CI for
>> MXNet.
>>     Now that many of you have reviewed it, do you think I should start a
>> vote
>>     on which framework the community wants to move forward with ?
>>
>>     Thanks,
>>     Meghna
>>
>>     On Mon, Nov 6, 2017 at 6:59 PM, Chris Olivier <cj...@gmail.com>
>> wrote:
>>
>>     > After a decision is reached, i am willing to add tasks to Apache
>> MXNet JIRA
>>     >
>>     > On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <
>> pedro.larroy.lists@gmail.com
>>     > >
>>     > wrote:
>>     >
>>     > > Thanks for setting up the document guys, looks like a solid basis
>> to
>>     > > start to work on!
>>     > >
>>     > > Marco, Kellen and I have already added some comments.
>>     > >
>>     > > Pedro
>>     > >
>>     > >
>>     > > On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
>>     > > <me...@gmail.com> wrote:
>>     > > > Kellen, Thank you for your comments in the doc.
>>     > > > Sure Steffen, I will continue to merge everyone’s comments into
>> the doc
>>     > > and
>>     > > > work with Pedro to finalize it.
>>     > > > And then we can vote on the options.
>>     > > >
>>     > > > Thanks,
>>     > > > Meghna Baijal
>>     > > >
>>     > > >
>>     > > > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <
>>     > steffenrochel@gmail.com>
>>     > > > wrote:
>>     > > >
>>     > > >> Sandeep and Meghna have been working in background collecting
>> input
>>     > and
>>     > > >> preparing a doc. I suggest to drive discussion forward and
>> would like
>>     > to
>>     > > >> ask everybody to contribute to
>>     > > >> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZM
>> awxDk
>>     > > >> dlavUDASzUmLjk/edit?usp=sharing
>>     > > >>
>>     > > >> Lets converge on requirements and architecture, so we can move
>> forward
>>     > > with
>>     > > >> implementation.
>>     > > >>
>>     > > >> I would like to suggest for Pedro  and Meghna to lead the
>> discussion
>>     > and
>>     > > >> help to resolve suggestions.
>>     > > >>
>>     > > >> I assume we need a vote once we are converged on a good draft
>> to call
>>     > > it a
>>     > > >> plan and move forward with implementation. As we all are
>> unhappy with
>>     > > the
>>     > > >> current CI situation I would also suggest a phased approach,
>> so we can
>>     > > get
>>     > > >> back to reliable and efficient basic CI quickly and add
>> advanced
>>     > > >> capabilities over time.
>>     > > >>
>>     > > >> Steffen
>>     > > >>
>>     > > >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
>>     > > >> kellen.sunderland@gmail.com> wrote:
>>     > > >>
>>     > > >> > Hey Henri, I think that's what a few of us are advocating.
>> Running
>>     > a
>>     > > set
>>     > > >> > of quick tests as part of the PR process, and then a more
>> detailed
>>     > > >> > regression test suite periodically (say every 4 hours). This
>> fits
>>     > > nicely
>>     > > >> > into a tagging or 2 branch development system.  Commits will
>> be
>>     > tagged
>>     > > >> (or
>>     > > >> > merged into a stable branch) as soon as they pass the
>> detailed
>>     > > regression
>>     > > >> > testing.
>>     > > >> >
>>     > > >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org>
>> wrote:
>>     > > >> >
>>     > > >> > > Random question - can the CI be split such that the Apache
>> CI is
>>     > > doing
>>     > > >> a
>>     > > >> > > basic set of checks on that hardware, and is hooked to a
>> PR, while
>>     > > >> there
>>     > > >> > is
>>     > > >> > > a larger "Is trunk good for release?" test that is running
>>     > > periodically
>>     > > >> > > rather than on every PR?
>>     > > >> > >
>>     > > >> > > ie: do we need each PR to be run on varied hardware, or
>> can we
>>     > have
>>     > > >> this
>>     > > >> > > two tier approach?
>>     > > >> > >
>>     > > >> > > Hen
>>     > > >> > >
>>     > > >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>>     > > >> > > sandeep.krishna98@gmail.com> wrote:
>>     > > >> > >
>>     > > >> > > > Hello all,
>>     > > >> > > >
>>     > > >> > > > I am hereby opening up a discussion thread on how we can
>>     > stabilize
>>     > > >> > Apache
>>     > > >> > > > MXNet CI build system.
>>     > > >> > > >
>>     > > >> > > > Problems:
>>     > > >> > > >
>>     > > >> > > > ========
>>     > > >> > > >
>>     > > >> > > > Recently, we have seen following issues with Apache
>> MXNet CI
>>     > build
>>     > > >> > > systems:
>>     > > >> > > >
>>     > > >> > > >    1. Apache Jenkins master is overloaded and we see
>> issues
>>     > like -
>>     > > >> > unable
>>     > > >> > > >    to trigger builds, difficult to load and view the
>> blue ocean
>>     > > and
>>     > > >> > other
>>     > > >> > > >    Jenkins build status page.
>>     > > >> > > >    2. We are generating too many request/interaction on
>> Apache
>>     > > Infra
>>     > > >> > > team.
>>     > > >> > > >       1. Addition/deletion of new slave: Caused from
>> scaling
>>     > > >> activity,
>>     > > >> > > >       recycling, troubleshooting or any actions leading
>> to
>>     > change
>>     > > of
>>     > > >> > > slave
>>     > > >> > > >       machines.
>>     > > >> > > >       2. Plugins / other Jenkins Master configurations.
>>     > > >> > > >       3. Experimentation on CI pipelines.
>>     > > >> > > >    3. Harder to debug and resolve issues - Since access
>> to
>>     > master
>>     > > and
>>     > > >> > > slave
>>     > > >> > > >    is not with the same community, it requires Infra and
>>     > > community to
>>     > > >> > > dive
>>     > > >> > > >    deep together on all action items.
>>     > > >> > > >
>>     > > >> > > > Possible Solutions:
>>     > > >> > > >
>>     > > >> > > > ==============
>>     > > >> > > >
>>     > > >> > > >    1. Can we set up a separate Jenkins CI build system
>> for
>>     > Apache
>>     > > >> MXNet
>>     > > >> > > >    outside Apache Infra?
>>     > > >> > > >    2. Can we have a separate Jenkins Master in Apache
>> Infra for
>>     > > >> MXNet?
>>     > > >> > > >    3. Review design of current setup, refine and fill
>> the gaps.
>>     > > >> > > >
>>     > > >> > > > @ Mentors/Infra team/Community:
>>     > > >> > > >
>>     > > >> > > > ==========================
>>     > > >> > > >
>>     > > >> > > > Please provide your suggestions on how we can proceed
>> further
>>     > and
>>     > > >> work
>>     > > >> > on
>>     > > >> > > > stabilizing the CI build systems for MXNet.
>>     > > >> > > >
>>     > > >> > > > Also, if the community decides on separate Jenkins CI
>> build
>>     > > system,
>>     > > >> > what
>>     > > >> > > > important points should be taken care of apart from the
>> below:
>>     > > >> > > >
>>     > > >> > > >    1. Community being able to access the build page for
>> build
>>     > > >> statuses.
>>     > > >> > > >    2. Committers being able to login with apache
>> credentials.
>>     > > >> > > >    3. Hook setup from apache/incubator-mxnet repo to
>> Jenkins
>>     > > master.
>>     > > >> > > >
>>     > > >> > > >
>>     > > >> > > > Irrespective of the solution we come up, I think we
>> should
>>     > > initiate a
>>     > > >> > > > technical design discussion on how to setup the CI build
>> system.
>>     > > >> > > Probably 1
>>     > > >> > > > or 2 pager documents with the architecture and review
>> with Infra
>>     > > and
>>     > > >> > > > community members.
>>     > > >> > > >
>>     > > >> > > > ***There were few proposal and discussion on the slack
>> channel,
>>     > to
>>     > > >> > reach
>>     > > >> > > > wider community members, moving that discussion formally
>> to this
>>     > > >> list.
>>     > > >> > > >
>>     > > >> > > >
>>     > > >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build
>> system.
>>     > > >> > > >
>>     > > >> > > > Thanks,
>>     > > >> > > >
>>     > > >> > > > Sandeep
>>     > > >> > > >
>>     > > >> > > >
>>     > > >> > > >
>>     > > >> > > > --
>>     > > >> > > > Sandeep Krishnamurthy
>>     > > >> > > >
>>     > > >> > >
>>     > > >> >
>>     > > >>
>>     > >
>>     >
>>
>>
>> Amazon Development Center Germany GmbH
>> Berlin - Dresden - Aachen
>> main office: Krausenstr. 38, 10117 Berlin
>> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
>> Ust-ID: DE289237879
>> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>>
>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Meghna Baijal <me...@gmail.com>.
Pedro, I created a row for BuildBot in the doc. Do you want to add some
pros and cons about it? It would be good to have all this information
collected in one place.

Meghna

On Thu, Nov 9, 2017 at 4:40 AM, Larroy, Pedro <pl...@amazon.de> wrote:

> Thanks a lot for the document and leading the discussion.
>
> Does anybody have experience with a build system other than Jenkins? In
> the document we mention Teamcity as a possible option, and there’s also the
> second leading open source CI tool “Buildbot” which is not mentioned.
>
> I’m not sure if we have strong evidence to have an informed decision about
> using something other than Jenkins, also from the document I get that the
> negatives of Jenkins are pretty minor compared to the other frameworks.
>
> I would be interested to read if somebody has used any other framework in
> depth and is willing to vote against using Jenkins so we can all do an
> informed vote.
>
> I don’t feel comfortable voting for Jenkins because is the only one I know
> as well.
>
> Kind regards.
> --
>
> Pedro
>
> On 08/11/17 23:41, "Meghna Baijal" <me...@gmail.com> wrote:
>
>     Thanks for the active discussion on the document for the new CI for
> MXNet.
>     Now that many of you have reviewed it, do you think I should start a
> vote
>     on which framework the community wants to move forward with ?
>
>     Thanks,
>     Meghna
>
>     On Mon, Nov 6, 2017 at 6:59 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
>     > After a decision is reached, i am willing to add tasks to Apache
> MXNet JIRA
>     >
>     > On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <
> pedro.larroy.lists@gmail.com
>     > >
>     > wrote:
>     >
>     > > Thanks for setting up the document guys, looks like a solid basis
> to
>     > > start to work on!
>     > >
>     > > Marco, Kellen and I have already added some comments.
>     > >
>     > > Pedro
>     > >
>     > >
>     > > On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
>     > > <me...@gmail.com> wrote:
>     > > > Kellen, Thank you for your comments in the doc.
>     > > > Sure Steffen, I will continue to merge everyone’s comments into
> the doc
>     > > and
>     > > > work with Pedro to finalize it.
>     > > > And then we can vote on the options.
>     > > >
>     > > > Thanks,
>     > > > Meghna Baijal
>     > > >
>     > > >
>     > > > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <
>     > steffenrochel@gmail.com>
>     > > > wrote:
>     > > >
>     > > >> Sandeep and Meghna have been working in background collecting
> input
>     > and
>     > > >> preparing a doc. I suggest to drive discussion forward and
> would like
>     > to
>     > > >> ask everybody to contribute to
>     > > >> https://docs.google.com/document/d/
> 17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
>     > > >> dlavUDASzUmLjk/edit?usp=sharing
>     > > >>
>     > > >> Lets converge on requirements and architecture, so we can move
> forward
>     > > with
>     > > >> implementation.
>     > > >>
>     > > >> I would like to suggest for Pedro  and Meghna to lead the
> discussion
>     > and
>     > > >> help to resolve suggestions.
>     > > >>
>     > > >> I assume we need a vote once we are converged on a good draft
> to call
>     > > it a
>     > > >> plan and move forward with implementation. As we all are
> unhappy with
>     > > the
>     > > >> current CI situation I would also suggest a phased approach, so
> we can
>     > > get
>     > > >> back to reliable and efficient basic CI quickly and add advanced
>     > > >> capabilities over time.
>     > > >>
>     > > >> Steffen
>     > > >>
>     > > >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
>     > > >> kellen.sunderland@gmail.com> wrote:
>     > > >>
>     > > >> > Hey Henri, I think that's what a few of us are advocating.
> Running
>     > a
>     > > set
>     > > >> > of quick tests as part of the PR process, and then a more
> detailed
>     > > >> > regression test suite periodically (say every 4 hours). This
> fits
>     > > nicely
>     > > >> > into a tagging or 2 branch development system.  Commits will
> be
>     > tagged
>     > > >> (or
>     > > >> > merged into a stable branch) as soon as they pass the detailed
>     > > regression
>     > > >> > testing.
>     > > >> >
>     > > >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org>
> wrote:
>     > > >> >
>     > > >> > > Random question - can the CI be split such that the Apache
> CI is
>     > > doing
>     > > >> a
>     > > >> > > basic set of checks on that hardware, and is hooked to a
> PR, while
>     > > >> there
>     > > >> > is
>     > > >> > > a larger "Is trunk good for release?" test that is running
>     > > periodically
>     > > >> > > rather than on every PR?
>     > > >> > >
>     > > >> > > ie: do we need each PR to be run on varied hardware, or can
> we
>     > have
>     > > >> this
>     > > >> > > two tier approach?
>     > > >> > >
>     > > >> > > Hen
>     > > >> > >
>     > > >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>     > > >> > > sandeep.krishna98@gmail.com> wrote:
>     > > >> > >
>     > > >> > > > Hello all,
>     > > >> > > >
>     > > >> > > > I am hereby opening up a discussion thread on how we can
>     > stabilize
>     > > >> > Apache
>     > > >> > > > MXNet CI build system.
>     > > >> > > >
>     > > >> > > > Problems:
>     > > >> > > >
>     > > >> > > > ========
>     > > >> > > >
>     > > >> > > > Recently, we have seen following issues with Apache MXNet
> CI
>     > build
>     > > >> > > systems:
>     > > >> > > >
>     > > >> > > >    1. Apache Jenkins master is overloaded and we see
> issues
>     > like -
>     > > >> > unable
>     > > >> > > >    to trigger builds, difficult to load and view the blue
> ocean
>     > > and
>     > > >> > other
>     > > >> > > >    Jenkins build status page.
>     > > >> > > >    2. We are generating too many request/interaction on
> Apache
>     > > Infra
>     > > >> > > team.
>     > > >> > > >       1. Addition/deletion of new slave: Caused from
> scaling
>     > > >> activity,
>     > > >> > > >       recycling, troubleshooting or any actions leading to
>     > change
>     > > of
>     > > >> > > slave
>     > > >> > > >       machines.
>     > > >> > > >       2. Plugins / other Jenkins Master configurations.
>     > > >> > > >       3. Experimentation on CI pipelines.
>     > > >> > > >    3. Harder to debug and resolve issues - Since access to
>     > master
>     > > and
>     > > >> > > slave
>     > > >> > > >    is not with the same community, it requires Infra and
>     > > community to
>     > > >> > > dive
>     > > >> > > >    deep together on all action items.
>     > > >> > > >
>     > > >> > > > Possible Solutions:
>     > > >> > > >
>     > > >> > > > ==============
>     > > >> > > >
>     > > >> > > >    1. Can we set up a separate Jenkins CI build system for
>     > Apache
>     > > >> MXNet
>     > > >> > > >    outside Apache Infra?
>     > > >> > > >    2. Can we have a separate Jenkins Master in Apache
> Infra for
>     > > >> MXNet?
>     > > >> > > >    3. Review design of current setup, refine and fill the
> gaps.
>     > > >> > > >
>     > > >> > > > @ Mentors/Infra team/Community:
>     > > >> > > >
>     > > >> > > > ==========================
>     > > >> > > >
>     > > >> > > > Please provide your suggestions on how we can proceed
> further
>     > and
>     > > >> work
>     > > >> > on
>     > > >> > > > stabilizing the CI build systems for MXNet.
>     > > >> > > >
>     > > >> > > > Also, if the community decides on separate Jenkins CI
> build
>     > > system,
>     > > >> > what
>     > > >> > > > important points should be taken care of apart from the
> below:
>     > > >> > > >
>     > > >> > > >    1. Community being able to access the build page for
> build
>     > > >> statuses.
>     > > >> > > >    2. Committers being able to login with apache
> credentials.
>     > > >> > > >    3. Hook setup from apache/incubator-mxnet repo to
> Jenkins
>     > > master.
>     > > >> > > >
>     > > >> > > >
>     > > >> > > > Irrespective of the solution we come up, I think we should
>     > > initiate a
>     > > >> > > > technical design discussion on how to setup the CI build
> system.
>     > > >> > > Probably 1
>     > > >> > > > or 2 pager documents with the architecture and review
> with Infra
>     > > and
>     > > >> > > > community members.
>     > > >> > > >
>     > > >> > > > ***There were few proposal and discussion on the slack
> channel,
>     > to
>     > > >> > reach
>     > > >> > > > wider community members, moving that discussion formally
> to this
>     > > >> list.
>     > > >> > > >
>     > > >> > > >
>     > > >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> system.
>     > > >> > > >
>     > > >> > > > Thanks,
>     > > >> > > >
>     > > >> > > > Sandeep
>     > > >> > > >
>     > > >> > > >
>     > > >> > > >
>     > > >> > > > --
>     > > >> > > > Sandeep Krishnamurthy
>     > > >> > > >
>     > > >> > >
>     > > >> >
>     > > >>
>     > >
>     >
>
>
> Amazon Development Center Germany GmbH
> Berlin - Dresden - Aachen
> main office: Krausenstr. 38, 10117 Berlin
> Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
> Ust-ID: DE289237879
> Eingetragen am Amtsgericht Charlottenburg HRB 149173 B
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by "Larroy, Pedro" <pl...@amazon.de>.
Thanks a lot for the document and leading the discussion.

Does anybody have experience with a build system other than Jenkins? In the document we mention Teamcity as a possible option, and there’s also the second leading open source CI tool “Buildbot” which is not mentioned.

I’m not sure if we have strong evidence to have an informed decision about using something other than Jenkins, also from the document I get that the negatives of Jenkins are pretty minor compared to the other frameworks.

I would be interested to read if somebody has used any other framework in depth and is willing to vote against using Jenkins so we can all do an informed vote.

I don’t feel comfortable voting for Jenkins because is the only one I know as well.

Kind regards.
-- 

Pedro

On 08/11/17 23:41, "Meghna Baijal" <me...@gmail.com> wrote:

    Thanks for the active discussion on the document for the new CI for MXNet.
    Now that many of you have reviewed it, do you think I should start a vote
    on which framework the community wants to move forward with ?
    
    Thanks,
    Meghna
    
    On Mon, Nov 6, 2017 at 6:59 PM, Chris Olivier <cj...@gmail.com> wrote:
    
    > After a decision is reached, i am willing to add tasks to Apache MXNet JIRA
    >
    > On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <pedro.larroy.lists@gmail.com
    > >
    > wrote:
    >
    > > Thanks for setting up the document guys, looks like a solid basis to
    > > start to work on!
    > >
    > > Marco, Kellen and I have already added some comments.
    > >
    > > Pedro
    > >
    > >
    > > On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
    > > <me...@gmail.com> wrote:
    > > > Kellen, Thank you for your comments in the doc.
    > > > Sure Steffen, I will continue to merge everyone’s comments into the doc
    > > and
    > > > work with Pedro to finalize it.
    > > > And then we can vote on the options.
    > > >
    > > > Thanks,
    > > > Meghna Baijal
    > > >
    > > >
    > > > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <
    > steffenrochel@gmail.com>
    > > > wrote:
    > > >
    > > >> Sandeep and Meghna have been working in background collecting input
    > and
    > > >> preparing a doc. I suggest to drive discussion forward and would like
    > to
    > > >> ask everybody to contribute to
    > > >> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
    > > >> dlavUDASzUmLjk/edit?usp=sharing
    > > >>
    > > >> Lets converge on requirements and architecture, so we can move forward
    > > with
    > > >> implementation.
    > > >>
    > > >> I would like to suggest for Pedro  and Meghna to lead the discussion
    > and
    > > >> help to resolve suggestions.
    > > >>
    > > >> I assume we need a vote once we are converged on a good draft to call
    > > it a
    > > >> plan and move forward with implementation. As we all are unhappy with
    > > the
    > > >> current CI situation I would also suggest a phased approach, so we can
    > > get
    > > >> back to reliable and efficient basic CI quickly and add advanced
    > > >> capabilities over time.
    > > >>
    > > >> Steffen
    > > >>
    > > >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
    > > >> kellen.sunderland@gmail.com> wrote:
    > > >>
    > > >> > Hey Henri, I think that's what a few of us are advocating.  Running
    > a
    > > set
    > > >> > of quick tests as part of the PR process, and then a more detailed
    > > >> > regression test suite periodically (say every 4 hours). This fits
    > > nicely
    > > >> > into a tagging or 2 branch development system.  Commits will be
    > tagged
    > > >> (or
    > > >> > merged into a stable branch) as soon as they pass the detailed
    > > regression
    > > >> > testing.
    > > >> >
    > > >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
    > > >> >
    > > >> > > Random question - can the CI be split such that the Apache CI is
    > > doing
    > > >> a
    > > >> > > basic set of checks on that hardware, and is hooked to a PR, while
    > > >> there
    > > >> > is
    > > >> > > a larger "Is trunk good for release?" test that is running
    > > periodically
    > > >> > > rather than on every PR?
    > > >> > >
    > > >> > > ie: do we need each PR to be run on varied hardware, or can we
    > have
    > > >> this
    > > >> > > two tier approach?
    > > >> > >
    > > >> > > Hen
    > > >> > >
    > > >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
    > > >> > > sandeep.krishna98@gmail.com> wrote:
    > > >> > >
    > > >> > > > Hello all,
    > > >> > > >
    > > >> > > > I am hereby opening up a discussion thread on how we can
    > stabilize
    > > >> > Apache
    > > >> > > > MXNet CI build system.
    > > >> > > >
    > > >> > > > Problems:
    > > >> > > >
    > > >> > > > ========
    > > >> > > >
    > > >> > > > Recently, we have seen following issues with Apache MXNet CI
    > build
    > > >> > > systems:
    > > >> > > >
    > > >> > > >    1. Apache Jenkins master is overloaded and we see issues
    > like -
    > > >> > unable
    > > >> > > >    to trigger builds, difficult to load and view the blue ocean
    > > and
    > > >> > other
    > > >> > > >    Jenkins build status page.
    > > >> > > >    2. We are generating too many request/interaction on Apache
    > > Infra
    > > >> > > team.
    > > >> > > >       1. Addition/deletion of new slave: Caused from scaling
    > > >> activity,
    > > >> > > >       recycling, troubleshooting or any actions leading to
    > change
    > > of
    > > >> > > slave
    > > >> > > >       machines.
    > > >> > > >       2. Plugins / other Jenkins Master configurations.
    > > >> > > >       3. Experimentation on CI pipelines.
    > > >> > > >    3. Harder to debug and resolve issues - Since access to
    > master
    > > and
    > > >> > > slave
    > > >> > > >    is not with the same community, it requires Infra and
    > > community to
    > > >> > > dive
    > > >> > > >    deep together on all action items.
    > > >> > > >
    > > >> > > > Possible Solutions:
    > > >> > > >
    > > >> > > > ==============
    > > >> > > >
    > > >> > > >    1. Can we set up a separate Jenkins CI build system for
    > Apache
    > > >> MXNet
    > > >> > > >    outside Apache Infra?
    > > >> > > >    2. Can we have a separate Jenkins Master in Apache Infra for
    > > >> MXNet?
    > > >> > > >    3. Review design of current setup, refine and fill the gaps.
    > > >> > > >
    > > >> > > > @ Mentors/Infra team/Community:
    > > >> > > >
    > > >> > > > ==========================
    > > >> > > >
    > > >> > > > Please provide your suggestions on how we can proceed further
    > and
    > > >> work
    > > >> > on
    > > >> > > > stabilizing the CI build systems for MXNet.
    > > >> > > >
    > > >> > > > Also, if the community decides on separate Jenkins CI build
    > > system,
    > > >> > what
    > > >> > > > important points should be taken care of apart from the below:
    > > >> > > >
    > > >> > > >    1. Community being able to access the build page for build
    > > >> statuses.
    > > >> > > >    2. Committers being able to login with apache credentials.
    > > >> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
    > > master.
    > > >> > > >
    > > >> > > >
    > > >> > > > Irrespective of the solution we come up, I think we should
    > > initiate a
    > > >> > > > technical design discussion on how to setup the CI build system.
    > > >> > > Probably 1
    > > >> > > > or 2 pager documents with the architecture and review with Infra
    > > and
    > > >> > > > community members.
    > > >> > > >
    > > >> > > > ***There were few proposal and discussion on the slack channel,
    > to
    > > >> > reach
    > > >> > > > wider community members, moving that discussion formally to this
    > > >> list.
    > > >> > > >
    > > >> > > >
    > > >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
    > > >> > > >
    > > >> > > > Thanks,
    > > >> > > >
    > > >> > > > Sandeep
    > > >> > > >
    > > >> > > >
    > > >> > > >
    > > >> > > > --
    > > >> > > > Sandeep Krishnamurthy
    > > >> > > >
    > > >> > >
    > > >> >
    > > >>
    > >
    >
    

Amazon Development Center Germany GmbH
Berlin - Dresden - Aachen
main office: Krausenstr. 38, 10117 Berlin
Geschaeftsfuehrer: Dr. Ralf Herbrich, Christian Schlaeger
Ust-ID: DE289237879
Eingetragen am Amtsgericht Charlottenburg HRB 149173 B

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
Can you please clarify the AWS Code Build/Windows issue? Does the document
state there is a workaround? I didn’t fully understand.


On Wed, Nov 8, 2017 at 9:32 PM sandeep krishnamurthy <
sandeep.krishna98@gmail.com> wrote:

> Good work Meghna and thanks to community members for participating in the
> discussion and providing valuable inputs.
> Yes please share the document again and ask for vote and more broader
> inputs.
>
> On Wed, Nov 8, 2017 at 2:43 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
> > +1
> >
> > On Wed, Nov 8, 2017 at 2:40 PM Meghna Baijal <meghnabaijal2017@gmail.com
> >
> > wrote:
> >
> > > Thanks for the active discussion on the document for the new CI for
> > MXNet.
> > > Now that many of you have reviewed it, do you think I should start a
> vote
> > > on which framework the community wants to move forward with ?
> > >
> > > Thanks,
> > > Meghna
> > >
> > > On Mon, Nov 6, 2017 at 6:59 PM, Chris Olivier <cj...@gmail.com>
> > > wrote:
> > >
> > > > After a decision is reached, i am willing to add tasks to Apache
> MXNet
> > > JIRA
> > > >
> > > > On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <
> > > pedro.larroy.lists@gmail.com
> > > > >
> > > > wrote:
> > > >
> > > > > Thanks for setting up the document guys, looks like a solid basis
> to
> > > > > start to work on!
> > > > >
> > > > > Marco, Kellen and I have already added some comments.
> > > > >
> > > > > Pedro
> > > > >
> > > > >
> > > > > On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
> > > > > <me...@gmail.com> wrote:
> > > > > > Kellen, Thank you for your comments in the doc.
> > > > > > Sure Steffen, I will continue to merge everyone’s comments into
> the
> > > doc
> > > > > and
> > > > > > work with Pedro to finalize it.
> > > > > > And then we can vote on the options.
> > > > > >
> > > > > > Thanks,
> > > > > > Meghna Baijal
> > > > > >
> > > > > >
> > > > > > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <
> > > > steffenrochel@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >> Sandeep and Meghna have been working in background collecting
> > input
> > > > and
> > > > > >> preparing a doc. I suggest to drive discussion forward and would
> > > like
> > > > to
> > > > > >> ask everybody to contribute to
> > > > > >>
> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
> > > > > >> dlavUDASzUmLjk/edit?usp=sharing
> > > > > >>
> > > > > >> Lets converge on requirements and architecture, so we can move
> > > forward
> > > > > with
> > > > > >> implementation.
> > > > > >>
> > > > > >> I would like to suggest for Pedro  and Meghna to lead the
> > discussion
> > > > and
> > > > > >> help to resolve suggestions.
> > > > > >>
> > > > > >> I assume we need a vote once we are converged on a good draft to
> > > call
> > > > > it a
> > > > > >> plan and move forward with implementation. As we all are unhappy
> > > with
> > > > > the
> > > > > >> current CI situation I would also suggest a phased approach, so
> we
> > > can
> > > > > get
> > > > > >> back to reliable and efficient basic CI quickly and add advanced
> > > > > >> capabilities over time.
> > > > > >>
> > > > > >> Steffen
> > > > > >>
> > > > > >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
> > > > > >> kellen.sunderland@gmail.com> wrote:
> > > > > >>
> > > > > >> > Hey Henri, I think that's what a few of us are advocating.
> > > Running
> > > > a
> > > > > set
> > > > > >> > of quick tests as part of the PR process, and then a more
> > detailed
> > > > > >> > regression test suite periodically (say every 4 hours). This
> > fits
> > > > > nicely
> > > > > >> > into a tagging or 2 branch development system.  Commits will
> be
> > > > tagged
> > > > > >> (or
> > > > > >> > merged into a stable branch) as soon as they pass the detailed
> > > > > regression
> > > > > >> > testing.
> > > > > >> >
> > > > > >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org>
> wrote:
> > > > > >> >
> > > > > >> > > Random question - can the CI be split such that the Apache
> CI
> > is
> > > > > doing
> > > > > >> a
> > > > > >> > > basic set of checks on that hardware, and is hooked to a PR,
> > > while
> > > > > >> there
> > > > > >> > is
> > > > > >> > > a larger "Is trunk good for release?" test that is running
> > > > > periodically
> > > > > >> > > rather than on every PR?
> > > > > >> > >
> > > > > >> > > ie: do we need each PR to be run on varied hardware, or can
> we
> > > > have
> > > > > >> this
> > > > > >> > > two tier approach?
> > > > > >> > >
> > > > > >> > > Hen
> > > > > >> > >
> > > > > >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > > >> > > sandeep.krishna98@gmail.com> wrote:
> > > > > >> > >
> > > > > >> > > > Hello all,
> > > > > >> > > >
> > > > > >> > > > I am hereby opening up a discussion thread on how we can
> > > > stabilize
> > > > > >> > Apache
> > > > > >> > > > MXNet CI build system.
> > > > > >> > > >
> > > > > >> > > > Problems:
> > > > > >> > > >
> > > > > >> > > > ========
> > > > > >> > > >
> > > > > >> > > > Recently, we have seen following issues with Apache MXNet
> CI
> > > > build
> > > > > >> > > systems:
> > > > > >> > > >
> > > > > >> > > >    1. Apache Jenkins master is overloaded and we see
> issues
> > > > like -
> > > > > >> > unable
> > > > > >> > > >    to trigger builds, difficult to load and view the blue
> > > ocean
> > > > > and
> > > > > >> > other
> > > > > >> > > >    Jenkins build status page.
> > > > > >> > > >    2. We are generating too many request/interaction on
> > Apache
> > > > > Infra
> > > > > >> > > team.
> > > > > >> > > >       1. Addition/deletion of new slave: Caused from
> scaling
> > > > > >> activity,
> > > > > >> > > >       recycling, troubleshooting or any actions leading to
> > > > change
> > > > > of
> > > > > >> > > slave
> > > > > >> > > >       machines.
> > > > > >> > > >       2. Plugins / other Jenkins Master configurations.
> > > > > >> > > >       3. Experimentation on CI pipelines.
> > > > > >> > > >    3. Harder to debug and resolve issues - Since access to
> > > > master
> > > > > and
> > > > > >> > > slave
> > > > > >> > > >    is not with the same community, it requires Infra and
> > > > > community to
> > > > > >> > > dive
> > > > > >> > > >    deep together on all action items.
> > > > > >> > > >
> > > > > >> > > > Possible Solutions:
> > > > > >> > > >
> > > > > >> > > > ==============
> > > > > >> > > >
> > > > > >> > > >    1. Can we set up a separate Jenkins CI build system for
> > > > Apache
> > > > > >> MXNet
> > > > > >> > > >    outside Apache Infra?
> > > > > >> > > >    2. Can we have a separate Jenkins Master in Apache
> Infra
> > > for
> > > > > >> MXNet?
> > > > > >> > > >    3. Review design of current setup, refine and fill the
> > > gaps.
> > > > > >> > > >
> > > > > >> > > > @ Mentors/Infra team/Community:
> > > > > >> > > >
> > > > > >> > > > ==========================
> > > > > >> > > >
> > > > > >> > > > Please provide your suggestions on how we can proceed
> > further
> > > > and
> > > > > >> work
> > > > > >> > on
> > > > > >> > > > stabilizing the CI build systems for MXNet.
> > > > > >> > > >
> > > > > >> > > > Also, if the community decides on separate Jenkins CI
> build
> > > > > system,
> > > > > >> > what
> > > > > >> > > > important points should be taken care of apart from the
> > below:
> > > > > >> > > >
> > > > > >> > > >    1. Community being able to access the build page for
> > build
> > > > > >> statuses.
> > > > > >> > > >    2. Committers being able to login with apache
> > credentials.
> > > > > >> > > >    3. Hook setup from apache/incubator-mxnet repo to
> Jenkins
> > > > > master.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > Irrespective of the solution we come up, I think we should
> > > > > initiate a
> > > > > >> > > > technical design discussion on how to setup the CI build
> > > system.
> > > > > >> > > Probably 1
> > > > > >> > > > or 2 pager documents with the architecture and review with
> > > Infra
> > > > > and
> > > > > >> > > > community members.
> > > > > >> > > >
> > > > > >> > > > ***There were few proposal and discussion on the slack
> > > channel,
> > > > to
> > > > > >> > reach
> > > > > >> > > > wider community members, moving that discussion formally
> to
> > > this
> > > > > >> list.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> > > system.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Sandeep
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > --
> > > > > >> > > > Sandeep Krishnamurthy
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Sandeep Krishnamurthy
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by sandeep krishnamurthy <sa...@gmail.com>.
Good work Meghna and thanks to community members for participating in the
discussion and providing valuable inputs.
Yes please share the document again and ask for vote and more broader
inputs.

On Wed, Nov 8, 2017 at 2:43 PM, Chris Olivier <cj...@gmail.com> wrote:

> +1
>
> On Wed, Nov 8, 2017 at 2:40 PM Meghna Baijal <me...@gmail.com>
> wrote:
>
> > Thanks for the active discussion on the document for the new CI for
> MXNet.
> > Now that many of you have reviewed it, do you think I should start a vote
> > on which framework the community wants to move forward with ?
> >
> > Thanks,
> > Meghna
> >
> > On Mon, Nov 6, 2017 at 6:59 PM, Chris Olivier <cj...@gmail.com>
> > wrote:
> >
> > > After a decision is reached, i am willing to add tasks to Apache MXNet
> > JIRA
> > >
> > > On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <
> > pedro.larroy.lists@gmail.com
> > > >
> > > wrote:
> > >
> > > > Thanks for setting up the document guys, looks like a solid basis to
> > > > start to work on!
> > > >
> > > > Marco, Kellen and I have already added some comments.
> > > >
> > > > Pedro
> > > >
> > > >
> > > > On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
> > > > <me...@gmail.com> wrote:
> > > > > Kellen, Thank you for your comments in the doc.
> > > > > Sure Steffen, I will continue to merge everyone’s comments into the
> > doc
> > > > and
> > > > > work with Pedro to finalize it.
> > > > > And then we can vote on the options.
> > > > >
> > > > > Thanks,
> > > > > Meghna Baijal
> > > > >
> > > > >
> > > > > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <
> > > steffenrochel@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> Sandeep and Meghna have been working in background collecting
> input
> > > and
> > > > >> preparing a doc. I suggest to drive discussion forward and would
> > like
> > > to
> > > > >> ask everybody to contribute to
> > > > >> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
> > > > >> dlavUDASzUmLjk/edit?usp=sharing
> > > > >>
> > > > >> Lets converge on requirements and architecture, so we can move
> > forward
> > > > with
> > > > >> implementation.
> > > > >>
> > > > >> I would like to suggest for Pedro  and Meghna to lead the
> discussion
> > > and
> > > > >> help to resolve suggestions.
> > > > >>
> > > > >> I assume we need a vote once we are converged on a good draft to
> > call
> > > > it a
> > > > >> plan and move forward with implementation. As we all are unhappy
> > with
> > > > the
> > > > >> current CI situation I would also suggest a phased approach, so we
> > can
> > > > get
> > > > >> back to reliable and efficient basic CI quickly and add advanced
> > > > >> capabilities over time.
> > > > >>
> > > > >> Steffen
> > > > >>
> > > > >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
> > > > >> kellen.sunderland@gmail.com> wrote:
> > > > >>
> > > > >> > Hey Henri, I think that's what a few of us are advocating.
> > Running
> > > a
> > > > set
> > > > >> > of quick tests as part of the PR process, and then a more
> detailed
> > > > >> > regression test suite periodically (say every 4 hours). This
> fits
> > > > nicely
> > > > >> > into a tagging or 2 branch development system.  Commits will be
> > > tagged
> > > > >> (or
> > > > >> > merged into a stable branch) as soon as they pass the detailed
> > > > regression
> > > > >> > testing.
> > > > >> >
> > > > >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
> > > > >> >
> > > > >> > > Random question - can the CI be split such that the Apache CI
> is
> > > > doing
> > > > >> a
> > > > >> > > basic set of checks on that hardware, and is hooked to a PR,
> > while
> > > > >> there
> > > > >> > is
> > > > >> > > a larger "Is trunk good for release?" test that is running
> > > > periodically
> > > > >> > > rather than on every PR?
> > > > >> > >
> > > > >> > > ie: do we need each PR to be run on varied hardware, or can we
> > > have
> > > > >> this
> > > > >> > > two tier approach?
> > > > >> > >
> > > > >> > > Hen
> > > > >> > >
> > > > >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > > >> > > sandeep.krishna98@gmail.com> wrote:
> > > > >> > >
> > > > >> > > > Hello all,
> > > > >> > > >
> > > > >> > > > I am hereby opening up a discussion thread on how we can
> > > stabilize
> > > > >> > Apache
> > > > >> > > > MXNet CI build system.
> > > > >> > > >
> > > > >> > > > Problems:
> > > > >> > > >
> > > > >> > > > ========
> > > > >> > > >
> > > > >> > > > Recently, we have seen following issues with Apache MXNet CI
> > > build
> > > > >> > > systems:
> > > > >> > > >
> > > > >> > > >    1. Apache Jenkins master is overloaded and we see issues
> > > like -
> > > > >> > unable
> > > > >> > > >    to trigger builds, difficult to load and view the blue
> > ocean
> > > > and
> > > > >> > other
> > > > >> > > >    Jenkins build status page.
> > > > >> > > >    2. We are generating too many request/interaction on
> Apache
> > > > Infra
> > > > >> > > team.
> > > > >> > > >       1. Addition/deletion of new slave: Caused from scaling
> > > > >> activity,
> > > > >> > > >       recycling, troubleshooting or any actions leading to
> > > change
> > > > of
> > > > >> > > slave
> > > > >> > > >       machines.
> > > > >> > > >       2. Plugins / other Jenkins Master configurations.
> > > > >> > > >       3. Experimentation on CI pipelines.
> > > > >> > > >    3. Harder to debug and resolve issues - Since access to
> > > master
> > > > and
> > > > >> > > slave
> > > > >> > > >    is not with the same community, it requires Infra and
> > > > community to
> > > > >> > > dive
> > > > >> > > >    deep together on all action items.
> > > > >> > > >
> > > > >> > > > Possible Solutions:
> > > > >> > > >
> > > > >> > > > ==============
> > > > >> > > >
> > > > >> > > >    1. Can we set up a separate Jenkins CI build system for
> > > Apache
> > > > >> MXNet
> > > > >> > > >    outside Apache Infra?
> > > > >> > > >    2. Can we have a separate Jenkins Master in Apache Infra
> > for
> > > > >> MXNet?
> > > > >> > > >    3. Review design of current setup, refine and fill the
> > gaps.
> > > > >> > > >
> > > > >> > > > @ Mentors/Infra team/Community:
> > > > >> > > >
> > > > >> > > > ==========================
> > > > >> > > >
> > > > >> > > > Please provide your suggestions on how we can proceed
> further
> > > and
> > > > >> work
> > > > >> > on
> > > > >> > > > stabilizing the CI build systems for MXNet.
> > > > >> > > >
> > > > >> > > > Also, if the community decides on separate Jenkins CI build
> > > > system,
> > > > >> > what
> > > > >> > > > important points should be taken care of apart from the
> below:
> > > > >> > > >
> > > > >> > > >    1. Community being able to access the build page for
> build
> > > > >> statuses.
> > > > >> > > >    2. Committers being able to login with apache
> credentials.
> > > > >> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> > > > master.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > Irrespective of the solution we come up, I think we should
> > > > initiate a
> > > > >> > > > technical design discussion on how to setup the CI build
> > system.
> > > > >> > > Probably 1
> > > > >> > > > or 2 pager documents with the architecture and review with
> > Infra
> > > > and
> > > > >> > > > community members.
> > > > >> > > >
> > > > >> > > > ***There were few proposal and discussion on the slack
> > channel,
> > > to
> > > > >> > reach
> > > > >> > > > wider community members, moving that discussion formally to
> > this
> > > > >> list.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> > system.
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Sandeep
> > > > >> > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Sandeep Krishnamurthy
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
>



-- 
Sandeep Krishnamurthy

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
+1

On Wed, Nov 8, 2017 at 2:40 PM Meghna Baijal <me...@gmail.com>
wrote:

> Thanks for the active discussion on the document for the new CI for MXNet.
> Now that many of you have reviewed it, do you think I should start a vote
> on which framework the community wants to move forward with ?
>
> Thanks,
> Meghna
>
> On Mon, Nov 6, 2017 at 6:59 PM, Chris Olivier <cj...@gmail.com>
> wrote:
>
> > After a decision is reached, i am willing to add tasks to Apache MXNet
> JIRA
> >
> > On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <
> pedro.larroy.lists@gmail.com
> > >
> > wrote:
> >
> > > Thanks for setting up the document guys, looks like a solid basis to
> > > start to work on!
> > >
> > > Marco, Kellen and I have already added some comments.
> > >
> > > Pedro
> > >
> > >
> > > On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
> > > <me...@gmail.com> wrote:
> > > > Kellen, Thank you for your comments in the doc.
> > > > Sure Steffen, I will continue to merge everyone’s comments into the
> doc
> > > and
> > > > work with Pedro to finalize it.
> > > > And then we can vote on the options.
> > > >
> > > > Thanks,
> > > > Meghna Baijal
> > > >
> > > >
> > > > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <
> > steffenrochel@gmail.com>
> > > > wrote:
> > > >
> > > >> Sandeep and Meghna have been working in background collecting input
> > and
> > > >> preparing a doc. I suggest to drive discussion forward and would
> like
> > to
> > > >> ask everybody to contribute to
> > > >> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
> > > >> dlavUDASzUmLjk/edit?usp=sharing
> > > >>
> > > >> Lets converge on requirements and architecture, so we can move
> forward
> > > with
> > > >> implementation.
> > > >>
> > > >> I would like to suggest for Pedro  and Meghna to lead the discussion
> > and
> > > >> help to resolve suggestions.
> > > >>
> > > >> I assume we need a vote once we are converged on a good draft to
> call
> > > it a
> > > >> plan and move forward with implementation. As we all are unhappy
> with
> > > the
> > > >> current CI situation I would also suggest a phased approach, so we
> can
> > > get
> > > >> back to reliable and efficient basic CI quickly and add advanced
> > > >> capabilities over time.
> > > >>
> > > >> Steffen
> > > >>
> > > >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
> > > >> kellen.sunderland@gmail.com> wrote:
> > > >>
> > > >> > Hey Henri, I think that's what a few of us are advocating.
> Running
> > a
> > > set
> > > >> > of quick tests as part of the PR process, and then a more detailed
> > > >> > regression test suite periodically (say every 4 hours). This fits
> > > nicely
> > > >> > into a tagging or 2 branch development system.  Commits will be
> > tagged
> > > >> (or
> > > >> > merged into a stable branch) as soon as they pass the detailed
> > > regression
> > > >> > testing.
> > > >> >
> > > >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
> > > >> >
> > > >> > > Random question - can the CI be split such that the Apache CI is
> > > doing
> > > >> a
> > > >> > > basic set of checks on that hardware, and is hooked to a PR,
> while
> > > >> there
> > > >> > is
> > > >> > > a larger "Is trunk good for release?" test that is running
> > > periodically
> > > >> > > rather than on every PR?
> > > >> > >
> > > >> > > ie: do we need each PR to be run on varied hardware, or can we
> > have
> > > >> this
> > > >> > > two tier approach?
> > > >> > >
> > > >> > > Hen
> > > >> > >
> > > >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > >> > > sandeep.krishna98@gmail.com> wrote:
> > > >> > >
> > > >> > > > Hello all,
> > > >> > > >
> > > >> > > > I am hereby opening up a discussion thread on how we can
> > stabilize
> > > >> > Apache
> > > >> > > > MXNet CI build system.
> > > >> > > >
> > > >> > > > Problems:
> > > >> > > >
> > > >> > > > ========
> > > >> > > >
> > > >> > > > Recently, we have seen following issues with Apache MXNet CI
> > build
> > > >> > > systems:
> > > >> > > >
> > > >> > > >    1. Apache Jenkins master is overloaded and we see issues
> > like -
> > > >> > unable
> > > >> > > >    to trigger builds, difficult to load and view the blue
> ocean
> > > and
> > > >> > other
> > > >> > > >    Jenkins build status page.
> > > >> > > >    2. We are generating too many request/interaction on Apache
> > > Infra
> > > >> > > team.
> > > >> > > >       1. Addition/deletion of new slave: Caused from scaling
> > > >> activity,
> > > >> > > >       recycling, troubleshooting or any actions leading to
> > change
> > > of
> > > >> > > slave
> > > >> > > >       machines.
> > > >> > > >       2. Plugins / other Jenkins Master configurations.
> > > >> > > >       3. Experimentation on CI pipelines.
> > > >> > > >    3. Harder to debug and resolve issues - Since access to
> > master
> > > and
> > > >> > > slave
> > > >> > > >    is not with the same community, it requires Infra and
> > > community to
> > > >> > > dive
> > > >> > > >    deep together on all action items.
> > > >> > > >
> > > >> > > > Possible Solutions:
> > > >> > > >
> > > >> > > > ==============
> > > >> > > >
> > > >> > > >    1. Can we set up a separate Jenkins CI build system for
> > Apache
> > > >> MXNet
> > > >> > > >    outside Apache Infra?
> > > >> > > >    2. Can we have a separate Jenkins Master in Apache Infra
> for
> > > >> MXNet?
> > > >> > > >    3. Review design of current setup, refine and fill the
> gaps.
> > > >> > > >
> > > >> > > > @ Mentors/Infra team/Community:
> > > >> > > >
> > > >> > > > ==========================
> > > >> > > >
> > > >> > > > Please provide your suggestions on how we can proceed further
> > and
> > > >> work
> > > >> > on
> > > >> > > > stabilizing the CI build systems for MXNet.
> > > >> > > >
> > > >> > > > Also, if the community decides on separate Jenkins CI build
> > > system,
> > > >> > what
> > > >> > > > important points should be taken care of apart from the below:
> > > >> > > >
> > > >> > > >    1. Community being able to access the build page for build
> > > >> statuses.
> > > >> > > >    2. Committers being able to login with apache credentials.
> > > >> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> > > master.
> > > >> > > >
> > > >> > > >
> > > >> > > > Irrespective of the solution we come up, I think we should
> > > initiate a
> > > >> > > > technical design discussion on how to setup the CI build
> system.
> > > >> > > Probably 1
> > > >> > > > or 2 pager documents with the architecture and review with
> Infra
> > > and
> > > >> > > > community members.
> > > >> > > >
> > > >> > > > ***There were few proposal and discussion on the slack
> channel,
> > to
> > > >> > reach
> > > >> > > > wider community members, moving that discussion formally to
> this
> > > >> list.
> > > >> > > >
> > > >> > > >
> > > >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build
> system.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Sandeep
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Sandeep Krishnamurthy
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Meghna Baijal <me...@gmail.com>.
Thanks for the active discussion on the document for the new CI for MXNet.
Now that many of you have reviewed it, do you think I should start a vote
on which framework the community wants to move forward with ?

Thanks,
Meghna

On Mon, Nov 6, 2017 at 6:59 PM, Chris Olivier <cj...@gmail.com> wrote:

> After a decision is reached, i am willing to add tasks to Apache MXNet JIRA
>
> On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <pedro.larroy.lists@gmail.com
> >
> wrote:
>
> > Thanks for setting up the document guys, looks like a solid basis to
> > start to work on!
> >
> > Marco, Kellen and I have already added some comments.
> >
> > Pedro
> >
> >
> > On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
> > <me...@gmail.com> wrote:
> > > Kellen, Thank you for your comments in the doc.
> > > Sure Steffen, I will continue to merge everyone’s comments into the doc
> > and
> > > work with Pedro to finalize it.
> > > And then we can vote on the options.
> > >
> > > Thanks,
> > > Meghna Baijal
> > >
> > >
> > > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <
> steffenrochel@gmail.com>
> > > wrote:
> > >
> > >> Sandeep and Meghna have been working in background collecting input
> and
> > >> preparing a doc. I suggest to drive discussion forward and would like
> to
> > >> ask everybody to contribute to
> > >> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
> > >> dlavUDASzUmLjk/edit?usp=sharing
> > >>
> > >> Lets converge on requirements and architecture, so we can move forward
> > with
> > >> implementation.
> > >>
> > >> I would like to suggest for Pedro  and Meghna to lead the discussion
> and
> > >> help to resolve suggestions.
> > >>
> > >> I assume we need a vote once we are converged on a good draft to call
> > it a
> > >> plan and move forward with implementation. As we all are unhappy with
> > the
> > >> current CI situation I would also suggest a phased approach, so we can
> > get
> > >> back to reliable and efficient basic CI quickly and add advanced
> > >> capabilities over time.
> > >>
> > >> Steffen
> > >>
> > >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
> > >> kellen.sunderland@gmail.com> wrote:
> > >>
> > >> > Hey Henri, I think that's what a few of us are advocating.  Running
> a
> > set
> > >> > of quick tests as part of the PR process, and then a more detailed
> > >> > regression test suite periodically (say every 4 hours). This fits
> > nicely
> > >> > into a tagging or 2 branch development system.  Commits will be
> tagged
> > >> (or
> > >> > merged into a stable branch) as soon as they pass the detailed
> > regression
> > >> > testing.
> > >> >
> > >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
> > >> >
> > >> > > Random question - can the CI be split such that the Apache CI is
> > doing
> > >> a
> > >> > > basic set of checks on that hardware, and is hooked to a PR, while
> > >> there
> > >> > is
> > >> > > a larger "Is trunk good for release?" test that is running
> > periodically
> > >> > > rather than on every PR?
> > >> > >
> > >> > > ie: do we need each PR to be run on varied hardware, or can we
> have
> > >> this
> > >> > > two tier approach?
> > >> > >
> > >> > > Hen
> > >> > >
> > >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > >> > > sandeep.krishna98@gmail.com> wrote:
> > >> > >
> > >> > > > Hello all,
> > >> > > >
> > >> > > > I am hereby opening up a discussion thread on how we can
> stabilize
> > >> > Apache
> > >> > > > MXNet CI build system.
> > >> > > >
> > >> > > > Problems:
> > >> > > >
> > >> > > > ========
> > >> > > >
> > >> > > > Recently, we have seen following issues with Apache MXNet CI
> build
> > >> > > systems:
> > >> > > >
> > >> > > >    1. Apache Jenkins master is overloaded and we see issues
> like -
> > >> > unable
> > >> > > >    to trigger builds, difficult to load and view the blue ocean
> > and
> > >> > other
> > >> > > >    Jenkins build status page.
> > >> > > >    2. We are generating too many request/interaction on Apache
> > Infra
> > >> > > team.
> > >> > > >       1. Addition/deletion of new slave: Caused from scaling
> > >> activity,
> > >> > > >       recycling, troubleshooting or any actions leading to
> change
> > of
> > >> > > slave
> > >> > > >       machines.
> > >> > > >       2. Plugins / other Jenkins Master configurations.
> > >> > > >       3. Experimentation on CI pipelines.
> > >> > > >    3. Harder to debug and resolve issues - Since access to
> master
> > and
> > >> > > slave
> > >> > > >    is not with the same community, it requires Infra and
> > community to
> > >> > > dive
> > >> > > >    deep together on all action items.
> > >> > > >
> > >> > > > Possible Solutions:
> > >> > > >
> > >> > > > ==============
> > >> > > >
> > >> > > >    1. Can we set up a separate Jenkins CI build system for
> Apache
> > >> MXNet
> > >> > > >    outside Apache Infra?
> > >> > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> > >> MXNet?
> > >> > > >    3. Review design of current setup, refine and fill the gaps.
> > >> > > >
> > >> > > > @ Mentors/Infra team/Community:
> > >> > > >
> > >> > > > ==========================
> > >> > > >
> > >> > > > Please provide your suggestions on how we can proceed further
> and
> > >> work
> > >> > on
> > >> > > > stabilizing the CI build systems for MXNet.
> > >> > > >
> > >> > > > Also, if the community decides on separate Jenkins CI build
> > system,
> > >> > what
> > >> > > > important points should be taken care of apart from the below:
> > >> > > >
> > >> > > >    1. Community being able to access the build page for build
> > >> statuses.
> > >> > > >    2. Committers being able to login with apache credentials.
> > >> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> > master.
> > >> > > >
> > >> > > >
> > >> > > > Irrespective of the solution we come up, I think we should
> > initiate a
> > >> > > > technical design discussion on how to setup the CI build system.
> > >> > > Probably 1
> > >> > > > or 2 pager documents with the architecture and review with Infra
> > and
> > >> > > > community members.
> > >> > > >
> > >> > > > ***There were few proposal and discussion on the slack channel,
> to
> > >> > reach
> > >> > > > wider community members, moving that discussion formally to this
> > >> list.
> > >> > > >
> > >> > > >
> > >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Sandeep
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Sandeep Krishnamurthy
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Chris Olivier <cj...@gmail.com>.
After a decision is reached, i am willing to add tasks to Apache MXNet JIRA

On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy <pe...@gmail.com>
wrote:

> Thanks for setting up the document guys, looks like a solid basis to
> start to work on!
>
> Marco, Kellen and I have already added some comments.
>
> Pedro
>
>
> On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
> <me...@gmail.com> wrote:
> > Kellen, Thank you for your comments in the doc.
> > Sure Steffen, I will continue to merge everyone’s comments into the doc
> and
> > work with Pedro to finalize it.
> > And then we can vote on the options.
> >
> > Thanks,
> > Meghna Baijal
> >
> >
> > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <st...@gmail.com>
> > wrote:
> >
> >> Sandeep and Meghna have been working in background collecting input and
> >> preparing a doc. I suggest to drive discussion forward and would like to
> >> ask everybody to contribute to
> >> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
> >> dlavUDASzUmLjk/edit?usp=sharing
> >>
> >> Lets converge on requirements and architecture, so we can move forward
> with
> >> implementation.
> >>
> >> I would like to suggest for Pedro  and Meghna to lead the discussion and
> >> help to resolve suggestions.
> >>
> >> I assume we need a vote once we are converged on a good draft to call
> it a
> >> plan and move forward with implementation. As we all are unhappy with
> the
> >> current CI situation I would also suggest a phased approach, so we can
> get
> >> back to reliable and efficient basic CI quickly and add advanced
> >> capabilities over time.
> >>
> >> Steffen
> >>
> >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
> >> kellen.sunderland@gmail.com> wrote:
> >>
> >> > Hey Henri, I think that's what a few of us are advocating.  Running a
> set
> >> > of quick tests as part of the PR process, and then a more detailed
> >> > regression test suite periodically (say every 4 hours). This fits
> nicely
> >> > into a tagging or 2 branch development system.  Commits will be tagged
> >> (or
> >> > merged into a stable branch) as soon as they pass the detailed
> regression
> >> > testing.
> >> >
> >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
> >> >
> >> > > Random question - can the CI be split such that the Apache CI is
> doing
> >> a
> >> > > basic set of checks on that hardware, and is hooked to a PR, while
> >> there
> >> > is
> >> > > a larger "Is trunk good for release?" test that is running
> periodically
> >> > > rather than on every PR?
> >> > >
> >> > > ie: do we need each PR to be run on varied hardware, or can we have
> >> this
> >> > > two tier approach?
> >> > >
> >> > > Hen
> >> > >
> >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> >> > > sandeep.krishna98@gmail.com> wrote:
> >> > >
> >> > > > Hello all,
> >> > > >
> >> > > > I am hereby opening up a discussion thread on how we can stabilize
> >> > Apache
> >> > > > MXNet CI build system.
> >> > > >
> >> > > > Problems:
> >> > > >
> >> > > > ========
> >> > > >
> >> > > > Recently, we have seen following issues with Apache MXNet CI build
> >> > > systems:
> >> > > >
> >> > > >    1. Apache Jenkins master is overloaded and we see issues like -
> >> > unable
> >> > > >    to trigger builds, difficult to load and view the blue ocean
> and
> >> > other
> >> > > >    Jenkins build status page.
> >> > > >    2. We are generating too many request/interaction on Apache
> Infra
> >> > > team.
> >> > > >       1. Addition/deletion of new slave: Caused from scaling
> >> activity,
> >> > > >       recycling, troubleshooting or any actions leading to change
> of
> >> > > slave
> >> > > >       machines.
> >> > > >       2. Plugins / other Jenkins Master configurations.
> >> > > >       3. Experimentation on CI pipelines.
> >> > > >    3. Harder to debug and resolve issues - Since access to master
> and
> >> > > slave
> >> > > >    is not with the same community, it requires Infra and
> community to
> >> > > dive
> >> > > >    deep together on all action items.
> >> > > >
> >> > > > Possible Solutions:
> >> > > >
> >> > > > ==============
> >> > > >
> >> > > >    1. Can we set up a separate Jenkins CI build system for Apache
> >> MXNet
> >> > > >    outside Apache Infra?
> >> > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> >> MXNet?
> >> > > >    3. Review design of current setup, refine and fill the gaps.
> >> > > >
> >> > > > @ Mentors/Infra team/Community:
> >> > > >
> >> > > > ==========================
> >> > > >
> >> > > > Please provide your suggestions on how we can proceed further and
> >> work
> >> > on
> >> > > > stabilizing the CI build systems for MXNet.
> >> > > >
> >> > > > Also, if the community decides on separate Jenkins CI build
> system,
> >> > what
> >> > > > important points should be taken care of apart from the below:
> >> > > >
> >> > > >    1. Community being able to access the build page for build
> >> statuses.
> >> > > >    2. Committers being able to login with apache credentials.
> >> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins
> master.
> >> > > >
> >> > > >
> >> > > > Irrespective of the solution we come up, I think we should
> initiate a
> >> > > > technical design discussion on how to setup the CI build system.
> >> > > Probably 1
> >> > > > or 2 pager documents with the architecture and review with Infra
> and
> >> > > > community members.
> >> > > >
> >> > > > ***There were few proposal and discussion on the slack channel, to
> >> > reach
> >> > > > wider community members, moving that discussion formally to this
> >> list.
> >> > > >
> >> > > >
> >> > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Sandeep
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Sandeep Krishnamurthy
> >> > > >
> >> > >
> >> >
> >>
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Pedro Larroy <pe...@gmail.com>.
Thanks for setting up the document guys, looks like a solid basis to
start to work on!

Marco, Kellen and I have already added some comments.

Pedro


On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
<me...@gmail.com> wrote:
> Kellen, Thank you for your comments in the doc.
> Sure Steffen, I will continue to merge everyone’s comments into the doc and
> work with Pedro to finalize it.
> And then we can vote on the options.
>
> Thanks,
> Meghna Baijal
>
>
> On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <st...@gmail.com>
> wrote:
>
>> Sandeep and Meghna have been working in background collecting input and
>> preparing a doc. I suggest to drive discussion forward and would like to
>> ask everybody to contribute to
>> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
>> dlavUDASzUmLjk/edit?usp=sharing
>>
>> Lets converge on requirements and architecture, so we can move forward with
>> implementation.
>>
>> I would like to suggest for Pedro  and Meghna to lead the discussion and
>> help to resolve suggestions.
>>
>> I assume we need a vote once we are converged on a good draft to call it a
>> plan and move forward with implementation. As we all are unhappy with the
>> current CI situation I would also suggest a phased approach, so we can get
>> back to reliable and efficient basic CI quickly and add advanced
>> capabilities over time.
>>
>> Steffen
>>
>> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
>> kellen.sunderland@gmail.com> wrote:
>>
>> > Hey Henri, I think that's what a few of us are advocating.  Running a set
>> > of quick tests as part of the PR process, and then a more detailed
>> > regression test suite periodically (say every 4 hours). This fits nicely
>> > into a tagging or 2 branch development system.  Commits will be tagged
>> (or
>> > merged into a stable branch) as soon as they pass the detailed regression
>> > testing.
>> >
>> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
>> >
>> > > Random question - can the CI be split such that the Apache CI is doing
>> a
>> > > basic set of checks on that hardware, and is hooked to a PR, while
>> there
>> > is
>> > > a larger "Is trunk good for release?" test that is running periodically
>> > > rather than on every PR?
>> > >
>> > > ie: do we need each PR to be run on varied hardware, or can we have
>> this
>> > > two tier approach?
>> > >
>> > > Hen
>> > >
>> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
>> > > sandeep.krishna98@gmail.com> wrote:
>> > >
>> > > > Hello all,
>> > > >
>> > > > I am hereby opening up a discussion thread on how we can stabilize
>> > Apache
>> > > > MXNet CI build system.
>> > > >
>> > > > Problems:
>> > > >
>> > > > ========
>> > > >
>> > > > Recently, we have seen following issues with Apache MXNet CI build
>> > > systems:
>> > > >
>> > > >    1. Apache Jenkins master is overloaded and we see issues like -
>> > unable
>> > > >    to trigger builds, difficult to load and view the blue ocean and
>> > other
>> > > >    Jenkins build status page.
>> > > >    2. We are generating too many request/interaction on Apache Infra
>> > > team.
>> > > >       1. Addition/deletion of new slave: Caused from scaling
>> activity,
>> > > >       recycling, troubleshooting or any actions leading to change of
>> > > slave
>> > > >       machines.
>> > > >       2. Plugins / other Jenkins Master configurations.
>> > > >       3. Experimentation on CI pipelines.
>> > > >    3. Harder to debug and resolve issues - Since access to master and
>> > > slave
>> > > >    is not with the same community, it requires Infra and community to
>> > > dive
>> > > >    deep together on all action items.
>> > > >
>> > > > Possible Solutions:
>> > > >
>> > > > ==============
>> > > >
>> > > >    1. Can we set up a separate Jenkins CI build system for Apache
>> MXNet
>> > > >    outside Apache Infra?
>> > > >    2. Can we have a separate Jenkins Master in Apache Infra for
>> MXNet?
>> > > >    3. Review design of current setup, refine and fill the gaps.
>> > > >
>> > > > @ Mentors/Infra team/Community:
>> > > >
>> > > > ==========================
>> > > >
>> > > > Please provide your suggestions on how we can proceed further and
>> work
>> > on
>> > > > stabilizing the CI build systems for MXNet.
>> > > >
>> > > > Also, if the community decides on separate Jenkins CI build system,
>> > what
>> > > > important points should be taken care of apart from the below:
>> > > >
>> > > >    1. Community being able to access the build page for build
>> statuses.
>> > > >    2. Committers being able to login with apache credentials.
>> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
>> > > >
>> > > >
>> > > > Irrespective of the solution we come up, I think we should initiate a
>> > > > technical design discussion on how to setup the CI build system.
>> > > Probably 1
>> > > > or 2 pager documents with the architecture and review with Infra and
>> > > > community members.
>> > > >
>> > > > ***There were few proposal and discussion on the slack channel, to
>> > reach
>> > > > wider community members, moving that discussion formally to this
>> list.
>> > > >
>> > > >
>> > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Sandeep
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Sandeep Krishnamurthy
>> > > >
>> > >
>> >
>>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Meghna Baijal <me...@gmail.com>.
Kellen, Thank you for your comments in the doc.
Sure Steffen, I will continue to merge everyone’s comments into the doc and
work with Pedro to finalize it.
And then we can vote on the options.

Thanks,
Meghna Baijal


On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel <st...@gmail.com>
wrote:

> Sandeep and Meghna have been working in background collecting input and
> preparing a doc. I suggest to drive discussion forward and would like to
> ask everybody to contribute to
> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
> dlavUDASzUmLjk/edit?usp=sharing
>
> Lets converge on requirements and architecture, so we can move forward with
> implementation.
>
> I would like to suggest for Pedro  and Meghna to lead the discussion and
> help to resolve suggestions.
>
> I assume we need a vote once we are converged on a good draft to call it a
> plan and move forward with implementation. As we all are unhappy with the
> current CI situation I would also suggest a phased approach, so we can get
> back to reliable and efficient basic CI quickly and add advanced
> capabilities over time.
>
> Steffen
>
> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
>
> > Hey Henri, I think that's what a few of us are advocating.  Running a set
> > of quick tests as part of the PR process, and then a more detailed
> > regression test suite periodically (say every 4 hours). This fits nicely
> > into a tagging or 2 branch development system.  Commits will be tagged
> (or
> > merged into a stable branch) as soon as they pass the detailed regression
> > testing.
> >
> > On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
> >
> > > Random question - can the CI be split such that the Apache CI is doing
> a
> > > basic set of checks on that hardware, and is hooked to a PR, while
> there
> > is
> > > a larger "Is trunk good for release?" test that is running periodically
> > > rather than on every PR?
> > >
> > > ie: do we need each PR to be run on varied hardware, or can we have
> this
> > > two tier approach?
> > >
> > > Hen
> > >
> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > > sandeep.krishna98@gmail.com> wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am hereby opening up a discussion thread on how we can stabilize
> > Apache
> > > > MXNet CI build system.
> > > >
> > > > Problems:
> > > >
> > > > ========
> > > >
> > > > Recently, we have seen following issues with Apache MXNet CI build
> > > systems:
> > > >
> > > >    1. Apache Jenkins master is overloaded and we see issues like -
> > unable
> > > >    to trigger builds, difficult to load and view the blue ocean and
> > other
> > > >    Jenkins build status page.
> > > >    2. We are generating too many request/interaction on Apache Infra
> > > team.
> > > >       1. Addition/deletion of new slave: Caused from scaling
> activity,
> > > >       recycling, troubleshooting or any actions leading to change of
> > > slave
> > > >       machines.
> > > >       2. Plugins / other Jenkins Master configurations.
> > > >       3. Experimentation on CI pipelines.
> > > >    3. Harder to debug and resolve issues - Since access to master and
> > > slave
> > > >    is not with the same community, it requires Infra and community to
> > > dive
> > > >    deep together on all action items.
> > > >
> > > > Possible Solutions:
> > > >
> > > > ==============
> > > >
> > > >    1. Can we set up a separate Jenkins CI build system for Apache
> MXNet
> > > >    outside Apache Infra?
> > > >    2. Can we have a separate Jenkins Master in Apache Infra for
> MXNet?
> > > >    3. Review design of current setup, refine and fill the gaps.
> > > >
> > > > @ Mentors/Infra team/Community:
> > > >
> > > > ==========================
> > > >
> > > > Please provide your suggestions on how we can proceed further and
> work
> > on
> > > > stabilizing the CI build systems for MXNet.
> > > >
> > > > Also, if the community decides on separate Jenkins CI build system,
> > what
> > > > important points should be taken care of apart from the below:
> > > >
> > > >    1. Community being able to access the build page for build
> statuses.
> > > >    2. Committers being able to login with apache credentials.
> > > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
> > > >
> > > >
> > > > Irrespective of the solution we come up, I think we should initiate a
> > > > technical design discussion on how to setup the CI build system.
> > > Probably 1
> > > > or 2 pager documents with the architecture and review with Infra and
> > > > community members.
> > > >
> > > > ***There were few proposal and discussion on the slack channel, to
> > reach
> > > > wider community members, moving that discussion formally to this
> list.
> > > >
> > > >
> > > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > > >
> > > > Thanks,
> > > >
> > > > Sandeep
> > > >
> > > >
> > > >
> > > > --
> > > > Sandeep Krishnamurthy
> > > >
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Steffen Rochel <st...@gmail.com>.
Sandeep and Meghna have been working in background collecting input and
preparing a doc. I suggest to drive discussion forward and would like to
ask everybody to contribute to
https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDkdlavUDASzUmLjk/edit?usp=sharing

Lets converge on requirements and architecture, so we can move forward with
implementation.

I would like to suggest for Pedro  and Meghna to lead the discussion and
help to resolve suggestions.

I assume we need a vote once we are converged on a good draft to call it a
plan and move forward with implementation. As we all are unhappy with the
current CI situation I would also suggest a phased approach, so we can get
back to reliable and efficient basic CI quickly and add advanced
capabilities over time.

Steffen

On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Hey Henri, I think that's what a few of us are advocating.  Running a set
> of quick tests as part of the PR process, and then a more detailed
> regression test suite periodically (say every 4 hours). This fits nicely
> into a tagging or 2 branch development system.  Commits will be tagged (or
> merged into a stable branch) as soon as they pass the detailed regression
> testing.
>
> On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:
>
> > Random question - can the CI be split such that the Apache CI is doing a
> > basic set of checks on that hardware, and is hooked to a PR, while there
> is
> > a larger "Is trunk good for release?" test that is running periodically
> > rather than on every PR?
> >
> > ie: do we need each PR to be run on varied hardware, or can we have this
> > two tier approach?
> >
> > Hen
> >
> > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> > sandeep.krishna98@gmail.com> wrote:
> >
> > > Hello all,
> > >
> > > I am hereby opening up a discussion thread on how we can stabilize
> Apache
> > > MXNet CI build system.
> > >
> > > Problems:
> > >
> > > ========
> > >
> > > Recently, we have seen following issues with Apache MXNet CI build
> > systems:
> > >
> > >    1. Apache Jenkins master is overloaded and we see issues like -
> unable
> > >    to trigger builds, difficult to load and view the blue ocean and
> other
> > >    Jenkins build status page.
> > >    2. We are generating too many request/interaction on Apache Infra
> > team.
> > >       1. Addition/deletion of new slave: Caused from scaling activity,
> > >       recycling, troubleshooting or any actions leading to change of
> > slave
> > >       machines.
> > >       2. Plugins / other Jenkins Master configurations.
> > >       3. Experimentation on CI pipelines.
> > >    3. Harder to debug and resolve issues - Since access to master and
> > slave
> > >    is not with the same community, it requires Infra and community to
> > dive
> > >    deep together on all action items.
> > >
> > > Possible Solutions:
> > >
> > > ==============
> > >
> > >    1. Can we set up a separate Jenkins CI build system for Apache MXNet
> > >    outside Apache Infra?
> > >    2. Can we have a separate Jenkins Master in Apache Infra for MXNet?
> > >    3. Review design of current setup, refine and fill the gaps.
> > >
> > > @ Mentors/Infra team/Community:
> > >
> > > ==========================
> > >
> > > Please provide your suggestions on how we can proceed further and work
> on
> > > stabilizing the CI build systems for MXNet.
> > >
> > > Also, if the community decides on separate Jenkins CI build system,
> what
> > > important points should be taken care of apart from the below:
> > >
> > >    1. Community being able to access the build page for build statuses.
> > >    2. Committers being able to login with apache credentials.
> > >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
> > >
> > >
> > > Irrespective of the solution we come up, I think we should initiate a
> > > technical design discussion on how to setup the CI build system.
> > Probably 1
> > > or 2 pager documents with the architecture and review with Infra and
> > > community members.
> > >
> > > ***There were few proposal and discussion on the slack channel, to
> reach
> > > wider community members, moving that discussion formally to this list.
> > >
> > >
> > > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> > >
> > > Thanks,
> > >
> > > Sandeep
> > >
> > >
> > >
> > > --
> > > Sandeep Krishnamurthy
> > >
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by kellen sunderland <ke...@gmail.com>.
Hey Henri, I think that's what a few of us are advocating.  Running a set
of quick tests as part of the PR process, and then a more detailed
regression test suite periodically (say every 4 hours). This fits nicely
into a tagging or 2 branch development system.  Commits will be tagged (or
merged into a stable branch) as soon as they pass the detailed regression
testing.

On Wed, Nov 1, 2017 at 9:07 PM, Hen <ba...@apache.org> wrote:

> Random question - can the CI be split such that the Apache CI is doing a
> basic set of checks on that hardware, and is hooked to a PR, while there is
> a larger "Is trunk good for release?" test that is running periodically
> rather than on every PR?
>
> ie: do we need each PR to be run on varied hardware, or can we have this
> two tier approach?
>
> Hen
>
> On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> sandeep.krishna98@gmail.com> wrote:
>
> > Hello all,
> >
> > I am hereby opening up a discussion thread on how we can stabilize Apache
> > MXNet CI build system.
> >
> > Problems:
> >
> > ========
> >
> > Recently, we have seen following issues with Apache MXNet CI build
> systems:
> >
> >    1. Apache Jenkins master is overloaded and we see issues like - unable
> >    to trigger builds, difficult to load and view the blue ocean and other
> >    Jenkins build status page.
> >    2. We are generating too many request/interaction on Apache Infra
> team.
> >       1. Addition/deletion of new slave: Caused from scaling activity,
> >       recycling, troubleshooting or any actions leading to change of
> slave
> >       machines.
> >       2. Plugins / other Jenkins Master configurations.
> >       3. Experimentation on CI pipelines.
> >    3. Harder to debug and resolve issues - Since access to master and
> slave
> >    is not with the same community, it requires Infra and community to
> dive
> >    deep together on all action items.
> >
> > Possible Solutions:
> >
> > ==============
> >
> >    1. Can we set up a separate Jenkins CI build system for Apache MXNet
> >    outside Apache Infra?
> >    2. Can we have a separate Jenkins Master in Apache Infra for MXNet?
> >    3. Review design of current setup, refine and fill the gaps.
> >
> > @ Mentors/Infra team/Community:
> >
> > ==========================
> >
> > Please provide your suggestions on how we can proceed further and work on
> > stabilizing the CI build systems for MXNet.
> >
> > Also, if the community decides on separate Jenkins CI build system, what
> > important points should be taken care of apart from the below:
> >
> >    1. Community being able to access the build page for build statuses.
> >    2. Committers being able to login with apache credentials.
> >    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
> >
> >
> > Irrespective of the solution we come up, I think we should initiate a
> > technical design discussion on how to setup the CI build system.
> Probably 1
> > or 2 pager documents with the architecture and review with Infra and
> > community members.
> >
> > ***There were few proposal and discussion on the slack channel, to reach
> > wider community members, moving that discussion formally to this list.
> >
> >
> > My Proposal: Option 1 - Set up separate Jenkins CI build system.
> >
> > Thanks,
> >
> > Sandeep
> >
> >
> >
> > --
> > Sandeep Krishnamurthy
> >
>

Re: [Proposal] Stabilizing Apache MXNet CI build system

Posted by Hen <ba...@apache.org>.
Random question - can the CI be split such that the Apache CI is doing a
basic set of checks on that hardware, and is hooked to a PR, while there is
a larger "Is trunk good for release?" test that is running periodically
rather than on every PR?

ie: do we need each PR to be run on varied hardware, or can we have this
two tier approach?

Hen

On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
sandeep.krishna98@gmail.com> wrote:

> Hello all,
>
> I am hereby opening up a discussion thread on how we can stabilize Apache
> MXNet CI build system.
>
> Problems:
>
> ========
>
> Recently, we have seen following issues with Apache MXNet CI build systems:
>
>    1. Apache Jenkins master is overloaded and we see issues like - unable
>    to trigger builds, difficult to load and view the blue ocean and other
>    Jenkins build status page.
>    2. We are generating too many request/interaction on Apache Infra team.
>       1. Addition/deletion of new slave: Caused from scaling activity,
>       recycling, troubleshooting or any actions leading to change of slave
>       machines.
>       2. Plugins / other Jenkins Master configurations.
>       3. Experimentation on CI pipelines.
>    3. Harder to debug and resolve issues - Since access to master and slave
>    is not with the same community, it requires Infra and community to dive
>    deep together on all action items.
>
> Possible Solutions:
>
> ==============
>
>    1. Can we set up a separate Jenkins CI build system for Apache MXNet
>    outside Apache Infra?
>    2. Can we have a separate Jenkins Master in Apache Infra for MXNet?
>    3. Review design of current setup, refine and fill the gaps.
>
> @ Mentors/Infra team/Community:
>
> ==========================
>
> Please provide your suggestions on how we can proceed further and work on
> stabilizing the CI build systems for MXNet.
>
> Also, if the community decides on separate Jenkins CI build system, what
> important points should be taken care of apart from the below:
>
>    1. Community being able to access the build page for build statuses.
>    2. Committers being able to login with apache credentials.
>    3. Hook setup from apache/incubator-mxnet repo to Jenkins master.
>
>
> Irrespective of the solution we come up, I think we should initiate a
> technical design discussion on how to setup the CI build system. Probably 1
> or 2 pager documents with the architecture and review with Infra and
> community members.
>
> ***There were few proposal and discussion on the slack channel, to reach
> wider community members, moving that discussion formally to this list.
>
>
> My Proposal: Option 1 - Set up separate Jenkins CI build system.
>
> Thanks,
>
> Sandeep
>
>
>
> --
> Sandeep Krishnamurthy
>