You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by "Davydenko, Denis" <dz...@gmail.com> on 2020/02/12 18:12:07 UTC

Update on upcoming changes to the MXNet CI: Jenkins

Hello, MXNet dev community,
As you all know, the experience with CI infrastructure isn’t ideal in spite of its high cost. For this reason, we’re proposing the following changes to improve stability, reduce cost, and grant more control to contributors. As we work in a refresh of CI, we believe these changes will reduce the pain we all suffer when we try to push a PR through the system.

Following is the list of changes:
Fix missing status reports between GH and Jenkins
Update Jenkins permission groups to re-trigger builds
Introduce per-PR CI bot
Details:

- Fix missing status reports
Currently, once commit gets added to PR - the CI is run on that added commit. Sometimes, CI run status is missing from the commit in Github despite having completed in Jenkins. Example: CI run: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline, commit status in github (missing unix-cpu, unix-gpu and windows-gpu statuses): https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
Problem: There seems to be a bug where some status reports are missing on Github. The hypothesis is that there is some issue with Github Hooks.

- Update Jenkins permission groups to re-trigger builds
Problem: Currently, only MXNet Committers and selected people from AWS have the ability to re-trigger CI runs on PRs. This leaves the PR Authors waiting for authorized users to re-trigger their PRs for them.
Solution : Allow these membership categories Jenkins Admins, MXNet Committers, and PR Authors to re-trigger PR builds.

- Introduce per-PR CI bot
Problem: As of date, MXNet CI is automated. It runs every time a commit is pushed onto your Github PR. This results in lot of unnecessary CI runs apart from added costs.
Solution: Switch to Manual Trigger. Users from authorized groups (1 of the 3 categories mentioned above) can trigger CI run by adding a simple comment to PR: “[mxnet-ci] run”. 

--
Thank you,

AWS MXNet team

 


Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by "Davydenko, Denis" <dz...@gmail.com>.
We intend this bot to be very simplistic initially. But your idea is very interesting and we will consider if we can roll this out as phase 2.


On 2/12/20, 10:57 AM, "PrzemysÅ≠aw TrÄ˙dak" <pt...@apache.org> wrote:

    Hi Denis,
    
    Could this bot be smart enough to first do the sanity pipeline (to catch stuff like lint errors etc.) before launching the full thing?
    
    Thanks
    Przemek
    
    On 2020/02/12 18:12:07, "Davydenko, Denis" <dz...@gmail.com> wrote: 
    > Hello, MXNet dev community,
    > As you all know, the experience with CI infrastructure isn’t ideal in spite of its high cost. For this reason, we’re proposing the following changes to improve stability, reduce cost, and grant more control to contributors. As we work in a refresh of CI, we believe these changes will reduce the pain we all suffer when we try to push a PR through the system.
    > 
    > Following is the list of changes:
    > Fix missing status reports between GH and Jenkins
    > Update Jenkins permission groups to re-trigger builds
    > Introduce per-PR CI bot
    > Details:
    > 
    > - Fix missing status reports
    > Currently, once commit gets added to PR - the CI is run on that added commit. Sometimes, CI run status is missing from the commit in Github despite having completed in Jenkins. Example: CI run: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline, commit status in github (missing unix-cpu, unix-gpu and windows-gpu statuses): https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
    > Problem: There seems to be a bug where some status reports are missing on Github. The hypothesis is that there is some issue with Github Hooks.
    > 
    > - Update Jenkins permission groups to re-trigger builds
    > Problem: Currently, only MXNet Committers and selected people from AWS have the ability to re-trigger CI runs on PRs. This leaves the PR Authors waiting for authorized users to re-trigger their PRs for them.
    > Solution : Allow these membership categories Jenkins Admins, MXNet Committers, and PR Authors to re-trigger PR builds.
    > 
    > - Introduce per-PR CI bot
    > Problem: As of date, MXNet CI is automated. It runs every time a commit is pushed onto your Github PR. This results in lot of unnecessary CI runs apart from added costs.
    > Solution: Switch to Manual Trigger. Users from authorized groups (1 of the 3 categories mentioned above) can trigger CI run by adding a simple comment to PR: “[mxnet-ci] run”. 
    > 
    > --
    > Thank you,
    > 
    > AWS MXNet team
    > 
    >  
    > 
    > 
    



Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by Przemys��aw Tr��dak <pt...@apache.org>.
Hi Denis,

Could this bot be smart enough to first do the sanity pipeline (to catch stuff like lint errors etc.) before launching the full thing?

Thanks
Przemek

On 2020/02/12 18:12:07, "Davydenko, Denis" <dz...@gmail.com> wrote: 
> Hello, MXNet dev community,
> As you all know, the experience with CI infrastructure isn’t ideal in spite of its high cost. For this reason, we’re proposing the following changes to improve stability, reduce cost, and grant more control to contributors. As we work in a refresh of CI, we believe these changes will reduce the pain we all suffer when we try to push a PR through the system.
> 
> Following is the list of changes:
> Fix missing status reports between GH and Jenkins
> Update Jenkins permission groups to re-trigger builds
> Introduce per-PR CI bot
> Details:
> 
> - Fix missing status reports
> Currently, once commit gets added to PR - the CI is run on that added commit. Sometimes, CI run status is missing from the commit in Github despite having completed in Jenkins. Example: CI run: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline, commit status in github (missing unix-cpu, unix-gpu and windows-gpu statuses): https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
> Problem: There seems to be a bug where some status reports are missing on Github. The hypothesis is that there is some issue with Github Hooks.
> 
> - Update Jenkins permission groups to re-trigger builds
> Problem: Currently, only MXNet Committers and selected people from AWS have the ability to re-trigger CI runs on PRs. This leaves the PR Authors waiting for authorized users to re-trigger their PRs for them.
> Solution : Allow these membership categories Jenkins Admins, MXNet Committers, and PR Authors to re-trigger PR builds.
> 
> - Introduce per-PR CI bot
> Problem: As of date, MXNet CI is automated. It runs every time a commit is pushed onto your Github PR. This results in lot of unnecessary CI runs apart from added costs.
> Solution: Switch to Manual Trigger. Users from authorized groups (1 of the 3 categories mentioned above) can trigger CI run by adding a simple comment to PR: “[mxnet-ci] run”. 
> 
> --
> Thank you,
> 
> AWS MXNet team
> 
>  
> 
> 

Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by "Davydenko, Denis" <dz...@gmail.com>.
This makes total sense, Aaron. We can probably spend some time on these modifications once we complete originally mentioned changes __



On 2/13/20, 9:21 AM, "Aaron Markham" <aa...@gmail.com> wrote:

    +1 These are good action items that should help alleviate part of the
    CI issues.
    
    The following comments are not to take away from your proposal. Move
    forward, assuming the community agrees.
    I'd really like to see particular tests run only when the PR is
    touching a related part. While this is more effort, it would really
    make a major difference. Light research shows that projects have been
    doing this for quite some time, so it wouldn't be a new invention and
    deep exploration.
    
    I realize there are a lot of interdependencies and it would probably
    not work for everything. But, what if we start small?
    --> Docs pages (*.rst, *.md, *.html, *.js, *.css): don't trigger most
    tests, especially GPU and cross-platform tests.
    --> Tutorials that have GPU requirements run their own validation
    tests, and tutorials that don't have GPU requirement don't get tested
    on GPUs.
    
    Cheers,
    Aaron
    
    
    
    On Wed, Feb 12, 2020 at 10:12 AM Davydenko, Denis
    <dz...@gmail.com> wrote:
    >
    > Hello, MXNet dev community,
    > As you all know, the experience with CI infrastructure isn’t ideal in spite of its high cost. For this reason, we’re proposing the following changes to improve stability, reduce cost, and grant more control to contributors. As we work in a refresh of CI, we believe these changes will reduce the pain we all suffer when we try to push a PR through the system.
    >
    > Following is the list of changes:
    > Fix missing status reports between GH and Jenkins
    > Update Jenkins permission groups to re-trigger builds
    > Introduce per-PR CI bot
    > Details:
    >
    > - Fix missing status reports
    > Currently, once commit gets added to PR - the CI is run on that added commit. Sometimes, CI run status is missing from the commit in Github despite having completed in Jenkins. Example: CI run: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline, commit status in github (missing unix-cpu, unix-gpu and windows-gpu statuses): https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
    > Problem: There seems to be a bug where some status reports are missing on Github. The hypothesis is that there is some issue with Github Hooks.
    >
    > - Update Jenkins permission groups to re-trigger builds
    > Problem: Currently, only MXNet Committers and selected people from AWS have the ability to re-trigger CI runs on PRs. This leaves the PR Authors waiting for authorized users to re-trigger their PRs for them.
    > Solution : Allow these membership categories Jenkins Admins, MXNet Committers, and PR Authors to re-trigger PR builds.
    >
    > - Introduce per-PR CI bot
    > Problem: As of date, MXNet CI is automated. It runs every time a commit is pushed onto your Github PR. This results in lot of unnecessary CI runs apart from added costs.
    > Solution: Switch to Manual Trigger. Users from authorized groups (1 of the 3 categories mentioned above) can trigger CI run by adding a simple comment to PR: “[mxnet-ci] run”.
    >
    > --
    > Thank you,
    >
    > AWS MXNet team
    >
    >
    >
    



Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by Aaron Markham <aa...@gmail.com>.
+1 These are good action items that should help alleviate part of the
CI issues.

The following comments are not to take away from your proposal. Move
forward, assuming the community agrees.
I'd really like to see particular tests run only when the PR is
touching a related part. While this is more effort, it would really
make a major difference. Light research shows that projects have been
doing this for quite some time, so it wouldn't be a new invention and
deep exploration.

I realize there are a lot of interdependencies and it would probably
not work for everything. But, what if we start small?
--> Docs pages (*.rst, *.md, *.html, *.js, *.css): don't trigger most
tests, especially GPU and cross-platform tests.
--> Tutorials that have GPU requirements run their own validation
tests, and tutorials that don't have GPU requirement don't get tested
on GPUs.

Cheers,
Aaron



On Wed, Feb 12, 2020 at 10:12 AM Davydenko, Denis
<dz...@gmail.com> wrote:
>
> Hello, MXNet dev community,
> As you all know, the experience with CI infrastructure isn’t ideal in spite of its high cost. For this reason, we’re proposing the following changes to improve stability, reduce cost, and grant more control to contributors. As we work in a refresh of CI, we believe these changes will reduce the pain we all suffer when we try to push a PR through the system.
>
> Following is the list of changes:
> Fix missing status reports between GH and Jenkins
> Update Jenkins permission groups to re-trigger builds
> Introduce per-PR CI bot
> Details:
>
> - Fix missing status reports
> Currently, once commit gets added to PR - the CI is run on that added commit. Sometimes, CI run status is missing from the commit in Github despite having completed in Jenkins. Example: CI run: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline, commit status in github (missing unix-cpu, unix-gpu and windows-gpu statuses): https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
> Problem: There seems to be a bug where some status reports are missing on Github. The hypothesis is that there is some issue with Github Hooks.
>
> - Update Jenkins permission groups to re-trigger builds
> Problem: Currently, only MXNet Committers and selected people from AWS have the ability to re-trigger CI runs on PRs. This leaves the PR Authors waiting for authorized users to re-trigger their PRs for them.
> Solution : Allow these membership categories Jenkins Admins, MXNet Committers, and PR Authors to re-trigger PR builds.
>
> - Introduce per-PR CI bot
> Problem: As of date, MXNet CI is automated. It runs every time a commit is pushed onto your Github PR. This results in lot of unnecessary CI runs apart from added costs.
> Solution: Switch to Manual Trigger. Users from authorized groups (1 of the 3 categories mentioned above) can trigger CI run by adding a simple comment to PR: “[mxnet-ci] run”.
>
> --
> Thank you,
>
> AWS MXNet team
>
>
>

Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by Tao Lv <ta...@apache.org>.
Can someone educate me how to re-trigger a single test suite in CI?

On Thu, Feb 13, 2020 at 5:10 AM Lausen, Leonard <la...@amazon.com.invalid>
wrote:

> Hi Denis,
>
> pipeline may be the wrong word, job may be the correct one. For example,
> commiters can currently access a job page like
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17521/5/
>  , press "Login" and then the restart button to only retrigger that job,
> obtaining
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17521/6/
>
> This is correctly reported to Github and the status will change from
> failed to
> passed once depending on the result of the new job.
>
> Best regards
> Leonard
>
> On Wed, 2020-02-12 at 20:23 +0000, Davydenko, Denis wrote:
> > This might or might not work given that GH PR is failed or not given
> overall
> > CI run status, not just few builds from it. But it is a good suggestion
> to try
> > out, we will evaluate whether it could be accomplished. Thanks!
> >
> >
> >
> > On 2/12/20, 11:05 AM, "Lausen, Leonard" <la...@amazon.com.INVALID>
> wrote:
> >
> >     Thank you Denis for taking up this initiative. With respect to
> "Introduce
> > per-PR
> >     CI bot" and the "[mxnet-ci] run" command. Would it make sense to add
> >     "retriggering only failed pipelines" to the scope? For example users
> could
> > be
> >     asked to specify the name of the pipeline, or have "[mxnet-ci] run
> all"
> > and
> >     "[mxnet-ci] run failed".
> >
> >     In the current state, when retriggering all pipelines, it's likely
> that
> > one of
> >     them will fail. Only by retriggering the failed pipeline alone there
> is a
> > higher
> >      chance to arrive at a state where all pipelines have succeeded.
> >
> >     On Wed, 2020-02-12 at 10:12 -0800, Davydenko, Denis wrote:
> >     > Hello, MXNet dev community,
> >     > As you all know, the experience with CI infrastructure isn’t ideal
> in
> > spite of
> >     > its high cost. For this reason, we’re proposing the following
> changes to
> >     > improve stability, reduce cost, and grant more control to
> contributors.
> > As we
> >     > work in a refresh of CI, we believe these changes will reduce the
> pain
> > we all
> >     > suffer when we try to push a PR through the system.
> >     >
> >     > Following is the list of changes:
> >     > Fix missing status reports between GH and Jenkins
> >     > Update Jenkins permission groups to re-trigger builds
> >     > Introduce per-PR CI bot
> >     > Details:
> >     >
> >     > - Fix missing status reports
> >     > Currently, once commit gets added to PR - the CI is run on that
> added
> > commit.
> >     > Sometimes, CI run status is missing from the commit in Github
> despite
> > having
> >     > completed in Jenkins. Example: CI run:
> >     >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline
> >     > , commit status in github (missing unix-cpu, unix-gpu and
> windows-gpu
> >     > statuses):
> >     >
> >
> https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
> >     > Problem: There seems to be a bug where some status reports are
> missing
> > on
> >     > Github. The hypothesis is that there is some issue with Github
> Hooks.
> >     >
> >     > - Update Jenkins permission groups to re-trigger builds
> >     > Problem: Currently, only MXNet Committers and selected people from
> AWS
> > have
> >     > the ability to re-trigger CI runs on PRs. This leaves the PR
> Authors
> > waiting
> >     > for authorized users to re-trigger their PRs for them.
> >     > Solution : Allow these membership categories Jenkins Admins, MXNet
> > Committers,
> >     > and PR Authors to re-trigger PR builds.
> >     >
> >     > - Introduce per-PR CI bot
> >     > Problem: As of date, MXNet CI is automated. It runs every time a
> commit
> > is
> >     > pushed onto your Github PR. This results in lot of unnecessary CI
> runs
> > apart
> >     > from added costs.
> >     > Solution: Switch to Manual Trigger. Users from authorized groups
> (1 of
> > the 3
> >     > categories mentioned above) can trigger CI run by adding a simple
> > comment to
> >     > PR: “[mxnet-ci] run”.
> >     >
> >     > --
> >     > Thank you,
> >     >
> >     > AWS MXNet team
> >     >
> >     >
> >     >
> >
> >
>

Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by "Lausen, Leonard" <la...@amazon.com.INVALID>.
Hi Denis,

pipeline may be the wrong word, job may be the correct one. For example,
commiters can currently access a job page like 
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17521/5/
 , press "Login" and then the restart button to only retrigger that job,
obtaining 
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-17521/6/

This is correctly reported to Github and the status will change from failed to
passed once depending on the result of the new job.

Best regards
Leonard

On Wed, 2020-02-12 at 20:23 +0000, Davydenko, Denis wrote:
> This might or might not work given that GH PR is failed or not given overall
> CI run status, not just few builds from it. But it is a good suggestion to try
> out, we will evaluate whether it could be accomplished. Thanks!
> 
> 
> 
> On 2/12/20, 11:05 AM, "Lausen, Leonard" <la...@amazon.com.INVALID> wrote:
> 
>     Thank you Denis for taking up this initiative. With respect to "Introduce
> per-PR 
>     CI bot" and the "[mxnet-ci] run" command. Would it make sense to add
>     "retriggering only failed pipelines" to the scope? For example users could
> be
>     asked to specify the name of the pipeline, or have "[mxnet-ci] run all"
> and
>     "[mxnet-ci] run failed".
>     
>     In the current state, when retriggering all pipelines, it's likely that
> one of
>     them will fail. Only by retriggering the failed pipeline alone there is a
> higher
>      chance to arrive at a state where all pipelines have succeeded.
>     
>     On Wed, 2020-02-12 at 10:12 -0800, Davydenko, Denis wrote:
>     > Hello, MXNet dev community,
>     > As you all know, the experience with CI infrastructure isn’t ideal in
> spite of
>     > its high cost. For this reason, we’re proposing the following changes to
>     > improve stability, reduce cost, and grant more control to contributors.
> As we
>     > work in a refresh of CI, we believe these changes will reduce the pain
> we all
>     > suffer when we try to push a PR through the system.
>     > 
>     > Following is the list of changes:
>     > Fix missing status reports between GH and Jenkins
>     > Update Jenkins permission groups to re-trigger builds
>     > Introduce per-PR CI bot
>     > Details:
>     > 
>     > - Fix missing status reports
>     > Currently, once commit gets added to PR - the CI is run on that added
> commit.
>     > Sometimes, CI run status is missing from the commit in Github despite
> having
>     > completed in Jenkins. Example: CI run: 
>     > 
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline
>     > , commit status in github (missing unix-cpu, unix-gpu and windows-gpu
>     > statuses): 
>     > 
> https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
>     > Problem: There seems to be a bug where some status reports are missing
> on
>     > Github. The hypothesis is that there is some issue with Github Hooks.
>     > 
>     > - Update Jenkins permission groups to re-trigger builds
>     > Problem: Currently, only MXNet Committers and selected people from AWS
> have
>     > the ability to re-trigger CI runs on PRs. This leaves the PR Authors
> waiting
>     > for authorized users to re-trigger their PRs for them.
>     > Solution : Allow these membership categories Jenkins Admins, MXNet
> Committers,
>     > and PR Authors to re-trigger PR builds.
>     > 
>     > - Introduce per-PR CI bot
>     > Problem: As of date, MXNet CI is automated. It runs every time a commit
> is
>     > pushed onto your Github PR. This results in lot of unnecessary CI runs
> apart
>     > from added costs.
>     > Solution: Switch to Manual Trigger. Users from authorized groups (1 of
> the 3
>     > categories mentioned above) can trigger CI run by adding a simple
> comment to
>     > PR: “[mxnet-ci] run”. 
>     > 
>     > --
>     > Thank you,
>     > 
>     > AWS MXNet team
>     > 
>     >  
>     > 
>     
> 

Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by "Davydenko, Denis" <dd...@amazon.com.INVALID>.
This might or might not work given that GH PR is failed or not given overall CI run status, not just few builds from it. But it is a good suggestion to try out, we will evaluate whether it could be accomplished. Thanks!



On 2/12/20, 11:05 AM, "Lausen, Leonard" <la...@amazon.com.INVALID> wrote:

    Thank you Denis for taking up this initiative. With respect to "Introduce per-PR 
    CI bot" and the "[mxnet-ci] run" command. Would it make sense to add
    "retriggering only failed pipelines" to the scope? For example users could be
    asked to specify the name of the pipeline, or have "[mxnet-ci] run all" and
    "[mxnet-ci] run failed".
    
    In the current state, when retriggering all pipelines, it's likely that one of
    them will fail. Only by retriggering the failed pipeline alone there is a higher
     chance to arrive at a state where all pipelines have succeeded.
    
    On Wed, 2020-02-12 at 10:12 -0800, Davydenko, Denis wrote:
    > Hello, MXNet dev community,
    > As you all know, the experience with CI infrastructure isn’t ideal in spite of
    > its high cost. For this reason, we’re proposing the following changes to
    > improve stability, reduce cost, and grant more control to contributors. As we
    > work in a refresh of CI, we believe these changes will reduce the pain we all
    > suffer when we try to push a PR through the system.
    > 
    > Following is the list of changes:
    > Fix missing status reports between GH and Jenkins
    > Update Jenkins permission groups to re-trigger builds
    > Introduce per-PR CI bot
    > Details:
    > 
    > - Fix missing status reports
    > Currently, once commit gets added to PR - the CI is run on that added commit.
    > Sometimes, CI run status is missing from the commit in Github despite having
    > completed in Jenkins. Example: CI run: 
    > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline
    > , commit status in github (missing unix-cpu, unix-gpu and windows-gpu
    > statuses): 
    > https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
    > Problem: There seems to be a bug where some status reports are missing on
    > Github. The hypothesis is that there is some issue with Github Hooks.
    > 
    > - Update Jenkins permission groups to re-trigger builds
    > Problem: Currently, only MXNet Committers and selected people from AWS have
    > the ability to re-trigger CI runs on PRs. This leaves the PR Authors waiting
    > for authorized users to re-trigger their PRs for them.
    > Solution : Allow these membership categories Jenkins Admins, MXNet Committers,
    > and PR Authors to re-trigger PR builds.
    > 
    > - Introduce per-PR CI bot
    > Problem: As of date, MXNet CI is automated. It runs every time a commit is
    > pushed onto your Github PR. This results in lot of unnecessary CI runs apart
    > from added costs.
    > Solution: Switch to Manual Trigger. Users from authorized groups (1 of the 3
    > categories mentioned above) can trigger CI run by adding a simple comment to
    > PR: “[mxnet-ci] run”. 
    > 
    > --
    > Thank you,
    > 
    > AWS MXNet team
    > 
    >  
    > 
    


Re: Update on upcoming changes to the MXNet CI: Jenkins

Posted by "Lausen, Leonard" <la...@amazon.com.INVALID>.
Thank you Denis for taking up this initiative. With respect to "Introduce per-PR 
CI bot" and the "[mxnet-ci] run" command. Would it make sense to add
"retriggering only failed pipelines" to the scope? For example users could be
asked to specify the name of the pipeline, or have "[mxnet-ci] run all" and
"[mxnet-ci] run failed".

In the current state, when retriggering all pipelines, it's likely that one of
them will fail. Only by retriggering the failed pipeline alone there is a higher
 chance to arrive at a state where all pipelines have succeeded.

On Wed, 2020-02-12 at 10:12 -0800, Davydenko, Denis wrote:
> Hello, MXNet dev community,
> As you all know, the experience with CI infrastructure isn’t ideal in spite of
> its high cost. For this reason, we’re proposing the following changes to
> improve stability, reduce cost, and grant more control to contributors. As we
> work in a refresh of CI, we believe these changes will reduce the pain we all
> suffer when we try to push a PR through the system.
> 
> Following is the list of changes:
> Fix missing status reports between GH and Jenkins
> Update Jenkins permission groups to re-trigger builds
> Introduce per-PR CI bot
> Details:
> 
> - Fix missing status reports
> Currently, once commit gets added to PR - the CI is run on that added commit.
> Sometimes, CI run status is missing from the commit in Github despite having
> completed in Jenkins. Example: CI run: 
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17376/17/pipeline
> , commit status in github (missing unix-cpu, unix-gpu and windows-gpu
> statuses): 
> https://github.com/apache/incubator-mxnet/pull/17376#partial-pull-merging.
> Problem: There seems to be a bug where some status reports are missing on
> Github. The hypothesis is that there is some issue with Github Hooks.
> 
> - Update Jenkins permission groups to re-trigger builds
> Problem: Currently, only MXNet Committers and selected people from AWS have
> the ability to re-trigger CI runs on PRs. This leaves the PR Authors waiting
> for authorized users to re-trigger their PRs for them.
> Solution : Allow these membership categories Jenkins Admins, MXNet Committers,
> and PR Authors to re-trigger PR builds.
> 
> - Introduce per-PR CI bot
> Problem: As of date, MXNet CI is automated. It runs every time a commit is
> pushed onto your Github PR. This results in lot of unnecessary CI runs apart
> from added costs.
> Solution: Switch to Manual Trigger. Users from authorized groups (1 of the 3
> categories mentioned above) can trigger CI run by adding a simple comment to
> PR: “[mxnet-ci] run”. 
> 
> --
> Thank you,
> 
> AWS MXNet team
> 
>  
>