You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Chaitanya Bapat <ch...@gmail.com> on 2020/04/02 03:38:05 UTC

Re: Update : CI windows-gpu Failure

Hello MXNet Community,

Since a week, CI is blocked due to Windows-GPU failure.
PR to fix it is still WIP :
https://github.com/apache/incubator-mxnet/pull/17808

This updates the toolchain from 32bit to 64bit [to resolve the 2GB memory
linker error currently facing CI]
Along with host of other updates that are long time coming -
[VSCode2019,opencv,cudnn,etc]
We have 2 pending issues:
1. cuda segfault in Py3 Windows GPU test
OSError: exception: access violation writing 0x0000000000000000

2. Jenkins Channel Connection
"hudson.remoting.ChannelClosedException: Channel
"hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from [...]
failed. The channel is closing down or has closed down"

We are hard at work to unblock the CI & get the PR fix merged.

Since we want to focus on fixing the windows-gpu issue and avoid
complicating the situation further, we are not disabling the windows-gpu
build as of now. As a backup plan, we will disable the windows-gpu builds
by 4/5 Sunday EOD if things don’t recover by then.

Thanks for the continued patience.
Chai,
on behalf of the MXNet CI team

On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <ch...@gmail.com> wrote:

> Hello MXNet community,
>
> It’s been over 3 days now that windows-gpu builds are failing on CI.
> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to identify
> the root-cause and fix.
>
> Issue: Linker is running OOM due to 32bit toolchain not able to address
> the available memory of the machine.
>
> Multiple attempts have been made (albeit with limited success)
> 1. Reduce the number of builds per worker (for window-cpu node) from 3 to 1
> 2. Updated the toolchain from 32bit to 64bit (as pointed out by multiple
> people)
> PR : https://github.com/apache/incubator-mxnet/pull/17916
> [related to Leo’s PR :
> https://github.com/apache/incubator-mxnet/pull/17912)
>
> Road to unblock:
> Updated AMI coupled with toolchain should possibly help
> Ningyuan has an updated AMI for windows (PR :
> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019, cuda10.2,
> cmake fixes etc.
>
> We will get it deployed by tomorrow and update the status accordingly.
>
> Thanks for the patience. Apologies for the inconvenience caused.
> Thank you 🙏
> Chai,
> on behalf of the MXNet CI team
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image:
> https://www.facebook.com/chaibapat] <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>

-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
<https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat]
<https://www.facebook.com/chaibapchya>[image:
https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
https://www.linkedin.com//in/chaibapat25]
<https://www.linkedin.com//in/chaibapchya/>

Re: Update : CI windows-gpu Failure

Posted by Marco de Abreu <ma...@gmail.com>.

Done:  https://issues.apache.org/jira/browse/INFRA-20085

Since you didn't provide the required events, I took an educated guess.
Next time, please include the events.

-Marco

On Tue, Apr 7, 2020 at 6:57 AM Chaitanya Bapat <ch...@gmail.com> wrote:

> Hello MXNet community,
>
> >  The new AMI will be tested in a CI Dev environment and we will notify
> you when it's ready for adoption on the CI.
>
> As we continue to work on improving & updating the CI, I want to notify the
> community about steps taken for testing CI Dev environments.
>
> In order to ensure updated AMI is error-free, we plan to test it in CI Dev
> environment for atleast 1000 builds. In order to test for these many
> builds, we want to route Github PR events to CI Dev environment. In
> addition to the test of AMI, there's a proposal to migrate instances from
> p3/g3 [unix-gpu/centos-gpu/windows-gpu] to G4 from cost & speed standpoint.
> But it needs to be tested and data needs to be gathered before carrying out
> the migration.
>
> In that relation, I request Marco to cut a ticket to Apache Infra for
> another Github webhook. This Github webhook in the apache/incubator-mxnet
> repository will point to the Jenkins CI Dev server [used for A/B testing].
>
> Once tested successfully, I will notify the community of proposed
> migrations to make CI faster, better and less error-prone.
>
> Thank you,
> Chai,
> on behalf of MXNet CI team.
>
>
>
> On Fri, 3 Apr 2020 at 21:30, sandeep krishnamurthy <
> sandeep.krishna98@gmail.com> wrote:
>
> > Thanks a lot Chai, Joe, Leo, Marco, Ningyuan, Sheng, Zhi for pulling many
> > all nighters to fix this issue.
> > Thanks for pushing further to automate the AMI building and Leo's
> proposal
> > to update the tool chain. That should stabilize the CI for the project.
> >
> > Best,
> > Sandeep
> >
> > On Fri, Apr 3, 2020 at 8:50 PM Chaitanya Bapat <ch...@gmail.com>
> > wrote:
> >
> > > Hello MXNet community,
> > >
> > > The Windows GPU pipeline on CI works again now. To fix it, we updated
> the
> > > AMI
> > > used by the tests and preinstalled VS2019, which uses a 64bit toolchain
> > and
> > > resolves the OOM error. Prior attempts with using VS2017 64bit
> toolchain
> > > had
> > > failed, making the change to the AMI necessary.
> > >
> > > VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2
> on
> > > the
> > > AMI. As the Windows GPU build was the only build testing Cuda 9, we now
> > do
> > > not
> > > have any tests with Cuda 9. We will start a separate discussion thread
> to
> > > come
> > > to a decision if we like to add back Cuda 9 tests on a one of the Unix
> > > platforms
> > > or to drop Cuda 9 support in MXNet 2. The AMI was further updated with
> > the
> > > ninja
> > > build tool and a recent version of cmake, helping us to speed up the
> > build
> > > further. Previously cmake was installed as part of every CI run, as the
> > > version
> > > on the AMI has been too outdated for some time already.
> > >
> > > All these updates were manually build on top of the existing AMI. The
> > team
> > > is
> > > continuing to work on updating the automated AMI building process to
> > > include the
> > > updated toolchain. Unfortunately the automation scripts for the recent
> > > VS2019 do
> > > not work on the Mircosoft Server 2016 version used so far, and we thus
> > > intend to
> > > switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev
> > > environment and we will notify you when it's ready for adoption on the
> > CI.
> > >
> > > https://github.com/apache/incubator-mxnet/pull/17962 contains the
> > changes
> > > to the
> > > master branch required to make use of the updated toolchain.
> > >
> > > On your side, all you have to do is rebase the master to reflect the
> > > changes.
> > > Refer Git commands to rebase master here
> > > <https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df
> >.
> > >
> > > Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help,
> > support
> > > & guidance.
> > >
> > > Thanks once again to the community for patience. Apologies for
> > > inconvenience caused.
> > > Regards,
> > > Chai,
> > > on behalf of MXNet CI team
> > >
> > > On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <ch...@gmail.com>
> > wrote:
> > >
> > > > Hello MXNet Community,
> > > >
> > > > Since a week, CI is blocked due to Windows-GPU failure.
> > > > PR to fix it is still WIP :
> > > > https://github.com/apache/incubator-mxnet/pull/17808
> > > >
> > > > This updates the toolchain from 32bit to 64bit [to resolve the 2GB
> > memory
> > > > linker error currently facing CI]
> > > > Along with host of other updates that are long time coming -
> > > > [VSCode2019,opencv,cudnn,etc]
> > > > We have 2 pending issues:
> > > > 1. cuda segfault in Py3 Windows GPU test
> > > > OSError: exception: access violation writing 0x0000000000000000
> > > >
> > > > 2. Jenkins Channel Connection
> > > > "hudson.remoting.ChannelClosedException: Channel
> > > > "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from
> [...]
> > > > failed. The channel is closing down or has closed down"
> > > >
> > > > We are hard at work to unblock the CI & get the PR fix merged.
> > > >
> > > > Since we want to focus on fixing the windows-gpu issue and avoid
> > > > complicating the situation further, we are not disabling the
> > windows-gpu
> > > > build as of now. As a backup plan, we will disable the windows-gpu
> > builds
> > > > by 4/5 Sunday EOD if things don’t recover by then.
> > > >
> > > > Thanks for the continued patience.
> > > > Chai,
> > > > on behalf of the MXNet CI team
> > > >
> > > >
> > > >
> > > > On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <ch...@gmail.com>
> > > > wrote:
> > > >
> > > >> Hello MXNet community,
> > > >>
> > > >> It’s been over 3 days now that windows-gpu builds are failing on CI.
> > > >> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to
> > identify
> > > >> the root-cause and fix.
> > > >>
> > > >> Issue: Linker is running OOM due to 32bit toolchain not able to
> > address
> > > >> the available memory of the machine.
> > > >>
> > > >> Multiple attempts have been made (albeit with limited success)
> > > >> 1. Reduce the number of builds per worker (for window-cpu node)
> from 3
> > > to
> > > >> 1
> > > >> 2. Updated the toolchain from 32bit to 64bit (as pointed out by
> > multiple
> > > >> people)
> > > >> PR : https://github.com/apache/incubator-mxnet/pull/17916
> > > >> [related to Leo’s PR :
> > > >> https://github.com/apache/incubator-mxnet/pull/17912)
> > > >>
> > > >> Road to unblock:
> > > >> Updated AMI coupled with toolchain should possibly help
> > > >> Ningyuan has an updated AMI for windows (PR :
> > > >> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019,
> > > >> cuda10.2, cmake fixes etc.
> > > >>
> > > >> We will get it deployed by tomorrow and update the status
> accordingly.
> > > >>
> > > >> Thanks for the patience. Apologies for the inconvenience caused.
> > > >> Thank you 🙏
> > > >> Chai,
> > > >> on behalf of the MXNet CI team
> > > >>
> > > >> --
> > > >> *Chaitanya Prakash Bapat*
> > > >> *+1 (973) 953-6299*
> > > >>
> > > >> [image: https://www.linkedin.com//in/chaibapat25]
> > > >> <https://github.com/ChaiBapchya>[image:
> > > >> https://www.facebook.com/chaibapat]
> > > >> <https://www.facebook.com/chaibapchya>[image:
> > > >> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > > >[image:
> > > >> https://www.linkedin.com//in/chaibapat25]
> > > >> <https://www.linkedin.com//in/chaibapchya/>
> > > >>
> > > >
> > > >
> > > > --
> > > > *Chaitanya Prakash Bapat*
> > > > *+1 (973) 953-6299*
> > > >
> > > > [image: https://www.linkedin.com//in/chaibapat25]
> > > > <https://github.com/ChaiBapchya>[image:
> > > > https://www.facebook.com/chaibapat] <
> > > https://www.facebook.com/chaibapchya>[image:
> > > > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > > >[image:
> > > > https://www.linkedin.com//in/chaibapat25]
> > > > <https://www.linkedin.com//in/chaibapchya/>
> > > >
> > >
> > >
> > > --
> > > *Chaitanya Prakash Bapat*
> > > *+1 (973) 953-6299*
> > >
> > > [image: https://www.linkedin.com//in/chaibapat25]
> > > <https://github.com/ChaiBapchya>[image:
> > https://www.facebook.com/chaibapat
> > > ]
> > > <https://www.facebook.com/chaibapchya>[image:
> > > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > >[image:
> > > https://www.linkedin.com//in/chaibapat25]
> > > <https://www.linkedin.com//in/chaibapchya/>
> > >
> >
> >
> > --
> > Sandeep Krishnamurthy
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat
> ]
> <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>

Re: Update : CI windows-gpu Failure

Posted by Chaitanya Bapat <ch...@gmail.com>.

Hello MXNet community,

>  The new AMI will be tested in a CI Dev environment and we will notify
you when it's ready for adoption on the CI.

As we continue to work on improving & updating the CI, I want to notify the
community about steps taken for testing CI Dev environments.

In order to ensure updated AMI is error-free, we plan to test it in CI Dev
environment for atleast 1000 builds. In order to test for these many
builds, we want to route Github PR events to CI Dev environment. In
addition to the test of AMI, there's a proposal to migrate instances from
p3/g3 [unix-gpu/centos-gpu/windows-gpu] to G4 from cost & speed standpoint.
But it needs to be tested and data needs to be gathered before carrying out
the migration.

In that relation, I request Marco to cut a ticket to Apache Infra for
another Github webhook. This Github webhook in the apache/incubator-mxnet
repository will point to the Jenkins CI Dev server [used for A/B testing].

Once tested successfully, I will notify the community of proposed
migrations to make CI faster, better and less error-prone.

Thank you,
Chai,
on behalf of MXNet CI team.



On Fri, 3 Apr 2020 at 21:30, sandeep krishnamurthy <
sandeep.krishna98@gmail.com> wrote:

> Thanks a lot Chai, Joe, Leo, Marco, Ningyuan, Sheng, Zhi for pulling many
> all nighters to fix this issue.
> Thanks for pushing further to automate the AMI building and Leo's proposal
> to update the tool chain. That should stabilize the CI for the project.
>
> Best,
> Sandeep
>
> On Fri, Apr 3, 2020 at 8:50 PM Chaitanya Bapat <ch...@gmail.com>
> wrote:
>
> > Hello MXNet community,
> >
> > The Windows GPU pipeline on CI works again now. To fix it, we updated the
> > AMI
> > used by the tests and preinstalled VS2019, which uses a 64bit toolchain
> and
> > resolves the OOM error. Prior attempts with using VS2017 64bit toolchain
> > had
> > failed, making the change to the AMI necessary.
> >
> > VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2 on
> > the
> > AMI. As the Windows GPU build was the only build testing Cuda 9, we now
> do
> > not
> > have any tests with Cuda 9. We will start a separate discussion thread to
> > come
> > to a decision if we like to add back Cuda 9 tests on a one of the Unix
> > platforms
> > or to drop Cuda 9 support in MXNet 2. The AMI was further updated with
> the
> > ninja
> > build tool and a recent version of cmake, helping us to speed up the
> build
> > further. Previously cmake was installed as part of every CI run, as the
> > version
> > on the AMI has been too outdated for some time already.
> >
> > All these updates were manually build on top of the existing AMI. The
> team
> > is
> > continuing to work on updating the automated AMI building process to
> > include the
> > updated toolchain. Unfortunately the automation scripts for the recent
> > VS2019 do
> > not work on the Mircosoft Server 2016 version used so far, and we thus
> > intend to
> > switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev
> > environment and we will notify you when it's ready for adoption on the
> CI.
> >
> > https://github.com/apache/incubator-mxnet/pull/17962 contains the
> changes
> > to the
> > master branch required to make use of the updated toolchain.
> >
> > On your side, all you have to do is rebase the master to reflect the
> > changes.
> > Refer Git commands to rebase master here
> > <https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df>.
> >
> > Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help,
> support
> > & guidance.
> >
> > Thanks once again to the community for patience. Apologies for
> > inconvenience caused.
> > Regards,
> > Chai,
> > on behalf of MXNet CI team
> >
> > On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <ch...@gmail.com>
> wrote:
> >
> > > Hello MXNet Community,
> > >
> > > Since a week, CI is blocked due to Windows-GPU failure.
> > > PR to fix it is still WIP :
> > > https://github.com/apache/incubator-mxnet/pull/17808
> > >
> > > This updates the toolchain from 32bit to 64bit [to resolve the 2GB
> memory
> > > linker error currently facing CI]
> > > Along with host of other updates that are long time coming -
> > > [VSCode2019,opencv,cudnn,etc]
> > > We have 2 pending issues:
> > > 1. cuda segfault in Py3 Windows GPU test
> > > OSError: exception: access violation writing 0x0000000000000000
> > >
> > > 2. Jenkins Channel Connection
> > > "hudson.remoting.ChannelClosedException: Channel
> > > "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from [...]
> > > failed. The channel is closing down or has closed down"
> > >
> > > We are hard at work to unblock the CI & get the PR fix merged.
> > >
> > > Since we want to focus on fixing the windows-gpu issue and avoid
> > > complicating the situation further, we are not disabling the
> windows-gpu
> > > build as of now. As a backup plan, we will disable the windows-gpu
> builds
> > > by 4/5 Sunday EOD if things don’t recover by then.
> > >
> > > Thanks for the continued patience.
> > > Chai,
> > > on behalf of the MXNet CI team
> > >
> > >
> > >
> > > On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <ch...@gmail.com>
> > > wrote:
> > >
> > >> Hello MXNet community,
> > >>
> > >> It’s been over 3 days now that windows-gpu builds are failing on CI.
> > >> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to
> identify
> > >> the root-cause and fix.
> > >>
> > >> Issue: Linker is running OOM due to 32bit toolchain not able to
> address
> > >> the available memory of the machine.
> > >>
> > >> Multiple attempts have been made (albeit with limited success)
> > >> 1. Reduce the number of builds per worker (for window-cpu node) from 3
> > to
> > >> 1
> > >> 2. Updated the toolchain from 32bit to 64bit (as pointed out by
> multiple
> > >> people)
> > >> PR : https://github.com/apache/incubator-mxnet/pull/17916
> > >> [related to Leo’s PR :
> > >> https://github.com/apache/incubator-mxnet/pull/17912)
> > >>
> > >> Road to unblock:
> > >> Updated AMI coupled with toolchain should possibly help
> > >> Ningyuan has an updated AMI for windows (PR :
> > >> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019,
> > >> cuda10.2, cmake fixes etc.
> > >>
> > >> We will get it deployed by tomorrow and update the status accordingly.
> > >>
> > >> Thanks for the patience. Apologies for the inconvenience caused.
> > >> Thank you 🙏
> > >> Chai,
> > >> on behalf of the MXNet CI team
> > >>
> > >> --
> > >> *Chaitanya Prakash Bapat*
> > >> *+1 (973) 953-6299*
> > >>
> > >> [image: https://www.linkedin.com//in/chaibapat25]
> > >> <https://github.com/ChaiBapchya>[image:
> > >> https://www.facebook.com/chaibapat]
> > >> <https://www.facebook.com/chaibapchya>[image:
> > >> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > >[image:
> > >> https://www.linkedin.com//in/chaibapat25]
> > >> <https://www.linkedin.com//in/chaibapchya/>
> > >>
> > >
> > >
> > > --
> > > *Chaitanya Prakash Bapat*
> > > *+1 (973) 953-6299*
> > >
> > > [image: https://www.linkedin.com//in/chaibapat25]
> > > <https://github.com/ChaiBapchya>[image:
> > > https://www.facebook.com/chaibapat] <
> > https://www.facebook.com/chaibapchya>[image:
> > > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> > >[image:
> > > https://www.linkedin.com//in/chaibapat25]
> > > <https://www.linkedin.com//in/chaibapchya/>
> > >
> >
> >
> > --
> > *Chaitanya Prakash Bapat*
> > *+1 (973) 953-6299*
> >
> > [image: https://www.linkedin.com//in/chaibapat25]
> > <https://github.com/ChaiBapchya>[image:
> https://www.facebook.com/chaibapat
> > ]
> > <https://www.facebook.com/chaibapchya>[image:
> > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> >[image:
> > https://www.linkedin.com//in/chaibapat25]
> > <https://www.linkedin.com//in/chaibapchya/>
> >
>
>
> --
> Sandeep Krishnamurthy
>


-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
<https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat]
<https://www.facebook.com/chaibapchya>[image:
https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
https://www.linkedin.com//in/chaibapat25]
<https://www.linkedin.com//in/chaibapchya/>

Re: Update : CI windows-gpu Failure

Posted by sandeep krishnamurthy <sa...@gmail.com>.

Thanks a lot Chai, Joe, Leo, Marco, Ningyuan, Sheng, Zhi for pulling many
all nighters to fix this issue.
Thanks for pushing further to automate the AMI building and Leo's proposal
to update the tool chain. That should stabilize the CI for the project.

Best,
Sandeep

On Fri, Apr 3, 2020 at 8:50 PM Chaitanya Bapat <ch...@gmail.com> wrote:

> Hello MXNet community,
>
> The Windows GPU pipeline on CI works again now. To fix it, we updated the
> AMI
> used by the tests and preinstalled VS2019, which uses a 64bit toolchain and
> resolves the OOM error. Prior attempts with using VS2017 64bit toolchain
> had
> failed, making the change to the AMI necessary.
>
> VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2 on
> the
> AMI. As the Windows GPU build was the only build testing Cuda 9, we now do
> not
> have any tests with Cuda 9. We will start a separate discussion thread to
> come
> to a decision if we like to add back Cuda 9 tests on a one of the Unix
> platforms
> or to drop Cuda 9 support in MXNet 2. The AMI was further updated with the
> ninja
> build tool and a recent version of cmake, helping us to speed up the build
> further. Previously cmake was installed as part of every CI run, as the
> version
> on the AMI has been too outdated for some time already.
>
> All these updates were manually build on top of the existing AMI. The team
> is
> continuing to work on updating the automated AMI building process to
> include the
> updated toolchain. Unfortunately the automation scripts for the recent
> VS2019 do
> not work on the Mircosoft Server 2016 version used so far, and we thus
> intend to
> switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev
> environment and we will notify you when it's ready for adoption on the CI.
>
> https://github.com/apache/incubator-mxnet/pull/17962 contains the changes
> to the
> master branch required to make use of the updated toolchain.
>
> On your side, all you have to do is rebase the master to reflect the
> changes.
> Refer Git commands to rebase master here
> <https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df>.
>
> Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help, support
> & guidance.
>
> Thanks once again to the community for patience. Apologies for
> inconvenience caused.
> Regards,
> Chai,
> on behalf of MXNet CI team
>
> On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <ch...@gmail.com> wrote:
>
> > Hello MXNet Community,
> >
> > Since a week, CI is blocked due to Windows-GPU failure.
> > PR to fix it is still WIP :
> > https://github.com/apache/incubator-mxnet/pull/17808
> >
> > This updates the toolchain from 32bit to 64bit [to resolve the 2GB memory
> > linker error currently facing CI]
> > Along with host of other updates that are long time coming -
> > [VSCode2019,opencv,cudnn,etc]
> > We have 2 pending issues:
> > 1. cuda segfault in Py3 Windows GPU test
> > OSError: exception: access violation writing 0x0000000000000000
> >
> > 2. Jenkins Channel Connection
> > "hudson.remoting.ChannelClosedException: Channel
> > "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from [...]
> > failed. The channel is closing down or has closed down"
> >
> > We are hard at work to unblock the CI & get the PR fix merged.
> >
> > Since we want to focus on fixing the windows-gpu issue and avoid
> > complicating the situation further, we are not disabling the windows-gpu
> > build as of now. As a backup plan, we will disable the windows-gpu builds
> > by 4/5 Sunday EOD if things don’t recover by then.
> >
> > Thanks for the continued patience.
> > Chai,
> > on behalf of the MXNet CI team
> >
> >
> >
> > On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <ch...@gmail.com>
> > wrote:
> >
> >> Hello MXNet community,
> >>
> >> It’s been over 3 days now that windows-gpu builds are failing on CI.
> >> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to identify
> >> the root-cause and fix.
> >>
> >> Issue: Linker is running OOM due to 32bit toolchain not able to address
> >> the available memory of the machine.
> >>
> >> Multiple attempts have been made (albeit with limited success)
> >> 1. Reduce the number of builds per worker (for window-cpu node) from 3
> to
> >> 1
> >> 2. Updated the toolchain from 32bit to 64bit (as pointed out by multiple
> >> people)
> >> PR : https://github.com/apache/incubator-mxnet/pull/17916
> >> [related to Leo’s PR :
> >> https://github.com/apache/incubator-mxnet/pull/17912)
> >>
> >> Road to unblock:
> >> Updated AMI coupled with toolchain should possibly help
> >> Ningyuan has an updated AMI for windows (PR :
> >> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019,
> >> cuda10.2, cmake fixes etc.
> >>
> >> We will get it deployed by tomorrow and update the status accordingly.
> >>
> >> Thanks for the patience. Apologies for the inconvenience caused.
> >> Thank you 🙏
> >> Chai,
> >> on behalf of the MXNet CI team
> >>
> >> --
> >> *Chaitanya Prakash Bapat*
> >> *+1 (973) 953-6299*
> >>
> >> [image: https://www.linkedin.com//in/chaibapat25]
> >> <https://github.com/ChaiBapchya>[image:
> >> https://www.facebook.com/chaibapat]
> >> <https://www.facebook.com/chaibapchya>[image:
> >> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> >[image:
> >> https://www.linkedin.com//in/chaibapat25]
> >> <https://www.linkedin.com//in/chaibapchya/>
> >>
> >
> >
> > --
> > *Chaitanya Prakash Bapat*
> > *+1 (973) 953-6299*
> >
> > [image: https://www.linkedin.com//in/chaibapat25]
> > <https://github.com/ChaiBapchya>[image:
> > https://www.facebook.com/chaibapat] <
> https://www.facebook.com/chaibapchya>[image:
> > https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya
> >[image:
> > https://www.linkedin.com//in/chaibapat25]
> > <https://www.linkedin.com//in/chaibapchya/>
> >
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat
> ]
> <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>


-- 
Sandeep Krishnamurthy

Re: Update : CI windows-gpu Failure

Posted by Chaitanya Bapat <ch...@gmail.com>.

Hello MXNet community,

The Windows GPU pipeline on CI works again now. To fix it, we updated the
AMI
used by the tests and preinstalled VS2019, which uses a 64bit toolchain and
resolves the OOM error. Prior attempts with using VS2017 64bit toolchain had
failed, making the change to the AMI necessary.

VS2019 only works with Cuda 10 and we thus also preinstalled Cuda 10.2 on
the
AMI. As the Windows GPU build was the only build testing Cuda 9, we now do
not
have any tests with Cuda 9. We will start a separate discussion thread to
come
to a decision if we like to add back Cuda 9 tests on a one of the Unix
platforms
or to drop Cuda 9 support in MXNet 2. The AMI was further updated with the
ninja
build tool and a recent version of cmake, helping us to speed up the build
further. Previously cmake was installed as part of every CI run, as the
version
on the AMI has been too outdated for some time already.

All these updates were manually build on top of the existing AMI. The team
is
continuing to work on updating the automated AMI building process to
include the
updated toolchain. Unfortunately the automation scripts for the recent
VS2019 do
not work on the Mircosoft Server 2016 version used so far, and we thus
intend to
switch to Microsoft Server 2019. The new AMI will be tested in a CI Dev
environment and we will notify you when it's ready for adoption on the CI.

https://github.com/apache/incubator-mxnet/pull/17962 contains the changes
to the
master branch required to make use of the updated toolchain.

On your side, all you have to do is rebase the master to reflect the
changes.
Refer Git commands to rebase master here
<https://gist.github.com/ChaiBapchya/2c52bce4b3d52ab03ccbc875a49996df>.

Thanks to Joe, Leo, Marco, Ningyuan, Sandeep, Sheng, Zhi for help, support
& guidance.

Thanks once again to the community for patience. Apologies for
inconvenience caused.
Regards,
Chai,
on behalf of MXNet CI team

On Wed, 1 Apr 2020 at 20:38, Chaitanya Bapat <ch...@gmail.com> wrote:

> Hello MXNet Community,
>
> Since a week, CI is blocked due to Windows-GPU failure.
> PR to fix it is still WIP :
> https://github.com/apache/incubator-mxnet/pull/17808
>
> This updates the toolchain from 32bit to 64bit [to resolve the 2GB memory
> linker error currently facing CI]
> Along with host of other updates that are long time coming -
> [VSCode2019,opencv,cudnn,etc]
> We have 2 pending issues:
> 1. cuda segfault in Py3 Windows GPU test
> OSError: exception: access violation writing 0x0000000000000000
>
> 2. Jenkins Channel Connection
> "hudson.remoting.ChannelClosedException: Channel
> "hudson.remoting.Channel@5cca06e6:JNLP4-connect connection from [...]
> failed. The channel is closing down or has closed down"
>
> We are hard at work to unblock the CI & get the PR fix merged.
>
> Since we want to focus on fixing the windows-gpu issue and avoid
> complicating the situation further, we are not disabling the windows-gpu
> build as of now. As a backup plan, we will disable the windows-gpu builds
> by 4/5 Sunday EOD if things don’t recover by then.
>
> Thanks for the continued patience.
> Chai,
> on behalf of the MXNet CI team
>
>
>
> On Thu, 26 Mar 2020 at 21:16, Chaitanya Bapat <ch...@gmail.com>
> wrote:
>
>> Hello MXNet community,
>>
>> It’s been over 3 days now that windows-gpu builds are failing on CI.
>> The team (me, Leo, Ningyuan, Joe, Pedro) are at work trying to identify
>> the root-cause and fix.
>>
>> Issue: Linker is running OOM due to 32bit toolchain not able to address
>> the available memory of the machine.
>>
>> Multiple attempts have been made (albeit with limited success)
>> 1. Reduce the number of builds per worker (for window-cpu node) from 3 to
>> 1
>> 2. Updated the toolchain from 32bit to 64bit (as pointed out by multiple
>> people)
>> PR : https://github.com/apache/incubator-mxnet/pull/17916
>> [related to Leo’s PR :
>> https://github.com/apache/incubator-mxnet/pull/17912)
>>
>> Road to unblock:
>> Updated AMI coupled with toolchain should possibly help
>> Ningyuan has an updated AMI for windows (PR :
>> https://github.com/apache/incubator-mxnet/pull/17808) - vs2019,
>> cuda10.2, cmake fixes etc.
>>
>> We will get it deployed by tomorrow and update the status accordingly.
>>
>> Thanks for the patience. Apologies for the inconvenience caused.
>> Thank you 🙏
>> Chai,
>> on behalf of the MXNet CI team
>>
>> --
>> *Chaitanya Prakash Bapat*
>> *+1 (973) 953-6299*
>>
>> [image: https://www.linkedin.com//in/chaibapat25]
>> <https://github.com/ChaiBapchya>[image:
>> https://www.facebook.com/chaibapat]
>> <https://www.facebook.com/chaibapchya>[image:
>> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
>> https://www.linkedin.com//in/chaibapat25]
>> <https://www.linkedin.com//in/chaibapchya/>
>>
>
>
> --
> *Chaitanya Prakash Bapat*
> *+1 (973) 953-6299*
>
> [image: https://www.linkedin.com//in/chaibapat25]
> <https://github.com/ChaiBapchya>[image:
> https://www.facebook.com/chaibapat] <https://www.facebook.com/chaibapchya>[image:
> https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
> https://www.linkedin.com//in/chaibapat25]
> <https://www.linkedin.com//in/chaibapchya/>
>

-- 
*Chaitanya Prakash Bapat*
*+1 (973) 953-6299*

[image: https://www.linkedin.com//in/chaibapat25]
<https://github.com/ChaiBapchya>[image: https://www.facebook.com/chaibapat]
<https://www.facebook.com/chaibapchya>[image:
https://twitter.com/ChaiBapchya] <https://twitter.com/ChaiBapchya>[image:
https://www.linkedin.com//in/chaibapat25]
<https://www.linkedin.com//in/chaibapchya/>