You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Dick Carter <di...@apache.org> on 2019/06/03 22:06:17 UTC

CUDA / CUDNN support revisited

I'd like to revisit the discussion of: https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E now that a year has passed.

My motivation is:

1. There's a lot of hard-to-read '#if CUDNN_MAJOR' code referencing cuDNN versions back as far as v4(!?). We need to clean this out before it hampers our ability to nimbly move the codebase forward.

2. There seems to be a difference of opinion on whether we should be supporting version 'N-1' (e.g. cuDNN6). Our current MXNet 1.5 candidate does not compile against cuDNN v6, so this should be either fixed or be up-front stated to the user community. The breaking PR was https://github.com/apache/incubator-mxnet/pull/14476.

Having read the prior discussion, my take on it is:

- Users should be given an ample time period (1 year?) to move to a new CUDA/cuDNN version once it becomes 'usable.'

- We should not claim to support a given version if it is no longer part of the MXNet CI. User's should be warned of an impeding dropping of this 'testing support.'

So these statements do not necessarily promise 'N-1' support. I could see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only. Some period before CUDA9 is dropped from CI, the user community is warned. After that time, CUDA10 might be the only version tested by CI, and hence the only version supported (until the next CUDA version came around).

Let me propose as a 'strawman' that we claim to support CUDA version 9 and 10, with cuDNN version 7 only. Those versions have been out for over 1.5 years. So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no coverage by our CI.

-Dick

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Carin Meier <ca...@gmail.com>.

From the Clojure package perspective, since it is compatible with maven,
this approach will work fine.
It would also make it easier for developers to build on top of MXNet and
share their libraries.

- Carin

On Wed, Jul 3, 2019 at 3:45 AM Per da Silva <pe...@gmail.com> wrote:

> Hi,
>
> We've started working on something along these lines as part of the CD
> pipeline framework. The idea is to compile and test the libmxnet.so  (both
> statically and dynamically linked) for the different variants (cpu, gpu,
> mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
> etc) just wrap around the library.
>
> I've been on long term sick leave and haven't been able to move forward
> with this, although I have an open PR that kicks off this work:
> https://github.com/apache/incubator-mxnet/pull/15051 - I welcome everyone
> to take a look. It's the first of a series of PRs to automate the
> distribution of the python (pip and docker) packages. Instead of using
> maven, we have opted to use S3. But this decision can be revisited.
>
> We also want to distribute what we termed "runtime" docker images. Docker
> images containing the dynamically linked mxnet library and all of the
> runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
> This could facilitate the packaging and distribution of docker images for
> the different frontends.
>
> Cheers,
>
> Per
>
> On Wed., 3 Jul. 2019, 8:47 am Qing Lan, <la...@live.com> wrote:
>
> > In that case, the answer is yes. The Scala package will be published in
> > one version with a variaty of backend package choices. User can easily
> > attach and detach different MXNet versions. However, the Scala package
> > cannot run without a backend.
> >
> > Another key advantage of this design will be a broader support on
> > different implementations such as Java Cpp. User will be able to
> implement
> > their customized MXNet frontend to use the native library.
> >
> > Thanks,
> > Qing
> >
> > ________________________________
> > From: Sheng Zha <sz...@gmail.com>
> > Sent: Tuesday, July 2, 2019 22:14
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
> >
> > Does it mean that the scala binding of mxnet will be an independent
> > package that doesn’t directly depend on the native package, and user
> > projects need to declare dependency on both the scala binding and one of
> > the native packages?
> >
> > -sz
> >
> > > On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com> wrote:
> > >
> > > Currently, MXNet were built along with different language bindings such
> > as
> > > Scala.
> > >
> > > The libmxnet.so files are bundled within scala jar package.
> > >
> > > It would be nice to distribute libmxnet.so library independently in
> > maven,
> > > and scala package can choose which native library to use.
> > >
> > > Here is the design document on cwiki:
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> > >
> > > Thanks,
> > >
> > > Frank
> >
>

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Pedro Larroy <pe...@gmail.com>.

Nice!  +1 To this approach, seems well thought. Thanks for including
Android and linux-arm.  Does Android and linux-arm use a different
classifier?

On Wed, Jul 3, 2019 at 6:46 AM Chris Olivier <cj...@apache.org> wrote:
>
> Will this be another repo under Apache repo? Is tensorflow java package in
> a separate repo?
>
> On Wed, Jul 3, 2019 at 12:46 AM Per da Silva <pe...@gmail.com> wrote:
>
> > Hi,
> >
> > We've started working on something along these lines as part of the CD
> > pipeline framework. The idea is to compile and test the libmxnet.so  (both
> > statically and dynamically linked) for the different variants (cpu, gpu,
> > mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
> > etc) just wrap around the library.
> >
> > I've been on long term sick leave and haven't been able to move forward
> > with this, although I have an open PR that kicks off this work:
> > https://github.com/apache/incubator-mxnet/pull/15051 - I welcome everyone
> > to take a look. It's the first of a series of PRs to automate the
> > distribution of the python (pip and docker) packages. Instead of using
> > maven, we have opted to use S3. But this decision can be revisited.
> >
> > We also want to distribute what we termed "runtime" docker images. Docker
> > images containing the dynamically linked mxnet library and all of the
> > runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
> > This could facilitate the packaging and distribution of docker images for
> > the different frontends.
> >
> > Cheers,
> >
> > Per
> >
> > On Wed., 3 Jul. 2019, 8:47 am Qing Lan, <la...@live.com> wrote:
> >
> > > In that case, the answer is yes. The Scala package will be published in
> > > one version with a variaty of backend package choices. User can easily
> > > attach and detach different MXNet versions. However, the Scala package
> > > cannot run without a backend.
> > >
> > > Another key advantage of this design will be a broader support on
> > > different implementations such as Java Cpp. User will be able to
> > implement
> > > their customized MXNet frontend to use the native library.
> > >
> > > Thanks,
> > > Qing
> > >
> > > ________________________________
> > > From: Sheng Zha <sz...@gmail.com>
> > > Sent: Tuesday, July 2, 2019 22:14
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
> > >
> > > Does it mean that the scala binding of mxnet will be an independent
> > > package that doesn’t directly depend on the native package, and user
> > > projects need to declare dependency on both the scala binding and one of
> > > the native packages?
> > >
> > > -sz
> > >
> > > > On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com> wrote:
> > > >
> > > > Currently, MXNet were built along with different language bindings such
> > > as
> > > > Scala.
> > > >
> > > > The libmxnet.so files are bundled within scala jar package.
> > > >
> > > > It would be nice to distribute libmxnet.so library independently in
> > > maven,
> > > > and scala package can choose which native library to use.
> > > >
> > > > Here is the design document on cwiki:
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> > > >
> > > > Thanks,
> > > >
> > > > Frank
> > >
> >

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Frank Liu <fr...@gmail.com>.

No, This just a CD pipeline that publish a jar files contains libmxnet.so
to maven central.
We are publishing Scala package to maven central already. The issue with
current Scala maven package is not flexible to developer who want's switch
different version of libmxnet.so file.
With this proposal, developer can easily choose different libmxnet.so file
by auto detect the os type.



On Wed, Jul 3, 2019 at 6:46 AM Chris Olivier <cj...@apache.org> wrote:

> Will this be another repo under Apache repo? Is tensorflow java package in
> a separate repo?
>
> On Wed, Jul 3, 2019 at 12:46 AM Per da Silva <pe...@gmail.com> wrote:
>
> > Hi,
> >
> > We've started working on something along these lines as part of the CD
> > pipeline framework. The idea is to compile and test the libmxnet.so
> (both
> > statically and dynamically linked) for the different variants (cpu, gpu,
> > mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
> > etc) just wrap around the library.
> >
> > I've been on long term sick leave and haven't been able to move forward
> > with this, although I have an open PR that kicks off this work:
> > https://github.com/apache/incubator-mxnet/pull/15051 - I welcome
> everyone
> > to take a look. It's the first of a series of PRs to automate the
> > distribution of the python (pip and docker) packages. Instead of using
> > maven, we have opted to use S3. But this decision can be revisited.
> >
> > We also want to distribute what we termed "runtime" docker images. Docker
> > images containing the dynamically linked mxnet library and all of the
> > runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
> > This could facilitate the packaging and distribution of docker images for
> > the different frontends.
> >
> > Cheers,
> >
> > Per
> >
> > On Wed., 3 Jul. 2019, 8:47 am Qing Lan, <la...@live.com> wrote:
> >
> > > In that case, the answer is yes. The Scala package will be published in
> > > one version with a variaty of backend package choices. User can easily
> > > attach and detach different MXNet versions. However, the Scala package
> > > cannot run without a backend.
> > >
> > > Another key advantage of this design will be a broader support on
> > > different implementations such as Java Cpp. User will be able to
> > implement
> > > their customized MXNet frontend to use the native library.
> > >
> > > Thanks,
> > > Qing
> > >
> > > ________________________________
> > > From: Sheng Zha <sz...@gmail.com>
> > > Sent: Tuesday, July 2, 2019 22:14
> > > To: dev@mxnet.incubator.apache.org
> > > Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
> > >
> > > Does it mean that the scala binding of mxnet will be an independent
> > > package that doesn’t directly depend on the native package, and user
> > > projects need to declare dependency on both the scala binding and one
> of
> > > the native packages?
> > >
> > > -sz
> > >
> > > > On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com>
> wrote:
> > > >
> > > > Currently, MXNet were built along with different language bindings
> such
> > > as
> > > > Scala.
> > > >
> > > > The libmxnet.so files are bundled within scala jar package.
> > > >
> > > > It would be nice to distribute libmxnet.so library independently in
> > > maven,
> > > > and scala package can choose which native library to use.
> > > >
> > > > Here is the design document on cwiki:
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> > > >
> > > > Thanks,
> > > >
> > > > Frank
> > >
> >
>

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Chris Olivier <cj...@apache.org>.

Will this be another repo under Apache repo? Is tensorflow java package in
a separate repo?

On Wed, Jul 3, 2019 at 12:46 AM Per da Silva <pe...@gmail.com> wrote:

> Hi,
>
> We've started working on something along these lines as part of the CD
> pipeline framework. The idea is to compile and test the libmxnet.so  (both
> statically and dynamically linked) for the different variants (cpu, gpu,
> mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
> etc) just wrap around the library.
>
> I've been on long term sick leave and haven't been able to move forward
> with this, although I have an open PR that kicks off this work:
> https://github.com/apache/incubator-mxnet/pull/15051 - I welcome everyone
> to take a look. It's the first of a series of PRs to automate the
> distribution of the python (pip and docker) packages. Instead of using
> maven, we have opted to use S3. But this decision can be revisited.
>
> We also want to distribute what we termed "runtime" docker images. Docker
> images containing the dynamically linked mxnet library and all of the
> runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
> This could facilitate the packaging and distribution of docker images for
> the different frontends.
>
> Cheers,
>
> Per
>
> On Wed., 3 Jul. 2019, 8:47 am Qing Lan, <la...@live.com> wrote:
>
> > In that case, the answer is yes. The Scala package will be published in
> > one version with a variaty of backend package choices. User can easily
> > attach and detach different MXNet versions. However, the Scala package
> > cannot run without a backend.
> >
> > Another key advantage of this design will be a broader support on
> > different implementations such as Java Cpp. User will be able to
> implement
> > their customized MXNet frontend to use the native library.
> >
> > Thanks,
> > Qing
> >
> > ________________________________
> > From: Sheng Zha <sz...@gmail.com>
> > Sent: Tuesday, July 2, 2019 22:14
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
> >
> > Does it mean that the scala binding of mxnet will be an independent
> > package that doesn’t directly depend on the native package, and user
> > projects need to declare dependency on both the scala binding and one of
> > the native packages?
> >
> > -sz
> >
> > > On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com> wrote:
> > >
> > > Currently, MXNet were built along with different language bindings such
> > as
> > > Scala.
> > >
> > > The libmxnet.so files are bundled within scala jar package.
> > >
> > > It would be nice to distribute libmxnet.so library independently in
> > maven,
> > > and scala package can choose which native library to use.
> > >
> > > Here is the design document on cwiki:
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> > >
> > > Thanks,
> > >
> > > Frank
> >
>

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Frank Liu <fr...@gmail.com>.

maven packages are crucial to JVM based language. This proposal can
definitely benefit from your PR. Without your PR, maven publish can not be
automated.

It would be nice to automate maven publish as party of your CD pipeline.




On Wed, Jul 3, 2019 at 12:45 AM Per da Silva <pe...@gmail.com> wrote:

> Hi,
>
> We've started working on something along these lines as part of the CD
> pipeline framework. The idea is to compile and test the libmxnet.so  (both
> statically and dynamically linked) for the different variants (cpu, gpu,
> mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
> etc) just wrap around the library.
>
> I've been on long term sick leave and haven't been able to move forward
> with this, although I have an open PR that kicks off this work:
> https://github.com/apache/incubator-mxnet/pull/15051 - I welcome everyone
> to take a look. It's the first of a series of PRs to automate the
> distribution of the python (pip and docker) packages. Instead of using
> maven, we have opted to use S3. But this decision can be revisited.
>
> We also want to distribute what we termed "runtime" docker images. Docker
> images containing the dynamically linked mxnet library and all of the
> runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
> This could facilitate the packaging and distribution of docker images for
> the different frontends.
>
> Cheers,
>
> Per
>
> On Wed., 3 Jul. 2019, 8:47 am Qing Lan, <la...@live.com> wrote:
>
> > In that case, the answer is yes. The Scala package will be published in
> > one version with a variaty of backend package choices. User can easily
> > attach and detach different MXNet versions. However, the Scala package
> > cannot run without a backend.
> >
> > Another key advantage of this design will be a broader support on
> > different implementations such as Java Cpp. User will be able to
> implement
> > their customized MXNet frontend to use the native library.
> >
> > Thanks,
> > Qing
> >
> > ________________________________
> > From: Sheng Zha <sz...@gmail.com>
> > Sent: Tuesday, July 2, 2019 22:14
> > To: dev@mxnet.incubator.apache.org
> > Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
> >
> > Does it mean that the scala binding of mxnet will be an independent
> > package that doesn’t directly depend on the native package, and user
> > projects need to declare dependency on both the scala binding and one of
> > the native packages?
> >
> > -sz
> >
> > > On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com> wrote:
> > >
> > > Currently, MXNet were built along with different language bindings such
> > as
> > > Scala.
> > >
> > > The libmxnet.so files are bundled within scala jar package.
> > >
> > > It would be nice to distribute libmxnet.so library independently in
> > maven,
> > > and scala package can choose which native library to use.
> > >
> > > Here is the design document on cwiki:
> > >
> >
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> > >
> > > Thanks,
> > >
> > > Frank
> >
>

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Per da Silva <pe...@gmail.com>.

Hi,

We've started working on something along these lines as part of the CD
pipeline framework. The idea is to compile and test the libmxnet.so  (both
statically and dynamically linked) for the different variants (cpu, gpu,
mkl, etc.) then have the different mxnet frontends (python, Julia, scala,
etc) just wrap around the library.

I've been on long term sick leave and haven't been able to move forward
with this, although I have an open PR that kicks off this work:
https://github.com/apache/incubator-mxnet/pull/15051 - I welcome everyone
to take a look. It's the first of a series of PRs to automate the
distribution of the python (pip and docker) packages. Instead of using
maven, we have opted to use S3. But this decision can be revisited.

We also want to distribute what we termed "runtime" docker images. Docker
images containing the dynamically linked mxnet library and all of the
runtime dependencies (examples: https://hub.docker.com/r/mxnet/runtime).
This could facilitate the packaging and distribution of docker images for
the different frontends.

Cheers,

Per

On Wed., 3 Jul. 2019, 8:47 am Qing Lan, <la...@live.com> wrote:

> In that case, the answer is yes. The Scala package will be published in
> one version with a variaty of backend package choices. User can easily
> attach and detach different MXNet versions. However, the Scala package
> cannot run without a backend.
>
> Another key advantage of this design will be a broader support on
> different implementations such as Java Cpp. User will be able to implement
> their customized MXNet frontend to use the native library.
>
> Thanks,
> Qing
>
> ________________________________
> From: Sheng Zha <sz...@gmail.com>
> Sent: Tuesday, July 2, 2019 22:14
> To: dev@mxnet.incubator.apache.org
> Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution
>
> Does it mean that the scala binding of mxnet will be an independent
> package that doesn’t directly depend on the native package, and user
> projects need to declare dependency on both the scala binding and one of
> the native packages?
>
> -sz
>
> > On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com> wrote:
> >
> > Currently, MXNet were built along with different language bindings such
> as
> > Scala.
> >
> > The libmxnet.so files are bundled within scala jar package.
> >
> > It would be nice to distribute libmxnet.so library independently in
> maven,
> > and scala package can choose which native library to use.
> >
> > Here is the design document on cwiki:
> >
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> >
> > Thanks,
> >
> > Frank
>

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Qing Lan <la...@live.com>.

In that case, the answer is yes. The Scala package will be published in one version with a variaty of backend package choices. User can easily attach and detach different MXNet versions. However, the Scala package cannot run without a backend.

Another key advantage of this design will be a broader support on different implementations such as Java Cpp. User will be able to implement their customized MXNet frontend to use the native library.

Thanks,
Qing

________________________________
From: Sheng Zha <sz...@gmail.com>
Sent: Tuesday, July 2, 2019 22:14
To: dev@mxnet.incubator.apache.org
Subject: Re: [DISCUSS] Make MXNet deploy it's own distribution

Does it mean that the scala binding of mxnet will be an independent package that doesn’t directly depend on the native package, and user projects need to declare dependency on both the scala binding and one of the native packages?

-sz

> On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com> wrote:
>
> Currently, MXNet were built along with different language bindings such as
> Scala.
>
> The libmxnet.so files are bundled within scala jar package.
>
> It would be nice to distribute libmxnet.so library independently in maven,
> and scala package can choose which native library to use.
>
> Here is the design document on cwiki:
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
>
> Thanks,
>
> Frank

Re: [DISCUSS] Make MXNet deploy it's own distribution

Posted by Sheng Zha <sz...@gmail.com>.

Does it mean that the scala binding of mxnet will be an independent package that doesn’t directly depend on the native package, and user projects need to declare dependency on both the scala binding and one of the native packages?

-sz

> On Jul 2, 2019, at 5:50 PM, Frank Liu <fr...@gmail.com> wrote:
> 
> Currently, MXNet were built along with different language bindings such as
> Scala.
> 
> The libmxnet.so files are bundled within scala jar package.
> 
> It would be nice to distribute libmxnet.so library independently in maven,
> and scala package can choose which native library to use.
> 
> Here is the design document on cwiki:
> https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution
> 
> Thanks,
> 
> Frank

[DISCUSS] Make MXNet deploy it's own distribution

Posted by Frank Liu <fr...@gmail.com>.

Currently, MXNet were built along with different language bindings such as
Scala.

The libmxnet.so files are bundled within scala jar package.

It would be nice to distribute libmxnet.so library independently in maven,
and scala package can choose which native library to use.

Here is the design document on cwiki:
https://cwiki.apache.org/confluence/display/MXNET/Make+MXNet+deploy+it%27s+own+distribution

Thanks,

Frank

Re: CUDA / CUDNN support revisited

Posted by Dick Carter <di...@apache.org>.

Heads up that I'll soon be submitting a PR to help with cuda/cudnn version checking.  My goal is to address two points:

    - rnn.cc of mxnet v1.5 does not compile against cudnn v6.  Do we scramble to fix it or admit that we no longer support cudnn v6 or earlier?
    - how do we handle the process of removing code that assumes these no-longer-supported cuda/cudnn versions?

I agree with Kellen's statements that the transition should be tied to timeframe more than N/N-1.  Users have had over 1.5 years to move to cuda 9 / cudnn 7, so it's time to drop cuda 8 / cudnn 6 in my opinion.

My PR will be supplying the 'mechanism' of dealing with cuda/cudnn versions.  We can continue the discussion on the final 'policy' settings here and in the PR.


On 2019/06/19 17:00:02, kellen sunderland <ke...@gmail.com> wrote: 
> Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
> don't believe there's any need to drop SMs.
> 
> On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
> 
> > I think where we're all going to have agreement is that we shouldn't have
> > code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> > than 6.  We can go ahead and remove any code that targets those old
> > versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> > suggest we also add some logging for users with prior versions letting them
> > know they can still use MXNet 1.4.
> >
> > Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> > originally a proponent of the N and N-1 route for simplicity.  Looking back
> > at the choice, one complication I see is that there's competing concerns
> > between semver library compatibility and feature releases on NVIDIA's
> > part.  NVIDIA is releasing new libraries with a lot of new features on a
> > regular basis, which is good, but for compatibility reasons they've begun
> > to bump major versions less often, which is also probably also good.  For
> > example if memory serves correctly cuDNN used to get an MV bump every 6
> > months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> > project maintainer I would certainly like to drop support for library
> > versions that are 2 years old in my latest release.  Supporting a 2 year
> > wide range of dependency libraries in the CI for example is going to be a
> > burden.
> >
> > From the MXNet users' perspective obviously having to update dependencies
> > is a pain, but updating these libs are likely to give significant
> > performance increases (occasional perf regressions aside).  I think a
> > consistent thread I've heard from users is that training takes too long,
> > inference costs too much, and they want their DL framework to abstract the
> > complexity of using custom hardware like TCs or AVX with them having to put
> > in a lot of effort.  Another consideration is that using old versions of
> > MXNet is actually quite easy and convenient thanks to (IMO) some solid
> > release practices and naming conventions.
> >
> > Given how easy it is to use old MXNet versions I think it's reasonable to
> > target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> > versions).
> >
> > On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu <ma...@gmail.com>
> > wrote:
> >
> >> Good points anirudh. Generally I would understand N as being the major
> >> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
> >> and
> >> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
> >> be
> >> dropped when 11 is released and tested.
> >>
> >> At the same time, we would always only supported the latest compatible
> >> cudnn version. Or is there any reason somebody would use an old cudnn
> >> version?
> >>
> >> Wdyt?
> >>
> >> -Marco
> >>
> >> Anirudh Subramanian <an...@gmail.com> schrieb am Mi., 19. Juni
> >> 2019,
> >> 01:47:
> >>
> >> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
> >> CUDA
> >> > Version N and CUDA Version N - 1 should be supported in CI.
> >> >
> >> > My question is what happens, when we are at a position, where we are on
> >> a
> >> > CUDA version N and removed support for CUDA version N - 1. Within a
> >> small
> >> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
> >> perf
> >> > regressions and some bugs have been fixed. Should we just move to N + 1,
> >> > since version N will have all these issues for users and may also slow
> >> us
> >> > down on CI.
> >> >
> >> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> >> > causing intermittent CI failures:
> >> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
> >> already
> >> > a
> >> > PR to bump up Nvidia version to 10.1 (
> >> > https://github.com/apache/incubator-mxnet/pull/14986/files).
> >> >
> >> > I think for situations where there is a quick follow up release like
> >> 10.1
> >> > and MXNet users are impacted by certain issues, we should just bump up
> >> the
> >> > version and stop support for 10.0.
> >> > Would like to hear more from Nvidia folks (on this particular case of
> >> CUDA
> >> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
> >> customers).
> >> >
> >> > Anirudh
> >> >
> >> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <di...@apache.org>
> >> wrote:
> >> >
> >> > > Actually, I tried to say that support *doesn't necessarily* include
> >> N-1.
> >> > > I'm proposing that the supported versions are 1) covered by CI and 2)
> >> > have
> >> > > been available in a usable form long enough that a semi-motivated user
> >> > has
> >> > > been able to transition to it.  That might mean only N (e.g. per my
> >> > > proposal, only cuDNN v7).
> >> > >
> >> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> >> > users
> >> > > will transition to it at their own pace, thereby creating a N / N-1
> >> > support
> >> > > situation for some period.
> >> > >
> >> > >
> >> > > On 2019/06/03 22:43:20, Pedro Larroy <pe...@gmail.com>
> >> > > wrote:
> >> > > > Your proposal of having support for N and N-1 makes a lot of sense
> >> to
> >> > > > me. Are there use cases for supporting older CUDA versions?
> >> > > >
> >> > > >
> >> > > > Thanks.
> >> > > >
> >> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org>
> >> > wrote:
> >> > > > >
> >> > > > > I'd like to revisit the discussion of:
> >> > >
> >> >
> >> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> >> > > now that a year has passed.
> >> > > > >
> >> > > > > My motivation is:
> >> > > > >
> >> > > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code
> >> referencing
> >> > > cuDNN versions back as far as v4(!?).  We need to clean this out
> >> before
> >> > it
> >> > > hampers our ability to nimbly move the codebase forward.
> >> > > > >
> >> > > > > 2.  There seems to be a difference of opinion on whether we
> >> should be
> >> > > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5
> >> candidate
> >> > > does not compile against cuDNN v6, so this should be either fixed or
> >> be
> >> > > up-front stated to the user community.  The breaking PR was
> >> > > https://github.com/apache/incubator-mxnet/pull/14476.
> >> > > > >
> >> > > > > Having read the prior discussion, my take on it is:
> >> > > > >
> >> > > > > - Users should be given an ample time period (1 year?) to move to
> >> a
> >> > > new CUDA/cuDNN version once it becomes 'usable.'
> >> > > > >
> >> > > > > - We should not claim to support a given version if it is no
> >> longer
> >> > > part of the MXNet CI.  User's should be warned of an impeding
> >> dropping of
> >> > > this 'testing support.'
> >> > > > >
> >> > > > > So these statements do not necessarily promise 'N-1' support.  I
> >> > could
> >> > > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10
> >> only.
> >> > > Some period before CUDA9 is dropped from CI, the user community is
> >> > warned.
> >> > > After that time, CUDA10 might be the only version tested by CI, and
> >> hence
> >> > > the only version supported (until the next CUDA version came around).
> >> > > > >
> >> > > > > Let me propose as a 'strawman' that we claim to support CUDA
> >> version
> >> > 9
> >> > > and 10, with cuDNN version 7 only.  Those versions have been out for
> >> over
> >> > > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with
> >> no
> >> > > coverage by our CI.
> >> > > > >
> >> > > > >     -Dick
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: CUDA / CUDNN support revisited

Posted by kellen sunderland <ke...@gmail.com>.

Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
don't believe there's any need to drop SMs.

On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> I think where we're all going to have agreement is that we shouldn't have
> code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> than 6.  We can go ahead and remove any code that targets those old
> versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> suggest we also add some logging for users with prior versions letting them
> know they can still use MXNet 1.4.
>
> Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> originally a proponent of the N and N-1 route for simplicity.  Looking back
> at the choice, one complication I see is that there's competing concerns
> between semver library compatibility and feature releases on NVIDIA's
> part.  NVIDIA is releasing new libraries with a lot of new features on a
> regular basis, which is good, but for compatibility reasons they've begun
> to bump major versions less often, which is also probably also good.  For
> example if memory serves correctly cuDNN used to get an MV bump every 6
> months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> project maintainer I would certainly like to drop support for library
> versions that are 2 years old in my latest release.  Supporting a 2 year
> wide range of dependency libraries in the CI for example is going to be a
> burden.
>
> From the MXNet users' perspective obviously having to update dependencies
> is a pain, but updating these libs are likely to give significant
> performance increases (occasional perf regressions aside).  I think a
> consistent thread I've heard from users is that training takes too long,
> inference costs too much, and they want their DL framework to abstract the
> complexity of using custom hardware like TCs or AVX with them having to put
> in a lot of effort.  Another consideration is that using old versions of
> MXNet is actually quite easy and convenient thanks to (IMO) some solid
> release practices and naming conventions.
>
> Given how easy it is to use old MXNet versions I think it's reasonable to
> target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> versions).
>
> On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu <ma...@gmail.com>
> wrote:
>
>> Good points anirudh. Generally I would understand N as being the major
>> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
>> and
>> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
>> be
>> dropped when 11 is released and tested.
>>
>> At the same time, we would always only supported the latest compatible
>> cudnn version. Or is there any reason somebody would use an old cudnn
>> version?
>>
>> Wdyt?
>>
>> -Marco
>>
>> Anirudh Subramanian <an...@gmail.com> schrieb am Mi., 19. Juni
>> 2019,
>> 01:47:
>>
>> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
>> CUDA
>> > Version N and CUDA Version N - 1 should be supported in CI.
>> >
>> > My question is what happens, when we are at a position, where we are on
>> a
>> > CUDA version N and removed support for CUDA version N - 1. Within a
>> small
>> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
>> perf
>> > regressions and some bugs have been fixed. Should we just move to N + 1,
>> > since version N will have all these issues for users and may also slow
>> us
>> > down on CI.
>> >
>> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
>> > causing intermittent CI failures:
>> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
>> already
>> > a
>> > PR to bump up Nvidia version to 10.1 (
>> > https://github.com/apache/incubator-mxnet/pull/14986/files).
>> >
>> > I think for situations where there is a quick follow up release like
>> 10.1
>> > and MXNet users are impacted by certain issues, we should just bump up
>> the
>> > version and stop support for 10.0.
>> > Would like to hear more from Nvidia folks (on this particular case of
>> CUDA
>> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
>> customers).
>> >
>> > Anirudh
>> >
>> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <di...@apache.org>
>> wrote:
>> >
>> > > Actually, I tried to say that support *doesn't necessarily* include
>> N-1.
>> > > I'm proposing that the supported versions are 1) covered by CI and 2)
>> > have
>> > > been available in a usable form long enough that a semi-motivated user
>> > has
>> > > been able to transition to it.  That might mean only N (e.g. per my
>> > > proposal, only cuDNN v7).
>> > >
>> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
>> > users
>> > > will transition to it at their own pace, thereby creating a N / N-1
>> > support
>> > > situation for some period.
>> > >
>> > >
>> > > On 2019/06/03 22:43:20, Pedro Larroy <pe...@gmail.com>
>> > > wrote:
>> > > > Your proposal of having support for N and N-1 makes a lot of sense
>> to
>> > > > me. Are there use cases for supporting older CUDA versions?
>> > > >
>> > > >
>> > > > Thanks.
>> > > >
>> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org>
>> > wrote:
>> > > > >
>> > > > > I'd like to revisit the discussion of:
>> > >
>> >
>> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
>> > > now that a year has passed.
>> > > > >
>> > > > > My motivation is:
>> > > > >
>> > > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code
>> referencing
>> > > cuDNN versions back as far as v4(!?).  We need to clean this out
>> before
>> > it
>> > > hampers our ability to nimbly move the codebase forward.
>> > > > >
>> > > > > 2.  There seems to be a difference of opinion on whether we
>> should be
>> > > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5
>> candidate
>> > > does not compile against cuDNN v6, so this should be either fixed or
>> be
>> > > up-front stated to the user community.  The breaking PR was
>> > > https://github.com/apache/incubator-mxnet/pull/14476.
>> > > > >
>> > > > > Having read the prior discussion, my take on it is:
>> > > > >
>> > > > > - Users should be given an ample time period (1 year?) to move to
>> a
>> > > new CUDA/cuDNN version once it becomes 'usable.'
>> > > > >
>> > > > > - We should not claim to support a given version if it is no
>> longer
>> > > part of the MXNet CI.  User's should be warned of an impeding
>> dropping of
>> > > this 'testing support.'
>> > > > >
>> > > > > So these statements do not necessarily promise 'N-1' support.  I
>> > could
>> > > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10
>> only.
>> > > Some period before CUDA9 is dropped from CI, the user community is
>> > warned.
>> > > After that time, CUDA10 might be the only version tested by CI, and
>> hence
>> > > the only version supported (until the next CUDA version came around).
>> > > > >
>> > > > > Let me propose as a 'strawman' that we claim to support CUDA
>> version
>> > 9
>> > > and 10, with cuDNN version 7 only.  Those versions have been out for
>> over
>> > > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with
>> no
>> > > coverage by our CI.
>> > > > >
>> > > > >     -Dick
>> > > >
>> > >
>> >
>>
>

Re: CUDA / CUDNN support revisited

Posted by kellen sunderland <ke...@gmail.com>.

I think where we're all going to have agreement is that we shouldn't have
code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
than 6.  We can go ahead and remove any code that targets those old
versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
suggest we also add some logging for users with prior versions letting them
know they can still use MXNet 1.4.

Where things get interesting is CUDA 9 / cuDNN 6 support.  I was originally
a proponent of the N and N-1 route for simplicity.  Looking back at the
choice, one complication I see is that there's competing concerns between
semver library compatibility and feature releases on NVIDIA's part.  NVIDIA
is releasing new libraries with a lot of new features on a regular basis,
which is good, but for compatibility reasons they've begun to bump major
versions less often, which is also probably also good.  For example if
memory serves correctly cuDNN used to get an MV bump every 6 months or so,
but now the N-1 MV (6) was released in April of 2017.  As a project
maintainer I would certainly like to drop support for library versions that
are 2 years old in my latest release.  Supporting a 2 year wide range of
dependency libraries in the CI for example is going to be a burden.

From the MXNet users' perspective obviously having to update dependencies
is a pain, but updating these libs are likely to give significant
performance increases (occasional perf regressions aside).  I think a
consistent thread I've heard from users is that training takes too long,
inference costs too much, and they want their DL framework to abstract the
complexity of using custom hardware like TCs or AVX with them having to put
in a lot of effort.  Another consideration is that using old versions of
MXNet is actually quite easy and convenient thanks to (IMO) some solid
release practices and naming conventions.

Given how easy it is to use old MXNet versions I think it's reasonable to
target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
versions).

On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu <ma...@gmail.com>
wrote:

> Good points anirudh. Generally I would understand N as being the major
> versions. Speak we would maintain CUDA 9 and 10.1 in your given example and
> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only be
> dropped when 11 is released and tested.
>
> At the same time, we would always only supported the latest compatible
> cudnn version. Or is there any reason somebody would use an old cudnn
> version?
>
> Wdyt?
>
> -Marco
>
> Anirudh Subramanian <an...@gmail.com> schrieb am Mi., 19. Juni 2019,
> 01:47:
>
> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
> CUDA
> > Version N and CUDA Version N - 1 should be supported in CI.
> >
> > My question is what happens, when we are at a position, where we are on a
> > CUDA version N and removed support for CUDA version N - 1. Within a small
> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
> perf
> > regressions and some bugs have been fixed. Should we just move to N + 1,
> > since version N will have all these issues for users and may also slow us
> > down on CI.
> >
> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> > causing intermittent CI failures:
> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
> already
> > a
> > PR to bump up Nvidia version to 10.1 (
> > https://github.com/apache/incubator-mxnet/pull/14986/files).
> >
> > I think for situations where there is a quick follow up release like 10.1
> > and MXNet users are impacted by certain issues, we should just bump up
> the
> > version and stop support for 10.0.
> > Would like to hear more from Nvidia folks (on this particular case of
> CUDA
> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
> customers).
> >
> > Anirudh
> >
> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <di...@apache.org> wrote:
> >
> > > Actually, I tried to say that support *doesn't necessarily* include
> N-1.
> > > I'm proposing that the supported versions are 1) covered by CI and 2)
> > have
> > > been available in a usable form long enough that a semi-motivated user
> > has
> > > been able to transition to it.  That might mean only N (e.g. per my
> > > proposal, only cuDNN v7).
> > >
> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> > users
> > > will transition to it at their own pace, thereby creating a N / N-1
> > support
> > > situation for some period.
> > >
> > >
> > > On 2019/06/03 22:43:20, Pedro Larroy <pe...@gmail.com>
> > > wrote:
> > > > Your proposal of having support for N and N-1 makes a lot of sense to
> > > > me. Are there use cases for supporting older CUDA versions?
> > > >
> > > >
> > > > Thanks.
> > > >
> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org>
> > wrote:
> > > > >
> > > > > I'd like to revisit the discussion of:
> > >
> >
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> > > now that a year has passed.
> > > > >
> > > > > My motivation is:
> > > > >
> > > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code
> referencing
> > > cuDNN versions back as far as v4(!?).  We need to clean this out before
> > it
> > > hampers our ability to nimbly move the codebase forward.
> > > > >
> > > > > 2.  There seems to be a difference of opinion on whether we should
> be
> > > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5
> candidate
> > > does not compile against cuDNN v6, so this should be either fixed or be
> > > up-front stated to the user community.  The breaking PR was
> > > https://github.com/apache/incubator-mxnet/pull/14476.
> > > > >
> > > > > Having read the prior discussion, my take on it is:
> > > > >
> > > > > - Users should be given an ample time period (1 year?) to move to a
> > > new CUDA/cuDNN version once it becomes 'usable.'
> > > > >
> > > > > - We should not claim to support a given version if it is no longer
> > > part of the MXNet CI.  User's should be warned of an impeding dropping
> of
> > > this 'testing support.'
> > > > >
> > > > > So these statements do not necessarily promise 'N-1' support.  I
> > could
> > > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10
> only.
> > > Some period before CUDA9 is dropped from CI, the user community is
> > warned.
> > > After that time, CUDA10 might be the only version tested by CI, and
> hence
> > > the only version supported (until the next CUDA version came around).
> > > > >
> > > > > Let me propose as a 'strawman' that we claim to support CUDA
> version
> > 9
> > > and 10, with cuDNN version 7 only.  Those versions have been out for
> over
> > > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with
> no
> > > coverage by our CI.
> > > > >
> > > > >     -Dick
> > > >
> > >
> >
>

Re: CUDA / CUDNN support revisited

Posted by Marco de Abreu <ma...@gmail.com>.

Good points anirudh. Generally I would understand N as being the major
versions. Speak we would maintain CUDA 9 and 10.1 in your given example and
drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only be
dropped when 11 is released and tested.

At the same time, we would always only supported the latest compatible
cudnn version. Or is there any reason somebody would use an old cudnn
version?

Wdyt?

-Marco

Anirudh Subramanian <an...@gmail.com> schrieb am Mi., 19. Juni 2019,
01:47:

> +1, Agree this should be done for both CUDA and CUDNN versions. At max CUDA
> Version N and CUDA Version N - 1 should be supported in CI.
>
> My question is what happens, when we are at a position, where we are on a
> CUDA version N and removed support for CUDA version N - 1. Within a small
> duration Nvidia comes up with a CUDA patch version N + 1, where  some perf
> regressions and some bugs have been fixed. Should we just move to N + 1,
> since version N will have all these issues for users and may also slow us
> down on CI.
>
> I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> causing intermittent CI failures:
> https://github.com/apache/incubator-mxnet/issues/15273 . There is already
> a
> PR to bump up Nvidia version to 10.1 (
> https://github.com/apache/incubator-mxnet/pull/14986/files).
>
> I think for situations where there is a quick follow up release like 10.1
> and MXNet users are impacted by certain issues, we should just bump up the
> version and stop support for 10.0.
> Would like to hear more from Nvidia folks (on this particular case of CUDA
> 10.0 vs CUDA 10.1 and what are the recommendations for existing customers).
>
> Anirudh
>
> On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <di...@apache.org> wrote:
>
> > Actually, I tried to say that support *doesn't necessarily* include N-1.
> > I'm proposing that the supported versions are 1) covered by CI and 2)
> have
> > been available in a usable form long enough that a semi-motivated user
> has
> > been able to transition to it.  That might mean only N (e.g. per my
> > proposal, only cuDNN v7).
> >
> > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> users
> > will transition to it at their own pace, thereby creating a N / N-1
> support
> > situation for some period.
> >
> >
> > On 2019/06/03 22:43:20, Pedro Larroy <pe...@gmail.com>
> > wrote:
> > > Your proposal of having support for N and N-1 makes a lot of sense to
> > > me. Are there use cases for supporting older CUDA versions?
> > >
> > >
> > > Thanks.
> > >
> > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org>
> wrote:
> > > >
> > > > I'd like to revisit the discussion of:
> >
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> > now that a year has passed.
> > > >
> > > > My motivation is:
> > > >
> > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing
> > cuDNN versions back as far as v4(!?).  We need to clean this out before
> it
> > hampers our ability to nimbly move the codebase forward.
> > > >
> > > > 2.  There seems to be a difference of opinion on whether we should be
> > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate
> > does not compile against cuDNN v6, so this should be either fixed or be
> > up-front stated to the user community.  The breaking PR was
> > https://github.com/apache/incubator-mxnet/pull/14476.
> > > >
> > > > Having read the prior discussion, my take on it is:
> > > >
> > > > - Users should be given an ample time period (1 year?) to move to a
> > new CUDA/cuDNN version once it becomes 'usable.'
> > > >
> > > > - We should not claim to support a given version if it is no longer
> > part of the MXNet CI.  User's should be warned of an impeding dropping of
> > this 'testing support.'
> > > >
> > > > So these statements do not necessarily promise 'N-1' support.  I
> could
> > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.
> > Some period before CUDA9 is dropped from CI, the user community is
> warned.
> > After that time, CUDA10 might be the only version tested by CI, and hence
> > the only version supported (until the next CUDA version came around).
> > > >
> > > > Let me propose as a 'strawman' that we claim to support CUDA version
> 9
> > and 10, with cuDNN version 7 only.  Those versions have been out for over
> > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no
> > coverage by our CI.
> > > >
> > > >     -Dick
> > >
> >
>

Re: CUDA / CUDNN support revisited

Posted by Anirudh Subramanian <an...@gmail.com>.

+1, Agree this should be done for both CUDA and CUDNN versions. At max CUDA
Version N and CUDA Version N - 1 should be supported in CI.

My question is what happens, when we are at a position, where we are on a
CUDA version N and removed support for CUDA version N - 1. Within a small
duration Nvidia comes up with a CUDA patch version N + 1, where  some perf
regressions and some bugs have been fixed. Should we just move to N + 1,
since version N will have all these issues for users and may also slow us
down on CI.

I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
causing intermittent CI failures:
https://github.com/apache/incubator-mxnet/issues/15273 . There is already a
PR to bump up Nvidia version to 10.1 (
https://github.com/apache/incubator-mxnet/pull/14986/files).

I think for situations where there is a quick follow up release like 10.1
and MXNet users are impacted by certain issues, we should just bump up the
version and stop support for 10.0.
Would like to hear more from Nvidia folks (on this particular case of CUDA
10.0 vs CUDA 10.1 and what are the recommendations for existing customers).

Anirudh

On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <di...@apache.org> wrote:

> Actually, I tried to say that support *doesn't necessarily* include N-1.
> I'm proposing that the supported versions are 1) covered by CI and 2) have
> been available in a usable form long enough that a semi-motivated user has
> been able to transition to it.  That might mean only N (e.g. per my
> proposal, only cuDNN v7).
>
> Regarding precedent for N / N-1,  when a new CUDA version comes out, users
> will transition to it at their own pace, thereby creating a N / N-1 support
> situation for some period.
>
>
> On 2019/06/03 22:43:20, Pedro Larroy <pe...@gmail.com>
> wrote:
> > Your proposal of having support for N and N-1 makes a lot of sense to
> > me. Are there use cases for supporting older CUDA versions?
> >
> >
> > Thanks.
> >
> > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org> wrote:
> > >
> > > I'd like to revisit the discussion of:
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> now that a year has passed.
> > >
> > > My motivation is:
> > >
> > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing
> cuDNN versions back as far as v4(!?).  We need to clean this out before it
> hampers our ability to nimbly move the codebase forward.
> > >
> > > 2.  There seems to be a difference of opinion on whether we should be
> supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate
> does not compile against cuDNN v6, so this should be either fixed or be
> up-front stated to the user community.  The breaking PR was
> https://github.com/apache/incubator-mxnet/pull/14476.
> > >
> > > Having read the prior discussion, my take on it is:
> > >
> > > - Users should be given an ample time period (1 year?) to move to a
> new CUDA/cuDNN version once it becomes 'usable.'
> > >
> > > - We should not claim to support a given version if it is no longer
> part of the MXNet CI.  User's should be warned of an impeding dropping of
> this 'testing support.'
> > >
> > > So these statements do not necessarily promise 'N-1' support.  I could
> see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.
> Some period before CUDA9 is dropped from CI, the user community is warned.
> After that time, CUDA10 might be the only version tested by CI, and hence
> the only version supported (until the next CUDA version came around).
> > >
> > > Let me propose as a 'strawman' that we claim to support CUDA version 9
> and 10, with cuDNN version 7 only.  Those versions have been out for over
> 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no
> coverage by our CI.
> > >
> > >     -Dick
> >
>

Re: CUDA / CUDNN support revisited

Posted by Dick Carter <di...@apache.org>.

Actually, I tried to say that support *doesn't necessarily* include N-1.  I'm proposing that the supported versions are 1) covered by CI and 2) have been available in a usable form long enough that a semi-motivated user has been able to transition to it.  That might mean only N (e.g. per my proposal, only cuDNN v7).

Regarding precedent for N / N-1,  when a new CUDA version comes out, users will transition to it at their own pace, thereby creating a N / N-1 support situation for some period.


On 2019/06/03 22:43:20, Pedro Larroy <pe...@gmail.com> wrote: 
> Your proposal of having support for N and N-1 makes a lot of sense to
> me. Are there use cases for supporting older CUDA versions?
> 
> 
> Thanks.
> 
> On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org> wrote:
> >
> > I'd like to revisit the discussion of: https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E now that a year has passed.
> >
> > My motivation is:
> >
> > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing cuDNN versions back as far as v4(!?).  We need to clean this out before it hampers our ability to nimbly move the codebase forward.
> >
> > 2.  There seems to be a difference of opinion on whether we should be supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate does not compile against cuDNN v6, so this should be either fixed or be up-front stated to the user community.  The breaking PR was https://github.com/apache/incubator-mxnet/pull/14476.
> >
> > Having read the prior discussion, my take on it is:
> >
> > - Users should be given an ample time period (1 year?) to move to a new CUDA/cuDNN version once it becomes 'usable.'
> >
> > - We should not claim to support a given version if it is no longer part of the MXNet CI.  User's should be warned of an impeding dropping of this 'testing support.'
> >
> > So these statements do not necessarily promise 'N-1' support.  I could see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.  Some period before CUDA9 is dropped from CI, the user community is warned.  After that time, CUDA10 might be the only version tested by CI, and hence the only version supported (until the next CUDA version came around).
> >
> > Let me propose as a 'strawman' that we claim to support CUDA version 9 and 10, with cuDNN version 7 only.  Those versions have been out for over 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no coverage by our CI.
> >
> >     -Dick
>

Re: CUDA / CUDNN support revisited

Posted by Pedro Larroy <pe...@gmail.com>.

Your proposal of having support for N and N-1 makes a lot of sense to
me. Are there use cases for supporting older CUDA versions?


Thanks.

On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org> wrote:
>
> I'd like to revisit the discussion of: https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E now that a year has passed.
>
> My motivation is:
>
> 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing cuDNN versions back as far as v4(!?).  We need to clean this out before it hampers our ability to nimbly move the codebase forward.
>
> 2.  There seems to be a difference of opinion on whether we should be supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate does not compile against cuDNN v6, so this should be either fixed or be up-front stated to the user community.  The breaking PR was https://github.com/apache/incubator-mxnet/pull/14476.
>
> Having read the prior discussion, my take on it is:
>
> - Users should be given an ample time period (1 year?) to move to a new CUDA/cuDNN version once it becomes 'usable.'
>
> - We should not claim to support a given version if it is no longer part of the MXNet CI.  User's should be warned of an impeding dropping of this 'testing support.'
>
> So these statements do not necessarily promise 'N-1' support.  I could see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.  Some period before CUDA9 is dropped from CI, the user community is warned.  After that time, CUDA10 might be the only version tested by CI, and hence the only version supported (until the next CUDA version came around).
>
> Let me propose as a 'strawman' that we claim to support CUDA version 9 and 10, with cuDNN version 7 only.  Those versions have been out for over 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no coverage by our CI.
>
>     -Dick