You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mxnet.apache.org by Dick Carter <di...@apache.org> on 2019/07/02 22:33:40 UTC
Re: CUDA / CUDNN support revisited

Heads up that I'll soon be submitting a PR to help with cuda/cudnn version checking.  My goal is to address two points:

    - rnn.cc of mxnet v1.5 does not compile against cudnn v6.  Do we scramble to fix it or admit that we no longer support cudnn v6 or earlier?
    - how do we handle the process of removing code that assumes these no-longer-supported cuda/cudnn versions?

I agree with Kellen's statements that the transition should be tied to timeframe more than N/N-1.  Users have had over 1.5 years to move to cuda 9 / cudnn 7, so it's time to drop cuda 8 / cudnn 6 in my opinion.

My PR will be supplying the 'mechanism' of dealing with cuda/cudnn versions.  We can continue the discussion on the final 'policy' settings here and in the PR.


On 2019/06/19 17:00:02, kellen sunderland <ke...@gmail.com> wrote: 
> Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
> don't believe there's any need to drop SMs.
> 
> On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
> kellen.sunderland@gmail.com> wrote:
> 
> > I think where we're all going to have agreement is that we shouldn't have
> > code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> > than 6.  We can go ahead and remove any code that targets those old
> > versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> > suggest we also add some logging for users with prior versions letting them
> > know they can still use MXNet 1.4.
> >
> > Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> > originally a proponent of the N and N-1 route for simplicity.  Looking back
> > at the choice, one complication I see is that there's competing concerns
> > between semver library compatibility and feature releases on NVIDIA's
> > part.  NVIDIA is releasing new libraries with a lot of new features on a
> > regular basis, which is good, but for compatibility reasons they've begun
> > to bump major versions less often, which is also probably also good.  For
> > example if memory serves correctly cuDNN used to get an MV bump every 6
> > months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> > project maintainer I would certainly like to drop support for library
> > versions that are 2 years old in my latest release.  Supporting a 2 year
> > wide range of dependency libraries in the CI for example is going to be a
> > burden.
> >
> > From the MXNet users' perspective obviously having to update dependencies
> > is a pain, but updating these libs are likely to give significant
> > performance increases (occasional perf regressions aside).  I think a
> > consistent thread I've heard from users is that training takes too long,
> > inference costs too much, and they want their DL framework to abstract the
> > complexity of using custom hardware like TCs or AVX with them having to put
> > in a lot of effort.  Another consideration is that using old versions of
> > MXNet is actually quite easy and convenient thanks to (IMO) some solid
> > release practices and naming conventions.
> >
> > Given how easy it is to use old MXNet versions I think it's reasonable to
> > target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> > versions).
> >
> > On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu <ma...@gmail.com>
> > wrote:
> >
> >> Good points anirudh. Generally I would understand N as being the major
> >> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
> >> and
> >> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
> >> be
> >> dropped when 11 is released and tested.
> >>
> >> At the same time, we would always only supported the latest compatible
> >> cudnn version. Or is there any reason somebody would use an old cudnn
> >> version?
> >>
> >> Wdyt?
> >>
> >> -Marco
> >>
> >> Anirudh Subramanian <an...@gmail.com> schrieb am Mi., 19. Juni
> >> 2019,
> >> 01:47:
> >>
> >> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
> >> CUDA
> >> > Version N and CUDA Version N - 1 should be supported in CI.
> >> >
> >> > My question is what happens, when we are at a position, where we are on
> >> a
> >> > CUDA version N and removed support for CUDA version N - 1. Within a
> >> small
> >> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
> >> perf
> >> > regressions and some bugs have been fixed. Should we just move to N + 1,
> >> > since version N will have all these issues for users and may also slow
> >> us
> >> > down on CI.
> >> >
> >> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> >> > causing intermittent CI failures:
> >> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
> >> already
> >> > a
> >> > PR to bump up Nvidia version to 10.1 (
> >> > https://github.com/apache/incubator-mxnet/pull/14986/files).
> >> >
> >> > I think for situations where there is a quick follow up release like
> >> 10.1
> >> > and MXNet users are impacted by certain issues, we should just bump up
> >> the
> >> > version and stop support for 10.0.
> >> > Would like to hear more from Nvidia folks (on this particular case of
> >> CUDA
> >> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
> >> customers).
> >> >
> >> > Anirudh
> >> >
> >> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter <di...@apache.org>
> >> wrote:
> >> >
> >> > > Actually, I tried to say that support *doesn't necessarily* include
> >> N-1.
> >> > > I'm proposing that the supported versions are 1) covered by CI and 2)
> >> > have
> >> > > been available in a usable form long enough that a semi-motivated user
> >> > has
> >> > > been able to transition to it.  That might mean only N (e.g. per my
> >> > > proposal, only cuDNN v7).
> >> > >
> >> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> >> > users
> >> > > will transition to it at their own pace, thereby creating a N / N-1
> >> > support
> >> > > situation for some period.
> >> > >
> >> > >
> >> > > On 2019/06/03 22:43:20, Pedro Larroy <pe...@gmail.com>
> >> > > wrote:
> >> > > > Your proposal of having support for N and N-1 makes a lot of sense
> >> to
> >> > > > me. Are there use cases for supporting older CUDA versions?
> >> > > >
> >> > > >
> >> > > > Thanks.
> >> > > >
> >> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter <di...@apache.org>
> >> > wrote:
> >> > > > >
> >> > > > > I'd like to revisit the discussion of:
> >> > >
> >> >
> >> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> >> > > now that a year has passed.
> >> > > > >
> >> > > > > My motivation is:
> >> > > > >
> >> > > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code
> >> referencing
> >> > > cuDNN versions back as far as v4(!?).  We need to clean this out
> >> before
> >> > it
> >> > > hampers our ability to nimbly move the codebase forward.
> >> > > > >
> >> > > > > 2.  There seems to be a difference of opinion on whether we
> >> should be
> >> > > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5
> >> candidate
> >> > > does not compile against cuDNN v6, so this should be either fixed or
> >> be
> >> > > up-front stated to the user community.  The breaking PR was
> >> > > https://github.com/apache/incubator-mxnet/pull/14476.
> >> > > > >
> >> > > > > Having read the prior discussion, my take on it is:
> >> > > > >
> >> > > > > - Users should be given an ample time period (1 year?) to move to
> >> a
> >> > > new CUDA/cuDNN version once it becomes 'usable.'
> >> > > > >
> >> > > > > - We should not claim to support a given version if it is no
> >> longer
> >> > > part of the MXNet CI.  User's should be warned of an impeding
> >> dropping of
> >> > > this 'testing support.'
> >> > > > >
> >> > > > > So these statements do not necessarily promise 'N-1' support.  I
> >> > could
> >> > > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10
> >> only.
> >> > > Some period before CUDA9 is dropped from CI, the user community is
> >> > warned.
> >> > > After that time, CUDA10 might be the only version tested by CI, and
> >> hence
> >> > > the only version supported (until the next CUDA version came around).
> >> > > > >
> >> > > > > Let me propose as a 'strawman' that we claim to support CUDA
> >> version
> >> > 9
> >> > > and 10, with cuDNN version 7 only.  Those versions have been out for
> >> over
> >> > > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with
> >> no
> >> > > coverage by our CI.
> >> > > > >
> >> > > > >     -Dick
> >> > > >
> >> > >
> >> >
> >>
> >
>