You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mxnet.apache.org by Per da Silva <pe...@gmail.com> on 2019/04/09 07:07:31 UTC

CUDNN 7.5 Issues

Dear MXNet community,

I've been trying to update the CI GPU images to CUDA 10, but the tests are
failing. I'm not sure why and would really appreciate some help =D

I've managed, at least, to narrow down the problem to the cuDNN version.
The current CUDA 10 images uses cuDNN version 7.5.0.56 (
https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
).

I noticed that the binary in the python packages we release uses cuDNN
7.3.1.20 (
https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34),
so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
and sure enough the tests passed (
https://github.com/apache/incubator-mxnet/pull/14513).

After talking with another contributer, we decided that I would try to
create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
(to be fixed later). But, it seems the problem is a bit more heinous. I
disable one test, and another one fails...So, it might make sense to reach
out now and see if we can find the root cause and fix it.

Some things I've sanity checked:

We run the tests on g3.8xlarge instances. These instances contain Tesla M60
GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 supports
compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).

According to the cuDNN support matrix (
https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html),
cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver r410.48
(I assume greater or equal).

The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.

So, as best I can tell, our environment ought to support cuDNN 7.5, which
leads me to conclude that maybe there's something wrong in the code.

The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed: e
== CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".

According to the cuDNN user guide (
https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html):

CUDNN_STATUS_ARCH_MISMATCH

The function requires a feature absent from the current GPU device. Note
that cuDNN only supports devices with compute capabilities greater than or
equal to 3.0.

To correct: compile and run the application on a device with appropriate
compute capability.

But, as we've seen, our environment seems to support this version of cuDNN
and other versions go through CI w/o any problem...

You can see some logs here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/

I have about 13 runs of this pipeline. The errors for different runs can be
seen by changing the number before /pipeline (e.g.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
<http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/>
for
the 2nd run, etc.)

Thanks in advance for the help!

You can reach me here or on Slack if you have any questions =D

Cheers,

Per

P.S. I'm attaching some instructions on how to reproduce the issue at home
(or at least on a g3.8xlarge instance running ubuntu 16.04).

Re: CUDNN 7.5 Issues

Posted by Per da Silva <pe...@gmail.com>.

Hey Kellen,

I really appreciate that. Thank you!

And thanks to the community for supporting me ^^

Per


On Wed, Apr 10, 2019 at 5:53 AM kellen sunderland <
kellen.sunderland@gmail.com> wrote:

> Hey Per, just wanted to drop a line and say thanks for supporting the
> community on this one.
>
> On Tue, Apr 9, 2019 at 4:20 AM Per da Silva <pe...@gmail.com> wrote:
>
> > I've created an issue to track this problem:
> > https://github.com/apache/incubator-mxnet/issues/14652
> >
> > On Tue, Apr 9, 2019 at 9:07 AM Per da Silva <pe...@gmail.com>
> wrote:
> >
> > > Dear MXNet community,
> > >
> > > I've been trying to update the CI GPU images to CUDA 10, but the tests
> > are
> > > failing. I'm not sure why and would really appreciate some help =D
> > >
> > > I've managed, at least, to narrow down the problem to the cuDNN
> version.
> > > The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> > >
> >
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> > > ).
> > >
> > > I noticed that the binary in the python packages we release uses cuDNN
> > > 7.3.1.20 (
> > >
> >
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34
> > ),
> > > so decided to create a PR with CI updated to CUDA 10 with cuDNN
> 7.3.1.20
> > > and sure enough the tests passed (
> > > https://github.com/apache/incubator-mxnet/pull/14513).
> > >
> > > After talking with another contributer, we decided that I would try to
> > > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing
> tests
> > > (to be fixed later). But, it seems the problem is a bit more heinous. I
> > > disable one test, and another one fails...So, it might make sense to
> > reach
> > > out now and see if we can find the root cause and fix it.
> > >
> > > Some things I've sanity checked:
> > >
> > > We run the tests on g3.8xlarge instances. These instances contain Tesla
> > > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10
> > supports
> > > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA
> ).
> > >
> > > According to the cuDNN support matrix (
> > >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
> > ),
> > > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver
> > r410.48
> > > (I assume greater or equal).
> > >
> > > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
> > >
> > > So, as best I can tell, our environment ought to support cuDNN 7.5,
> which
> > > leads me to conclude that maybe there's something wrong in the code.
> > >
> > > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check
> failed:
> > > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
> > >
> > > According to the cuDNN user guide (
> > >
> >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> > > ):
> > >
> > > CUDNN_STATUS_ARCH_MISMATCH
> > >
> > > The function requires a feature absent from the current GPU device.
> Note
> > > that cuDNN only supports devices with compute capabilities greater than
> > or
> > > equal to 3.0.
> > >
> > > To correct: compile and run the application on a device with
> appropriate
> > > compute capability.
> > >
> > > But, as we've seen, our environment seems to support this version of
> > cuDNN
> > > and other versions go through CI w/o any problem...
> > >
> > > You can see some logs here:
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> > >
> > >
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
> > >
> > > I have about 13 runs of this pipeline. The errors for different runs
> can
> > > be seen by changing the number before /pipeline (e.g.
> > >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> > > <
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> >
> > for
> > > the 2nd run, etc.)
> > >
> > > Thanks in advance for the help!
> > >
> > > You can reach me here or on Slack if you have any questions =D
> > >
> > > Cheers,
> > >
> > > Per
> > >
> > > P.S. I'm attaching some instructions on how to reproduce the issue at
> > home
> > > (or at least on a g3.8xlarge instance running ubuntu 16.04).
> > >
> >
>

Re: CUDNN 7.5 Issues

Posted by kellen sunderland <ke...@gmail.com>.

Hey Per, just wanted to drop a line and say thanks for supporting the
community on this one.

On Tue, Apr 9, 2019 at 4:20 AM Per da Silva <pe...@gmail.com> wrote:

> I've created an issue to track this problem:
> https://github.com/apache/incubator-mxnet/issues/14652
>
> On Tue, Apr 9, 2019 at 9:07 AM Per da Silva <pe...@gmail.com> wrote:
>
> > Dear MXNet community,
> >
> > I've been trying to update the CI GPU images to CUDA 10, but the tests
> are
> > failing. I'm not sure why and would really appreciate some help =D
> >
> > I've managed, at least, to narrow down the problem to the cuDNN version.
> > The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> >
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> > ).
> >
> > I noticed that the binary in the python packages we release uses cuDNN
> > 7.3.1.20 (
> >
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34
> ),
> > so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> > and sure enough the tests passed (
> > https://github.com/apache/incubator-mxnet/pull/14513).
> >
> > After talking with another contributer, we decided that I would try to
> > create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> > (to be fixed later). But, it seems the problem is a bit more heinous. I
> > disable one test, and another one fails...So, it might make sense to
> reach
> > out now and see if we can find the root cause and fix it.
> >
> > Some things I've sanity checked:
> >
> > We run the tests on g3.8xlarge instances. These instances contain Tesla
> > M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10
> supports
> > compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
> >
> > According to the cuDNN support matrix (
> > https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html
> ),
> > cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver
> r410.48
> > (I assume greater or equal).
> >
> > The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
> >
> > So, as best I can tell, our environment ought to support cuDNN 7.5, which
> > leads me to conclude that maybe there's something wrong in the code.
> >
> > The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> > e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
> >
> > According to the cuDNN user guide (
> >
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> > ):
> >
> > CUDNN_STATUS_ARCH_MISMATCH
> >
> > The function requires a feature absent from the current GPU device. Note
> > that cuDNN only supports devices with compute capabilities greater than
> or
> > equal to 3.0.
> >
> > To correct: compile and run the application on a device with appropriate
> > compute capability.
> >
> > But, as we've seen, our environment seems to support this version of
> cuDNN
> > and other versions go through CI w/o any problem...
> >
> > You can see some logs here:
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
> >
> >
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
> >
> > I have about 13 runs of this pipeline. The errors for different runs can
> > be seen by changing the number before /pipeline (e.g.
> >
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> > <
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/>
> for
> > the 2nd run, etc.)
> >
> > Thanks in advance for the help!
> >
> > You can reach me here or on Slack if you have any questions =D
> >
> > Cheers,
> >
> > Per
> >
> > P.S. I'm attaching some instructions on how to reproduce the issue at
> home
> > (or at least on a g3.8xlarge instance running ubuntu 16.04).
> >
>

Re: CUDNN 7.5 Issues

Posted by Per da Silva <pe...@gmail.com>.

I've created an issue to track this problem:
https://github.com/apache/incubator-mxnet/issues/14652

On Tue, Apr 9, 2019 at 9:07 AM Per da Silva <pe...@gmail.com> wrote:

> Dear MXNet community,
>
> I've been trying to update the CI GPU images to CUDA 10, but the tests are
> failing. I'm not sure why and would really appreciate some help =D
>
> I've managed, at least, to narrow down the problem to the cuDNN version.
> The current CUDA 10 images uses cuDNN version 7.5.0.56 (
> https://gitlab.com/nvidia/cuda/blob/ubuntu16.04/10.0/devel/cudnn7/Dockerfile
> ).
>
> I noticed that the binary in the python packages we release uses cuDNN
> 7.3.1.20 (
> https://github.com/apache/incubator-mxnet/blob/master/tools/setup_gpu_build_tools.sh#L34),
> so decided to create a PR with CI updated to CUDA 10 with cuDNN 7.3.1.20
> and sure enough the tests passed (
> https://github.com/apache/incubator-mxnet/pull/14513).
>
> After talking with another contributer, we decided that I would try to
> create a PR with CUDA 10 and cuDNN 7.5 and just disable the failing tests
> (to be fixed later). But, it seems the problem is a bit more heinous. I
> disable one test, and another one fails...So, it might make sense to reach
> out now and see if we can find the root cause and fix it.
>
> Some things I've sanity checked:
>
> We run the tests on g3.8xlarge instances. These instances contain Tesla
> M60 GPUs. The Tesla M60s have a compute capability of 5.2. CUDA 10 supports
> compute capabilities of 3.0 - 7.5 (https://en.wikipedia.org/wiki/CUDA).
>
> According to the cuDNN support matrix (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-support-matrix/index.html),
> cuDNN 7.5 is compatible with the GPU, CUDA 10, and requires driver r410.48
> (I assume greater or equal).
>
> The AMIs running on the g3.8xlarge have CUDA 10 and driver 410.73.
>
> So, as best I can tell, our environment ought to support cuDNN 7.5, which
> leads me to conclude that maybe there's something wrong in the code.
>
> The errors are always: "src/operator/./cudnn_rnn-inl.h:759: Check failed:
> e == CUDNN_STATUS_SUCCESS (6 vs. 0) cuDNN: CUDNN_STATUS_ARCH_MISMATCH".
>
> According to the cuDNN user guide (
> https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html
> ):
>
> CUDNN_STATUS_ARCH_MISMATCH
>
> The function requires a feature absent from the current GPU device. Note
> that cuDNN only supports devices with compute capabilities greater than or
> equal to 3.0.
>
> To correct: compile and run the application on a device with appropriate
> compute capability.
>
> But, as we've seen, our environment seems to support this version of cuDNN
> and other versions go through CI w/o any problem...
>
> You can see some logs here:
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/
>
>
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14611/12/pipeline/
>
> I have about 13 runs of this pipeline. The errors for different runs can
> be seen by changing the number before /pipeline (e.g.
> http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/2/pipeline/
> <http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fcentos-gpu/detail/PR-14611/1/pipeline/> for
> the 2nd run, etc.)
>
> Thanks in advance for the help!
>
> You can reach me here or on Slack if you have any questions =D
>
> Cheers,
>
> Per
>
> P.S. I'm attaching some instructions on how to reproduce the issue at home
> (or at least on a g3.8xlarge instance running ubuntu 16.04).
>