You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2021/02/19 20:04:57 UTC

[GitHub] [incubator-mxnet] Zha0q1 opened a new issue #19929: [v1.x] CU102 CD Failure due to Cuda/Cudnn/CuBlas mismatch

Zha0q1 opened a new issue #19929:
URL: https://github.com/apache/incubator-mxnet/issues/19929


   https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job-1.x/detail/mxnet-cd-release-job-1.x/1530/pipeline/435
   
   ```
   [2021-02-18T21:59:21.985Z]   what():  [21:59:18] /work/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:126: Check failed: err == CUBLAS_STATUS_SUCCESS (7 vs. 0) : Destory cublas handle failed
   
   [2021-02-18T21:59:21.985Z] Stack trace:
   
   [2021-02-18T21:59:21.985Z]   [bt] (0) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x27ed308) [0x7fb4b07e3308]
   
   [2021-02-18T21:59:21.985Z]   [bt] (1) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1879) [0x7fb4b57d7879]
   
   [2021-02-18T21:59:21.985Z]   [bt] (2) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77e1e36) [0x7fb4b57d7e36]
   
   [2021-02-18T21:59:21.985Z]   [bt] (3) /work/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)1>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)1>*, std::shared_ptr<dmlc::ManualEvent> const&)+0x1c7) [0x7fb4b57f7097]
   
   [2021-02-18T21:59:21.985Z]   [bt] (4) /work/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>)+0x46) [0x7fb4b57f7346]
   
   [2021-02-18T21:59:21.985Z]   [bt] (5) /work/mxnet/python/mxnet/../../lib/libmxnet.so(+0x77f92b4) [0x7fb4b57ef2b4]
   
   [2021-02-18T21:59:21.985Z]   [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7fb555711c80]
   
   [2021-02-18T21:59:21.985Z]   [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7fb55d8886ba]
   
   [2021-02-18T21:59:21.985Z]   [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb55ca6b4dd]
   ```
   
   This has been happening for a while now. https://github.com/apache/incubator-mxnet/pull/19506 attempted to fix it but the error stayed/came back. I think this is most likely a cuda/cudnn/cublas version mismatch issue. I have created a branch with 
   ```
   ENV CUDA_VERSION=10.2.89
   ENV CUDNN_VERSION=8.0.4.30
   COPY install/ubuntu_cudnn.sh /work/
   RUN /work/ubuntu_cudnn.sh
   ```
   this section in file (https://github.com/apache/incubator-mxnet/blob/v1.x/ci/docker/Dockerfile.build.ubuntu_gpu_cu102) removed altogether and kicked off a run on that branch to observe if this solves the issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] Zha0q1 commented on issue #19929: [v1.x] CU102 CD Failure due to Cuda/Cudnn/CuBlas mismatch

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #19929:
URL: https://github.com/apache/incubator-mxnet/issues/19929#issuecomment-783638485


   fixed


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] Zha0q1 commented on issue #19929: [v1.x] CU102 CD Failure due to Cuda/Cudnn/CuBlas mismatch

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #19929:
URL: https://github.com/apache/incubator-mxnet/issues/19929#issuecomment-782441817


   I built mxnet in the test env and it passed the tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] Zha0q1 closed issue #19929: [v1.x] CU102 CD Failure due to Cuda/Cudnn/CuBlas mismatch

Posted by GitBox <gi...@apache.org>.
Zha0q1 closed issue #19929:
URL: https://github.com/apache/incubator-mxnet/issues/19929


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] Zha0q1 commented on issue #19929: [v1.x] CU102 CD Failure due to Cuda/Cudnn/CuBlas mismatch

Posted by GitBox <gi...@apache.org>.
Zha0q1 commented on issue #19929:
URL: https://github.com/apache/incubator-mxnet/issues/19929#issuecomment-782409556


   Removing 
   ```
   ENV CUDA_VERSION=10.2.89
   ENV CUDNN_VERSION=8.0.4.30
   COPY install/ubuntu_cudnn.sh /work/
   RUN /work/ubuntu_cudnn.sh
   ```
   in the test env did not work..
   @mseth10 would you provide some context on why you updated cudnn from 7 to 8?
   I think cu102 mxnet is still built with cudnn7
   https://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/restricted-mxnet-cd/pipelines/mxnet-cd-release-job-1.x/runs/1522/nodes/74/steps/203/log/?start=0


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org