You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/07/17 03:46:31 UTC

[GitHub] [incubator-mxnet] chengyuz opened a new issue #18743: AMP: an illegal memory access was encountered

chengyuz opened a new issue #18743:
URL: https://github.com/apache/incubator-mxnet/issues/18743


   ## Description
   i followed this link(https://mxnet.apache.org/api/python/docs/tutorials/performance/backend/amp.html) to enable amp in my project, but with error: 
   INFO:root:----------------------------------------------------------------------------------------------------
   INFO:root:Using AMP
   INFO:root:Features in transition 1: 96 -> 96
   INFO:root:Features in transition 2: 192 -> 192
   INFO:root:Features in transition 3: 448 -> 448
   [11:43:40] /media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./dataset/imagenet200/rec/train.rec, use 30 threads for decoding..
   [11:43:42] /media/apache-mxnet-src-1.6.0-incubating/src/io/iter_image_recordio_2.cc:178: ImageRecordIOParser2: ./dataset/imagenet200/rec/val.rec, use 30 threads for decoding..
   [11:44:05] /media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   [11:44:10] /media/apache-mxnet-src-1.6.0-incubating/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:744: only 0 out of 2 GPU pairs are enabled direct access. It may affect the performance. You can set MXNET_ENABLE_GPU_P2P=0 to turn it off
   [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: ..
   [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/kvstore/././comm.h:753: ..
   Traceback (most recent call last):
     File "scripts/train_imagenet.py", line 807, in <module>
       main()
     File "scripts/train_imagenet.py", line 803, in main
       train(context)
     File "scripts/train_imagenet.py", line 736, in train
       trainer.step(batch_size)
     File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 334, in step
       self._allreduce_grads()
     File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/gluon/trainer.py", line 364, in _allreduce_grads
       self._kvstore.push(i, param.list_grad(), priority=-i)
     File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/kvstore.py", line 234, in push
       self.handle, mx_uint(len(ckeys)), ckeys, cvals, ctypes.c_int(priority)))
     File "/media/apache-mxnet-src-1.6.0-incubating/python/mxnet/base.py", line 255, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [11:44:18] /media/apache-mxnet-src-1.6.0-incubating/src/storage/./pooled_storage_manager.h:164: cudaMalloc failed: an illegal memory access was encountered
   Stack trace:
     [bt] (0) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7f500e8f9493]
     [bt] (1) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::storage::GPUPooledStorageManager::Alloc(mxnet::Storage::Handle*)+0x245) [0x7f50113b6775]
     [bt] (2) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::StorageImpl::Alloc(mxnet::Storage::Handle*)+0x59) [0x7f50113b8c79]
     [bt] (3) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::NDArray::NDArray(mxnet::TShape const&, mxnet::Context, bool, int)+0x52b) [0x7f500e91272b]
     [bt] (4) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::CommDevice::Reduce(int, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x277) [0x7f500ebb5eb7]
     [bt] (5) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(mxnet::kvstore::KVStoreLocal::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int)+0x11d) [0x7f500ebb9f5d]
     [bt] (6) /media/apache-mxnet-src-1.6.0-incubating/python/mxnet/../../build/libmxnet.so(MXKVStorePush+0x105) [0x7f500e903845]
     [bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f504603fdae]
     [bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f504603f71f]
   
   ## Environment
   
   mxnet1.6.0 build from source, gtx2080, python3.6.9
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-mxnet] szha commented on issue #18743: AMP: an illegal memory access was encountered

Posted by GitBox <gi...@apache.org>.
szha commented on issue #18743:
URL: https://github.com/apache/incubator-mxnet/issues/18743#issuecomment-659839040


   how do you reproduce the error?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org