You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/28 19:51:47 UTC

[GitHub] DickJC123 removed a comment on issue #11359: Flaky test test_io:test_ImageRecordIter_seed_augmentation

DickJC123 removed a comment on issue #11359: Flaky test test_io:test_ImageRecordIter_seed_augmentation
URL: https://github.com/apache/incubator-mxnet/issues/11359#issuecomment-401135876
 
 
   I have a lead on the problem.  There is an out-of-bound read performed by the SequenceLastKernel.  I'll stop here and let the person responsible for this kernel correct the problem.  Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write.  The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss.  The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages.
   
   I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command:
   
   ```
   MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last | c++filt
   [INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
   [WARNING] *** test-level seed set: all "@with_seed()" tests run deterministically ***
   test_operator_gpu.test_sequence_last ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
   ========= CUDA-MEMCHECK
   ========= Invalid __global__ read of size 4
   =========     at 0x00000390 in void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>)
   =========     by thread (2,0,0) in block (0,0,0)
   =========     Address 0x7f13f24003f8 is out of bounds
   =========     Device Frame:void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) (void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) : 0x390)
   =========     Saved host backtrace up to driver entry point at kernel launch time
   =========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24cc4d]
   =========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 [0x15680]
   =========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaLaunch + 0x14e) [0x33c9e]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::SequenceLastOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0xc3a) [0x53384da]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0x363) [0x3214a53]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool) + 0x59) [0x3808f09]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5870]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
   =========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
   =========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
   =========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
   =========
   ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize. 
   =========     Saved host backtrace up to driver entry point at error
   =========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
   =========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaStreamSynchronize + 0x176) [0x47336]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mshadow::Stream<mshadow::gpu>::Wait() + 0x26) [0x32635a6]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5945]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
   ERROR
   
   ======================================================================
   ERROR: test_operator_gpu.test_sequence_last
   ----------------------------------------------------------------------
   Traceback (most recent call last):
     File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
       self.test(*self.arg)
     File "/usr/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
       return func(*arg, **kw)
     File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
       orig_test(*args, **kwargs)
     File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2998, in test_sequence_last
       check_sequence_func("last", axis=0)
     File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2989, in check_sequence_func
       numeric_eps=1e-2, rtol=1e-2)
     File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 906, in check_numeric_gradient
       eps=numeric_eps, use_forward_train=use_forward_train, dtype=dtype)
     File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 781, in numeric_grad
       f_neps = executor.outputs[0].asnumpy()
     File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/base.py", line 210, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   MXNetError: [11:31:45] /home/dcarter/mxnet_dev/dgx/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure
   
   Stack trace returned 10 entries:
   [bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
   [bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
   [bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f14f31bc658]
   [bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(+0x37d5945) [0x7f14f372e945]
   [bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f14f3687a15]
   [bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f14f369eb1b]
   [bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f14f369ed8e]
   [bt] (7) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f14f368701a]
   [bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f151efb1c80]
   [bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f15269af6ba]
   
   
   -------------------- >> begin captured logging << --------------------
   common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
   common: WARNING: *** test-level seed set: all "@with_seed()" tests run deterministically ***
   common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
   --------------------- >> end captured logging << ---------------------
   
   ----------------------------------------------------------------------
   Ran 1 test in 22.915s
   
   FAILED (errors=1)
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [11:31:45] src/storage/./pooled_storage_manager.h:77: CUDA: unspecified launch failure
   
   Stack trace returned 10 entries:
   [bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
   [bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
   [bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x8f) [0x7f14f36aa8cf]
   [bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::ReleaseAll()+0x95) [0x7f14f36a2ef5]
   [bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager()+0x1a) [0x7f14f36aa9ca]
   [bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0xa23) [0x7f14f36a9bd3]
   [bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::shared_ptr<mxnet::Storage>::~shared_ptr()+0x52) [0x7f14f36aa822]
   [bt] (7) /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8) [0x7f1526617ff8]
   [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(+0x3a045) [0x7f1526618045]
   [bt] (9) /usr/bin/python() [0x51dc1f]
   
   
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
   =========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
   =========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
   =========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
   =========
   ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree. 
   =========     Saved host backtrace up to driver entry point at error
   =========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
   =========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaFree + 0x1a0) [0x419b0]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle) + 0x32) [0x3751872]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::ReleaseAll() + 0x95) [0x3749ef5]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager() + 0x1a) [0x37519ca]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xa23) [0x3750bd3]
   =========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::shared_ptr<mxnet::Storage>::~shared_ptr() + 0x52) [0x3751822]
   =========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x39ff8]
   =========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x3a045]
   =========     Host Frame:/usr/bin/python [0x11dc1f]
   =========     Host Frame:/usr/bin/python [0x11b1b7]
   =========     Host Frame:/usr/bin/python (PyErr_PrintEx + 0x2d) [0x11aadd]
   =========     Host Frame:/usr/bin/python [0x309d5]
   =========     Host Frame:/usr/bin/python (Py_Main + 0x612) [0x93ae2]
   =========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
   =========     Host Frame:/usr/bin/python (_start + 0x29) [0x933e9]
   =========
   ========= Error: process didn't terminate successfully
   ========= No CUDA-MEMCHECK results found
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services