You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/03/25 12:55:01 UTC
[GitHub] chengdazhi opened a new issue #10240: [Python2] Occasional and random Illegal Memory Access Error | Cached Op Input Context Error

chengdazhi opened a new issue #10240: [Python2] Occasional and random Illegal Memory Access Error | Cached Op Input Context Error
URL: https://github.com/apache/incubator-mxnet/issues/10240
 
 
   Hi all,
   
   I am with visual computing group of microsoft research asia. We are currently working on GLUON based Faster RCNN and are dealing with a really trick problem:
   
   ## Description
   System randomly reports Illegal Memory Access error(np and mxnet random seeds are fixed and data is not shuffled). In hybridize mode the error becomes "Inputs not on the same context" reported by CachedOp. It is strange that one of the two inputs of this CachedOp reported in log is always xxx_weight or xxx_moving_mean and should live on all contexts. Please refer to error section below for the details.
   
   The error typically happens after tens or hundreds of batchs, with a batchsize of four images.
   
   ## Environment info (Required)
   
   > ----------Python Info----------
   > ('Version      :', '2.7.14')
   > ('Compiler     :', 'GCC 7.2.0')
   > ('Build        :', ('default', 'Dec  7 2017 17:05:42'))
   > ('Arch         :', ('64bit', ''))
   > ------------Pip Info-----------
   > ('Version      :', '9.0.1')
   > ('Directory    :', '/data1/home/v-dachen/anaconda2/lib/python2.7/site-packages/pip')
   > ----------MXNet Info-----------
   > No MXNet installed.
   > ----------System Info----------
   > ('Platform     :', 'Linux-4.4.0-104-generic-x86_64-with-debian-stretch-sid')
   > ('system       :', 'Linux')
   > ('node         :', 'MSRAVCG04')
   > ('release      :', '4.4.0-104-generic')
   > ('version      :', '#127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017')
   > ----------Hardware Info----------
   > ('machine      :', 'x86_64')
   > ('processor    :', 'x86_64')
   > Architecture:          x86_64
   > CPU op-mode(s):        32-bit, 64-bit
   > Byte Order:            Little Endian
   > CPU(s):                32
   > On-line CPU(s) list:   0-31
   > Thread(s) per core:    2
   > Core(s) per socket:    8
   > Socket(s):             2
   > NUMA node(s):          2
   > Vendor ID:             GenuineIntel
   > CPU family:            6
   > Model:                 62
   > Model name:            Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
   > Stepping:              4
   > CPU MHz:               2600.000
   > CPU max MHz:           3400.0000
   > CPU min MHz:           1200.0000
   > BogoMIPS:              5202.06
   > Virtualization:        VT-x
   > L1d cache:             32K
   > L1i cache:             32K
   > L2 cache:              256K
   > L3 cache:              20480K
   > NUMA node0 CPU(s):     0-7,16-23
   > NUMA node1 CPU(s):     8-15,24-31
   > Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
   > ----------Network Test----------
   > Setting timeout: 10
   > Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0012 sec, LOAD: 1.1022 sec.
   > Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0673 sec, LOAD: 0.3463 sec.
   > Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.4961 sec, LOAD: 0.7315 sec.
   > Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0689 sec, LOAD: 0.3608 sec.
   > Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1838 sec, LOAD: 0.3668 sec.
   > Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1662 sec, LOAD: 0.9060 sec.
   > 
   > 
   
   Package used (Python/R/Scala/Julia):
   Python 2.7 
   
   ## Build info (Required if built from source)
   
   build from mxnet repo v1.1.0 branch, commit #9999 
   
   
   Build config:
   (Paste the content of config.mk, or the build command.)
   
   ## Error Message:
   
   **NOTE: this Illegal Memory Access problem is only triggered when hybridize() is not called, it is not always triggered by stream_gpu-inl.h, sometimes it's pooled_storage_manager.h:107** 
   
   > 
   > **terminate called after throwing an instance of 'dmlc::Error'
   >   what():  [20:04:58] src/engine/./threaded_engine.h:359: [20:04:58] /data1/home/v-dachen/external/mxnet/mxnet_1.0.0/incubator-mxnet/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered**
   > 
   > Stack trace returned 10 entries:
   > [bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f7cb0c6aeaa]
   > [bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f7cb0c6ba48]
   > [bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f7cb32358d8]
   > [bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x39c) [0x7f7cb3244e5c]
   > [bt] (4) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(+0x3348c6b) [0x7f7cb36ccc6b]
   > [bt] (5) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x93) [0x7f7cb36d9db3]
   > [bt] (6) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)+0xcb) [0x7f7cb36e20db]
   > [bt] (7) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>&&)+0x63) [0x7f7cb36e22d3]
   > [bt] (8) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x4a) [0x7f7cb36dc4aa]
   > [bt] (9) /data1/home/v-dachen/anaconda2/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f7cc4776c5c]
   > 
   > 
   > A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
   > 
   > Stack trace returned 9 entries:
   > [bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f7cb0c6aeaa]
   > [bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f7cb0c6ba48]
   > [bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x332) [0x7f7cb36da052]
   > [bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)+0xcb) [0x7f7cb36e20db]
   > [bt] (4) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::_Function_handler<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>&&)+0x63) [0x7f7cb36e22d3]
   > [bt] (5) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> (std::shared_ptr<mxnet::engine::ThreadPool::SimpleEvent>)> >::_M_run()+0x4a) [0x7f7cb36dc4aa]
   > [bt] (6) /data1/home/v-dachen/anaconda2/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f7cc4776c5c]
   > [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f7ccad766ba]
   > [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f7cca39c3dd]
   > 
   > 
   > Aborted (core dumped)
   > 
   > 
   
   
    **And the following CachedOp related problem is triggered when hybridize() is called, note that the Block that triggered this error changes each time.**  
   
   > 
   > Traceback (most recent call last):
   >   File "experiments/faster_rcnn/rcnn_end2end_train_test_gluon.py", line 27, in <module>
   >     train_end2end.main()
   >   File "experiments/faster_rcnn/../../faster_rcnn/train_end2end_gluon.py", line 187, in main
   >     config.TRAIN.lr, config.TRAIN.lr_step)
   >   File "experiments/faster_rcnn/../../faster_rcnn/train_end2end_gluon.py", line 180, in train_net
   >     batch_end_callback, epoch_end_callback)
   >   File "experiments/faster_rcnn/../../faster_rcnn/core/module_gluon.py", line 71, in train
   >     **conv4_feat = net.resnet_conv4(data) *(NOTE: this varies each time)***
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/gluon/block.py", line 360, in __call__
   >     return self.forward(*args)
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/gluon/block.py", line 664, in forward
   >     return self._call_cached_op(x, *args)
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/gluon/block.py", line 477, in _call_cached_op
   >     out = self._cached_op(*cargs)
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/_ctypes/ndarray.py", line 149, in __call__
   >     ctypes.byref(out_stypes)))
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/base.py", line 146, in check_call
   >     raise MXNetError(py_str(_LIB.MXGetLastError()))
   > **mxnet.base.MXNetError: [15:53:24] src/imperative/cached_op.cc:351: Check failed: inputs[i]->ctx() == default_ctx (gpu(4) vs. gpu(7)) CachedOp requires all inputs to live on the same context. But data is on gpu(7) while bn3a_branch2c_moving_mean is on gpu *(This is very strange since moving_mean should live on all contexts)***
   > Stack trace returned 10 entries:
   > [bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7f24aba98eaa]
   > [bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f24aba99a48]
   > [bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::Imperative::CachedOp::Forward(std::shared_ptr<mxnet::Imperative::CachedO
   > p> const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&)+0x620) [0x7f24ae03a1f0]
   > [bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(MXInvokeCachedOp+0xd94) [0x7f24ae5355c4]
   > [bt] (4) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(MXInvokeCachedOpEx+0x3ee) [0x7f24ae5365de]
   > [bt] (5) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7f24bc19aec0]
   > [bt] (6) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7f24bc19a87d]
   > [bt] (7) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4d6) [0x7f24bc3b0736]
   > [bt] (8) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(+0x8a61) [0x7f24bc3a6a61]
   > [bt] (9) /data1/home/v-dachen/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f24c5e02773]
   > 
   > 
   > 
   
   
    **When I turn on NaïveEngine as suggested, it say that naïve engine only supports synchronize push, and it happens in a random manner, no matter hybridized or not. We do not use multi-threading explicitly, since the problem seems to be threading related.** 
   
   > 
   > Exception in thread Thread-2:
   > Traceback (most recent call last):
   >   File "/data1/home/v-dachen/anaconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
   >     self.run()
   >   File "/data1/home/v-dachen/anaconda2/lib/python2.7/threading.py", line 754, in run
   >     self.__target(*self.__args, \*\*self.__kwargs)
   >   File "experiments/faster_rcnn/../../faster_rcnn/../lib/utils/PrefetchingIter.py", line 194, in prefetch_func
   >     self.next_batch[i][0] = self.iters[0].next()
   >   File "experiments/faster_rcnn/../../faster_rcnn/core/loader.py", line 924, in next
   >     data, label = self.get_batch_individual(cur_from)
   >   File "experiments/faster_rcnn/../../faster_rcnn/core/loader.py", line 1064, in get_batch_individual
   >     data = [[mx.nd.array(data[key]) for key in self.data_name] for data in all_data]
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/utils.py", line 146, in array
   >     return _array(source_array, ctx=ctx, dtype=dtype)
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 2245, in array
   >     arr[:] = source_array
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 437, in __setitem__
   >     self._set_nd_basic_indexing(key, value)
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 698, in _set_nd_basic_indexing
   >     self._sync_copyfrom(value)
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/ndarray/ndarray.py", line 863, in _sync_copyfrom
   >     ctypes.c_size_t(source_array.size)))
   >   File "/data1/home/v-dachen/codebase/mxnet_relation_network/faster_rcnn/../external/mxnet/mxnet_1.0.0/mxnet/base.py", line 146, in check_call
   >     raise MXNetError(py_str(_LIB.MXGetLastError()))
   > **MXNetError: [15:54:16] src/engine/naive_engine.cc:168: Check failed: this->req_completed_ NaiveEngine only support synchronize Push so far **
   > Stack trace returned 10 entries:
   > [bt] (0) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5a) [0x7fd15f9b2eaa]
   > [bt] (1) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::engine::NaiveEngine::PushAsync(std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)+0x393) [0x7fd162407a03]
   > [bt] (2) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(mxnet::NDArray::SyncCopyFromCPU(void const*, unsigned long) const+0x6c8) [0x7fd161fefea8]
   > [bt] (3) /data1/home/v-dachen/codebase/mxnet_relation_network/external/mxnet/mxnet_1.0.0/mxnet/libmxnet.so(MXNDArraySyncCopyFromCPU+0x6) [0x7fd16247b446]
   > [bt] (4) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call_unix64+0x4c) [0x7fd1700b4ec0]
   > [bt] (5) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6(ffi_call+0x22d) [0x7fd1700b487d]
   > [bt] (6) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4d6) [0x7fd1702ca736]
   > [bt] (7) /data1/home/v-dachen/anaconda2/lib/python2.7/lib-dynload/_ctypes.so(+0x8a61) [0x7fd1702c0a61]
   > [bt] (8) /data1/home/v-dachen/anaconda2/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7fd179d1c773]
   > [bt] (9) /data1/home/v-dachen/anaconda2/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x190b) [0x7fd179db053b]
   > 
   > 
   > 
   
   ## Minimum reproducible example
   Our code is a simple FasterRCNN, providing a reproducible example would be very demanding, so I listed some key designs as follow:
   
   * Our network is implemented using HybridBlocks and Blocks
   * DataLoader is implemented based on DataIter.
   * We pad input images to a fixed size, so it doesn't change across GPU.
   
   ### If you have any thoughts on debugging, please comment and I'll report the results.
   
   ## What have you tried to solve it?
   
   1. update nvidia driver to 384, cuda to 9.0 and cudnn to 7.1.2
   2. gradually reducing the building blocks of the architecture, and found that the time taken for the error to show gradually increases. So it is unlikely that the behavior is triggered by one operator mis-usage.
   3. changing DataLoader to single-threaded version.
   4. removing callbacks(like speedometer) and metrics.
   5. building mxnet with flag DMSHADOW_USE_PASCAL turned on, as instructed by this issue: https://github.com/apache/incubator-mxnet/issues/6592, which didn't work.
   6. turning python gc on and off in each epoch, as instructed by this issue: https://github.com/apache/incubator-mxnet/issues/7710 and didn't work.
   7. some other similar issues exist, but failed to provide any solutions: https://github.com/apache/incubator-mxnet/issues/7000, https://github.com/apache/incubator-mxnet/issues/6799, https://github.com/apache/incubator-mxnet/issues/6685
   8. posting a issue here...
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services