You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/10/28 05:57:13 UTC

[GitHub] [incubator-mxnet] Heermosi opened a new issue #16651: I'm sorry I've triggered an error in mxnet source code, how can I debug it? It seems like a check failure on custom operators, how can I find more details?

Heermosi opened a new issue #16651: I'm sorry I've triggered an error in mxnet source code, how can I debug it? It seems like a check failure on custom operators, how can I find more details?
URL: https://github.com/apache/incubator-mxnet/issues/16651
 
 
   So far I'm working in linux dockers, the mxnet was built from source code
   When I was running RoITransformer project with fpn training, the following error emerged.
   >   File "experiments/fpn/fpn_end2end_train_test_RoITransformer.py", line 21, in <            module>
   >     train_end2end_rotbox_RoITransformer.main()
   >   File "experiments/fpn/../../fpn/train_end2end_rotbox_RoITransformer.py", line             188, in main
   >     config.TRAIN.begin_epoch, config.TRAIN.end_epoch, config.TRAIN.lr, config.TR            AIN.lr_step)
   >   File "experiments/fpn/../../fpn/train_end2end_rotbox_RoITransformer.py", line             181, in train_net
   >     arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_e            poch=end_epoch)
   >   File "experiments/fpn/../../fpn/core/module.py", line 989, in fit
   >     self.update_metric(eval_metric, data_batch.label)
   >   File "experiments/fpn/../../fpn/core/module.py", line 1081, in update_metric
   >     self._curr_module.update_metric(eval_metric, labels)
   >   File "experiments/fpn/../../fpn/core/module.py", line 672, in update_metric
   >     self._exec_group.update_metric(eval_metric, labels)
   >   File "experiments/fpn/../../fpn/core/DataParallelExecutorGroup.py", line 481,             in update_metric
   >     eval_metric.update(labels, texec.outputs)
   >   File "/usr/local/lib/python2.7/site-packages/mxnet-1.6.0-py2.7.egg/mxnet/metri            c.py", line 364, in update
   >     metric.update(labels, preds)
   >   File "experiments/fpn/../../fpn/core/metric.py", line 53, in update
   >     pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
   >   File "/usr/local/lib/python2.7/site-packages/mxnet-1.6.0-py2.7.egg/mxnet/ndarr            ay/ndarray.py", line 2506, in asnumpy
   >     ctypes.c_size_t(data.size)))
   >   File "/usr/local/lib/python2.7/site-packages/mxnet-1.6.0-py2.7.egg/mxnet/base.            py", line 254, in check_call
   >     raise MXNetError(py_str(_LIB.MXGetLastError()))
   > mxnet.base.MXNetError: [13:42:48] src/operator/custom/custom.cc:417: Check faile            d: reinterpret_cast<CustomOpFBFunc>(params.info->callbacks[kCustomOpBackward])(             ptrs.size(), const_cast<void**>(ptrs.data()), const_cast<int*>(tags.data()), rei            nterpret_cast<const int*>(req.data()), static_cast<int>(ctx.is_train), params.in            fo->contexts[kCustomOpBackward]):
   > Stack trace:
   >   [bt] (0) /usr/local/lib/python2.7/site-packages/mxnet-1.6.0-py2.7.egg/mxnet/li            bmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x43) [0x7fd12dfa0133]
   >   [bt] (1) /usr/local/lib/python2.7/site-packages/mxnet-1.6.0-py2.7.egg/mxnet/li            bmxnet.so(+0x16d265f) [0x7fd12e65365f]
   >   [bt] (2) /usr/local/lib/python2.7/site-packages/mxnet-1.6.0-py2.7.egg/mxnet/li            bmxnet.so(+0x16db4f9) [0x7fd12e65c4f9]
   >   [bt] (3) /usr/local/lib/python2.7/site-packages/mxnet-1.6.0-py2.7.egg/mxnet/li            bmxnet.so(std::thread::_State_impl<std::thread::_Invoker<std::tuple<mxnet::op::c            ustom::CustomOperator::SetNumThreads(int)::{lambda()#1}> > >::_M_run()+0xde) [0x            7fd12e66305e]
   >   [bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fd174e7866f]
   >   [bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fd17b9786db]
   >   [bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fd17aefc88f]
   > 
   I've viewed the source code mentioned above, it looks like this:
   
   `
     CustomOperator::Get()->Push(
       [=]() {
         CHECK(reinterpret_cast<CustomOpFBFunc>(params.info->callbacks[kCustomOpBackward])(
           ptrs.size(), const_cast<void**>(ptrs.data()), const_cast<int*>(tags.data()),
           reinterpret_cast<const int*>(req.data()), static_cast<int>(ctx.is_train),
           params.info->contexts[kCustomOpBackward]));
       }, ctx, false, ctx.is_train, cpys, tags, output_tags, outputs, "_backward_" + params.op_type);
   `
   I cannot figure out what does this mean or which custom operator caused this.
   Can any one give an advice on how to debug this???
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services