You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/03/11 07:07:58 UTC

[GitHub] [incubator-mxnet] houweidong opened a new issue #14387: problem in metric.update(0, record) when train with multi GPU

houweidong opened a new issue #14387: problem in metric.update(0, record) when train with multi GPU
URL: https://github.com/apache/incubator-mxnet/issues/14387
 
 
   I run the code on 9 2080TI(11G), 48CPU kernel and 256G memory
   When i run the yolo3 code provided [here](url) with single GPU, there is no problem. But when I tried to run an end2end yolo3 model with the code with
   python3 --network darknet53 --dataset coco --gpus 0,2 --batch-size 16 -j 16 --log-interval 100 --lr-decay-epoch 80,90 --epochs 100 --warmup-epochs 2 --mixup --no-mixup-epochs 20 --label-smooth --no-wd
   a problem occurs like this:
   
   _
   
   creating index...
   index created!
   loading annotations into memory...
   Done (t=0.32s)
   creating index...
   index created!
   INFO:root:Namespace(batch_size=16, data_shape=416, dataset='coco', epochs=100, gpus='0,2', label_smooth=True, log_interval=100, lr=0.001, lr_decay=0.1, lr_decay_epoch='80,90', lr_decay_period=0, lr_mode='step', mixup=True, momentum=0.9, network='darknet53', no_mixup_epochs=20, no_random_shape=False, no_wd=True, num_samples=117266, num_workers=16, resume='', save_interval=10, save_prefix='yolo3_darknet53_coco', seed=233, start_epoch=0, syncbn=False, val_interval=5, warmup_epochs=2, warmup_lr=0.0, wd=0.0005)
   INFO:root:Start training from [Epoch 0]
   Traceback (most recent call last):
   File "/home/new/dev/weidong/yolov3/train_yolo3.py", line 347, in
   train(net, train_data, val_data, eval_metric, ctx, args)
   File "/home/new/dev/weidong/yolov3/train_yolo3.py", line 279, in train
   obj_metrics.update(0, obj_losses)
   File "/home/new/.local/lib/python3.6/site-packages/mxnet/metric.py", line 1506, in update
   loss = ndarray.sum(pred).asscalar()
   File "/home/new/.local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2013, in asscalar
   return self.asnumpy()[0]
   File "/home/new/.local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
   ctypes.c_size_t(data.size)))
   File "/home/new/.local/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
   raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [11:14:15] src/operator/nn/./cudnn/cudnn_convolution-inl.h:160: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED
   Stack trace returned 10 entries:
   [bt] (0) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x42e352) [0x7fb4d4108352]
   [bt] (1) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x42e928) [0x7fb4d4108928]
   [bt] (2) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3d8ac3f) [0x7fb4d7a64c3f]
   [bt] (3) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3dba817) [0x7fb4d7a94817]
   [bt] (4) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2ff) [0x7fb4d6e299df]
   [bt] (5) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a6c93) [0x7fb4d6d80c93]
   [bt] (6) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30af4ce) [0x7fb4d6d894ce]
   [bt] (7) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30b35ea) [0x7fb4d6d8d5ea]
   [bt] (8) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30b385e) [0x7fb4d6d8d85e]
   [bt] (9) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30afb7b) [0x7fb4d6d89b7b]
   
   _
   It seems the problem is caused by the `obj_metrics.update(0, obj_losses)`, so when i comment all codes about the metrics, the code works well at train phase, but the problem still exits in the eval phase,like this:
   
   > there has omitted some content
   INFO:root:[Epoch 4][Batch 6699], LR: 1.00E-03, Speed: 185.507 samples/sec
   INFO:root:[Epoch 4][Batch 6799], LR: 1.00E-03, Speed: 183.469 samples/sec
   INFO:root:[Epoch 4][Batch 6899], LR: 1.00E-03, Speed: 161.316 samples/sec
   INFO:root:[Epoch 4][Batch 6999], LR: 1.00E-03, Speed: 196.679 samples/sec
   INFO:root:[Epoch 4][Batch 7099], LR: 1.00E-03, Speed: 221.820 samples/sec
   INFO:root:[Epoch 4][Batch 7199], LR: 1.00E-03, Speed: 178.233 samples/sec
   INFO:root:[Epoch 4][Batch 7299], LR: 1.00E-03, Speed: 194.973 samples/sec
   INFO:root:[Epoch 4] Training cost: 1727.013
   Traceback (most recent call last):
     File "/home/new/dev/weidong/yolov3/train_yolo3.py", line 354, in <module>
       train(net, train_data, val_data, eval_metric, ctx, args)
     File "/home/new/dev/weidong/yolov3/train_yolo3.py", line 309, in train
       map_name, mean_ap = validate(net, val_data, ctx, eval_metric)
     File "/home/new/dev/weidong/yolov3/train_yolo3.py", line 181, in validate
       eval_metric.update(det_bboxes, det_ids, det_scores, gt_bboxes, gt_ids, gt_difficults)
     File "/home/new/.local/lib/python3.6/site-packages/gluoncv/utils/metrics/coco_detection.py", line 175, in update
       *[as_numpy(x) for x in [pred_bboxes, pred_labels, pred_scores]]):
     File "/home/new/.local/lib/python3.6/site-packages/gluoncv/utils/metrics/coco_detection.py", line 175, in <listcomp>
       *[as_numpy(x) for x in [pred_bboxes, pred_labels, pred_scores]]):
     File "/home/new/.local/lib/python3.6/site-packages/gluoncv/utils/metrics/coco_detection.py", line 168, in as_numpy
       out = [x.asnumpy() if isinstance(x, mx.nd.NDArray) else x for x in a]
     File "/home/new/.local/lib/python3.6/site-packages/gluoncv/utils/metrics/coco_detection.py", line 168, in <listcomp>
       out = [x.asnumpy() if isinstance(x, mx.nd.NDArray) else x for x in a]
     File "/home/new/.local/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
       ctypes.c_size_t(data.size)))
     File "/home/new/.local/lib/python3.6/site-packages/mxnet/base.py", line 252, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [12:58:57] src/operator/nn/./cudnn/cudnn_convolution-inl.h:160: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED
   Stack trace returned 10 entries:
   [bt] (0) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x42e352) [0x7f6060b6f352]
   [bt] (1) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x42e928) [0x7f6060b6f928]
   [bt] (2) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3d8ac3f) [0x7f60644cbc3f]
   [bt] (3) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3dba817) [0x7f60644fb817]
   [bt] (4) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x2ff) [0x7f60638909df]
   [bt] (5) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30a6c93) [0x7f60637e7c93]
   [bt] (6) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30af4ce) [0x7f60637f04ce]
   [bt] (7) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30b35ea) [0x7f60637f45ea]
   [bt] (8) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30b385e) [0x7f60637f485e]
   [bt] (9) /home/new/.local/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30afb7b) [0x7f60637f0b7b]
   
   And I have no idea how to fix it up now. Any suggestions?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services