You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by "Carl Yang (JIRA)" <ji...@apache.org> on 2018/10/05 20:06:01 UTC

[jira] [Created] (MXNET-1027) Horovod Random Segfault during Training

Carl Yang created MXNET-1027:
--------------------------------

             Summary: Horovod Random Segfault during Training
                 Key: MXNET-1027
                 URL: https://issues.apache.org/jira/browse/MXNET-1027
             Project: Apache MXNet
          Issue Type: Bug
          Components: Horovod
            Reporter: Carl Yang


setup: 8 GPUs on p3.16xlarge
commit: most-likely Horovod branch: (0a0240113fe5a24ec2c772fd7309840ba179562a)

nohup: ignoring input and appending output to 'nohup.out'
INFO:root:start with arguments Namespace(batch_size=128, benchmark=0, brightness=0.4, contrast=0.4, data_nthreads=4, data_train='/media/ramdisk/train-passthrough.rec', data_train_idx='/media/ramdisk/train-passthrough.idx', data_val='/media/ramdisk/val-passthrough.rec', data_val_idx='/media/ramdisk/val-passthrough.idx', disp_batches=20, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='3,224,224', initializer='default', kv_store='None', load_epoch=None, loss='', lr=0.8, lr_factor=0.1, lr_step_epochs='30,60,80', macrobatch_size=0, max_random_area=1, max_random_aspect_ratio=1.3333333333333333, max_random_h=0, max_random_l=0, max_random_rotate_angle=0, max_random_s=0, max_random_scale=1, max_random_shear_ratio=0, min_random_area=0.08, min_random_aspect_ratio=0.75, min_random_scale=1, model_prefix=None, mom=0.9, monitor=0, network='resnet-v1', num_classes=1000, num_epochs=90, num_examples=1281167, num_layers=50, optimizer='sgd', pad_size=0, pca_noise=0.1, random_crop=0, random_mirror=0, random_resized_crop=1, rgb_mean='123.68,116.779,103.939', saturation=0.4, save_period=1, test_io=0, top_k=0, warmup_epochs=10, warmup_strategy='linear', wd=0.0001)

…

INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.12 samples/sec       accuracy=0.710156
INFO:root:Epoch[67] Batch [1140-1160]   Speed: 335.77 samples/sec       accuracy=0.719922
INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.73 samples/sec       accuracy=0.714063
INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.85 samples/sec       accuracy=0.721875
INFO:root:Epoch[67] Batch [1140-1160]   Speed: 334.34 samples/sec       accuracy=0.711719
INFO:root:Epoch[67] Batch [1140-1160]   Speed: 333.82 samples/sec       accuracy=0.714844
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.31 samples/sec       accuracy=0.722656
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.31 samples/sec       accuracy=0.705859
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.25 samples/sec       accuracy=0.712891
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.31 samples/sec       accuracy=0.723828
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.26 samples/sec       accuracy=0.717969
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.71 samples/sec       accuracy=0.716016
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.03 samples/sec       accuracy=0.722656
INFO:root:Epoch[67] Batch [1160-1180]   Speed: 329.27 samples/sec       accuracy=0.716797

Segmentation fault: 11

Stack trace returned 8 entries:
[bt] (0) /home/ubuntu/master/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f7233aacaeb]
[bt] (1) /home/ubuntu/master/lib/libmxnet.so(+0x3e4d74f) [0x7f7236b9a74f]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f724a0be4b0]
[bt] (3) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(horovod::MX::HandleManager::ExecuteCallback(int)+0x19) [0x7f7227ef7009]
[bt] (4) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x24b2b) [0x7f7227edab2b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7180a6bc80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f724a45a6ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f724a19041d]

Segmentation fault: 11

Stack trace returned 9 entries:
[bt] (0) /home/ubuntu/master/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f7233aacaeb]
[bt] (1) /home/ubuntu/master/lib/libmxnet.so(+0x3e4d74f) [0x7f7236b9a74f]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x354b0) [0x7f724a0be4b0]
[bt] (3) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(horovod::MX::HandleManager::ExecuteCallback(int)+0x19) [0x7f7227ef7009]
[bt] (4) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x259fc) [0x7f7227edb9fc]
[bt] (5) /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x30e6a) [0x7f7227ee6e6a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f7180a6bc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f724a45a6ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f724a19041d]
terminate called after throwing an instance of 'std::system_error'
  what():  Resource deadlock avoided
[ip-172-31-9-223:33837] *** Process received signal ***
[ip-172-31-9-223:33837] Signal: Aborted (6)
[ip-172-31-9-223:33837] Signal code:  (-6)
[ip-172-31-9-223:33837] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f724a464390]
[ip-172-31-9-223:33837] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7f724a0be428]
[ip-172-31-9-223:33837] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7f724a0c002a]
[ip-172-31-9-223:33837] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x16d)[0x7f7180a4284d]
[ip-172-31-9-223:33837] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8d6b6)[0x7f7180a406b6]
[ip-172-31-9-223:33837] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c6a9)[0x7f7180a3f6a9]
[ip-172-31-9-223:33837] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2e5)[0x7f7180a40005]
[ip-172-31-9-223:33837] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0xff83)[0x7f718058af83]
[ip-172-31-9-223:33837] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0xfb)[0x7f718058b2eb]
[ip-172-31-9-223:33837] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x5c)[0x7f7180a4090c]
[ip-172-31-9-223:33837] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_system_errori+0x8e)[0x7f7180a697fe]
[ip-172-31-9-223:33837] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread4joinEv+0x18)[0x7f7180a6bb88]
[ip-172-31-9-223:33837] [12] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x243e3)[0x7f7227eda3e3]
[ip-172-31-9-223:33837] [13] /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8)[0x7f724a0c2ff8]
[ip-172-31-9-223:33837] [14] /lib/x86_64-linux-gnu/libc.so.6(+0x3a045)[0x7f724a0c3045]
[ip-172-31-9-223:33837] [15] /home/ubuntu/master/lib/libmxnet.so(+0x3e4d786)[0x7f7236b9a786]
[ip-172-31-9-223:33837] [16] /lib/x86_64-linux-gnu/libc.so.6(+0x354b0)[0x7f724a0be4b0]
[ip-172-31-9-223:33837] [17] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(_ZN7horovod2MX13HandleManager15ExecuteCallbackEi+0x19)[0x7f7227ef7009]
[ip-172-31-9-223:33837] [18] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x259fc)[0x7f7227edb9fc]
[ip-172-31-9-223:33837] [19] /usr/local/lib/python3.5/dist-packages/horovod/mxnet/mpi_lib.cpython-35m-x86_64-linux-gnu.so(+0x30e6a)[0x7f7227ee6e6a]
[ip-172-31-9-223:33837] [20] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80)[0x7f7180a6bc80]
[ip-172-31-9-223:33837] [21] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7f724a45a6ba]
[ip-172-31-9-223:33837] [22] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f724a19041d]
[ip-172-31-9-223:33837] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 5 with PID 0 on node ip-172-31-9-223 exited on signal 6 (Aborted).
--------------------------------------------------------------------------



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org