You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/12 05:43:10 UTC
[GitHub] leleamol commented on issue #9004: test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build.
leleamol commented on issue #9004: test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build.
URL: https://github.com/apache/incubator-mxnet/issues/9004#issuecomment-350951363
@ptrendx following is the output of test_nccl.py when ran with NCCL_DEBUG=INFO.
[ec2-user@ip-172-31-46-76 gpu]$ NCCL_DEBUG=INFO python test_nccl.py
ip-172-31-46-76:8258:8258 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
ip-172-31-46-76:8258:8258 [0] INFO Using internal Network Socket
ip-172-31-46-76:8258:8258 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
NCCL version 2.1.2+cuda9.0
ip-172-31-46-76:8258:8258 [0] INFO NET : Using interface eth0:172.31.46.76<0>
ip-172-31-46-76:8258:8258 [0] INFO NET/Socket : 1 interfaces found
ip-172-31-46-76:8258:8258 [1] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [1] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [1] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [1] INFO [0] Ring 0 : 0 1
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [2] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [2] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [2] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [2] INFO [0] Ring 0 : 0 1 2
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8362 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [3] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [3] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [3] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [3] INFO [0] Ring 0 : 0 1 2 3
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [4] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [4] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [4] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [4] INFO [0] Ring 0 : 0 1 2 3 4
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8365 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [5] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [5] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [5] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [5] INFO [0] Ring 0 : 0 1 2 3 4 5
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8365 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [6] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [6] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [6] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [6] INFO [0] Ring 0 : 0 1 2 3 4 5 6
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8363 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [7] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [7] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [7] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [7] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8362 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [8] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [8] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [8] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [8] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7 8
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 8 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [6] INFO 6 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 6 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [7] INFO 7 -> 8 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [8] INFO 8 -> 7 via P2P/direct pointer
ip-172-31-46-76:8258:8258 [8] INFO 8 -> 0 via P2P/direct pointer
ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
ip-172-31-46-76:8258:8258 [9] INFO Using 512 threads
ip-172-31-46-76:8258:8258 [9] INFO Min Comp Cap 3
ip-172-31-46-76:8258:8258 [9] INFO NCCL_SINGLE_RING_THRESHOLD=131072
ip-172-31-46-76:8258:8258 [9] INFO [0] Ring 0 : 0 1 2 3 4 5 6 7 8 9
ip-172-31-46-76:8258:8258 [0] transport/p2p.cu:393 WARN failed to peer with device 9: 60 peer mapping resources exhausted
ip-172-31-46-76:8258:8258 [0] INFO init.cu:191 -> 3
ip-172-31-46-76:8258:8258 [0] INFO init.cu:266 -> 3
ip-172-31-46-76:8258:8258 [0] INFO init.cu:610 -> 3
ip-172-31-46-76:8258:8258 [0] INFO init.cu:678 -> 3
It created a core dump. The callstack is as follows
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `python test_nccl.py'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f504ef23a6e in commFree (comm=0xece8380) at init.cu:100
100 init.cu: No such file or directory.
Missing separate debuginfos, use: debuginfo-install python26-2.6.9-2.89.amzn1.x86_64 python27-2.7.12-2.121.amzn1.x86_64 python34-3.4.3-1.35.amzn1.x86_64
(gdb) bt
#0 0x00007f504ef23a6e in commFree (comm=0xece8380) at init.cu:100
#1 0x00007f504ef27dad in ncclCommInitAll (comms=<optimized out>, ndev=<optimized out>, devlist=<optimized out>)
at init.cu:692
#2 0x00007f508c715a50 in mxnet::kvstore::KVStoreNCCL::Reduce(std::vector<int, std::allocator<int> >, std::vector<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >, std::allocator<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > > > const&, int, std::vector<mxnet::NDArray const*, std::allocator<mxnet::NDArray const*> >*) () from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#3 0x00007f508c72223a in mxnet::kvstore::KVStoreNCCL::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int) ()
from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#4 0x00007f508c6d8ba1 in mxnet::kvstore::KVStoreLocal::Push(std::vector<std::string, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int) ()
from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#5 0x00007f508c6557fb in MXKVStorePushEx ()
from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
#6 0x00007f5149206cec in ffi_call_unix64 () from /usr/lib64/libffi.so.6
#7 0x00007f5149206615 in ffi_call () from /usr/lib64/libffi.so.6
#8 0x00007f514941997b in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
#9 0x00007f5149413915 in ?? () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
#10 0x00007f5150a74173 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
#11 0x00007f5150b06f7d in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#12 0x00007f5150b0cd8d in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#13 0x00007f5150b098cc in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#14 0x00007f5150b09972 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
#15 0x00007f5150b0cd8d in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
#16 0x00007f5150b0ce92 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
#17 0x00007f5150b25d9f in ?? () from /usr/lib64/libpython2.7.so.1.0
#18 0x00007f5150b26ede in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
#19 0x00007f5150b28049 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
#20 0x00007f5150b38c8f in Py_Main () from /usr/lib64/libpython2.7.so.1.0
#21 0x00007f514fd76b05 in __libc_start_main (main=0x4006c0 <main>, argc=2, argv=0x7fff28caa7f8,
init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff28caa7e8)
at libc-start.c:269
#22 0x00000000004006f1 in _start ()
(gdb)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services