You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/12 05:43:10 UTC

[GitHub] leleamol commented on issue #9004: test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build.

leleamol commented on issue #9004: test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build.
URL: https://github.com/apache/incubator-mxnet/issues/9004#issuecomment-350951363
 
 
   @ptrendx following is the output of test_nccl.py when ran with NCCL_DEBUG=INFO.
   
   [ec2-user@ip-172-31-46-76 gpu]$ NCCL_DEBUG=INFO python test_nccl.py
   
   
   
   ip-172-31-46-76:8258:8258 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
   
   ip-172-31-46-76:8258:8258 [0] INFO Using internal Network Socket
   
   ip-172-31-46-76:8258:8258 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
   
   NCCL version 2.1.2+cuda9.0
   
   ip-172-31-46-76:8258:8258 [0] INFO NET : Using interface eth0:172.31.46.76<0>
   
   ip-172-31-46-76:8258:8258 [0] INFO NET/Socket : 1 interfaces found
   
   ip-172-31-46-76:8258:8258 [1] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [1] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [1] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [1] INFO [0] Ring 0 :    0   1
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [2] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [2] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [2] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [2] INFO [0] Ring 0 :    0   1   2
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8362 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [3] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [3] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [3] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [3] INFO [0] Ring 0 :    0   1   2   3
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [4] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [4] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [4] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [4] INFO [0] Ring 0 :    0   1   2   3   4
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8365 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [5] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [5] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [5] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [5] INFO [0] Ring 0 :    0   1   2   3   4   5
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8365 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [6] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [6] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [6] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [6] INFO [0] Ring 0 :    0   1   2   3   4   5   6
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 6 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [6] INFO 6 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8363 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [7] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [7] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [7] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [7] INFO [0] Ring 0 :    0   1   2   3   4   5   6   7
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 7 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [6] INFO 6 -> 7 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [7] INFO 7 -> 6 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [7] INFO 7 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8362 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [8] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [8] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [8] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [8] INFO [0] Ring 0 :    0   1   2   3   4   5   6   7   8
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 8 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [0] INFO 0 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [1] INFO 1 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 1 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [2] INFO 2 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 2 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [3] INFO 3 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 3 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [4] INFO 4 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 4 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [5] INFO 5 -> 6 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [6] INFO 6 -> 5 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [6] INFO 6 -> 7 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [7] INFO 7 -> 6 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [7] INFO 7 -> 8 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [8] INFO 8 -> 7 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8258 [8] INFO 8 -> 0 via P2P/direct pointer
   
   ip-172-31-46-76:8258:8364 [0] INFO Launch mode Group
   
   ip-172-31-46-76:8258:8258 [9] INFO Using 512 threads
   
   ip-172-31-46-76:8258:8258 [9] INFO Min Comp Cap 3
   
   ip-172-31-46-76:8258:8258 [9] INFO NCCL_SINGLE_RING_THRESHOLD=131072
   
   ip-172-31-46-76:8258:8258 [9] INFO [0] Ring 0 :    0   1   2   3   4   5   6   7   8   9
   
   
   
   ip-172-31-46-76:8258:8258 [0] transport/p2p.cu:393 WARN failed to peer with device 9: 60 peer mapping resources exhausted
   
   ip-172-31-46-76:8258:8258 [0] INFO init.cu:191 -> 3
   
   ip-172-31-46-76:8258:8258 [0] INFO init.cu:266 -> 3
   
   ip-172-31-46-76:8258:8258 [0] INFO init.cu:610 -> 3
   
   ip-172-31-46-76:8258:8258 [0] INFO init.cu:678 -> 3
   
   
   It created a core dump. The callstack is as follows
   [Thread debugging using libthread_db enabled]
   Using host libthread_db library "/lib64/libthread_db.so.1".
   Core was generated by `python test_nccl.py'.
   Program terminated with signal 11, Segmentation fault.
   #0  0x00007f504ef23a6e in commFree (comm=0xece8380) at init.cu:100
   100	init.cu: No such file or directory.
   Missing separate debuginfos, use: debuginfo-install python26-2.6.9-2.89.amzn1.x86_64 python27-2.7.12-2.121.amzn1.x86_64 python34-3.4.3-1.35.amzn1.x86_64
   (gdb) bt
   #0  0x00007f504ef23a6e in commFree (comm=0xece8380) at init.cu:100
   #1  0x00007f504ef27dad in ncclCommInitAll (comms=<optimized out>, ndev=<optimized out>, devlist=<optimized out>)
       at init.cu:692
   #2  0x00007f508c715a50 in mxnet::kvstore::KVStoreNCCL::Reduce(std::vector<int, std::allocator<int> >, std::vector<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >, std::allocator<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > > > const&, int, std::vector<mxnet::NDArray const*, std::allocator<mxnet::NDArray const*> >*) () from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #3  0x00007f508c72223a in mxnet::kvstore::KVStoreNCCL::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int) ()
      from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #4  0x00007f508c6d8ba1 in mxnet::kvstore::KVStoreLocal::Push(std::vector<std::string, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int) ()
      from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #5  0x00007f508c6557fb in MXKVStorePushEx ()
      from /usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #6  0x00007f5149206cec in ffi_call_unix64 () from /usr/lib64/libffi.so.6
   #7  0x00007f5149206615 in ffi_call () from /usr/lib64/libffi.so.6
   #8  0x00007f514941997b in _ctypes_callproc () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
   #9  0x00007f5149413915 in ?? () from /usr/lib64/python2.7/lib-dynload/_ctypes.so
   #10 0x00007f5150a74173 in PyObject_Call () from /usr/lib64/libpython2.7.so.1.0
   #11 0x00007f5150b06f7d in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
   #12 0x00007f5150b0cd8d in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
   #13 0x00007f5150b098cc in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
   #14 0x00007f5150b09972 in PyEval_EvalFrameEx () from /usr/lib64/libpython2.7.so.1.0
   #15 0x00007f5150b0cd8d in PyEval_EvalCodeEx () from /usr/lib64/libpython2.7.so.1.0
   #16 0x00007f5150b0ce92 in PyEval_EvalCode () from /usr/lib64/libpython2.7.so.1.0
   #17 0x00007f5150b25d9f in ?? () from /usr/lib64/libpython2.7.so.1.0
   #18 0x00007f5150b26ede in PyRun_FileExFlags () from /usr/lib64/libpython2.7.so.1.0
   #19 0x00007f5150b28049 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.7.so.1.0
   #20 0x00007f5150b38c8f in Py_Main () from /usr/lib64/libpython2.7.so.1.0
   #21 0x00007f514fd76b05 in __libc_start_main (main=0x4006c0 <main>, argc=2, argv=0x7fff28caa7f8, 
       init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff28caa7e8)
       at libc-start.c:269
   #22 0x00000000004006f1 in _start ()
   (gdb) 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services