You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/08 21:13:16 UTC

[GitHub] leleamol opened a new issue #9004: test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build.

leleamol opened a new issue #9004: test_nccl.py script causes a core dump on P2.16xlarge instance when ran against NCCL enabled MXNet build.
URL: https://github.com/apache/incubator-mxnet/issues/9004
 
 
   Note: Providing complete information in the most concise form is the best way to get help. This issue template serves as the checklist for essential information to most of the technical issues and bug reports. For non-technical issues and feature requests, feel free to present the information in what you believe is the best form.
   
   For Q & A and discussion, please start a discussion thread at https://discuss.mxnet.io 
   
   ## Description
   The test_nccl.py script when ran against NCCL enabled MXNet causes a core dump.
   
   ## Environment info (Required)
   
   MXNet version v1.0.0 built with USE_NCCL=1 and USE_NCCL_PATH
   NCCL 2.1
   Instance type : p2.16xlarge
   
   ```
   What to do:
   1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
   2. Run the script using `python diagnose.py` and paste its output here.
   
   [ec2-user@ip-172-31-42-123 tools]$ python diagnose.py 
   ----------Python Info----------
   ('Version      :', '2.7.12')
   ('Compiler     :', 'GCC 4.8.5 20150623 (Red Hat 4.8.5-11)')
   ('Build        :', ('default', 'Nov  2 2017 19:20:38'))
   ('Arch         :', ('64bit', 'ELF'))
   ------------Pip Info-----------
   ('Version      :', '9.0.1')
   ('Directory    :', '/usr/lib/python2.7/dist-packages/pip')
   ----------MXNet Info-----------
   ('Version      :', '1.0.0')
   ('Directory    :', '/usr/lib/python2.7/dist-packages/mxnet-1.0.0-py2.7.egg/mxnet')
   Traceback (most recent call last):
     File "diagnose.py", line 171, in <module>
       check_mxnet()
     File "diagnose.py", line 113, in check_mxnet
       except FileNotFoundError:
   NameError: global name 'FileNotFoundError' is not defined
   
   ```
   
   
   Package used (Python/R/Scala/Julia): 
   (I'm using ...) Python
   
   For Scala user, please provide:
   1. Java version: (`java -version`)
   2. Maven version: (`mvn -version`)
   3. Scala runtime if applicable: (`scala -version`)
   
   For R user, please provide R `sessionInfo()`:
   
   ## Build info (Required if built from source)
   
   Compiler (gcc/clang/mingw/visual studio): gcc
   
   MXNet commit hash:
   (Paste the output of `git rev-parse HEAD` here.)
   2b67436802b750e15b9fbfdf275958c1000be6a8
   
   Build config:
   (Paste the content of config.mk, or the build command.)
   USE_CUDA=1
   USE_CUDA_PATH=/usr/local/cuda
   USE_CUDNN=1
   USE_DIST_KVSTORE=1
   USE_MKL2017=1
   USE_BLAS=openblas
   USE_S3=1
   USE_NCCL=1
   USE_NCCL_PATH=/usr/nccl/cuda-9
   CUDA_ARCH := -gencode arch=compute_35,code=sm_35 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70
   
   ## Error Message:
   Core was generated by `/home/ec2-user/src/anaconda2/bin/python ./src/anaconda2/bin/nosetests /home/ec2'.
   Program terminated with signal 11, Segmentation fault.
   #0  0x00007f21e752aa6e in commFree (comm=0x5567ce4da6f0) at init.cu:100
   100	init.cu: No such file or directory.
   Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.12.amzn1.x86_64 krb5-libs-1.15.1-8.43.amzn1.x86_64 libcom_err-1.42.12-4.40.amzn1.x86_64 libjpeg-turbo-1.2.90-5.14.amzn1.x86_64 libselinux-2.1.10-3.22.amzn1.x86_64 libuuid-2.23.2-33.28.amzn1.x86_64 openssl-1.0.2k-8.106.amzn1.x86_64
   (gdb) where
   #0  0x00007f21e752aa6e in commFree (comm=0x5567ce4da6f0) at init.cu:100
   #1  0x00007f21e752edad in ncclCommInitAll (comms=<optimized out>, ndev=<optimized out>, devlist=<optimized out>) at init.cu:692
   #2  0x00007f22294f7a50 in mxnet::kvstore::KVStoreNCCL::Reduce(std::vector<int, std::allocator<int> >, std::vector<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> >, std::allocator<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > > > const&, int, std::vector<mxnet::NDArray const*, std::allocator<mxnet::NDArray const*> >*) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #3  0x00007f222950423a in mxnet::kvstore::KVStoreNCCL::PushImpl(std::vector<int, std::allocator<int> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #4  0x00007f22294baba1 in mxnet::kvstore::KVStoreLocal::Push(std::vector<std::string, std::allocator<std::string> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, int) () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #5  0x00007f22294377fb in MXKVStorePushEx () from /home/ec2-user/src/anaconda2/lib/python2.7/site-packages/mxnet-1.0.0-py2.7.egg/mxnet/libmxnet.so
   #6  0x00007f224231aec0 in ffi_call_unix64 () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6
   #7  0x00007f224231a87d in ffi_call () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/../../libffi.so.6
   #8  0x00007f2242530736 in _ctypes_callproc () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/_ctypes.so
   #9  0x00007f2242526a61 in PyCFuncPtr_call () from /home/ec2-user/src/anaconda2/lib/python2.7/lib-dynload/_ctypes.so
   #10 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #11 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #12 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #13 0x00007f224e0c2482 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #14 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #15 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #16 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #17 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #18 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #19 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #20 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #21 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #22 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #23 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #24 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #25 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #26 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #27 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #28 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #29 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #30 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #31 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #32 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #33 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #34 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #35 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #36 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #37 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #38 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #39 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #40 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #41 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #42 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #43 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #44 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #45 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #46 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #47 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #48 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #49 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #50 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #51 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #52 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #53 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #54 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #55 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #56 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #57 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #58 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #59 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #60 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #61 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #62 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #63 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #64 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #65 0x00007f224e082574 in slot_tp_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #66 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #67 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #68 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #69 0x00007f224e0c3dac in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #70 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #71 0x00007f224e04e0c7 in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #72 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #73 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #74 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #75 0x00007f224e0be4d0 in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #76 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #77 0x00007f224e04dfda in function_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #78 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #79 0x00007f224e03850d in instancemethod_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #80 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #81 0x00007f224e082254 in slot_tp_init () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #82 0x00007f224e07eb0b in type_call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #83 0x00007f224e029773 in PyObject_Call () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #84 0x00007f224e0bd53b in PyEval_EvalFrameEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #85 0x00007f224e0c54e9 in PyEval_EvalCodeEx () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #86 0x00007f224e0c570a in PyEval_EvalCode () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #87 0x00007f224e0de93d in run_mod () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #88 0x00007f224e0dfab8 in PyRun_FileExFlags () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #89 0x00007f224e0e0cd8 in PyRun_SimpleFileExFlags () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #90 0x00007f224e0f2d3c in Py_Main () from /home/ec2-user/src/anaconda2/bin/../lib/libpython2.7.so.1.0
   #91 0x00007f224d32fb05 in __libc_start_main (main=0x5567c5f66850 <main>, argc=3, argv=0x7ffe7888ded8, init=<optimized out>, fini=<optimized out>, 
       rtld_fini=<optimized out>, stack_end=0x7ffe7888dec8) at libc-start.c:269
   #92 0x00005567c5f6687f in _start ()
   ## Minimum reproducible example
   (If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)
   mxnet/tests/python/gpu/test_nccl.py
   
   ## Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1. Comment out following line in mxnet/tests/python/gpu/test_nccl.py
   @unittest.skip("Test requires NCCL library installed and enabled during build")
   2. Run following command
   python tests/python/gpu/test_nccl.py
   
   ## What have you tried to solve it?
   
   1.
   2.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services