You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/08/29 08:45:03 UTC

[GitHub] dzabraev opened a new issue #12393: Deadlock in save_checkpoint when using threading

dzabraev opened a new issue #12393: Deadlock in save_checkpoint when using threading
URL: https://github.com/apache/incubator-mxnet/issues/12393
 
 
   ## Description
   For preparing batches I create several (100) python threads (I name this threads preparing-threads). Each of this thread prepares batch and appends it to queue. Then main process takes prepared batch and does learning. Each preparing-thread make call of mx.nd.array and it causes deadlock when main-thread makes `mx.model.save_checkpoint`. If I call mx.nd.array in main thread it works ok, no deadlock, but learning speed (samples/persec) too low.
   
   Can anybody prompts me, is this behaviour a bug or maybe I don't use mx.nd.array function in non-main python thread? I see a lot of mxnet documentation and couldn't find any restrictions about using `threading`.
   
   
   ## Environment info
   I'm seeing the deadlock in python interface in mxnet 1.2.0 and 1.2.1
   
   ```
   python diagnose.py
   ----------Python Info----------
   ('Version      :', '2.7.12')
   ('Compiler     :', 'GCC 5.4.0 20160609')
   ('Build        :', ('default', 'Dec  4 2017 14:50:18'))
   ('Arch         :', ('64bit', 'ELF'))
   ------------Pip Info-----------
   ('Version      :', '10.0.1')
   ('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
   ----------MXNet Info-----------
   /usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
     from ._conv import register_converters as _register_converters
   ('Version      :', '1.2.0')
   ('Directory    :', '/usr/local/lib/python2.7/dist-packages/mxnet')
   ('Commit Hash   :', '297c64fd2ee404612aa3ecc880b940fb2538039c')
   ----------System Info----------
   ('Platform     :', 'Linux-4.4.0-87-generic-x86_64-with-Ubuntu-16.04-xenial')
   ('system       :', 'Linux')
   ('node         :', '894febf28f08')
   ('release      :', '4.4.0-87-generic')
   ('version      :', '#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017')
   ----------Hardware Info----------
   ('machine      :', 'x86_64')
   ('processor    :', 'x86_64')
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                88
   On-line CPU(s) list:   0-87
   Thread(s) per core:    2
   Core(s) per socket:    22
   Socket(s):             2
   NUMA node(s):          2
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
   Stepping:              1
   CPU MHz:               2199.914
   CPU max MHz:           3600.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              4403.10
   Virtualization:        VT-x
   Hypervisor vendor:     vertical
   Virtualization type:   full
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              56320K
   NUMA node0 CPU(s):     0-21,44-65
   NUMA node1 CPU(s):     22-43,66-87
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
   ----------Network Test----------
   Setting timeout: 10
   Error open MXNet: https://github.com/apache/incubator-mxnet, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>, DNS finished in 0.0553529262543 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0011 sec, LOAD: 1.7442 sec.
   Error open FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>, DNS finished in 0.284875869751 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>, DNS finished in 0.0283861160278 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.9700 sec, LOAD: 0.7090 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 1.4856 sec, LOAD: 1.4765 sec.
   ```
   
   ## GDB batcktrace
   
   ```
   #0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
   #1  0x00007f4ca484791c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
   #2  0x00007f4b851826fe in std::condition_variable::wait<mxnet::engine::ThreadedEngine::WaitForVar(mxnet::Engine::VarHandle)::<lambda()> >(std::unique_lock<std::mutex> &, mxnet::engine::ThreadedEngine::<lambda()>) (this=0x21d3468,
       __lock=..., __p=...) at /usr/include/c++/5/condition_variable:98
   #3  0x00007f4b85182145 in mxnet::engine::ThreadedEngine::WaitForVar (this=0x21d3420, var=0x7f47b10c0160) at src/engine/threaded_engine.cc:397
   #4  0x00007f4b84a61cea in mxnet::NDArray::WaitToRead (this=0x7f4a4b4d0110) at include/mxnet/./ndarray.h:307
   #5  0x00007f4b84c1acae in mxnet::NDArray::Save (this=0x7f4a4b4d0110, strm=0x7f466fb685d0) at src/ndarray/ndarray.cc:1625
   #6  0x00007f4b84c56d81 in dmlc::serializer::SaveLoadClassHandler<mxnet::NDArray>::Write (strm=0x7f466fb685d0, data=...) at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/./serializer.h:83
   #7  0x00007f4b84c56816 in dmlc::serializer::IfThenElse<true, dmlc::serializer::SaveLoadClassHandler<mxnet::NDArray>, dmlc::serializer::UndefinedSerializerFor<mxnet::NDArray>, mxnet::NDArray>::Write (strm=0x7f466fb685d0, data=...)
       at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/./serializer.h:52
   #8  0x00007f4b84c56350 in dmlc::serializer::IfThenElse<false, dmlc::serializer::PODHandler<mxnet::NDArray>, dmlc::serializer::IfThenElse<true, dmlc::serializer::SaveLoadClassHandler<mxnet::NDArray>, dmlc::serializer::UndefinedSerializerFor<mxnet::NDArray>, mxnet::NDArray>, mxnet::NDArray>::Write (strm=0x7f466fb685d0, data=...) at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/./serializer.h:61
   #9  0x00007f4b84c554d1 in dmlc::serializer::Handler<mxnet::NDArray>::Write (strm=0x7f466fb685d0, data=...) at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/./serializer.h:248
   #10 0x00007f4b84c512fa in dmlc::serializer::ComposeVectorHandler<mxnet::NDArray>::Write (strm=0x7f466fb685d0, vec=std::vector of length 820, capacity 820 = {...})
       at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/./serializer.h:136
   #11 0x00007f4b84c4c923 in dmlc::serializer::IfThenElse<false, dmlc::serializer::PODVectorHandler<mxnet::NDArray>, dmlc::serializer::ComposeVectorHandler<mxnet::NDArray>, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > >::Write (strm=0x7f466fb685d0, data=std::vector of length 820, capacity 820 = {...}) at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/./serializer.h:61
   #12 0x00007f4b84c46e94 in dmlc::serializer::Handler<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > >::Write (strm=0x7f466fb685d0, data=std::vector of length 820, capacity 820 = {...})
       at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/./serializer.h:277
   #13 0x00007f4b84c3b3ff in dmlc::Stream::Write<std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > > (this=0x7f466fb685d0, data=std::vector of length 820, capacity 820 = {...})
       at /learning/code/apache-mxnet-src-1.2.1-incubating/3rdparty/dmlc-core/include/dmlc/io.h:432
   #14 0x00007f4b84c1bdea in mxnet::NDArray::Save (fo=0x7f466fb685d0, data=std::vector of length 820, capacity 820 = {...}, names=std::vector of length 820, capacity 820 = {...}) at src/ndarray/ndarray.cc:1800
   #15 0x00007f4b851ea235 in MXNDArraySave (fname=0x7f4b56f1c9b4 "/learning/results/models/clean-testspeed/100r-0001.params", num_args=820, args=0x7f466fb7a930, keys=0x7f466fb78a10) at src/c_api/c_api.cc:297
   #16 0x00007f4cd056de40 in ffi_call_unix64 () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #17 0x00007f4cd056d8ab in ffi_call () from /usr/lib/x86_64-linux-gnu/libffi.so.6
   #18 0x00007f4cd077d3df in _ctypes_callproc () from /learning/code/dbgmxnet/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
   #19 0x00007f4cd0781d82 in ?? () from /learning/code/dbgmxnet/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so
   #20 0x00000000004c15bf in PyEval_EvalFrameEx ()
   #21 0x00000000004c136f in PyEval_EvalFrameEx ()
   #22 0x00000000004c136f in PyEval_EvalFrameEx ()
   #23 0x00000000004b9ab6 in PyEval_EvalCodeEx ()
   #24 0x00000000004c16e7 in PyEval_EvalFrameEx ()
   #25 0x00000000004b9ab6 in PyEval_EvalCodeEx ()
   #26 0x00000000004c1e6f in PyEval_EvalFrameEx ()
   #27 0x00000000004b9ab6 in PyEval_EvalCodeEx ()
   #28 0x00000000004c16e7 in PyEval_EvalFrameEx ()
   #29 0x00000000004b9ab6 in PyEval_EvalCodeEx ()
   #30 0x00000000004c1e6f in PyEval_EvalFrameEx ()
   #31 0x00000000004b9ab6 in PyEval_EvalCodeEx ()
   #32 0x00000000004c1e6f in PyEval_EvalFrameEx ()
   #33 0x00000000004b9ab6 in PyEval_EvalCodeEx ()
   #34 0x00000000004eb30f in ?? ()
   #35 0x00000000004e5422 in PyRun_FileExFlags ()
   #36 0x00000000004e3cd6 in PyRun_SimpleFileExFlags ()
   #37 0x0000000000493ae2 in Py_Main ()
   #38 0x00007f4d48470830 in __libc_start_main (main=0x4934c0 <main>, argc=31, argv=0x7fff1312a3d8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff1312a3c8) at ../csu/libc-start.c:291
   #39 0x00000000004933e9 in _start ()
   ```
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services