You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/08/03 19:48:17 UTC

[GitHub] David-Levinthal opened a new issue #12024: Speech_recognition crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed

David-Levinthal opened a new issue #12024: Speech_recognition crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed
URL: https://github.com/apache/incubator-mxnet/issues/12024
 
 
   Any insights in how to fix this would be greatly appreciated
   ## Description
   Speech_recognition training on LibriSpeech crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed. 
   ## Environment info (Required)
   ubuntu 16.04, cuda 9.2 Cudnn7.1.4, nccl 2.1.2 4 Nvidia V100s, CUDA_VISIBLE_DEVICES=0
   deepspeech.cfg set to only use 1 GPU
   
   ```
   What to do:
   1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
   2. Run the script using `python diagnose.py` and paste its output here.
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Byte Order:            Little Endian
   CPU(s):                28
   On-line CPU(s) list:   0-27
   Thread(s) per core:    1
   Core(s) per socket:    14
   Socket(s):             2
   NUMA node(s):          2
   Vendor ID:             GenuineIntel
   CPU family:            6
   Model:                 79
   Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
   Stepping:              1
   CPU MHz:               2189.890
   CPU max MHz:           3500.0000
   CPU min MHz:           1200.0000
   BogoMIPS:              5189.99
   Virtualization:        VT-x
   L1d cache:             32K
   L1i cache:             32K
   L2 cache:              256K
   L3 cache:              35840K
   NUMA node0 CPU(s):     0-13
   NUMA node1 CPU(s):     14-27
   Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
   ----------Python Info----------
   ('Version      :', '2.7.12')
   ('Compiler     :', 'GCC 5.4.0 20160609')
   ('Build        :', ('default', 'Dec  4 2017 14:50:18'))
   ('Arch         :', ('64bit', 'ELF'))
   ------------Pip Info-----------
   ('Version      :', '9.0.1')
   ('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
   ----------MXNet Info-----------
   ('Version      :', '1.3.0')
   ('Directory    :', '/home/levinth/mxnet/python/mxnet')
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   ('Platform     :', 'Linux-4.4.0-130-generic-x86_64-with-Ubuntu-16.04-xenial')
   ('system       :', 'Linux')
   ('node         :', 'zt-gpu-lin')
   ('release      :', '4.4.0-130-generic')
   ('version      :', '#156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018')
   ----------Hardware Info----------
   ('machine      :', 'x86_64')
   ('processor    :', 'x86_64')
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0267 sec, LOAD: 0.4810 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0077 sec, LOAD: 0.4934 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2532 sec, LOAD: 0.2584 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0342 sec, LOAD: 0.3129 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0159 sec, LOAD: 0.1362 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0608 sec, LOAD: 1.6993 sec.
   
   ```
   
   Package used (Python/R/Scala/Julia):
   python
   
   ## Build info (Required if built from source)
   
   Compiler (gcc/clang/mingw/visual studio):
   gcc/nvcc
    gcc --version
   gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
   
   MXNet commit hash:
   git rev-parse HEAD
   1fa04f2c9a7ba0c3273d080afa7fc993b927f114
   
   Build config:
   make -j USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_NCCL=1 USE_NCCL_PATH=/usr/local/nccl
   config.mk is default except for uncommenting the 2 lines for including warp-ctc
   
   ## Error Message:
   [    INFO][2018/08/03 10:57:31.331] optimizer_params_dictionary = {"momentum":0.9}
   [    INFO][2018/08/03 10:57:31.332] clip_gradient = 100
   [    INFO][2018/08/03 10:57:31.332] weight_decay = 0.
   [    INFO][2018/08/03 10:57:31.332] 
   [    INFO][2018/08/03 10:57:59.214] ---------train---------
   [10:58:00] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   terminate called after throwing an instance of 'dmlc::Error'
     what():  [10:58:03] src/engine/./threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed
   A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
   
   Stack trace returned 9 entries:
   [bt] (0) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f9dfe87440b]
   [bt] (1) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f9dfe874f78]
   [bt] (2) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xfa9) [0x7f9e015b3a59]
   [bt] (3) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f9e015c9d0b]
   [bt] (4) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f9e015c9f7e]
   [bt] (5) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f9e015b299a]
   [bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f9e4e10dc80]
   [bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f9e581196ba]
   [bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f9e57e4f41d]
   
   
   ## Minimum reproducible example
   default deepspeech with LibriSpeech data set prepared per instructions
   train.py edited to use 
   >     #summary_writer = SummaryWriter(tblog_dir)
   >     summary_writer = tf.summary.FileWriter(tblog_dir)
   
   ## Steps to reproduce
   see attachment for installation and invocation notes
   [mxnet_deepspeech_build_for_bug.txt](https://github.com/apache/incubator-mxnet/files/2258650/mxnet_deepspeech_build_for_bug.txt)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services