You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/09/03 03:23:10 UTC

[GitHub] loadwiki opened a new issue #12438: core dump in macosx using big model

loadwiki opened a new issue #12438: core dump in macosx using big model
URL: https://github.com/apache/incubator-mxnet/issues/12438
 
 
   (Brief description of the problem in no more than 2 sentences.)
   My cpp program sometimes core dump  in libmxnet.so when the model is as large as 200M bytes;
   no core dump with small model.
   ## Environment info (Required)
   imac osx 10.13.6
   ## Build info (Required if built from source)
   git diff make/config.mk
   @@ -82,7 +82,7 @@ USE_NCCL_PATH = NONE
    # whether use opencv during compilation
    # you can disable it, however, you will not able to use
    # imbin iterator
   -USE_OPENCV = 1
   +USE_OPENCV = 0
   
    #whether use libjpeg-turbo for image decode without OpenCV wrapper
    USE_LIBJPEG_TURBO = 0
   @@ -90,7 +90,7 @@ USE_LIBJPEG_TURBO = 0
    USE_LIBJPEG_TURBO_PATH = NONE
   
    # use openmp for parallelization
   -USE_OPENMP = 1
   +USE_OPENMP = 0
   
   ## Error Message:
   (Paste the complete error message, including stack trace.)
   lldb main -c /cores/core.97762
   (lldb) target create "main" --core "/cores/core.97762"
   Traceback (most recent call last):
     File "<input>", line 1, in <module>
     File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy.py", line 52, in <module>
       import weakref
     File "/usr/local/Cellar/python@2/2.7.15/Frameworks/Python.framework/Versions/2.7/lib/python2.7/weakref.py", line 14, in <module>
       from _weakref import (
   ImportError: cannot import name _remove_dead_weakref
   Core file '/cores/core.97762' (x86_64) was loaded.
   (lldb) bt
   warning: could not execute support code to read Objective-C class data in the process. This may reduce the quality of type information available.
   * thread #1, stop reason = signal SIGSTOP
     * frame #0: 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10
       frame #1: 0x00007fff64046589 libsystem_pthread.dylib`_pthread_cond_wait + 732
       frame #2: 0x00007fff61c81cb0 libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
       frame #3: 0x000000010d6bc364 libmxnet.so`mxnet::engine::ThreadedEngine::WaitForVar(mxnet::engine::Var*) + 596
       frame #4: 0x000000010d7cd49a libmxnet.so`mxnet::NDArray::SyncCopyToCPU(void*, unsigned long) const + 954
       frame #5: 0x000000010d6ad0d4 libmxnet.so`MXPredGetOutput + 340
       frame #6: 0x000000010c1cac30 main`Infer(pred_hnd=0x00007fcba2f00000, image_data=size=1, data=size=1) at face_predict.cpp:296
       frame #7: 0x000000010c120e99 main`process_camera(model_path="../models/ncnn", camera=0x00007ffee3af5170, output_folder="./output/192.168.150.244", mainThread=true) at main.cpp:278
       frame #8: 0x000000010c125f42 main`main(argc=4, argv=0x00007ffee3af57b0) at main.cpp:484
       frame #9: 0x00007fff63d2d015 libdyld.dylib`start + 1
   (lldb) thread list
   Process 0 stopped
   * thread #1: tid = 0x0000, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #2: tid = 0x0001, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #3: tid = 0x0002, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #4: tid = 0x0003, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #5: tid = 0x0004, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #6: tid = 0x0005, 0x000000010c589a4a libmxnet.so`void mxnet::op::BatchNormForwardImpl<mshadow::cpu, float, float>(mshadow::Stream<mshadow::cpu>*, mxnet::OpContext const&, mxnet::op::BatchNormParam const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&) + 1002, stop reason = signal SIGSTOP
     thread #7: tid = 0x0006, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #8: tid = 0x0007, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #9: tid = 0x0008, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #10: tid = 0x0009, 0x00007fff63e7da16 libsystem_kernel.dylib`__psynch_cvwait + 10, stop reason = signal SIGSTOP
     thread #11: tid = 0x000a, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
     thread #12: tid = 0x000b, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
     thread #13: tid = 0x000c, 0x00007fff63e7e28a libsystem_kernel.dylib`__workq_kernreturn + 10, stop reason = signal SIGSTOP
   ## Minimum reproducible example
   There is no obvious condition which cause the core dump.
   I do manuelly send a sigstop signal to my main program, then main stop as usual.
   I'm curious that there is no segment fault or abort or some other signal but a sigstop when the core dump occurs.
   At first I compile the mxnet master branch. Then I switch a release tag '1.2.1.rc1', same thing happens.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services