You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/01/23 06:29:33 UTC

[GitHub] stephenrawls opened a new issue #13968: Hybridization SequenceLast Bug

stephenrawls opened a new issue #13968: Hybridization SequenceLast Bug
URL: https://github.com/apache/incubator-mxnet/issues/13968
 
 
   ## Description
   I am experiencing a crash related to hybridization and the SequenceLast operator. It is currently consistently crashing for me, but others in team haven't been able to reproduce.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.7.2
   Compiler     : Clang 9.0.0 (clang-900.0.39.2)
   Build        : ('default', 'Dec 27 2018 07:35:45')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 18.1
   Directory    : /usr/local/lib/python3.7/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.3.0
   Directory    : /usr/local/lib/python3.7/site-packages/mxnet
   Commit Hash   : b3be92f4a48bce62a5a8424271871c2f81c8f7f1
   ----------System Info----------
   Platform     : Darwin-16.7.0-x86_64-i386-64bit
   system       : Darwin
   node         : dca90487206c.ant.amazon.com
   release      : 16.7.0
   version      : Darwin Kernel Version 16.7.0: Sun Oct 28 22:30:19 PDT 2018; root:xnu-3789.73.27~1/RELEASE_X86_64
   ----------Hardware Info----------
   machine      : x86_64
   processor    : i386
   b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
   b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
   b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
   b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0341 sec, LOAD: 0.6150 sec.
   Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0352 sec, LOAD: 0.2739 sec.
   Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0428 sec, LOAD: 0.2802 sec.
   Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0420 sec, LOAD: 0.3961 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0305 sec, LOAD: 0.8088 sec.
   Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0337 sec, LOAD: 0.1836 sec.
   ```
   
   I'm using Python
   
   
   Unfortunately I don't have a minimal reproducible example yet. Every time I try, the bug does not occur in my condensed example.
   
   The problem I have is that I am seeing the following error message:
   ```
    File ".../lib/python3.4/site-packages/mxnet/base.py", line 251, in check_call
       raise MXNetError(py_str(_LIB.MXGetLastError()))
   mxnet.base.MXNetError: [23:04:29] .../src/src/imperative/imperative_utils.cc:58: Check failed: !ndinputs.back()->is_none() node_264 0
   Stack trace returned 10 entries:
   [bt] (0) .../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x7ff375cd4e0d]
   [bt] (1) .../lib/libmxnet.so(mxnet::imperative::RunGraph(bool, nnvm::IndexedGraph const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> >, unsigned long, unsigned long, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> >&&, std::vector<unsigned int, std::allocator<unsigned int> >&&, std::vector<mxnet::OpStatePtr, std::allocator<mxnet::OpStatePtr> >*, std::vector<mxnet::DispatchMode, std::allocator<mxnet::DispatchMode> > const&, bool)+0x5a4) [0x7ff375e4a924]
   [bt] (2) .../lib/libmxnet.so(mxnet::Imperative::Backward(std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, bool, bool, bool)+0x22fc) [0x7ff375e3969c]
   [bt] (3) .../lib/libmxnet.so(MXAutogradBackwardEx+0x2e0) [0x7ff375cfe640]
   ```
   
   When I print out the symbolic debug string for my network and go to "node_264" it leads me to the following section of my model code:
   
   ```
           last_real_tag = mx.nd.SequenceLast(
               tags_t, sequence_length=mx.nd.cast(valid_lengths,dtype=tags_t.dtype), use_sequence_length=True, axis=0)
   ```
   
   The things that I think are important about this are:
     (1) The tags_t array is an integer sequence that we generate from ground truth data. So this SequenceLast call shouldn't be trying to flow any gradients back through it.  (Note that the stack trace is in the backward pass)
     (2) The model we run is a HybridBlock, but this particular code is part of our loss function which is not a hybrid block. It takes the output from the model, and combines it with the ground truth values from this SequenceLast operation, and calculates our loss
     (3) Everything worked fine when our model was a regular gluon Bock. It works fine as a HybridBlock as long as we don't hybridize() it. But when I do hybridize() the model, I am consistently seeing this crash. (Although other team members have yet to run into it).
   
   
   We currently have a work around to add the following statement directly after the SequenceLast operation:
   ```
   last_real_tag = last_real_tag.detach()
   ```
   
   When this is in place I never encounter the crash I highlighted above.
   
   Additionally I have another hackier work-around of not using SequenceLast, but instead using a combination of `Take` and `diag` to perform the same function. When I do that I also do not ever see the crash.
   
   I'm happy to provide further info as needed, apologies for not being able to condense down into an actual reproducible crash.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services