You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2020/07/03 00:09:18 UTC

[GitHub] [incubator-tvm] samskalicky opened a new pull request #5986: Fixes for GraphRuntime destruction

samskalicky opened a new pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986


   Ive been getting this issue when running tests, all pass, and then as the process starts to exit fails with a core dump:
   ```
   pure virtual method called
   terminate called without an active exception
   Aborted (core dumped)
   
   #5  0x00007ffff11d9988 in __cxxabiv1::__cxa_pure_virtual ()
       at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
   #6  0x00007fff45589a82 in tvm::runtime::NDArray::Internal::DefaultDeleter (ptr_obj=0x55555754ece0)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/ndarray.cc:97
   #7  0x00007fff4557d439 in tvm::runtime::Object::DecRef (this=0x55555754ece0)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:833
   #8  0x00007fff455b2815 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::reset (this=0x5555571c8c00)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:439
   #9  0x00007fff45598698 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::~ObjectPtr (this=0x5555571c8c00, 
       __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:388
   #10 0x00007fff4557d4aa in tvm::runtime::ObjectRef::~ObjectRef (this=0x5555571c8c00, __in_chrg=<optimized out>)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:511
   #11 0x00007fff4557df1e in tvm::runtime::NDArray::~NDArray (this=0x5555571c8c00, __in_chrg=<optimized out>)
       at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/ndarray.h:42
   #12 0x00007fff455fafb3 in std::_Destroy<tvm::runtime::NDArray> (__pointer=0x5555571c8c00)
       at /usr/include/c++/5/bits/stl_construct.h:93
   #13 0x00007fff455edee1 in std::_Destroy_aux<false>::__destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, 
       __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:103
   #14 0x00007fff455dfa22 in std::_Destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, __last=0x5555571c8c10)
       at /usr/include/c++/5/bits/stl_construct.h:126
   #15 0x00007fff455cd124 in std::_Destroy<tvm::runtime::NDArray*, tvm::runtime::NDArray> (__first=0x5555571c8c00, 
       __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:151
   #16 0x00007fff455e0d81 in std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >::~vector (
       this=0x55555752d2e8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_vector.h:424
   #17 0x00007fff455e0ec8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, 
       __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73
   #18 0x00007fff455e0fb8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, 
       __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73
   ```
   
   It looks like theres a race condition in the shutdown sequence in TVM, and an NDArray is trying to be destructed, but the DeviceAPI object has already been destructed, so when it calls FreeDataSpace to free the NDArray memory it runs into the “pure virtual method called” error.
   
   I added a destructor to the CUDADeviceAPI class (https://github.com/neo-ai/tvm/blob/dev/src/runtime/cuda/cuda_device_api.cc#L37) with a print statement and was able to confirm that the destructor was being called before the NDArray was destructed. This confirms the root cause, that the CUDA DeviceAPI was destructed before all the NDArrays were destructed (and their underlying memory freed).
   
   Basically the issue is that the CUDADeviceAPI singleton class is destructed before all GPU NDArrays are freed. The quick fix is to be able to re-construct the CUDADeviceAPI singleton after being deconstructed so that it can be used to free the remaining GPU NDArrays.
   
   The DeviceAPIManager class (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L91) is a singleton that maintains a map of DeviceAPI objects for each context (CPU, GPU, etc). The Global API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L107) is the static singleton “get_instance” function. The GetAPI API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L112) is used to get the DeviceAPI object for a particular context type that is looked up in the api_ map. Upon destructionif we clear the api_ array to nullptr (e72b64b) each DeviceAPI object will be reconstructed. Upon reconstruction of the singleton CUDADeviceAPI class, we need to reset the static shared_ptr (3e50586) too.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-653288776


   Thanks @samskalicky . 
   
   I agree that the destruction would be an issue here.  The fix however, is a bit adhoc. The root of problem is due to the fact of using a static GraphRuntime that get destructed.  
   
   The best approach might be just to ensure the destruction of the graph runtime at the time point, and not introducing graph runtime as a static object. 
   
   We could try to allocate raw pointer for the device API and never destory themselves(as the resource will de-allocate in unloading and no de-allocation is needed).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen closed pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

tqchen closed pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-676406806


   close this for now as there is no further actionable item atm


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-675124439


   Unfortunately the device API encapsulation means we cannot simply make them static(the need of virtual methods for other device APIs).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-674175359


   the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python), 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] samskalicky commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

samskalicky commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-675137290


   Unfortunately we're starting to see this problem in other frameworks as well. Heres PyTorch:
   ```
   #0  0x00007fff56b0ee60 in tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*) () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
   #1  0x00007fff56983f6b in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
   #2  0x00007fff56afa604 in tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::MetadataModuleNode>::Deleter_(tvm::runtime::Object*) ()
      from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
   #3  0x00007fff56b78d4e in tvm::runtime::GraphRuntime::~GraphRuntime() () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
   #4  0x00007fff56b79379 in tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::GraphRuntime>::Deleter_(tvm::runtime::Object*) ()
      from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
   #5  0x00007fff8c1fbb43 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() ()
      from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtorch.so
   #6  0x00007fff8c1fbb5d in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()
       () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtorch.so
   ```
   Maybe theres a better way to prevent the destruction of the deviceAPI objects with a counter, to ensure that they arent destructed before all the arrays that were allocated with them are freed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] samskalicky commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

samskalicky commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-674181057


   > the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python),
   
   True, im running TVM inside a custom subgraph operator in MXNet. so the subgraph operator is stateful and loads the graphruntime in its constructor


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] samskalicky commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

samskalicky commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-653270573


   @tqchen for review, @zhiics @trevor-m FYI


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen edited a comment on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

tqchen edited a comment on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-675124439


   Unfortunately the device API encapsulation means we cannot simply make them static(the need of virtual methods for other device APIs). In this case I think we should update the mxnet subgraph API to avoid the static states if possible, or simply avoid de-allocating the global state(by using new instead of creating a static instance)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] samskalicky commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

samskalicky commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-675123247


   @tqchen the CPU/GPU device API classes dont seem to store any state. Can we just make these APIs static?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] samskalicky commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

samskalicky commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-674158779


   Lots of testing over the past month, definitely reduced the occurrence of the problem by making the runtime not static. But still seeing intermittent failures (depending on model can be more prevalent)
   ```
   Segmentation fault: 11
   
   *** Error in `python': double free or corruption (!prev): 0x000055becd8c4460 ***
   ======= Backtrace: =========
   /lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7fd5a64827f5]
   /lib/x86_64-linux-gnu/libc.so.6(+0x8038a)[0x7fd5a648b38a]
   /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fd5a648f58c]
   /lib/x86_64-linux-gnu/libc.so.6(+0x3a035)[0x7fd5a6445035]
   /lib/x86_64-linux-gnu/libc.so.6(+0x3a055)[0x7fd5a6445055]
   /home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x7fc3125)[0x7fd5498ea125]
   /lib/x86_64-linux-gnu/libc.so.6(+0x354c0)[0x7fd5a64404c0]
   /usr/local/cuda/lib64/libcudart.so.10.0(+0x1d9fe)[0x7fd4fc1909fe]
   /usr/local/cuda/lib64/libcudart.so.10.0(+0x2296b)[0x7fd4fc19596b]
   /usr/local/cuda/lib64/libcudart.so.10.0(cudaSetDevice+0x47)[0x7fd4fc1bd087]
   /home/ubuntu/anaconda3/lib/python3.7/site-packages/neomxnet/libdlr.so(_ZN3tvm7runtime13CUDADeviceAPI13FreeDataSpaceE9DLContextPv+0x3a)[0x7fd4eda8652a]
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] samskalicky commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

samskalicky commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-654382750


   Thanks for the quick reply @tqchen!
   
   Agreed, the proposed fix is adhoc. I wanted to show a working solution to the problem as a starting point. 
   
   I can try and make the GraphRuntime object not static so that it will be destructed before the DeviceAPI and see if that avoids the problem on my side.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] tqchen commented on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

tqchen commented on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-676409254


   Would be useful to do some exploration, dig further and open a discuss thread about the details. For example, if we try to retain libtvm.so until pytorch unloads, would the problem go away.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-tvm] samskalicky edited a comment on pull request #5986: Fixes for GraphRuntime destruction

Posted by GitBox <gi...@apache.org>.

samskalicky edited a comment on pull request #5986:
URL: https://github.com/apache/incubator-tvm/pull/5986#issuecomment-674181057


   > the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python),
   
   True, im running TVM inside a custom subgraph operator in MXNet. so the subgraph operator is stateful and loads the graphruntime in its constructor. So the DeviceAPI objects will be destructed before the runtime is.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org