You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2021/02/09 05:35:00 UTC

[GitHub] [incubator-mxnet] access2rohit opened a new issue #19871: [RFC] Fixing CPU Memory Leaks in MXNet

access2rohit opened a new issue #19871:
URL: https://github.com/apache/incubator-mxnet/issues/19871


   ## Problem statement
   MXNet when run with valgrind shows different memory leaks on unittests and when running inference. 
   I have collected a list of such leaks as shown below. Some of these maybe by design or some might be actual. The table below shows comprhensive list of such leaks categorized by type (Engine, memory, CachedOp or Op)
   https://docs.google.com/spreadsheets/d/184kbSuhCVUTohxkDYxp_eMxcEhIKhoOY65VEkFVjDi0/edit?usp=sharing
   
   ## Proposed solutions
   Investigate which leaks are not by design and fix them
   
   ## Setup 
   ```
   ## build python from source debug mode
   cd $HOME
   wget https://www.python.org/ftp/python/3.6.12/Python-3.6.12.tgz   
   tar -xvzf Python-3.6.12.tgz
   cd Python-3.6.12.tgz
   ./configure --with-pydebug --without-pymalloc --with-valgrind --prefix /opt/debugpython/
   sudo make OPT=-g && sudo make install
   
   ## Add python valgrind suppression file
   vi $HOME/Python-3.6.12/Misc/valgrind-python.supp
   ## Then Uncomment PyObject_Free and PyObject_Realloc in the valgring suppression file.
   
   
   ## build Valgrind from source since apt-get installs version 1.13 
   ## which will give error with python 
   cd $HOME
   git clone git://sourceware.org/git/valgrind.git
   cd $HOME/valgrind
   ./autogen.sh
   ./configure --prefix=$(pwd)
   make
   sudo make install
   export PATH=$PATH:$HOME/valgrind/bin
   export VALGRIND_LIB="$HOME/valgrind/lib/valgrind"
   
   ## go to MXNet directory and run valgrind
   cd $HOME/workspace/incubator-mxnet
   # Build MXNet 
   # run valgrind on single unittest via pytest
   $HOME/valgrind/bin/valgrind --tool=memcheck --suppressions=$HOME/valgrind/Misc/valgrind-python.supp --leak-check=full --error-exitcode=1 /opt/debugpython/bin/python3 -m pytest -s --exitfirst --verbose --timeout=0 tests/python/unittest/test_numpy_op.py::test_np_sort
   ```
   
   ## Sample Leak
   
   ```
   ==23789== 34,652 (240 direct, 34,412 indirect) bytes in 3 blocks are definitely lost in loss record 126,460 of 126,809
   ==23789==    at 0x4C3257A: operator new(unsigned long) (vg_replace_malloc.c:342)
   ==23789==    by 0x5D4B98C1: void dmlc::any::construct<mxnet::Imperative::DCInfo, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&>(std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&) (any.h:267)
   ==23789==    by 0x5D4AB003: mxnet::Imperative::DCInfo::Create(std::shared_ptr<nnvm::Node> const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&) (imperative.cc:681)
   ==23789==    by 0x5D4A7278: mxnet::Imperative::RecordDeferredCompute(nnvm::NodeAttrs&&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&) (imperative.cc:341)
   ==23789==    by 0x5D226911: mxnet::Invoke(nnvm::Op const*, nnvm::NodeAttrs*, int, mxnet::NDArray**, int*, mxnet::NDArray**) (utils.cc:95)
   ==23789==    by 0x5D22414C: mxnet::UFuncHelper(mxnet::NDArray*, mxnet::NDArray*, mxnet::NDArray*, mxnet::runtime::MXNetRetValue*, nnvm::Op const*) (ufunc_helper.cc:47)
   ==23789==    by 0x5D224B35: mxnet::UFuncHelper(mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*, nnvm::Op const*, nnvm::Op const*, nnvm::Op const*) (ufunc_helper.cc:152)
   ==23789==    by 0x5D194321: mxnet::{lambda(mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*)#1}::operator()(mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*) const (np_elemwise_broadcast_op.cc:36)
   ==23789==    by 0x5D196729: std::_Function_handler<void (mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*), mxnet::{lambda(mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*)#1}>::_M_invoke(std::_Any_data const&, mxnet::runtime::MXNetArgs&&, mxnet::runtime::MXNetRetValue*&&) (std_function.h:316)
   ==23789==    by 0x6980964F: std::function<void (mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*)>::operator()(mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*) const (std_function.h:706)
   ==23789==    by 0x698095ED: mxnet::runtime::PackedFunc::CallPacked(mxnet::runtime::MXNetArgs, mxnet::runtime::MXNetRetValue*) const (packed_func.h:942)
   ==23789==    by 0x698083E4: MXNetFuncCall (c_runtime_api.cc:64)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] szha commented on issue #19871: [RFC] Fixing CPU Memory Leaks in MXNet

Posted by GitBox <gi...@apache.org>.
szha commented on issue #19871:
URL: https://github.com/apache/incubator-mxnet/issues/19871#issuecomment-776041501


   Thanks for the proposal! I think the effort for fixing memory leaks will be very helpful. Would you also consider adding the memory leak checks in the CI so that we could trace any new leak?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] access2rohit commented on issue #19871: [RFC] Fixing CPU Memory Leaks in MXNet

Posted by GitBox <gi...@apache.org>.
access2rohit commented on issue #19871:
URL: https://github.com/apache/incubator-mxnet/issues/19871#issuecomment-776233634


   > Thanks for the proposal! I think the effort for fixing memory leaks will be very helpful. Would you also consider adding the memory leak checks in the CI so that we could trace any new leak?
   
   yes. that is also part of the plan. Let me update that 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] access2rohit edited a comment on issue #19871: [RFC] Fixing CPU Memory Leaks in MXNet

Posted by GitBox <gi...@apache.org>.
access2rohit edited a comment on issue #19871:
URL: https://github.com/apache/incubator-mxnet/issues/19871#issuecomment-776233634


   > Thanks for the proposal! I think the effort for fixing memory leaks will be very helpful. Would you also consider adding the memory leak checks in the CI so that we could trace any new leak?
   
   yes. that is also part of the plan. Let me update that. The setup of the ENV takes a lot of time even if its automated and the scripts take roughly 1.5x time of the original ones w/o valgrind and python(debug mode). I was hoping to push a docker image that has everything installed already and pulling it everytime we need to run memory leak checks on every PR. I am not sure where do is push such images for our CI and what permissions do I need to do that ? @szha 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org