You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/03/26 07:07:31 UTC
[GitHub] [incubator-mxnet] YutingZhang opened a new issue #14522: mx.nd.Custom conflicts with memory management

YutingZhang opened a new issue #14522: mx.nd.Custom conflicts with memory management
URL: https://github.com/apache/incubator-mxnet/issues/14522
 
 
   When training/running a large neural network with CustomOP, mxnet can get stuck. My speculation is that if memory management (e.g., releasing/reallocating GPU memory, raising an "out of memory error") is needed while running CustomOP, mxnet can get stuck. 
   
   A minimum piece code to show that CustomOP can deadlock with memory management ("out-of-memory in this case"):
   As expected, main(1) should work. 
   main(100) should give "out of GPU memory" error. However, it just got stuck.
   
   The real-world problem I met is not just about inability to give an "out of memory" error. It seems MxNet can release/reallocate memory dynamically, but this probably also deadlock with CustomOP, so my program, which could fit into GPU memory, can also get stuck
   
   
   ```python
   import mxnet as mx
   
   
   class MyMulMax(mx.operator.CustomOp):
   
       def __init__(self):
           super().__init__()
   
       def forward(self, is_train, req, in_data, out_data, aux):
           a, b = in_data[0:2]
           c = mx.nd.batch_dot(a, b)
           d = mx.nd.max(c, axis=-1, keepdims=True)
           self.assign(out_data[0], req[0], d)
   
       def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
           self.assign(in_grad[0], req[0], 0)
           self.assign(in_grad[1], req[1], 0)
   
   
   @mx.operator.register("MyMulMax")
   class MyMulMaxProp(mx.operator.CustomOpProp):
   
       def __init__(self):
           super().__init__()
   
       def list_arguments(self):
           return ['a', 'b']
   
       def list_outputs(self):
           return ['d']
   
       def infer_shape(self, in_shape):
           return in_shape, [list(in_shape[0][:-1] + [1])]
   
       def create_operator(self, ctx, shapes, dtypes):
           return MyMulMax()
   
   
   def main(n):
       with mx.Context(mx.gpu(0)):
           a = mx.nd.random.uniform(shape=(n, 6000, 1))
           b = mx.nd.random.uniform(shape=(n, 1, 7000))
           d = mx.nd.Custom(a, b, op_type="MyMulMax")
           d_np = d.asnumpy()
   
   
   if __name__ == "__main__":
       main(1)
       print("DONE -- 1")
       main(100)
       print("DONE -- 2")
   ```
   
   Tested with nightly build of mxnet-cu90mkl with Python3.6

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services