You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/12/15 01:19:37 UTC

[GitHub] kalpitdixit opened a new issue #9079: multiple experiments on separate gpus get stuck

kalpitdixit opened a new issue #9079: multiple experiments on separate gpus get stuck
URL: https://github.com/apache/incubator-mxnet/issues/9079
 
 
   My machine has 8 GPUs, each of them is running a separate but identical training run. After a few epochs of training, 7 of them get stuck one-by-one.
   
   I modified the forward() function in python/mxnet/executor.py as below. In the log files corresponding to the stuck runs, the last line says "aaaaa". "bbbbb" does not get printed which leads me to conclude that that the module.forward(data, is_train=True) is not completing.
   
   Each experiment works for several epochs but after ~10 epochs, they start getting stuck with no errors. Also, the processes do not throw any errors nor do they crash.
   
   def forward(self, is_train=False, **kwargs):
           if len(kwargs) != 0:
               arg_dict = self.arg_dict
               for name, array in kwargs.items():
                   if not isinstance(array, (NDArray, np.ndarray)):
                       raise ValueError('only accept keyword argument of NDArrays and numpy.ndarray')
                   if name not in arg_dict:
                       raise TypeError('Unknown argument %s' % name)
                   if arg_dict[name].shape != array.shape:
                       raise ValueError('Shape not match! Argument %s, need: %s, received: %s'
                                        %(name, str(arg_dict[name].shape), str(array.shape)))
                   arg_dict[name][:] = array
   
           check_call(_LIB.MXExecutorForward(
               self.handle,
               ctypes.c_int(int(is_train))))
           **logging.info('aaaaa')**
           self.outputs[0].wait_to_read() # I only have one output
           **logging.info('bbbbb')**
           return self.outputs

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services