You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/06/21 10:26:30 UTC

[GitHub] [incubator-mxnet] sulxxy opened a new issue #15310: program terminates when `backward()` does not complete yet

sulxxy opened a new issue #15310: program terminates when `backward()` does not complete yet
URL: https://github.com/apache/incubator-mxnet/issues/15310
 
 
   ## Description
   The program terminates before the `backward` pass completed. There is no any error message shown.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.6.8
   Compiler     : GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)
   Build        : ('v3.6.8:3c6b436a57', 'Dec 24 2018 02:04:31')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 19.0.3
   Directory    : /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.4.1
   Directory    : /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/mxnet
   Commit Hash   : 1a7199691f5cbc6012bb53eecbf884bed5ae6590
   ----------System Info----------
   Platform     : Darwin-18.6.0-x86_64-i386-64bit
   system       : Darwin
   node         : zhiwei.local
   release      : 18.6.0
   version      : Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
   ----------Hardware Info----------
   machine      : x86_64
   processor    : i386
   b'machdep.cpu.brand_string: Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz'
   b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
   b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 AVX2 BMI2 INVPCID SMAP RDSEED ADX IPT FPU_CSDS IBRS STIBP L1DF SSBD'
   b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
   ----------Network Test----------
   Setting timeout: 10
   Error open MXNet: https://github.com/apache/incubator-mxnet, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.004141092300415039 sec.
   Error open Gluon Tutorial(en): http://gluon.mxnet.io, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.052942752838134766 sec.
   Error open Gluon Tutorial(cn): https://zh.gluon.ai, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.058538198471069336 sec.
   Error open FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.03873300552368164 sec.
   Error open PYPI: https://pypi.python.org/pypi/pip, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.03042888641357422 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)>, DNS finished in 0.0011088848114013672 sec.
   ```
   
   Package used (Python/R/Scala/Julia):
   I'm using Python 3.6.8
   
   ## Error Message:
   The program terminates at an unexpected point but there is no any error message.
   
   ## Minimum reproducible example
   ```python
   import mxnet as mx
   import logging
   import numpy as np
   
   
   class CustOperator(mx.operator.CustomOp):
       def __init__(self):
           super(CustOperator, self).__init__()
   
       def forward(self, is_train, req, in_data, out_data, aux):
           logging.debug('forward')
           for i in range(10):
               logging.debug('forward {}'.format(i))
               train_iter = mx.io.NDArrayIter(np.array([1,2,3]), np.array([2,2,2]))
           logging.debug('forward finished')
   
       def backward(self, req, out_grad, in_data, out_data, in_grad, aux):
           logging.debug('backward')
           for i in range(10):
               logging.debug('backward {}'.format(i))
               train_iter = mx.io.NDArrayIter(np.array([1,2,3]), np.array([2,2,2]))
           logging.debug('backward finished')
   
   
   @mx.operator.register("cust")
   class CustProp(mx.operator.CustomOpProp):
       def __init__(self):
           super(CustProp, self).__init__(need_top_grad=False)
   
       def list_arguments(self):
           return ['data', 'label', 'weight']
   
       def list_outputs(self):
           return ['output']
   
       def infer_type(self, in_type):
           dtype = in_type[0]
           return [dtype, dtype, dtype], [dtype], []
   
       def infer_shape(self, in_shape):
           data_shape = in_shape[0]
           label_shape = (in_shape[0][0],)
           output_shape = (1,)
           return [data_shape, label_shape, label_shape], [output_shape], []
   
       def create_operator(self, ctx, in_shapes, in_dtypes):
           return CustOperator()
   
   
   cust_prefix = 'cust'
   data_name = 'data'
   label_name = cust_prefix + '_label'
   logging.getLogger().setLevel(logging.DEBUG)  # logging to stdout
   
   ctx = mx.cpu()
   
   x_train = np.array([1.314]).reshape(1, )
   y_train = np.array([3.126]).reshape(1, )
   batch_size = 1
   
   train_iter = mx.io.NDArrayIter(x_train, y_train, batch_size=batch_size, shuffle=True, data_name=data_name,
                                  label_name=label_name)
   
   dummy_data = mx.sym.var(data_name)
   dummy_label = mx.sym.var(label_name)
   cust = mx.symbol.Custom(data=dummy_data, label=dummy_label, name=cust_prefix, op_type=cust_prefix)
   
   def dummy_metric(label, pred):
       return 0
   
   metric = mx.metric.create(dummy_metric)
   cust_model = mx.mod.Module(symbol=cust, data_names=[data_name], label_names=[label_name])
   cust_model.fit(train_iter, eval_metric=metric, num_epoch=1)
   
   
   ```
   
   ## Steps to reproduce
   (Paste the commands you ran that produced the error.)
   
   1. save above code as cust_op.py
   2. python3 cust_op.py
   
   The output looks like:
   ```console
   DEBUG:root:forward
   DEBUG:root:forward 0
   DEBUG:root:forward 1
   DEBUG:root:forward 2
   DEBUG:root:forward 3
   DEBUG:root:forward 4
   DEBUG:root:forward 5
   DEBUG:root:forward 6
   DEBUG:root:forward 7
   DEBUG:root:forward 8
   DEBUG:root:forward 9
   DEBUG:root:forward finished
   INFO:root:Epoch[0] Train-dummy_metric=0.000000
   INFO:root:Epoch[0] Time cost=0.017
   DEBUG:root:backward
   DEBUG:root:backward 0
   DEBUG:root:backward 1
   DEBUG:root:backward 2
   DEBUG:root:backward 3
   ```
   
   As you can see, the program terminates when `backward` pass does not finish yet. And the last running iteration at `backward` is not always `3` and might be different during multiple runs.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services