You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2017/11/23 13:25:14 UTC
[GitHub] mseeger opened a new issue #8799: Dangling outputs and dtype != float32: Gradient computation fails

mseeger opened a new issue #8799: Dangling outputs and dtype != float32: Gradient computation fails
URL: https://github.com/apache/incubator-mxnet/issues/8799
 
 
   Hello,
   
   please see the complete example below for full details.
   The problem is that if in the course of a computation, certain outputs of operators are not used downstream (I call this "dangling outputs"), then gradient computation fails IFF dtype != float32.
   
   I suppose this is because somewhere deep inside, output gradients constant 0 are created for these outputs, but this is done in float32 instead of the type of the output. I think this should be really easy to fix for somebody who knows where this happens (I don't!).
   
   You could argue that one should not ever produce dangling outputs. But at least for some operators, this is bound to happen. And in any case, things should not work for float32 and then fail for other dtype, and the user is left with a totally obscure error message (see below).
   
   The way around this problem is to make sure there are no dangling outputs, see example below. I'd argue that this is highly obscure, and users should not have to do this.
   
   Here the complete example. Note that this is *not* specific to linalg.gelqf. I am just using it as an example of an operator with >1 outputs. You can use any other op with >1 output and will get the same error (you can also use an op with 1 output, and then not use it, even though this is arguably pretty dumb).
   
   --- Code ---
   
   import mxnet as mx
   import numpy as np
   from mxnet import autograd
   
   def crit_func(a):
       q, l = mx.nd.linalg.gelqf(a)
       # Could as well write _, l = mx.nd.linalg.gelqf(a)
       # This circumvents the issue, but is highly obscure:
       #bogus = mx.nd.BlockGrad(q) * 0.0
       #return mx.nd.sum(l) + bogus
       return mx.nd.sum(l)
   
   ctx = mx.cpu()
   
   dtype = np.float32
   a32 = mx.nd.random.normal(shape=(2, 3), ctx=ctx, dtype=dtype)
   a32.attach_grad()
   with autograd.record():
       crit32 = crit_func(a32)
   print 'crit32 = %f' % crit32.asscalar()
   crit32.backward()
   print 'grad32\n', a32.grad.asnumpy()
   
   dtype = np.float64
   a64 = mx.nd.Cast(a32, dtype=dtype)
   a64.attach_grad()
   with autograd.record():
       crit64 = crit_func(a64)
   print 'crit64 = %f' % crit64.asscalar()
   crit64.backward()
   print 'grad64\n', a64.grad.asnumpy()
   
   --- Output ---
   
   crit32 = -0.633861
   grad32
   [[-0.55487823  0.57657677  0.63104177]
    [-1.29575324  0.44739273  0.3476553 ]]
   crit64 = -0.633862
   
   ---------------------------------------------------------------------------
   MXNetError                                Traceback (most recent call last)
   <ipython-input-3-2e10f88a3247> in <module>()
        24     crit64 = crit_func(a64)
        25 print 'crit64 = %f' % crit64.asscalar()
   ---> 26 crit64.backward()
        27 print 'grad64\n', a64.grad.asnumpy()
   
   /Users/matthis/MLBerlinHomes/matthis/virtenvs/linalg_ops/linalg_ops_env/lib/python2.7/site-packages/mxnet-0.12.1-py2.7.egg/mxnet/ndarray/ndarray.pyc in backward(self, out_grad, retain_graph, train_mode)
      1763             ctypes.c_int(train_mode),
      1764             ctypes.c_void_p(0),
   -> 1765             ctypes.c_void_p(0)))
      1766 
      1767     def tostype(self, stype):
   
   /Users/matthis/MLBerlinHomes/matthis/virtenvs/linalg_ops/linalg_ops_env/lib/python2.7/site-packages/mxnet-0.12.1-py2.7.egg/mxnet/base.pyc in check_call(ret)
       144     """
       145     if ret != 0:
   --> 146         raise MXNetError(py_str(_LIB.MXGetLastError()))
       147 
       148 if sys.version_info[0] < 3:
   
   MXNetError: [14:09:12] include/mxnet/./tensor_blob.h:216: Check failed: mshadow::DataType<DType>::kFlag == type_flag_ TBlob.get_with_shape: data type do not match specified type.Expected: 0 v.s. given 1
   
   Stack trace returned 10 entries:
   [bt] (0) 0   libmxnet.so                         0x000000010c8d8378 _ZN4dmlc15LogMessageFatalD2Ev + 40
   [bt] (1) 1   libmxnet.so                         0x000000010c9046d3 _ZNK5mxnet5TBlob4dptrIdEEPT_v + 211
   [bt] (2) 2   libmxnet.so                         0x000000010c904041 _ZNK5mxnet5TBlob14get_with_shapeIN7mshadow3cpuELi3EdEENS2_6TensorIT_XT0_ET1_EERKNS2_5ShapeIXT0_EEEPNS2_6StreamIS5_EE + 817
   [bt] (3) 3   libmxnet.so                         0x000000010d719ae7 _ZNK5mxnet5TBlob8FlatToKDIN7mshadow3cpuELi3EdEENS2_6TensorIT_XT0_ET1_EEPNS2_6StreamIS5_EE + 839
   [bt] (4) 4   libmxnet.so                         0x000000010d713248 _ZN5mxnet2op12LaOpBackwardIN7mshadow3cpuELi2ELi2ELi4ELi1ENS0_14gelqf_backwardEEEvRKN4nnvm9NodeAttrsERKNS_9OpContextERKNSt3__16vectorINS_5TBlobENSC_9allocatorISE_EEEERKNSD_INS_9OpReqTypeENSF_ISK_EEEESJ_ + 1464
   [bt] (5) 5   libmxnet.so                         0x000000010d8aed74 _ZZN5mxnet10imperative12PushFComputeERKNSt3__18functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKNS1_6vectorINS_5TBlobENS1_9allocatorISB_EEEERKNSA_INS_9OpReqTypeENSC_ISH_EEEESG_EEEPKNS3_2OpES6_RKNS_7ContextERKNSA_IPNS_6engine3VarENSC_ISY_EEEES12_RKNSA_INS_8ResourceENSC_IS13_EEEERKNSA_IPNS_7NDArrayENSC_IS19_EEEES1D_RKNSA_IjNSC_IjEEEESL_ENKUlNS_10RunContextENSW_18CallbackOnCompleteEE_clES1I_S1J_ + 532
   [bt] (6) 6   libmxnet.so                         0x000000010d8ae0f7 _ZNSt3__110__function6__funcIZN5mxnet10imperative12PushFComputeERKNS_8functionIFvRKN4nnvm9NodeAttrsERKNS2_9OpContextERKNS_6vectorINS2_5TBlobENS_9allocatorISD_EEEERKNSC_INS2_9OpReqTypeENSE_ISJ_EEEESI_EEEPKNS5_2OpES8_RKNS2_7ContextERKNSC_IPNS2_6engine3VarENSE_IS10_EEEES14_RKNSC_INS2_8ResourceENSE_IS15_EEEERKNSC_IPNS2_7NDArrayENSE_IS1B_EEEES1F_RKNSC_IjNSE_IjEEEESN_EUlNS2_10RunContextENSY_18CallbackOnCompleteEE_NSE_IS1M_EEFvS1K_S1L_EEclEOS1K_OS1L_ + 55
   [bt] (7) 7   libmxnet.so                         0x000000010d839cf2 _ZN5mxnet6engine11NaiveEngine9PushAsyncENSt3__18functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEEENS_7ContextERKNS2_6vectorIPNS0_3VarENS2_9allocatorISB_EEEESG_NS_10FnPropertyEiPKc + 194
   [bt] (8) 8   libmxnet.so                         0x000000010d89d3c2 _ZN5mxnet10imperative12PushFComputeERKNSt3__18functionIFvRKN4nnvm9NodeAttrsERKNS_9OpContextERKNS1_6vectorINS_5TBlobENS1_9allocatorISB_EEEERKNSA_INS_9OpReqTypeENSC_ISH_EEEESG_EEEPKNS3_2OpES6_RKNS_7ContextERKNSA_IPNS_6engine3VarENSC_ISY_EEEES12_RKNSA_INS_8ResourceENSC_IS13_EEEERKNSA_IPNS_7NDArrayENSC_IS19_EEEES1D_RKNSA_IjNSC_IjEEEESL_ + 1186
   [bt] (9) 9   libmxnet.so                         0x000000010d89b501 _ZN5mxnet10Imperative8InvokeOpERKNS_7ContextERKN4nnvm9NodeAttrsERKNSt3__16vectorIPNS_7NDArrayENS8_9allocatorISB_EEEESG_RKNS9_INS_9OpReqTypeENSC_ISH_EEEENS_12DispatchModeENS_10OpStatePtrE + 1137
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services