You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/04/16 03:23:32 UTC

[GitHub] [incubator-mxnet] zleyk22 opened a new issue #18078: A possible bug when computing a gradient vector

zleyk22 opened a new issue #18078: A possible bug when computing a gradient vector
URL: https://github.com/apache/incubator-mxnet/issues/18078
 
 
   ## Description
   I investigated the MXNet C++ code and noticed that the gradient of the input array `[[x, y]]` is computed as `[[(x*y)/x, (x*y)/y]]`. Therefore, if `y` is zero, then we always get `nan` as the second element of the array. Is it a bug?
   
   ### Error Message
   No error message
   
   ## To Reproduce
   This is a python program I run:
   ```
   import mxnet as mx
   sym = mx.symbol.prod(data=mx.symbol.Variable("in"), axis=(1))
   exec = sym.bind(mx.cpu(0), {"in":mx.nd.array([[4, 0]])}, {"in":mx.nd.array([[9, 9]])})
   out = exec.forward()[0]
   exec.backward(mx.nd.ones(out.shape))
   print('-- Gradient:')
   print(exec.grad_dict['in'])
   ```
   I get this output:
   ```
   -- Gradient:
   [[ 0. nan]]
   <NDArray 1x2 @cpu(0)>
   ```
   ## Environment
   
   We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
   ```
   curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
   ```
   ### paste outputs here
   ```
   ----------Python Info----------
   Version      : 3.7.7
   Compiler     : Clang 11.0.0 (clang-1100.0.33.17)
   Build        : ('default', 'Mar 10 2020 15:43:03')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 20.0.2
   Directory    : /usr/local/lib/python3.7/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.6.0
   Directory    : /Users/zibi22/MXNet-1.6.0/python/mxnet
   Num GPUs     : 0
   Hashtag not found. Not installed from pre-built package.
   ----------System Info----------
   Platform     : Darwin-18.7.0-x86_64-i386-64bit
   system       : Darwin
   node         : zibi22maclap
   release      : 18.7.0
   version      : Darwin Kernel Version 18.7.0: Thu Jan 23 06:52:12 PST 2020; root:xnu-4903.278.25~1/RELEASE_X86_64
   ----------Hardware Info----------
   machine      : x86_64
   processor    : i386
   b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz'
   b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
   b'machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 HLE AVX2 SMEP BMI2 ERMS INVPCID RTM FPU_CSDS MPX RDSEED ADX SMAP CLFSOPT IPT MDCLEAR TSXFA IBRS STIBP L1DF SSBD'
   b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
   ----------Network Test----------
   Setting timeout: 10
   Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0756 sec, LOAD: 0.9306 sec.
   Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0006 sec, LOAD: 0.8133 sec.
   Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0008 sec, LOAD: 0.3825 sec.
   Timing for D2L: http://d2l.ai, DNS: 0.0006 sec, LOAD: 0.1266 sec.
   Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0008 sec, LOAD: 0.1243 sec.
   Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0009 sec, LOAD: 0.3131 sec.
   Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0731 sec, LOAD: 0.9831 sec.
   Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.00061798095703125 sec.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] sxjscience commented on issue #18078: A possible bug when computing a gradient vector

Posted by GitBox <gi...@apache.org>.

sxjscience commented on issue #18078: A possible bug when computing a gradient vector
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614921875
 
 
   @zleyk22 I can think of two possible ways to solve this problem:
   1) Use two cumsums 
      - `[\log a_0, \log a_0 + \log a_1, ..., \log a_0 + \log a_1 + ... \log a_{n-3}]`,
      - `[\log a_2 + \log a_3 ... + \log a_{n-1}, \log a_3 + ... + \log a_{n-1}, ... \log a_{n-1}]`
   Then, sum up these two cumsums. Here, I think we should take the log-sum approach to avoid the overflow/underflow problem of multiplying lots of numbers. (Also this is the algorithm used to solve https://leetcode.com/problems/product-of-array-except-self/)
   2) Detect 0s and give them special treatment. We may detect the positions of the zeros and update the gradient of these positions with the correct value.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] sxjscience commented on issue #18078: A possible bug when computing a gradient vector

Posted by GitBox <gi...@apache.org>.

sxjscience commented on issue #18078: A possible bug when computing a gradient vector
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614406630
 
 
   @zleyk22 Yes, it's a bug and also exists in numpy.
   ```python
   import mxnet as mx
   mx.npx.set_np()
   a = mx.np.array([1,0])
   a.attach_grad()
   with mx.autograd.record():
       b = mx.np.prod(a)
       b.backward()
   print(a.grad)
   ```
   Output:
   ```
   [ 0. nan]
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] sxjscience commented on issue #18078: A possible bug when computing a gradient vector

Posted by GitBox <gi...@apache.org>.

sxjscience commented on issue #18078: A possible bug when computing a gradient vector
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614766460
 
 
   @zleyk22 I think currently no one is looking at the issue. Would you try to solve it? Thanks for pointing out the problem!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zleyk22 commented on issue #18078: A possible bug when computing a gradient vector

Posted by GitBox <gi...@apache.org>.

zleyk22 commented on issue #18078: A possible bug when computing a gradient vector
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614769243
 
 
   Yes, I will try to fix it.
   
   On 4/16/20 11:44 AM, Xingjian Shi wrote:
   >
   > @zleyk22 <https://github.com/zleyk22> I think currently no one is 
   > looking at the issue. Would you try to solve it? Thanks for pointing 
   > out the problem!
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub 
   > <https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614766460>, 
   > or unsubscribe 
   > <https://github.com/notifications/unsubscribe-auth/AO3PCD2XA7POGEXRQ4L2SK3RM4YYDANCNFSM4MJCT54Q>.
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zleyk22 commented on issue #18078: A possible bug when computing a gradient vector

Posted by GitBox <gi...@apache.org>.

zleyk22 commented on issue #18078: A possible bug when computing a gradient vector
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614763343
 
 
   Xingjian,
   
   If it is a bug, then possibly somebody tries to fix it. Do you know such 
   a person? I would like to contact him/her.
   
   But if nobody is fixing this bug, then I volunteer to fix it. And as I 
   see it now it may not be an easy task.
   
   Zbigniew
   
   On 4/15/20 11:19 PM, Xingjian Shi wrote:
   >
   > @zleyk22 <https://github.com/zleyk22> Yes, it's a bug and also exists 
   > in numpy.
   >
   > import  mxnetas  mx
   > mx.npx.set_np()
   > a=  mx.np.array([1,0])
   > a.attach_grad()
   > with  mx.autograd.record():
   >      b=  mx.np.prod(a)
   >      b.backward()
   > print(a.grad)
   >
   > Output:
   >
   > |[ 0. nan] |
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub 
   > <https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-614406630>, 
   > or unsubscribe 
   > <https://github.com/notifications/unsubscribe-auth/AO3PCDZ7IXHMQE6L7BMYYDTRM2BMRANCNFSM4MJCT54Q>.
   >
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] matteosal commented on issue #18078: [Numpy] Gradient of np.prod is wrong when there are zeros in the array

Posted by GitBox <gi...@apache.org>.

matteosal commented on issue #18078:
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-637637711


   @yzhliu 
   Me and @zleyk22 spent some time investigating this and tracking it down. We managed to track it down to [this line](https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce_op.h#L966).
   `OP::Map(data[i], DType(out[out_idx]))` computes `DType(out[out_idx]) / data[i]`, which returns nan when `data[i]` is zero (out input is `mx.nd.array([[4, 0]])`).
   
   The logic behind this computation is that (unless zeros are involved) dividing `x*y*z` by e.g. `x` gives you `y*z` which is the desired gradient. But this trick doesn't work when zeros are involved. In this case, the code should just recompute the entire product, leaving `x` out.
   
   But it seems that this infrastructure is quite rigid and only allows to re-use the previously-computed output and one single input element at a time. So it seems that fixing this bug would require some changes to this system, which is shared by more than one operator. Due to our low familiarity with the codebase, we don't think we are able to proceed further without a bit of guidance and help.
   
   Can you suggest a way forward? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] zleyk22 commented on issue #18078: [Numpy] Gradient of np.prod is wrong when there are zeros in the array

Posted by GitBox <gi...@apache.org>.

zleyk22 commented on issue #18078:
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-628246366

   Yes, and no. I have been looking into this code, but there are many 
   parts of the code which are not clear for me. Therefore, I am not sure 
   how to fix it in a reasonable way in order not to introduce more bugs. I 
   was investigating how this code was written. What I saw, this code was 
   written by Eric Junyuan Xie in the end of 2016. I thought that going 
   back in time I would learn how this code was written and would find some 
   comments and explanations. But there was nothing which suggested why the 
   author chose that way of implementing. So currently, I am not going to 
   fix it. I prefer to wait a while in order to figure out what to do next. 
   Maybe it would be advisable to talk to Eric Junyuan Xie to learn more 
   about his code?

   On 5/13/20 3:45 PM, Yizhi Liu wrote:
   >
   > @zleyk22 <https://github.com/zleyk22> are you working on this?
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub 
   > <https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-628236372>, 
   > or unsubscribe 
   > <https://github.com/notifications/unsubscribe-auth/AO3PCD67QIMEOXF2TMBRE3LRRMBG5ANCNFSM4MJCT54Q>.
   >

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] matteosal edited a comment on issue #18078: [Numpy] Gradient of np.prod is wrong when there are zeros in the array

Posted by GitBox <gi...@apache.org>.

matteosal edited a comment on issue #18078:
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-637637711


   @yzhliu 
   Me and @zleyk22 spent some time investigating this and tracking it down. We managed to track it down to [this line](https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/broadcast_reduce_op.h#L966).
   `OP::Map(data[i], DType(out[out_idx]))` computes `DType(out[out_idx]) / data[i]`, which returns nan when `data[i]` is zero (our input is `mx.nd.array([[4, 0]])`).
   
   The logic behind this computation is that (unless zeros are involved) dividing `x*y*z` by e.g. `x` gives you `y*z` which is the desired gradient. But this trick doesn't work when zeros are involved. In this case, the code should just recompute the entire product, leaving `x` out.
   
   But it seems that this infrastructure is quite rigid and only allows to re-use the previously-computed output and one single input element at a time. So it seems that fixing this bug would require some changes to this system, which is shared by more than one operator. Due to our low familiarity with the codebase, we don't think we are able to proceed further without a bit of guidance and help.
   
   Can you suggest a way forward? Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [incubator-mxnet] yzhliu commented on issue #18078: [Numpy] Gradient of np.prod is wrong when there are zeros in the array

Posted by GitBox <gi...@apache.org>.

yzhliu commented on issue #18078:
URL: https://github.com/apache/incubator-mxnet/issues/18078#issuecomment-628236372


   @zleyk22 are you working on this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org