You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/15 22:18:32 UTC
[GitHub] leezu opened a new issue #11314: Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values

leezu opened a new issue #11314: Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values
URL: https://github.com/apache/incubator-mxnet/issues/11314
 
 
   ## Description
   
   The `AddTakeGradLargeBatchCaller` operator called during backward of Embedding is broken and results in `nan` at random positions in the gradient array. 
   
   ## Environment info (Required)
   - Cuda 9.0 or Cuda 9.2 with respective mxnet-cu92, mxnet-cu90 prebuilt binaries (both v1.2 and nightly affected)
   - EC2 p3.2xlarge or p2.xlarge instance
   
   
   While it is occurs rarely with Cuda 9.0 or p2.xlarge it almost always occurs on Cuda 9.2 and p3.2xlarge. 
   
   ## Minimum reproducible example
   
   ``` python
   import mxnet as mx
   import numpy as np
   
   N = 50000
   ctx = mx.gpu()
   
   embedding = mx.gluon.nn.Embedding(N, 300)
   embedding.initialize(ctx=ctx)
   i = 0
   np.random.seed(1)
   idx = mx.nd.array(np.random.randint(0, N, size=(1024, 160)), ctx=ctx)
   
   got_nan = False
   while True:
       i += 1
       with mx.autograd.record():
           emb_in = embedding(idx)
           loss = emb_in.sum()
       loss.backward()
   
       if not np.all(np.isfinite(embedding.weight.grad().asnumpy())):
           nan_rows = np.where(~np.isfinite(embedding.weight.grad().asnumpy()))[0]
           print(f'Got nan {i}\tRetrying with same data. '
                 f'(Affected rows: {nan_rows.tolist()}).')
           got_nan = True
       else:
           if got_nan:  # We got nan before and it disappeared now
               print(f'nan disappeared in {i}..')
               break
   
       if i % 100 == 0:
           print(f'{i}')
   
   ```
   
   ## Steps to reproduce
   
   Run above script with cuda 9.2 and observe very frequent nan values:
   
   ``` 
   % python debug_embedding_nan.py
   Got nan 3       Retrying with same data. (Affected indices: [14721, 14721], [1, 2]).
   Got nan 4       Retrying with same data. (Affected indices: [20, 20, 39, 39, 18232, 18232], [257, 258, 1, 2, 1, 2]).
   Got nan 5       Retrying with same data. (Affected indices: [20, 20, 71, 33346, 38015], [257, 258, 258, 130, 130]).
   Got nan 6       Retrying with same data. (Affected indices: [20, 20], [257, 258]).
   nan disappeared in 7..
   % python debug_embedding_nan.py 
   Got nan 7       Retrying with same data. (Affected indices: [20, 20, 33, 71, 71, 71, 71, 71], [257, 258, 1, 1, 2, 129, 130, 258]).
   nan disappeared in 8..
   % python debug_embedding_nan.py
   Got nan 1       Retrying with same data. (Affected indices: [1489], [129]).
   Got nan 2       Retrying with same data. (Affected indices: [42581, 42581], [257, 258]).
   nan disappeared in 3..
   
   ```
   
   Run above script with cuda 9.0 and observe very frequent nan values:
   
   ``` 
   100
   200
   300
   400
   500
   600
   700
   800
   900
   1000
   1100
   1200
   1300
   1400
   Got nan 1461    Retrying with same data. (Affected indices: [3254], [2]).
   nan disappeared in 1462..
   
   ```
   
   
   ## What have you tried to solve it?
   
   1. Apply the following patch and set
   ``` patch
   From 3fd91f0078e70cf990ce1549081c03cfb50292ad Mon Sep 17 00:00:00 2001
   From: Leonard Lausen <le...@lausen.nl>
   Date: Fri, 15 Jun 2018 18:45:39 +0000
   Subject: [PATCH] MXNET_FORCE_ADDTAKEGRAD to disable
    AddTakeGradLargeBatchCaller
   
   If MXNET_FORCE_ADDTAKEGRAD is set, EmbeddingOpBackward will always use
   AddTakeGrad independently of gradient input and output shape
   ---
    src/operator/tensor/indexing_op.h | 6 +++++-
    1 file changed, 5 insertions(+), 1 deletion(-)
   
   diff --git a/src/operator/tensor/indexing_op.h b/src/operator/tensor/indexing_op.h
   index 87381960e..d3a1bdfd6 100644
   --- a/src/operator/tensor/indexing_op.h
   +++ b/src/operator/tensor/indexing_op.h
   @@ -598,7 +598,11 @@ void EmbeddingOpBackward(const nnvm::NodeAttrs& attrs,
            uint64_t shape_out_prod =
              static_cast<uint64_t>(grad_out.shape_[0])*
              static_cast<uint64_t>(grad_out.shape_[1]);
   -        if (shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384) {
   +
   +        const char *type = getenv("MXNET_FORCE_ADDTAKEGRAD");
   +        const bool default_addtakegrad = (type == nullptr);
   +
   +        if (!default_addtakegrad || ( shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384 )) {
              AddTakeGrad(grad_in, data, grad_out);
            } else {
              AddTakeGradLargeBatchCaller(ctx, grad_in, data, grad_out);
   --
   2.17.1
   
   ```
   
   2. Run above script when `MXNET_FORCE_ADDTAKEGRAD=1` is exported.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services