You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/06/15 22:18:32 UTC
[GitHub] leezu opened a new issue #11314: Embedding Backward
(AddTakeGradLargeBatchCaller) non-deterministic nan values
leezu opened a new issue #11314: Embedding Backward (AddTakeGradLargeBatchCaller) non-deterministic nan values
URL: https://github.com/apache/incubator-mxnet/issues/11314
## Description
The `AddTakeGradLargeBatchCaller` operator called during backward of Embedding is broken and results in `nan` at random positions in the gradient array.
## Environment info (Required)
- Cuda 9.0 or Cuda 9.2 with respective mxnet-cu92, mxnet-cu90 prebuilt binaries (both v1.2 and nightly affected)
- EC2 p3.2xlarge or p2.xlarge instance
While it is occurs rarely with Cuda 9.0 or p2.xlarge it almost always occurs on Cuda 9.2 and p3.2xlarge.
## Minimum reproducible example
``` python
import mxnet as mx
import numpy as np
N = 50000
ctx = mx.gpu()
embedding = mx.gluon.nn.Embedding(N, 300)
embedding.initialize(ctx=ctx)
i = 0
np.random.seed(1)
idx = mx.nd.array(np.random.randint(0, N, size=(1024, 160)), ctx=ctx)
got_nan = False
while True:
i += 1
with mx.autograd.record():
emb_in = embedding(idx)
loss = emb_in.sum()
loss.backward()
if not np.all(np.isfinite(embedding.weight.grad().asnumpy())):
nan_rows = np.where(~np.isfinite(embedding.weight.grad().asnumpy()))[0]
print(f'Got nan {i}\tRetrying with same data. '
f'(Affected rows: {nan_rows.tolist()}).')
got_nan = True
else:
if got_nan: # We got nan before and it disappeared now
print(f'nan disappeared in {i}..')
break
if i % 100 == 0:
print(f'{i}')
```
## Steps to reproduce
Run above script with cuda 9.2 and observe very frequent nan values:
```
% python debug_embedding_nan.py
Got nan 3 Retrying with same data. (Affected indices: [14721, 14721], [1, 2]).
Got nan 4 Retrying with same data. (Affected indices: [20, 20, 39, 39, 18232, 18232], [257, 258, 1, 2, 1, 2]).
Got nan 5 Retrying with same data. (Affected indices: [20, 20, 71, 33346, 38015], [257, 258, 258, 130, 130]).
Got nan 6 Retrying with same data. (Affected indices: [20, 20], [257, 258]).
nan disappeared in 7..
% python debug_embedding_nan.py
Got nan 7 Retrying with same data. (Affected indices: [20, 20, 33, 71, 71, 71, 71, 71], [257, 258, 1, 1, 2, 129, 130, 258]).
nan disappeared in 8..
% python debug_embedding_nan.py
Got nan 1 Retrying with same data. (Affected indices: [1489], [129]).
Got nan 2 Retrying with same data. (Affected indices: [42581, 42581], [257, 258]).
nan disappeared in 3..
```
Run above script with cuda 9.0 and observe very frequent nan values:
```
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
Got nan 1461 Retrying with same data. (Affected indices: [3254], [2]).
nan disappeared in 1462..
```
## What have you tried to solve it?
1. Apply the following patch and set
``` patch
From 3fd91f0078e70cf990ce1549081c03cfb50292ad Mon Sep 17 00:00:00 2001
From: Leonard Lausen <le...@lausen.nl>
Date: Fri, 15 Jun 2018 18:45:39 +0000
Subject: [PATCH] MXNET_FORCE_ADDTAKEGRAD to disable
AddTakeGradLargeBatchCaller
If MXNET_FORCE_ADDTAKEGRAD is set, EmbeddingOpBackward will always use
AddTakeGrad independently of gradient input and output shape
---
src/operator/tensor/indexing_op.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/src/operator/tensor/indexing_op.h b/src/operator/tensor/indexing_op.h
index 87381960e..d3a1bdfd6 100644
--- a/src/operator/tensor/indexing_op.h
+++ b/src/operator/tensor/indexing_op.h
@@ -598,7 +598,11 @@ void EmbeddingOpBackward(const nnvm::NodeAttrs& attrs,
uint64_t shape_out_prod =
static_cast<uint64_t>(grad_out.shape_[0])*
static_cast<uint64_t>(grad_out.shape_[1]);
- if (shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384) {
+
+ const char *type = getenv("MXNET_FORCE_ADDTAKEGRAD");
+ const bool default_addtakegrad = (type == nullptr);
+
+ if (!default_addtakegrad || ( shape_out_prod < (uint64_t)16384 && shape_in_prod < (uint64_t)16384 )) {
AddTakeGrad(grad_in, data, grad_out);
} else {
AddTakeGradLargeBatchCaller(ctx, grad_in, data, grad_out);
--
2.17.1
```
2. Run above script when `MXNET_FORCE_ADDTAKEGRAD=1` is exported.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services