You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/10/19 08:20:40 UTC
[GitHub] [incubator-mxnet] zixuanweeei opened a new issue #16548: Stacked RNNCell and FusedRNNCell may give inconsistent results with GPU context

zixuanweeei opened a new issue #16548: Stacked RNNCell and FusedRNNCell may give inconsistent results with GPU context
URL: https://github.com/apache/incubator-mxnet/issues/16548
 
 
   ## Description
   Some new test cases were added to unit test for covering the fusion routine of RNN operators in #16420. Then we found that flakiness of `test_operator.py:test_rnnrelu_sym` appears several times from the online CI of either Unix-GPU MKLDNN+GPU or Unix-GPU NOMKLDNN+GPU. We have no idea the root cause of the flakiness, but we can reproduce the inconsistent results locally. Please see the below parts of the details.
   
   ## Environment info (Required)
   
   ```
   ----------Python Info----------
   Version      : 3.7.3
   Compiler     : GCC 7.3.0
   Build        : ('default', 'Mar 27 2019 22:11:17')
   Arch         : ('64bit', '')
   ------------Pip Info-----------
   Version      : 19.1.1
   Directory    : /root/miniconda3/lib/python3.7/site-packages/pip
   ----------MXNet Info-----------
   Version      : 1.6.0
   Directory    : /root/dev/incubator-mxnet/python/mxnet
   Commit hash file "/root/dev/incubator-mxnet/python/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source.
   Library      : ['/root/dev/incubator-mxnet/lib/libmxnet.so', '/root/dev/incubator-mxnet/python/mxnet/../../lib/libmxnet.so']
   Build features:
   ✔ CUDA
   ✔ CUDNN
   ✖ NCCL
   ✔ CUDA_RTC
   ✖ TENSORRT
   ✔ CPU_SSE
   ✔ CPU_SSE2
   ✔ CPU_SSE3
   ✔ CPU_SSE4_1
   ✔ CPU_SSE4_2
   ✖ CPU_SSE4A
   ✔ CPU_AVX
   ✖ CPU_AVX2
   ✔ OPENMP
   ✖ SSE
   ✔ F16C
   ✖ JEMALLOC
   ✖ BLAS_OPEN
   ✖ BLAS_ATLAS
   ✔ BLAS_MKL
   ✖ BLAS_APPLE
   ✖ LAPACK
   ✖ MKLDNN
   ✔ OPENCV
   ✖ CAFFE
   ✖ PROFILER
   ✖ DIST_KVSTORE
   ✖ CXX14
   ✖ INT64_TENSOR_SIZE
   ✖ SIGNAL_HANDLER
   ✖ DEBUG
   ✖ TVM_OP
   ----------System Info----------
   Platform     : Linux-4.18.0-15-generic-x86_64-with-debian-buster-sid
   system       : Linux
   node         : d64ced67d422
   release      : 4.18.0-15-generic
   version      : #16~18.04.1-Ubuntu SMP Thu Feb 7 14:06:04 UTC 2019
   ```
   
   Package used (Python/R/Scala/Julia):
   I'm using Python Package
   
   ## Build info (Required if built from source)
   ```
   Using built-in specs.
   COLLECT_GCC=gcc
   COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
   OFFLOAD_TARGET_NAMES=nvptx-none
   OFFLOAD_TARGET_DEFAULT=1
   Target: x86_64-linux-gnu
   Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
   Thread model: posix
   gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
   ```
   
   MXNet commit hash:
   63fbfb12296e41263a6f40df6e616601f6d24d8a
   
   Build config:
   ```
   make -j10 USE_PROFILER=0 USE_CUDA=1 USE_CUDNN=1 USE_MKLDNN=0 USE_BLAS=mkl USE_INTEL_PATH=/opt/intel USE_CUDA_PATH=/usr/local/cuda-10.0 USE_CUDNN_PATH=/usr/local/cuda-10.0/lib64 test
   ```
   
   ## Error Message:
   ```
   /root/dev/incubator-mxnet/python/mxnet/rnn/rnn_cell.py:675: UserWarning: NTC layout detected. Consider using TNC for FusedRNNCell for faster speed
     warnings.warn("NTC layout detected. Consider using "
   Traceback (most recent call last):
     File "rnn_relu_unidirectional.py", line 69, in <module>
       check_consistency(fused_cell, stacked_cell, seq_len, batch_size, input_dim, "write")
     File "rnn_relu_unidirectional.py", line 62, in check_consistency
       assert_allclose(mod1.get_input_grads()[0].asnumpy(), mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol)
     File "/root/miniconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1501, in assert_allclose
       verbose=verbose, header=header, equal_nan=equal_nan)
     File "/root/miniconda3/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 827, in assert_array_compare
       raise AssertionError(msg)
   AssertionError:
   Not equal to tolerance rtol=0.01, atol=0.0001
   
   Mismatch: 0.0195%
   Max absolute difference: 0.00012941
   Max relative difference: 1.6247102
    x: array([[[ 2.050658e-03,  1.623387e-03,  1.420917e-03, ...,
            -5.661430e-05,  1.032020e-03,  1.353525e-03],
           [ 2.300234e-05,  1.259212e-03,  1.050305e-03, ...,...
    y: array([[[ 2.050658e-03,  1.623387e-03,  1.420917e-03, ...,
            -5.661434e-05,  1.032020e-03,  1.353525e-03],
           [ 2.300218e-05,  1.259212e-03,  1.050305e-03, ...,...
   ```
   
   ## Minimum reproducible example
   ```Python
   import mxnet as mx
   import numpy as np
   from numpy.testing import assert_allclose
   from mxnet.test_utils import set_default_context, default_context
   
   
   def sym_gen(seq_len, batch_size, input_dim, state_dim):
     fused = mx.rnn.FusedRNNCell(state_dim, num_layers=3, mode='rnn_relu', get_next_state=True, prefix='')
     stack = mx.rnn.SequentialRNNCell()
     stack.add(mx.rnn.RNNCell(state_dim, activation='relu', prefix='l0_'))
     stack.add(mx.rnn.RNNCell(state_dim, activation='relu', prefix='l1_'))
     stack.add(mx.rnn.RNNCell(state_dim, activation='relu', prefix='l2_'))
   
     return fused, stack
   
   
   def check_consistency(cell1, cell2, seq_len, batch_size, input_dim, grad_req, atol=1e-4, rtol=1e-2):
     dshape = (batch_size, seq_len, input_dim)
     data = mx.sym.Variable('data')
   
     Y1, _ = cell1.unroll(seq_len, data, layout='NTC', merge_outputs=True)
     mod1 = mx.mod.Module(Y1, label_names=None, context=default_context())
     mod1.bind(data_shapes=[('data', dshape)], label_shapes=None, inputs_need_grad=True, grad_req=grad_req)
   
     Y2, _ = cell2.unroll(seq_len, data, layout='NTC', merge_outputs=True)
     mod2 = mx.mod.Module(Y2, label_names=None, context=default_context())
     mod2.bind(data_shapes=[('data', dshape)], label_shapes=None, inputs_need_grad=True, grad_req=grad_req)
   
     mod1.init_params()
     _, auxs = mod1.get_params()
     # args = cell1.unpack_weights(args)
     npzfile = np.load("./issue_array.npz")
     arrays = dict([(name, npzfile[name]) for name in npzfile.files])
     for name, array in arrays.items():
         arrays[name] = mx.nd.array(array)
     args1 = cell1.pack_weights(arrays)
     mod1.set_params(args1, auxs)
     args2 = cell2.pack_weights(arrays)
     mod2.set_params(args2, auxs)
   
     x = np.load("./x.npz")["x"]
     x = mx.nd.array(x)
     batch=mx.io.DataBatch(data=[x])
     # check inference
     mod1.forward(batch, is_train=False)
     mod2.forward(batch, is_train=False)
     assert_allclose(mod1.get_outputs()[0].asnumpy(), mod2.get_outputs()[0].asnumpy(), rtol=rtol, atol=atol)
   
     # check training
     mod1.forward(batch, is_train=True)
     mod2.forward(batch, is_train=True)
     assert_allclose(mod1.get_outputs()[0].asnumpy(), mod2.get_outputs()[0].asnumpy(), rtol=rtol, atol=atol)
   
     dy_array = np.load("./dy.npz")["dy"]
     dy = mx.nd.array(dy_array)
     mod1.backward(out_grads=[dy])
     mod2.backward(out_grads=[dy])
     if type(grad_req) is dict and grad_req['data'] == 'null' or grad_req == 'null':
       assert(mod1.get_input_grads()[0] == None)
       assert(mod2.get_input_grads()[0] == None)
     else:
       assert_allclose(mod1.get_input_grads()[0].asnumpy(), mod2.get_input_grads()[0].asnumpy(), rtol=rtol, atol=atol)
   
   
   if __name__ == "__main__":
     set_default_context(mx.gpu(0))
     seq_len, batch_size, input_dim, state_dim = 5, 32, 32, 512
     fused_cell, stacked_cell = sym_gen(seq_len, batch_size, input_dim, state_dim)
     check_consistency(fused_cell, stacked_cell, seq_len, batch_size, input_dim, "write")
   ```
   
   ## Steps to reproduce
   1. Please download the input, weights, gradients data from the link below. It contains three .npz files.
   2. Extract them to the directory of the Python script above
   3. Install mxnet-gpu and execute the script
   
   [data.tar.gz](https://github.com/apache/incubator-mxnet/files/3746390/data.tar.gz)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services