You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/12/26 08:37:27 UTC
[GitHub] [incubator-mxnet] Justobe opened a new issue #19717: mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6
Justobe opened a new issue #19717:
URL: https://github.com/apache/incubator-mxnet/issues/19717
## Description
mxnet throws an exception when I try to build my model and use mxnet as the backend of keras. However, my script runs successfully on other backends of keras (such as tensorflow and cntk). I further found that the problem may be caused by batch normalization in the program when using mxnet.
I also noticed that this issue was mentioned in #15721, but this issue still exists in the latest keras-mxnet 2.2.4.2 and mxnet-cu101 1.7
### Error Message
> Traceback (most recent call last):
File "crash_checker.py", line 67, in <module>
model.add(Dense(10, activation='softmax'))
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/engine/sequential.py", line 181, in add
output_tensor = layer(self.outputs[0])
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/engine/base_layer.py", line 470, in __call__
output = self.call(inputs, **kwargs)
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/layers/core.py", line 893, in call
output = K.bias_add(output, self.bias, data_format='channels_last')
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 94, in func_wrapper
train_symbol = func(*args, **kwargs)
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 3982, in bias_add
x_dim = ndim(x)
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 535, in ndim
shape = x.shape
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 4395, in shape
return self._get_shape()
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/keras/backend/mxnet_backend.py", line 4404, in _get_shape
_, out_shape, _ = self.symbol.infer_shape_partial()
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1177, in infer_shape_partial
return self._infer_shape_impl(True, *args, **kwargs)
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/mxnet/symbol/symbol.py", line 1265, in _infer_shape_impl
ctypes.byref(complete)))
File "/root/anaconda3/envs/mxnet_170/lib/python3.6/site-packages/mxnet/base.py", line 246, in check_call
raise get_last_ffi_error()
**mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6: [16:26:44] include/mxnet/./tuple.h:245: Check failed: i >= 0 && i < ndim(): index = -2 must be in range [0, -1)**
## To Reproduce
I provide a simple script to reproduce the bug, run the following script such as:
```
import os
import sys
bk = sys.argv[1]
os.environ['KERAS_BACKEND'] = bk
from keras import backend as K
import keras
from keras.models import Sequential
from keras.layers.core import Dense
from keras.layers import Conv2D,MaxPooling2D,BatchNormalization,Flatten,Dropout
model = Sequential()
model.add(Conv2D(96, (3,3), strides=(2,2), activation='relu', padding='same', input_shape=(32, 32, 3,)))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2,2)))
# Local Response normalization for Original Alexnet
model.add(BatchNormalization())
model.add(Conv2D(96, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(BatchNormalization())
model.add(Conv2D(192, (3,3), activation='relu', padding='same'))
model.add(Conv2D(192, (3,3), activation='relu', padding='same'))
model.add(Conv2D(256, (3,3), activation='relu', padding='same'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(512, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='tanh'))
# Comment out this line of code, mxnet runs successfully
# However, this script runs successfully on both tensorflow and cntk
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
# print the model summary
model.summary()
```
### Steps to reproduce
`python myscript.py mxnet` (change mxnet to tensorflow if you want to test under backend tensorflow)
## Environment
```
Package Version
------------------- -------------------
cached-property 1.5.2
certifi 2020.12.5
chardet 4.0.0
cycler 0.10.0
decorator 4.4.2
graphviz 0.8.4
h5py 2.10.0
idna 2.10
Keras-Applications 1.0.8
keras-mxnet 2.2.4.2
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
matplotlib 3.2.2
mxnet-cu101 1.7.0
networkx 2.5
numpy 1.19.4
pandas 0.23.0
Pillow 5.1.0
pip 20.3.3
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2020.5
PyWavelets 1.1.1
PyYAML 5.3.1
redis 3.3.2
requests 2.25.1
scikit-image 0.13.1
scikit-learn 0.19.1
scipy 1.1.0
setuptools 51.0.0.post20201207
six 1.15.0
urllib3 1.26.2
wheel 0.36.2
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] yangshuo0323 commented on issue #19717: mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6
Posted by GitBox <gi...@apache.org>.
yangshuo0323 commented on issue #19717:
URL: https://github.com/apache/incubator-mxnet/issues/19717#issuecomment-770141771
I see you have trained your model based on MXNet version 1.7.0. I want to train BERT on mutiple GPU, and I have another doubt want to consult you. Do you meet this trouble:
```
[1,4]<stderr>:===================
[1,5]<stderr>:[node106:26502:0:26502] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,5]<stderr>:==== backtrace ====
[1,6]<stderr>:[node106:26503:0:26503] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30)
[1,6]<stderr>:==== backtrace ====
[1,5]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f40f065bcec]
[1,5]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f40f065bf64]
[1,5]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f42ead77d44]
[1,5]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f428d022564]
[1,5]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f428d025790]
[1,5]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f428d01ded1]
[1,5]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f428cff89d4]
[1,5]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f410243a18f]
[1,5]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f4102431d84]
[1,5]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f42e9da49dd]
[1,5]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f42e9da4067]
[1,5]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f42eafd527e]
[1,5]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f42eafd5cb4]
[1,5]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x564d0453c00b]
[1,5]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x564d045a09a1]
[1,5]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x564d04534497]
[1,5]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x564d0459ccba]
[1,5]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x564d0453420b]
[1,5]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x564d0459bbe6]
[1,5]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x564d044e42b9]
[1,5]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x564d044e51d4]
[1,5]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x564d044e51fc]
[1,5]<stderr>: 26 python(+0x22bf44) [0x564d045faf44]
[1,5]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x564d046052b1]
[1,5]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x564d046054a3]
[1,5]<stderr>: 29 python(+0x2375d5) [0x564d046065d5]
[1,5]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x564d046066fc]
[1,5]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f42ea9c4840]
[1,5]<stderr>: 32 python(+0x1dc3c0) [0x564d045ab3c0]
[1,5]<stderr>:===================
[1,6]<stderr>: 0 /usr/lib/libucs.so.0(+0x1fcec) [0x7f1a6c25bcec]
[1,6]<stderr>: 1 /usr/lib/libucs.so.0(+0x1ff64) [0x7f1a6c25bf64]
[1,6]<stderr>: 2 /lib/x86_64-linux-gnu/libpthread.so.0(pthread_mutex_lock+0x4) [0x7f1c66a2ad44]
[1,6]<stderr>: 3 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine11ThreadedVar21AppendWriteDependencyEPNS0_8OprBlockE+0x44) [0x7f1c08cd5564]
[1,6]<stderr>: 4 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine4PushEPNS0_3OprENS_7ContextEib+0x280) [0x7f1c08cd8790]
[1,6]<stderr>: 5 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet6engine14ThreadedEngine9PushAsyncESt8functionIFvNS_10RunContextENS0_18CallbackOnCompleteEEENS_7ContextERKSt6vectorIPNS0_3VarESaISA_EESE_NS_10FnPropertyEiPKcb+0x131) [0x7f1c08cd0ed1]
[1,6]<stderr>: 6 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/mxnet/libmxnet.so(_ZN5mxnet10CopyFromToERKNS_7NDArrayES2_ib+0xaf4) [0x7f1c08cab9d4]
[1,6]<stderr>: 7 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(_ZN7horovod5mxnet29PushHorovodOperationCudaOnCPUENS_6common7Request11RequestTypeEPN5mxnet7NDArrayES6_PKcii+0xe6f) [0x7f1a7e0e118f]
[1,6]<stderr>: 8 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/site-packages/horovod/mxnet/mpi_lib.cpython-37m-x86_64-linux-gnu.so(horovod_mxnet_broadcast_async+0x54) [0x7f1a7e0d8d84]
[1,6]<stderr>: 9 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7f1c65a579dd]
[1,6]<stderr>: 10 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7f1c65a57067]
[1,6]<stderr>: 11 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(_ctypes_callproc+0x2ce) [0x7f1c66c8827e]
[1,6]<stderr>: 12 /home/yangshuo/miniconda3/envs/yangshuo/lib/python3.7/lib-dynload/_ctypes.cpython-37m-x86_64-linux-gnu.so(+0x12cb4) [0x7f1c66c88cb4]
[1,6]<stderr>: 13 python(_PyObject_FastCallKeywords+0x48b) [0x562df52e800b]
[1,6]<stderr>: 14 python(_PyEval_EvalFrameDefault+0x51d1) [0x562df534c9a1]
[1,6]<stderr>: 15 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>: 16 python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>: 17 python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>: 18 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>: 19 python(_PyFunction_FastCallKeywords+0x387) [0x562df52e0497]
[1,6]<stderr>: 20 python(_PyEval_EvalFrameDefault+0x14ea) [0x562df5348cba]
[1,6]<stderr>: 21 python(_PyFunction_FastCallKeywords+0xfb) [0x562df52e020b]
[1,6]<stderr>: 22 python(_PyEval_EvalFrameDefault+0x416) [0x562df5347be6]
[1,6]<stderr>: 23 python(_PyEval_EvalCodeWithName+0x2f9) [0x562df52902b9]
[1,6]<stderr>: 24 python(PyEval_EvalCodeEx+0x44) [0x562df52911d4]
[1,6]<stderr>: 25 python(PyEval_EvalCode+0x1c) [0x562df52911fc]
[1,6]<stderr>: 26 python(+0x22bf44) [0x562df53a6f44]
[1,6]<stderr>: 27 python(PyRun_FileExFlags+0xa1) [0x562df53b12b1]
[1,6]<stderr>: 28 python(PyRun_SimpleFileExFlags+0x1c3) [0x562df53b14a3]
[1,6]<stderr>: 29 python(+0x2375d5) [0x562df53b25d5]
[1,6]<stderr>: 30 python(_Py_UnixMain+0x3c) [0x562df53b26fc]
[1,6]<stderr>: 31 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f1c66677840]
[1,6]<stderr>: 32 python(+0x1dc3c0) [0x562df53573c0]
[1,6]<stderr>:===================
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 7 with PID 0 on node node106 exited on signal 11 (Segmentation fault).
```
- My environment is:
```
gluonnlp 0.10.0
horovod 0.19.5
mxnet-cu102 1.7.0
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] yangshuo0323 commented on issue #19717: mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6
Posted by GitBox <gi...@apache.org>.
yangshuo0323 commented on issue #19717:
URL: https://github.com/apache/incubator-mxnet/issues/19717#issuecomment-770141916
@Justobe
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] szha commented on issue #19717: mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6
Posted by GitBox <gi...@apache.org>.
szha commented on issue #19717:
URL: https://github.com/apache/incubator-mxnet/issues/19717#issuecomment-754320285
cc @sandeep-krishnamurthy
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org
[GitHub] [incubator-mxnet] Justobe commented on issue #19717: mxnet.base.MXNetError: MXNetError: Error in operator batchnorm6
Posted by GitBox <gi...@apache.org>.
Justobe commented on issue #19717:
URL: https://github.com/apache/incubator-mxnet/issues/19717#issuecomment-770142589
@yangshuo0323 Sorry, I did not meet similar trouble like that. The exception of my script was thrown when I used mxnet as the backend of Keras.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org