You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2019/03/21 02:13:14 UTC
[GitHub] [incubator-mxnet] OliverColeman opened a new issue #14484: Odd
behaviour with 'device' kvstore and CUDA illegal memory access errors
OliverColeman opened a new issue #14484: Odd behaviour with 'device' kvstore and CUDA illegal memory access errors
URL: https://github.com/apache/incubator-mxnet/issues/14484
## Description
Training the FCN model from gluon-cv over 2 GPUs I encounter different but perhaps related issues depending on which kind of kvstore I use ('local' and 'device'). (I don't think this is a gluon-cv issue.) Test script included.
## Environment info (Required)
```
----------Python Info----------
Version : 3.5.6
Compiler : GCC 7.3.0
Build : ('default', 'Aug 26 2018 21:41:56')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 8.1.2
Directory : /opt/conda/lib/python3.5/site-packages/pip
----------MXNet Info-----------
Version : 1.3.1
Directory : /opt/conda/lib/python3.5/site-packages/mxnet
Commit Hash : 19c501680183237d52a862e6ae1dc4ddc296305b
----------System Info----------
Platform : Linux-4.15.0-46-generic-x86_64-with-debian-stretch-sid
system : Linux
node : axl1
release : 4.15.0-46-generic
version : #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 17
Model name: AMD Ryzen 3 2200G with Radeon Vega Graphics
Stepping: 0
CPU MHz: 1458.994
CPU max MHz: 3500.0000
CPU min MHz: 1600.0000
BogoMIPS: 6986.85
Virtualization: AMD-V
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 64K
L2 cache: 512K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx hw_pstate sme ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
----------Network Test----------
Setting timeout: 10
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1691 sec, LOAD: 0.6659 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0070 sec, LOAD: 1.3928 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0083 sec, LOAD: 0.8829 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.8895 sec, LOAD: 0.7720 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0078 sec, LOAD: 0.9719 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0093 sec, LOAD: 0.0760 sec.
```
Package used (Python/R/Scala/Julia):
Python
## Error Message:
### If kvstore is 'local':
```
epoch 0
[01:09:18] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
-------- autograd.backward(loss)
---------- trainer.step(batch_size)
Traceback (most recent call last):
File "train.py", line 131, in <module>
predTop = predTop.reshape((-1,)).astype('uint8').asnumpy()
File "/opt/conda/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 1972, in asnumpy
ctypes.c_size_t(data.size)))
File "/opt/conda/lib/python3.5/site-packages/mxnet/base.py", line 251, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [01:09:26] /home/travis/build/dmlc/mxnet-distro/mxnet-build/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: an illegal memory access was encountered
Stack trace returned 10 entries:
[bt] (0) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x381822) [0x7fbe7f130822]
[bt] (1) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x381e08) [0x7fbe7f130e08]
[bt] (2) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f3e198) [0x7fbe81ced198]
[bt] (3) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2faf1ea) [0x7fbe81d5e1ea]
[bt] (4) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f15123) [0x7fbe81cc4123]
[bt] (5) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f1d334) [0x7fbe81ccc334]
[bt] (6) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f213db) [0x7fbe81cd03db]
[bt] (7) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f215fe) [0x7fbe81cd05fe]
[bt] (8) /opt/conda/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x2f1d9fb) [0x7fbe81ccc9fb]
[bt] (9) /opt/conda/bin/../lib/libstdc++.so.6(+0xb8678) [0x7fbe6a362678]
```
### If kvstore is 'device':
There is no error, the process hangs when trying to push to the kvstore in `gluon.Trainer._allreduce_grads()`. The example script below includes some debug code to narrow down where the process hangs.
```
epoch 0
[01:21:38] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
-------- autograd.backward(loss)
---------- trainer.step(batch_size)
kvs 2 <mxnet.kvstore.KVStore object at 0x7f1e6f85f940>
a
a2
b 0 fcn0_resnetv1s_conv0_weight
c
c2
d
g
h
b 1 fcn0_resnetv1s_syncbatchnorm0_gamma
c
c2
d
g
h
b 2 fcn0_resnetv1s_syncbatchnorm0_beta
c
c2
d
g
h
b 3 fcn0_resnetv1s_syncbatchnorm0_running_mean
h
b 4 fcn0_resnetv1s_syncbatchnorm0_running_var
h
b 5 fcn0_resnetv1s_conv1_weight
c
c2
d
g
h
b 6 fcn0_resnetv1s_syncbatchnorm1_gamma
c
c2
d
g
h
b 7 fcn0_resnetv1s_syncbatchnorm1_beta
c
c2
d
g
h
b 8 fcn0_resnetv1s_syncbatchnorm1_running_mean
h
b 9 fcn0_resnetv1s_syncbatchnorm1_running_var
h
b 10 fcn0_resnetv1s_conv2_weight
c
c2
d
g
h
b 11 fcn0_resnetv1s_syncbatchnorm2_gamma
c
c2
d
g
h
b 12 fcn0_resnetv1s_syncbatchnorm2_beta
c
c2
d
g
h
b 13 fcn0_resnetv1s_syncbatchnorm2_running_mean
h
b 14 fcn0_resnetv1s_syncbatchnorm2_running_var
h
b 15 fcn0_resnetv1s_layers1_conv0_weight
c
c2
d
g
h
b 16 fcn0_resnetv1s_layers1_syncbatchnorm0_gamma
c
c2
d
g
h
b 17 fcn0_resnetv1s_layers1_syncbatchnorm0_beta
c
c2
d
g
h
b 18 fcn0_resnetv1s_layers1_syncbatchnorm0_running_mean
h
b 19 fcn0_resnetv1s_layers1_syncbatchnorm0_running_var
h
b 20 fcn0_resnetv1s_layers1_conv1_weight
c
c2
d
g
h
b 21 fcn0_resnetv1s_layers1_syncbatchnorm1_gamma
c
c2
d
g
h
b 22 fcn0_resnetv1s_layers1_syncbatchnorm1_beta
c
c2
d
g
h
b 23 fcn0_resnetv1s_layers1_syncbatchnorm1_running_mean
h
b 24 fcn0_resnetv1s_layers1_syncbatchnorm1_running_var
h
b 25 fcn0_resnetv1s_layers1_conv2_weight
c
c2
d
g
h
b 26 fcn0_resnetv1s_layers1_syncbatchnorm2_gamma
c
c2
d
g
h
b 27 fcn0_resnetv1s_layers1_syncbatchnorm2_beta
c
c2
d
g
h
b 28 fcn0_resnetv1s_layers1_syncbatchnorm2_running_mean
h
b 29 fcn0_resnetv1s_layers1_syncbatchnorm2_running_var
h
b 30 fcn0_resnetv1s_down1_conv0_weight
c
c2
d
g
h
b 31 fcn0_resnetv1s_down1_syncbatchnorm0_gamma
c
c2
[...hangs here. The python process then refuses to exit regardless of which kill signal I send to it. The docker container also refuses to stop. I have to restart the machine at this point.]
```
Note: the specific layer it stops on varies.
## Minimum reproducible example
```
import sys, math
import numpy as np
import mxnet as mx
from mxnet import gluon, autograd, metric
import gluoncv
from gluoncv.utils.parallel import DataParallelModel, DataParallelCriterion
from gluoncv.model_zoo import get_model
from gluoncv.loss import *
from gluoncv.model_zoo.segbase import *
from mxnet.gluon.data import dataset
from gluoncv.utils import LRScheduler
class DummyDataSet(dataset.Dataset):
def __init__(self, crop_size):
self.data = []
for i in range(5):
d = mx.ndarray.ones((3, crop_size, crop_size))
l = mx.ndarray.ones((crop_size, crop_size))
r = (d, l)
self.data.append(r)
@property
def num_class(self):
return 5
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return self.data[index]
class Trainer(gluon.Trainer):
def step(self, batch_size, ignore_stale_grad=False):
if not self._kv_initialized:
print("kvs %d %s" % (len(self._contexts), str(self._kvstore_params['kvstore'])))
self._init_kvstore()
if self._params_to_init:
self._init_params()
self._optimizer.rescale_grad = self._scale / batch_size
self._allreduce_grads()
self._update(ignore_stale_grad)
def _allreduce_grads(self):
print("a")
if self._kvstore:
print("a2")
for i, param in enumerate(self._params):
print("b %d %s" % (i, param.name))
if param.grad_req != 'null':
print("c")
plg = param.list_grad()
print("c2")
self._kvstore.push(i, plg, priority=-i)
print("d")
if not self._update_on_kvstore:
print("e")
self._kvstore.pull(i, param.list_grad(), priority=-i, ignore_sparse=self._distributed)
print("f")
print("g")
print("h")
print("i")
print("j")
if __name__ == "__main__":
input_size = 480
dataset_train = DummyDataSet(input_size)
data_loader = gluon.data.DataLoader(dataset_train, 2, shuffle=True, last_batch='rollover', num_workers=4)
net = get_segmentation_model(model='fcn', dataset='pascal_aug',
backbone='resnet50', norm_layer=mx.gluon.contrib.nn.basic_layers.SyncBatchNorm,
norm_kwargs={'num_devices': 2}, aux=True,
crop_size=input_size)
net.cast('float32')
exec_contexts = [ mx.gpu(0), mx.gpu(1) ]
net = DataParallelModel(net, exec_contexts)
criterion = MixSoftmaxCrossEntropyLoss(True, aux_weight=0.5)
criterion = DataParallelCriterion(criterion, exec_contexts, True)
lr_scheduler = LRScheduler(mode='poly', baselr=0.001,
niters=len(dataset_train),
nepochs=30)
optimizer_params = {'lr_scheduler': lr_scheduler,
'wd':0.0001,
'momentum': 0.9}
kv = mx.kv.create('device')
trainer = Trainer(net.module.collect_params(), 'sgd', optimizer_params, kvstore = kv)
batch_size = 4
for epoch in range(0, 30):
print ("epoch", epoch)
for i, (data, label) in enumerate(data_loader):
lr_scheduler.update(i, epoch)
with autograd.record(True):
pred = net(data)
#pred = upsize_parallel_output(pred)
loss = criterion(pred, label)
mx.nd.waitall()
print ("-------- autograd.backward(loss)")
autograd.backward(loss)
print ("---------- trainer.step(batch_size)")
trainer.step(batch_size)
# DataParallelModel output is a tuple of tuples of NDArrays.
pred = [ p[0] for p in pred ]
pred = mx.ndarray.concat(*pred)
predTop = mx.nd.argmax(pred, 1)
predNP = predTop.reshape((-1,)).astype('uint8').asnumpy()
```
## Steps to reproduce
1. Run the above script, setting the kvstore type to either `local` or `device`.
## What have you tried to solve it?
1. Disabling gc at beginning of epoch and re-enabling at end, seemed to work in one similar-seeming issue, but made no difference for me.
Note: I still get the same result when not using a sub-classed version of gluon.Trainer.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services