You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/05/05 15:41:11 UTC
[GitHub] dwSun opened a new issue #10824: Segmentation Fault when using
as_in_context
dwSun opened a new issue #10824: Segmentation Fault when using as_in_context
URL: https://github.com/apache/incubator-mxnet/issues/10824
## Description
When using as_in_context with gluon.data.DataLoader, the script always failed with some error. Sometimes segmentation fault, sometimes double free or corruption.
## Environment info (Required)
With mxnet-mkl-1.1.0 from pypi this error can be reproduced, I have tested it on 2 different machines.
Mxnet-mkl-1.2.0b20180503 can't reproduce this error.
Not sure about mxnet without mkl, it is to slow to finish my script.
## Error Message:
```
...
249:[15784.430563088692] samples/s
274:[15608.4416962894] samples/s
299:[16002.520640366014] samples/s
312:[16143.928433001587] samples/s
####### epoch 12 ########
24:[9204.46913275453] samples/s
49:[11987.921486806097] samples/s
74:[14063.263894806989] samples/s
99:[14007.013852797589] samples/s
double free or corruption (fasttop)
[1] 3253 abort python3 train-test.py
```
sometimes it is segment fault
```
...
299:[5396.608145014218] samples/s
312:[5540.875926004944] samples/s
####### epoch 1 ########
24:[9079.366178889948] samples/s
49:[13414.984817682252] samples/s
Segmentation fault: 11
Stack trace returned 4 entries:
[bt] (0) /home/david/.virtualenvs/mxnet-mkl/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x15fbe2) [0x7efde3bfdbe2]
[bt] (1) /home/david/.virtualenvs/mxnet-mkl/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x277923e) [0x7efde621723e]
[bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x34f00) [0x7efdf4125f00]
[bt] (3) [0x7efdf0098010]
[1] 4057 segmentation fault python3 train-test.py
```
## Minimum reproducible example
this is the script described above.
```py
#!/usr/bin/env python3
import time
from multiprocessing import cpu_count
import mxnet as mx
from mxnet.gluon.data import dataset
CPU_COUNT = cpu_count()
class MyDataset(dataset.Dataset):
def __init__(self):
"""Init."""
def __getitem__(self, idx):
"""Get item."""
img = mx.nd.random.uniform(shape=(24, 24, 3))
return img, 0
def __len__(self):
"""Dataset length."""
return 10000
train_dataset = MyDataset()
batch_size = 32
train_data_loader = mx.gluon.data.DataLoader(
train_dataset, batch_size, shuffle=True, num_workers=CPU_COUNT)
ctx = mx.cpu()
ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
epochs = 50
for epoch in range(epochs):
print('####### epoch {} ########'.format(epoch))
# training loop (with autograd and trainer steps, etc.)
start = time.time()
training_samples = 0
for batch_idx, (data, label) in enumerate(train_data_loader):
# comment out those 2 lines, this script will finish without error.
data = data.as_in_context(ctx)
label = label.as_in_context(ctx)
training_samples += data.shape[0]
if training_samples > 0 and training_samples % 100 == 0:
elapse = time.time() - start
print('{}:[{}] samples/s'.format(batch_idx,
training_samples/elapse))
```
Also, I tried another script:
```py
#!/usr/bin/env python3
import mxnet as mx
from multiprocessing import Pool
def run(par):
ctx = mx.cpu()
print('####### run {} ########'.format(par))
for x in range(5000):
data = mx.nd.random.uniform(shape=(32, 320, 320, 3))
data = data.as_in_context(ctx)
p = Pool()
p.map(run, range(50))
p.close()
p.join()
```
this script will finish without any error, with or without mkl.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services