You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/05/05 15:41:11 UTC

[GitHub] dwSun opened a new issue #10824: Segmentation Fault when using as_in_context

dwSun opened a new issue #10824: Segmentation Fault when using as_in_context
URL: https://github.com/apache/incubator-mxnet/issues/10824
 
 
   ## Description
   When using as_in_context with gluon.data.DataLoader, the script always failed with some error. Sometimes segmentation fault, sometimes double free or corruption.
   
   
   ## Environment info (Required)
   With mxnet-mkl-1.1.0 from pypi this error can be reproduced, I have tested it on 2 different machines.
   
   Mxnet-mkl-1.2.0b20180503 can't reproduce this error.
   Not sure about mxnet without mkl, it is to slow to finish my script. 
   
   ## Error Message:
   ```
   ...
   249:[15784.430563088692] samples/s
   274:[15608.4416962894] samples/s
   299:[16002.520640366014] samples/s
   312:[16143.928433001587] samples/s
   ####### epoch 12 ########
   24:[9204.46913275453] samples/s
   49:[11987.921486806097] samples/s
   74:[14063.263894806989] samples/s
   99:[14007.013852797589] samples/s
   double free or corruption (fasttop)
   [1]    3253 abort      python3 train-test.py
   ```
   sometimes it is segment fault
   ```
   ...
   299:[5396.608145014218] samples/s
   312:[5540.875926004944] samples/s
   ####### epoch 1 ########
   24:[9079.366178889948] samples/s
   49:[13414.984817682252] samples/s
   
   Segmentation fault: 11
   
   Stack trace returned 4 entries:
   [bt] (0) /home/david/.virtualenvs/mxnet-mkl/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x15fbe2) [0x7efde3bfdbe2]
   [bt] (1) /home/david/.virtualenvs/mxnet-mkl/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x277923e) [0x7efde621723e]
   [bt] (2) /lib/x86_64-linux-gnu/libc.so.6(+0x34f00) [0x7efdf4125f00]
   [bt] (3) [0x7efdf0098010]
   [1]    4057 segmentation fault  python3 train-test.py
   ```
   
   ## Minimum reproducible example
   this is the script described above.
   ```py
   #!/usr/bin/env python3
   import time
   from multiprocessing import cpu_count
   
   import mxnet as mx
   from mxnet.gluon.data import dataset
   
   CPU_COUNT = cpu_count()
   
   
   class MyDataset(dataset.Dataset):
       def __init__(self):
           """Init."""
   
       def __getitem__(self, idx):
           """Get item."""
           img = mx.nd.random.uniform(shape=(24, 24, 3))
           return img, 0
   
       def __len__(self):
           """Dataset length."""
           return 10000
   
   
   train_dataset = MyDataset()
   
   batch_size = 32
   train_data_loader = mx.gluon.data.DataLoader(
       train_dataset, batch_size, shuffle=True, num_workers=CPU_COUNT)
   
   ctx = mx.cpu()
   ctx = mx.gpu() if mx.test_utils.list_gpus() else mx.cpu()
   
   epochs = 50
   for epoch in range(epochs):
       print('####### epoch {} ########'.format(epoch))
       # training loop (with autograd and trainer steps, etc.)
       start = time.time()
       training_samples = 0
       for batch_idx, (data, label) in enumerate(train_data_loader):
           
           # comment out those 2 lines, this script will finish without error.
           data = data.as_in_context(ctx)
           label = label.as_in_context(ctx)
   
           training_samples += data.shape[0]
           if training_samples > 0 and training_samples % 100 == 0:
               elapse = time.time() - start
               print('{}:[{}] samples/s'.format(batch_idx,
                                                training_samples/elapse))
   
   ```
   
   
   Also, I tried another script:
   ```py
   #!/usr/bin/env python3
   
   import mxnet as mx
   
   from multiprocessing import Pool
   
   def run(par):
       ctx = mx.cpu()
       print('####### run {} ########'.format(par))
       for x in range(5000):
           data = mx.nd.random.uniform(shape=(32, 320, 320, 3))
           data = data.as_in_context(ctx)
   
   p = Pool()
   p.map(run, range(50))
   p.close()
   p.join()
   ```
   this script will finish without any error, with or without mkl.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services