You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2018/08/18 11:33:39 UTC

[GitHub] al-rigazzi opened a new issue #12240: ImageIter crashing with 64k samples per batch

al-rigazzi opened a new issue #12240: ImageIter crashing with 64k samples per batch
URL: https://github.com/apache/incubator-mxnet/issues/12240
 
 
   Hello,
   
   I am experimenting new schemes to train CNNs in parallel. I am using the script `train_imagenet.py` with ResNet50, and the full ImageNet1k data set.
   
   Basically, I am running CPU-only instances, putting 1024 samples on each instance. Up to 32 instances, thus 32k samples per batch in total, everything runs smoothly.
   
   As soon as I switch to 64 nodes, thus 64k samples per batch, I see the code crash in iter_image_recordio_2.cc.
   
   It **always** happens after 17 epochs (if I restart the training from epoch 10, it will crash at epoch 27). The error I get is:
   
   > Traceback (most recent call last):
   >   File "example/image-classification/train_imagenet.py", line 58, in <module>
   >     fit.fit(args, sym, data.get_rec_iter)
   >   File "/home/users/arigazzi/mxnet/example/image-classification/common/fit.py", line 318, in fit
   >     monitor=monitor)
   >   File "/home/users/arigazzi/mxnet/example/image-classification/common/../../../python/mxnet/module/base_module.py", line 564, in fit
   >     batch_end_callback=eval_batch_end_callback, epoch=epoch)
   >   File "/home/users/arigazzi/mxnet/example/image-classification/common/../../../python/mxnet/module/base_module.py", line 248, in score
   >     for nbatch, eval_batch in enumerate(eval_data):
   >   File "/home/users/arigazzi/mxnet/example/image-classification/common/../../../python/mxnet/io.py", line 232, in __next__
   >     return self.next()
   >   File "/home/users/arigazzi/mxnet/example/image-classification/common/../../../python/mxnet/io.py", line 834, in next
   >     check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res)))
   >   File "/home/users/arigazzi/mxnet/example/image-classification/common/../../../python/mxnet/base.py", line 210, in check_call
   >     raise MXNetError(py_str(_LIB.MXGetLastError()))
   > mxnet.base.MXNetError: [04:44:26] /home/users/arigazzi/mxnet/src/io/iter_image_recordio_2.cc:318: Check failed: !overflow number of input images must be bigger than the batch size
   
   Can anyone help me debugging this?
   
   Two main questions:
   - Why is the overflow flag set? There should be enough samples to fill the batch, as batches_per_epoch is computed by MXNet. 
   - What does the option `round_batch` do? Should I try to unset it?
   
   Thanks in advance.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services