You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/01/30 19:40:07 UTC

[GitHub] [incubator-mxnet] Kh4L opened a new issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Kh4L opened a new issue #17485: IndexError in gluon-cv Mask-RCNN validation on master
URL: https://github.com/apache/incubator-mxnet/issues/17485
 
 
   ## Description
   
   An IndexError occurs during the first validation step when [training gluon-cv Mask-RCNN](https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py)  with horovod.
   
   ### Error Message
   
   ```
   [1,2]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
   [1,5]<stderr>:Traceback (most recent call last):
   [1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 615, in <module>
   [1,5]<stderr>:    train(net, train_data, val_data, eval_metric, batch_size, ctx, logger, args)
   [1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 540, in train
   [1,5]<stderr>:    args)
   [1,5]<stderr>:  File "gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py", line 275, in validate
   [1,5]<stderr>:    det_bbox = det_bbox[i].asnumpy()
   [1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 2554, in asnumpy
   [1,5]<stderr>:    ctypes.c_size_t(data.size)))
   [1,5]<stderr>:  File "/workspace/incubator-mxnet/python/mxnet/base.py", line 273, in check_call
   [1,5]<stderr>:    raise get_last_ffi_error()
   [1,5]<stderr>:IndexError: Traceback (most recent call last):
   [1,5]<stderr>:  File "src/operator/tensor/indexing_op.cu", line 461
   [1,5]<stderr>:IndexError: index 999 is out of bounds for axis 1 with size 500
   --------------------------------------------------------------------------
   Primary job  terminated normally, but 1 process returned
   a non-zero exit code. Per user-direction, the job has been aborted.
   --------------------------------------------------------------------------
   --------------------------------------------------------------------------
   mpirun detected that one or more processes exited with non-zero status, thus causing
   the job to be terminated. The first process to do so was:
     Process name: [[39172,1],0]
     Exit code:    1
   --------------------------------------------------------------------------
   ```
   
   ### Steps to reproduce
   
   1. Compile and install mxnet master
   2. Get gluon-cv master
   3. horovodrun -np 8 -H localhost:8 python ./gluon-cv/scripts/instance/mask_rcnn/train_mask_rcnn.py --dataset coco -j 4 --log-interval 1000 --use-fpn --horovod --amp --batch-size 16 --lr 0.02 --lr-warmup 500 --epochs 1
   
   ## Environment
   
   We recommend using our script for collecting the diagnostic information. Run the following command and paste the outputs below:
   ```
   curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
   
   # paste outputs here
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-mxnet] zhreshold closed issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
zhreshold closed issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] zhreshold commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
zhreshold commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764904092


   @karan6181 
   after investigating the mask rcnn, I think a simple fix should do the trick to allow validation work with models initialized with `batch_size > 1`:
   
   here: https://github.com/dmlc/gluon-cv/blob/ecf491685018e951bec12f9e45bc482749de4f85/gluoncv/model_zoo/rcnn/mask_rcnn/mask_rcnn.py#L85
   
   we can modify this line to 
   
   ```python
   if autograd.is_training():
       x = x.reshape((-4, self._batch_images, -1, 0, 0, 0))
   else:
       # always use batch_size = 1 for inference
       x = x.reshape((-4, 1, -1, 0, 0, 0))
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] karan6181 commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
karan6181 commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764860187


   - I debugged the code a bit and found this line (https://github.com/dmlc/gluon-cv/blob/master/scripts/instance/mask_rcnn/train_mask_rcnn.py#L694) might be the culprit or things where we should start looking at. 
   - For `batch_size=2 per GPU`,  I think during training, the data loader created such a way that it passes 2 images per GPU during forward pass. However, Validation doesn't support multi-batch and it is always 1 image per GPU irrespective of user passes `batch_size=2 per GPU`. Since the model expects 2 images per GPU and for validation we are passing 1 image per GPU, we are seeing the above error. If I run the training with `batch_size=2 per GPU` and then save the model params and then run the validation by loading the same model params but with `batch_size=1 per GPU`, then validation works. So something to do with `per_device_batch_size` managed in https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/rcnn/faster_rcnn/faster_rcnn.py or https://github.com/dmlc/gluon-cv/blob/master/gluoncv/model_zoo/rcnn/mask_rcnn/rcnn_target.py. I might lack some background knowledge on this, but this is what I found.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
ChaiBapchya commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-765034237


   @karan6181 tested it on EC2 instances and verified that this works.
   Thanks @zhreshold for pointing it out & @karan6181 for helping with investigation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
ChaiBapchya commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764071515


   @Kh4L was this issue fixed? If yes what was the workaround?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
ChaiBapchya edited a comment on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-765034237


   @karan6181 tested it on EC2 instances and verified that this works.
   While I tested it in docker environment. Both the places, we were able to run for per_device_batch_size > 1 for training while it set per_device_batch_size to 1 for inference.
   Thanks @zhreshold for pointing it out & @karan6181 for helping with investigation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] zhreshold commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
zhreshold commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764904092


   @karan6181 
   after investigating the mask rcnn, I think a simple fix should do the trick to allow validation work with models initialized with `batch_size > 1`:
   
   here: https://github.com/dmlc/gluon-cv/blob/ecf491685018e951bec12f9e45bc482749de4f85/gluoncv/model_zoo/rcnn/mask_rcnn/mask_rcnn.py#L85
   
   we can modify this line to 
   
   ```python
   if autograd.is_training():
       x = x.reshape((-4, self._batch_images, -1, 0, 0, 0))
   else:
       # always use batch_size = 1 for inference
       x = x.reshape((-4, 1, -1, 0, 0, 0))
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
ChaiBapchya edited a comment on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-765034237


   @karan6181 tested it on EC2 instances and verified that this works.
   While I tested it in docker environment. Both the places, we were able to run for per_device_batch_size > 1 for training while it set per_device_batch_size to 1 for inference.
   Thanks @zhreshold for pointing it out & @karan6181 for helping with investigation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] karan6181 commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
karan6181 commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764815167


    The issue occurs when batch size is greater than 1 per GPU. If user provide the Batch size of 1 per GPU, then Validation works perfectly fine. However, when batch size is greater than 1 per GPU then they see this error `IndexError: index 999 is out of bounds for axis 1 with size 500`. Looks like a Bug. FYI @zhreshold 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] karan6181 commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
karan6181 commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764834200


   I did some analysis on different `batch_size` hyperparameter configurations:
   
   1. If `batch_size=1 per GPU`, then training plus validation (after every epoch) works without any issue
   2. If `batch_size=2 per GPU`, then training works (If we don't run validation at all)
   3. If `batch_size=2 per GPU`, then training works, but validation fails, irrespective of doing validation at every epoch or at the end of training. 
   4. If we save the model (model params) after training with `batch_size=1 per GPU` and then run validation separately by loading the same model params with `batch_size=1 per GPU` then it works, however, with `batch_size=2 per GPU`, it doesn't work with the same model params that was loaded.
   
   Note: Validation doesn't support multi-batch. Meaning it always runs with 1 image per GPU irrespective of `batch_size` number which is provided by the user.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] karan6181 commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
karan6181 commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764815167






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org


[GitHub] [incubator-mxnet] ChaiBapchya commented on issue #17485: IndexError in gluon-cv Mask-RCNN validation on master

Posted by GitBox <gi...@apache.org>.
ChaiBapchya commented on issue #17485:
URL: https://github.com/apache/incubator-mxnet/issues/17485#issuecomment-764071515






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org