You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mxnet.apache.org by GitBox <gi...@apache.org> on 2020/11/09 07:49:07 UTC

[GitHub] [incubator-mxnet] seekFire opened a new issue #19499: Error occurs when I used weight decay in optimizer_params

seekFire opened a new issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499


   ## Description
   I set weight decay in the key 'wd' of the dict "optimizer_params" as follows:
   optimizer_params={'learning_rate': lr, 'momentum': momentum,
                                   'lr_scheduler': scheduler, 'wd': weight_decay}
   
   the value of "weight_decay" is 1e-4, but the error message will be located on the line:
   `trainer.step(batch_size=batchsize)`
   
   ### Error Message
   The error message is as follows:
   File "/home/bst/anaconda2/envs/Python3/lib/python3.7/site-packages/mxnet/optimizer/optimizer.py", line 487, in _get_wds
       wds[i] *= self.param_dict[index].wd_mult
   TypeError: can't multiply sequence by non-int of type 'float'
   
   ## What have you tried to solve it?
   When I get rid of the keyword 'wd' of dict "optimizer_params", the error above will not occurs, but new error will occurs in logging stage:
   ```
   logger.info("Training: [epoch: %d, steps: %d, learning_rate: %.2e, batch_loss: %.4f, batch_time: %.2fs]"
                           % (i, step_num, trainer.learning_rate, batch_loss.mean().asscalar(), cost_batch))
   ```
   
   The new error message is as follow:
   mxnet.base.MXNetError: Traceback (most recent call last):
     File "src/storage/./pooled_storage_manager.h", line 161
   MXNetError: cudaMalloc retry failed: out of memory
   
   ## Environment
   Python3.7
   mxnet-cu102==1.7.0
   
   So could you please help me with these questions?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725829524


   @leezu 
   No, I don't. but the type of the tensor manipulated is **mxnet.ndarray.ndarray.NDArray** rather than **numpy.ndarray**, the special values like 0, -1, -3 are defined in the function **reshape** of **mxnet.ndarray.ndarray.NDArray**, and when I used the older version of mxnet(1.5.0), there has no such error... So I guess whether or not it is a bug


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725839386


   @leezu 
   I'm doing the semantic segmentation task, and because of using sparse label, the shape of label needs to be converted, from (n, c, h, w) to (n, h\*w), and the shape of prediction also needs to be converted from (n, c, h, w) to (n, h\*w, c)
   I just wonder if it is because I set the **last_batch** of dataloader to '**keep**', so the batch size is not fixed, and the error occured...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of tensor in this function will convert from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray**, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725361524


   @szha 
   Well, I think the question above is not important for me, as long as the model can be trained with at least one image per batch...BTW, when I run some script like below under the shell environment, it works OK:
   ```
   >>> import mxnet as mx
   >>> x = mx.nd.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   >>> y.shape    # (2, 15, 4)
   ```
   But the reshape operation in my custom metric function for segmentation task will generate error during training:
   `labels = labels.transpose((0, 2, 3, 1)).reshape(0, -3, -1).argmax(-1)`
   
   The error message shown as below:
   `ValueError: can only specify one unknown dimension`
   
   And I don't think `reshape(0, -3, -1)` is ambiguous for a 4-dim tensor, furthermore in the introduction of function **mx.nd.NDArray.reshape** there has similar demo...
   
   When I rectify `reshape(0, -3, -1)` to `reshape(0, -3, 0)`, the error changes as below:
   `ValueError: cannot reshape array of size 524288 into shape (0,newaxis,0)`
   
   So what do you think about the cause?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha closed issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

szha closed issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725834831


   @leezu 
   I tried your suggestion by adding the script:
   `mx.npx.set_np()`
   But it didn't work and new error occurs...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725361524


   @szha 
   Well, I think this question is not important for me, as long as the model can be trained with at least one image per batch...BTW, when I run some script like below under the shell environment, it works OK:
   ```
   >>> import mxnet as mx
   >>> x = mx.nd.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   >>> y.shape    # (2, 15, 4)
   ```
   But the reshape operation in my custom metric function will generate error during training:
   `labels = labels.transpose((0, 2, 3, 1)).reshape(0, -3, -1).argmax(-1)`
   
   The error message shown as below:
   `ValueError: can only specify one unknown dimension`
   
   And I don't think `reshape(0, -3, -1)` is ambiguous for a 4-dim tensor, furthermore in the introduction of function **mx.nd.NDArray.reshape** there has similar demo...
   
   When I rectify `reshape(0, -3, -1)` to `reshape(0, -3, 0)`, the error changes as below:
   `ValueError: cannot reshape array of size 524288 into shape (0,newaxis,0)`
   
   So what do you think about the cause?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] github-actions[bot] commented on issue #19499: Error occurs when I used weight decay in optimizer_params

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-723832422


   Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue.
   Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly.
   If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on [contributing to MXNet](https://mxnet.apache.org/community/contribute) and our [development guides wiki](https://cwiki.apache.org/confluence/display/MXNET/Developments).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725361524


   @szha 
   Well, I think the question above is not important for me, as long as the model can be trained with at least one image per batch...BTW, when I run some script like below under the shell environment, it works OK:
   ```
   >>> import mxnet as mx
   >>> x = mx.nd.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   >>> y.shape    # (2, 15, 4)
   ```
   But the reshape operation in my custom metric function for segmentation task will generate error during evaluating:
   `labels = labels.transpose((0, 2, 3, 1)).reshape(0, -3, -1).argmax(-1)`
   
   The error message shown as below:
   `ValueError: can only specify one unknown dimension`
   
   And I don't think `reshape(0, -3, -1)` is ambiguous for a 4-dim tensor, furthermore in the introduction of function **mx.nd.NDArray.reshape** there has similar demo...
   
   When I rectify `reshape(0, -3, -1)` to `reshape(0, -3, 0)`, the error changes as below, `labels` has shape: (1, 2, 512, 512):
   `ValueError: cannot reshape array of size 524288 into shape (0,newaxis,0)`
   
   So what do you think about the cause?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of input tensor (label & pred) of this function has converted from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray** automatically, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. I think the class **mx.metric.CustomMetric** of new version(e.g: 1.7.0) is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725221589


   @szha Yes, you're right, and the input image size is 512*512, the GPU memory is 12GB.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725839386


   @leezu 
   I'm doing the semantic segmentation task, and because of using sparse label, the shape of label needs to be converted, from 
   (n, c, h, w) to (n, h\*w), and the shape of prediction also needs to be converted from (n, c, h, w) to (n, h\*w, c)
   I just wonder if it is because I set the **last_batch** of dataloader to '**keep**', so the batch size is not fixed, and the error occured...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725839386


   @leezu 
   I'm doing the semantic segmentation task, and because of using sparse label, the shape of label needs to be converted, from (n, c, h, w) to (n, h*w), and the shape of prediction also needs to be converted from (n, c, h, w) to (n, h*w, c)
   I just wonder if it is because I set the **last_batch** of dataloader to '**keep**', so the batch size is not fixed, and the error occured...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725361524


   @szha 
   Well, I think this question is not important for me, as long as the model can be trained with at least one image per batch...BTW, when I run some script like below under the shell environment, it works OK:
   ```
   >>> import mxnet as mx
   >>> x = mx.nd.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   >>> y.shape    # (2, 15, 4)
   ```
   But the reshape operation in my custom metric function will generate error during training:
   `labels = labels.transpose((0, 2, 3, 1)).reshape(0, -3, -1).argmax(-1)`
   
   The error message shown as below:
   `ValueError: can only specify one unknown dimension`
   
   So what do you think about the cause?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-724714767


   @szha Another question is, the loss value not only doesn't decrease, but also changes back and forth between several fixed values during training, what do you think is the possible reason for this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of input tensor (label & pred) of this function has converted from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray** automatically, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. I think the class **mx.metric.CustomMetric** of new version(1.7.0) is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire removed a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire removed a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-724714767


   @szha Another question is, the loss value not only doesn't decrease, but also changes back and forth between several fixed values during training, what do you think is the possible reason for this?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

szha commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-724823891


   > when I trained with 4 GPUs, it will works fine with batch_size = 4
   
   Is this a per-GPU batch size? I imagine it has to do with the input image sizes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725221589


   @szha Yes, you're right, and the input image size is 512*512


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of input tensor (label & pred) of this function will convert from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray**, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. I think the class **mx.metric.CustomMetric** of new version(e.g: 1.7.0) is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725361524


   @szha 
   Well, I think this question is not important for me, as long as the model can be trained with at least one image per batch...BTW, when I run some script like below under the shell environment, it works OK:
   ```
   >>> import mxnet as mx
   >>> x = mx.nd.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   >>> y.shape    # (2, 15, 4)
   ```
   But the reshape operation in my custom metric function for segmentation task will generate error during training:
   `labels = labels.transpose((0, 2, 3, 1)).reshape(0, -3, -1).argmax(-1)`
   
   The error message shown as below:
   `ValueError: can only specify one unknown dimension`
   
   And I don't think `reshape(0, -3, -1)` is ambiguous for a 4-dim tensor, furthermore in the introduction of function **mx.nd.NDArray.reshape** there has similar demo...
   
   When I rectify `reshape(0, -3, -1)` to `reshape(0, -3, 0)`, the error changes as below:
   `ValueError: cannot reshape array of size 524288 into shape (0,newaxis,0)`
   
   So what do you think about the cause?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire removed a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire removed a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725834831


   @leezu 
   I tried your suggestion by adding the script:
   `mx.npx.set_np()`
   But it didn't work and new error occurs...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of tensor in this function will convert from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray**, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. I think the class **mx.metric.CustomMetric** of new version(e.g: 1.7.0) is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of tensor in this function will convert from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray**, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. So I think the class **mx.metric.CustomMetric** has changed from older version.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725839386


   @leezu 
   I tried your suggestion by adding the script:
   `mx.npx.set_np()`
   But it didn't work and new error occurs...
   
   I'm doing the semantic segmentation task, and because of using sparse label, the shape of label needs to be converted, from 
   (n, c, h, w) to (n, h\*w), and the shape of prediction also needs to be converted from (n, c, h, w) to (n, h\*w, c)
   I just wonder if it is because I set the **last_batch** of dataloader to '**keep**', so the batch size is not fixed, and the error occured...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] szha commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

szha commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-724233640


   Hi @seekFire. Because MXNet execution is asynchronous, the OOM error likely happened earlier. I'd suggest reducing model size or batch size to make it fit in your current GPU. If you have reason to believe that the current setting should fit in your GPU memory, it would be helpful if you elaborate on that so that I can take a closer look.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725829524


   @leezu 
   No, I don't. but the type of the tensor manipulated is **mxnet.ndarray.ndarray.NDArray** rather than **numpy.ndarray**, the special values like 0, -1, -3 are defined in the function **reshape** of **mxnet.ndarray.ndarray.NDArray**, and when I used the older version of mxnet(e.g 1.5.0), there has no such error... So I guess whether or not it is a bug


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725829524


   @leezu 
   No, I don't. but the type of the tensor manipulated is **mxnet.ndarray.ndarray.NDArray** rather than **numpy.ndarray**, the special values like 0, -1, -3 are defined in the function **reshape** of **mxnet.ndarray.ndarray.NDArray**, and when I used the older version of mxnet(e.g 1.5.0), there has no such error... So I guess whether or not it's a part needs to be improved.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of tensor in this function will convert from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray**, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. I think the class **mx.metric.CustomMetric** is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725839386


   @leezu 
   I tried your suggestion by adding the script:
   `mx.npx.set_np()`
   But it didn't work and new error occurs...
   
   I'm doing the semantic segmentation task, and because of using sparse label, the shape of label needs to be converted, from 
   (n, c, h, w) to (n, h\*w), and the shape of prediction also needs to be converted from (n, c, h, w) to (n, h\*w, c)
   I just wonder if it is because I set the **last_batch** of dataloader to '**keep**', so the batch size is not fixed, and the error occured...But if this is the case, even so, I don't think this error should appear.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of tensor in this function will convert from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray**, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. I think the class **mx.metric.CustomMetric** of new version(e.g: 1.7.0 / 1.8.0) is different from that of older version, because I used to use this class to warp the same custom metric function and it runs OK.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire removed a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire removed a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725839386


   @leezu 
   I tried your suggestion by adding the script:
   `mx.npx.set_np()`
   But it didn't work and new error occurs...
   
   I'm doing the semantic segmentation task, and because of using sparse label, the shape of label needs to be converted, from 
   (n, c, h, w) to (n, h\*w), and the shape of prediction also needs to be converted from (n, c, h, w) to (n, h\*w, c)
   I just wonder if it is because I set the **last_batch** of dataloader to '**keep**', so the batch size is not fixed, and the error occured...But if this is the case, even so, I don't think this error should appear.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-724494683


   @szha Thank you for your suggestion! When I turn down the batch size to 2 on one GPU it works ok, I'm just surprised that the batch size is so low when training with HRNet-W18 for segmentation... 
   BTW, when I trained the model with one GPU, the batch size can not even be set to 4, but when I trained with 4 GPUs, it will works fine with batch_size = 4, I just wonder what's the difference between these two situations?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725361524


   @szha 
   Well, I think the question above is not important for me, as long as the model can be trained with at least one image per batch...BTW, when I run some script like below under the shell environment, it works OK:
   ```
   >>> import mxnet as mx
   >>> x = mx.nd.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   >>> y.shape    # (2, 15, 4)
   ```
   But the reshape operation in my custom metric function for segmentation task will generate error during evaluating:
   `labels = labels.transpose((0, 2, 3, 1)).reshape(0, -3, -1).argmax(-1)`
   
   The error message shown as below:
   `ValueError: can only specify one unknown dimension`
   
   And I don't think `reshape(0, -3, -1)` is ambiguous for a 4-dim tensor, furthermore in the introduction of function **mx.nd.NDArray.reshape** there has similar demo...
   
   When I rectify `reshape(0, -3, -1)` to `reshape(0, -3, 0)`, the error changes as below:
   `ValueError: cannot reshape array of size 524288 into shape (0,newaxis,0)`
   
   So what do you think about the cause?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] leezu commented on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

leezu commented on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725579643


   Have you enabled the numpy compatible mode? `use_np` or `set_np`?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org

[GitHub] [incubator-mxnet] seekFire edited a comment on issue #19499: cudaMalloc retry failed

Posted by GitBox <gi...@apache.org>.

seekFire edited a comment on issue #19499:
URL: https://github.com/apache/incubator-mxnet/issues/19499#issuecomment-725878826


   @leezu 
   I think I may find out the error reason: when I use the class **mx.metric.CustomMetric** to wrap my custom metric function, the type of tensor in this function will convert from **mxnet.ndarray.ndarray.NDArray** to **numpy.ndarray**, so it will generate this error. The validation process is as follows, same script as above except using **numpy** to replace **mxnet.ndarray**:
   
   ```
   >>> import numpy as np
   >>> x = np.ones((2, 3, 4, 5))
   >>> y = x.transpose((0, 3, 1, 2)).reshape(0, -3, -1)
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
   ValueError: can only specify one unknown dimension
   ```
   The error is same as the mentioned above. So I think the class **mx.metric.CustomMetric** is different from that of older version.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@mxnet.apache.org
For additional commands, e-mail: issues-help@mxnet.apache.org