You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@tvm.apache.org by GitBox <gi...@apache.org> on 2021/01/12 03:27:09 UTC

[GitHub] [tvm] masahi opened a new pull request #7257: Unpack NMS inputs into bbox, scores and class ids

masahi opened a new pull request #7257:
URL: https://github.com/apache/tvm/pull/7257


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi merged pull request #7257: [TOPI] Improve memory layout inside GPU NMS kernel

Posted by GitBox <gi...@apache.org>.

masahi merged pull request #7257:
URL: https://github.com/apache/tvm/pull/7257


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7257: [TOPI] Improve memory layout inside GPU NMS kernel

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7257:
URL: https://github.com/apache/tvm/pull/7257#issuecomment-758901561


   I don't quite follow, maybe you are missing something? 
   
   First, this PR doesn't change our NMS API, it only changes the buffer layouts used internally. 
   
   Second, the final concat is only required for MXNet, which uses `return_indices=False`. Our NMS returns something completely different depending on `return_indices` flag : If True, it returns a big output tensor packed with bboxes, scores and class ids, with invalid entries indicated by -1.
   
   (The valid entries are supposed to move to the top, if ` invalid_to_bottom` flag is True. But our GPU NMS kernel ignores this argument and the output is not reordered. This is another difference with CPU implementation, I think this is a bug)
   
   https://github.com/apache/tvm/blob/4e8cc4fc26e931e38017d198d29f45cba04f5a60/python/tvm/topi/cuda/nms.py#L762-L763
   
   If `return_indices=True`, which applies to TF, ONNX, and PyTorch, we only return survived box indices, so there is no need to concat bboxes, scores, and class ids. For this case, there is zero additional overhead after this PR, it is just that concat before NMS done by frontends are now completely pointless.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7257: [TOPI] Improve memory layout inside GPU NMS kernel

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7257:
URL: https://github.com/apache/tvm/pull/7257#issuecomment-759351406


   Thanks @mbrookhart


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7257: [TOPI] Improve memory layout inside GPU NMS kernel

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7257:
URL: https://github.com/apache/tvm/pull/7257#issuecomment-758961671


   Yes, ideally I want to update our NMS to be more closer to TF/ONNX/PyTorch, and let MXNet frontend handle split and concat, rather than the other way around (what we have now). Current API is over complicated due to the need to support both styles. If we can assume that `return_indices` is always True, we can clean up our API a lot. For example, `invalid_to_bottom` argument only makes sense for MXNet. We don't need `coord_start`, `score_index`, and `id_index` arguments, if inputs are only unpacked.
   
   Supporting `max_output_boxes_per_class` needs change in the implementation as well. We need to count the number of survived boxes per class. But that's the only change I think, it is definitely doable without writing another kernel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] mbrookhart commented on pull request #7257: [TOPI] Improve memory layout inside GPU NMS kernel

Posted by GitBox <gi...@apache.org>.

mbrookhart commented on pull request #7257:
URL: https://github.com/apache/tvm/pull/7257#issuecomment-758777131


   Even after this, I think we will still need a loop over classes for ONNX and TF, since ONNX explicitly and TF implicitly need max_output_boxes_per_class, while this op even with class id will return max_output_boxes for all classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi commented on pull request #7257: [TOPI] Improve memory layout inside GPU NMS kernel

Posted by GitBox <gi...@apache.org>.

masahi commented on pull request #7257:
URL: https://github.com/apache/tvm/pull/7257#issuecomment-758901561


   I don't quite follow, maybe you are missing something? 
   
   First, this PR doesn't change our NMS API, it only changes the buffer layouts used internally. 
   
   Second, the final concat is only required for MXNet, which uses `return_indices=False`. Our NMS returns something completely different depending on `return_indices` flag : If True, it returns a big output tensor packed with bboxes, scores and class ids, with invalid entries indicated by -1.
   
   (The valid entries are supposed to move to the top, if ` invalid_to_bottom` flag is True. But our GPU NMS kernel ignores this argument and the output is not reordered. This is another difference with CPU implementation, I think this is a bug)
   
   https://github.com/apache/tvm/blob/4e8cc4fc26e931e38017d198d29f45cba04f5a60/python/tvm/topi/cuda/nms.py#L762-L763
   
   If `return_indices=True`, which applies to TF, ONNX, and PyTorch, we only return survived box indices, so there is no need to concat bboxes, scores, and class ids. For this case, there is zero additional overhead, it is just that concat before NMS done by frontends are now completely pointless.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [tvm] masahi edited a comment on pull request #7257: [TOPI] Improve memory layout inside GPU NMS kernel

Posted by GitBox <gi...@apache.org>.

masahi edited a comment on pull request #7257:
URL: https://github.com/apache/tvm/pull/7257#issuecomment-758961671


   Yes, ideally I want to update our NMS to be closer to TF/ONNX/PyTorch, and let MXNet frontend handle split and concat, rather than the other way around (what we have now). Current API is over complicated due to the need to support both styles. If we can assume that `return_indices` is always True, we can clean up our API a lot. For example, `invalid_to_bottom` argument only makes sense for MXNet. We don't need `coord_start`, `score_index`, and `id_index` arguments, if inputs are only unpacked.
   
   Supporting `max_output_boxes_per_class` needs change in the implementation as well. We need to count the number of survived boxes per class. But that's the only change I think, it is definitely doable without writing another kernel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org